Scholarly Culture and Accountability Plan (SCAP)

The Department of Biostatistics and Bioinformatics is committed to creating a science culture that promotes our core values of ethical inquiryskepticismfree discussion, and open science with respect to secure data generation, responsible data sharing, research integrity and complete and clear analysis. In our own work and in our training programs we emphasize the use and the promotion of reasonable measures to ensure the integrity and reproducibility of analysis of our collaborators’ research.   

Complete the B&B SCAP Attestation               

Our roles and responsibilities with respect to these areas are outlined below:

B1. Independent Methods Research

  1. Purely theoretical
    1. Before submitting a manuscript, obtain colleague review of theoretical results
    2. Given that reviewers do not necessarily check derivations, if possible results should be confirmed by simulations
  2. Simulations
    1. Adhere to the principles of literate programming when writing code for designing simulation studies
    2. Document all parameter settings and seeds
    3. Keep audit trail of all programs and changes to programs with respect to parameter settings, seeds, and code using source code management software (e.g., git or mercurial)
    4. Use reproducible audit, deployment and runtime strategies, for instance Jupyter Notebooks and containers like Docker
    5. Avoid hard coding parameters in functions
  3. Querying and manipulating source data
    1. Adhere to the principles of literate programming when writing code for querying and manipulating source data
    2. Keep raw data file pure, making and keeping an audit trail on changes to analysis file
    3. Document all derived variables
    4. Keep audit trail of all programs, changes to programs, and results using source code management software (e.g., git or mercurial)

B2.  Software Development in General

  1. Strong preference for open source code
  2. Literate programming – emphasis and readability and maintainability
  3. Preference for agile methods (versioning, continuous deployment, multilayer testing)
  4. Use of version control
  5. Tests and code coverage
  6. Commitment to code maintenance
  7. Recognition that code and software is an evolving ecosystem

B3. Publication and Authorship

  1. True contributions to research need to be recognized through authorship
  2. Contribution must be substantive and reflect real involvement
  3. Authorship should not be accepted, even when offered, when contribution is not substantive
  4. A statement of contribution(s) should reflect level of involvement

B4. Collaborative Research

Biostatistician/Bioinformaticians in collaborative research have a responsibility to promote a culture of open discussion and skepticism.  Although such an approach can be intimidating, especially when it involves potential confrontation with a more senior investigator, it is fundamental to honest collaboration, open science, scientific integrity, and reproducibility.  Such challenges might be considered “speed bumps” in the process of research in that probing questions may lead to a change in approach along with a delay.  It is important to remember that, while speed bumps slow the traffic, their purpose is protection; in our case, the goal is to protect the integrity of the study. It is also important to note that scientific discussion and debate contribute to and strengthen the merit and value of the research.

The process for statistical collaboration is outlined on the Duke Biostatistics BERD Core website. 

We are responsible for the integrity of the data and the analyses from the point we receive the data.  Although we cannot take sole responsibility for any data manipulations that occurred before that, we should promote good practices and do reasonable data checking.

Statisticians need to ensure:

  1. Creation of a pre-defined statistical analysis and data management plan
  2. Procedure for requesting changes to SAP
  3. Scripted workflows and reproducible analysis
  4. Procedure for reporting any concerns (suspicious data, outliers, p value hacking)

Whether supervising staff or performing analyses, ensure the following standards:
(Cited from: Gentzkow, Matthew and Jesse M. Shapiro. 2014. "Code and Data for the Social Sciences: A Practitioner’s Guide". University of Chicago, January 2014.)

  1.  Data management
    1. Create a project structure for keeping all things at the right place (data, code, figures, etc.)
    2. Never modify raw data files (ideally, they should be read-only and in a versioned repository), copy/rename to new ones when making transformations, cleaning, etc.
    3. Check data consistency
    4. Manage script dependencies and data flow with a build automation tool (e.g., GNU make) or dynamic report generating software (e.g., knitr)
  2. Coding
    1. Adhere to the principles of literate programming
    2. Organize source code in logical units or building blocks
    3. Separate source code from editing stuff, especially for large projects -- partly overlapping with previous item and reporting
    4. Document all processes
  3. Analysis
    1. Don't forget to set/record the seed you used when calling random number generators or stochastic algorithms (e.g., k-means)
    2. For Monte Carlo studies, it is a good idea to store specs/parameters in a separate file (sumatra may be a good candidate)
  4. Versioning
    1. Use some kind of revision control for easy tracking/export, e.g. git or mercurial
    2. Backup everything, on a regular basis
  5. Use a tool for dynamic report generation, such as
    1. [R] Sweave or knitr
    2. [R] Brew
    3. [R] R2HTML or ascii

It is our responsibility to reinforce the following principles with our collaborators, staff, and students:

  1. To avoid bias or suspicion thereof, investigators should not take their own outcome measurements or analyze their own data, to the extent possible
  2. Statisticians MUST NOT certify or validate analysis that was completed by an investigator
  3. All data, from raw to final analysis should be traceable with an audit trail
  4. To the extent possible collect data in a tool that allows tracking
  5. Original raw data files should be saved and all changes made only to subsequent copies
  6. Our educational programs should ensure that our students are practicing these principles. By the time they graduate they should be totally indoctrinated in these practices
  7. For investigators doing clinical studies:
    1. Limitations of secondary data – issues of confounding, selection bias, standardization of variable definitions
    2. There is no good solution for missing data

All laboratory scientists keep a document of Standard Operating Procedures for Data Management and Processing that must be signed by each member of the lab.

November 2022