Guidelines for Collecting and Sharing Data

These guidelines have been created to promote research integrity, protect patient privacy, and to make the data transfer process more efficient.  Any research involving human subjects must follow Health Insurance Portability and Accountability Act (HIPAA), Duke Institutional Review Board (IRB) regulations, and all applicable regulatory guidelines.

IRB Approval

For human research studies, before allowing access to the data for anyone (including data management and the quantitative methodologist) you will need to add him or her to ‘key personnel’ in the IRB study protocol. For more information, visit https://irb.duhs.duke.edu/.

Data Collection

All data should be stored in a database program that allows for proper reproducible research, data integrity, and data security/protection. REDCap is a widely used data collection program and available for free at Duke. We recommend that investigators involve the quantitative methodologist in the discussion on database design, review of data elements, testing of the database, and the data collection process.

  • DOCR’s instance of REDCap has been systematically tested and validated.
  • School of Medicine supports infrastructure costs associated with DOCR’s instance of REDCap, including dedicated servers, daily backups, application updates, and software validation.
  • DOCR initial consultations are free.  DOCR’s data managers and analysts can help you find the best data collection solution for your project.
  • Contracting with DOCR is often less expensive than hiring your own staff because you do not have to find, hire, and train staff, or worry about staff turnover.
  • DOCR provides training classes in the use of REDCap, available in the LMS system.
  • Data can be exported to the biostatistician’s preferred format.

Contact DOCR at redcap-docr@duke.edu to discuss the use of REDCap.

The use of Excel for research data collection/management should be avoided if possible (why we avoid Excel). If you need to use Excel, please follow the guidelines below to minimize errors and ensure data quality. If using REDCap, most of these items will be ensured through database data validation processes.

  • Every patient/subject should have a unique identifier. 
  • Avoid all use of commas in data fields. This includes both text and numeric fields.  For example, use 1298 rather than 1,298.
  • Do not include line breaks within a cell.
  • Keep column/variable names under 32 characters, while keeping each one unique.  Do not start a column/variable name with a number or symbol. Column/variable names should not include spaces or special characters.
  • Dedicate the top row only for the column/variable name; do not repeat rows of column/variable names. 
  • If there are several groups of patients, use a separate column to identify group membership for each patient. Do not indicate any distinguishing patient characteristic with highlighted cells.  Instead incorporate a separate column to indicate the characteristic. The following spreadsheet contains group information in the column labeled ‘Arm’. In this case the patients were randomized to one of two treatment arms: experimental (Exper) or control (Control).

DukeID

Arm

Value

A0001

Exper

5

B0002

Control

4

C0004

Exper

2

D0005

Control

6

E0006

Control

8

  • Use the same format for all variables in a column.  If a variable is to be analyzed as numeric then all entries in that column must be numeric.  Any characters or symbols including ‘<’, ‘>’, ‘=’, ‘*’, ‘?’ etc. are not permitted. 
  • For missing data, leave the cell empty to indicate a missing value; do not use ‘N/A’.
  • For character variables, be consistent with the letter case and exact cell content.  For example, yes, Yes, and YES are all considered different responses.  Spaces are considered characters; 2 spaces between characters are different than 1 space.
  • For variables with the same response options (such as yes/no), use consistent coding. Do not code one variable as ‘1=yes, 0=no’ and another variable as ‘1=no, 0=yes’.
  • Do not include blank rows or columns.
  • Do not hide rows or columns of data instead of deleting them as they will still be imported into the statistical software.
  • Do not include summary data in the data file. 
  • Do not include comments or footnotes in the data file. Comments or explanations of variable names, study design, data collection, any irregularities that occurred during the study or data collection are encouraged, but they should be listed in a separate document.
  • Data with repeated measures can be collected in long format or wide format (see examples below). For long format, each patient has a row of data for each time measurement and all observations from the same patient are indicated with the unique identifier. For wide format, each patient has one row of data and repeated columns of measurements.
Long format dataset

DukeID

Time

Marker

3

1

34

3

2

23

3

3

45

4

2

35

5

1

27

5

3

76

Wide format dataset

DukeID

Marker1

Marker2

Marker3

3

34

23

45

4

 

35

 

5

27

 

76

  • If there are corrections to the data, it is the responsibility of the investigator to provide the statistician with an updated file as soon as possible. Please include an explanation why data correction was warranted.

Incorrect data collection example

This spreadsheet breaks many of the rules above and would require a lot of time for the quantitative methodologist to clean the data.

Data for ARC trial

         
             
 

Treatment

1st Date

Age of Subject

Patient's

Height

*blood pressure

       

Gender

at baseline

 

1

Oct 25th, 2019

44

m

67"

120/82

 

2

7/5/2018

62

Female

5'10"

>150/90

 

3

28-Feb-2018

30-34

female

182cm

135 over 85

 

4

6-Apr-18

22.5

male

74

normal

 

5

9/12/2017

69

F

5.5

160/110

             
 

Control

         
 

1

6/22/2018

65+

femlae

5ft3

130/60 130/70

 

2

December 26, 2019

73

Male

unknown

140/75

 

3

8/5

49

M

66

N/A

 

4

4/12/2019

60 1/2

MAle

~61

SBP: 110 DBP: 60

 

5

10/16/18

??

f

6'3

80/120

Average

       

68

 
 

*collected at baseline

   

 

 

Good data collection example

This spreadsheet requires very little data cleaning, so the quantitative methodologist will be able to get to the analysis more quickly.

ID

Arm

First_Date

Age

Gender

Height

SBP

DBP

1

0

10/25/2019

44

M

67

120

82

2

0

7/5/2018

62

F

70

150

90

3

0

2/28/2018

32

F

72

135

85

4

0

4/6/2018

22

M

74

120

80

5

0

9/12/2017

69

F

65

160

110

6

1

6/22/2018

66

F

63

130

60

7

1

12/26/2019

73

M

 

140

75

8

1

8/5/2017

49

M

66

   

9

1

4/12/2019

60

M

61

110

60

10

1

10/16/2018

 

F

75

120

80

 

Sending/Sharing Data

  • Information about all the permitted data storage resources is on ISO’s website: https://security.duke.edu/policies/duke-services-and-data-classification.
  • Duke Box (NOT Drop Box) is the preferred way to share data. It is only for the transfer of data and should not be used for long-term data storage.
  • Sharing research data via email, even with send secure is discouraged and should be avoided.
  • Any questions pertaining to data sharing or storage can be directed to the DHTS Information Security Officer (ISO) at security@duke.edu or the research practice manager.
  • If data are not shared or stored in an approved manner, a violation of the protocol must be reported to the IRB in a timely manner.