Guidelines for Collecting and Sharing Data

These guidelines have been created to promote research integrity, protect patient privacy, and to make the data transfer process more efficient. Any research involving human subjects must follow Health Insurance Portability and Accountability Act (HIPAA), Duke Institutional Review Board (IRB) regulations, and all applicable regulatory guidelines.

IRB Approval

For human research studies, before allowing access to the data for anyone (including data management and the quantitative methodologist) you will need to add him or her to ‘key personnel’ in the IRB study protocol. For more information, visit https://irb.duhs.duke.edu/.

Data Collection

All data should be stored in a database program that allows for proper reproducible research, data integrity, and data security/protection. REDCap is a widely used data collection program and available for free at Duke. We recommend that investigators involve the quantitative methodologist in the discussion on database design, review of data elements, testing of the database, and the data collection process.

DOCR’s instance of REDCap has been systematically tested and validated.
School of Medicine supports infrastructure costs associated with DOCR’s instance of REDCap, including dedicated servers, daily backups, application updates, and software validation.
DOCR initial consultations are free. DOCR’s data managers and analysts can help you find the best data collection solution for your project.
Contracting with DOCR is often less expensive than hiring your own staff because you do not have to find, hire, and train staff, or worry about staff turnover.
DOCR provides training classes in the use of REDCap, available in the LMS system.
Data can be exported to the biostatistician’s preferred format.

Contact DOCR at redcap-docr@duke.edu to discuss the use of REDCap.

The use of Excel for research data collection/management should be avoided if possible (why we avoid Excel). If you need to use Excel, please follow the guidelines below to minimize errors and ensure data quality. If using REDCap, most of these items will be ensured through database data validation processes.

Every patient/subject should have a unique identifier.
Avoid all use of commas in data fields. This includes both text and numeric fields. For example, use 1298 rather than 1,298.
Do not include line breaks within a cell.
Keep column/variable names under 32 characters, while keeping each one unique. Do not start a column/variable name with a number or symbol. Column/variable names should not include spaces or special characters.
Dedicate the top row only for the column/variable name; do not repeat rows of column/variable names.
If there are several groups of patients, use a separate column to identify group membership for each patient. Do not indicate any distinguishing patient characteristic with highlighted cells. Instead incorporate a separate column to indicate the characteristic. The following spreadsheet contains group information in the column labeled ‘Arm’. In this case the patients were randomized to one of two treatment arms: experimental (Exper) or control (Control).

DukeID	Arm	Value
A0001	Exper	5
B0002	Control	4
C0004	Exper	2
D0005	Control	6
E0006	Control	8

Use the same format for all variables in a column. If a variable is to be analyzed as numeric then all entries in that column must be numeric. Any characters or symbols including ‘<’, ‘>’, ‘=’, ‘*’, ‘?’ etc. are not permitted.
For missing data, leave the cell empty to indicate a missing value; do not use ‘N/A’.
For character variables, be consistent with the letter case and exact cell content. For example, yes, Yes, and YES are all considered different responses. Spaces are considered characters; 2 spaces between characters are different than 1 space.
For variables with the same response options (such as yes/no), use consistent coding. Do not code one variable as ‘1=yes, 0=no’ and another variable as ‘1=no, 0=yes’.
Do not include blank rows or columns.
Do not hide rows or columns of data instead of deleting them as they will still be imported into the statistical software.
Do not include summary data in the data file.
Do not include comments or footnotes in the data file. Comments or explanations of variable names, study design, data collection, any irregularities that occurred during the study or data collection are encouraged, but they should be listed in a separate document.
Data with repeated measures can be collected in long format or wide format (see examples below). For long format, each patient has a row of data for each time measurement and all observations from the same patient are indicated with the unique identifier. For wide format, each patient has one row of data and repeated columns of measurements.

Long format dataset
DukeID	Time	Marker
3	1	34
3	2	23
3	3	45
4	2	35
5	1	27
5	3	76

Wide format dataset
DukeID	Marker1	Marker2	Marker3
3	34	23	45
4		35
5	27		76

If there are corrections to the data, it is the responsibility of the investigator to provide the statistician with an updated file as soon as possible. Please include an explanation why data correction was warranted.

Incorrect data collection example

This spreadsheet breaks many of the rules above and would require a lot of time for the quantitative methodologist to clean the data.

Data for ARC trial

	Treatment	1st Date	Age of Subject	Patient's	Height	*blood pressure
				Gender	at baseline
	1	Oct 25th, 2019	44	m	67"	120/82
	2	7/5/2018	62	Female	5'10"	>150/90
	3	28-Feb-2018	30-34	female	182cm	135 over 85
	4	6-Apr-18	22.5	male	74	normal
	5	9/12/2017	69	F	5.5	160/110

	Control
	1	6/22/2018	65+	femlae	5ft3	130/60 130/70
	2	December 26, 2019	73	Male	unknown	140/75
	3	8/5	49	M	66	N/A
	4	4/12/2019	60 1/2	MAle	~61	SBP: 110 DBP: 60
	5	10/16/18	??	f	6'3	80/120
Average					68
	*collected at baseline

Good data collection example

This spreadsheet requires very little data cleaning, so the quantitative methodologist will be able to get to the analysis more quickly.

ID	Arm	First_Date	Age	Gender	Height	SBP	DBP
1	0	10/25/2019	44	M	67	120	82
2	0	7/5/2018	62	F	70	150	90
3	0	2/28/2018	32	F	72	135	85
4	0	4/6/2018	22	M	74	120	80
5	0	9/12/2017	69	F	65	160	110
6	1	6/22/2018	66	F	63	130	60
7	1	12/26/2019	73	M		140	75
8	1	8/5/2017	49	M	66
9	1	4/12/2019	60	M	61	110	60
10	1	10/16/2018		F	75	120	80

Sending/Sharing Data

Information about all the permitted data storage resources is on ISO’s website: https://security.duke.edu/policies/duke-services-and-data-classification.
Duke Box (NOT Drop Box) is the preferred way to share data. It is only for the transfer of data and should not be used for long-term data storage.
Sharing research data via email, even with send secure is discouraged and should be avoided.
Any questions pertaining to data sharing or storage can be directed to the DHTS Information Security Officer (ISO) at security@duke.edu or the research practice manager.
If data are not shared or stored in an approved manner, a violation of the protocol must be reported to the IRB in a timely manner.

IRB Approval

Data Collection

Advantages of Using REDCap at Duke

Data Collection in Excel

Example Data in Excel

Incorrect data collection example

Good data collection example

Sending/Sharing Data