Ben Goldstein, PhD
Principal Investigator
Professor of Biostatistics & Bioinformatics
Professor in Population Health Sciences
Associate Professor in Pediatrics
Member in the Duke Clinical Research Institute
Contact Information

Department of Biostatistics & Bioinformatics
Duke University
2424 Erwin Road
Suite 9023
Durham, NC 27705
ben.goldstein@duke.edu

About

I am Professor of Biostatistics & Bioinformatics at Duke University, and chief of the Division of Translational Biomedical Informatics.  I have secondary appointments in the Department of Pediatrics (No I’m not a pediatrician!) and Population Health, and a member of the Duke Clinical Research Institute and the Children’s Health & Discovery Initiative.

I serve as the Chief Data Scientist for Duke AI Heath and the Associate Chief Date Scientist for the Duke University Health System. In these roles I work to provide guidance and support on health system based analytics.

My research interests are in the meaningful use of Electronic Health Records data. My work sits at the intersection of Biostatistics, Biomedical Informatics, Epidemiology and Machine Learning. I collaborate actively with both clinicians and fellow methodologists. I believe that effectively working with EHR data requires a true team science approach and appreciate the perspective and expertise my colleagues bring to any project.

Venn diagram showing roles of biostatisticians, informaticists, and clinicians in research.

This diagram illustrates how biostatisticians, informaticists, and clinicians collaborate in clinical research. Each professional contributes unique skills, such as data analysis, study design, and data extraction, highlighting their shared responsibilities in successful research.

 

If you are a student or collaborator interested in working with EHR data please feel free to email me to set up a meeting.

 

I received my BA in psychology in 2002 from Wesleyan University, MPH in Epidemiology & Biostatistics in 2007 and PhD in Biostatistics with a Designated Emphasis in Genomic & Computational Biology from UC Berkeley in 2011. My PhD adviser was Dr. Alan Hubbard and I worked in the lab of Dr. Lisa Barcellos studying machine learning methods for genome wide association studies.

In 2011 I joined the newly formed Quantitative Sciences Unit in the Department of Medicine at Stanford University. I worked as a collaborative biostatistician on a range of clinical projects. Through this work I developed an interest in the use of Electronic health Records. Under the primary mentorship of Dr. Wolfgang Winkelmayer I received an NIH K25 career development award focused on the use of EHR to develop clinical risk models.

In 2014 I joined the faculty of Duke University in the Department of Biostatistics and Bioinformatics as well as the Duke Clinical Research Institute. I was one of the original members of the DCRI Center for Predictive Medicine.  In 2018 I was named the Data Science Lead for the newly formed Children’s Health & Discovery Initiative. In 2019 I was appointed Associate Professor of Biostatistics and Bioinformatics. In 2023 I was named Director of Data Science for Duke AI Health and the Associate Chief Data Scientist for DUHS. In 2024, I was named Division Chief for the Division of Translational Biomedical Informatics. I have secondary appointments in the departments of Population Health Sciences and Pediatrics.

My research focuses on the meaningful use of EHRs both for the purpose of risk prediction and causal inference and sits at the intersection of biostatistics, informatics, machine learning and epidemiology. I study how the way people interact with the health care system can bias EHR based studies – something we’ve term informed presence. I also apply machine learning methods to understand patient’s evolving health trajectories captured in EHR data. I have a strong interest in making EHR data easily and reliably available to researchers and have worked to develop the Clinical Research DataMart (CRDM) to support research activities at Duke. I enjoy collaborating with clinical researchers and other methodologists on projects related to EHR data.

On a translational side, I have a strong interest in using health data to better inform clinical practice and decision making – the Learning Healthcare System. I work closely with members of Duke University Health System on quality improvement initiatives. In particular I help develop, evaluate and implement clinical decision support (CDS) tools. I serve as the co-chair of the evaluation committee for the Algorithm Based Clinical Decision Support (ABCDS) governance committee.

In my free time I enjoy cooking and spending time with my wife and two boys. I always look for opportunities to travel and explore new parts of the world.

Research

My research is focused on the meaningful use of Electronic Health Records data with an interest in both deriving inference from EHR data and developing risk prediction models and clinical decision support tools with EHRs. From an inferential standpoint, I am interested in understanding the potential and limitations of EHRs for clinical research and adapting methods for the analytic challenges that arise. From a risk prediction standpoint, I am interested in best practices for developing, implementing and evaluating clinical decision support tools. I also have a growing interest in best practices for implementing tools into clinical environments to enhance usability and acceptability. Overall, my research sits at the intersection of biostatistics, biomedical informatics, machine learning, epidemiology and implementation science.

I enjoy collaborating with both clinicians and methodologists and involving students into these projects.

A particular focus is identifying biases that may arise in the use of EHRs. One such bias is what we term informed presence bias. This arises from the observation that people only interact with the health system when they are sick. While this is a missing data problem – we are missing healthy observation – the potential for bias lies in what we observe as opposed to what we miss. One of my foci is identifying situations in which this bias can arise, characterizing the problem it can engender and ultimately developing solutions for addressing it.

One of the key ways that EHR data have been used is to develop risk prediction models. EHRs allow us to observe patients repeatedly  over time. We can exploit these repeated measurements to better characterize a patient’s risk profile and/or develop dynamic (time updated) risk models. I have been working on understanding what are the best way to incorporate these repeated measures into risk models. I am also interested in developing effective and robust approaches for dynamic prediction. In one collaboration, we have been working with clinicians at Duke Hospital to evaluate and improve a risk tool for time updated detection of patient deterioration.

I have strong interest in using routinely collected health data to inform and improve the way clinical care is administered. I serve as the Chief Data Scientist for Quality, providing analytic support for quality improvement and health equity projects. I also work closely with members of the Duke University Health System to develop, implement  and evaluate clinical decision support (CDS) tools. I co-lead the evaluation committee for DUHS’s CDS governance committee, ABCDS.

EHR data present a unique opportunity to understand and assess pharmaceutical interventions in a real world environment. We have conducted comparative effectiveness studies of pharmaceutical interventions, assessed the degree of inequity in the usage of medications, developed approaches to capture adverse events, mined EHR databases to identify medications related to positive outcomes, and examined the impacts of polypharmacy in children.

EHR data represent a unique and valuable data source to understand a person’s clinical status. Unfortunately, EHR data are stored in complex and hard to access format. Our group is working to make Duke EHR generally and reproducibly available to clinical researchers. We have stood up the Duke Clinical Research DataMart (CRDM) which allows users to set up regular data pulls for both research purposes or tracking of clinical populations.

Funded Work

The following are external grants and projects I’ve been awarded as PI or Co-PI

  • Using Natural Language Processing (NLP) to detect under diagnosis of late talking
  • Mapping developmental of late talking children
  • Collaborators: Lauren Franz (co-PI), Danai Fanin (co-I)
  • Part of Duke’s Autism Center of Excellence (ACE)
  • Using EHR and claims data to create automated screening tool for kids at risk of ASD
  • Developing machine learning approaches to model rare outcome
  • Collaborators: Geraldine Dawson (Duke – Psychiatry, ACE PI), Gary Maslow (Duke – Psychiatry, Project co-lead)
  • Leveraging EHR data on dialysis patients
  • Using deep learning to predict life expectancy and identify patient clinical trajectories
  • Administrative supplement on ethics of ML based tools
  • Collaborators: Julia Sciala (U Virginia – Nephrology), Ricardo Henao (Duke – B&B), Tariq Shafi (U Mississippi – Nephrology)
  • Conduct Stakeholder focus groups to understand the process for how Duke’s early warning score was implemented
  • Create a guidance for implemented ML-based CDS tools
  • Collaborators: Nina Sperber (Duke – Population Health, coPI)
  • Develop analytic approaches for assessing COVID vaccine safety and effectiveness with EHR data
  • Compare what is observed directly from health systems versus what is imported in from state data
  • Collaborators: Jillian Hurst (Duke – Pediatric ID), Deverick Anderson (Duke – Adult ID), Emily O’Brien (Duke – Population Health)
  • Exploring the use of EHR to detect and adverse events and associate them with therapies
  • Providing advice and guidance to the FDA biostatistics group tasked with RWD
  • Collaborators: Jillian Hurst (Duke – Pediatric ID), JJ Strouse (Duke – Hematology), Haley Hostetler (Duke – Allergy & Immunology)
  • Linking Duke EHR data on kids with asthma with geospatial and temporal factors
  • Assessing impact of environmental factors on asthma exacerbations
  • Collaborators: Jason Lang (Duke – Pediatric Pulmonology, coPI)
  • Comparing Duke EHR data and Optum Health Data to develop predictive models
  • Apply deep learning methods and assess transportability
  • Collaborators: Neha Pagidipati (Duke – Cardiology), Ricardo Henao (Duke – B&B)
  • K Award using EHR data on dialysis patients
  • Predictive modelling and assessment of EHR data quality
  • Mentors: Wolfgang Winkelmayer (Baylor), Michael Pencina (Duke), Trevor Hastie (Stanford), Tim Assimes (Stanford)

Students & Staff

AMIA LAB DINNER 2024

This image shows biostatistics students and staff at the AMIA 2024 lab dinner, celebrating their work together.

Current Students 

  • Menying Yan (PhD Candidate, Biostatistics)
  • Zigui Wang (PhD Student Biostatistics)
  • Scott Sun (PhD Student, Biostatistics)
  • Jiang Shu (PhD Student Biostatistics Biostatistics)
  • Gabby Walczak (2026, MB Student)
  • Achint Kaur (2026, MB Student)

AI Health Fellow

Staff Members


Past Students

PhD

  • Zidi Xu (2021, PhD Biostatistics) – Data Scientist Apple

Masters

  • Xiruo Ding (2017, MB) – PhD student at University Washington
  • Karine Yenokyan (2018, MB) – Biostatistician at Johns Hopkins
  • Aijing Gao (2018, MB) – Biostatistician at HealthNet
  • Yue Liang (2019, MB) – PhD Student at University Minnesota
  • Zhecheng Sheng (2019, MB) – PhD Student at University of Minnesota
  • Jingyi He (2020, MB) – Biostatistician at Paraxell
  • Zhenhui Xu (2021, MB) – Data Scientist at Brightech Intl
  • Tingxuan Li (2021, MB) – Biostatistician at Iqvia
  • Jiamu He (2022,  MB Biostatistics) – Biostatistician at University of Pennsylvania
  • Yili Li (2022,  MB Biostatistics) – Biostatistician at Eli Lily
  • Feier Chang (2023, MB Biostatistics) – PhD Student University of Pittsburgh
  • Caitlyn Ngyuen (2023, MB Biostatistics) – GlaxoSmithKline
  • Yvonne Feng (2023, MB Biostatistics) – Biostatistician University Pennsylvania
  • Ziyi Wang (2023, MB Biostatistics) – Biostatistician University Pennsylvania
  • Scott Sun (2023, MB Biostatistics) – PhD Student Duke University
  • Mufan Wang (2024, MB Biostatistics) – PhD Student NCSU
  • Sam Albertson (2024, MB Biostatistics) - Data Scientist University of Washington
  • Peng Wu (2024, MB Biostatistics) – PhD Student University Wisconsin
  • Yixin Chen (2024, MB Biostatistics) – PhD Student Ohio State University
  • Alexander Da Silva (2024, MB Biostatistics)
  • Ruobing Xue (2025, MB Biostatistics) - Data Scientist Atrium Health
  • Jonathan Hui (2025, MB Biostatistics) - PhD Student University of Pittsburgh
  • Rushi Tang (2025, MB Biostatistics) - PhD Student Case Western University
  • Jiang Shu (2025, MB Biostatistics) - PhD Student Duke University

Undergraduate

  • Angie Shen (2018, BA Statistics) – PhD Student at UNC
  • Paul Sabharwal (2022, BA, Computer Science) – Medical School
  • Yihan Shi (2023, BA Statistics) – Law School

Publications

I publish methodological, applied, and collaborative papers. Many papers are written by trainees which makes my life easier!

For a list of publications you can find most of my work on google scholar or PubMed

Courses

  • Master’s of Biostatistics course introducing topics in health data science
  • Exploratory data analysis
  • Unsupervised learning
  • Supervised learning
  • Special topics course for data science master’s students
  • Biases with EHR data
  • Inference with EHR data
  • Predictive modelling with EHR data
  • Interdisciplinary course for quantitative and clinical investigators on how to work with EHR data
  • Surveyed topics on EHR data access, data processing, study design and analytics