New AI Method Excels at Predicting Disease Risk from Clinical Notes

Long documents with important details scattered throughout, such as clinical notes, are a problem for artificial intelligence (AI) models. Duke researchers, including computational biology and bioinformatics PhD student Fengnan Li and   Matthew Engelhard , MD, PhD , assistant professor of biostatistics and bioinformatics, have developed a new natural language processing method, IRIS (Interpretable Retrieval-Augmented Classification for long Interspersed Document Sequences), to address this challenge.

IRIS breaks documents into smaller segments, stores numeric representations of each segment in a searchable database, and learns to identify the segments that are most relevant to a given disease, Engelhard said. Compared to alternative approaches, IRIS doesn’t require heavy computing, making it possible to train the model on a single graphics processing unit (GPU).

Tests on six datasets showed that IRIS performs as well as other top models, and it does especially well in health care tasks — like predicting risk for diseases such as autism or ADHD from clinical notes. IRIS also helps users understand why it made a certain prediction by clearly showing the document segments it identified as relevant.

Their work was published in August at the 2025 Annual Meeting of the Association for Computational Linguistics.

Next the team plans to implement IRIS into their early autism and ADHD prediction models to make them more effective and interpretable.

Other authors: Elliot D. Hill, Shu Jiang, and Jiaxin Gao.

Funding: The National Institutes of Health.

Share