Large Language Models Can Outperform Human Programmers

February 10, 2025

By Angela Spivey

Photo of hands at a laptop overlaid with computer code

For researchers and others with ideas about how a computer program could make their lives easier, a publicly available artificial intelligence (AI) large language model may be able to help.

In a direct comparison, one large language model, GPT-4, outperformed 85% of human programmers in writing code to execute simple tasks, reported Zhicheng “Jason” Ji, PhD, assistant professor of biostatistics and bioinformatics at Duke University School of Medicine, in a study published December 2024 in the journal Advanced Science.

This was the first known study to directly compare large language models to human programmers. The results suggest that large language models like GPT-4 have the potential to serve as a reliable assistant in generating programming code and developing software.

“By democratizing access to computer programming, large language models have the potential to help people without extensive programming backgrounds access this powerful tool to realize their ideas,” Ji said. He co-authored the study with Wenpin Hou, PhD, of Mailman School of Public Health at Columbia University.

The researchers systematically evaluated the capabilities of seven large language models (LLMs) in generating programming code, including comparing the performance of the LLMs directly to that of human participants in online programming contests.

Ji and Hou served as the “interface” between the models and contest websites, writing a prompt in English asking the models to write code that could perform particular tasks. The researchers then executed the codes on online platforms, such as LeetCode, often used for training and technical tests for programmers.

Comparing the results from the LLMs to published statistics and rankings about human participants in these contests provided a way to rigorously compare the performance of LLMs to human programmers, Ji said.

LLMs compare patterns between and classify large datasets of text. For writing programming code, LLMs can be thought of as “translators” between natural English language and programming code, Ji said.

The research was supported by the National Institutes of Health under Award Number R00HG011468 to Hou.

Large Language Models Can Outperform Human Programmers

A New Hub for Fighting Infectious Diseases with Smarter Models

Duke Sets National Standards for Safe, Scalable AI in Health Care

Honey, I Shrunk the Proteins

AI Model Predicts Risks and Potential Causes of Adolescent Mental Illness

NIH Funding Sustains Scientific Discovery