A secondary school learner and a master’s degree candidate enter a biomedical data competition. It resembles the beginning of a joke, yet the punchline is their victory, in part because AI executed the demanding programming that seasoned bioinformaticians typically spend months crafting manually. Their AI assistants completed the task in less than two minutes.
This duo was part of a collaborative effort from the University of California, San Francisco, and Wayne State University, aiming to ascertain if large language models could tackle one of the more challenging issues in reproductive health research: analyzing extensive biological datasets to forecast aspects like gestational age and the likelihood of preterm birth. These inquiries are not simple. Approximately 11 percent of infants globally are delivered prematurely, and we still lack dependable tools to identify at-risk pregnancies.
Consequently, they tested eight AI chatbots against outcomes from DREAM competitions, wherein teams of data scientists from across the world spend months creating predictive models from identical datasets. More than a hundred teams had previously participated in three such competitions focused on pregnancy, utilizing a variety of data types from blood transcriptomics and placental DNA methylation to vaginal microbiome information. The researchers provided each LLM with a singular natural-language prompt detailing the data, the objective, and the metrics to report, then merely executed the generated code. No assistance, no iterative adjustments. One attempt.
Half of the AI models were unsuccessful. Four out of the eight were unable to generate executable code.
However, the remaining four, led by OpenAI’s o3-mini-high, achieved something quite remarkable. Across four prediction challenges, the top LLM-generated models equaled or surpassed the median performance of human DREAM participants. Notably, for one specific task, predicting placental gestational age from about 350,000 DNA methylation markers, the AI-generated code actually outperformed the leading human team. The LLM’s ridge regression model achieved an error of 1.12 weeks, compared to 1.24 weeks for the top human effort, a statistically significant difference. The human team spent months creating multi-stage random forest models that included additional clinical data that the AI was not even made aware of. The LLM opted for a more straightforward methodology, which proved to be more effective.
“These AI tools could alleviate one of the most significant bottlenecks in data science: constructing our analysis workflows,” states Marina Sirota, a pediatrics professor and interim director of the Bakar Computational Health Sciences Institute at UCSF. “The acceleration couldn’t arrive sooner for patients who urgently need assistance.”
There is, naturally, context that moderates the enthusiasm. For three of the four challenges, human teams still ranked higher overall; the LLMs matched the median but could not compete with the top performers. Moreover, the humans had certain advantages, such as access to extra demographic data and the capacity to submit multiple models and select the one with the highest score. The LLMs had one chance. It’s also worth mentioning that R code performed substantially more reliably than Python overall (14 out of 16 task completions versus seven), partly due to Bioconductor packages for R providing comprehensive code examples that the LLMs likely absorbed during training. None of the AIs managed the most complex data retrieval task in Python.
What might be more significant than sheer accuracy, however, is speed. Code that took human participants hours to days, confined within a three-month competition timeline, returned from the LLMs in seconds. The whole project, spanning from the initial prompt to journal submission, took approximately six months. Adi L. Tarca, a professor at Wayne State University who co-led the study, believes this alters the dynamics for researchers lacking extensive programming expertise. “Thanks to generative AI, researchers with limited data science backgrounds won’t always need to build extensive collaborations or spend hours troubleshooting code,” he asserts. “They can concentrate on addressing the pertinent biomedical questions.”
There’s also a nuanced finding within the results. None of the four successful LLMs committed what is perhaps the most significant error in predictive modeling: leaking information from the test set into training. Such contamination is a frequent source of exaggerated accuracy in models created by humans, and the fact that AI-generated code avoided this indicates something about the caliber of the training data these models have been exposed to.
Yet, the researchers are upfront about the constraints. The data were all tabular; we still do not know how these models would perform with imaging, unstructured clinical notes, or the complicated longitudinal designs typical of much of real-world medicine. Additionally, there’s concern regarding convergence. Three LLMs produced identical models for one task, which is beneficial for