Recent studies indicate that drug discovery initiatives that depend heavily on DNA-encoded chemical libraries might be missing numerous viable drug candidates. These libraries assign a unique DNA sequence to each molecule, similar to a barcode, allowing researchers to examine extensive numbers of compounds at the same time. These large datasets are often crucial for training machine learning models that look for potential drug candidates.
Researchers, headed by Raphael Franzini from the University of Utah, aimed to assess the trustworthiness of data linked to DNA-encoded chemical libraries. They analyzed a library comprising over 58,000 compounds focused on enzymes related to DNA repair and cancer. After synthesizing and evaluating 33 molecules that were previously disregarded by screens, they found many to be just as effective as those deemed promising. Remarkably, several screens nearly missed compounds that were structurally similar to the approved cancer medication olaparib.
Franzini comments, “We discovered that DNA-encoded library data frequently categorizes effective molecules as ineffective ones.” The problem appears to stem from the DNA barcodes; molecules evaluated with these tags displayed reduced activity, particularly against targets they weren’t originally designed to interact with.
This research, referred to as a “highly relevant contribution” by Laura Guasch, a computational chemist at Roche in Switzerland, emphasizes the challenge of false negatives contaminating datasets, which can undermine machine learning algorithms used in this field. “False negatives introduce considerable noise and bias into training datasets, causing machine learning models to detect misleading patterns or disregard valid chemotypes,” states Srinivas Chamakuri, assistant professor at Baylor College of Medicine.
Franzini’s team demonstrated that ostensibly effective machine learning models were simply recognizing recurring structural patterns, lacking genuine predictive power. Guasch remarks, “A key implication of this study is the notable risk that existing drug discovery initiatives might be overlooking potential drug candidates because of high rates of false negatives.”
The research showed that removing unreliable data from training sets and focusing on verified active compounds significantly enhanced models’ capabilities to pinpoint promising drugs. This implies that significant modifications may be necessary in machine learning methodologies for drug discovery to address inherent biases present in screening data.