"Incorrect pKa Values in Chemical Databases Could Compromise Drug Development Initiatives"

“Incorrect pKa Values in Chemical Databases Could Compromise Drug Development Initiatives”


**Discrepancies in Acid Dissociation Constants for Zwitterionic Compounds: Consequences for Pharmaceutical Development and Environmental Science**

Acid dissociation constants, denoted as pKa values, are fundamental to chemistry, affecting essential molecular properties such as solubility, ionization, membrane permeability, and reactivity. These characteristics significantly impact various fields, including drug development and environmental research. Nonetheless, a recent study has revealed a pressing concern: discrepancies in the documentation of pKa values for zwitterionic compounds in chemical databases and their application in predictive modeling systems. This finding could transform how scientists and industries approach drug candidate design or simulate chemical behaviors in intricate systems.

### **Origins of the Issue**

The research that identified the issue originated at the Massachusetts Institute of Technology, spearheaded by Jonathan Zheng and his team. Their work centered on ChEMBL—a prominent chemical database that provides chemical and pharmacological information—and exposed inaccuracies in the recording of pKa values for zwitterionic compounds. “Our investigation revealed that the ChEMBL database contains numerous erroneous pKa values due to nomenclature confusion,” Zheng asserts. To complicate matters further, many contemporary machine-learning models, like QupKake, depend on ChEMBL data for training. This dependence results in consistent prediction inaccuracies for zwitterionic compounds.

The discrepancies arise from the distinct dual-character attributes of zwitterions. Zwitterions are molecules that possess both positively and negatively charged groups, yielding a neutral overall molecular charge. This complexity complicates the assignment of pKa values compared to simpler non-zwitterionic molecules.

As Zheng notes, ambiguity occurs in defining what is “acidic” versus “basic” in relation to pKa values for these zwitterionic compounds. In solution, zwitterions can exist in various isomeric forms—uncharged, singly charged, and dipolar species—where protonation and deprotonation are contingent upon pH and the protonation states of other functional groups. Historically, chemists have established conventions that label the lower pKa value as acidic and the higher pKa as basic for zwitterions. In contrast, for non-zwitterionic compounds, the classifications are reversed. Sadly, this inconsistency in nomenclature has led to widespread reporting mistakes.

### **Practical Consequences**

Mistakes in pKa data are not insignificant. Since pKa values impact solubility, absorption, and membrane permeability, they are crucial in assessing whether a compound qualifies as a viable drug candidate. Kai Leonhard, a chemist at RWTH Aachen University in Germany, emphasizes this concern. “Promising drug candidates could be overlooked,” he states, “merely due to errors in pKa interpretation indicating suboptimal solubility or undesirable properties, when in reality, these candidates might possess advantageous profiles.”

The consequences extend beyond drug discovery. In environmental science, pKa values are instrumental in modeling how a compound behaves in aqueous conditions, covering acid-base equilibria, pollutant stability, and degradation processes. Erroneous pKa values could, therefore, result in flawed predictions about the environmental fate, transport, and biochemical interactions of chemical entities.

### **Issues with Machine Learning and Predictive Models**

The findings from Zheng and his team assume greater importance in light of the increasing reliance on data-driven models. For instance, their study assessed the QupKake machine learning model, a widely-used tool trained with ChEMBL data. The investigation indicated that QupKake’s predictions were systematically less accurate for zwitterionic compounds in comparison to experimental data organized by the International Union of Pure and Applied Chemistry (IUPAC). Given that pKa modeling is becoming increasingly crucial in drug development and physicochemical forecasts, such inconsistencies may distort outcomes in large-scale studies or practical applications.

### **Solutions to the Issue**

A comprehensive revision in the documentation, curation, and utilization of pKa data may be essential to resolve these challenges. Zheng points out that systematic inaccuracies can be minimized if data curators adopt clearer and universally accepted nomenclatures. He suggests discontinuing the terminology “acidic” and “basic” in labeling pKa values. Instead, terms like *proton loss constants* and *proton gain constants* could more explicitly represent the molecular processes involved.

In addition, he proposes utilizing the distinctions between “macroscopic” and “microscopic” pKa values as metadata. A macroscopic pKa indicates an average dissociation constant across a range of isomeric forms, while a microscopic pKa pertains to individual isomers or protonation states. This differentiation provides greater clarity for zwitterionic and multi-protonatable species.

Another practical measure is for researchers to reevaluate existing chemical databases, such as ChEMBL, to pinpoint compounds with recognized inaccuracies. Going forward, a concerted effort should be made to annotate molecules with comprehensive descriptors regarding protonation states and environmental conditions.