Mobile Menu

Bootstrapped Machine Learning for Characterising Compounds in Drug Discovery

Researchers from Purdue University have created a new method of bootstrapped machine learning applied to the tandem mass spectrometry process, which aims to improve the process of characterising compounds in drug discovery.

Tandem mass spectrometry

Researchers use tandem mass spectrometry (MS/MS) for characterisation of complex mixtures in fields such as drug discovery. Collision-activated dissociation (CAD) is a common MS/MS technique for obtaining structural information for ionised and isolated mixture components. However, this approach is limited by the fact that isomeric ions often generate identical fragmentation patterns, making identification of different compounds via CAD unreliable.

To address this, a new MS/MS approach has been developed, which is based on diagnostic, reliable and predictable gas-phase ion-molecule reactions. Researchers can use this approach to identify specific functional groups or their combinations in ionized and isolated mixture components. This thereby facilitates the differentiation of isomeric ions without the need for reference compounds.

One of the neutral reagents previously used to differentiate two isomeric drug metabolites is 2-methoxypropene (MOP). In these experiments, atmospheric pressure chemical ionization (ACPI) in a linear quadrupole ion trap mass spectrometer enabled protonation of the analytes. The researchers then transferred the protonated analytes into an ion trap, isolated them and allowed them to react with MOP. Some protonated analytes will be unreactive with MOP, whereas others will transfer a proton to MOP. Protonated analytes of the greatest interest are those that formed a diagnostic, stable addition product with MOP. The researchers then ejected all generated products in a mass-selective manner from the ion trap into external detectors. This enables determination of their m/z-values and relative abundances. Subsequently, this enables determination of reactions that have taken place.

However, interpretation of the data obtained for complex mixtures in these experiments is challenging and time-consuming due to the amount of data produced.

Why is a bootstrapped machine learning model needed?

To overcome issues with the processing of data from MOP experiments, the researchers behind this study developed a chemical graph-based interpretable machine learning method to facilitate data interpretation to predict whether a protonated analyte will form a diagnostic product ion upon reactions with MOP. Long-short term memory (LTSM), multilayer-perception and graph convolution networks (GCN) have previously been demonstrated to be suitable for predicting reaction outcomes when a large number of known reactions are available. Unfortunately, due to the specificity of the diagnostic ion-molecule reactions of interest, only a small set of known reactions exist. Moreover, these models are difficult to understand and yield no additional chemical insight. Therefore, a new machine learning method needs to be developed.

Bootstrapped machine learning model

This study represents the first use of a bootstrapped machine learning model. The researchers behind this study developed a machine learning model based on the Morgan fingerprint algorithm. The algorithm was used to represent functional groups, which are the presence or lack of a topology of a collection of atoms. This method thereby avoids the use of manually created functional groups subject to human bias and interpretation. The Morgan fingerprint algorithm works by finding all subgraphs of a molecular graph (for example to connectivity of the atoms in a molecule), and assigns a number to these subgraphs. Researchers calculated the subgraphs via a set of hashing functions applied to each atom and its respective neighbourhood. This yields a number, which can be used as a substitute for the functional group.


Overall, the bootstrapped decision tree model was trained on 36 known ion-molecule reactions with MOP. When the model was tested with a blind test set, a Cohen kappa statistic of 0.70 was achieved. This result suggests substantial inter-model reliability on limited training data. More specifically, using the test set, the model correctly predicted the reactivity for 11 of the 13 analytes, and 13 of 13 analytes, when an additional quantum mechanics (QM) filter based on the relevant proton affinities was applied. Additionally, the researches evaluated other machine learning models, including k-nearest neighbour, and none of the methods outperformed the 0.70 kappa value of the bootstrapped machine learning model generated in this study.

To ensure that the introduction of new data does not cause extensive changes to the decision tree model prospective diagnostic product predictions were experimentally tested for 13 unpublished analytes. The results indicated minimal changes, which suggests robust selection of chemical features. Therefore, the researchers hoped that this method will pave the way for expanding MS/MS methods to include new diagnostic reactions for the identification of many different functionalities in drug metabolites in an easy, accurate and automated manner.


This study developed methodologies for the application of a new bootstrapped machine learning algorithm for the characterisation of compounds in drug discovery. The model was able to outperform existing machine learning algorithms and was able to integrate new data effectively. The researchers hope that this research will pave the way for the fast determination of unknown isomeric metabolites of medicinal compounds via the identification of diagnostic product ions formed with selected neutral reagents. In the future, the researchers hope to showcase a fully automated pipeline for mixture component identification, incorporating multiple models, similar to the one developed in this study, along with demonstrating how this methodology can be used for the development of new therapeutics.

Image credit: FreePik

Share this article