A research group at the Institute for Research in Biomedicine (IRB) in Barcelona has developed a machine-learning tool that uses neural networks to predict the bioactivity signature of any given compound even without experimental data. This provides important biological insight into the therapeutic potential of compounds to facilitate the selection of efficacious candidates in the drug discovery pipeline.
Computational drug discovery
The vastness of the chemical space is a double-edged sword. While this immense chemical space provides an extensive source of new drugs, it also complicates drug discovery since a single drug candidate must be selected from the many available candidates. Computational drug discovery (CDD) has aided in the selection process, using chemoinformatics to navigate the chemical space. At the heart of chemoinformatics is a range of chemical descriptors capturing the physicochemical and structural properties of small molecules. The recent availability of bioactivity descriptors, such as ligand-binding affinities and target profiles, has enabled a more biologically-relevant characterisation of these chemicals. Supplementing chemical with biological information facilitates the selection of candidates with desired properties at each stage of drug discovery.
Previously, the same research group had collated major chemogenomics and drug databases to create Chemical Checker (CC), the largest database of small-molecule bioactivity signatures to date. CC is composed of 25 bioactivity spaces for ~800,000 molecules. Each space describes properties such as chemical structure, target information, network properties, plus cellular and clinical responses. However, due to the incompleteness of experimentally derived bioactivity data, spaces are missing for most molecules documented in the CC. Therefore, this means that bioactivity signatures are useful only for well-characterised compounds, restricting CDD to using only chemical information for most compounds.
Developing bioactivity signaturizers with neural networks
By integrating deep neural networks with experimental information in the CC, the researchers behind this study developed “signaturizers”. These are a set of models that infer the bioactivity signatures for any given compound even in the absence of experimental data.
In chemoinformatics, bioactivity signatures are represented as multi-dimensional vectors encapsulating the biological properties of small molecules of interest. The shorter the distance between bioactivity signatures, the more similar the biological behaviour displayed by two compounds. In the CC, the researchers had observed correlations between different bioactivity spaces. For two compounds, similarities of a given type of bioactivity signature correlates to the complete collection of CC signatures. Accordingly, similarity measures can be obtained for any compound even without existing data.
The researchers treated the inference of bioactivity signatures as a metric learning problem. They trained Siamese Neural Networks (SNN), each corresponding to one CC space, with 107 triplets of molecules – an anchor, one similar to the anchor (positive) and another that is dissimilar (negative). The SNNs were tasked to classify the pattern between the molecules with a distance measurement in the relevant CC space. This ultimately generated 25 SNN signaturizers. A signaturizer is a model taking the CC signatures available for a compound as input and yields a 128-dimensional signature as output. This signature captures the similarity profile of the compound in the corresponding CC space.
In the validation stage, signaturizer performance varied according to the CC space and molecule of interest. Nevertheless, the signaturizers predicted bioactivity well across all spaces.
Assessing the reliability of the neural network bioactivity predictions
Next, missing signatures in the CC were annotated with the validated signaturizers. The result showed that the number of available signatures can be increased if SNN predictions are incorporated.
To assess the reliability of these predicted bioactivity spaces, this model also provides an applicability score per compound. This accounts for the resemblance of the predicted signatures to the experimental signatures in the training set, robustness of the predictions to a test-time data dropout, and expected accuracy based on available CC datasets. In the prediction space, areas of higher applicability scores may be easily identified to highlight the bioactivity spaces that are more reliably and accurately predicted.
Validating the signaturizers with Snail1
For validation, the researchers used the signaturizers to identify potential drug candidates against an orphan drug target, Snail1. This is a cancer-associated transcription factor that is near-undruggable. The signaturizers were applied on ~20,000 compounds from two libraries, namely the Prestwick collection and IRB Barcelona proprietary library. Overall, the signaturizers inferred 222 compounds to have the chemical and biological characteristics capable of downregulating Snail1. The ability of these predicted compounds to reduce Snail1 levels were also verified experimentally.
Summary
This new methodology harnesses neural networks to infer the biological activity of any chemical compound even without experimental data. When evaluating therapeutic potential, these signaturizers extend the applicability of bioactivity data to new or poorly characterised compounds. Signaturizers may improve CDD efficiency as they provide the crucial biological insight needed to select compounds throughout the drug discovery pipeline. Importantly, this tool may also aid in the development of drugs for undruggable targets.
Image credit: jcomp – Freepik