Target prediction with machine learning algorithms can help accelerate the identification of protein targets of hit molecules, limiting the number of required experiments. However, drug-target interaction databases used for training present a high statistical bias, leading to a high number of false positives, therefore increasing the time and cost of experimental validation. A recent study worked to develop a method to correct databases’ statistical biases, thereby reducing false positive drug target predictions, and the number of experiments required for validation.
Drug target predictions
Drug discovery often relies on the identification of therapeutic targets, which are usually proteins playing a role in the disease. This allows researchers to design or search for small molecular drugs that interact with the protein target to alter disease development. However, more recently there has been a renewed interest in phenotypic drug discovery, which does not rely on prior knowledge of targets. While phenotypic drug discovery has allowed the identification of a few first-in class drugs, once a phenotypic hit is identified, not knowing its mechanism of action is a strong limitation when a drug reaches the market.
A recent study worked to determine the protein targets for hit molecules discovered in phenotypic screens. The researchers tackled target identification in the form of Drug-Target Interaction prediction based on machine learning chemogenomic algorithms.
Biases in databases
Various machine learning algorithms have been proposed for drug target predictions. These include similarity-based methods such as kernel ridge linear regression, support vector machines (SVM) and neighbourhood regularised logistic matrix factorisation (NRLMF). However, whatever algorithm is used, training a good machine learning chemogenomic model is hindered by biases in the drug-target interaction database. Such biases include whether the molecule for which one wishes to make predictions has a known interaction or not. Another issue arises when the databases only contain positive examples of pairs known to interact, but no negative examples of pairs known not to interact.
This work explores how to best choose negative examples to correct the statistical bias of databases, and reduce the number of false positive drug target predictions, which is essential to reduce the number of time and resource expensive experiments that are required to validate true protein targets.
Building datasets for drug target predictions
Hit molecules in phenotypic screens for drug discovery are mainly drug-like compounds. The researchers used the Drugbank database to build their training dataset, since it provides high quality bio-activity information relating to approved and experimental drugs, including their targets. Overall, DrugBank contains around 17,000 curated drug-target Interactions (DTIs). Using this data, they built their DB-Database, which comprises all DTIs reported in DrugBank that involve a human protein and a small molecular drug. In total, the DB-Database contains 14,637 interactions between 2,670 human proteins and 5,070 drug-like molecules, which make up their positive DTIs. Since training a machine learning algorithm also requires negative examples, the researchers added an equal number of negative DTIs to their database using random and balance sampling.
Their random sampling method chose negative examples at random, among pairs that are not labelled as DTIs but where both the small molecule and human protein are in the database. This was based on the assumption that most of the unlabelled interactions are expected to be negative. This process was repeated 5 times, leading to 5 training sets called RN-datasets.
Meanwhile, during balanced sampling, the researchers randomly chose negative examples among unlabelled DTIs, in such a way that each protein and each drug appeared an equal number of times in positive and negative interactions. This process was also repeated 5 times, leading to 5 training datasets called BN-datasets.
Overall, the RN- and BN-datasets share the same set of positive DTIs, which are those contained in the DB-database, and their total number of negative DTIs are equal to that of positive DTIs.
Using machine learning to reduce false positive drug target predictions
Throughout the article, the main algorithm that the team used was the SVM algorithm. To compare the performance of their model trained on the RN- or the BN-datasets when predicting targets for difficult molecules (molecules that have no or few known targets), the researchers considered a small dataset of DTIs involving drugs that have few known targets. This database was built using the DB-database, compiling all thew drugs that do not have more than 4 targets. This led to 560 drugs involved in 851 interactions, from which the researchers selected 200 as positive DTIs, involving 200 different drugs, defining the so-called 200-positive-dataset. Following that, 200 negative DTIs were also randomly chosen among the unlabelled DTIs, which did not belong to the RN- or BN-datasets, defining the 200-negative-dataset.
Analysis of the performance of the model on three drugs (DB11363, DB11842 and DB11732, which had all their known DTIs removed from the training set) showed that training with the DN-dataset allowed the researchers to recover all of the true targets in each case. Moreover, the researcher’s method decreased the number of false positives among the top ranked predicted targets, and, overall, the rank of the true targets was improved. The key result of their paper was to show that by choosing an equal number of positive and negative DTIs per molecule and protein you can decrease the number of false positive drug target predictions in biased datasets.
This study used computational methods to correct databases’ statistical biases and reduce the number of false positive predictions. It is hoped that their method will be used to accelerate the search for drug targets, thereby reducing the number of costly experiments required.
Image credit: unoL – Canva