Epigenetic targets are of significant importance in drug discovery research, and there is increasing availability of chemogenomic data related to epigenetics. A recent study, published in the Journal of Medicinal Chemistry, developed models for epigenetic target fishing based on established machine learning algorithms trained on different fingerprint representations of compounds, to facilitate epigenetic drug discovery.
Epigenetics is an essential component in an organism’s normal development and responsiveness, and its dysregulation has been associated with altered gene expression patterns related to multiple diseases. This makes epigenetic targets a focus for drug discovery research. Successful examples of drugs targeting epigenetic proteins can be found in cancer research, with the approval of eight epigenetic drugs for clinical use. These include the drug azacytidine targeting DNMT1 and belinostat, which targets HDACs. Over the past decade there has been increasing availability of chemogenomic databases related to epigenetics, illustrating its importance in drug discovery. One such example is EpiFactors, which is one of the largest databases with annotated proteins related to epigenetics. In total, this database contains 815 different targets.
The increasing availability of chemogenomic data for all target classes has opened up the opportunity to construct ligand-based models to assist target prediction of small molecules. However, the data available still represents a small proportion when compared to data available for other protein families such as kinases, ion channels or G protein-coupled receptors. This suggests that epigenetic targets are commonly underrepresented in current target prediction methods, and unless the similarity of a known ligand is high enough, they are less likely to be predicted as potential targets of small molecules by existing prediction models. This stresses the need to develop predictive models focused on epigenetic targets to assist medicinal chemistry efforts.
Machine learning for epigenetic target fishing
The application of machine learning models for large-scale epigenetic target prediction has only been explored on a limited basis. Most of the research has been focused on single targets or protein families such as HDACS or the BET family. Therefore, the researchers behind this study aimed to develop accurate models for epigenetic target fishing, based on machine learning algorithms trained on different fingerprint representations of compounds.
In this study, the researchers developed and evaluated the performance of five machine learning algorithms built on three molecular fingerprints of different designs to predict 55 epigenetic targets of small molecules.
The five machine learning algorithms used were:
- k-nearest neighbours (k-NN)
- Random Forest (RF)
- Gradient Boosting Trees (GBT)
- Support Vector Machines (SVM)
- Feed-Forward Neural Networks (FFNN).
The selected molecular fingerprints were:
- Molecular ACCess System (MACCS) Keys as a dictionary-based fingerprint where each position indicated the presence or absence of a predefined structure.
- Morgan fingerprint with radius 2 as a circular fingerprint where each position represented an atom environment including all atoms connected up to a radius of two bonds.
- RDK fingerprint as a topological fingerprint where each position represented a linear substructure including all atoms connected up to a length of seven bonds.
The performance of the models was validated using two approaches. Firstly, performance estimation for binary classifications in 10-fold cross-validations in the context of each target. Secondly, the performance of their combination in epigenetic target prediction, evaluated over 10 balanced samples of compounds containing an equal number of known active compounds for each target.
Identifying the best machine learning model for epigenetic target fishing
This study identified Morgan and RDK fingerprints as the best representation for the derivation of binary classifiers for the targets, especially when derived using SVM. However, as a model’s performance is dependent on dataset composition, the researchers behind this study stressed that the trends found in this study could change as more bioactivity data is published and if different sets of hyperparameters are studied.
The researchers built a consensus model by combining the predictions of the best models derived from Morgan and RDK fingerprints (Morgan:SVM and RDK:SVM). The performance of the consensus model, as well as two source models, was assessed on a distance-to-model basis, categorising the predictions according to the Jaccard distance statistic of the compounds in the test set from those in the training set. The Jaccard distance is a statistic used for gauging the similarities and diversities of sample sets. When tested using single-target binary classification, the consensus model showed a significantly higher precision for identifying active compounds than those obtained by the individual source models. This trend was also the case when the models were evaluated for epigenetic target fishing.
The consensus model showed a mean balanced accuracy (BA) of 0.835 considering the cross-validated predictions of 55 target-associated binary classifiers, with mean precisions for identifying active compounds ranging from 0.923 for compounds closer to the training set, to 0.810 for compounds further from the training set. Mean BA was used as a performance metric to select the best set of hyperparameters. For epigenetic target prediction, mean precisions were found to range from 0.952 to 0.773. To demonstrate the applicability of the consensus model, the researchers performed retrospective identification of epigenetic targets of two external and recently reported compounds.
This study produced a consensus machine learning model that was demonstrated to be a robust and accurate method for epigenetic target prediction for small molecules. It is hoped that this model will be a helpful tool in practical medicinal chemistry applications for epigenetic drug discovery. As more data becomes available, the researchers will update the number of epigenetic targets included and the classification models implemented.
Image credit: kjpargeter – FreePik