Mobile Menu

Predicting Moonlighting Proteins Using Machine Learning

Moonlighting proteins are a subclass of multifunctional proteins, which play an important role in disease pathways and drug-target discovery. As the detection of these proteins experimentally is challenging, researchers have detected most of them randomly. Therefore, a new study has used eight machine-learning classification models to try and predict moonlighting proteins.

Moonlighting proteins

Moonlighting proteins are a subset of multifunctional proteins in which one polypeptide chain exhibits more than one biochemical function. The word moonlight can be applied to proteins with at least two different unrelated functions, as long as this multifunctionality is not the result of gene fusion, multiple domains, multiple splice variants, proteolytic fragments, or pleiotropic effects. Another important feature of moonlighting proteins is the independence of functions, or more specifically, the fact that the inactivation of one function does not affect the protein’s other functions. Online databases, such as MoonProt and MoodDB, have reported on 400 and 238 moonlighting proteins, respectively.

Moonlighting proteins contain various subtypes including:

  1. Different sites for different functions in the same domain.
  2. Different sites for different domains within different domains.
  3. Implementing the same residue for different functions.
  4. Executing different residues of the same site for different functions.
  5. Implementing structural composition or different folding for different functions.

Why is predicting moonlighting proteins important for drug discovery?

Although there have been various studies on moonlighting proteins, a great deal about these proteins remains unknown. There are numerous reasons why studies predicting moonlighting protein are appealing, including detecting unknown cellular processes, identifying new protein mechanisms, improving protein function prediction, identifying a proteins role in disease pathway as well as obtaining information on protein evolution and drug discovery. Previous research has indicated that 78% of moonlighting proteins are involved in human disease pathways and 48% are the targets of active medicines. For example, the moonlighting protein phosphoglucose isomerase is an enzyme involved in glycolysis and is also a cytokine involved in breast cancer metastasis.

Computational methods for predicting moonlighting proteins

To date, researchers have used several computational methods to detect moonlighting proteins. Since moonlighting proteins tend to interact with other proteins with different functions, they can be detected by studying protein-protein interactions. However, these previous studies have often not included machine-learning methods along with feature extraction.

In this study, researchers used eight classification models and 37 different feature vectors to detect moonlighting proteins. The team used a dataset of 351 samples (containing 215 moonlighting and 136 non-moonlighting proteins). To evaluate the performance of these models, the researchers divided the proteins from the dataset into two parts: training (80%) and test (20%). Then, out of the 37 feature vectors, the researchers introduced 10 vectors, which had a higher performance than the other vectors. Among the 10 feature vectors, the SAAC vector (using support vector machine (SVM) and K-nearest neighbour (KNN) models) and the QSorder vector (using the naïve bayes (NB) model) had the highest classification accuracy on the test dataset.

To identify outlier proteins, researchers employed the NB with the QSorder vector, as well as the SVM and KNN with SAAC vector. The researchers performed tenfold cross validation 100 times on these models. This allowed them to identify and count proteins that were incorrectly classified as validation fold. If a protein was misclassified more than 90 times, that protein was termed a candidate outlier protein. The results of the outlier tests showed that outlier proteins can greatly reduce the accuracy of classifier models. Identification of these proteins and their properties can help researchers create more appropriate and accurate classification models. Studying non-moonlighting proteins that were considered candidate outlier proteins, which share characteristics with moonlighting proteins, could identify proteins that are in fact moonlight proteins.

Summary

This study has used distinct feature vectors to identify novel moonlighting proteins, which are typically difficult to identify through experimentation. The methods used in this study have also helped pinpoint a number of non-moonlighting proteins that may have been misclassified. Moonlighting proteins are important targets in drug discovery, and so effective identification of these proteins could help unlock important novel drug targets in the future.

Image credit: pikisuperstar – FreePik

Share this article