Mobile Menu

Advances in machine learning for constructing biomedical knowledge graphs

Knowledge graphs, as a means of structuring information and modelling complex relationships, have rocketed in popularity across many industries, including drug development. By representing biomedical information as nodes, and the relationships between nodes as edges, KGs can be used to map out relationships between biomedical concepts to understand drug-target interactions or identify genes that play a role in disease progression.

Traditionally, KGs have been curated manually by experts, but machine learning methods have been employed successfully to speed up the process. Nicholson and Greene from University of Pennsylvania review the advances on these automated systems, in an Elsevier publication.

There are primarily two ways to construct a knowledge graph: manually using curated databases and through text mining systems.

One way is using pre-existing databases that were manually curated by experts. This is time-consuming because experts are required to pick out biomedical interactions by reading research papers. For example, COSMIC is a database that contains 45 million entries, as of 2018, on key cancer related genes. Even though there is high precision due to manual extraction, there is low recall – the publication rate of research papers is too high for experts to keep pace with. However, these manually curated datasets can be used as the gold standard to validate and train machine learning software.

Another technique is to use automated natural language processing approaches, such as text mining software to quickly detect sentences on biomedical interactions. There are several methods:

1. Rule-based relationship extraction identifies important keywords and grammatical structures to extract sentences that allude to a relationship. Keywords and grammatical patterns are firstly determined by experts. Sentences are then simplified grammatically for further manual extraction. Even though there is a high recall, precision is low, as sentences that contain ambiguous directionality of biological events may not be detected by the software.

2. Unsupervised extraction clusters and statistically calculate associations without using annotated labels by experts. Machine learning will scan for statistical co-occurrence, which is the mention of two entities that are assumed to be independent of each other but forms a relationship. For example, disease-gene interactions or protein-protein associations have been found using this method by scanning PubMed abstracts for the databases, DISEASES and STRING, respectively.

3. Supervised relationship extraction construct generalised patterns to differentiate between sentences that allude to a relationship from sentences that do not. Patterns are determined through publicly available datasets that have been manually curated. Some techniques include support vector machines (SVM) and deep learning techniques.

Knowledge graphs are important for researchers to identify novel treatment for diseases by establishing new associations between diseases and biomolecules or drug target. These applications rely on knowledge graphs, a high dimensional structure, to be translated to a low dimensional space, which aims to build predictors to a problem using machine learning methods. The three main categories for this include:

Image Source: Science Direct

1. Matrix factorisation, such as Isomap, Laplacian eigenmaps, and Singular Vector Decomposition (SVD), uses linear algebra to determine the relationship between entities within the knowledge graph. Using the SVD technique, researchers have managed to construct a miRNA-Disease network. After sorting into small matrices, similarity scores between miRNA and diseases can be calculated to determine the likelihood that there is a strong correlation between disease progression and miRNA activation.

2. Translational distance models, such as TransE and TransH, identifies entities in a knowledge graph as linear transformations. This model has been used to aid the improvement of patient care by delineating the relationship between prescription of drugs and disease progression in patients.

3. Neural networks, such as word2vec, makes non-linear transformations. Researchers have used neural network to construct a disease-target-disease network to enhance the identification of new disease treatments.

Despite multiple promising applications of knowledge, the design of future algorithms has to account for missing data gap during analyses or biases in establishing relationships between different entities on a knowledge graph. Due to the diversity of applications of knowledge graphs, there also lacks a standardised set of expected evaluations to counter the biases that exist. The expanded use of knowledge graphs will mitigate these problems as better machine learning software are designed.

Journal source: Constructing knowledge graphs and their biomedical applications

Share this article