In 2011 researchers, from The Institute of Cancer Research UK, created the largest, public, cancer drug discovery resource, known as canSAR. The researchers behind the canSAR knowledgebase have recently enhanced the data and demonstrated how it can be adapted and utilised outside of oncology.
canSAR
canSAR was originally created to provide unique data, curation, annotation and most importantly, AI-informed target assessment for drug discovery. To achieve this, the researchers developed canSAR to be a scalable, adaptable, and fully integrative knowledgebase. It integrates data from multi-omic profiling of cancer tissue from cancer cell lines and patients, with data on genetic vulnerabilities and dependencies. This data is fully integrated with medicinal chemistry and pharmacology data, annotation of the human proteome, 3D protein structures, protein-protein interactions, drug approvals and clinical trials among other data. The full integration of data means that researchers can identify non-obvious connections. This helps drug discovery of novel targets and insights for cancer drug discovery.
The researchers then developed a variety of machine learning algorithms to learn from the data to provide comprehensive, rapid updated target druggability/ligandability assessment. The algorithms assess target feasibility for drug discovery based on 3D structure, chemistry, behaviour in protein interaction networks, and availability to antibody/biotherapeutics.
The initial study focused on the use of this knowledgebase for cancer drug discovery. However, the researchers now aim to demonstrate how canSAR can be to interpret complex findings and support experimental design. As well as demonstrating its use outside of oncology.
Data in the canSAR knowledgebase
canSAR contains the entire human proteome from the Uniprot Swiss-Prot database, as well as around 542,000 non-human sequences. The recent update of the knowledgebase increased the number of molecular profiling studies to >25,000 cancer patients. The data now includes almost 10 million protein-coding mutation data points, >107 million gene-level copy number alterations. As well as, >218 million gene expression profiles from tumour samples and around 194 million normal gene expression profiles.
The knowledgebase also contains curated and uniformly assessed protein-protein interaction data. This was compiled from interactome databases including the IMEx consortium and TRRUST. Moreover, the researchers curated protein-protein interactions from the Protein Data Bank (PDB) and identified druggable protein-protein interaction interfaces.
The canSAR update carried out in this study included the addition of curated and standardised drug combination data for 1456 clinical trials and over 316,000 drug synergies from cancer cell line models from the canSynergize database.
Knowledge of 3D protein structure is important for drug discovery, and is also important for understanding the likely impact of molecular aberrations on protein function. This then allows researchers to generate hypotheses for disease causation. canSAR regularly updates its database of 3D protein structures from the PBD Europe. However, 3D protein structures are not available for all compounds. To overcome this, the researchers developed canSAR to contain orthogonal assessments for the suitability of targets for drug discovery. The knowledgebase provides ligand/chemistry-based assessment for over 8,000 human targets by using chemical and bioactivity information.
Beyond oncology
Although canSAR was originally developed to support cancer drug discovery, the researchers argue that it can be used beyond oncology. They reason that canSAR contains the entire human proteome and all protein sequences in UniProt, and the entire PDB. Therefore, canSAR has enough generalized data for it to be effectively applied to other therapeutic areas.
In the wake of the COVID-19 pandemic the international community has scrambled to identify potential therapies and vaccines. A growing list of research data has been produced and published, some with very little scientific validation. This has meant that misinformation was frequent, and some clinical trials were started without clear rational insights. Meanwhile, in some cases, valuable insights were missed amongst the chaotic data available. The researchers argue that there is a clear need for an objective, data-driven resource to inform coronavirus drug development. Therefore, the investigators rapidly developed a coronavirus edition of canSAR to facilitate more informed drug development for the virus. The aim of this new addition was to detract from the chaotic data available.
The future of the canSAR knowledgebase
Moving forward, the researchers hope to further develop canSAR to focus on decision support and experimental guidance for translational research in oncology, while maintaining the knowledgebases disease-independent capabilities in target druggability assessment and prioritization.
Recently, the researchers have expanded canSAR greatly, and it now contains around 10 billion experimental measurements. The software provides an up-to-date database of target druggability and ligandability assessment. This leads to more informed drug discovery and therapeutic development. Moreover, using the example of the coronavirus pandemic, the researchers have demonstrated how the existing canSAR system can be used to rapidly develop different editions of the knowledgebase for specific conditions.
Image credit: Rawpixel – FreePik