Mobile Menu

A framework to generate real-world evidence using real-world data from clinical notes

A new study, published as a preprint in BMC Medical Informatics and Decision Making, has described a framework to generate real-world evidence from unstructured clinical notes to explore the clinical utility of genetic tests. They used BRCAness (defect in homologous recombination repair in the absence of BRCA1/2 mutations) as a use case to examine this.

Precision medicine and real-world evidence

Precision medicine aims to improve clinical decision-making by incorporating individuals’ genomic information and clinical characteristics. This, in turn, improves selection of target therapies, lessens side effects and also achieves desirable cost-effectiveness. For effective implementation of precision medicine, the value of harnessing real-world data (RWD) and generating real-world evidence (RWE) has become clear.

Advancements in next-generation sequencing and genetic testing have been critical in the practice of precision medicine. Nevertheless, the clinical utility of genetic testing in real-world settings remains unevaluated. In addition, data quality currently significantly limits usage of RWD for RWE studies. Investigators have used natural language processing (NLP) techniques to extract unstructured clinical notes. Researchers have also applied NLP to the extraction of various clinical data elements, such as adverse drug events.

Real-world evidence study framework

Researchers proposed a RWE study framework that incorporates context-based NLP methods for data extraction and data quality examination. The team note that the novelty of their work is the fact they were able to extract patients’ personal genetic information by differentiating it from general genetic information also documented in electronic health records (EHRs).

Their cohort included 196 female cancer patients that had undergone genetic testing. The team collected their genetic reports as well as unstructured clinical notes. NLP-based approaches were applied to capture data on relevant topics within clinical notes relating to BRCA1/2. The team compared rule-based and machine learning NLP systems for genetic information extraction. Incompleteness and discrepancy issues within data quality were examined. Finally, the team used clean RWD to conduct a RWE study. They explored the association between BRCA1/2 mutations and prescription of PARP inhibitors.


The team identified seven topics in clinical context of genetic mentions, including: information, evaluation, insurance, order, negative, positive, and variants of unknown significance (VUS). The team also found that the rule-based NLP system achieved the best performance. It had a precision of 0.87, recall of 0.93 and F-measure of 0.91.

The system revealed discrepancies and missingness of genetic data within EHRs as only 75% of BRCA1/2 mutation information was captured. As a result, researchers had to manually clean the data before further analysis could be performed. Subsequently, using cleaned RWD, the team found significant associations between BRCA1/2 positive mutations and targeted therapies.


This system can resolve contextual variability to extract RWD from unstructured clinical notes. It is clear that data quality issues exist and can vary by data type. Most importantly, the team were able to use cleaned RWD to show that the real-world association of BRCA1/2 and prescription of PARP inhibitors is significant.

Image credit: By katemangostar –

Share this article