Mobile Menu

Validating a new text mining tool for electronic health records

Researchers have undertaken a validation study, using patients with metastatic renal cell carcinoma (mRCC), to assess the efficiency of a new text mining tool to extract data from electronic health records (EHRs).

Filling in the gap

Randomised control trials (RCTs) are the gold-standard for investigating the efficacy of drug therapies. Therefore, they are important for the authorisation of drug applications. However, in accelerated approval pathways, expensive anticancer drugs are increasingly being approved based upon studies with surrogate end points such as progression-free survival. A large section of these studies also lack a standard-of-care control arm, so the overall survival effect is unclear at the point of approval. Furthermore, researchers often test novel drugs in specific patient populations that may not represent the full cohort of patients who will eventually receive the treatment.

The use of real-world data (RWD) closes the gap between evidence from RCTs and clinical practice. RWD is valuable in assessing the effectiveness of a new drug in daily practice. EHRs are a key source of RWD. They contain longitudinal patient data and important detailed health information. These records also contain free-text notes, which consist of very detailed information about the patient, their illnesses and treatments. As a result, manual review is still the standard method for data collection from EHRs. However, this process is laborious, time-consuming and error-prone. Therefore, more durable and advanced methods are required.

Natural language processing

The use of natural language processing (NLP) and text mining techniques are providing key opportunities to extract information. These techniques are currently only being utilised in institutions with strong informatics departments. For example, an NLP pipeline is currently being used to extract urinary incontinence and erectile dysfunction from data about patient outcomes of prostate cancer treatment. 

The Clinical Data Collection (CDC) is a NLP and text mining-based tool currently available in hospitals in the Netherlands and Belgium. This tool is able to collect structured as well as unstructured data from EHRs. Currently, these tools extract relevant parts of the EHR, based on built queries. In this study, published in Clinical Pharmacology & Therapeutics, researchers proposed CDC as a useful extraction tool for retrieving RWD from EHRs. Therefore, they designed a validation study to assess the tools ability to extract clinical trial parameters from EHRs. As they wanted to look at the effectiveness data of specific oncologic drug treatments, they performed this study in patients with mRCC receiving systemic treatment.

Automated vs Manual

The team applied CDC to collect patient characteristics, treatment outcomes and adverse drug events (ADEs) during drug treatments for mRCC from their EHRs. They compared the data collected with manually obtained data.

They first investigated whether CDC could trace all patients who met the inclusion criteria. CDC was able to select 99% of the manual population. It was able to extract overall survival at no significant difference to manual review (21.7 months (95% confidence interval (CI) 18.7–24.8) vs. 21.7 months (95% CI 18.6–24.8)). Calculated progression-free survival was also similar at 8.9 months (95% CI 5.4–12.4) vs. 7.6 months (95% CI 5.7–9.4) for CDC vs. MR, respectively. For categorical characteristics and ADEs, the team calculated F1-scores. They found the highest F1-score in cancer-related variables (88.1–100), followed by comorbidities (71.5–90.4) and adverse drug events (53.3–74.5). Significantly, CDC resulted in a seven-fold reduction in time per patient (mean data collection for CDC was 12 minutes vs. 86 minutes for manual review).


These results demonstrate that CDC can accurately collect main treatment outcomes, including progression-free survival and overall survival. Therefore, the team concluded that healthcare professionals could adequately apply CDC to retrieve RWD from EHRs. CDC represents a more consistent and timely technical solution for extracting data. The team believe that with further efforts, researchers could optimise queries to improve the accuracy of data collection. They also suggest that experts could apply these queries to obtain RWD from several other oncologic drug treatments.

Image credit: By Image Team –

Share this article