Using machine learning algorithms and ensemble learning, researchers have modelled the factors underlying the completion or cessation of COVID-19 clinical trials. The models were found to predict clinical trial outcomes with good accuracy, thereby helping to reduce resource use and subsequent costs for future COVID-19 research.
COVID-19 clinical trials
In the battle against the COVID-19 pandemic, significant research effort has been put into arming health systems with vaccines and drug treatments. Clinical trials are vital to this research, which are necessary to validate the safety and efficacy of any intervention or treatment on human subjects. These trials also help to bring about a better understanding of the biology of disease.
Clinical trials are notoriously resource-intensive and time-consuming. Yet, previous studies have shown that 10-12% of trials are terminated for a variety of reasons, including insufficient enrolment, safety concerns and administrative issues. In the context of COVID-19, the cessation of clinical trials would result in substantial losses of money and resources from already-stretched systems.
It is therefore imperative to optimise efficient research efforts. In this study, researchers at Florida Atlantic University’s College of Engineering and Computer Science aimed to identify the features associated with completion or cessation of COVID-19 clinical trials. With machine learning algorithms and ensemble learning, the features may subsequently be used to design models that predict the outcome of future trials. The researchers hoped that these approaches will help stakeholders plan resources, reduce costs and minimise time of clinical trials.
Clinical trial data
Commencing in January 2021, the researcher first retrieved data from 4,441 COVID-19 clinical trials from ClinicalTrials.gov. Hosted by the US National Library of Medicine (NLM), ClinicalTrials.gov is an online public database that contains registered clinical trials from108 different countries or regions.
Very few COVID-19 trials have been formally terminated due to the relative novelty of the disease. As such, the researchers considered 3 types of trials as cessation trials: terminated, suspended and withdrawn trials. These represented unsuccessful research efforts and those that were stopped for particular reasons. Terminated trials are those that had enrolled participants but were stopped prematurely. Suspended trials are similar but may resume in the future. Meanwhile, withdrawn studies are those that were halted prematurely before participant enrolment.
Overall, the final training dataset included in the analysis had 772 clinical trials, with 81.34% completion trials and 18.65% cessation trials.
Features of COVID-19 clinical trials
The researchers then designed four types of features to represent each clinical trial. In total, feature engineering produced 693 dimensional features, representing the most extensive set of features for clinical trial reports to date.
Statistics features, keyword features and drug features extract information from the data fields in clinical trial reports. For instance, statistic features model clinical trials based on trial administration, study information, study design, and eligibility criteria. Meanwhile, drug features describe the different types of drug interventions used to treat COVID-19. Keyword features capture the key words or phrases describing the trial’s protocol based on the NLM Medical Subject Heading (MeSH) terms. Users of ClinicalTrials.gov use MeSH terms to search for specific trials in the database.
Each clinical trial report also includes a detailed textual description of the clinical trial. The fourth feature, the embedding feature, generates a feature vector to represent each description. This employs Doc2Vec, a neural network language model that generates vector representations for words.
To identify the features that were most important in determining trial completion or cessation, ReliefF was then applied. This is a similarity-based feature selection method that ranks the features based on their impact. The results demonstrated that keyword features were most informative, sequentially followed by drug features, statistics features and embedding features. However, all four features are essential for predictive accuracy.
Predictive modelling with ensemble learning
Using the above features, the researchers developed a model to predict clinical trial completion and cessation. Ensemble learning was employed to combine models to achieve the best possible predictive performance.
The training dataset had only 144 cessation trials out of 772 total trials. To account for the resulting class imbalance, the researchers applied random under-sampling to completion trials. This involved randomly deleting completion trials from the dataset to even out class proportions. Since this method may inadvertently remove important samples, sampling is repeated 10 times. Each sampled dataset produced one predictive model, all of which were combined to form an ensemble.
They compared 4 predictive models, namely Neural Network, Random Forest, XGBoost, and Logistic Regression models. Overall, the Random Forest model was best at predicting COVID-19 trial completion or cessation. Out of 1, this model achieved a 0.87 in an area under the curve (AUC) score. A higher AUC score means that a model’s predictions are better. Meanwhile, balanced accuracy was over 0.81 out of 1.
A limitation of this work is the small number of clinical trials reported in the dataset. Nevertheless, as COVID-19 clinical research continues to progress, more clinical trial reports can be integrated to improve model performance.
This study demonstrated that machine learning methods can deliver effective models to understand the features that distinguish completed from ceased COVID-19 clinical trials. They also enable future COVID-19 trial statuses to be predicted with good accuracy. When applied, the proposed approach would help stakeholders better plan procedures and resource allocations. Ultimately, these models may empower researchers to make the most of valuable time and resources in the fight against COVID-19.
Image credit: rawpixel.com – Freepik