Mobile Menu

Is data cleansing worth the cost?

Whilst many not directly involved in knowledge management or IT infrastructure may not be familiar with the challenges in handling poor quality, messy, or unstructured knowledge, most will have experienced how it can hamper downstream analysis. Here, Tom Plasterer, Director of Bioinformatics, Data Science & AI, BioPharmaceuticals R&D at AstraZeneca, discusses how to prioritise your legacy data for cleansing programmes and evaluate the quality of your data.

Executive Summary:

  • Not all datasets need to be cleaned; organisations should first prioritise cleaning legacy data based on expected benefits relative to cleaning newly-generated data
  • When analysing whether data cleansing is required, companies should evaluate if their data is valid, accurate, complete, consistent and uniform
  • The first stage of cleansing data is ensuring methodologies follow community best practice, all relevant individuals are on board with the project workflow and steps are taken to maximise findability
  • Cleansing data requires little in the way of novel development. There are a wealth of online resources to increase understanding of the importance of clean data and FAIR principles for better data stewardship. Major bodies with useful guides, hackathons and groups include:
    + GoFAIR, created by the European Open Science Cloud, the NIH Commons, the E-African Open Science Cloud, and the Australian Open Science Cloud
    + the IMI FAIR Plus Project, led by ELIXIR and Janssen
    + the Pistoia FAIR implementation group
  • Data cleansing software is available, but it should be specifically chosen for what it can do for your data, especially making it more interoperable and reusable
Data cleansing and FAIR


More on these topics

Data Cleansing / FAIR data

Share this article