Mobile Menu

The challenges in FAIRifying medical datasets

The FAIR guiding principles for data management are widely accepted across research, despite their young age. Researchers at the University of Leipzig discuss in a ‘lessons learned’ style paper this week the major difficulties in the practical implementation of FAIR across sensitive patient data.

The FAIR (Findable, Accessible, Interoperable, Reusable) data guiding principles, were created as a means to improve data management and data sharing, to allow data to be more machine-actionable. If applied correctly, embracing the FAIR framework will enable accurate queries on meta datasets and help us to fully leverage artificial intelligence and machine learning in R&D.

But whilst most data repositories today claim to follow FAIR principles, Löbe et al. remarks on the significant variation in the qualitative implementation of them. Because of the generic design of the standards, their application often ends up being domain-specific.

Although the data generated through research projects are often of high quality, are complete, and accurate, it’s often only used once for a singular objective. Löbe et al. here explore the challenges that arise in FAIR provisioning of research data repositories using the Leipzig Health Atlas Project as the use case.

The authors tackle each of the so-called axes of FAIR outlining the hurdles relative to biomedical research for each:


  • Recognising and referencing resources – neither locally nor globally unique numerical identifiers can guarantee the “eternal persistence” pre-requisite of FAIR given the unknown number of updates or system changes that may occur during the data lifecycle. Solutions already being used include Object Identifier systems (OID) and Digital Object Identifiers (DOI). However, whilst these systems can be designed to be resolvable in principle there is limited agreement on the desirable characteristics.
  • The granularity at which data is assigned identifiers – the frequency at which biomedical datasets need to be added to and updated means that continuous versioning would be impractical. In an ideal world, systems would be able to reference single data elements through the Semantic Web where each “information atom” would have it’s own Uniform Resource Identifier (URI). The capabilities of current software are not there yet.


  • The challenges of attaining permanently available metadata tend to be cultural rather than technical, as some researchers have yet to fully embrace the implications of this criterion. Instead, regulatory regulations of personal biomedical data often hinder collection and processing. There is a great need for further harmonised metadata vocabularies to be adopted in terms of data collection and informed consent.
  • The authors recommend the establishment of a Data Access Board for approval of data access requests according to the usage intended. Further considerations include an authentication service, contact points for questions on data sharing and protection.


  • Individual liberties of researchers who often can, say Löbe et al., act according to their own ideas means that datasets fail to be fully exploitable as data structures are poorly enforced. To increase the applicability of their data formatting across knowledge representations, a definition of a rich target data model should be established. The most practical, universally applicable model in use today is HL7 Fast Healthcare Interoperability Resources (FHIR). However, international medical terminologies only partially meet vocabulary requirements compliant with FAIR.


  • The provenance of data is a huge topic in terms of reusability, where data owners should clearly define their methods of data collection and processing to give external researchers confidence in the datasets. The authors recommend tranSMART, a simple web-based tool for visual analysis to give researchers a deeper overview of the data, hence increasing reusability.

Further considerations:

Aside from the five axes of FAIR, the authors discuss the additional considerations for FAIRifying medical datasets. These include maintaining high standard procedures to ensure high data quality and considering privacy-preserving data analysis if the datasets are not suitable for sharing. Addressing repository operators the authors suggest that they provide additional services to aid data sharing and usage such as pseudonymization, de-identification, anonymization and record linkage.

If you want to find out more about how the FAIR principles can be used to save researchers time and help to maximise the impact of health data, join us in a 3-part series with FAIR champions including Tom Plasterer, Martin Romacker, Andrea Splendiani, Philippe Rocca-sera, and many, many more, starting on Tuesday 21st July at 4 pm BST/5 pm CET/11 am EST.

More on these topics


Share this article