Mobile Menu

Preprint: Federated, privacy-preserved machine learning for GWAS

Genome-wide association studies (GWAS), by measuring single nucleotide polymorphisms, can help us to identify possible genetic variants associated with disease phenotypes. However, the scale of GWAS sampling, where larger samples would produce more accurate genetic predictions, has been limited by data protection issues and privacy restrictions applied to biomedical and clinical data.

As PLINK, the most widely used open-source software tool for GWAS, can only perform association analysis on local data, geneticists have established several methods of meta-analysis for combining summary statistics of individual GWAS. However, the statistical power of such methods designed to protect privacy, are negatively impacted by cross-study heterogeneity resulting in inaccurate or misleading conclusions.

A German and Danish team last week presented sPLINK (safe PLINK); a federated, privacy-preserving software tool for GWAS on distributed datasets in their preprint published on BioRxiv. Using federated machine learning the tool allowed researchers to analyse sensitive and distributed sources of raw data without it leaving local sites.

sPLINK serves as an alternative to meta-analysis and aggregated analysis approaches to GWAS, as the software extracts local model parameters from the data of individually submitted cohorts and shares only those with the central server. The tool is said to be user-friendly, both in the functionality of the web interface and the facilitation of collaborative GWAS projects. Additionally, multiple association tests are supported including linear and logistic regression and chi-square for GWAS.

The authors have confirmed that the software is robust against the imbalanced phenotype distributions that conventional meta-analysis GWAS approaches are prone to. In highly heterogeneous samples, current meta-analysis tools typically lose accuracy. Whereas, this tool provides results congruous with aggregated analysis, irrespective of the heterogeneity of sample phenotype distributions. Thus, the authors stipulate, that sPLINK has the potential to supersede meta-analysis as the gold standard in collaborative GWAS projects.

In their concluding remarks, the authors outlined their future plans for the development of sPLINK: “We plan to implement the federated version of more association tests or more machine learning algorithms including random forest or deep neural networks (DNN) leveraged by the GWAS community in sPLINK. We will also investigate sPLINK’s potential to tackle other open challenges in GWAS such as trans-ethnicity, where the samples in the distributed datasets are from different ethnic groups.”

Article reference: sPLINK: A Federated, Privacy-Preserving Tool as a Robust Alternative to Meta-Analysis in Genome-Wide Association Studies

More on these topics

GWAS / Machine Learning

Share this article