Learning your identity and disease from research papers: information leaks in genome wide association study

  • Authors:
  • Rui Wang;Yong Fuga Li;XiaoFeng Wang;Haixu Tang;Xiaoyong Zhou

  • Affiliations:
  • Indiana University Bloomington, Bloomington, IN, USA;Indiana University Bloomington, Bloomington, IN, USA;Indiana University Bloomington, Bloomington, IN, USA;Indiana University Bloomington, Bloomington, IN, USA;Indiana University Bloomington, Bloomington, IN, USA

  • Venue:
  • Proceedings of the 16th ACM conference on Computer and communications security
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Genome-wide association studies (GWAS) aim at discovering the association between genetic variations, particularly single-nucleotide polymorphism (SNP), and common diseases, which is well recognized to be one of the most important and active areas in biomedical research. Also renowned is the privacy implication of such studies, which has been brought into the limelight by the recent attack proposed by Homer et al. Homer's attack demonstrates that it is possible to identify a GWAS participant from the allele frequencies of a large number of SNPs. Such a threat, unfortunately, was found in our research to be significantly understated. In this paper, we show that individuals can actually be identified from even a relatively small set of statistics, as those routinely published in GWAS papers. We present two attacks. The first one extends Homer's attack with a much more powerful test statistic, based on the correlations among different SNPs described by coefficient of determination (r2). This attack can determine the presence of an individual from the statistics related to a couple of hundred SNPs. The second attack can lead to complete disclosure of hundreds of participants' SNPs, through analyzing the information derived from published statistics. We also found that those attacks can succeed even when the precisions of the statistics are low and part of data is missing. We evaluated our attacks on the real human genomes and concluded that such threats are completely realistic.