Genotype Tagging with Limited Overfitting

  • Authors:
  • Irina Astrovskaya;Alex Zelikovsky

  • Affiliations:
  • Department of Computer Science, Georgia State University, Atlanta GA 30303;Department of Computer Science, Georgia State University, Atlanta GA 30303

  • Venue:
  • BSB '09 Proceedings of the 4th Brazilian Symposium on Bioinformatics: Advances in Bioinformatics and Computational Biology
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Due to the high genotyping cost and large data volume in genome-wide association studies data, it is desirable to find a small subset of SNPs, referred as tag SNPs, that covers the genetic variation of the entire data. To represent genetic variation of an untagged SNP, the existing tagging methods use either a single tag SNP ( e.g., Tagger, IdSelect), or several tag SNPs ( e.g., MLR, STAMPA). When multiple tags are used to explain variation of a single SNP then usually less tags are needed but overfitting is higher. This paper explores the trade-off between the number of tags and overfitting and considers the problem of finding a minimum number of tags when at most two tags can represent variation of an untagged SNP. We show that this problem is hard to approximate and propose an efficient heuristic, referred as 2LR. Our experimental results show that 2LR tagging is between Tagger and MLR in the number of tags and in overfitting. Indeed, 2LR uses slightly more tags than MLR but the overfitting measured with 2-fold cross validations is practically the same as for Tagger. 2LR-tagging better tolerates missing data than Tagger.