Genotype Tagging with Limited Overfitting

Authors:
Irina Astrovskaya;Alex Zelikovsky
Affiliations:
Department of Computer Science, Georgia State University, Atlanta GA 30303;Department of Computer Science, Georgia State University, Atlanta GA 30303
Venue:
BSB '09 Proceedings of the 4th Brazilian Symposium on Bioinformatics: Advances in Bioinformatics and Computational Biology
Year:
2009

Citing 5
Cited 1

On the red-blue set cover problem

SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
Approximation algorithms

Approximation algorithms
HapBlock: haplotype block partitioning and tag SNP selection software using a set of dynamic programming algorithms

Bioinformatics
Tag SNP selection in genotype data for maximizing SNP prediction accuracy

Bioinformatics
BNTagger

Bioinformatics

Finding associations among SNPS for prostate cancer using collaborative filtering

Proceedings of the ACM sixth international workshop on Data and text mining in biomedical informatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Due to the high genotyping cost and large data volume in genome-wide association studies data, it is desirable to find a small subset of SNPs, referred as tag SNPs, that covers the genetic variation of the entire data. To represent genetic variation of an untagged SNP, the existing tagging methods use either a single tag SNP ( e.g., Tagger, IdSelect), or several tag SNPs ( e.g., MLR, STAMPA). When multiple tags are used to explain variation of a single SNP then usually less tags are needed but overfitting is higher. This paper explores the trade-off between the number of tags and overfitting and considers the problem of finding a minimum number of tags when at most two tags can represent variation of an untagged SNP. We show that this problem is hard to approximate and propose an efficient heuristic, referred as 2LR. Our experimental results show that 2LR tagging is between Tagger and MLR in the number of tags and in overfitting. Indeed, 2LR uses slightly more tags than MLR but the overfitting measured with 2-fold cross validations is practically the same as for Tagger. 2LR-tagging better tolerates missing data than Tagger.