On the red-blue set cover problem
SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
Approximation algorithms
Bioinformatics
Finding associations among SNPS for prostate cancer using collaborative filtering
Proceedings of the ACM sixth international workshop on Data and text mining in biomedical informatics
Hi-index | 0.00 |
Due to the high genotyping cost and large data volume in genome-wide association studies data, it is desirable to find a small subset of SNPs, referred as tag SNPs, that covers the genetic variation of the entire data. To represent genetic variation of an untagged SNP, the existing tagging methods use either a single tag SNP ( e.g., Tagger, IdSelect), or several tag SNPs ( e.g., MLR, STAMPA). When multiple tags are used to explain variation of a single SNP then usually less tags are needed but overfitting is higher. This paper explores the trade-off between the number of tags and overfitting and considers the problem of finding a minimum number of tags when at most two tags can represent variation of an untagged SNP. We show that this problem is hard to approximate and propose an efficient heuristic, referred as 2LR. Our experimental results show that 2LR tagging is between Tagger and MLR in the number of tags and in overfitting. Indeed, 2LR uses slightly more tags than MLR but the overfitting measured with 2-fold cross validations is practically the same as for Tagger. 2LR-tagging better tolerates missing data than Tagger.