Computational Problems in Noisy SNP and Haplotype Analysis: Block Scores, Block Identification, and Population Stratification

Authors:
Gad Kimmel;Roded Sharan;Ron Shamir
Affiliations:
School of Computer Science, Tel-Aviv University, Tel-Aviv 69978, Israel;International Computer Science Institute, 1947 Center St., Suite 600, Berkeley, California 94704, USA;School of Computer Science, Tel-Aviv University, Tel-Aviv 69978, Israel
Venue:
INFORMS Journal on Computing
Year:
2004

Citing 7
Cited 0

Introduction to algorithms

Introduction to algorithms
Polynomial-time approximation schemes for geometric min-sum median clustering

Journal of the ACM (JACM)
Computers and Intractability: A Guide to the Theory of NP-Completeness

Computers and Intractability: A Guide to the Theory of NP-Completeness
Haplotypes and informative SNP selection algorithms: don't block out information

RECOMB '03 Proceedings of the seventh annual international conference on Research in computational molecular biology
Large scale reconstruction of haplotypes from genotype data

RECOMB '03 Proceedings of the seventh annual international conference on Research in computational molecular biology
Haplotype inference by pure Parsimony

CPM'03 Proceedings of the 14th annual conference on Combinatorial pattern matching
Combinatorial problems arising in SNP and haplotype analysis

DMTCS'03 Proceedings of the 4th international conference on Discrete mathematics and theoretical computer science

Quantified Score

Hi-index	0.00

Visualization

Abstract

The study of haplotypes and their diversity in a population is central to disease-association research. We study several problems arising in haplotype block partitioning. Our objective function is the total number of distinct haplotypes in blocks. We show that the problem is NP-hard when there are errors or missing data, and provide approximation algorithms for several of its variants. We also give an algorithm that solves the problem with high probability under a probabilistic model that allows noise and missing data. In addition, we study the multipopulation case, where one has to partition the haplotypes into populations and seek a different block partition in each one. We provide a heuristic for that problem and use it to analyze simulated and real data. On simulated data, our blocks resemble the true partition more than the blocks generated by the LD-based algorithm of Gabriel et al (2002). On single-population real data, we generate a more concise block description than do extant approaches, with better average LD within blocks. The algorithm also gives promising results on real two-population genotype data.