Multivariate imputation of genotype data using short and long range disequilibrium

  • Authors:
  • María M. Abad-Grau;Paola Sebastiani

  • Affiliations:
  • Software Engineering Department, University of Granada, Granada, Spain;Department of Biostatistics, Boston University, Boston, MA

  • Venue:
  • EUROCAST'07 Proceedings of the 11th international conference on Computer aided systems theory
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Missing values in genetic data are a common issue. In this paper we explore several machine learning techniques for creating models that can be used to impute the missing genotypes using multiple genetic markers. We map the machine learning techniques to different patterns of transmission and, in particular, we contrast the effect of short and long range disequilibrium between markers. The assumption of short range disequilibrium implies that only physically close genetic variants are informative for reconstructing missing genotypes, while this assumption is relaxed in long range disequilibrium and physically distant genetic variants become informative for imputation. We evaluate the accuracy of a flexible feature selection model that fits both patterns of transmission using six real datasets of single nucleotide polymorphisms (SNP). The results show an increased accuracy compared to standard imputation models.