Learning phenotype mapping for integrating large genetic data

  • Authors:
  • Chun-Nan Hsu;Cheng-Ju Kuo;Congxing Cai;Sarah A. Pendergrass;Marylyn D. Ritchie;Jose Luis Ambite

  • Affiliations:
  • USC Information Sciences Institute, Marina del Rey, CA and Institute of Information Sciences, Academia Sinica, Taipei, Taiwan;Institute of Information Sciences, Academia Sinica, Taipei, Taiwan;USC Information Sciences Institute, Marina del Rey, CA;Center for Human Genetics Research;Center for Human Genetics Research and Vanderbilt University, Nashville, TN;USC Information Sciences Institute, Marina del Rey, CA

  • Venue:
  • BioNLP '11 Proceedings of BioNLP 2011 Workshop
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Accurate phenotype mapping will play an important role in facilitating Phenome-Wide Association Studies (PheWAS), and potentially in other phenomics based studies. The Phe-WAS approach investigates the association between genetic variation and an extensive range of phenotypes in a high-throughput manner to better understand the impact of genetic variations on multiple phenotypes. Herein we define the phenotype mapping problem posed by PheWAS analyses, discuss the challenges, and present a machine-learning solution. Our key ideas include the use of weighted Jaccard features and term augmentation by dictionary lookup. When compared to string similarity metric-based features, our approach improves the F-score from 0.59 to 0.73. With augmentation we show further improvement in F-score to 0.89. For terms not covered by the dictionary, we use transitive closure inference and reach an F-score of 0.91, close to a level sufficient for practical use. We also show that our model generalizes well to phenotypes not used in our training dataset.