Learning phenotype mapping for integrating large genetic data

Authors:
Chun-Nan Hsu;Cheng-Ju Kuo;Congxing Cai;Sarah A. Pendergrass;Marylyn D. Ritchie;Jose Luis Ambite
Affiliations:
USC Information Sciences Institute, Marina del Rey, CA and Institute of Information Sciences, Academia Sinica, Taipei, Taiwan;Institute of Information Sciences, Academia Sinica, Taipei, Taiwan;USC Information Sciences Institute, Marina del Rey, CA;Center for Human Genetics Research;Center for Human Genetics Research and Vanderbilt University, Nashville, TN;USC Information Sciences Institute, Marina del Rey, CA
Venue:
BioNLP '11 Proceedings of BioNLP 2011 Workshop
Year:
2011

Citing 9
Cited 1

Efficient clustering of high-dimensional data sets with application to reference matching

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning object identification rules for information integration

Information Systems - Data extraction, cleaning and reconciliation
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem

Data Mining and Knowledge Discovery
Adaptive duplicate detection using learnable string similarity measures

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Improving machine learning approaches to coreference resolution

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
A Heterogeneous Field Matching Method for Record Linkage

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Learning blocking schemes for record linkage

AAAI'06 Proceedings of the 21st national conference on Artificial intelligence - Volume 1
PheWAS

Bioinformatics
An Introduction to Duplicate Detection

An Introduction to Duplicate Detection

Exploring label dependency in active learning for phenotype mapping

BioNLP '12 Proceedings of the 2012 Workshop on Biomedical Natural Language Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Accurate phenotype mapping will play an important role in facilitating Phenome-Wide Association Studies (PheWAS), and potentially in other phenomics based studies. The Phe-WAS approach investigates the association between genetic variation and an extensive range of phenotypes in a high-throughput manner to better understand the impact of genetic variations on multiple phenotypes. Herein we define the phenotype mapping problem posed by PheWAS analyses, discuss the challenges, and present a machine-learning solution. Our key ideas include the use of weighted Jaccard features and term augmentation by dictionary lookup. When compared to string similarity metric-based features, our approach improves the F-score from 0.59 to 0.73. With augmentation we show further improvement in F-score to 0.89. For terms not covered by the dictionary, we use transitive closure inference and reach an F-score of 0.91, close to a level sufficient for practical use. We also show that our model generalizes well to phenotypes not used in our training dataset.