Efficient clustering of high-dimensional data sets with application to reference matching
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning object identification rules for information integration
Information Systems - Data extraction, cleaning and reconciliation
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem
Data Mining and Knowledge Discovery
Adaptive duplicate detection using learnable string similarity measures
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Improving machine learning approaches to coreference resolution
ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
A Heterogeneous Field Matching Method for Record Linkage
ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Learning blocking schemes for record linkage
AAAI'06 Proceedings of the 21st national conference on Artificial intelligence - Volume 1
Bioinformatics
An Introduction to Duplicate Detection
An Introduction to Duplicate Detection
Exploring label dependency in active learning for phenotype mapping
BioNLP '12 Proceedings of the 2012 Workshop on Biomedical Natural Language Processing
Hi-index | 0.00 |
Accurate phenotype mapping will play an important role in facilitating Phenome-Wide Association Studies (PheWAS), and potentially in other phenomics based studies. The Phe-WAS approach investigates the association between genetic variation and an extensive range of phenotypes in a high-throughput manner to better understand the impact of genetic variations on multiple phenotypes. Herein we define the phenotype mapping problem posed by PheWAS analyses, discuss the challenges, and present a machine-learning solution. Our key ideas include the use of weighted Jaccard features and term augmentation by dictionary lookup. When compared to string similarity metric-based features, our approach improves the F-score from 0.59 to 0.73. With augmentation we show further improvement in F-score to 0.89. For terms not covered by the dictionary, we use transitive closure inference and reach an F-score of 0.91, close to a level sufficient for practical use. We also show that our model generalizes well to phenotypes not used in our training dataset.