Enhancing clinical concept extraction with distributional semantics

Authors:
Siddhartha Jonnalagadda;Trevor Cohen;Stephen Wu;Graciela Gonzalez
Affiliations:
Department of Biomedical Informatics, Arizona State University, Phoenix, AZ, USA and Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA;School of Biomedical Informatics, University of Texas Health Science Center, Houston, TX, USA;Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA;Department of Biomedical Informatics, Arizona State University, Phoenix, AZ, USA
Venue:
Journal of Biomedical Informatics
Year:
2012

Citing 19
Cited 0

A vector space model for automatic indexing

Communications of the ACM
Unsupervised learning by probabilistic latent semantic analysis

Machine Learning
Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL

EMCL '01 Proceedings of the 12th European Conference on Machine Learning
Word Space

Advances in Neural Information Processing Systems 5, [NIPS Conference]
Two biomedical sublanguages: a description based on the theories of Zellig Harris

Journal of Biomedical Informatics - Special issue: Sublanguage
Latent dirichlet allocation

The Journal of Machine Learning Research
Automatic retrieval and clustering of similar words

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
The computation of word associations: comparing syntagmatic and paradigmatic approaches

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Introduction to the CoNLL-2003 shared task: language-independent named entity recognition

CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
Natural language processing to extract medical problems from electronic clinical documents: Performance evaluation

Journal of Biomedical Informatics
Dependency-Based Construction of Semantic Space Models

Computational Linguistics
From text to structured information: automatic processing of medical reports

AFIPS '76 Proceedings of the June 7-10, 1976, national computer conference and exposition
Methodological Review: Empirical distributional semantics: Methods and biomedical applications

Journal of Biomedical Informatics
Word representations: a simple and general method for semi-supervised learning

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Event detection in blogs using temporal random indexing

eETTs '09 Proceedings of the Workshop on Events in Emerging Text Types
From frequency to meaning: vector space models of semantics

Journal of Artificial Intelligence Research
The Semantic Vectors Package: New Algorithms and Public Tools for Distributional Semantics

ICSC '10 Proceedings of the 2010 IEEE Fourth International Conference on Semantic Computing
An effective approach to biomedical information extraction with limited training data

An effective approach to biomedical information extraction with limited training data
Holographic reduced representations

IEEE Transactions on Neural Networks

Quantified Score

Hi-index	0.00

Visualization

Abstract

Extracting concepts (such as drugs, symptoms, and diagnoses) from clinical narratives constitutes a basic enabling technology to unlock the knowledge within and support more advanced reasoning applications such as diagnosis explanation, disease progression modeling, and intelligent analysis of the effectiveness of treatment. The recent release of annotated training sets of de-identified clinical narratives has contributed to the development and refinement of concept extraction methods. However, as the annotation process is labor-intensive, training data are necessarily limited in the concepts and concept patterns covered, which impacts the performance of supervised machine learning applications trained with these data. This paper proposes an approach to minimize this limitation by combining supervised machine learning with empirical learning of semantic relatedness from the distribution of the relevant words in additional unannotated text. The approach uses a sequential discriminative classifier (Conditional Random Fields) to extract the mentions of medical problems, treatments and tests from clinical narratives. It takes advantage of all Medline abstracts indexed as being of the publication type ''clinical trials'' to estimate the relatedness between words in the i2b2/VA training and testing corpora. In addition to the traditional features such as dictionary matching, pattern matching and part-of-speech tags, we also used as a feature words that appear in similar contexts to the word in question (that is, words that have a similar vector representation measured with the commonly used cosine metric, where vector representations are derived using methods of distributional semantics). To the best of our knowledge, this is the first effort exploring the use of distributional semantics, the semantics derived empirically from unannotated text often using vector space models, for a sequence classification task such as concept extraction. Therefore, we first experimented with different sliding window models and found the model with parameters that led to best performance in a preliminary sequence labeling task. The evaluation of this approach, performed against the i2b2/VA concept extraction corpus, showed that incorporating features based on the distribution of words across a large unannotated corpus significantly aids concept extraction. Compared to a supervised-only approach as a baseline, the micro-averaged F-score for exact match increased from 80.3% to 82.3% and the micro-averaged F-score based on inexact match increased from 89.7% to 91.3%. These improvements are highly significant according to the bootstrap resampling method and also considering the performance of other systems. Thus, distributional semantic features significantly improve the performance of concept extraction from clinical narratives by taking advantage of word distribution information obtained from unannotated data.