A vector space model for automatic indexing
Communications of the ACM
Unsupervised learning by probabilistic latent semantic analysis
Machine Learning
Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL
EMCL '01 Proceedings of the 12th European Conference on Machine Learning
Advances in Neural Information Processing Systems 5, [NIPS Conference]
Two biomedical sublanguages: a description based on the theories of Zellig Harris
Journal of Biomedical Informatics - Special issue: Sublanguage
The Journal of Machine Learning Research
Automatic retrieval and clustering of similar words
COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
The computation of word associations: comparing syntagmatic and paradigmatic approaches
COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Introduction to the CoNLL-2003 shared task: language-independent named entity recognition
CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
Journal of Biomedical Informatics
Dependency-Based Construction of Semantic Space Models
Computational Linguistics
From text to structured information: automatic processing of medical reports
AFIPS '76 Proceedings of the June 7-10, 1976, national computer conference and exposition
Methodological Review: Empirical distributional semantics: Methods and biomedical applications
Journal of Biomedical Informatics
Word representations: a simple and general method for semi-supervised learning
ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Event detection in blogs using temporal random indexing
eETTs '09 Proceedings of the Workshop on Events in Emerging Text Types
From frequency to meaning: vector space models of semantics
Journal of Artificial Intelligence Research
The Semantic Vectors Package: New Algorithms and Public Tools for Distributional Semantics
ICSC '10 Proceedings of the 2010 IEEE Fourth International Conference on Semantic Computing
An effective approach to biomedical information extraction with limited training data
An effective approach to biomedical information extraction with limited training data
Holographic reduced representations
IEEE Transactions on Neural Networks
Hi-index | 0.00 |
Extracting concepts (such as drugs, symptoms, and diagnoses) from clinical narratives constitutes a basic enabling technology to unlock the knowledge within and support more advanced reasoning applications such as diagnosis explanation, disease progression modeling, and intelligent analysis of the effectiveness of treatment. The recent release of annotated training sets of de-identified clinical narratives has contributed to the development and refinement of concept extraction methods. However, as the annotation process is labor-intensive, training data are necessarily limited in the concepts and concept patterns covered, which impacts the performance of supervised machine learning applications trained with these data. This paper proposes an approach to minimize this limitation by combining supervised machine learning with empirical learning of semantic relatedness from the distribution of the relevant words in additional unannotated text. The approach uses a sequential discriminative classifier (Conditional Random Fields) to extract the mentions of medical problems, treatments and tests from clinical narratives. It takes advantage of all Medline abstracts indexed as being of the publication type ''clinical trials'' to estimate the relatedness between words in the i2b2/VA training and testing corpora. In addition to the traditional features such as dictionary matching, pattern matching and part-of-speech tags, we also used as a feature words that appear in similar contexts to the word in question (that is, words that have a similar vector representation measured with the commonly used cosine metric, where vector representations are derived using methods of distributional semantics). To the best of our knowledge, this is the first effort exploring the use of distributional semantics, the semantics derived empirically from unannotated text often using vector space models, for a sequence classification task such as concept extraction. Therefore, we first experimented with different sliding window models and found the model with parameters that led to best performance in a preliminary sequence labeling task. The evaluation of this approach, performed against the i2b2/VA concept extraction corpus, showed that incorporating features based on the distribution of words across a large unannotated corpus significantly aids concept extraction. Compared to a supervised-only approach as a baseline, the micro-averaged F-score for exact match increased from 80.3% to 82.3% and the micro-averaged F-score based on inexact match increased from 89.7% to 91.3%. These improvements are highly significant according to the bootstrap resampling method and also considering the performance of other systems. Thus, distributional semantic features significantly improve the performance of concept extraction from clinical narratives by taking advantage of word distribution information obtained from unannotated data.