Using domain knowledge for ontology-guided entity extraction from noisy, unstructured text data

Authors:
Sergey Bratus;Anna Rumshisky;Rajendra Magar;Paul Thompson
Affiliations:
Dartmouth College, Hanover, NH;Brandeis University, Waltham, MA;Dartmouth College, Hanover, NH;Dartmouth College, Hanover, NH
Venue:
Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data
Year:
2009

Citing 4
Cited 3

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Information Extraction with HMM Structures Learned by Stochastic Optimization

Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence
The general motors variation-reduction adviser: evolution of a CBR system

ICCBR'03 Proceedings of the 5th international conference on Case-based reasoning: Research and Development
Reasoning with textual cases

ICCBR'05 Proceedings of the 6th international conference on Case-Based Reasoning Research and Development

Discovering users' topics of interest on twitter: a first look

AND '10 Proceedings of the fourth workshop on Analytics for noisy unstructured text data
Discovering context: classifying tweets through a semantic transform based on wikipedia

FAC'11 Proceedings of the 6th international conference on Foundations of augmented cognition: directing the future of adaptive systems
A novel semantic information retrieval system based on a three-level domain model

Journal of Systems and Software

Quantified Score

Hi-index	0.00

Visualization

Abstract

Domain-specific knowledge is often recorded by experts in the form of unstructured text. For example, in the medical domain, clinical notes from electronic health records contain a wealth of information. Similar practices are found in other domains. The challenge we discuss in this paper is how to identify and extract part names from technicians repair notes, a noisy unstructured text data source from General Motors' archives of solved vehicle repair problems, with the goal to develop a robust and dynamic reasoning system to be used as a repair adviser by service technicians. In the present work, we discuss two approaches to this problem. We present an algorithm for ontology-guided entity disambiguation that uses existing knowledge sources such as domain-specific ontologies and other structured data. We illustrate its use in automotive domain, using GM parts ontology and the unit structure of repair manuals text to build context models, which are then used to disambiguate mentions of part-related entities in the text. We also describe extraction of part names with a small amount of annotated data using Hidden Markov Models (HMM) with shrinkage, achieving an f-score of approximately 80%. Next we used linear-chain Conditional Random Fields (CRF) in order to model observation dependencies present in the repair notes. Using CRF did not lead to improved performance, but a slight improvement over the HMM results was obtained by using a weighted combination of the HMM and CRF models.