Term extraction from sparse, ungrammatical domain-specific documents

Authors:
Ashwin Ittoo;Gosse Bouma
Affiliations:
Department of Operations, Faculty of Economics and Business, University of Groningen, The Netherlands;Computational Linguistics (Information Science), Faculty of Arts, University of Groningen, The Netherlands
Venue:
Expert Systems with Applications: An International Journal
Year:
2013

Citing 13
Cited 1

Word association norms, mutual information, and lexicography

Computational Linguistics
Identifying terms by their family and friends

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
Detecting novel compounds: the role of distributional evidence

EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 1
Feature-rich part-of-speech tagging with a cyclic dependency network

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Combining association measures for collocation extraction

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Using the Web to Reduce Data Sparseness in Pattern-Based Information Extraction

PKDD 2007 Proceedings of the 11th European conference on Principles and Practice of Knowledge Discovery in Databases
The AMTEx approach in the medical document indexing and retrieval application

Data & Knowledge Engineering
Using ontology to improve precision of terminology extraction from documents

Expert Systems with Applications: An International Journal
Comparing corpora using frequency profiling

CompareCorpora '00 Proceedings of the Workshop on Comparing Corpora
Mining meaning from Wikipedia

International Journal of Human-Computer Studies
Extending lexical association measures for collocation extraction

Computer Speech and Language
Ontology based knowledge extraction for shipyard fabrication workshop reports

Expert Systems with Applications: An International Journal
Improving term extraction with terminological resources

FinTAL'06 Proceedings of the 5th international conference on Advances in Natural Language Processing

Towards automatic tweet generation: A comparative study from the text summarization perspective in the journalism genre

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	12.05

Visualization

Abstract

Existing term extraction systems have predominantly targeted large and well-written document collections, which provide reliable statistical and linguistic evidence to support term extraction. In this article, we address the term extraction challenges posed by sparse, ungrammatical texts with domain-specific contents, such as customer complaint emails and engineers' repair notes. To this aim, we present ExtTerm, a novel term extraction system. Specifically, as our core innovations, we accurately detect rare (low frequency) terms, overcoming the issue of data sparsity. These rare terms may denote critical events, but they are often missed by extant TE systems. ExtTerm also precisely detects multi-word terms of arbitrarily lengths, e.g. with more than 2 words. This is achieved by exploiting fundamental theoretical notions underlying term formation, and by developing a technique to compute the collocation strength between any number of words. Thus, we address the limitation of existing TE systems, which are primarily designed to identify terms with 2 words. Furthermore, we show that open-domain (general) resources, such as Wikipedia, can be exploited to support domain-specific term extraction. Thus, they can be used to compensate for the unavailability of domain-specific knowledge resources. Our experimental evaluations reveal that ExtTerm outperforms a state-of-the-art baseline in extracting terms from a domain-specific, sparse and ungrammatical real-life text collection.