Word association norms, mutual information, and lexicography
Computational Linguistics
Identifying terms by their family and friends
COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
Detecting novel compounds: the role of distributional evidence
EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 1
Feature-rich part-of-speech tagging with a cyclic dependency network
NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Combining association measures for collocation extraction
COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Using the Web to Reduce Data Sparseness in Pattern-Based Information Extraction
PKDD 2007 Proceedings of the 11th European conference on Principles and Practice of Knowledge Discovery in Databases
The AMTEx approach in the medical document indexing and retrieval application
Data & Knowledge Engineering
Using ontology to improve precision of terminology extraction from documents
Expert Systems with Applications: An International Journal
Comparing corpora using frequency profiling
CompareCorpora '00 Proceedings of the Workshop on Comparing Corpora
International Journal of Human-Computer Studies
Extending lexical association measures for collocation extraction
Computer Speech and Language
Ontology based knowledge extraction for shipyard fabrication workshop reports
Expert Systems with Applications: An International Journal
Improving term extraction with terminological resources
FinTAL'06 Proceedings of the 5th international conference on Advances in Natural Language Processing
Expert Systems with Applications: An International Journal
Hi-index | 12.05 |
Existing term extraction systems have predominantly targeted large and well-written document collections, which provide reliable statistical and linguistic evidence to support term extraction. In this article, we address the term extraction challenges posed by sparse, ungrammatical texts with domain-specific contents, such as customer complaint emails and engineers' repair notes. To this aim, we present ExtTerm, a novel term extraction system. Specifically, as our core innovations, we accurately detect rare (low frequency) terms, overcoming the issue of data sparsity. These rare terms may denote critical events, but they are often missed by extant TE systems. ExtTerm also precisely detects multi-word terms of arbitrarily lengths, e.g. with more than 2 words. This is achieved by exploiting fundamental theoretical notions underlying term formation, and by developing a technique to compute the collocation strength between any number of words. Thus, we address the limitation of existing TE systems, which are primarily designed to identify terms with 2 words. Furthermore, we show that open-domain (general) resources, such as Wikipedia, can be exploited to support domain-specific term extraction. Thus, they can be used to compensate for the unavailability of domain-specific knowledge resources. Our experimental evaluations reveal that ExtTerm outperforms a state-of-the-art baseline in extracting terms from a domain-specific, sparse and ungrammatical real-life text collection.