Identification of probable real words: an entropy-based approach

Authors:
Youngja Park
Affiliations:
IBM T. J. Watson Research Center, New York
Venue:
ULA '02 Proceedings of the ACL-02 workshop on Unsupervised lexical acquisition - Volume 9
Year:
2002

Citing 6
Cited 5

Programming perl

Programming perl
Automatic stochastic tagging of natural language texts

Computational Linguistics
Transformation-based error-driven learning and natural language processing: a case study in part-of-speech tagging

Computational Linguistics
Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
Coping with ambiguity and unknown words through probabilistic models

Computational Linguistics - Special issue on using large corpora: II
Automatic rule induction for unknown-word guessing

Computational Linguistics

Information extraction from biomedical literature: methodology, evaluation and an application

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
The Talent system: TEXTRACT architecture and data model

Natural Language Engineering
Glossary extraction and utilization in the information search and delivery system for IBM technical support

IBM Systems Journal
Enhancing a biomedical information extraction system with dictionary mining and context disambiguation

IBM Journal of Research and Development
The talent system: TEXTRACT architecture and data model

SEALTS '03 Proceedings of the HLT-NAACL 2003 workshop on Software engineering and architecture of language technology systems - Volume 8

Quantified Score

Hi-index	0.01

Visualization

Abstract

This paper proposes a method for identifying probable real words among out-of-vocabulary (OOV) words in text. The identification of real words is done based on entropy of probability of character trigrams as well as the morphological rules of English. It also generates possible parts-of-speech (POS) of the identified real words on the basis of lexical formation rules and word endings. The method shows high performance both in precision and in recall. This method is very useful in recognizing domain-specific technical terms, and has successfully been embedded in a glossary extraction system, which identifies single or multi word glossary items and builds a domain-specific dictionary.