Identification of probable real words: an entropy-based approach

  • Authors:
  • Youngja Park

  • Affiliations:
  • IBM T. J. Watson Research Center, New York

  • Venue:
  • ULA '02 Proceedings of the ACL-02 workshop on Unsupervised lexical acquisition - Volume 9
  • Year:
  • 2002

Quantified Score

Hi-index 0.01

Visualization

Abstract

This paper proposes a method for identifying probable real words among out-of-vocabulary (OOV) words in text. The identification of real words is done based on entropy of probability of character trigrams as well as the morphological rules of English. It also generates possible parts-of-speech (POS) of the identified real words on the basis of lexical formation rules and word endings. The method shows high performance both in precision and in recall. This method is very useful in recognizing domain-specific technical terms, and has successfully been embedded in a glossary extraction system, which identifies single or multi word glossary items and builds a domain-specific dictionary.