Programming perl
Automatic stochastic tagging of natural language texts
Computational Linguistics
Building a large annotated corpus of English: the penn treebank
Computational Linguistics - Special issue on using large corpora: II
Coping with ambiguity and unknown words through probabilistic models
Computational Linguistics - Special issue on using large corpora: II
Automatic rule induction for unknown-word guessing
Computational Linguistics
Information extraction from biomedical literature: methodology, evaluation and an application
CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
The Talent system: TEXTRACT architecture and data model
Natural Language Engineering
IBM Journal of Research and Development
The talent system: TEXTRACT architecture and data model
SEALTS '03 Proceedings of the HLT-NAACL 2003 workshop on Software engineering and architecture of language technology systems - Volume 8
Hi-index | 0.01 |
This paper proposes a method for identifying probable real words among out-of-vocabulary (OOV) words in text. The identification of real words is done based on entropy of probability of character trigrams as well as the morphological rules of English. It also generates possible parts-of-speech (POS) of the identified real words on the basis of lexical formation rules and word endings. The method shows high performance both in precision and in recall. This method is very useful in recognizing domain-specific technical terms, and has successfully been embedded in a glossary extraction system, which identifies single or multi word glossary items and builds a domain-specific dictionary.