Treatment of Unknown Words

Authors:
Jan Daciuk
Affiliations:
-
Venue:
WIA '99 Revised Papers from the 4th International Workshop on Automata Implementation
Year:
1999

Citing 5
Cited 3

A corpus-based approach to language learning

A corpus-based approach to language learning
Deterministic part-of-speech tagging with finite-state transducers

Computational Linguistics
Transformation-based error-driven learning and natural language processing: a case study in part-of-speech tagging

Computational Linguistics
Coping with ambiguity and unknown words through probabilistic models

Computational Linguistics - Special issue on using large corpora: II
Automatic rule induction for unknown-word guessing

Computational Linguistics

Using Word Formation Rules to Extend MT Lexicons

AMTA '02 Proceedings of the 5th Conference of the Association for Machine Translation in the Americas on Machine Translation: From Research to Real Users
Beyond N in N-gram tagging

ACLstudent '04 Proceedings of the ACL 2004 workshop on Student research
Robust ending guessing rules with application to Slavonic languages

ROMAND '04 Proceedings of the 3rd Workshop on RObust Methods in Analysis of Natural Language Data

Quantified Score

Hi-index	0.00

Visualization

Abstract

Words not present in the dictionary are almost always found in unrestricted texts. However, there is a need to obtain their likely base forms (in lemmatization), or morphological categories (in tagging), or both. Some of them find their ways into dictionaries, and it would be nice to predict what their entries should look like. Humans can perform those tasks using endings of words (sometimes prefixes and infixes as well), and so can do computers. Previous approaches used manually constructed lists of endings and associated information. Brill proposed transformation-based learning from corpora, and Mikheev used Brill's approach on data for a morphological lexicon. However, both Brill's algorithm, and Mikheev's algorithm that is derived from Brill's one, lack speed, both in the rule acquisition phase, and in the rule application phase. Their algorithms handle only the case of tagging, although an extension to other tasks seems possible. We propose a very fast finite-state method that handles all of the tasks described above, and that achieves similar quality of guessing.