A corpus-based approach to language learning
A corpus-based approach to language learning
Deterministic part-of-speech tagging with finite-state transducers
Computational Linguistics
Coping with ambiguity and unknown words through probabilistic models
Computational Linguistics - Special issue on using large corpora: II
Automatic rule induction for unknown-word guessing
Computational Linguistics
Using Word Formation Rules to Extend MT Lexicons
AMTA '02 Proceedings of the 5th Conference of the Association for Machine Translation in the Americas on Machine Translation: From Research to Real Users
ACLstudent '04 Proceedings of the ACL 2004 workshop on Student research
Robust ending guessing rules with application to Slavonic languages
ROMAND '04 Proceedings of the 3rd Workshop on RObust Methods in Analysis of Natural Language Data
Hi-index | 0.00 |
Words not present in the dictionary are almost always found in unrestricted texts. However, there is a need to obtain their likely base forms (in lemmatization), or morphological categories (in tagging), or both. Some of them find their ways into dictionaries, and it would be nice to predict what their entries should look like. Humans can perform those tasks using endings of words (sometimes prefixes and infixes as well), and so can do computers. Previous approaches used manually constructed lists of endings and associated information. Brill proposed transformation-based learning from corpora, and Mikheev used Brill's approach on data for a morphological lexicon. However, both Brill's algorithm, and Mikheev's algorithm that is derived from Brill's one, lack speed, both in the rule acquisition phase, and in the rule application phase. Their algorithms handle only the case of tagging, although an extension to other tasks seems possible. We propose a very fast finite-state method that handles all of the tasks described above, and that achieves similar quality of guessing.