A probabilistic model for guessing base forms of new words by analogy

Authors:
Krister Lindén
Affiliations:
Department of General Linguistics, University of Helsinki
Venue:
CICLing'08 Proceedings of the 9th international conference on Computational linguistics and intelligent text processing
Year:
2008

Citing 4
Cited 4

Modeling and learning multilingual inflectional morphology in a minimally supervised framework

Modeling and learning multilingual inflectional morphology in a minimally supervised framework
Introducing VAUCANSON

Theoretical Computer Science - Implementation and application of automata
Multilingual modeling of cross-lingual spelling variants

Information Retrieval
Overview of Morpho challenge 2008

CLEF'08 Proceedings of the 9th Cross-language evaluation forum conference on Evaluating systems for multilingual and multimodal information access

On knowledge-poor methods for person name matching and lemmatization for highly inflectional languages

Information Retrieval
Guessers for Finite-State Transducer Lexicons

CICLing '09 Proceedings of the 10th International Conference on Computational Linguistics and Intelligent Text Processing
A nearest-neighbor approach to the automatic analysis of ancient Greek morphology

CoNLL '08 Proceedings of the Twelfth Conference on Computational Natural Language Learning
A dictionary- and corpus-independent statistical lemmatizer for information retrieval in low resource languages

CLEF'10 Proceedings of the 2010 international conference on Multilingual and multimodal information access evaluation: cross-language evaluation forum

Quantified Score

Hi-index	0.00

Visualization

Abstract

Language software applications encounter new words, e.g., acronyms, technical terminology, loan words, names or compounds of such words. Looking at English, one might assume that they appear in base form, i.e., the lexical look-up form. However, in more highly inflecting languages like Finnish or Swahili only 40-50 % of new words appear in base form. In order to index documents or discover translations for these languages, it would be useful to reduce new words to their base forms as well. We often have access to analyzes for more frequent words which shape our intuition for how new words will inflect. We formalize this into a probabilistic model for lemmatization of new words using analogy, i.e., guessing base forms, and test the model on English, Finnish, Swedish and Swahili demonstrating that we get a recall of 89- 99 % with an average precision of 76-94 % depending on language and the amount of training material.