A philosophical basis for knowledge acquisition
Knowledge Acquisition
Machine Learning
Machine Learning
Learning word normalization using word suffix and context from unlabeled data
ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning
Inductive Logic Programming for Natural Language Processing
ILP '96 Selected Papers from the 6th International Workshop on Inductive Logic Programming
Learning Multilingual Morphology with CLOG
ILP '98 Proceedings of the 8th International Workshop on Inductive Logic Programming
TnT: a statistical part-of-speech tagger
ANLC '00 Proceedings of the sixth conference on Applied natural language processing
Memory-based morphological analysis
ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Inducing multilingual text analysis tools via robust projection across aligned corpora
HLT '01 Proceedings of the first international conference on Human language technology research
Memory-Based Learning of morphology with stochastic transducers
ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Induction of first-order decision lists: results on learning the past tense of English verbs
Journal of Artificial Intelligence Research
An analogical learner for morphological analysis
CONLL '05 Proceedings of the Ninth Conference on Computational Natural Language Learning
Hi-index | 0.00 |
Lemmatisation is the process of finding the normalised forms of wordforms as they appear in text. It is a useful pre-processing step for a large number of language engineering tasks, and especially important for languages with rich inflection morphology. This paper presents a machine learning approach to automated word lemmatisation using a Ripple Down Rule learning algorithm, specially adapted to this task. By focusing on word suffixes, the induced Ripple Down Rules determine which wordform suffix should be removed and/or added to generate the lemma. The rules, induced from a lexicon of lemmatised Slovene words, were evaluated by cross-validation in the lexicon and on a hand-validated annotated corpus, and compared to previous work using two other inductive lemmatisers, ATRIS and CLOG. We show that RDR outperforms ATRIS and is more flexible than CLOG, as it can, unlike CLOG, also work without prior part-of-speech tagging. The RDR lemmatiser is easy to train and use for new languages and is, together with CLOG, available via a Web service.