Ripple Down Rule learning for automated word lemmatisation

  • Authors:
  • Joël Plisson;Nada Lavrač;Dunja Mladenić;Tomaž Erjavec

  • Affiliations:
  • Jožef Stefan Institute, Jamova 39, 1000 Ljubljana, Slovenia. E-mail: joel.plisson@ijs.si;Jožef Stefan Institute, Jamova 39, 1000 Ljubljana, Slovenia. E-mail: joel.plisson@ijs.si and University of Nova Gorica, Vipavska 13, 5000 Nova Gorica, Slovenia;Jožef Stefan Institute, Jamova 39, 1000 Ljubljana, Slovenia. E-mail: joel.plisson@ijs.si;Jožef Stefan Institute, Jamova 39, 1000 Ljubljana, Slovenia. E-mail: joel.plisson@ijs.si

  • Venue:
  • AI Communications
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Lemmatisation is the process of finding the normalised forms of wordforms as they appear in text. It is a useful pre-processing step for a large number of language engineering tasks, and especially important for languages with rich inflection morphology. This paper presents a machine learning approach to automated word lemmatisation using a Ripple Down Rule learning algorithm, specially adapted to this task. By focusing on word suffixes, the induced Ripple Down Rules determine which wordform suffix should be removed and/or added to generate the lemma. The rules, induced from a lexicon of lemmatised Slovene words, were evaluated by cross-validation in the lexicon and on a hand-validated annotated corpus, and compared to previous work using two other inductive lemmatisers, ATRIS and CLOG. We show that RDR outperforms ATRIS and is more flexible than CLOG, as it can, unlike CLOG, also work without prior part-of-speech tagging. The RDR lemmatiser is easy to train and use for new languages and is, together with CLOG, available via a Web service.