Morphological lexicon extraction from raw text data

  • Authors:
  • Markus Forsberg;Harald Hammarström;Aarne Ranta

  • Affiliations:
  • Department of Computing Science, Chalmers University of Technology, Sweden;Department of Computing Science, Chalmers University of Technology, Sweden;Department of Computing Science, Chalmers University of Technology, Sweden

  • Venue:
  • FinTAL'06 Proceedings of the 5th international conference on Advances in Natural Language Processing
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

The tool extract enables the automatic extraction of lemma-paradigm pairs from raw text data. The tool uses search patterns that consist of regular expressions and propositional logic. These search patterns define sufficient conditions for including lemma-paradigm pairs in the lexicon, on the basis of word forms occurring in the data. This paper explains the search pattern syntax of extract as well as the search algorithm, and discusses the design of search patterns from the recall and precision point of view. The extract tool was developed for morphologies defined in the Functional Morphology tool [1], but it is usable for all systems that implement a word-and-paradigm description of a morphology. The usefulness of the tool is demonstrated by a case study on the Canadian Hansards Corpus of French. The result is evaluated in terms of precision of the extracted lemmas and statistics on coverage and rule productiveness. Competitive extraction figures show that human-written rules in a tailored tool is a time-efficient approach to the task at hand.