Dealing with orthographic variation in a tagger-lemmatizer for fourteenth century Dutch charters

  • Authors:
  • Hans Halteren;Margit Rem

  • Affiliations:
  • CLS--Department of Linguistics/CLST, Radboud University Nijmegen, Nijmegen, The Netherlands;CLS--Department of Dutch, Radboud University Nijmegen, Nijmegen, The Netherlands

  • Venue:
  • Language Resources and Evaluation
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper we describe a tagger-lemmatizer for fourteenth century Dutch charters (as found in the corpus van Reenen/Mulder), with a special focus on the treatment of the extensive orthographic variation in this material. We show that despite the difficulties caused by the variation, we are still able to reach about 95 % accuracy in a tenfold cross-validation experiment for both tagging and lemmatization. We can deal effectively with the variation in tokenization (as applied by the authors) by pre-normalization (retokenization). For variation in spelling, however, we choose to expand our lexicon with predicted spelling variants. For those forms which can also not be found in this expanded lexicon, we first derive the word class and subsequently search for the most similar lexicon word. Interestingly, our techniques for recognizing spelling variants turn out to be vital for lemmatization accuracy, but much less important for tagging accuracy.