Dealing with orthographic variation in a tagger-lemmatizer for fourteenth century Dutch charters

Authors:
Hans Halteren;Margit Rem
Affiliations:
CLS--Department of Linguistics/CLST, Radboud University Nijmegen, Nijmegen, The Netherlands;CLS--Department of Dutch, Radboud University Nijmegen, Nijmegen, The Netherlands
Venue:
Language Resources and Evaluation
Year:
2013

Citing 7
Cited 0

Improving accuracy in word class tagging through the combination of machine learning systems

Computational Linguistics
TnT: a statistical part-of-speech tagger

ANLC '00 Proceedings of the sixth conference on Applied natural language processing
A default first order family weight determination procedure for WPDV models

ConLL '00 Proceedings of the 2nd workshop on Learning language in logic and the 4th conference on Computational natural language learning - Volume 7
Evaluating the pairwise string alignment of pronunciations

LaTeCH-SHELT&R '09 Proceedings of the EACL 2009 Workshop on Language Technology and Resources for Cultural Heritage, Social Sciences, Humanities, and Education
Mining Synonymous Transliterations from the World Wide Web

ACM Transactions on Asian Language Information Processing (TALIP)
Comparing canonicalizations of historical German text

SIGMORPHON '10 Proceedings of the 11th Meeting of the ACL Special Interest Group on Computational Morphology and Phonology
Learning trees and rules with set-valued features

AAAI'96 Proceedings of the thirteenth national conference on Artificial intelligence - Volume 1

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we describe a tagger-lemmatizer for fourteenth century Dutch charters (as found in the corpus van Reenen/Mulder), with a special focus on the treatment of the extensive orthographic variation in this material. We show that despite the difficulties caused by the variation, we are still able to reach about 95 % accuracy in a tenfold cross-validation experiment for both tagging and lemmatization. We can deal effectively with the variation in tokenization (as applied by the authors) by pre-normalization (retokenization). For variation in spelling, however, we choose to expand our lexicon with predicted spelling variants. For those forms which can also not be found in this expanded lexicon, we first derive the word class and subsequently search for the most similar lexicon word. Interestingly, our techniques for recognizing spelling variants turn out to be vital for lemmatization accuracy, but much less important for tagging accuracy.