Improving accuracy in word class tagging through the combination of machine learning systems
Computational Linguistics
TnT: a statistical part-of-speech tagger
ANLC '00 Proceedings of the sixth conference on Applied natural language processing
A default first order family weight determination procedure for WPDV models
ConLL '00 Proceedings of the 2nd workshop on Learning language in logic and the 4th conference on Computational natural language learning - Volume 7
Evaluating the pairwise string alignment of pronunciations
LaTeCH-SHELT&R '09 Proceedings of the EACL 2009 Workshop on Language Technology and Resources for Cultural Heritage, Social Sciences, Humanities, and Education
Mining Synonymous Transliterations from the World Wide Web
ACM Transactions on Asian Language Information Processing (TALIP)
Comparing canonicalizations of historical German text
SIGMORPHON '10 Proceedings of the 11th Meeting of the ACL Special Interest Group on Computational Morphology and Phonology
Learning trees and rules with set-valued features
AAAI'96 Proceedings of the thirteenth national conference on Artificial intelligence - Volume 1
Hi-index | 0.00 |
In this paper we describe a tagger-lemmatizer for fourteenth century Dutch charters (as found in the corpus van Reenen/Mulder), with a special focus on the treatment of the extensive orthographic variation in this material. We show that despite the difficulties caused by the variation, we are still able to reach about 95 % accuracy in a tenfold cross-validation experiment for both tagging and lemmatization. We can deal effectively with the variation in tokenization (as applied by the authors) by pre-normalization (retokenization). For variation in spelling, however, we choose to expand our lexicon with predicted spelling variants. For those forms which can also not be found in this expanded lexicon, we first derive the word class and subsequently search for the most similar lexicon word. Interestingly, our techniques for recognizing spelling variants turn out to be vital for lemmatization accuracy, but much less important for tagging accuracy.