Extending the tool, or how to annotate historical language varieties

Authors:
Cristina Sánchez-Marco;Gemma Boleda;Lluís Padró
Affiliations:
Universitat Pompeu Fabra Barcelona, Spain;Universitat Politècnica de Catalunya Barcelona, Spain;Universitat Politècnica de Catalunya Barcelona, Spain
Venue:
LaTeCH '11 Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities
Year:
2011

Citing 3
Cited 2

TnT: a statistical part-of-speech tagger

ANLC '00 Proceedings of the sixth conference on Applied natural language processing
Retrieval in text collections with historic spelling using linguistic and spelling variants

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Semantic density analysis: comparing word meaning across time and phonetic space

GEMS '09 Proceedings of the Workshop on Geometrical Models of Natural Language Semantics

Parsing the past: identification of verb constructions in historical text

LaTeCH '12 Proceedings of the 6th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities
An open diachronic corpus of historical Spanish

Language Resources and Evaluation

Quantified Score

Hi-index	0.05

Visualization

Abstract

We present a general and simple method to adapt an existing NLP tool in order to enable it to deal with historical varieties of languages. This approach consists basically in expanding the dictionary with the old word variants and in retraining the tagger with a small training corpus. We implement this approach for Old Spanish. The results of a thorough evaluation over the extended tool show that using this method an almost state-of-the-art performance is obtained, adequate to carry out quantitative studies in the humanities: 94.5% accuracy for the main part of speech and 92.6% for lemma. To our knowledge, this is the first time that such a strategy is adopted to annotate historical language varieties and we believe that it could be used as well to deal with other non-standard varieties of languages.