TnT: a statistical part-of-speech tagger
ANLC '00 Proceedings of the sixth conference on Applied natural language processing
Retrieval in text collections with historic spelling using linguistic and spelling variants
Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Semantic density analysis: comparing word meaning across time and phonetic space
GEMS '09 Proceedings of the Workshop on Geometrical Models of Natural Language Semantics
Parsing the past: identification of verb constructions in historical text
LaTeCH '12 Proceedings of the 6th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities
An open diachronic corpus of historical Spanish
Language Resources and Evaluation
Hi-index | 0.05 |
We present a general and simple method to adapt an existing NLP tool in order to enable it to deal with historical varieties of languages. This approach consists basically in expanding the dictionary with the old word variants and in retraining the tagger with a small training corpus. We implement this approach for Old Spanish. The results of a thorough evaluation over the extended tool show that using this method an almost state-of-the-art performance is obtained, adequate to carry out quantitative studies in the humanities: 94.5% accuracy for the main part of speech and 92.6% for lemma. To our knowledge, this is the first time that such a strategy is adopted to annotate historical language varieties and we believe that it could be used as well to deal with other non-standard varieties of languages.