A gold standard corpus of early modern German

Authors:
Silke Scheible;Richard J. Whitt;Martin Durrell;Paul Bennett
Affiliations:
University of Manchester;University of Manchester;University of Manchester;University of Manchester
Venue:
LAW V '11 Proceedings of the 5th Linguistic Annotation Workshop
Year:
2011

Citing 3
Cited 1

TnT: a statistical part-of-speech tagger

ANLC '00 Proceedings of the sixth conference on Applied natural language processing
Comparing canonicalizations of historical German text

SIGMORPHON '10 Proceedings of the 11th Meeting of the ACL Special Interest Group on Computational Morphology and Phonology
Generating search term variants for text collections with historic spellings

ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval

Evaluating an 'off-the-shelf' POS-tagger on early modern German text

LaTeCH '11 Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes an annotated gold standard sample corpus of Early Modern German containing over 50,000 tokens of text manually annotated with POS tags, lemmas, and normalised spelling variants. The corpus is the first resource of its kind for this variant of German, and represents an ideal test bed for evaluating and adapting existing NLP tools on historical data. We describe the corpus format, annotation levels, and challenges, providing an example of the requirements and needs of smaller humanities-based corpus projects.