Building a large annotated corpus of English: the penn treebank
Computational Linguistics - Special issue on using large corpora: II
Critical tokenization and its properties
Computational Linguistics
TnT: a statistical part-of-speech tagger
ANLC '00 Proceedings of the sixth conference on Applied natural language processing
A description language for syntactically annotated corpora
COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 2
SUMMaR: Combining Linguistics and Statistics for Text Summarization
Proceedings of the 2006 conference on ECAI 2006: 17th European Conference on Artificial Intelligence August 29 -- September 1, 2006, Riva del Garda, Italy
ANNIS: complex multilevel annotations in a linguistic database
NLPXML '06 Proceedings of the 5th Workshop on NLP and XML: Multi-Dimensional Markup in Natural Language Processing
GrAF: a graph-based format for linguistic annotations
LAW '07 Proceedings of the Linguistic Annotation Workshop
Creating and exploiting a resource of parallel parses
LAW IV '10 Proceedings of the Fourth Linguistic Annotation Workshop
Hi-index | 0.00 |
Given the contemporary trend to modular NLP architectures and multiple annotation frameworks, the existence of concurrent tokenizations of the same text represents a pervasive problem in everyday's NLP practice and poses a non-trivial theoretical problem to the integration of linguistic annotations and their interpretability in general. This paper describes a solution for integrating different tokenizations using a standoff XML format, and discusses the consequences for the handling of queries on annotated corpora.