By all these lovely tokens...: merging conflicting tokenizations

Authors:
Christian Chiarcos;Julia Ritz;Manfred Stede
Affiliations:
University of Potsdam, Golm, Germany;University of Potsdam, Golm, Germany;University of Potsdam, Golm, Germany
Venue:
ACL-IJCNLP '09 Proceedings of the Third Linguistic Annotation Workshop
Year:
2009

Citing 7
Cited 1

Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
Critical tokenization and its properties

Computational Linguistics
TnT: a statistical part-of-speech tagger

ANLC '00 Proceedings of the sixth conference on Applied natural language processing
A description language for syntactically annotated corpora

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 2
SUMMaR: Combining Linguistics and Statistics for Text Summarization

Proceedings of the 2006 conference on ECAI 2006: 17th European Conference on Artificial Intelligence August 29 -- September 1, 2006, Riva del Garda, Italy
ANNIS: complex multilevel annotations in a linguistic database

NLPXML '06 Proceedings of the 5th Workshop on NLP and XML: Multi-Dimensional Markup in Natural Language Processing
GrAF: a graph-based format for linguistic annotations

LAW '07 Proceedings of the Linguistic Annotation Workshop

Creating and exploiting a resource of parallel parses

LAW IV '10 Proceedings of the Fourth Linguistic Annotation Workshop

Quantified Score

Hi-index	0.00

Visualization

Abstract

Given the contemporary trend to modular NLP architectures and multiple annotation frameworks, the existence of concurrent tokenizations of the same text represents a pervasive problem in everyday's NLP practice and poses a non-trivial theoretical problem to the integration of linguistic annotations and their interpretability in general. This paper describes a solution for integrating different tokenizations using a standoff XML format, and discusses the consequences for the handling of queries on annotated corpora.