By all these lovely tokens...: merging conflicting tokenizations

  • Authors:
  • Christian Chiarcos;Julia Ritz;Manfred Stede

  • Affiliations:
  • University of Potsdam, Golm, Germany;University of Potsdam, Golm, Germany;University of Potsdam, Golm, Germany

  • Venue:
  • ACL-IJCNLP '09 Proceedings of the Third Linguistic Annotation Workshop
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Given the contemporary trend to modular NLP architectures and multiple annotation frameworks, the existence of concurrent tokenizations of the same text represents a pervasive problem in everyday's NLP practice and poses a non-trivial theoretical problem to the integration of linguistic annotations and their interpretability in general. This paper describes a solution for integrating different tokenizations using a standoff XML format, and discusses the consequences for the handling of queries on annotated corpora.