By all these lovely tokens... Merging conflicting tokenizations

  • Authors:
  • Christian Chiarcos;Julia Ritz;Manfred Stede

  • Affiliations:
  • Universität Potsdam, Sonderforschungsbereich 632 "Information Structure", Potsdam, Germany 14476;Universität Potsdam, Sonderforschungsbereich 632 "Information Structure", Potsdam, Germany 14476;Universität Potsdam, Sonderforschungsbereich 632 "Information Structure", Potsdam, Germany 14476

  • Venue:
  • Language Resources and Evaluation
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Given the contemporary trend to modular NLP architectures and multiple annotation frameworks, the existence of concurrent tokenizations of the same text represents a pervasive problem in everyday's NLP practice and poses a non-trivial theoretical problem to the integration of linguistic annotations and their interpretability in general. This paper describes a solution for integrating different tokenizations using a standoff XML format, and discusses the consequences from a corpus-linguistic perspective.