By all these lovely tokens... Merging conflicting tokenizations

Authors:
Christian Chiarcos;Julia Ritz;Manfred Stede
Affiliations:
Universität Potsdam, Sonderforschungsbereich 632 "Information Structure", Potsdam, Germany 14476;Universität Potsdam, Sonderforschungsbereich 632 "Information Structure", Potsdam, Germany 14476;Universität Potsdam, Sonderforschungsbereich 632 "Information Structure", Potsdam, Germany 14476
Venue:
Language Resources and Evaluation
Year:
2012

Citing 14
Cited 2

Transformation-based error-driven learning and natural language processing: a case study in part-of-speech tagging

Computational Linguistics
Foundations of statistical natural language processing

Foundations of statistical natural language processing
Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
Critical tokenization and its properties

Computational Linguistics
TnT: a statistical part-of-speech tagger

ANLC '00 Proceedings of the sixth conference on Applied natural language processing
A description language for syntactically annotated corpora

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 2
UIMA: an architectural approach to unstructured information processing in the corporate research environment

Natural Language Engineering
A model-theoretic coreference scoring scheme

MUC6 '95 Proceedings of the 6th conference on Message understanding
GATE: an architecture for development of robust HLT applications

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Representing Discourse Coherence: A Corpus-Based Study

Computational Linguistics
Protein name tagging for biomedical annotation in text

BioMed '03 Proceedings of the ACL 2003 workshop on Natural language processing in biomedicine - Volume 13
SUMMaR: Combining Linguistics and Statistics for Text Summarization

Proceedings of the 2006 conference on ECAI 2006: 17th European Conference on Artificial Intelligence August 29 -- September 1, 2006, Riva del Garda, Italy
OntoNotes: the 90% solution

NAACL-Short '06 Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers
GrAF: a graph-based format for linguistic annotations

LAW '07 Proceedings of the Linguistic Annotation Workshop

POWLA: modeling linguistic corpora in OWL/DL

ESWC'12 Proceedings of the 9th international conference on The Semantic Web: research and applications
NIF combinator: combining NLP tool output

EKAW'12 Proceedings of the 18th international conference on Knowledge Engineering and Knowledge Management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Given the contemporary trend to modular NLP architectures and multiple annotation frameworks, the existence of concurrent tokenizations of the same text represents a pervasive problem in everyday's NLP practice and poses a non-trivial theoretical problem to the integration of linguistic annotations and their interpretability in general. This paper describes a solution for integrating different tokenizations using a standoff XML format, and discusses the consequences from a corpus-linguistic perspective.