Foundations of statistical natural language processing
Foundations of statistical natural language processing
Building a large annotated corpus of English: the penn treebank
Computational Linguistics - Special issue on using large corpora: II
Critical tokenization and its properties
Computational Linguistics
TnT: a statistical part-of-speech tagger
ANLC '00 Proceedings of the sixth conference on Applied natural language processing
A description language for syntactically annotated corpora
COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 2
Natural Language Engineering
A model-theoretic coreference scoring scheme
MUC6 '95 Proceedings of the 6th conference on Message understanding
GATE: an architecture for development of robust HLT applications
ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Representing Discourse Coherence: A Corpus-Based Study
Computational Linguistics
Protein name tagging for biomedical annotation in text
BioMed '03 Proceedings of the ACL 2003 workshop on Natural language processing in biomedicine - Volume 13
SUMMaR: Combining Linguistics and Statistics for Text Summarization
Proceedings of the 2006 conference on ECAI 2006: 17th European Conference on Artificial Intelligence August 29 -- September 1, 2006, Riva del Garda, Italy
NAACL-Short '06 Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers
GrAF: a graph-based format for linguistic annotations
LAW '07 Proceedings of the Linguistic Annotation Workshop
POWLA: modeling linguistic corpora in OWL/DL
ESWC'12 Proceedings of the 9th international conference on The Semantic Web: research and applications
NIF combinator: combining NLP tool output
EKAW'12 Proceedings of the 18th international conference on Knowledge Engineering and Knowledge Management
Hi-index | 0.00 |
Given the contemporary trend to modular NLP architectures and multiple annotation frameworks, the existence of concurrent tokenizations of the same text represents a pervasive problem in everyday's NLP practice and poses a non-trivial theoretical problem to the integration of linguistic annotations and their interpretability in general. This paper describes a solution for integrating different tokenizations using a standoff XML format, and discusses the consequences from a corpus-linguistic perspective.