Tokenization: returning to a long solved problem a survey, contrastive experiment, recommendations, and toolkit

Authors:
Rebecca Dridan;Stephan Oepen
Affiliations:
Universitetet i Oslo;Universitetet i Oslo
Venue:
ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2
Year:
2012

Citing 5
Cited 1

Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
Coarse-to-fine n-best parsing and MaxEnt discriminative reranking

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
OntoNotes: the 90% solution

NAACL-Short '06 Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers
Syntactic scope resolution in uncertainty analysis

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
The ACL Anthology Searchbench

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Systems Demonstrations

Machine learning for high-quality tokenization replicating variable tokenization schemes

CICLing'13 Proceedings of the 14th international conference on Computational Linguistics and Intelligent Text Processing - Volume Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

We examine some of the frequently disregarded subtleties of tokenization in Penn Treebank style, and present a new rule-based preprocessing toolkit that not only reproduces the Treebank tokenization with unmatched accuracy, but also maintains exact stand-off pointers to the original text and allows flexible configuration to diverse use cases (e.g. to genre-or domain-specific idiosyncrasies).