Tokenization: returning to a long solved problem a survey, contrastive experiment, recommendations, and toolkit

  • Authors:
  • Rebecca Dridan;Stephan Oepen

  • Affiliations:
  • Universitetet i Oslo;Universitetet i Oslo

  • Venue:
  • ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

We examine some of the frequently disregarded subtleties of tokenization in Penn Treebank style, and present a new rule-based preprocessing toolkit that not only reproduces the Treebank tokenization with unmatched accuracy, but also maintains exact stand-off pointers to the original text and allows flexible configuration to diverse use cases (e.g. to genre-or domain-specific idiosyncrasies).