Phrase-based approach for adaptive tokenization

  • Authors:
  • Jianqiang Ma;Dale Gerdemann

  • Affiliations:
  • University of Tübingen, Tübingen, Germany;University of Tübingen, Tübingen, Germany

  • Venue:
  • SIGMORPHON '12 Proceedings of the Twelfth Meeting of the Special Interest Group on Computational Morphology and Phonology
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Fast re-training of word segmentation models is required for adapting to new resources or domains in NLP of many Asian languages without word delimiters. The traditional tokenization model is efficient but inaccurate. This paper proposes a phrase-based model that factors sentence tokenization into phrase tokenizations, the dependencies of which are also taken into account. The model has a good OOV recognition ability, which improves the overall performance significantly. The training is a linear time phrase extraction and MLE procedure, while the decoding is via dynamic programming based algorithms.