Chinese text segmentation with MBDP-1: making the most of training corpora

  • Authors:
  • Michael R. Brent;Xiaopeng Tao

  • Affiliations:
  • Washington University, St. Louis, MO;Washington University, St. Louis, MO

  • Venue:
  • ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
  • Year:
  • 2001

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper describes a system for segmenting Chinese text into words using the MBDP-1 algorithm. MBDP-1 is a knowledge-free segmentation algorithm that bootstraps its own lexicon, which starts out empty. Experiments on Chinese and English corpora show that MBDP-1 reliably outperforms the best previous algorithm when the available hand-segmented training corpus is small. As the size of the hand-segmented training corpus grows, the performance of MBDP-1 converges toward that of the best previous algorithm. The fact that MBDP-1 can be used with a small corpus is expected to be useful not only for the rare event of adapting to a new language, but also for the common event of adapting to a new genre within the same language.