Chinese text segmentation with MBDP-1: making the most of training corpora

Authors:
Michael R. Brent;Xiaopeng Tao
Affiliations:
Washington University, St. Louis, MO;Washington University, St. Louis, MO
Venue:
ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
Year:
2001

Citing 8
Cited 8

Chinese text segmentation for text retrieval: achievements and problems

Journal of the American Society for Information Science
A study on word-based and integral-bit Chinese text compression algorithms

Journal of the American Society for Information Science
An Efficient, Probabilistically Sound Algorithm for Segmentation andWord Discovery

Machine Learning - Special issue on natural language learning
A new statistical formula for Chinese text segmentation incorporating contextual information

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Correcting English Text Using PPM Models

DCC '98 Proceedings of the Conference on Data Compression
USe: A Retargetable Word Segmentation Procedure for Information Retrieval

USe: A Retargetable Word Segmentation Procedure for Information Retrieval
A compression-based algorithm for Chinese word segmentation

Computational Linguistics
A trainable rule-based algorithm for word segmentation

ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics

Applying Machine Learning to Text Segmentation for Information Retrieval

Information Retrieval
Mostly-unsupervised statistical segmentation of Japanese Kanji sequences

Natural Language Engineering
Investigating the relationship between word segmentation performance and retrieval performance in Chinese IR

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Voting experts: An unsupervised algorithm for segmenting sequences

Intelligent Data Analysis
Improving word segmentation by simultaneously learning phonotactics

CoNLL '08 Proceedings of the Twelfth Conference on Computational Natural Language Learning
Bootstrap voting experts

IJCAI'09 Proceedings of the 21st international jont conference on Artifical intelligence
Methodological Review: Unsupervised grammar induction and similarity retrieval in medical language processing using the Deterministic Dynamic Associative Memory (DDAM) model

Journal of Biomedical Informatics
Unsupervised segmentation of chinese corpus using accessor variety

IJCNLP'04 Proceedings of the First international joint conference on Natural Language Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes a system for segmenting Chinese text into words using the MBDP-1 algorithm. MBDP-1 is a knowledge-free segmentation algorithm that bootstraps its own lexicon, which starts out empty. Experiments on Chinese and English corpora show that MBDP-1 reliably outperforms the best previous algorithm when the available hand-segmented training corpus is small. As the size of the hand-segmented training corpus grows, the performance of MBDP-1 converges toward that of the best previous algorithm. The fact that MBDP-1 can be used with a small corpus is expected to be useful not only for the rare event of adapting to a new language, but also for the common event of adapting to a new genre within the same language.