Chinese text segmentation for text retrieval: achievements and problems
Journal of the American Society for Information Science
A study on word-based and integral-bit Chinese text compression algorithms
Journal of the American Society for Information Science
An Efficient, Probabilistically Sound Algorithm for Segmentation andWord Discovery
Machine Learning - Special issue on natural language learning
A new statistical formula for Chinese text segmentation incorporating contextual information
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Correcting English Text Using PPM Models
DCC '98 Proceedings of the Conference on Data Compression
USe: A Retargetable Word Segmentation Procedure for Information Retrieval
USe: A Retargetable Word Segmentation Procedure for Information Retrieval
A compression-based algorithm for Chinese word segmentation
Computational Linguistics
A trainable rule-based algorithm for word segmentation
ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
Applying Machine Learning to Text Segmentation for Information Retrieval
Information Retrieval
Mostly-unsupervised statistical segmentation of Japanese Kanji sequences
Natural Language Engineering
COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Voting experts: An unsupervised algorithm for segmenting sequences
Intelligent Data Analysis
Improving word segmentation by simultaneously learning phonotactics
CoNLL '08 Proceedings of the Twelfth Conference on Computational Natural Language Learning
IJCAI'09 Proceedings of the 21st international jont conference on Artifical intelligence
Unsupervised segmentation of chinese corpus using accessor variety
IJCNLP'04 Proceedings of the First international joint conference on Natural Language Processing
Hi-index | 0.00 |
This paper describes a system for segmenting Chinese text into words using the MBDP-1 algorithm. MBDP-1 is a knowledge-free segmentation algorithm that bootstraps its own lexicon, which starts out empty. Experiments on Chinese and English corpora show that MBDP-1 reliably outperforms the best previous algorithm when the available hand-segmented training corpus is small. As the size of the hand-segmented training corpus grows, the performance of MBDP-1 converges toward that of the best previous algorithm. The fact that MBDP-1 can be used with a small corpus is expected to be useful not only for the rare event of adapting to a new language, but also for the common event of adapting to a new genre within the same language.