Chinese text segmentation for text retrieval: achievements and problems
Journal of the American Society for Information Science
A stochastic finite-state word-segmentation algorithm for Chinese
Computational Linguistics
A study on word-based and integral-bit Chinese text compression algorithms
Journal of the American Society for Information Science
An Efficient, Probabilistically Sound Algorithm for Segmentation andWord Discovery
Machine Learning - Special issue on natural language learning
Statistical Models for Text Segmentation
Machine Learning - Special issue on natural language learning
A new statistical formula for Chinese text segmentation incorporating contextual information
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Self-Supervised Chinese Word Segmentation
IDA '01 Proceedings of the 4th International Conference on Advances in Intelligent Data Analysis
USe: A Retargetable Word Segmentation Procedure for Information Retrieval
USe: A Retargetable Word Segmentation Procedure for Information Retrieval
A compression-based algorithm for Chinese word segmentation
Computational Linguistics
Mostly-unsupervised statistical segmentation of Japanese: applications to kanji
NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
Chinese word segmentation without using lexicon and hand-crafted training data
COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
Chinese text segmentation with MBDP-1: making the most of training corpora
ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
Unsupervised segmentation of Chinese text by use of branching entropy
COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Integrating unsupervised and supervised word segmentation: The role of goodness measures
Information Sciences: an International Journal
A new unsupervised approach to word segmentation
Computational Linguistics
AIRS'04 Proceedings of the 2004 international conference on Asian Information Retrieval Technology
ACM Transactions on Asian Language Information Processing (TALIP)
Unknown Chinese word extraction based on variety of overlapping strings
Information Processing and Management: an International Journal
Hi-index | 0.00 |
The lack of word delimiters such as spaces in Chinese texts makes word segmentation a special issue in Chinese text processing. As the volume of Chinese texts grows rapidly on the Internet, the number of unknown words increases accordingly. However, word segmentation approaches relying solely on existing dictionaries are helpless in handling unknown words. In this paper, we propose a novel unsupervised method to segment large Chinese corpora using contextual information. In particular, the number of characters preceding and following a string, known as the accessors of the string, is used to measure the independence of the string. The greater the independence, the more likely it is that the string is a word. The segmentation problem is then considered an optimization problem to maximize the target function of this number over all word candidates in an utterance. Our purpose here is to explore the best function in terms of segmentation performance. The performance is evaluated with the word token recall measure in addition to word type precision and word type recall. Among the three types of target functions that we have explored, polynomial functions turn out to outperform others. This simple method is effective in unsupervised segmentation of Chinese texts and its performance is highly comparable to other recently reported unsupervised segmentation methods.