Unsupervised segmentation of chinese corpus using accessor variety

Authors:
Haodi Feng;Kang Chen;Chunyu Kit;Xiaotie Deng
Affiliations:
School of Computer Science and Technology, Shandong University, Jinan;Department of Computer Science and Technology, Tsinghua University, Beijing;Department of Chinese, Translation and Linguistics, City University of Hong Kong, Kowloon, Hong Kong;Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong
Venue:
IJCNLP'04 Proceedings of the First international joint conference on Natural Language Processing
Year:
2004

Citing 12
Cited 6

Chinese text segmentation for text retrieval: achievements and problems

Journal of the American Society for Information Science
A stochastic finite-state word-segmentation algorithm for Chinese

Computational Linguistics
A study on word-based and integral-bit Chinese text compression algorithms

Journal of the American Society for Information Science
An Efficient, Probabilistically Sound Algorithm for Segmentation andWord Discovery

Machine Learning - Special issue on natural language learning
Statistical Models for Text Segmentation

Machine Learning - Special issue on natural language learning
A new statistical formula for Chinese text segmentation incorporating contextual information

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Self-Supervised Chinese Word Segmentation

IDA '01 Proceedings of the 4th International Conference on Advances in Intelligent Data Analysis
USe: A Retargetable Word Segmentation Procedure for Information Retrieval

USe: A Retargetable Word Segmentation Procedure for Information Retrieval
A compression-based algorithm for Chinese word segmentation

Computational Linguistics
Mostly-unsupervised statistical segmentation of Japanese: applications to kanji

NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
Chinese word segmentation without using lexicon and hand-crafted training data

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
Chinese text segmentation with MBDP-1: making the most of training corpora

ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics

Unsupervised segmentation of Chinese text by use of branching entropy

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Integrating unsupervised and supervised word segmentation: The role of goodness measures

Information Sciences: an International Journal
A new unsupervised approach to word segmentation

Computational Linguistics
Extracting paraphrases of japanese action word of sentence ending part from web and mobile news articles

AIRS'04 Proceedings of the 2004 international conference on Asian Information Retrieval Technology
The Left and Right Context of a Word: Overlapping Chinese Syllable Word Segmentation with Minimal Context

ACM Transactions on Asian Language Information Processing (TALIP)
Unknown Chinese word extraction based on variety of overlapping strings

Information Processing and Management: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

The lack of word delimiters such as spaces in Chinese texts makes word segmentation a special issue in Chinese text processing. As the volume of Chinese texts grows rapidly on the Internet, the number of unknown words increases accordingly. However, word segmentation approaches relying solely on existing dictionaries are helpless in handling unknown words. In this paper, we propose a novel unsupervised method to segment large Chinese corpora using contextual information. In particular, the number of characters preceding and following a string, known as the accessors of the string, is used to measure the independence of the string. The greater the independence, the more likely it is that the string is a word. The segmentation problem is then considered an optimization problem to maximize the target function of this number over all word candidates in an utterance. Our purpose here is to explore the best function in terms of segmentation performance. The performance is evaluated with the word token recall measure in addition to word type precision and word type recall. Among the three types of target functions that we have explored, polynomial functions turn out to outperform others. This simple method is effective in unsupervised segmentation of Chinese texts and its performance is highly comparable to other recently reported unsupervised segmentation methods.