A new unsupervised approach to word segmentation

Authors:
Hanshi Wang;Jian Zhu;Shiping Tang;Xiaozhong Fan
Affiliations:
Beijing Institute of Technology;Beijing Institute of Technology;Beijing Institute of Technology;Beijing Institute of Technology
Venue:
Computational Linguistics
Year:
2011

Citing 25
Cited 1

A stochastic finite-state word-segmentation algorithm for Chinese

Computational Linguistics
Discovering Chinese words from unsegmented text (poster abstract)

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Suffix arrays: a new method for on-line string searches

SODA '90 Proceedings of the first annual ACM-SIAM symposium on Discrete algorithms
PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric

Journal of the ACM (JACM)
A Space-Economical Suffix Tree Construction Algorithm

Journal of the ACM (JACM)
Trie memory

Communications of the ACM
Self-Supervised Chinese Word Segmentation

IDA '01 Proceedings of the 4th International Conference on Advances in Intelligent Data Analysis
Unsupervised Segmentation of Categorical Time Series into Episodes

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
A compression-based algorithm for Chinese word segmentation

Computational Linguistics
Mostly-unsupervised statistical segmentation of Japanese Kanji sequences

Natural Language Engineering
Mostly-unsupervised statistical segmentation of Japanese: applications to kanji

NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
A trainable rule-based algorithm for word segmentation

ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
Accessor variety criteria for Chinese word extraction

Computational Linguistics
Chinese Word Segmentation and Named Entity Recognition: A Pragmatic Approach

Computational Linguistics
The first international Chinese word segmentation Bakeoff

SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
Contextual dependencies in unsupervised word segmentation

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
An all-subtrees approach to unsupervised parsing

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
A hierarchical Bayesian language model based on Pitman-Yor processes

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Unsupervised segmentation of Chinese text by use of branching entropy

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Voting experts: An unsupervised algorithm for segmenting sequences

Intelligent Data Analysis
Character-level dependencies in Chinese: usefulness and learning

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
Bayesian unsupervised word segmentation with nested Pitman-Yor language modeling

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
Statistical substring reduction in linear time

IJCNLP'04 Proceedings of the First international joint conference on Natural Language Processing
Unsupervised segmentation of chinese corpus using accessor variety

IJCNLP'04 Proceedings of the First international joint conference on Natural Language Processing
Entropy as an indicator of context boundaries: an experiment using a web search engine

IJCNLP'05 Proceedings of the Second international joint conference on Natural Language Processing

Unsupervized word segmentation: the case for Mandarin Chinese

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2

Quantified Score

Hi-index	0.00

Visualization

Abstract

This article proposes ESA, a new unsupervised approach to word segmentation. ESA is an iterative process consisting of 3 phases: Evaluation, Selection, and Adjustment. In Evaluation, both certainty and uncertainty of character sequence co-occurrence in corpora are considered as the statistical evidence supporting goodness measurement. Additionally, the statistical data of character sequences with various lengths become comparable with each other by using a simple process called Balancing. In Selection, a local maximum strategy is adopted without thresholds, and the strategy can be implemented with dynamic programming. In Adjustment, a part of the statistical data is updated to improve successive results. In our experiment, ESA was evaluated on the SIGHAN Bakeoff-2 data set. The results suggest that ESA is effective on Chinese corpora. It is noteworthy that the F-measures of the results are basically monotone increasing and can rapidly converge to relatively high values. Furthermore, the empirical formulae based on the results can be used to predict the parameter in ESA to avoid parameter estimation that is usually time-consuming.