Self-Supervised Chinese Word Segmentation

Authors:
Fuchun Peng;Dale Schuurmans
Affiliations:
-;-
Venue:
IDA '01 Proceedings of the 4th International Conference on Advances in Intelligent Data Analysis
Year:
2001

Citing 5
Cited 20

A stochastic finite-state word-segmentation algorithm for Chinese

Computational Linguistics
Foundations of statistical natural language processing

Foundations of statistical natural language processing
Discovering Chinese words from unsegmented text (poster abstract)

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Structure learning in conditional probability models via an entropic prior and parameter extinction

Neural Computation
Mostly-unsupervised statistical segmentation of Japanese: applications to kanji

NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference

Using self-supervised word segmentation in Chinese information retrieval

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Applying Machine Learning to Text Segmentation for Information Retrieval

Information Retrieval
Mostly-unsupervised statistical segmentation of Japanese Kanji sequences

Natural Language Engineering
The head-modifier principle and multilingual term extraction

Natural Language Engineering
Investigating the relationship between word segmentation performance and retrieval performance in Chinese IR

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Learning case-based knowledge for disambiguating Chinese word segmentation: a preliminary study

SIGHAN '02 Proceedings of the first SIGHAN workshop on Chinese language processing - Volume 18
A maximum entropy Chinese character-based parser

EMNLP '03 Proceedings of the 2003 conference on Empirical methods in natural language processing
Unsupervised models for morpheme segmentation and morphology learning

ACM Transactions on Speech and Language Processing (TSLP)
Chinese segmentation and new word detection using conditional random fields

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Unsupervised query segmentation using generative language models and wikipedia

Proceedings of the 17th international conference on World Wide Web
Chinese Word Segmentation for Terrorism-Related Contents

PAISI, PACCF and SOCO '08 Proceedings of the IEEE ISI 2008 PAISI, PACCF, and SOCO international workshops on Intelligence and Security Informatics
Query segmentation based on eigenspace similarity

ACLShort '09 Proceedings of the ACL-IJCNLP 2009 Conference Short Papers
Punctuation as implicit annotations for chinese word segmentation

Computational Linguistics
Chinese text segmentation: A hybrid approach using transductive learning and statistical association measures

Expert Systems with Applications: An International Journal
Inducing Morphemes Using Light Knowledge

ACM Transactions on Asian Language Information Processing (TALIP)
Integrating unsupervised and supervised word segmentation: The role of goodness measures

Information Sciences: an International Journal
Domain-specific Chinese word segmentation using suffix tree and mutual information

Information Systems Frontiers
Unsupervised query segmentation using clickthrough for information retrieval

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
A new unsupervised approach to word segmentation

Computational Linguistics
Unsupervised segmentation of chinese corpus using accessor variety

IJCNLP'04 Proceedings of the First international joint conference on Natural Language Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose a new unsupervised training method for acquiring probability models that accurately segment Chinese character sequences into words. By constructing a core lexicon to guide unsupervised word learning, self-supervised segmentation overcomes the local maxima problems that hamper standard EM training. Our procedure uses successive EM phases to learn a good probability model over character strings, and then prunes this model with a mutual information selection criterion to obtain a more accurate word lexicon. The segmentations produced by these models are more accurate than those produced by training with EM alone.