Word-based and character-based word segmentation models: comparison and combination

Authors:
Weiwei Sun
Affiliations:
Saarland University
Venue:
COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Year:
2010

Citing 9
Cited 7

Bagging predictors

Machine Learning
Word identification for Mandarin Chinese sentences

COLING '92 Proceedings of the 14th conference on Computational linguistics - Volume 1
Online Passive-Aggressive Algorithms

The Journal of Machine Learning Research
Structure compilation: trading structure for features

Proceedings of the 25th international conference on Machine learning
A hybrid Markov/semi-Markov conditional random field for sequence segmentation

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Subword-based tagging by conditional random fields for Chinese word segmentation

NAACL-Short '06 Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers
A discriminative latent variable chinese segmenter with hybrid word/character information

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Automatic adaptation of annotation standards: Chinese word segmentation and POS tagging: a case study

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
Chinese semantic role labeling with shallow parsing

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3 - Volume 3

A stacked sub-word model for joint Chinese word segmentation and part-of-speech tagging

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Enhancing Chinese word segmentation using unlabeled data

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Integrating Generative and Discriminative Character-Based Models for Chinese Word Segmentation

ACM Transactions on Asian Language Information Processing (TALIP)
Reducing approximation and estimation errors for Chinese lexical processing with heterogeneous annotations

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Capturing paradigmatic and syntagmatic lexical relations: towards accurate Chinese part-of-speech tagging

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Fast online training with frequency-adaptive learning rates for Chinese word segmentation and new word detection

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Unknown Chinese word extraction based on variety of overlapping strings

Information Processing and Management: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a theoretical and empirical comparative analysis of the two dominant categories of approaches in Chinese word segmentation: word-based models and character-based models. We show that, in spite of similar performance overall, the two models produce different distribution of segmentation errors, in a way that can be explained by theoretical properties of the two models. The analysis is further exploited to improve segmentation accuracy by integrating a word-based segmenter and a character-based segmenter. A Bootstrap Aggregating model is proposed. By letting multiple segmenters vote, our model improves segmentation consistently on the four different data sets from the second SIGHAN bakeoff.