Enhancing Chinese word segmentation using unlabeled data

Authors:
Weiwei Sun;Jia Xu
Affiliations:
Saarland University, and German Research Center for Artificial Intelligence (DFKI), Saarbrücken, Germany;German Research Center for Artificial Intelligence (DFKI), Saarbrücken, Germany
Venue:
EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Year:
2011

Citing 10
Cited 2

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Accessor variety criteria for Chinese word extraction

Computational Linguistics
Bayesian semi-supervised Chinese word segmentation for statistical machine translation

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
A discriminative latent variable chinese segmenter with hybrid word/character information

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Punctuation as implicit annotations for chinese word segmentation

Computational Linguistics
Automatic adaptation of annotation standards: Chinese word segmentation and POS tagging: a case study

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
Word representations: a simple and general method for semi-supervised learning

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Profiting from mark-up: hyper-text annotations for guided parsing

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Word-based and character-based word segmentation models: comparison and combination

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
A stacked sub-word model for joint Chinese word segmentation and part-of-speech tagging

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1

Reducing approximation and estimation errors for Chinese lexical processing with heterogeneous annotations

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Capturing paradigmatic and syntagmatic lexical relations: towards accurate Chinese part-of-speech tagging

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper investigates improving supervised word segmentation accuracy with unlabeled data. Both large-scale in-domain data and small-scale document text are considered. We present a unified solution to include features derived from unlabeled data to a discriminative learning model. For the large-scale data, we derive string statistics from Gigaword to assist a character-based segmenter. In addition, we introduce the idea about transductive, document-level segmentation, which is designed to improve the system recall for out-of-vocabulary (OOV) words which appear more than once inside a document. Novel features result in relative error reductions of 13.8% and 15.4% in terms of F-score and the recall of OOV words respectively.