Integrating Generative and Discriminative Character-Based Models for Chinese Word Segmentation

Authors:
Kun Wang;Chengqing Zong;Keh-Yih Su
Affiliations:
National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences;National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences;Behavior Design Corporation
Venue:
ACM Transactions on Asian Language Information Processing (TALIP)
Year:
2012

Citing 35
Cited 0

A maximum entropy approach to natural language processing

Computational Linguistics
Exploiting generative models in discriminative classifiers

Proceedings of the 1998 conference on Advances in neural information processing systems II
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
A trainable rule-based algorithm for word segmentation

ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
Joint and conditional estimation of tagging and parsing models

ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
Factored language models and generalized parallel backoff

NAACL-Short '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology: companion volume of the Proceedings of HLT-NAACL 2003--short papers - Volume 2
Improved source-channel models for Chinese word segmentation

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Chinese Word Segmentation and Named Entity Recognition: A Pragmatic Approach

Computational Linguistics
Discriminative training methods for hidden Markov models: theory and experiments with perceptron algorithms

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
Chinese word segmentation as LMR tagging

SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
HHMM-based Chinese lexical analyzer ICTCLAS

SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
Pattern Recognition and Machine Learning (Information Science and Statistics)

Pattern Recognition and Machine Learning (Information Science and Statistics)
Chinese segmentation and new word detection using conditional random fields

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Effective self-training for parsing

HLT-NAACL '06 Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics
Subword-based tagging for confidence-dependent Chinese word segmentation

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Chinese word segmentation as morpheme-based lexical chunking

Information Sciences: an International Journal
An asymptotic analysis of generative, discriminative, and pseudolikelihood estimators

Proceedings of the 25th international conference on Machine learning
Minimum tag error for discriminative training of conditional random fields

Information Sciences: an International Journal
Comment on "On Discriminative vs. Generative Classifiers: A Comparison of Logistic Regression and Naive Bayes"

Neural Processing Letters
Rethinking Chinese word segmentation: tokenization, character classification, or wordbreak identification

ACL '07 Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions
A hybrid Markov/semi-Markov conditional random field for sequence segmentation

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Competitive generative models with structure learning for NLP classification tasks

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
A hybrid generative/discriminative approach to semi-supervised classifier design

AAAI'05 Proceedings of the 20th national conference on Artificial intelligence - Volume 2
A discriminative latent variable chinese segmenter with hybrid word/character information

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
A dual-layer CRFs based joint decoding method for cascaded segmentation and labeling tasks

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Punctuation as implicit annotations for chinese word segmentation

Computational Linguistics
An error-driven word-character hybrid model for joint Chinese word segmentation and POS tagging

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
A Unified Character-Based Tagging Framework for Chinese Word Segmentation

ACM Transactions on Asian Language Information Processing (TALIP)
Integrating joint n-gram features into a discriminative training framework

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
A character-based joint model for Chinese word segmentation

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Word-based and character-based word segmentation models: comparison and combination

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Syntactic processing using the generalized perceptron and beam search

Computational Linguistics
A stacked sub-word model for joint Chinese word segmentation and part-of-speech tagging

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Parsing the internal structure of words: a new paradigm for Chinese word segmentation

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Chinese abbreviation identification using abbreviation-template features and context information

ICCPOL'06 Proceedings of the 21st international conference on Computer Processing of Oriental Languages: beyond the orient: the research challenges ahead

Quantified Score

Hi-index	0.00

Visualization

Abstract

Among statistical approaches to Chinese word segmentation, the word-based n-gram (generative) model and the character-based tagging (discriminative) model are two dominant approaches in the literature. The former gives excellent performance for the in-vocabulary (IV) words; however, it handles out-of-vocabulary (OOV) words poorly. On the other hand, though the latter is more robust for OOV words, it fails to deliver satisfactory performance for IV words. These two approaches behave differently due to the unit they use (word vs. character) and the model form they adopt (generative vs. discriminative). In general, character-based approaches are more robust than word-based ones, as the vocabulary of characters is a closed set; and discriminative models are more robust than generative ones, since they can flexibly include all kinds of available information, such as future context. This article first proposes a character-based n-gram model to enhance the robustness of the generative approach. Then the proposed generative model is further integrated with the character-based discriminative model to take advantage of both approaches. Our experiments show that this integrated approach outperforms all the existing approaches reported in the literature. Afterwards, a complete and detailed error analysis is conducted. Since a significant portion of the critical errors is related to numerical/foreign strings, character-type information is then incorporated into the model to further improve its performance. Last, the proposed integrated approach is tested on cross-domain corpora, and a semi-supervised domain adaptation algorithm is proposed and shown to be effective in our experiments.