Chinese word segmentation and statistical machine translation

Authors:
Ruiqiang Zhang;Keiji Yasuda;Eiichiro Sumita
Affiliations:
National Institute of Information and Communications Technology, Kyoto, Japan;National Institute of Information and Communications Technology, Kyoto, Japan;National Institute of Information and Communications Technology, Kyoto, Japan
Venue:
ACM Transactions on Speech and Language Processing (TSLP)
Year:
2008

Citing 17
Cited 2

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
A systematic comparison of various statistical alignment models

Computational Linguistics
Using POS information for statistical machine translation into morphologically rich languages

EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 1
BLEU: a method for automatic evaluation of machine translation

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Chunking with support vector machines

NAACL '01 Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies
Statistical phrase-based translation

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Minimum error rate training in statistical machine translation

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Chinese word segmentation as LMR tagging

SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
HHMM-based Chinese lexical analyzer ICTCLAS

SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
Adaptive Chinese word segmentation

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Combination of Arabic preprocessing schemes for statistical machine translation

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Chinese segmentation and new word detection using conditional random fields

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Improving statistical MT through morphological analysis

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Measuring Word Alignment Quality for Statistical Machine Translation

Computational Linguistics
Morphological analysis for statistical machine translation

HLT-NAACL-Short '04 Proceedings of HLT-NAACL 2004: Short Papers
Subword-based tagging by conditional random fields for Chinese word segmentation

NAACL-Short '06 Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers
Mixture-model adaptation for SMT

StatMT '07 Proceedings of the Second Workshop on Statistical Machine Translation

Improved statistical machine translation by multiple Chinese word segmentation

StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation
Automatic evaluation of Chinese translation output: word-level or character-level?

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2

Quantified Score

Hi-index	0.00

Visualization

Abstract

Chinese word segmentation (CWS) is a necessary step in Chinese-English statistical machine translation (SMT) and its performance has an impact on the results of SMT. However, there are many choices involved in creating a CWS system such as various specifications and CWS methods. The choices made will create a new CWS scheme, but whether it will produce a superior or inferior translation has remained unknown to date. This article examines the relationship between CWS and SMT. The effects of CWS on SMT were investigated using different specifications and CWS methods. Four specifications were selected for investigation: Beijing University (PKU), Hong Kong City University (CITYU), Microsoft Research (MSR), and Academia SINICA (AS). We created 16 CWS schemes under different settings to examine the relationship between CWS and SMT. Our experimental results showed that the MSR's specifications produced the lowest quality translations. In examining the effects of CWS methods, we tested dictionary-based and CRF-based approaches and found there was no significant difference between the two in the quality of the resulting translations. We also found the correlation between the CWS F-score and SMT BLEU score was very weak. We analyzed CWS errors and their effect on SMT by evaluating systems trained with and without these errors. This article also proposes two methods for combining advantages of different specifications: a simple concatenation of training data and a feature interpolation approach in which the same types of features of translation models from various CWS schemes are linearly interpolated. We found these approaches were very effective in improving the quality of translations.