Class-Based language models for chinese-english parallel corpus

Authors:
Junfei Guo;Juan Liu;Michael Walsh;Helmut Schmid
Affiliations:
School of Computer, Wuhan University, China,Institute for Natural Language Processing, University of Stuttgart, Germany;School of Computer, Wuhan University, China;Institute for Natural Language Processing, University of Stuttgart, Germany;Institute for Natural Language Processing, University of Stuttgart, Germany
Venue:
CICLing'13 Proceedings of the 14th international conference on Computational Linguistics and Intelligent Text Processing - Volume 2
Year:
2013

Citing 12
Cited 0

Class-based n-gram models of natural language

Computational Linguistics
Foundations of statistical natural language processing

Foundations of statistical natural language processing
Toward a unified approach to statistical language modeling for Chinese

ACM Transactions on Asian Language Information Processing (TALIP)
Distributional part-of-speech tagging

EACL '95 Proceedings of the seventh conference on European chapter of the Association for Computational Linguistics
An iterative algorithm to build Chinese language models

ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
An empirical study of smoothing techniques for language modeling

ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
Introduction to Information Retrieval

Introduction to Information Retrieval
Modeling characters versuswords for mandarin speech recognition

ICASSP '09 Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing
Optimizing Chinese word segmentation for machine translation performance

StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation
A word clustering approach for language model-based sentence retrieval in question answering systems

Proceedings of the 18th ACM conference on Information and knowledge management
Statistical Machine Translation

Statistical Machine Translation
Integrating history-length interpolation and classes in language modeling

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper addresses using novel class-based language models on parallel corpora, focusing specifically on English and Chinese languages. We find that the perplexity of Chinese is generally much higher than English and discuss the possible reasons. We demonstrate the relative effectiveness of using class-based models over the modified Kneser-Ney trigram model for our task. We also introduce a rare events clustering and a polynomial discounting mechanism, which is shown to improve results. Our experimental results on parallel corpora indicate that the improvement due to classes are similar for English and Chinese. This suggests that class-based language models should be used for both languages.