Toward a unified approach to statistical language modeling for Chinese

Authors:
Jianfeng Gao;Joshua Goodman;Mingjing Li;Kai-Fu Lee
Affiliations:
Microsoft Research (Asia), Beijing, China;Microsoft Research (Redmond), Washington;Microsoft Research (Asia), Beijing, China;Microsoft Research (Asia), Beijing, China
Venue:
ACM Transactions on Asian Language Information Processing (TALIP)
Year:
2002

Citing 11
Cited 36

Self-organized language modeling for speech recognition

Readings in speech recognition
Class-based n-gram models of natural language

Computational Linguistics
PAT-tree-based keyword extraction for Chinese information retrieval

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Foundations of statistical natural language processing

Foundations of statistical natural language processing
A hidden Markov model information retrieval system

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Navigating the Information Superhighway Using Spoken Language Interfaces

IEEE Expert: Intelligent Systems and Their Applications
Language Model Adaptation Using Mixtures and an Exponentially Decaying Cache

ICASSP '97 Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97)-Volume 2 - Volume 2
TextTiling: segmenting text into multi-paragraph subtopic passages

Computational Linguistics
Chinese word segmentation based on maximum matching and word binding force

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 1
Extraction of Chinese compound words: an experimental study on a very large corpus

CLPW '00 Proceedings of the second workshop on Chinese language processing: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 12
Multi-class composite N-gram based on connection direction

ICASSP '99 Proceedings of the Acoustics, Speech, and Signal Processing, 1999. on 1999 IEEE International Conference - Volume 01

Introduction to the special issue on statistical language modeling

ACM Transactions on Asian Language Information Processing (TALIP)
Chinese named entity identification using class-based language model

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Improving language model size reduction using better pruning criteria

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Exploring asymmetric clustering for statistical language modeling

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Improved source-channel models for Chinese word segmentation

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Unsupervised learning of dependency structure for language modeling

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Chinese Word Segmentation and Named Entity Recognition: A Pragmatic Approach

Computational Linguistics
Exploiting headword dependency and predictive clustering for language modeling

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
Finding the better indexing units for Chinese information retrieval

SIGHAN '02 Proceedings of the first SIGHAN workshop on Chinese language processing - Volume 18
Unsupervised training for overlapping ambiguity resolution in Chinese word segmentation

SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
An empirical study on language model adaptation

ACM Transactions on Asian Language Information Processing (TALIP)
A comparative study on language model adaptation techniques using new evaluation metrics

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Statistical query translation models for cross-language information retrieval

ACM Transactions on Asian Language Information Processing (TALIP)
Using word support model to improve Chinese input system

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
State-dependent phoneme-based model merging for dialectal Chinese speech recognition

Speech Communication
Structural optimization of a full-text n-gram index using relational normalization

The VLDB Journal — The International Journal on Very Large Data Bases
A novel statistical chinese language model and its application in pinyin-to-character conversion

Proceedings of the 17th ACM conference on Information and knowledge management
Perplexity-based evidential neural network classifier fusion using mpeg-7 low-level visual features

MIR '08 Proceedings of the 1st ACM international conference on Multimedia information retrieval
Discriminative lexicon adaptation for improved character accuracy: a new direction in Chinese language modeling

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2
An artificial immune network approach for pinyin-to- character conversion

VECIMS'09 Proceedings of the 2009 IEEE international conference on Virtual Environments, Human-Computer Interfaces and Measurement Systems
Intelligent selection of language model training data

ACLShort '10 Proceedings of the ACL 2010 Conference Short Papers
Predicting word pronunciation in Japanese

CICLing'11 Proceedings of the 12th international conference on Computational linguistics and intelligent text processing - Volume Part II
The use of SVM for chinese new word identification

IJCNLP'04 Proceedings of the First international joint conference on Natural Language Processing
State-dependent phoneme-based model merging for dialectal chinese speech recognition

ISCSLP'06 Proceedings of the 5th international conference on Chinese Spoken Language Processing
Spoken correction for chinese text entry

ISCSLP'06 Proceedings of the 5th international conference on Chinese Spoken Language Processing
Domain adaptation via pseudo in-domain data selection

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
The imagination of crowds: conversational AAC language modeling using crowdsourcing and large data sources

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
An empirical study on language model adaptation using a metric of domain similarity

IJCNLP'05 Proceedings of the Second international joint conference on Natural Language Processing
CHIME: an efficient error-tolerant Cinese pinyin input method

IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume Three
Adapting translation models to translationese improves SMT

EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics
A unified approach to transliteration-based text input with online spelling correction

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
LIUM's SMT machine translation systems for WMT 2012

WMT '12 Proceedings of the Seventh Workshop on Statistical Machine Translation
The Left and Right Context of a Word: Overlapping Chinese Syllable Word Segmentation with Minimal Context

ACM Transactions on Asian Language Information Processing (TALIP)
Class-Based language models for chinese-english parallel corpus

CICLing'13 Proceedings of the 14th international conference on Computational Linguistics and Intelligent Text Processing - Volume 2
Improving statistical machine translation by adapting translation models to translationese

Computational Linguistics
Improving statistical machine translation by adapting translation models to translationese

Computational Linguistics

Quantified Score

Hi-index	0.00

Visualization

Abstract

This article presents a unified approach to Chinese statistical language modeling (SLM). Applying SLM techniques like trigram language models to Chinese is challenging because (1) there is no standard definition of words in Chinese; (2) word boundaries are not marked by spaces; and (3) there is a dearth of training data. Our unified approach automatically and consistently gathers a high-quality training data set from the Web, creates a high-quality lexicon, segments the training data using this lexicon, and compresses the language model, all by using the maximum likelihood principle, which is consistent with trigram model training. We show that each of the methods leads to improvements over standard SLM, and that the combined method yields the best pinyin conversion result reported.