A scalable distributed syntactic, semantic, and lexical language model

Authors:
Ming Tan;Wenli Zhou;Lei Zheng;Shaojun Wang
Affiliations:
Wright State University;Wright State University;Wright State University;Wright State University
Venue:
Computational Linguistics
Year:
2012

Citing 39
Cited 0

A maximum entropy approach to natural language processing

Computational Linguistics
Inducing Features of Random Fields

IEEE Transactions on Pattern Analysis and Machine Intelligence
Statistical methods for speech recognition

Statistical methods for speech recognition
Unsupervised learning by probabilistic latent semantic analysis

Machine Learning
The theory of parsing, translation, and compiling

The theory of parsing, translation, and compiling
Inference and Estimation of a Long-Range Trigram Model

ICGI '94 Proceedings of the Second International Colloquium on Grammatical Inference and Applications
Exploiting syntactic structure for natural language modeling

Exploiting syntactic structure for natural language modeling
Latent dirichlet allocation

The Journal of Machine Learning Research
A neural probabilistic language model

The Journal of Machine Learning Research
Probabilistic top-down parsing and language modeling

Computational Linguistics
Exploiting syntactic structure for language modeling

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
Mathematical Foundations of Speech and Language Processing

Mathematical Foundations of Speech and Language Processing
Case-factor diagrams for structured probabilistic modeling

UAI '04 Proceedings of the 20th conference on Uncertainty in artificial intelligence
Mitigating the paucity-of-data problem: exploring the effect of training corpus size on classifier performance for natural language processing

HLT '01 Proceedings of the first international conference on Human language technology research
Immediate-head parsing for language models

ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
A syntax-based statistical translation model

ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
BLEU: a method for automatic evaluation of machine translation

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Statistical phrase-based translation

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Factored language models and generalized parallel backoff

NAACL-Short '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology: companion volume of the Proceedings of HLT-NAACL 2003--short papers - Volume 2
Minimum error rate training in statistical machine translation

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Combining Statistical Language Models via the Latent Maximum Entropy Principle

Machine Learning
Exploiting syntactic, semantic and lexical regularities in language modeling via directed Markov random fields

ICML '05 Proceedings of the 22nd international conference on Machine learning
The SuperARV language model: investigating the effectiveness of tightly integrating multiple knowledge sources

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
Topic modeling: beyond bag-of-words

ICML '06 Proceedings of the 23rd international conference on Machine learning
Speech and Language Processing (2nd Edition)

Speech and Language Processing (2nd Edition)
A hierarchical phrase-based model for statistical machine translation

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
A hierarchical Bayesian language model based on Pitman-Yor processes

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Hierarchical Phrase-Based Translation

Computational Linguistics
The Unreasonable Effectiveness of Data

IEEE Intelligent Systems
Distributed language modeling for N-best list re-ranking

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Estimation of stochastic context-free grammars and their use as language models

Computer Speech and Language
The dawn of statistical asr and mt

Computational Linguistics
Artificial Intelligence: A Modern Approach

Artificial Intelligence: A Modern Approach
Data-Intensive Text Processing with MapReduce

Data-Intensive Text Processing with MapReduce
An overview of Microsoft web N-gram corpus and applications

HLT-DEMO '10 Proceedings of the NAACL HLT 2010 Demonstration Session
Trigger-based language models: a maximum entropy approach

ICASSP'93 Proceedings of the 1993 IEEE international conference on Acoustics, speech, and signal processing: speech processing - Volume II
Stochastic analysis of lexical and semantic enhanced structural language model

ICGI'06 Proceedings of the 8th international conference on Grammatical Inference: algorithms and applications
The Latent Maximum Entropy Principle

ACM Transactions on Knowledge Discovery from Data (TKDD)

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents an attempt at building a large scale distributed composite language model that is formed by seamlessly integrating an n-gram model, a structured language model, and probabilistic latent semantic analysis under a directed Markov random field paradigm to simultaneously account for local word lexical information, mid-range sentence syntactic structure, and long-span document semantic content. The composite language model has been trained by performing a convergent N-best list approximate EM algorithm and a follow-up EM algorithm to improve word prediction power on corpora with up to a billion tokens and stored on a supercomputer. The large scale distributed composite language model gives drastic perplexity reduction over n-grams and achieves significantly better translation quality measured by the Bleu score and "readability" of translations when applied to the task of re-ranking the N-best list from a state-of-the-art parsing-based machine translation system.