Exploiting syntactic, semantic and lexical regularities in language modeling via directed Markov random fields

Authors:
Shaojun Wang;Shaomin Wang;Russell Greiner;Dale Schuurmans;Li Cheng
Affiliations:
University of Alberta;Massachusetts Institute of Technology;University of Alberta;University of Alberta;University of Alberta
Venue:
ICML '05 Proceedings of the 22nd international conference on Machine learning
Year:
2005

Citing 7
Cited 6

A maximum entropy approach to natural language processing

Computational Linguistics
Statistical methods for speech recognition

Statistical methods for speech recognition
Unsupervised learning by probabilistic latent semantic analysis

Machine Learning
Latent dirichlet allocation

The Journal of Machine Learning Research
Probabilistic top-down parsing and language modeling

Computational Linguistics
Statistical properties of probabilistic context-free grammars

Computational Linguistics
Combining Statistical Language Models via the Latent Maximum Entropy Principle

Machine Learning

Guessing parts-of-speech of unknown words using global information

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
New advances in logic-based probabilistic modeling by PRISM

Probabilistic inductive logic programming
Action categorization by structural probabilistic latent semantic analysis

Computer Vision and Image Understanding
A large scale distributed syntactic, semantic and lexical language model for machine translation

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Stochastic analysis of lexical and semantic enhanced structural language model

ICGI'06 Proceedings of the 8th international conference on Grammatical Inference: algorithms and applications
A scalable distributed syntactic, semantic, and lexical language model

Computational Linguistics

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a directed Markov random field (MRF) model that combines n-gram models, probabilistic context free grammars (PCFGs) and probabilistic latent semantic analysis (PLSA) for the purpose of statistical language modeling. Even though the composite directed MRF model potentially has an exponential number of loops and becomes a context sensitive grammar, we are nevertheless able to estimate its parameters in cubic time using an efficient modified EM method, the generalized inside-outside algorithm, which extends the inside-outside algorithm to incorporate the effects of the n-gram and PLSA language models. We generalize various smoothing techniques to alleviate the sparseness of n-gram counts in cases where there are hidden variables. We also derive an analogous algorithm to calculate the probability of initial subsequence of a sentence, generated by the composite language model. Our experimental results on the Wall Street Journal corpus show that we obtain significant reductions in perplexity compared to the state-of-the-art baseline trigram model with Good-Turing and Kneser-Ney smoothings.