Three new graphical models for statistical language modelling

Authors:
Andriy Mnih;Geoffrey Hinton
Affiliations:
University of Toronto, Canada;University of Toronto, Canada
Venue:
Proceedings of the 24th international conference on Machine learning
Year:
2007

Citing 5
Cited 18

Training products of experts by minimizing contrastive divergence

Neural Computation
A neural probabilistic language model

The Journal of Machine Learning Research
An empirical study of smoothing techniques for language modeling

ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
A fast learning algorithm for deep belief nets

Neural Computation
Training neural network language models on very large corpora

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing

Improving a statistical language model through non-linear prediction

Neurocomputing
Refining generative language models using discriminative learning

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
A deep learning approach to machine transliteration

StatMT '09 Proceedings of the Fourth Workshop on Statistical Machine Translation
Learning Deep Architectures for AI

Foundations and Trends® in Machine Learning
Distributional representations for handling sparsity in supervised sequence-labeling

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
Word representations: a simple and general method for semi-supervised learning

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Exploring representation-learning approaches to domain adaptation

DANLP 2010 Proceedings of the 2010 Workshop on Domain Adaptation for Natural Language Processing
Training continuous space language models: some practical issues

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Learning word vectors for sentiment analysis

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Temporal restricted Boltzmann machines for dependency parsing

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
Two Distributed-State Models For Generating High-Dimensional Time Series

The Journal of Machine Learning Research
Natural Language Processing (Almost) from Scratch

The Journal of Machine Learning Research
Learning algorithms for the classification restricted Boltzmann machine

The Journal of Machine Learning Research
Lexical surprisal as a general predictor of reading time

EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics
Cross-lingual word clusters for direct transfer of linguistic structure

NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Deep unsupervised feature learning for natural language processing

NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Student Research Workshop
Improving word representations via global context and multiple word prototypes

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Multi-space probabilistic sequence modeling

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

The supremacy of n-gram models in statistical language modelling has recently been challenged by parametric models that use distributed representations to counteract the difficulties caused by data sparsity. We propose three new probabilistic language models that define the distribution of the next word in a sequence given several preceding words by using distributed representations of those words. We show how real-valued distributed representations for words can be learned at the same time as learning a large set of stochastic binary hidden features that are used to predict the distributed representation of the next word from previous distributed representations. Adding connections from the previous states of the binary hidden features improves performance as does adding direct connections between the real-valued distributed representations. One of our models significantly outperforms the very best n-gram models.