Training products of experts by minimizing contrastive divergence
Neural Computation
A neural probabilistic language model
The Journal of Machine Learning Research
An empirical study of smoothing techniques for language modeling
ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
A fast learning algorithm for deep belief nets
Neural Computation
Training neural network language models on very large corpora
HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Refining generative language models using discriminative learning
EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
A deep learning approach to machine transliteration
StatMT '09 Proceedings of the Fourth Workshop on Statistical Machine Translation
Learning Deep Architectures for AI
Foundations and Trends® in Machine Learning
Distributional representations for handling sparsity in supervised sequence-labeling
ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
Word representations: a simple and general method for semi-supervised learning
ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Exploring representation-learning approaches to domain adaptation
DANLP 2010 Proceedings of the 2010 Workshop on Domain Adaptation for Natural Language Processing
Training continuous space language models: some practical issues
EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Learning word vectors for sentiment analysis
HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Temporal restricted Boltzmann machines for dependency parsing
HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
Two Distributed-State Models For Generating High-Dimensional Time Series
The Journal of Machine Learning Research
Natural Language Processing (Almost) from Scratch
The Journal of Machine Learning Research
Learning algorithms for the classification restricted Boltzmann machine
The Journal of Machine Learning Research
Lexical surprisal as a general predictor of reading time
EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics
Cross-lingual word clusters for direct transfer of linguistic structure
NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Deep unsupervised feature learning for natural language processing
NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Student Research Workshop
Improving word representations via global context and multiple word prototypes
ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Multi-space probabilistic sequence modeling
Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Hi-index | 0.00 |
The supremacy of n-gram models in statistical language modelling has recently been challenged by parametric models that use distributed representations to counteract the difficulties caused by data sparsity. We propose three new probabilistic language models that define the distribution of the next word in a sequence given several preceding words by using distributed representations of those words. We show how real-valued distributed representations for words can be learned at the same time as learning a large set of stochastic binary hidden features that are used to predict the distributed representation of the next word from previous distributed representations. Adding connections from the previous states of the binary hidden features improves performance as does adding direct connections between the real-valued distributed representations. One of our models significantly outperforms the very best n-gram models.