Using fast weights to improve persistent contrastive divergence

Authors:
Tijmen Tieleman;Geoffrey Hinton
Affiliations:
University of Toronto, Toronto, Ontario, Canada;University of Toronto, Toronto, Ontario, Canada
Venue:
ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Year:
2009

Citing 9
Cited 15

Connectionist learning of belief networks

Artificial Intelligence
A view of the EM algorithm that justifies incremental, sparse, and other variants

Learning in graphical models
An Introduction to Variational Methods for Graphical Models

Machine Learning
Training products of experts by minimizing contrastive divergence

Neural Computation
A New Learning Algorithm for Mean Field Boltzmann Machines

ICANN '02 Proceedings of the International Conference on Artificial Neural Networks
A fast learning algorithm for deep belief nets

Neural Computation
Classification using discriminative restricted Boltzmann machines

Proceedings of the 25th international conference on Machine learning
Training restricted Boltzmann machines using approximations to the likelihood gradient

Proceedings of the 25th international conference on Machine learning
Learning methods for generic object recognition with invariance to pose and lighting

CVPR'04 Proceedings of the 2004 IEEE computer society conference on Computer vision and pattern recognition

Learning Deep Architectures for AI

Foundations and Trends® in Machine Learning
Herding dynamic weights for partially observed random field models

UAI '09 Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence
Tractable multivariate binary density estimation and the restricted boltzmann forest

Neural Computation
Empirical analysis of the divergence of Gibbs sampling based learning algorithms for restricted Boltzmann machines

ICANN'10 Proceedings of the 20th international conference on Artificial neural networks: Part III
Quickly generating representative samples from an rbm-derived process

Neural Computation
Two Distributed-State Models For Generating High-Dimensional Time Series

The Journal of Machine Learning Research
On the expressive power of deep architectures

ALT'11 Proceedings of the 22nd international conference on Algorithmic learning theory
Models of visually guided routes in ants: embodiment simplifies route acquisition

ICIRA'11 Proceedings of the 4th international conference on Intelligent Robotics and Applications - Volume Part II
Learning algorithms for the classification restricted Boltzmann machine

The Journal of Machine Learning Research
An efficient learning procedure for deep boltzmann machines

Neural Computation
Training restricted boltzmann machines with multi-tempering: harnessing parallelization

ICANN'12 Proceedings of the 22nd international conference on Artificial Neural Networks and Machine Learning - Volume Part II
Training restricted Boltzmann machines: An introduction

Pattern Recognition
The flip-the-state transition operator for restricted Boltzmann machines

Machine Learning
Learning ensemble classifiers via restricted Boltzmann machines

Pattern Recognition Letters
Training energy-based models for time-series imputation

The Journal of Machine Learning Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

The most commonly used learning algorithm for restricted Boltzmann machines is contrastive divergence which starts a Markov chain at a data point and runs the chain for only a few iterations to get a cheap, low variance estimate of the sufficient statistics under the model. Tieleman (2008) showed that better learning can be achieved by estimating the model's statistics using a small set of persistent "fantasy particles" that are not reinitialized to data points after each weight update. With sufficiently small weight updates, the fantasy particles represent the equilibrium distribution accurately but to explain why the method works with much larger weight updates it is necessary to consider the interaction between the weight updates and the Markov chain. We show that the weight updates force the Markov chain to mix fast, and using this insight we develop an even faster mixing chain that uses an auxiliary set of "fast weights" to implement a temporary overlay on the energy landscape. The fast weights learn rapidly but also decay rapidly and do not contribute to the normal energy landscape that defines the model.