Learning long-term dependencies with gradient descent is difficult

Authors:
Y. Bengio;P. Simard;P. Frasconi
Affiliations:
Dept. d'Inf. et de Recherche Oper., Montreal Univ., Que.;-;-
Venue:
IEEE Transactions on Neural Networks
Year:
1994

Citing 0
Cited 25

2005 Special Issue: Learning protein secondary structure from sequential and relational data

Neural Networks - Special issue on neural networks and kernel methods for structured domains
2005 Special issue: Recursive principal components analysis

Neural Networks - Special issue on neural networks and kernel methods for structured domains
Probabilistic based recursive model for adaptive processing of data structures

Expert Systems with Applications: An International Journal
Learning long-term dependencies with recurrent neural networks

Neurocomputing
Long-term time series prediction with the NARX network: An empirical evaluation

Neurocomputing
A model for learning to segment temporal sequences, utilizing a mixture of RNN experts together with adaptive variance

Neural Networks
Probabilistic models for melodic prediction

Artificial Intelligence
Classification of graphical data made easy

Neurocomputing
Extracting finite structure from infinite language

Knowledge-Based Systems
Learning to imitate stochastic time series in a compositional way by chaos

Neural Networks
A multi-model approach for long-term runoff modeling using rainfall forecasts

Expert Systems with Applications: An International Journal
A Performance evaluation of neural network models in traffic volume forecasting

Mathematical and Computer Modelling: An International Journal
Encoding subcomponents in cooperative co-evolutionary recurrent neural networks

Neurocomputing
Supervised Learning of Logical Operations in Layered Spiking Neural Networks with Spike Train Encoding

Neural Processing Letters
Computational approaches to sentence completion

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
A challenge set for advancing language modeling

WLM '12 Proceedings of the NAACL-HLT 2012 Workshop: Will We Ever Really Replace the N-gram Model? On the Future of Language Modeling for HLT
Learning sequence neighbourhood metrics

ICANN'12 Proceedings of the 22nd international conference on Artificial Neural Networks and Machine Learning - Volume Part I
Noise robust ASR in reverberated multisource environments applying convolutive NMF and Long Short-Term Memory

Computer Speech and Language
Opening the black box: Low-dimensional dynamics in high-dimensional recurrent neural networks

Neural Computation
Can we build language-independent OCR using LSTM networks?

Proceedings of the 4th International Workshop on Multilingual OCR
Selective Recurrent Neural Network

Neural Processing Letters
Deep learning of representations: looking forward

SLSP'13 Proceedings of the First international conference on Statistical Language and Speech Processing
Initializing reservoirs with exhibitory and inhibitory signals using unsupervised learning techniques

Proceedings of the Fourth Symposium on Information and Communication Technology
Solving graph data issues using a layered architecture approach with applications to web spam detection

Neural Networks
Combination of supervised and unsupervised learning for training the activation functions of neural networks

Pattern Recognition Letters

Quantified Score

Hi-index	0.01

Visualization

Abstract

Recurrent neural networks can be used to map input sequences to output sequences, such as for recognition, production or prediction problems. However, practical difficulties have been reported in training recurrent neural networks to perform tasks in which the temporal contingencies present in the input/output sequences span long intervals. We show why gradient based learning algorithms face an increasingly difficult problem as the duration of the dependencies to be captured increases. These results expose a trade-off between efficient learning by gradient descent and latching on information for long periods. Based on an understanding of this problem, alternatives to standard gradient descent are considered