Learning long term dependencies with recurrent neural networks

Authors:
Anton Maximilian Schäfer;Steffen Udluft;Hans Georg Zimmermann
Affiliations:
Information & Communications, Learning Systems, Siemens AG, Corporate Technology, Munich, Germany;Information & Communications, Learning Systems, Siemens AG, Corporate Technology, Munich, Germany;Information & Communications, Learning Systems, Siemens AG, Corporate Technology, Munich, Germany
Venue:
ICANN'06 Proceedings of the 16th international conference on Artificial Neural Networks - Volume Part I
Year:
2006

Citing 7
Cited 1

Learning internal representations by error propagation

Parallel distributed processing: explorations in the microstructure of cognition, vol. 1
The vanishing gradient problem during learning recurrent neural nets and problem solutions

International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems
Neural Networks: A Comprehensive Foundation

Neural Networks: A Comprehensive Foundation
Field Guide to Dynamical Recurrent Networks

Field Guide to Dynamical Recurrent Networks
How to Train Neural Networks

Neural Networks: Tricks of the Trade, this book is an outgrowth of a 1996 NIPS workshop
Long Short-Term Memory

Neural Computation
Recurrent neural networks are universal approximators

ICANN'06 Proceedings of the 16th international conference on Artificial Neural Networks - Volume Part I

Selective Recurrent Neural Network

Neural Processing Letters

Quantified Score

Hi-index	0.00

Visualization

Abstract

Recurrent neural networks (RNNs) unfolded in time are in theory able to map any open dynamical system. Still they are often blamed to be unable to identify long-term dependencies in the data. Especially when they are trained with backpropagation through time (BPTT) it is claimed that RNNs unfolded in time fail to learn inter-temporal influences more than ten time steps apart. This paper provides a disproof of this often cited statement. We show that RNNs and especially normalised recurrent neural networks (NRNNs) unfolded in time are indeed very capable of learning time lags of at least a hundred time steps. We further demonstrate that the problem of a vanishing gradient does not apply to these networks.