Speeding up reinforcement learning using recurrent neural networks in non-Markovian environments

Authors:
LE Tien Dung;Takashi Komeda;Motoki Takagi
Affiliations:
Shibaura Institute of Technology, Minumaku, Saitama, Japan;Shibaura Institute of Technology, Minumaku, Saitama, Japan;Shibaura Institute of Technology, Minumaku, Saitama, Japan
Venue:
ASC '07 Proceedings of The Eleventh IASTED International Conference on Artificial Intelligence and Soft Computing
Year:
2007

Citing 6
Cited 0

Long short-term memory

Neural Computation
Co-evolving recurrent neurons learn deep memory POMDPs

GECCO '05 Proceedings of the 7th annual conference on Genetic and evolutionary computation
Learning to Forget: Continual Prediction with LSTM

Neural Computation
Training Recurrent Networks by Evolino

Neural Computation
Reinforcement learning: a survey

Journal of Artificial Intelligence Research
Efficient non-linear control through neuroevolution

ECML'06 Proceedings of the 17th European conference on Machine Learning

Quantified Score

Hi-index	0.00

Visualization

Abstract

Reinforcement Learning (RL) has been widely used to solve problems with a little feedback from environment. Q learning can solve Markov Decision Processes quite well. For Partially Observable Markov Decision Processes, a Recurrent Neural Network (RNN) can be used to approximate Q values. However, learning time for these problems is typically very long. In this paper, we present a method to speed up learning performance in non-Markovian environments by focusing on necessary state-action pairs in learning episodes. Whenever the agent can attain the goal, the agent checks the episode and relearns necessary actions. We use a table, storing minimum number of appearances of states in all successful episodes, to remove unnecessary state-action pairs in a successful episode and to form a min-episode. To verify this method, we performed two experiments: The E maze problem with Time-delay Neural Network and the lighting grid world problem with Long Short Term Memory RNN. Experimental results show that the proposed method enables an agent to acquire a policy with better learning performance compared to the standard method.