Smart exploration in reinforcement learning using absolute temporal difference errors

Authors:
Clement Gehring;Doina Precup
Affiliations:
McGill University, Montreal, PQ, Canada;McGill University, Montreal, PQ, Canada
Venue:
Proceedings of the 2013 international conference on Autonomous agents and multi-agent systems
Year:
2013

Citing 14
Cited 0

Technical Note: \cal Q-Learning

Machine Learning
Exploration of Multi-State Environments: Local Measures and Back-Propagation of Uncertainty

Machine Learning
The O.D. E. Method for Convergence of Stochastic Approximation and Reinforcement Learning

SIAM Journal on Control and Optimization
Introduction to Reinforcement Learning

Introduction to Reinforcement Learning
Near-Optimal Reinforcement Learning in Polynomial Time

Machine Learning
Efficient Exploration In Reinforcement Learning

Efficient Exploration In Reinforcement Learning
R-max - a general polynomial time algorithm for near-optimal reinforcement learning

The Journal of Machine Learning Research
Lyapunov design for safe reinforcement learning

The Journal of Machine Learning Research
PAC model-free reinforcement learning

ICML '06 Proceedings of the 23rd international conference on Machine learning
Model-based function approximation in reinforcement learning

Proceedings of the 6th international joint conference on Autonomous agents and multiagent systems
On the convergence of stochastic iterative dynamic programming algorithms

Neural Computation
Fast gradient-descent methods for temporal-difference learning with linear function approximation

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Algorithms for Reinforcement Learning

Algorithms for Reinforcement Learning
Q-error as a selection mechanism in modular reinforcement-learning systems

IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume Two

Quantified Score

Hi-index	0.00

Visualization

Abstract

Exploration is still one of the crucial problems in reinforcement learning, especially for agents acting in safety-critical situations. We propose a new directed exploration method, based on a notion of state controlability. Intuitively, if an agent wants to stay safe, it should seek out states where the effects of its actions are easier to predict; we call such states more controllable. Our main contribution is a new notion of controlability, computed directly from temporal-difference errors. Unlike other existing approaches of this type, our method scales linearly with the number of state features, and is directly applicable to function approximation. Our method converges to correct values in the policy evaluation setting. We also demonstrate significantly faster learning when this exploration strategy is used in large control problems.