Technical Note: \cal Q-Learning
Machine Learning
The O.D. E. Method for Convergence of Stochastic Approximation and Reinforcement Learning
SIAM Journal on Control and Optimization
Introduction to Reinforcement Learning
Introduction to Reinforcement Learning
Near-Optimal Reinforcement Learning in Polynomial Time
Machine Learning
Efficient Exploration In Reinforcement Learning
Efficient Exploration In Reinforcement Learning
R-max - a general polynomial time algorithm for near-optimal reinforcement learning
The Journal of Machine Learning Research
Lyapunov design for safe reinforcement learning
The Journal of Machine Learning Research
PAC model-free reinforcement learning
ICML '06 Proceedings of the 23rd international conference on Machine learning
Model-based function approximation in reinforcement learning
Proceedings of the 6th international joint conference on Autonomous agents and multiagent systems
On the convergence of stochastic iterative dynamic programming algorithms
Neural Computation
Fast gradient-descent methods for temporal-difference learning with linear function approximation
ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Algorithms for Reinforcement Learning
Algorithms for Reinforcement Learning
Q-error as a selection mechanism in modular reinforcement-learning systems
IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume Two
Hi-index | 0.00 |
Exploration is still one of the crucial problems in reinforcement learning, especially for agents acting in safety-critical situations. We propose a new directed exploration method, based on a notion of state controlability. Intuitively, if an agent wants to stay safe, it should seek out states where the effects of its actions are easier to predict; we call such states more controllable. Our main contribution is a new notion of controlability, computed directly from temporal-difference errors. Unlike other existing approaches of this type, our method scales linearly with the number of state features, and is directly applicable to function approximation. Our method converges to correct values in the policy evaluation setting. We also demonstrate significantly faster learning when this exploration strategy is used in large control problems.