Module-Based Reinforcement Learning: Experiments with a Real Robot
Machine Learning - Special issue on learning in autonomous robots
Introduction to Reinforcement Learning
Introduction to Reinforcement Learning
Neuro-Dynamic Programming
Near-Optimal Reinforcement Learning in Polynomial Time
Machine Learning
Rates of Convergence for Variable Resolution Schemes in Optimal Control
ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
ε-mdps: learning in varying environments
The Journal of Machine Learning Research
The Journal of Machine Learning Research
A Unified Analysis of Value-Function-Based Reinforcement Learning Algorithms
Neural Computation
Dynamic Programming and Optimal Control, Vol. II
Dynamic Programming and Optimal Control, Vol. II
Scheduling: Theory, Algorithms, and Systems
Scheduling: Theory, Algorithms, and Systems
Adaptive sampling based large-scale stochastic resource control
AAAI'06 Proceedings of the 21st national conference on Artificial intelligence - Volume 1
Approximate dynamic programming for an inventory problem: Empirical comparison
Computers and Industrial Engineering
Hi-index | 0.00 |
The paper investigates the possibility of applying value function based reinforcement learning (RL) methods in cases when the environment may change over time. First, theorems are presented which show that the optimal value function of a discounted Markov decision process (MDP) Lipschitz continuously depends on the immediate-cost function and the transition-probability function. Dependence on the discount factor is also analyzed and shown to be non-Lipschitz. Afterwards, the concept of (ε,δ)-MDPs is introduced, which is a generalization of MDPs and ε-MDPs. In this model the environment may change over time, more precisely, the transition function and the cost function may vary from time to time, but the changes must be bounded in the limit. Then, learning algorithms in changing environments are analyzed. A general relaxed convergence theorem for stochastic iterative algorithms is presented. We also demonstrate the results through three classical RL methods: asynchronous value iteration, Q-learning and temporal difference learning. Finally, some numerical experiments concerning changing environments are presented.