Value Function Based Reinforcement Learning in Changing Markovian Environments

Authors:
Balázs Csanád Csáji;László Monostori
Affiliations:
-;-
Venue:
The Journal of Machine Learning Research
Year:
2008

Citing 11
Cited 1

Module-Based Reinforcement Learning: Experiments with a Real Robot

Machine Learning - Special issue on learning in autonomous robots
Introduction to Reinforcement Learning

Introduction to Reinforcement Learning
Neuro-Dynamic Programming

Neuro-Dynamic Programming
Near-Optimal Reinforcement Learning in Polynomial Time

Machine Learning
Rates of Convergence for Variable Resolution Schemes in Optimal Control

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
ε-mdps: learning in varying environments

The Journal of Machine Learning Research
Learning Rates for Q-learning

The Journal of Machine Learning Research
A Unified Analysis of Value-Function-Based Reinforcement Learning Algorithms

Neural Computation
Dynamic Programming and Optimal Control, Vol. II

Dynamic Programming and Optimal Control, Vol. II
Scheduling: Theory, Algorithms, and Systems

Scheduling: Theory, Algorithms, and Systems
Adaptive sampling based large-scale stochastic resource control

AAAI'06 Proceedings of the 21st national conference on Artificial intelligence - Volume 1

Approximate dynamic programming for an inventory problem: Empirical comparison

Computers and Industrial Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

The paper investigates the possibility of applying value function based reinforcement learning (RL) methods in cases when the environment may change over time. First, theorems are presented which show that the optimal value function of a discounted Markov decision process (MDP) Lipschitz continuously depends on the immediate-cost function and the transition-probability function. Dependence on the discount factor is also analyzed and shown to be non-Lipschitz. Afterwards, the concept of (ε,δ)-MDPs is introduced, which is a generalization of MDPs and ε-MDPs. In this model the environment may change over time, more precisely, the transition function and the cost function may vary from time to time, but the changes must be bounded in the limit. Then, learning algorithms in changing environments are analyzed. A general relaxed convergence theorem for stochastic iterative algorithms is presented. We also demonstrate the results through three classical RL methods: asynchronous value iteration, Q-learning and temporal difference learning. Finally, some numerical experiments concerning changing environments are presented.