Adaptive Strategies and Regret Minimization in Arbitrarily Varying Markov Environments

Authors:
Shie Mannor;Nahum Shimkin
Affiliations:
-;-
Venue:
COLT '01/EuroCOLT '01 Proceedings of the 14th Annual Conference on Computational Learning Theory and and 5th European Conference on Computational Learning Theory
Year:
2001

Citing 5
Cited 0

Competitive Markov decision processes

Competitive Markov decision processes
A game of prediction with expert advice

Journal of Computer and System Sciences - Special issue on the eighth annual workshop on computational learning theory, July 5–8, 1995
Neuro-Dynamic Programming

Neuro-Dynamic Programming
Gambling in a rigged casino: The adversarial multi-armed bandit problem

FOCS '95 Proceedings of the 36th Annual Symposium on Foundations of Computer Science
Stochastic shortest path games: theory and algorithms

Stochastic shortest path games: theory and algorithms

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider the problem of maximizing the average reward in a controlled Markov environment, which also contains some arbitrarily varying elements. This problem is captured by a two-person stochastic game model involving the reward maximizing agent and a second player, which is free to use an arbitrary (non-stationary and unpredictable) control strategy. While the minimax value of the associated zero-sum game provides a guaranteed performance level, the fact that the second player's behavior is observed as the game unfolds opens up the opportunity to improve upon this minimax value if the second player is not playing a worst-case strategy. This basic idea has been formalized in the context of repeated matrix games by the classical notions of regret minimization with respect to the Bayes envelope, where an attainable performance goal is defined in terms of the empirical frequencies of the opponent's actions. This paper presents an extension of these ideas to problems with Markovian dynamics, under appropriate recurrence conditions. The Bayes envelope is first defined in a natural way in terms of the observed state action frequencies. As this envelope may not be attained in general, we define a proper convexification thereof as an attainable solution concept. In the specific case of single-controller games, where the opponent alone controls the state transitions, the Bayes envelope itself turns out to be convex and attainable. Some concrete examples are shown to fit in this framework.