The empirical Bayes envelope and regret minimization in competitive Markov decision processes

Authors:
Shie Mannor;Nahum Shimkin
Affiliations:
Laboratory for Information and Decision Systems, Massachusetts Institute of Technology, Cambridge, Massachusetts;Department of Electrical Engineering, Technion, Israel Institute of Technology, Haifa 32000, Israel
Venue:
Mathematics of Operations Research
Year:
2003

Citing 10
Cited 9

Stochastic systems: estimation, identification and adaptive control

Stochastic systems: estimation, identification and adaptive control
Competitive Markov decision processes

Competitive Markov decision processes
A game of prediction with expert advice

Journal of Computer and System Sciences - Special issue on the eighth annual workshop on computational learning theory, July 5–8, 1995
Simplifying Optimal Strategies in Stochastic Games

SIAM Journal on Control and Optimization
Neuro-Dynamic Programming

Neuro-Dynamic Programming
Gambling in a rigged casino: The adversarial multi-armed bandit problem

FOCS '95 Proceedings of the 36th Annual Symposium on Foundations of Computer Science
Stochastic shortest path games: theory and algorithms

Stochastic shortest path games: theory and algorithms
Reinforcement learning: a survey

Journal of Artificial Intelligence Research
Universal prediction

IEEE Transactions on Information Theory
Reliable communication under channel uncertainty

IEEE Transactions on Information Theory

A Geometric Approach to Multi-Criterion Reinforcement Learning

The Journal of Machine Learning Research
If multi-agent learning is the answer, what is the question?

Artificial Intelligence
Perspectives on multiagent learning

Artificial Intelligence
Multi-agent learning for engineers

Artificial Intelligence
Markov Decision Processes with Arbitrary Reward Processes

Recent Advances in Reinforcement Learning
Markov Decision Processes with Arbitrary Reward Processes

Mathematics of Operations Research
Online learning in Markov decision processes with arbitrarily changing rewards and transitions

GameNets'09 Proceedings of the First ICST international conference on Game Theory for Networks
Online learning with variable stage duration

COLT'06 Proceedings of the 19th annual conference on Learning Theory
Online learning with constraints

COLT'06 Proceedings of the 19th annual conference on Learning Theory

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper proposes an extension of the regret minimizing framework from repeated matrix games to stochastic game models, under appropriate recurrence conditions. A decision maker, P1, who wishes to maximize his long-term average reward is facing a Markovian environment, which may also be affected by arbitrary actions of other agents. The latter are collectively modeled as a second player, P2, whose strategy is arbitrary. Both states and actions are fully observed by both players. While P1 may obviously secure the min-max value of the game, he may wish to improve on that when the opponent is not playing a worst-case strategy. For repeated matrix games, an achievable goal is presented by the Bayes envelope, that traces P1's best-response payoff against the observable frequencies of P2's actions. We propose a generalization to the stochastic game framework, under recurrence conditions that amount to fixed-state reachability. The empirical Bayes envelope (EBE) is defined as P1's best-response payoff against the stationary strategies of P2 that agree with the observed state-action frequencies. Because the EBE may not be attainable in general, we consider its lower convex hull, the convex Bayes envelope (CBE), which is proved to be achievable by P1. The analysis relies on Blackwell's approachability theory. The CBE is lower bounded by the value of the game and for irreducible games turns out to be strictly above the value whenever P2's frequencies deviate from a worst-case strategy. In the special case of single-controller games where P2 alone affects the state transitions, the EBE itself is shown to be attainable.