Combining expert advice in reactive environments

Authors:
Daniela Pucci De Farias;Nimrod Megiddo
Affiliations:
Massachusetts Institute of Technology, Cambridge, Massachusetts;IBM Almaden Research Center, San Jose, California
Venue:
Journal of the ACM (JACM)
Year:
2006

Citing 9
Cited 7

Technical Note: \cal Q-Learning

Machine Learning
A randomization rule for selecting forecasts

Operations Research
The weighted majority algorithm

Information and Computation
How to use expert advice

Journal of the ACM (JACM)
A game of prediction with expert advice

Journal of Computer and System Sciences - Special issue on the eighth annual workshop on computational learning theory, July 5–8, 1995
Finite-sample convergence rates for Q-learning and indirect algorithms

Proceedings of the 1998 conference on Advances in neural information processing systems II
The Nonstochastic Multiarmed Bandit Problem

SIAM Journal on Computing
Near-Optimal Reinforcement Learning in Polynomial Time

Machine Learning
A decision-theoretic generalization of on-line learning and an application to boosting

EuroCOLT '95 Proceedings of the Second European Conference on Computational Learning Theory

Approximation algorithms for restless bandit problems

SODA '09 Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms
Online Markov Decision Processes

Mathematics of Operations Research
Universal reinforcement learning

IEEE Transactions on Information Theory
Approximation algorithms for restless bandit problems

Journal of the ACM (JACM)
Dynamic cooperator selection in cognitive radio networks

Ad Hoc Networks
Strong mitigation: nesting search for good policies within search for good reward

Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems - Volume 1
Adaptive probabilistic policy reuse

ICONIP'12 Proceedings of the 19th international conference on Neural Information Processing - Volume Part III

Quantified Score

Hi-index	0.06

Visualization

Abstract

“Experts algorithms” constitute a methodology for choosing actions repeatedly, when the rewards depend both on the choice of action and on the unknown current state of the environment. An experts algorithm has access to a set of strategies (“experts”), each of which may recommend which action to choose. The algorithm learns how to combine the recommendations of individual experts so that, in the long run, for any fixed sequence of states of the environment, it does as well as the best expert would have done relative to the same sequence. This methodology may not be suitable for situations where the evolution of states of the environment depends on past chosen actions, as is usually the case, for example, in a repeated non-zero-sum game.A general exploration-exploitation experts method is presented along with a proper definition of value. The definition is shown to be adequate in that it both captures the impact of an expert's actions on the environment and is learnable. The new experts method is quite different from previously proposed experts algorithms. It represents a shift from the paradigms of regret minimization and myopic optimization to consideration of the long-term effect of a player's actions on the environment. The importance of this shift is demonstrated by the fact that this algorithm is capable of inducing cooperation in the repeated Prisoner's Dilemma game, whereas previous experts algorithms converge to the suboptimal non-cooperative play. The method is shown to asymptotically perform as well as the best available expert. Several variants are analyzed from the viewpoint of the exploration-exploitation tradeoff, including explore-then-exploit, polynomially vanishing exploration, constant-frequency exploration, and constant-size exploration phases. Complexity and performance bounds are proven.