Self-Optimizing and Pareto-Optimal Policies in General Environments Based on Bayes-Mixtures

Authors:
Marcus Hutter
Affiliations:
-
Venue:
COLT '02 Proceedings of the 15th Annual Conference on Computational Learning Theory
Year:
2002

Citing 12
Cited 11

Stochastic systems: estimation, identification and adaptive control

Stochastic systems: estimation, identification and adaptive control
Artificial intelligence: a modern approach

Artificial intelligence: a modern approach
An introduction to Kolmogorov complexity and its applications (2nd ed.)

An introduction to Kolmogorov complexity and its applications (2nd ed.)
A near-optimal polynomial time algorithm for learning in certain classes of stochastic games

Artificial Intelligence
Dynamic Programming and Optimal Control

Dynamic Programming and Optimal Control
Introduction to Reinforcement Learning

Introduction to Reinforcement Learning
Near-Optimal Reinforcement Learning in Polynominal Time

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
General Loss Bounds for Universal Sequence Prediction

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
The Speed Prior: A New Simplicity Measure Yielding Near-Optimal Computable Predictions

COLT '02 Proceedings of the 15th Annual Conference on Computational Learning Theory
Dynamic Programming

Dynamic Programming
Towards a Universal Theory of Artificial Intelligence based on Algorithmic Probability and Sequential Decision Theory

Towards a Universal Theory of Artificial Intelligence based on Algorithmic Probability and Sequential Decision Theory
Reinforcement learning: a survey

Journal of Artificial Intelligence Research

Optimality of universal Bayesian sequence prediction for general loss and alphabet

The Journal of Machine Learning Research
Optimal Ordered Problem Solver

Machine Learning
On the possibility of learning in reactive environments with arbitrary dependence

Theoretical Computer Science
A minimum relative entropy principle for learning and acting

Journal of Artificial Intelligence Research
Optimality issues of universal greedy agents with static priors

ALT'10 Proceedings of the 21st international conference on Algorithmic learning theory
A Monte-Carlo AIXI approximation

Journal of Artificial Intelligence Research
Asymptotically optimal agents

ALT'11 Proceedings of the 22nd international conference on Algorithmic learning theory
General discounting versus average reward

ALT'06 Proceedings of the 17th international conference on Algorithmic Learning Theory
Asymptotic learnability of reinforcement problems with arbitrary dependence

ALT'06 Proceedings of the 17th international conference on Algorithmic Learning Theory
Asymptotic non-learnability of universal agents with computable horizon functions

Theoretical Computer Science
General time consistent discounting

Theoretical Computer Science

Quantified Score

Hi-index	0.00

Visualization

Abstract

The problem of making sequential decisions in unknown probabilistic environments is studied. In cycle t action yt results in perception xt and reward rt, where all quantities in general may depend on the complete history. The perception xt and reward rt are sampled from the (reactive) environmental probability distribution 碌. This very general setting includes, but is not limited to, (partial observable, k-th order) Markov decision processes. Sequential decision theory tells us how to act in order to maximize the total expected reward, called value, if 碌 is known. Reinforcement learning is usually used if 碌 is unknown. In the Bayesian approach one defines a mixture distribution 驴 as a weighted sum of distributions 驴驴M, where M is any class of distributions including the true environment 碌. We show that the Bayes-optimal policy p驴 based on the mixture 驴 is self-optimizing in the sense that the average value converges asymptotically for all 碌驴M to the optimal value achieved by the (infeasible) Bayes-optimal policy p碌 which knows 碌 in advance. We show that the necessary condition that M admits self-optimizing policies at all, is also sufficient. No other structural assumptions are made on M. As an example application, we discuss ergodic Markov decision processes, which allow for self-optimizing policies. Furthermore, we show that p驴 is Pareto-optimal in the sense that there is no other policy yielding higher or equal value in all environments 驴驴M and a strictly higher value in at least one.