Reinforcement learning in POMDPs without resets

Authors:
Eyal Even-Dar;Sham M. Kakade;Yishay Mansour
Affiliations:
School of Computer Science, Tel-Aviv University, Tel-Aviv, Israel;Computer and Information Science, University of Pennsylvania, Philadelphia, PA;School Computer Science, Tel-Aviv University, Tel-Aviv, Israel
Venue:
IJCAI'05 Proceedings of the 19th international joint conference on Artificial intelligence
Year:
2005

Citing 11
Cited 12

Computationally feasible bounds for partially observed Markov decision processes

Operations Research
A survey of algorithmic methods for partially observed Markov decision processes

Annals of Operations Research
Inference of finite automata using homing sequences

Information and Computation
Near-Optimal Reinforcement Learning in Polynominal Time

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Learning to Cooperate via Policy Search

UAI '00 Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence
Exact and approximate algorithms for partially observable markov decision processes

Exact and approximate algorithms for partially observable markov decision processes
Nonapproximability results for partially observable Markov decision processes

Journal of Artificial Intelligence Research
Approximating optimal policies for partially observable stochastic domains

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2
A heuristic variable grid solution method for POMDPs

AAAI'97/IAAI'97 Proceedings of the fourteenth national conference on artificial intelligence and ninth conference on Innovative applications of artificial intelligence
Approximate planning for factored POMDPs using belief state simplification

UAI'99 Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence
Tractable inference for complex stochastic processes

UAI'98 Proceedings of the Fourteenth conference on Uncertainty in artificial intelligence

Reinforcement learning with limited reinforcement: using Bayes risk for active learning in POMDPs

Proceedings of the 25th international conference on Machine learning
On the possibility of learning in reactive environments with arbitrary dependence

Theoretical Computer Science
Spoken language interaction with model uncertainty: an adaptive human-robot interaction system

Connection Science - Language and Robots
Representing systems with hidden state

AAAI'06 Proceedings of the 21st national conference on Artificial intelligence - Volume 1
Learning partially observable action schemas

AAAI'06 Proceedings of the 21st national conference on Artificial intelligence - Volume 1
A reinforcement learning algorithm with polynomial interaction complexity for only-costly-observable MDPs

AAAI'07 Proceedings of the 22nd national conference on Artificial intelligence - Volume 1
Learning partially observable deterministic action models

Journal of Artificial Intelligence Research
Greedy algorithms for sequential sensing decisions

IJCAI'09 Proceedings of the 21st international jont conference on Artifical intelligence
Universal reinforcement learning

IEEE Transactions on Information Theory
Asymptotic learnability of reinforcement problems with arbitrary dependence

ALT'06 Proceedings of the 17th international conference on Algorithmic Learning Theory
Reinforcement learning with limited reinforcement: Using Bayes risk for active learning in POMDPs

Artificial Intelligence
Optimistic agents are asymptotically optimal

AI'12 Proceedings of the 25th Australasian joint conference on Advances in Artificial Intelligence

Quantified Score

Hi-index	0.06

Visualization

Abstract

We consider the most realistic reinforcement learning setting in which an agent starts in an unknown environment (the POMDP) and must follow one continuous and uninterrupted chain of experience with no access to "resets" or "offline" simulation. We provide algorithms for general connected POMDPs that obtain near optimal average reward. One algorithm we present has a convergence rate which depends exponentially on a certain horizon time of an optimal policy, but has no dependence on the number of (unobservable) states. The main building block of our algorithms is an implementation of an approximate reset strategy, which we show always exists in every POMDP. An interesting aspect of our algorithms is how they use this strategy when balancing exploration and exploitation.