Learning in non-stationary MDPs as transfer learning

Authors:
M.M. Hassan Mahmud;Subramanian Ramamoorthy
Affiliations:
University of Edinburgh, Edinburgh, United Kingdom;University of Edinburgh, Edinburgh, United Kingdom
Venue:
Proceedings of the 2013 international conference on Autonomous agents and multi-agent systems
Year:
2013

Citing 4
Cited 0

Markov Decision Processes: Discrete Stochastic Dynamic Programming

Markov Decision Processes: Discrete Stochastic Dynamic Programming
Introduction to Reinforcement Learning

Introduction to Reinforcement Learning
Probabilistic policy reuse in a reinforcement learning agent

AAMAS '06 Proceedings of the fifth international joint conference on Autonomous agents and multiagent systems
Transfer Learning for Reinforcement Learning Domains: A Survey

The Journal of Machine Learning Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we introduce the MDP-A model for addressing a particular sub-class of non-stationary environments where the learner is required to interact with other agents. The behavior-policies of the agents are determined by a latent variable that changes rarely, but can modify the agent policies drastically when it does change (like traffic conditions in a driving problem). This unpredictable change in the latent variable results in non-stationarity. We frame this problem as transfer learning in a particular subclass of MDPs, which we call MDPs-with-agents (MDP-A), where each task/MDP requires the learner to learn to interact with opponent agents with fixed policies. Across the tasks, the state and action space remains the same (and is known) but the agent-policies change. We transfer information from previous tasks to quickly infer the combined agent behavior policy in a new task after some limited initial exploration, and hence rapidly learn an optimal/near-optimal policy. We propose a transfer algorithm which given a collection of source behavior policies, eliminates the policies that do not apply in the new task in time polynomial in the relevant parameters using novel a statistical test. We also perform experiments in three interesting domains and show that our algorithm significantly outperforms relevant alternative algorithms.