Addressing the policy-bias of q-learning by repeating updates

Authors:
Sherief Abdallah;Michael Kaisers
Affiliations:
British University in Dubai, Dubai, UAE & University of Edinburgh, United Kingdom;Maastricht University, Maastricht, Netherlands
Venue:
Proceedings of the 2013 international conference on Autonomous agents and multi-agent systems
Year:
2013

Citing 6
Cited 0

The dynamics of reinforcement learning in cooperative multiagent systems

AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
Introduction to Reinforcement Learning

Introduction to Reinforcement Learning
A multiagent reinforcement learning algorithm with non-linear dynamics

Journal of Artificial Intelligence Research
Emergence of norms through social learning

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Frequency adjusted multi-agent Q-learning

Proceedings of the 9th International Conference on Autonomous Agents and Multiagent Systems: volume 1 - Volume 1
Emergence and stability of social conventions in conflict situations

IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume One

Quantified Score

Hi-index	0.00

Visualization

Abstract

Q-learning is a very popular reinforcement learning algorithm being proven to converge to optimal policies in Markov decision processes. However, Q-learning shows artifacts if the optimal action is played with a low probability, a situation that may arise due to wrong initialization of Q-values or due to convergence to an almost pure policy after which a change in the environment makes another action optimal. These artifacts was resolved in literature by the variant Frequency Adjusted Q-learning (FAQL). However, FAQL also suffered from practical concerns that limited the policy subspace for which the behavior was improved. Here, we introduce the Repeated Update Q-learning (RUQL), a variant of Q-learning that resolves the undesirable artifacts of Q-learning without the practical concerns of FAQL. We show (both theoretically and experimentally) the similarities and differences between RUQL and FAQL (the closest state-of-the-art). Experimental results verify the theoretical insights and show how RUQL outperforms FAQL and QL in non-stationary environments.