Addressing the policy-bias of q-learning by repeating updates

  • Authors:
  • Sherief Abdallah;Michael Kaisers

  • Affiliations:
  • British University in Dubai, Dubai, UAE & University of Edinburgh, United Kingdom;Maastricht University, Maastricht, Netherlands

  • Venue:
  • Proceedings of the 2013 international conference on Autonomous agents and multi-agent systems
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Q-learning is a very popular reinforcement learning algorithm being proven to converge to optimal policies in Markov decision processes. However, Q-learning shows artifacts if the optimal action is played with a low probability, a situation that may arise due to wrong initialization of Q-values or due to convergence to an almost pure policy after which a change in the environment makes another action optimal. These artifacts was resolved in literature by the variant Frequency Adjusted Q-learning (FAQL). However, FAQL also suffered from practical concerns that limited the policy subspace for which the behavior was improved. Here, we introduce the Repeated Update Q-learning (RUQL), a variant of Q-learning that resolves the undesirable artifacts of Q-learning without the practical concerns of FAQL. We show (both theoretically and experimentally) the similarities and differences between RUQL and FAQL (the closest state-of-the-art). Experimental results verify the theoretical insights and show how RUQL outperforms FAQL and QL in non-stationary environments.