A Bayesian Learning Automaton for Solving Two-Armed Bernoulli Bandit Problems

  • Authors:
  • Ole-Christoffer Granmo

  • Affiliations:
  • -

  • Venue:
  • ICMLA '08 Proceedings of the 2008 Seventh International Conference on Machine Learning and Applications
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

The two-armed Bernoulli bandit (TABB) problem is a classical optimization problem where an agent sequentially pulls one of two arms attached to a gambling machine, with each pull resulting either in a reward or a penalty. The reward probabilities of each arm are unknown, and thus one must balance between exploiting existing knowledge about the arms, and obtaining new information. In the last decades, several computationally efficient algorithms for tackling this problem have emerged, with Learning Automata (LA) being known for their $\epsilon$-optimality, and confidence interval based for logarithmically growing regret. Applications include treatment selection in clinical trials, route selection in adaptive routing, and plan exploration in games like Go. The TABB has also been extensively studied from a Bayesian perspective, however, in general, such analysis leads to computationally inefficient solution policies.This paper introduces the Bayesian Learning Automaton (BLA). The BLA is inherently Bayesian in nature, yet relies simply on counting rewards/penalties and on random sampling from a pair of twin beta distributions. Furthermore, we report that BLA is self-correcting and converges to only pulling the optimal arm with probability 1. Extensive experiments demonstrate that, in contrast to most LA, BLA does not rely on external learning speed/accuracy control. It also outperforms recently proposed confidence interval based algorithms. We thus believe that BLA opens up for improved performance in a number of applications,and that it forms the basis for a new avenue of research.