Accelerated Bayesian learning for decentralized two-armed bandit based decision making with applications to the Goore Game

Authors:
Ole-Christoffer Granmo;Sondre Glimsdal
Affiliations:
Department of Information and Communication Technology, University of Agder, Kristiansand, Norway 4604;Department of Information and Communication Technology, University of Agder, Kristiansand, Norway 4604
Venue:
Applied Intelligence
Year:
2013

Citing 12
Cited 5

Learning automata: an introduction

Learning automata: an introduction
Using Finite State Automata to Produce Self-Optimization and Self-Control

IEEE Transactions on Parallel and Distributed Systems
Cooperative Mobile Robotics: Antecedents and Directions

Autonomous Robots
Bayesian sparse sampling for on-line reward optimization

ICML '05 Proceedings of the 22nd international conference on Machine learning
Routing Bandwidth-Guaranteed Paths in MPLS Traffic Engineering: A Multiple Race Track Learning Approach

IEEE Transactions on Computers
Solving Stochastic Nonlinear Resource Allocation Problems Using a Hierarchy of Twofold Resource Allocation Automata

IEEE Transactions on Computers
Combining finite learning automata with GSAT for the satisfiability problem

Engineering Applications of Artificial Intelligence
A modern Bayesian look at the multi-armed bandit

Applied Stochastic Models in Business and Industry
Solving non-stationary bandit problems by random sampling from sibling Kalman filters

IEA/AIE'10 Proceedings of the 23rd international conference on Industrial engineering and other applications of applied intelligent systems - Volume Part III
Nearly optimal exploration-exploitation decision thresholds

ICANN'06 Proceedings of the 16th international conference on Artificial Neural Networks - Volume Part I
Learning Automata-Based Solutions to the Nonlinear Fractional Knapsack Problem With Applications to Optimal Resource Allocation

IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics
Thompson Sampling for Dynamic Multi-armed Bandits

ICMLA '11 Proceedings of the 2011 10th International Conference on Machine Learning and Applications and Workshops - Volume 01

Monte-Carlo tree search for Bayesian reinforcement learning

Applied Intelligence
Learning via human feedback in continuous state and action spaces

Applied Intelligence
Dynamic game with perfect and complete information based dynamic channel assignment

Applied Intelligence
On incorporating the paradigms of discretization and Bayesian estimation to create a new family of pursuit learning automata

Applied Intelligence
On compatibility of uncertain multiplicative linguistic preference relations based on the linguistic COWGA

Applied Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

The two-armed bandit problem is a classical optimization problem where a decision maker sequentially pulls one of two arms attached to a gambling machine, with each pull resulting in a random reward. The reward distributions are unknown, and thus, one must balance between exploiting existing knowledge about the arms, and obtaining new information. Bandit problems are particularly fascinating because a large class of real world problems, including routing, Quality of Service (QoS) control, game playing, and resource allocation, can be solved in a decentralized manner when modeled as a system of interacting gambling machines.Although computationally intractable in many cases, Bayesian methods provide a standard for optimal decision making. This paper proposes a novel scheme for decentralized decision making based on the Goore Game in which each decision maker is inherently Bayesian in nature, yet avoids computational intractability by relying simply on updating the hyper parameters of sibling conjugate priors, and on random sampling from these posteriors. We further report theoretical results on the variance of the random rewards experienced by each individual decision maker. Based on these theoretical results, each decision maker is able to accelerate its own learning by taking advantage of the increasingly more reliable feedback that is obtained as exploration gradually turns into exploitation in bandit problem based learning.Extensive experiments, involving QoS control in simulated wireless sensor networks, demonstrate that the accelerated learning allows us to combine the benefits of conservative learning, which is high accuracy, with the benefits of hurried learning, which is fast convergence. In this manner, our scheme outperforms recently proposed Goore Game solution schemes, where one has to trade off accuracy with speed. As an additional benefit, performance also becomes more stable. We thus believe that our methodology opens avenues for improved performance in a number of applications of bandit based decentralized decision making.