Cognitive policy learner: biasing winning or losing strategies

Authors:
Dominik Dahlem;Jim Dowling;William Harrison
Affiliations:
SENSEable City Laboratory, Massachusetts Institute of Technology, Cambridge;Computer Systems Laboratory, Swedish Institute of Computer Science, Stockholm, Sweden;Trinity College Dublin, Dublin, Ireland
Venue:
The 10th International Conference on Autonomous Agents and Multiagent Systems - Volume 2
Year:
2011

Citing 21
Cited 0

Learning internal representations by error propagation

Parallel distributed processing: explorations in the microstructure of cognition, vol. 1
Using collective intelligence to route Internet traffic

Proceedings of the 1998 conference on Advances in neural information processing systems II
Hierarchical multi-agent reinforcement learning

Proceedings of the fifth international conference on Autonomous agents
Simulation Modeling and Analysis

Simulation Modeling and Analysis
The Complexity of Decentralized Control of Markov Decision Processes

Mathematics of Operations Research
Friend-or-Foe Q-learning in General-Sum Games

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
A Multi-Agent Policy-Gradient Approach to Network Routing

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Nash q-learning for general-sum stochastic games

The Journal of Machine Learning Research
Collectives and Design Complex Systems

Collectives and Design Complex Systems
Bayesian Reinforcement Learning for Coalition Formation under Uncertainty

AAMAS '04 Proceedings of the Third International Joint Conference on Autonomous Agents and Multiagent Systems - Volume 3
Queueing Networks and Markov Chains

Queueing Networks and Markov Chains
Learning the task allocation game

AAMAS '06 Proceedings of the fifth international joint conference on Autonomous agents and multiagent systems
Exploring selfish reinforcement learning in repeated games with stochastic rewards

Autonomous Agents and Multi-Agent Systems
Multiagent reinforcement learning and self-organization in a network of agents

Proceedings of the 6th international joint conference on Autonomous agents and multiagent systems
Efficient projections onto the l1-ball for learning in high dimensions

Proceedings of the 25th international conference on Machine learning
Stochastic kriging for simulation metamodeling

Proceedings of the 40th Conference on Winter Simulation
Collective intelligence, data routing and braess' paradox

Journal of Artificial Intelligence Research
A multi-agent learning approach to online distributed resource allocation

IJCAI'09 Proceedings of the 21st international jont conference on Artifical intelligence
Self-organization for coordinating decentralized reinforcement learning

Proceedings of the 9th International Conference on Autonomous Agents and Multiagent Systems: volume 1 - Volume 1
Collaborative Function Approximation in Social Multiagent Systems

WI-IAT '10 Proceedings of the 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 02
Better simulation metamodeling: the why, what, and how of stochastic kriging

Winter Simulation Conference

Quantified Score

Hi-index	0.00

Visualization

Abstract

In continuous learning settings stochastic stable policies are often necessary to ensure that agents continuously adapt to dynamic environments. The choice of the decentralised learning system and the employed policy plays an important role in the optimisation task. For example, a policy that exhibits fluctuations may also introduce non-linear effects which other agents in the environment may not be able to cope with and even amplify these effects. In dynamic and unpredictable multiagent environments these oscillations may introduce instabilities. In this paper, we take inspiration from the limbic system to introduce an extension to the weighted policy learner, where agents evaluate rewards as either positive or negative feedback, depending on how they deviate from average expected rewards. Agents have positive and negative biases, where a bias either magnifies or depresses a positive or negative feedback signal. To contain the non-linear effects of biased rewards, we incorporate a decaying memory of past positive and negative feedback signals to provide a smoother gradient update on the probability simplex, spreading out the effect of the feedback signal over time. By splitting the feedback signal, more leverage on the win or learn fast (WoLF) principle is possible. The cognitive policy learner is evaluated using a small queueing network and compared with the fair action and weighted policy learner. Emphasis is placed on analysing the dynamics of the learning algorithms with respect to the stability of the queueing network and the overall queueing performance.