A study of permutation crossover operators on the traveling salesman problem
Proceedings of the Second International Conference on Genetic Algorithms on Genetic algorithms and their application
Technical Note: \cal Q-Learning
Machine Learning
The dynamics of reinforcement learning in cooperative multiagent systems
AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
Introduction to Reinforcement Learning
Introduction to Reinforcement Learning
Multiagent Reinforcement Learning: Theoretical Framework and an Algorithm
ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Implicit Imitation in Multiagent Reinforcement Learning
ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
MICAI '00 Proceedings of the Mexican International Conference on Artificial Intelligence: Advances in Artificial Intelligence
Sequential optimality and coordination in multiagent systems
IJCAI'99 Proceedings of the 16th international joint conference on Artifical intelligence - Volume 1
The traveling salesman: computational solutions for TSP applications
The traveling salesman: computational solutions for TSP applications
A New Approach to Improve the Ant Colony System Performance: Learning Levels
HAIS '09 Proceedings of the 4th International Conference on Hybrid Artificial Intelligence Systems
Multiobjective water pinch analysis of the cuernavaca city water distribution network
EMO'05 Proceedings of the Third international conference on Evolutionary Multi-Criterion Optimization
Journal of Computational and Applied Mathematics
Hi-index | 0.00 |
In reinforcement learning an autonomous agent learns an optimal policy while interacting with the environment. In particular, in one-step Q-learning, with each action an agent updates its Q values considering immediate rewards. In this paper a new strategy for updating Q values is proposed. The strategy, implemented in an algorithm called DQL, uses a set of agents all searching the same goal in the same space to obtain the same optimal policy. Each agent leaves traces over a copy of the environment (copies of Q-values), while searching for a goal. These copies are used by the agents to decide which actions to take. Once all the agents reach a goal, the original Q-values of the best solution found by all the agents are updated using Watkins' Q-learning formula. DQL has some similarities with Gambardella's Ant-Q algorithm [4], however it does not require the definition of a domain dependent heuristic and consequently the tuning of additional parameters. DQL also does not update the original Q-values with zero reward while the agents are searching, as Ant-Q does. It is shown how DQL's guided exploration of several agents with selected exploitation (updating only the best solution) produces faster convergence times than Q-learning and Ant-Q on several testbed problems under similar conditions.