Two Steps Reinforcement Learning in Continuous Reinforcement Learning Tasks
IWANN '09 Proceedings of the 10th International Work-Conference on Artificial Neural Networks: Part I: Bio-Inspired Systems: Computational and Ambient Intelligence
A state-cluster based Q-learning
ICNC'09 Proceedings of the 5th international conference on Natural computation
Probabilistic Policy Reuse for inter-task transfer learning
Robotics and Autonomous Systems
Unsupervised modeling of partially observable environments
ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part I
Multi-agent reinforcement learning for simulating pedestrian navigation
ALA'11 Proceedings of the 11th international conference on Adaptive and Learning Agents
Safe exploration of state and action spaces in reinforcement learning
Journal of Artificial Intelligence Research
Hi-index | 0.00 |
When applying reinforcement learning in domains with very large or continuous state spaces, the experience obtained by the learning agent in the interaction with the environment must be generalized. The generalization methods are usually based on the approximation of the value functions used to compute the action policy and tackled in two different ways. On the one hand by using an approximation of the value functions based on a supervized learning method. On the other hand, by discretizing the environment to use a tabular representation of the value functions. In this work, we propose an algorithm that uses both approaches to use the benefits of both mechanisms, allowing a higher performance. The approach is based on two learning phases. In the first one, a learner is used as a supervized function approximator, but using a machine learning technique which also outputs a state space discretization of the environment, such as nearest prototype classifiers or decision trees do. In the second learning phase, the space discretization computed in the first phase is used to obtain a tabular representation of the value function computed in the previous phase, allowing a tuning of such value function approximation. Experiments in different domains show that executing both learning phases improves the results obtained executing only the first one. The results take into account the resources used and the performance of the learned behavior. © 2008 Wiley Periodicals, Inc.