Continuous-action reinforcement learning with fast policy search and adaptive basis function selection

  • Authors:
  • Xin Xu;Chunming Liu;Dewen Hu

  • Affiliations:
  • National University of Defense Technology, College of Mechatronics and Automation, Institute of Automation, 410073, ChangSha, Hunan, People’s Republic of China;National University of Defense Technology, College of Mechatronics and Automation, Institute of Automation, 410073, ChangSha, Hunan, People’s Republic of China;National University of Defense Technology, College of Mechatronics and Automation, Institute of Automation, 410073, ChangSha, Hunan, People’s Republic of China

  • Venue:
  • Soft Computing - A Fusion of Foundations, Methodologies and Applications - Special issue on Recent advances on machine learning and Cybernetics
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

As an important approach to solving complex sequential decision problems, reinforcement learning (RL) has been widely studied in the community of artificial intelligence and machine learning. However, the generalization ability of RL is still an open problem and it is difficult for existing RL algorithms to solve Markov decision problems (MDPs) with both continuous state and action spaces. In this paper, a novel RL approach with fast policy search and adaptive basis function selection, which is called Continuous-action Approximate Policy Iteration (CAPI), is proposed for RL in MDPs with both continuous state and action spaces. In CAPI, based on the value functions estimated by temporal-difference learning, a fast policy search technique is suggested to search for optimal actions in continuous spaces, which is computationally efficient and easy to implement. To improve the generalization ability and learning efficiency of CAPI, two adaptive basis function selection methods are developed so that sparse approximation of value functions can be obtained efficiently both for linear function approximators and kernel machines. Simulation results on benchmark learning control tasks with continuous state and action spaces show that the proposed approach not only can converge to a near-optimal policy in a few iterations but also can obtain comparable or even better performance than Sarsa-learning, and previous approximate policy iteration methods such as LSPI and KLSPI.