Hierarchical Approximate Policy Iteration With Binary-Tree State Space Decomposition

Authors:
Xin Xu;Chunming Liu;Simon X. Yang;Dewen Hu
Affiliations:
College of Mechatronics and Automation, National University of Defense Technology, Changsha, China;College of Mechatronics and Automation, National University of Defense Technology, Changsha, China;School of Engineering, University of Guelph, Guelph, Canada;College of Mechatronics and Automation, National University of Defense Technology, Changsha, China
Venue:
IEEE Transactions on Neural Networks - Part 1
Year:
2011

Citing 0
Cited 2

A hierarchical representation policy iteration algorithm for reinforcement learning

IScIDE'12 Proceedings of the third Sino-foreign-interchange conference on Intelligent Science and Intelligent Data Engineering
Reinforcement learning algorithms with function approximation: Recent advances and applications

Information Sciences: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

In recent years, approximate policy iteration (API) has attracted increasing attention in reinforcement learning (RL), e.g., least-squares policy iteration (LSPI) and its kernelized version, the kernel-based LSPI algorithm. However, it remains difficult for API algorithms to obtain near-optimal policies for Markov decision processes (MDPs) with large or continuous state spaces. To address this problem, this paper presents a hierarchical API (HAPI) method with binary-tree state space decomposition for RL in a class of absorbing MDPs, which can be formulated as time-optimal learning control tasks. In the proposed method, after collecting samples adaptively in the state space of the original MDP, a learning-based decomposition strategy of sample sets was designed to implement the binary-tree state space decomposition process. Then, API algorithms were used on the sample subsets to approximate local optimal policies of sub-MDPs. The original MDP was decomposed into a binary-tree structure of absorbing sub-MDPs, constructed during the learning process, thus, local near-optimal policies were approximated by API algorithms with reduced complexity and higher precision. Furthermore, because of the improved quality of local policies, the combined global policy performed better than the near-optimal policy obtained by a single API algorithm in the original MDP. Three learning control problems, including path-tracking control of a real mobile robot, were studied to evaluate the performance of the HAPI method. With the same setting for basis function selection and sample collection, the proposed HAPI obtained better near-optimal policies than previous API methods such as LSPI and KLSPI.