Efficient sample reuse in policy gradients with parameter-based exploration

  • Authors:
  • Tingting Zhao;Hirotaka Hachiya;Voot Tangkaratt;Jun Morimoto;Masashi Sugiyama

  • Affiliations:
  • -;-;-;-;-

  • Venue:
  • Neural Computation
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

The policy gradient approach is a flexible and powerful reinforcement learning method particularly for problems with continuous actions such as robot control. A common challenge is how to reduce the variance of policy gradient estimates for reliable policy updates. In this letter, we combine the following three ideas and give a highly effective policy gradient method: 1 policy gradients with parameter-based exploration, a recently proposed policy search method with low variance of gradient estimates; 2 an importance sampling technique, which allows us to reuse previously gathered data in a consistent way; and 3 an optimal baseline, which minimizes the variance of gradient estimates with their unbiasedness being maintained. For the proposed method, we give a theoretical analysis of the variance of gradient estimates and show its usefulness through extensive experiments.