The penalty avoiding rational policy making algorithm in continuous action spaces

  • Authors:
  • Kazuteru Miyazaki

  • Affiliations:
  • National Institution for Academic Degrees and University Evaluation, Kodaira-city, Tokyo, Japan

  • Venue:
  • IDEAL'10 Proceedings of the 11th international conference on Intelligent data engineering and automated learning
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Reinforcement learning involves learning to adapt to environments through the presentation of rewards - special input - serving as clues. To obtain quick rational policies, profit sharing, rational policy making algorithm, penalty avoiding rational policy making algorithm (PARP), PS-r* and PS-r# are used. They are called Exploitation-oriented Learning (XoL). When applying reinforcement learning to actual problems, treatment of continuous-valued input and output are sometimes required. A method based on PARP is proposed as a XoL method corresponding to the continuous-valued input, but continuous-valued output cannot be treated. We study the treatment of continuous-valued output suitable for a XoL method in which the environment includes both a reward and a penalty. We extend PARP in the continuous-valued input to continuous-valued output. We apply our proposal to the pole-cart balancing problem and confirm its validity.