Reliability of internal prediction/estimation and its application: I. adaptive action selection reflecting reliability of value function

Authors:
Yutaka Sakaguchi;Mitsuo Takano
Affiliations:
Graduate School of Information Systems, University of Electro-Communications, 1-5-1, Chofugaoka, Chofu, Tokyo 182-8585, Japan;Department of Electronics, Faculty of Engineering, Tokyo University of Technology, Hachioji, Tokyo, Japan and Graduate School of Information Systems, University of Electro-Communications, 1-5-1, C ...
Venue:
Neural Networks
Year:
2004

Citing 12
Cited 1

Mean, variance, and probabilistic criteria in finite Markov decision processes: a review

Journal of Optimization Theory and Applications
Integrated architecture for learning, planning, and reacting based on approximating dynamic programming

Proceedings of the seventh international conference (1990) on Machine learning
Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning

Machine Learning
Technical Note: \cal Q-Learning

Machine Learning
Prioritized Sweeping: Reinforcement Learning with Less Data and Less Time

Machine Learning
Exploration bonuses and dual control

Machine Learning
A near-optimal polynomial time algorithm for learning in certain classes of stochastic games

Artificial Intelligence
Introduction to Reinforcement Learning

Introduction to Reinforcement Learning
Metalearning and neuromodulation

Neural Networks - Computational models of neuromodulation
Control of exploitation-exploration meta-parameter in reinforcement learning

Neural Networks - Computational models of neuromodulation
Learning to Predict by the Methods of Temporal Differences

Machine Learning
Near-Optimal Reinforcement Learning in Polynominal Time

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning

Reinforcement-learning agents with different temperature parameters explain the variety of human action-selection behavior in a Markov decision process task

Neurocomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

This article proposes an adaptive action-selection method for a model-free reinforcement learning system, based on the concept of the 'reliability of internal prediction/estimation'. This concept is realized using an internal variable, called the Reliability Index (RI), which estimates the accuracy of the internal estimator. We define this index for a value function of a temporal difference learning system and substitute it for the temperature parameter of the Boltzmann action-selection rule. Accordingly, the weight of exploratory actions adaptively changes depending on the uncertainty of the prediction. We use this idea for tabular and weighted-sum type value functions. Moreover, we use the RI to adjust the learning coefficient in addition to the temperature parameter, meaning that the reliability becomes a general basis for meta-learning Numerical experiments were performed to examine the behavior of the proposed method. The RI-based Q-learning system demonstrated its features when the adaptive learning coefficient and large RI-discount rate (which indicate how the RI values of future states are reflected in the RI value of the current state) were introduced. Statistical tests confirmed that the algorithm spent more time exploring in the initial phase of learning, but accelerated learning from the midpoint of learning. It is also shown that the proposed method does not work well with the actor-critic models. The limitations of the proposed method and its relationship to relevant research are discussed.