LC-Learning: Phased Method for Average Reward Reinforcement Learning - Preliminary Results

Authors:
Taro Konda;Shinjiro Tensyo;Tomohiro Yamaguchi
Affiliations:
-;-;-
Venue:
PRICAI '02 Proceedings of the 7th Pacific Rim International Conference on Artificial Intelligence: Trends in Artificial Intelligence
Year:
2002

Citing 8
Cited 1

The complexity of Markov decision processes

Mathematics of Operations Research
Prioritized Sweeping: Reinforcement Learning with Less Data and Less Time

Machine Learning
Average reward reinforcement learning: foundations, algorithms, and empirical results

Machine Learning - Special issue on reinforcement learning
Model-based average reward reinforcement learning

Artificial Intelligence
Markov Decision Processes: Discrete Stochastic Dynamic Programming

Markov Decision Processes: Discrete Stochastic Dynamic Programming
Introduction to Reinforcement Learning

Introduction to Reinforcement Learning
Reinforcement learning: a survey

Journal of Artificial Intelligence Research
An average-reward reinforcement learning algorithm for computing bias-optimal policies

AAAI'96 Proceedings of the thirteenth national conference on Artificial intelligence - Volume 1

LC-Learning: Phased Method for Average Reward Reinforcement Learning - Analysis of Optimal Criteria

PRICAI '02 Proceedings of the 7th Pacific Rim International Conference on Artificial Intelligence: Trends in Artificial Intelligence

Quantified Score

Hi-index	0.01

Visualization

Abstract

This paper presents two methods to accelerate LC-learning, which is a novel model-based average reward reinforcement learning method to compute a bias-optimal policy in a cyclic domain. The LC-learning has successfully calculated the bias-optimal policy without any approximation approaches relying upon the notion that we only need to search the optimal cycle to find a gain-optimal policy. However it has a large complexity, since it searches most combinations of actions to detect all cycles. In this paper, we first implement two pruning methods to prevent the state explosion problem of the LC-learning. Second, we compare the improved LC-learning with one of the most rapid methods, the Prioritized Sweeping in a bus scheduling task. We show that the LC-learning calculates the bias-optimal policy more quickly than the normal Prioritized Sweeping and it also performs as well as the full-tuned version in the middle case.