An average-reward reinforcement learning algorithm for computing bias-optimal policies

Authors:
Sridhar Mahadevan
Affiliations:
Department of Computer Science and Engineering, University of South Florida, Tampa, Florida
Venue:
AAAI'96 Proceedings of the thirteenth national conference on Artificial intelligence - Volume 1
Year:
1996

Citing 5
Cited 4

Reinforcement learning algorithms for average-payoff Markovian decision processes

AAAI '94 Proceedings of the twelfth national conference on Artificial intelligence (vol. 1)
Average reward reinforcement learning: foundations, algorithms, and empirical results

Machine Learning - Special issue on reinforcement learning
Markov Decision Processes: Discrete Stochastic Dynamic Programming

Markov Decision Processes: Discrete Stochastic Dynamic Programming
H-Learning: A Reinforcement Learning Method for Optimizing Undiscounted Average Reward

H-Learning: A Reinforcement Learning Method for Optimizing Undiscounted Average Reward
Process-oriented planning and average-reward optimality

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2

LC-Learning: Phased Method for Average Reward Reinforcement Learning - Preliminary Results

PRICAI '02 Proceedings of the 7th Pacific Rim International Conference on Artificial Intelligence: Trends in Artificial Intelligence
LC-Learning: Phased Method for Average Reward Reinforcement Learning - Analysis of Optimal Criteria

PRICAI '02 Proceedings of the 7th Pacific Rim International Conference on Artificial Intelligence: Trends in Artificial Intelligence
A Reinforcement Learning Algorithm Based on Policy Iteration for Average Reward: Empirical Results with Yield Management and Convergence Analysis

Machine Learning
Auto-exploratory average reward reinforcement learning

AAAI'96 Proceedings of the thirteenth national conference on Artificial intelligence - Volume 1

Quantified Score

Hi-index	0.00

Visualization

Abstract

Average-reward reinforcement learning (ARL) is an undiscounted optimality framework that is generally applicable to a broad range of control tasks. ARL computes gain-optimal control policies that maximize the expected payoff per step. However, gainoptimality has some intrinsic limitations as an optimality criterion, since for example, it cannot distinguish between different policies that all reach an absorbing goal state, but incur varying costs. A more selective criterion is bias optimality, which can filter gain-optimal policies to select those that reach absorbing goals with the minimum cost. While several ARL algorithms for computing gain-optimal policies have been proposed, none of these algorithms can guarantee bias optimality, since this requires solving at least two nested optimality equations. In this paper, we describe a novel model-based ARL algorithm for computing bias-optimal policies. We test the proposed algorithm using an admission control queuing system, and show that it is able to utilize the queue much more efficiently than a gain-optimal method by learning bias-optimal policies.