Decision theoretic generalizations of the PAC model for neural net and other learning applications
Information and Computation
On efficient agnostic learning of linear combinations of basis functions
COLT '95 Proceedings of the eighth annual conference on Computational learning theory
Neuro-Dynamic Programming
Learning to Predict by the Methods of Temporal Differences
Machine Learning
Hi-index | 0.00 |
A key open problem in reinforcement learning is to assure convergence when using a compact hypothesis class to approximate the value function. Although the standard temporal-difference learning algorithm has been shown to converge when the hypothesis class is a linear combination of fixed basis functions, it may diverge with a general (non-linear) hypothesis class. This paper describes the Bridge algorithm, a new method for reinforcement learning, and shows that it converges to an approximate global optimum for any agnostically learnable hypothesis class. Convergence is demonstrated on a simple example for which temporal-difference learning fails. Weak conditions are identified under which the Bridge algorithm converges for any hypothesis class. Finally, connections are made between the complexity of reinforcement learning and the PAC-learnability of the hypothesis class.