Environment-Independent Reinforcement Acceleration

  • Authors:
  • Juergen Schmidhuber

  • Affiliations:
  • -

  • Venue:
  • Environment-Independent Reinforcement Acceleration
  • Year:
  • 1995

Quantified Score

Hi-index 0.00

Visualization

Abstract

A reinforcement learning system with limited computational resources interacts with an unrestricted, unknown environment. Its goal is to maximize cumulative reward, to be obtained throughout its limited, unknown lifetime. System policy is an arbitrary modifiable algorithm mapping environmental inputs and internal states to outputs and new internal states. The problem is: in realistic, unknown environments, each policy modification process (PMP) occurring during system life may have unpredictable influence on environmental states, rewards and PMPs at any later time. Existing reinforcement learning algorithms cannot properly deal with this. Neither can naive exhaustive search among all policy candidates --- not even in case of very small search spaces. In fact, a reasonable way of measuring performance improvements in such general (but typical) situations is missing. I define such a measure based on the novel ``reinforcement acceleration criterion'''' (RAC). At a given time, RAC is satisfied if the beginning of each completed PMP that computed a currently valid policy modification has been followed by long-term acceleration of average reinforcement intake (the computation time for later PMPs is taken into account). I present a method called ``environment-independent reinforcement acceleration'''' (EIRA) which is guaranteed to achieve RAC. EIRA does neither care whether the system''s policy allows for changing itself, nor whether there are multiple, interacting learning systems. Consequences are: (1) a sound theoretical framework for ``meta-learning'''' (because the success of a PMP recursively depends on the success of all later PMPs, for which it is setting the stage). (2) A sound theoretical framework for multi-agent learning. The principles have been implemented (1) in a single system using an assembler-like programming language to modify its own policy, and (2) a system consisting of multiple agents, where each agent is in fact just a connection in a fully recurrent reinforcement learning neural net. A by-product of this research is a general reinforcement learning algorithm for such nets. Preliminary experiments illustrate the theory.