Natural actor and belief critic: Reinforcement algorithm for learning parameters of dialogue systems modelled as POMDPs

Authors:
Filip Jurčíček;Blaise Thomson;Steve Young
Affiliations:
University of Cambridge, Cambridge, UK;University of Cambridge, Cambridge, UK;University of Cambridge, Cambridge, UK
Venue:
ACM Transactions on Speech and Language Processing (TSLP)
Year:
2011

Citing 21
Cited 3

Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning

Machine Learning
Natural gradient works efficiently in learning

Neural Computation
Introduction to Reinforcement Learning

Introduction to Reinforcement Learning
Planning and Acting under Uncertainty: A New Model for Spoken Dialogue System

UAI '01 Proceedings of the 17th Conference in Uncertainty in Artificial Intelligence
Introduction to Stochastic Search and Optimization

Introduction to Stochastic Search and Optimization
Spoken dialogue management using probabilistic reasoning

ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
Reinforcement learning with Gaussian processes

ICML '05 Proceedings of the 22nd international conference on Machine learning
Completely Derandomized Self-Adaptation in Evolution Strategies

Evolutionary Computation
Pattern Recognition and Machine Learning (Information Science and Statistics)

Pattern Recognition and Machine Learning (Information Science and Statistics)
Efficient model learning for dialog management

Proceedings of the ACM/IEEE international conference on Human-robot interaction
Python for Scientific Computing

Computing in Science and Engineering
Natural Actor-Critic

Neurocomputing
2008 Special Issue: Reinforcement learning of motor skills with policy gradients

Neural Networks
A tractable hybrid ddn–pomdp approach to affective dialogue modeling for probabilistic frame-based dialogue systems

Natural Language Engineering
Using automatically transcribed dialogs to learn user models in a spoken dialog system

HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
Demonstration of a POMDP voice dialer

HLT-Demonstrations '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Demo Session
The Hidden Information State model: A practical framework for POMDP-based spoken dialogue management

Computer Speech and Language
Planning and acting in partially observable stochastic domains

Artificial Intelligence
Bayesian update of dialogue state: A POMDP framework for spoken dialogue systems

Computer Speech and Language
Solving deep memory POMDPs with recurrent policy gradients

ICANN'07 Proceedings of the 17th international conference on Artificial neural networks
Natural actor-critic

ECML'05 Proceedings of the 16th European conference on Machine Learning

Reinforcement learning for parameter estimation in statistical spoken dialogue systems

Computer Speech and Language
Exploiting machine-transcribed dialog corpus to improve multiple dialog states tracking methods

SIGDIAL '12 Proceedings of the 13th Annual Meeting of the Special Interest Group on Discourse and Dialogue
Gaussian Processes for POMDP-Based Dialogue Manager Optimization

IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP)

Quantified Score

Hi-index	0.00

Visualization

Abstract

This article presents a novel algorithm for learning parameters in statistical dialogue systems which are modeled as Partially Observable Markov Decision Processes (POMDPs). The three main components of a POMDP dialogue manager are a dialogue model representing dialogue state information; a policy that selects the system's responses based on the inferred state; and a reward function that specifies the desired behavior of the system. Ideally both the model parameters and the policy would be designed to maximize the cumulative reward. However, while there are many techniques available for learning the optimal policy, no good ways of learning the optimal model parameters that scale to real-world dialogue systems have been found yet. The presented algorithm, called the Natural Actor and Belief Critic (NABC), is a policy gradient method that offers a solution to this problem. Based on observed rewards, the algorithm estimates the natural gradient of the expected cumulative reward. The resulting gradient is then used to adapt both the prior distribution of the dialogue model parameters and the policy parameters. In addition, the article presents a variant of the NABC algorithm, called the Natural Belief Critic (NBC), which assumes that the policy is fixed and only the model parameters need to be estimated. The algorithms are evaluated on a spoken dialogue system in the tourist information domain. The experiments show that model parameters estimated to maximize the expected cumulative reward result in significantly improved performance compared to the baseline hand-crafted model parameters. The algorithms are also compared to optimization techniques using plain gradients and state-of-the-art random search algorithms. In all cases, the algorithms based on the natural gradient work significantly better.