MAP adaptation of stochastic grammars

  • Authors:
  • Michiel Bacchiani;Michael Riley;Brian Roark;Richard Sproat

  • Affiliations:
  • IBM TJ Watson Research Center, Rm. 24-124, 1101 Kitchawan Rd, Rt134, Yorktown Height, NY 10598, USA;Google Inc., 1440 Broadway, New York, NY 10018, USA;Center for Spoken Language Understanding, Department of CS&EE, OGI School of Science & Engineering at Oregon Health & Science University, 20000 NW Walker Road, Beaverton, OR 97006, USA;Departments of Linguistics and ECE, University of Illinois at Urbana-Champaign, Foreign Languages Building 4103, 707 South Mathews Avenue, MC-168 Urbana, IL 61801, USA

  • Venue:
  • Computer Speech and Language
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper investigates supervised and unsupervised adaptation of stochastic grammars, including n-gram language models and probabilistic context-free grammars (PCFGs), to a new domain. It is shown that the commonly used approaches of count merging and model interpolation are special cases of a more general maximum a posteriori (MAP) framework, which additionally allows for alternate adaptation approaches. This paper investigates the effectiveness of different adaptation strategies, and, in particular, focuses on the need for supervision in the adaptation process. We show that n-gram models as well as PCFGs benefit from either supervised or unsupervised MAP adaptation in various tasks. For n-gram models, we compare the benefit from supervised adaptation with that of unsupervised adaptation on a speech recognition task with an adaptation sample of limited size (about 17h), and show that unsupervised adaptation can obtain 51% of the 7.7% adaptation gain obtained by supervised adaptation. We also investigate the benefit of using multiple word hypotheses (in the form of a word lattice) for unsupervised adaptation on a speech recognition task for which there was a much larger adaptation sample available. The use of word lattices for adaptation required the derivation of a generalization of the well-known Good-Turing estimate. Using this generalization, we derive a method that uses Monte Carlo sampling for building Katz backoff models. The adaptation results show that, for adaptation samples of limited size (several tens of hours), unsupervised adaptation on lattices gives a performance gain over using transcripts. The experimental results also show that with a very large adaptation sample (1050h), the benefit from transcript-based adaptation matches that of lattice-based adaptation. Finally, we show that PCFG domain adaptation using the MAP framework provides similar gains in F-measure accuracy on a parsing task as was seen in ASR accuracy improvements with n-gram adaptation. Experimental results show that unsupervised adaptation provides 37% of the 10.35% gain obtained by supervised adaptation.