Improving Acoustic Models with Captioned Multimedia Speech

  • Authors:
  • Photina Jaeyun Jang;Alexander G. Hauptmann

  • Affiliations:
  • Carnegie Mellon University;Carnegie Mellon University

  • Venue:
  • ICMCS '99 Proceedings of the 1999 IEEE International Conference on Multimedia Computing and Systems - Volume 02
  • Year:
  • 1999

Quantified Score

Hi-index 0.00

Visualization

Abstract

Speech recognition can be used to create searchable transcripts for audio indexing in digital video libraries. Large amounts of hand-transcribed speech training data are required to build or improve acoustic models of highly accurate speech recognition systems using current technologies. We present a technique to use television broadcasts with closed-captions as a source for large amounts of automatically extracted and accurately transcribed speech for improving acoustic models. The errorful closed captioned text is aligned with the also errorful speech recognition output and matching segments are used with each corresponding audio segment as acoustic training data to improve the speech recognition system. Our technique automatically extracted 131.4 hours of transcribed speech and improved the word error rate of our currently best speech recognition system (Sphinx-III) from 32.82% to 31.19%. A speech recognizer trained exclusively on 70.7 hours of this automatically transcribe! d speech produced a word error rate of 32.7%.