Improving Acoustic Models with Captioned Multimedia Speech

Authors:
Photina Jaeyun Jang;Alexander G. Hauptmann
Affiliations:
Carnegie Mellon University;Carnegie Mellon University
Venue:
ICMCS '99 Proceedings of the 1999 IEEE International Conference on Multimedia Computing and Systems - Volume 02
Year:
1999

Citing 0
Cited 2

A Neural Multi-expert Classification System for MPEG Audio Segmentation

ICAPR '01 Proceedings of the Second International Conference on Advances in Pattern Recognition
Exploiting predictable response training to improve automatic recognition of children's spoken responses

ITS'10 Proceedings of the 10th international conference on Intelligent Tutoring Systems - Volume Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

Speech recognition can be used to create searchable transcripts for audio indexing in digital video libraries. Large amounts of hand-transcribed speech training data are required to build or improve acoustic models of highly accurate speech recognition systems using current technologies. We present a technique to use television broadcasts with closed-captions as a source for large amounts of automatically extracted and accurately transcribed speech for improving acoustic models. The errorful closed captioned text is aligned with the also errorful speech recognition output and matching segments are used with each corresponding audio segment as acoustic training data to improve the speech recognition system. Our technique automatically extracted 131.4 hours of transcribed speech and improved the word error rate of our currently best speech recognition system (Sphinx-III) from 32.82% to 31.19%. A speech recognizer trained exclusively on 70.7 hours of this automatically transcribe! d speech produced a word error rate of 32.7%.