Long-Term temporal features for conversational speech recognition

  • Authors:
  • Barry Chen;Qifeng Zhu;Nelson Morgan

  • Affiliations:
  • International Computer Science Institute, Berkeley, CA;International Computer Science Institute, Berkeley, CA;International Computer Science Institute, Berkeley, CA

  • Venue:
  • MLMI'04 Proceedings of the First international conference on Machine Learning for Multimodal Interaction
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

The automatic transcription of conversational speech, both from telephone and in-person interactions, is still an extremely challenging task. Our efforts to recognize speech from meetings is likely to benefit from any advances we achieve with conversational telephone speech, a topic of considerable focus for our research. Towards both of these ends, we have developed, in collaboration with our colleagues at SRI and IDIAP, techniques to incorporate long-term (~500 ms) temporal information using multi-layered perceptrons (MLPs). Much of this work is based on prior achievements in recent years at the former lab of Hynek Hermansky at the Oregon Graduate Institute (OGI), where the TempoRAl Pattern (TRAP) approach was developed. The contribution here is to present experiments showing: 1) that simply widening acoustic context by using more frames of full band speech energies as input to the MLP is suboptimal compared to a more constrained two-stage approach that first focuses on long-term temporal patterns in each critical band separately and then combines them, 2) that the best two-stage approach studied utilizes hidden activation values of MLPs trained on the log critical band energies (LCBEs) of 51 consecutive frames, and 3) that combining the best two-stage approach with conventional short-term features significantly reduces word error rates on the 2001 NIST Hub-5 conversational telephone speech (CTS) evaluation set with models trained using the Switchboard Corpus.