Multiple Regression of Log Spectra for In-Car Speech Recognition Using Multiple Distributed Microphones

  • Authors:
  • Weifeng Li;Tetsuya Shinde;Hiroshi Fujimura;Chiyomi Miyajima;Takanori Nishino;Katunobu Itou;Kazuya Takeda;Fumitada Itakura

  • Affiliations:
  • The authors are with the Department of Information Electronics, Graduate School of Engineering, Nagoya University, Nagoya-shi, 464-8603 Japan. E-mail: lee@itakura.nuee.nagoya-u.ac.jp,;The authors are with the Department of Information Electronics, Graduate School of Engineering, Nagoya University, Nagoya-shi, 464-8603 Japan. E-mail: lee@itakura.nuee.nagoya-u.ac.jp,;The authors are with the Department of Media Science, Graduate School of Information Science, Nagoya University, Nagoya-shi, 464-8603 Japan.,;The authors are with the Department of Media Science, Graduate School of Information Science, Nagoya University, Nagoya-shi, 464-8603 Japan.,;The authors are with the Department of Media Science, Graduate School of Information Science, Nagoya University, Nagoya-shi, 464-8603 Japan.,;The authors are with the Department of Media Science, Graduate School of Information Science, Nagoya University, Nagoya-shi, 464-8603 Japan.,;The authors are with the Department of Media Science, Graduate School of Information Science, Nagoya University, Nagoya-shi, 464-8603 Japan.,;The author is with the Faculty of Science and Technology, Meijo University, Nagoya-shi, 468-8502 Japan.

  • Venue:
  • IEICE - Transactions on Information and Systems
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper describes a new multi-channel method of noisy speech recognition, which estimates the log spectrum of speech at a closetalking microphone based on the multiple regression of the log spectra (MRLS) of noisy signals captured by distributed microphones. The advantages of the proposed method are as follows: 1) The method does not require a sensitive geometric layout, calibration of the sensors nor additional pre-processing for tracking the speech source; 2) System works in very small computation amounts; and 3) Regression weights can be statistically optimized over the given training data. Once the optimal regression weights are obtained by regression learning, they can be utilized to generate the estimated log spectrum in the recognition phase, where the speech of close-talking is no longer required. The performance of the proposed method is illustrated by speech recognition of real in-car dialogue data. In comparison to the nearest distant microphone and multi-microphone adaptive beamformer, the proposed approach obtains relative word error rate (WER) reductions of 9.8% and 3.6%, respectively.