Microphone array processing for robust speech recognition

  • Authors:
  • Michael L. Seltzer;Richard M. Stern

  • Affiliations:
  • -;-

  • Venue:
  • Microphone array processing for robust speech recognition
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

Speech recognition performance degrades significantly in distant-talking environments, where the speech signals can be severely distorted by additive noise and reverberation. In such environments, the use of microphone arrays has been proposed as a means of improving the quality of captured speech signals. Currently, microphone-array-based speech recognition is performed in two independent stages: array processing and then recognition. Array processing algorithms designed for signal enhancement are applied in order to reduce the distortion in the speech waveform prior to feature extraction and recognition. This approach assumes that improving the quality of the speech waveform will necessarily result in improved recognition performance. However, speech recognition systems are statistical pattern classifiers that process features derived from the speech waveform, not the waveform itself. An array processing algorithm can therefore only be expected to improve recognition if it maximizes or at least increases the likelihood of the correct hypothesis, relative to other competing hypotheses. In this thesis a new approach to microphone-array processing is proposed in which the goal of the array processing is not to generate an enhanced output waveform but rather to generate a sequence of features which maximizes the likelihood of the correct hypothesis. In this approach, called Likelihood Maximizing Beamforming (LIMABEAM), information from the speech recognition system itself is used to optimize a filter-and-sum beamformer. Using LIMABEAM, significant improvements in recognition accuracy over conventional array processing approaches are obtained in moderately reverberant environments over a wide range of signal-to-noise ratios. However, only limited improvements are obtained in environments with more severe reverberation. To address this issue, a subband filtering approach to LIMABEAM is proposed, called Subband-Likelihood Maximizing Beamforming (S-LIMABEAM). S-LIMABEAM employs a new subband filter-and-sum architecture which explicitly considers how the features used for recognition are computed. This enables S-LIMABEAM to achieve dramatically improved performance over the original LIMABEAM algorithm in highly reverberant environments. Because the algorithms in this thesis are data-driven, they do not require a priori knowledge of the room impulse response, nor any particular number of microphones or array geometry. To demonstrate this, LIMABEAM and S-LIMABEAM are evaluated using multiple array configurations and environments including an array-equipped personal digital assistant (PDA) and a meeting room with a few tabletop microphones. In all cases, the proposed algorithms significantly outperform conventional array processing approaches.