Robust distant speech recognition by combining multiple microphone-array processing with position-dependent CMN

Authors:
Longbiao Wang;Norihide Kitaoka;Seiichi Nakagawa
Affiliations:
Department of Information and Computer Sciences, Toyohashi University of Technology, Toyahashi-shi, Japan;Department of Information and Computer Sciences, Toyohashi University of Technology, Toyahashi-shi, Japan;Department of Information and Computer Sciences, Toyohashi University of Technology, Toyahashi-shi, Japan
Venue:
EURASIP Journal on Applied Signal Processing
Year:
2006

Citing 4
Cited 2

A framework for speech source localization using sensor arrays

A framework for speech source localization using sensor arrays
Efficient cepstral normalization for robust speech recognition

HLT '93 Proceedings of the workshop on Human Language Technology
Acoustic source location in noisy and reverberant environment using CSP analysis

ICASSP '96 Proceedings of the Acoustics, Speech, and Signal Processing, 1996. on Conference Proceedings., 1996 IEEE International Conference - Volume 02
Robust adaptive time delay estimation for speaker localization in noisy and reverberant acoustic environments

EURASIP Journal on Applied Signal Processing

Robust Speech Recognition by Combining Short-Term and Long-Term Spectrum Based Position-Dependent CMN with Conventional CMN

IEICE - Transactions on Information and Systems
Adaptive Interfaces for People with Special Needs

IWANN '09 Proceedings of the 10th International Work-Conference on Artificial Neural Networks: Part II: Distributed Computing, Artificial Intelligence, Bioinformatics, Soft Computing, and Ambient Assisted Living

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose robust distant speech recognition by combining multiple microphone-array processing with position-dependent cepstral mean normalization (CMN). In the recognition stage, the system estimates the speaker position and adopts compensation parameters estimated a priori corresponding to the estimated position. Then the system applies CMN to the speech (i.e., position-dependent CMN) and performs speech recognition for each channel. The features obtained from the multiple channels are integrated with the following two types of processings. The first method is to use the maximum vote or the maximum summation likelihood of recognition results from multiple channels to obtain the final result, which is called multiple-decoder processing. The second method is to calculate the output probability of each input at frame level, and a single decoder using these output probabilities is used to perform speech recognition. This is called single-decoder processing, resulting in lower computational cost. We combine the delay-and-sum beamforming with multiple-decoder processing or single-decoder processing, which is termed multiple microphone-array processing. We conducted the experiments of our proposed method using a limited vocabulary (100 words) distant isolated word recognition in a real environment. The proposed multiple microphone-array processing using multiple decoders with position-dependent CMN achieved a 3.2% improvement (50% relative error reduction rate) over the delay-and-sum beamforming with conventional CMN (i.e., the conventional method). The multiple microphone-array processing using a single decoder needs about one-third the computational time of that using multiple decoders without degrading speech recognition performance.