Temporal AM-FM combination for robust speech recognition

  • Authors:
  • Yotaro Kubo;Shigeki Okawa;Akira Kurematsu;Katsuhiko Shirai

  • Affiliations:
  • Department of Computer Science, Waseda University, 3-4-1 Ohkubo, Shinjuku, Tokyo 169-8555, Japan;Chiba Institute of Technology, 2-17-1 Tsudanuma, Narashino, Chiba 275-0016, Japan;Department of Computer Science, Waseda University, 3-4-1 Ohkubo, Shinjuku, Tokyo 169-8555, Japan;Department of Computer Science, Waseda University, 3-4-1 Ohkubo, Shinjuku, Tokyo 169-8555, Japan

  • Venue:
  • Speech Communication
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

A novel method for feature extraction from the frequency modulation (FM) in speech signals is proposed for robust speech recognition. To exploit of the multistream speech recognizers, each stream should compensate for the shortcomings of the other streams. In this light, FM features are promising as complemental features of amplitude modulation (AM). In order to extract effective features from FM patterns, we applied the proposed feature extraction method by the data-driven modulation analysis of instantaneous frequency. By evaluating the frequency responses of the temporal filters obtained by the proposed method, we confirmed that the modulation observed around 4Hz is important for the discrimination of FM patterns, as in the case of AM features. We evaluated the robustness of our method by performing noisy speech recognition experiments. We confirmed that our FM features can improve the noise robustness of speech recognizers even when the FM features are not combined with conventional AM and/or spectral envelope features. We also performed multistream speech recognition experiments. The experimental results show that combination of the conventional AM system and proposed FM system reduced word error by 43.6% at 10 dB SNR as compared to the baseline MFCC system and by 20.2% as compared to the conventional AM system. We investigated the complementarity of the AM and FM features by performing speech recognition experiments in artificial noisy environments. We found the FM features to be robust to wide-band noise, which certainly degrades the performance of AM features. Further, we evaluated the efficiency of multiconditional training. Although the performance of the proposed combination method was degraded by multiconditional training, we confirmed that the performance of the proposed FM method improved. Through a series of experiments, we confirmed that our FM features can be used as independent features as well as complemental features.