Robust speech detection in real acoustic backgrounds with perceptually motivated features

  • Authors:
  • Jörg-Hendrik Bach;Jörn Anemüller;Birger Kollmeier

  • Affiliations:
  • Medical Physics Department, University of Oldenburg, 26111 Oldenburg, Germany;Medical Physics Department, University of Oldenburg, 26111 Oldenburg, Germany;Medical Physics Department, University of Oldenburg, 26111 Oldenburg, Germany

  • Venue:
  • Speech Communication
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

The current study presents an analysis of the robustness of a speech detector in real background sounds. One of the most important aspects of automatic speech/nonspeech classification is robustness in the presence of strongly varying external conditions. These include variations of the signal-to-noise ratio as well as fluctuations of the background noise. These variations are systematically evaluated by choosing different mismatched conditions between training and testing of the speech/nonspeech classifiers. The detection performance of the classifier with respect to these mismatched conditions is used as a measure of robustness and generalisation. The generalisation towards un-trained SNR conditions and unknown background noises is evaluated and compared to a matched baseline condition. The classifier consists of a feature front-end, which computes amplitude modulation spectral features (AMS), and a support vector machine (SVM) back-end. The AMS features are based on Fourier decomposition over time of short-term spectrograms. Mel-frequency cepstral coefficients (MFCC) as well as relative spectral features (RASTA) based on perceptual linear prediction (PLP) serve as baseline. The results show that RASTA-filtered PLP features perform best in the matched task. In the generalisation tasks however, the AMS features emerge as more robust in most cases, while MFCC features are outperformed by both other feature types. In a second set of experiments, a hierarchical approach is analysed which employs a background classification step prior to the speech/nonspeech classifier in order to improve the robustness of the detection scores in novel backgrounds. The background sounds used are recorded in typical everyday scenarios. The hierarchy provides a benefit in overall performance if the robust AMS features are employed. The generalisation capabilities of the hierarchy towards novel backgrounds and SNRs is found to be optimal when a limited number of training backgrounds is used (compared to the inclusion of all available background data). The best backgrounds in terms of generalisation capabilities are found to be backgrounds in which some component of speech (such as unintelligible background babble) is present, which corroborates the hypothesis that the AMS features provide a decomposition of signals which is by itself very suitable for training very general speech/nonspeech detectors. This is also supported by the finding that the SVMs combined with RASTA-PLPs require nonlinear kernels to reach a similar performance as the AMS patterns with linear kernels.