Environmental sound recognition with time-frequency audio features

  • Authors:
  • Selina Chu;Shrikanth Narayanan;C.-C. Jay Kuo

  • Affiliations:
  • Department of Computer Science, Signal and Image Processing Institute, University of Southern California, Los Angeles, CA;Ming Hsieh Department of Electrical Engineering, Department of Computer Science, Signal and Image Processing Institute, University of Southern California, Los Angeles, CA;Ming Hsieh Department of Electrical Engineering, Department of Computer Science, Signal and Image Processing Institute, University of Southern California, Los Angeles, CA

  • Venue:
  • IEEE Transactions on Audio, Speech, and Language Processing
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

The paper considers the task of recognizing environmental sounds for the understanding of a scene or context surrounding an audio sensor. A variety of features have been proposed for audio recognition, including the popular Mel-frequency cepstral coefficients (MFCCs) which describe the audio spectral shape. Environmental sounds, such as chirpings of insects and sounds of rain which are typically noise-like with a broad flat spectrum, may include strong temporal domain signatures. However, only few temporal-domain features have been developed to characterize such diverse audio signals previously. Here, we perform an empirical feature analysis for audio environment characterization and propose to use the matching pursuit (MP) algorithm to obtain effective time-frequency features. The MP-based method utilizes a dictionary of atoms for feature selection, resulting in a flexible, intuitive and physically interpretable set of features. The MP-based feature is adopted to supplement the MFCC features to yield higher recognition accuracy for environmental sounds. Extensive experiments are conducted to demonstrate the effectiveness of these joint features for unstructured environmental sound classification, including listening tests to study human recognition capabilities. Our recognition system has shown to produce comparable performance as human listeners.