Latent variable framework for modeling and separating single-channel acoustic sources

  • Authors:
  • Barbara G. Shinn-Cunningham;Madhusudana Shashanka

  • Affiliations:
  • Boston University;Boston University

  • Venue:
  • Latent variable framework for modeling and separating single-channel acoustic sources
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Auditory Scene Analysis refers to the human ability to extract different perceptual objects from a sound mixture. Replicating this ability in artificial systems has been an active area of research, related both to how one characterizes acoustic sources and separates sources from mixtures. The focus of this thesis is to develop models and algorithms that provide a framework to address these questions. The framework comprises latent variable models that employ hidden variables to model unobservable quantities. Such models are appropriate for obtaining representations of data that make hidden structure explicit. This work shows how one can utilize these ideas for the problem of source separation using single-channel audio signals. The proposed framework focuses on learning the time-frequency (TF) structure in a data-driven manner. TF representations of sounds are modeled by treating the energy in every TF bin as histogram counts of multiple draws. This formulation allows the extraction of the characteristic frequency structure of individual sources as latent components and models the sources as additive combinations of these components. The framework is then extended to incorporate the idea of sparse coding to overcome an important limitation of the basic model: an upper bound on the number of extractable components. Sparsity, imposed in the form of an entropic prior distribution, allows extraction of overcomplete sets of components that are more expressive and better characterize the sources. The statistical foundation of the framework makes it amenable to other extensions where known or hypothesized structure about the data can be easily incorporated by imposing appropriate prior distributions. Theoretical analysis of the proposed methods and algorithms for parameter inference are presented. Applications of the models to real-world problems are evaluated and discussed. The latent components learned from acoustic sources are used in a supervised setting for source separation and in a semi-supervised setting for denoising. Unlike approaches based on time-frequency masks that reconstruct partial spectral descriptions of sources by identifying time-frequency bins in which a source dominates, this approach reconstructs entire spectral descriptions of all sources. Various experimental results demonstrate the utility of the proposed framework.