Computational auditory induction as a missing-data model-fitting problem with Bregman divergence

  • Authors:
  • Jonathan Le Roux;Hirokazu Kameoka;Nobutaka Ono;Alain de Cheveigné;Shigeki Sagayama

  • Affiliations:
  • Graduate School of Information Science and Technology, The University of Tokyo, 7-3-1, Hongo, Bunkyo-ku, Tokyo 113-8656, Japan and NTT Communication Science Laboratories, NTT Corporation, 3-1, Mor ...;NTT Communication Science Laboratories, NTT Corporation, 3-1, Morinosato Wakamiya, Atsugi, Kanagawa 243-0198, Japan;Graduate School of Information Science and Technology, The University of Tokyo, 7-3-1, Hongo, Bunkyo-ku, Tokyo 113-8656, Japan;Centre National de la Recherche Scientifique, Université Paris 5, and Ecole Normale Supérieure, 29 rue d'Ulm, 75230 Paris Cedex 05, France;Graduate School of Information Science and Technology, The University of Tokyo, 7-3-1, Hongo, Bunkyo-ku, Tokyo 113-8656, Japan

  • Venue:
  • Speech Communication
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

The human auditory system has the ability, known as auditory induction, to estimate the missing parts of a continuous auditory stream briefly covered by noise and perceptually resynthesize them. In this article, we formulate this ability as a model-based spectrogram analysis and clustering problem with missing data, show how to solve it using an auxiliary function method, and explain how this method is generally related to the expectation-maximization (EM) algorithm for a certain type of divergence measures called Bregman divergences, thus enabling the use of prior distributions on the parameters. We illustrate how our method can be used to simultaneously analyze a scene and estimate missing information with two algorithms: the first, based on non-negative matrix factorization (NMF), performs analysis of polyphonic multi-instrumental musical pieces. Our method allows this algorithm to cope with gaps within the audio data, estimating the timbre of the instruments and their pitch, and reconstructing the missing parts. The second, based on a recently introduced technique for the analysis of complex acoustical scenes called harmonic-temporal clustering (HTC), enables us to perform robust fundamental frequency estimation from incomplete speech data.