Computational auditory induction as a missing-data model-fitting problem with Bregman divergence

Authors:
Jonathan Le Roux;Hirokazu Kameoka;Nobutaka Ono;Alain de Cheveigné;Shigeki Sagayama
Affiliations:
Graduate School of Information Science and Technology, The University of Tokyo, 7-3-1, Hongo, Bunkyo-ku, Tokyo 113-8656, Japan and NTT Communication Science Laboratories, NTT Corporation, 3-1, Mor ...;NTT Communication Science Laboratories, NTT Corporation, 3-1, Morinosato Wakamiya, Atsugi, Kanagawa 243-0198, Japan;Graduate School of Information Science and Technology, The University of Tokyo, 7-3-1, Hongo, Bunkyo-ku, Tokyo 113-8656, Japan;Centre National de la Recherche Scientifique, Université Paris 5, and Ecole Normale Supérieure, 29 rue d'Ulm, 75230 Paris Cedex 05, France;Graduate School of Information Science and Technology, The University of Tokyo, 7-3-1, Hongo, Bunkyo-ku, Tokyo 113-8656, Japan
Venue:
Speech Communication
Year:
2011

Citing 14
Cited 1

Restoration of lost samples in digital signals

Restoration of lost samples in digital signals
Image inpainting

Proceedings of the 27th annual conference on Computer graphics and interactive techniques
Robust automatic speech recognition with missing and unreliable acoustic data

Speech Communication
Digital Audio Restoration: A Statistical Model Based Approach

Digital Audio Restoration: A Statistical Model Based Approach
Prediction-driven computational auditory scene analysis

Prediction-driven computational auditory scene analysis
Clustering with Bregman Divergences

The Journal of Machine Learning Research
Computational Auditory Scene Analysis: Principles, Algorithms, and Applications

Computational Auditory Scene Analysis: Principles, Algorithms, and Applications
Nonnegative matrix factorization with the itakura-saito divergence: With application to music analysis

Neural Computation
Complex NMF: A new sparse representation for acoustic signals

ICASSP '09 Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing
Nonnegative matrix factor 2-d deconvolution for blind single channel source separation

ICA'06 Proceedings of the 6th international conference on Independent Component Analysis and Blind Signal Separation
An efficient model-based multirate method for reconstruction of audio signals across long gaps

IEEE Transactions on Audio, Speech, and Language Processing
A Multipitch Analyzer Based on Harmonic Temporal Structured Clustering

IEEE Transactions on Audio, Speech, and Language Processing
Single and Multiple Contour Estimation Through Parametric Spectrogram Modeling of Speech in Noisy Environments

IEEE Transactions on Audio, Speech, and Language Processing
Region filling and object removal by exemplar-based image inpainting

IEEE Transactions on Image Processing

Audio imputation using the non-negative hidden markov model

LVA/ICA'12 Proceedings of the 10th international conference on Latent Variable Analysis and Signal Separation

Quantified Score

Hi-index	0.00

Visualization

Abstract

The human auditory system has the ability, known as auditory induction, to estimate the missing parts of a continuous auditory stream briefly covered by noise and perceptually resynthesize them. In this article, we formulate this ability as a model-based spectrogram analysis and clustering problem with missing data, show how to solve it using an auxiliary function method, and explain how this method is generally related to the expectation-maximization (EM) algorithm for a certain type of divergence measures called Bregman divergences, thus enabling the use of prior distributions on the parameters. We illustrate how our method can be used to simultaneously analyze a scene and estimate missing information with two algorithms: the first, based on non-negative matrix factorization (NMF), performs analysis of polyphonic multi-instrumental musical pieces. Our method allows this algorithm to cope with gaps within the audio data, estimating the timbre of the instruments and their pitch, and reconstructing the missing parts. The second, based on a recently introduced technique for the analysis of complex acoustical scenes called harmonic-temporal clustering (HTC), enables us to perform robust fundamental frequency estimation from incomplete speech data.