Determining mixing parameters from multispeaker data using speech-specific information

Authors:
B. Yegnanarayana;R. Kumara Swamy;K. Sri Rama Murty
Affiliations:
International Institute of Information Technology, Hyderabad, India;Department of Electronics and Communications Engineering, Siddaganga Institute of Technology, Tumkur, India;Department of Computer Science and Engineering, Indian Institute of Technology-Madras, Chennai, India
Venue:
IEEE Transactions on Audio, Speech, and Language Processing
Year:
2009

Citing 7
Cited 3

Discrete-time signal processing (2nd ed.)

Discrete-time signal processing (2nd ed.)
An analysis of the limitations of blind signal separation application with speech

Signal Processing
Underdetermined blind sparse source separation for arbitrarily arranged multiple sensors

Signal Processing
Map-based underdetermined blind source separation of convolutive mixtures by hierarchical clustering and l1-norm minimization

EURASIP Journal on Applied Signal Processing
Blind separation of speech mixtures via time-frequency masking

IEEE Transactions on Signal Processing
Underdetermined blind source separation based on sparse representation

IEEE Transactions on Signal Processing
Performance measurement in blind audio source separation

IEEE Transactions on Audio, Speech, and Language Processing

Emotion recognition from speech: a review

International Journal of Speech Technology
Emotion recognition from speech using source, system, and prosodic features

International Journal of Speech Technology
Characterization and recognition of emotions from speech using excitation source information

International Journal of Speech Technology

Quantified Score

Hi-index	0.02

Visualization

Abstract

In this paper, we propose an approach for processing multispeaker speech signals collected simultaneously using a pair of spatially separated microphones in a real room environment. Spatial separation of microphones results in a fixed time-delay of arrival of speech signals from a given speaker at the pair of microphones. These time-delays are estimated by exploiting the impulse-like characteristic of excitation during speech production. The differences in the time-delays for different speakers are used to determine the number of speakers from the mixed multispeaker speech signals. There is difference in the signal levels due to differences in the distances between the speaker and each of the microphones. The differences in the signal levels dictate the values of the mixing parameters. Knowledge of speech production, especially the excitation source characteristics, is used to derive an approximate weight function for locating the regions specific to a given speaker. The scatter plots of the weighted and delay-compensated mixed speech signals are used to estimate the mixing parameters. The proposed method is applied on the data collected in actual laboratory environment for an underdetermined case, where the number of speakers is more than the number of microphones. Enhancement of speech due to a speaker is also examined using the information of the time-delays and the mixing parameters, and is evaluated using objective measures proposed in the literature.