Speech activity detection for multi-party conversation analyses based on likelihood ratio test on spatial magnitude

Authors:
Kentaro Ishizuka;Shoko Araki;Tatsuya Kawahara
Affiliations:
NTT Communication Science Laboratories, NTT Corporation, Kyoto, Japan;NTT Communication Science Laboratories, NTT Corporation, Kyoto, Japan;Academic Center for Computing and Media Studies, Kyoto University, Kyoto, Japan
Venue:
IEEE Transactions on Audio, Speech, and Language Processing
Year:
2010

Citing 22
Cited 0

A robust algorithm for accurate endpointing of speech signals

Speech Communication
Study of a voice activity detector and its influence on a noise reduction system

Speech Communication
Distributed meetings: a meeting capture and broadcasting system

Proceedings of the tenth ACM international conference on Multimedia
Robust Talker Direction Estimation Based on Weighted CSP Analysis and Maximum Likelihood Estimation

IEICE - Transactions on Information and Systems
Detection and separation of speech events in meeting recordings using a microphone array

EURASIP Journal on Audio, Speech, and Music Processing
Towards smart meeting: enabling technologies and a real-world application

Proceedings of the 9th international conference on Multimodal interfaces
Multi-modal conversational analysis of poster presentations using multiple sensors

Proceedings of the 2007 workshop on Tagging, mining and retrieval of human related activity information
A realtime multimodal system for analyzing group meetings by combining face pose tracking and speaker diarization

ICMI '08 Proceedings of the 10th international conference on Multimodal interfaces
Noise Robust Voice Activity Detection Based on Switching Kalman Filter

IEICE - Transactions on Information and Systems
A speaker diarization method based on the probabilistic fusion of audio-visual location information

Proceedings of the 2009 international conference on Multimodal interfaces
The AMI meeting corpus: a pre-announcement

MLMI'05 Proceedings of the Second international conference on Machine Learning for Multimodal Interaction
Automatic speech recognition and speech activity detection in the CHIL smart room

MLMI'05 Proceedings of the Second international conference on Machine Learning for Multimodal Interaction
Speaker localization in CHIL lectures: evaluation criteria and results

MLMI'05 Proceedings of the Second international conference on Machine Learning for Multimodal Interaction
Speaker diarization for multi-microphone meetings using only between-channel differences

MLMI'06 Proceedings of the Third international conference on Machine Learning for Multimodal Interaction
Speaker diarization: from broadcast news to lectures

MLMI'06 Proceedings of the Third international conference on Machine Learning for Multimodal Interaction
The potential performance gain in using spectral information inpassive detection/localization of wideband sources

IEEE Transactions on Signal Processing
Blind separation of speech mixtures via time-frequency masking

IEEE Transactions on Signal Processing
Voice activity detection based on multiple statistical models

IEEE Transactions on Signal Processing - Part I
Acoustic Beamforming for Speaker Diarization of Meetings

IEEE Transactions on Audio, Speech, and Language Processing
Statistical voice activity detection using low-variance spectrum estimation and an adaptive threshold

IEEE Transactions on Audio, Speech, and Language Processing
An overview of automatic speaker diarization systems

IEEE Transactions on Audio, Speech, and Language Processing
Separation of speech from interfering sounds based on oscillatory correlation

IEEE Transactions on Neural Networks

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper proposes a microphone array-based speech activity detection (SAD) method for analyzing multiparty conversations recorded in the presence of noise. In particular, the proposed method considers conversations where the number of speakers and speaker locations cannot be restricted, such as when standing and talking, and at poster sessions. When we observe such conversations, there are directional noise sources and diffuse noise that affect the direction of arrival estimations of the target speech signals. To detect speech activity without a priori knowledge about the speakers and noise environments, a likelihood ratio test (LRT)-based SAD method is applied to spatial magnitude, which are estimated by using the time-frequency masking of the observed spectra. The proposed method can exploit the enhanced signals obtained from time-frequency masking, and works even in the presence of environmental noise. Experiments with recorded simulated poster sessions confirmed that the proposed method could outperform conventional methods based on the LRT for a single channel, magnitude coherence, or crosspower spectrum phase.