Speech activity detection on multichannels of meeting recordings

Authors:
Zhongqiang Huang;Mary P. Harper
Affiliations:
Electrical and Computer Engineering, Purdue University, West Lafayette, IN;Electrical and Computer Engineering, Purdue University, West Lafayette, IN
Venue:
MLMI'05 Proceedings of the Second international conference on Machine Learning for Multimodal Interaction
Year:
2005

Citing 2
Cited 2

The LIMSI Broadcast News transcription system

Speech Communication - Special issue on automatic transcription of broadcast news data
VACE multimodal meeting corpus

MLMI'05 Proceedings of the Second international conference on Machine Learning for Multimodal Interaction

Detection of Laughter-in-Interaction in Multichannel Close-Talk Microphone Recordings of Meetings

MLMI '08 Proceedings of the 5th international workshop on Machine Learning for Multimodal Interaction
A geometric interpretation of non-target-normalized maximum cross-channel correlation for vocal activity detection in meetings

NAACL-Short '07 Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers

Quantified Score

Hi-index	0.00

Visualization

Abstract

The Purdue SAD system was originally designed to identify speech regions in multichannel meeting recordings with the goal of focusing transcription effort on regions containing speech. In the NIST RT-05S evaluation, this system was evaluated in the ihm condition of the speech activity detection task. The goal for this task condition is to separate the voice of the speaker on each channel from silence and crosstalk. Our system consists of several steps and does not require a training set. It starts with a simple silence detection algorithm that utilizes pitch and energy to roughly separate silence from speech and crosstalk. A global Bayesian Information Criterion (BIC) is integrated with a Viterbi segmentation algorithm that divides the concatenated stream of local speech and crosstalk into homogeneous portions, which allows an energy based clustering process to then separate local speech and crosstalk. The second step makes use of the obtained segment information to iteratively train a Gaussian mixture model for each speech activity category and decode the whole sequence over an ergodic network to refine the segmentation. The final step first uses a cross-correlation analysis to eliminate crosstalk, and then applies a batch of post-processing operations to adjust the segments to the evaluation scenario. In this paper, we describe our system and discuss various issues related to its evaluation.