Robust speech/non-speech classification in heterogeneous multimedia content

Authors:
Marijn Huijbregts;Franciska de Jong
Affiliations:
University of Twente, Department of Computer Science, P.O. Box 217, 7500 AE, Enschede, The Netherlands;University of Twente, Department of Computer Science, P.O. Box 217, 7500 AE, Enschede, The Netherlands
Venue:
Speech Communication
Year:
2011

Citing 5
Cited 1

Speech/music segmentation using entropy and dynamism features in a HMM classification framework

Speech Communication
Annotation of heterogeneous multimedia content using automatic speech recognition

SAMT'07 Proceedings of the semantic and digital media technologies 2nd international conference on Semantic Multimedia
NIST RT'05S evaluation: pre-processing techniques and speaker diarization on multiple microphone meetings

MLMI'05 Proceedings of the Second international conference on Machine Learning for Multimodal Interaction
The rich transcription 2006 spring meeting recognition evaluation

MLMI'06 Proceedings of the Third international conference on Machine Learning for Multimodal Interaction
The 2006 athens information technology speech activity detection and speaker diarization systems

MLMI'06 Proceedings of the Third international conference on Machine Learning for Multimodal Interaction

Investigation of broadcast-audio semantic analysis scenarios employing radio-programme-adaptive pattern classification

Speech Communication

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we present a speech/non-speech classification method that allows high quality classification without the need to know in advance what kinds of audible non-speech events are present in an audio recording and that does not require a single parameter to be tuned on in-domain data. Because no parameter tuning is needed and no training data is required to train models for specific sounds, the classifier is able to process a wide range of audio types with varying conditions and thereby contributes to the development of a more robust automatic speech recognition framework. Our speech/non-speech classification system does not attempt to classify all audible non-speech in a single run. Instead, first a bootstrap speech/silence classification is obtained using a standard speech/non-speech classifier. Next, models for speech, silence and audible non-speech are trained on the target audio using the bootstrap classification. The experiments show that the performance of the proposed system is 83% and 44% (relative) better than that of a common broadcast news speech/non-speech classifier when applied to a collection of meetings recorded with table-top microphones and a collection of Dutch television broadcasts used for TRECVID 2007.