Annotation of heterogeneous multimedia content using automatic speech recognition
SAMT'07 Proceedings of the semantic and digital media technologies 2nd international conference on Semantic Multimedia
MLMI'05 Proceedings of the Second international conference on Machine Learning for Multimodal Interaction
The rich transcription 2006 spring meeting recognition evaluation
MLMI'06 Proceedings of the Third international conference on Machine Learning for Multimodal Interaction
The 2006 athens information technology speech activity detection and speaker diarization systems
MLMI'06 Proceedings of the Third international conference on Machine Learning for Multimodal Interaction
Hi-index | 0.00 |
In this paper we present a speech/non-speech classification method that allows high quality classification without the need to know in advance what kinds of audible non-speech events are present in an audio recording and that does not require a single parameter to be tuned on in-domain data. Because no parameter tuning is needed and no training data is required to train models for specific sounds, the classifier is able to process a wide range of audio types with varying conditions and thereby contributes to the development of a more robust automatic speech recognition framework. Our speech/non-speech classification system does not attempt to classify all audible non-speech in a single run. Instead, first a bootstrap speech/silence classification is obtained using a standard speech/non-speech classifier. Next, models for speech, silence and audible non-speech are trained on the target audio using the bootstrap classification. The experiments show that the performance of the proposed system is 83% and 44% (relative) better than that of a common broadcast news speech/non-speech classifier when applied to a collection of meetings recorded with table-top microphones and a collection of Dutch television broadcasts used for TRECVID 2007.