Speech/music segmentation using entropy and dynamism features in a HMM classification framework

Authors:
Jitendra Ajmera;Iain McCowan;Hervé Bourlard
Affiliations:
IDIAP, case postal 592, Rue du Simplon 4, CH-1920 Martigny, Switzerland and EPFL, CH 1015 Lausanne, Switzerland;IDIAP, case postal 592, Rue du Simplon 4, CH-1920 Martigny, Switzerland;IDIAP, case postal 592, Rue du Simplon 4, CH-1920 Martigny, Switzerland and EPFL, CH 1015 Lausanne, Switzerland
Venue:
Speech Communication
Year:
2003

Citing 5
Cited 26

Construction and Evaluation of a Robust Multifeature Speech/Music Discriminator

ICASSP '97 Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97)-Volume 2 - Volume 2
Real-time discrimination of broadcast speech/music

ICASSP '96 Proceedings of the Acoustics, Speech, and Signal Processing, 1996. on Conference Proceedings., 1996 IEEE International Conference - Volume 02
A comparison of features for speech, music discrimination

ICASSP '99 Proceedings of the Acoustics, Speech, and Signal Processing, 1999. on 1999 IEEE International Conference - Volume 01
Hierarchical classification of audio data for archiving and retrieving

ICASSP '99 Proceedings of the Acoustics, Speech, and Signal Processing, 1999. on 1999 IEEE International Conference - Volume 06
Speech/music discrimination for multimedia applications

ICASSP '00 Proceedings of the Acoustics, Speech, and Signal Processing, 2000. on IEEE International Conference - Volume 04

Segmentation of specific speech signals from multi-dialog environment using SVM and wavelet

Pattern Recognition Letters
Speech/non-speech segmentation based on phoneme recognition features

EURASIP Journal on Applied Signal Processing
Review: Speaker segmentation and clustering

Signal Processing
ZemPod: A semantic web approach to podcasting

Web Semantics: Science, Services and Agents on the World Wide Web
A simulated annealing approach to speaker segmentation in audio databases

Engineering Applications of Artificial Intelligence
Classification of audio signals using SVM and RBFNN

Expert Systems with Applications: An International Journal
An Efficient Approach for Classification of Speech and Music

PCM '08 Proceedings of the 9th Pacific Rim Conference on Multimedia: Advances in Multimedia Information Processing
Recognition of Western style musical genres using machine learning techniques

Expert Systems with Applications: An International Journal
A decision-tree-based algorithm for speech/music classification and segmentation

EURASIP Journal on Audio, Speech, and Music Processing
A wavelet-based parameterization for speech/music discrimination

Computer Speech and Language
Social signal processing: Survey of an emerging domain

Image and Vision Computing
Noise robust features for speech/music discrimination in real-time telecommunication

ICME'09 Proceedings of the 2009 IEEE international conference on Multimedia and Expo
Speech/music discrimination using Mel-cepstrum modulation energy

TSD'07 Proceedings of the 10th international conference on Text, speech and dialogue
Online speech/music segmentation based on the variance mean of filter bank energy

EURASIP Journal on Advances in Signal Processing
Improvement to speech-music discrimination using sinusoidal model based features

Multimedia Tools and Applications
Classification of audio signals using AANN and GMM

Applied Soft Computing
Audio-based semantic concept classification for consumer video

IEEE Transactions on Audio, Speech, and Language Processing
Robust speech/non-speech classification in heterogeneous multimedia content

Speech Communication
Pattern classification models for classifying and indexing audio signals

Engineering Applications of Artificial Intelligence
Singer identification using time-frequency audio feature

ISNN'11 Proceedings of the 8th international conference on Advances in neural networks - Volume Part II
Toward a sound analysis system for telemedicine

FSKD'05 Proceedings of the Second international conference on Fuzzy Systems and Knowledge Discovery - Volume Part II
Robust speech detection based on phoneme recognition features

TSD'06 Proceedings of the 9th international conference on Text, Speech and Dialogue
Investigation of broadcast-audio semantic analysis scenarios employing radio-programme-adaptive pattern classification

Speech Communication
A new approach to acoustic analysis of two British regional accents--Birmingham and Liverpool accents

International Journal of Speech Technology
Acoustic classification and segmentation using modified spectral roll-off and variance-based features

Digital Signal Processing
Dictionary learning based sparse coefficients for audio classification with max and average pooling

Digital Signal Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we present a new approach towards high performance speech/music discrimination on realistic tasks related to the automatic transcription of broadcast news. In the approach presented here, an artificial neural network (ANN) trained on clean speech only (as used in a standard large vocabulary speech recognition system) is used as a channel model at the output of which the entropy and "dynamism" will be measured every 10 ms. These features are then integrated over time through an ergodic 2-state (speech and non-speech) hidden Markov model (HMM) with minimum duration constraints on each HMM state. For instance, in the case of entropy, it is indeed clear (and observed in practice) that, on average, the entropy at the output of the ANN will be larger for non-speech segments than speech segments presented at their input. In our case, the ANN acoustic model was a multi-layer perceptron (MLP, as often used in hybrid HMM/ANN systems) generating at its output estimators of the phonetic posterior probabilities based on the acoustic vectors at its input. It is from these outputs, thus from "real" probabilities, that the entropy and dynamism are estimated. The 2-state speech/non-speech HMM will take these two-dimensional features (entropy and dynamism) whose distributions will be modeled through multi-Gaussian densities or a secondary MLP. The parameters of this HMM are trained in a supervised manner using Viterbi algorithm.Although the proposed method can easily be adapted to other speech/non-speech discrimination applications, the present paper only focuses on speech/music segmentation. Different experiments, including different speech and music styles, as well as different temporal distributions of the speech and music signals (real data distribution, mostly speech, or mostly music), illustrate the robustness of the approach, always resulting in a correct segmentation performance higher than 90%. Finally, we will show how a confidence measure can be used to further improve the segmentation results, and also discuss how this may be used to extend the technique to the case of speech/music mixtures.