Speech/non-speech segmentation based on phoneme recognition features

Authors:
Janez Žibert;Nikola Pavesić;France Mihelič
Affiliations:
Faculty of Electrical Engineering, University of Ljubljana, Tržasˇka, Ljubljana, Slovenia;Faculty of Electrical Engineering, University of Ljubljana, Tržasˇka, Ljubljana, Slovenia;Faculty of Electrical Engineering, University of Ljubljana, Tržasˇka, Ljubljana, Slovenia
Venue:
EURASIP Journal on Applied Signal Processing
Year:
2006

Citing 6
Cited 1

The development of the HTK Broadcast News transcription system: an overview

Speech Communication - Special issue on automatic transcription of broadcast news data
The LIMSI Broadcast News transcription system

Speech Communication - Special issue on automatic transcription of broadcast news data
Large vocabulary continuous speech recognition of Broadcast News - The Philips/RWTH approach

Speech Communication - Special issue on automatic transcription of broadcast news data
Speech/music segmentation using entropy and dynamism features in a HMM classification framework

Speech Communication
Construction and Evaluation of a Robust Multifeature Speech/Music Discriminator

ICASSP '97 Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97)-Volume 2 - Volume 2
Real-time discrimination of broadcast speech/music

ICASSP '96 Proceedings of the Acoustics, Speech, and Signal Processing, 1996. on Conference Proceedings., 1996 IEEE International Conference - Volume 02

Speech/music discrimination using Mel-cepstrum modulation energy

TSD'07 Proceedings of the 10th international conference on Text, speech and dialogue

Quantified Score

Hi-index	0.00

Visualization

Abstract

This work assesses different approaches for speech and non-speech segmentation of audio data and proposes a new, high-level representation of audio signals based on phoneme recognition features suitable for speech/non-speech discrimination tasks. Unlike previous model-based approaches, where speech and non-speech classes were usually modeled by several models, we develop a representation where just one model per class is used in the segmentation process. For this purpose, four measures based on consonant-vowel pairs obtained from different phoneme speech recognizers are introduced and applied in two different segmentation-classification frameworks. The segmentation systems were evaluated on different broadcast news databases. The evaluation results indicate that the proposed phoneme recognition features are better than the standard mel-frequency cepstral coefficients and posterior probability-based features (entropy and dynamism). The proposed features proved to be more robust and less sensitive to different training and unforeseen conditions. Additional experiments with fusion models based on cepstral and the proposed phoneme recognition features produced the highest scores overall, which indicates that the most suitable method for speech/non-speech segmentation is a combination of low-level acoustic features and high-level recognition features.