Speech/music discrimination in audio podcast using structural segmentation and timbre recognition

Authors:
Mathieu Barthet;Steven Hargreaves;Mark Sandler
Affiliations:
Centre for Digital Music, Queen Mary University of London, London, United Kingdom;Centre for Digital Music, Queen Mary University of London, London, United Kingdom;Centre for Digital Music, Queen Mary University of London, London, United Kingdom
Venue:
CMMR'10 Proceedings of the 7th international conference on Exploring music contents
Year:
2010

Citing 7
Cited 2

A robust audio classification and segmentation method

MULTIMEDIA '01 Proceedings of the ninth ACM international conference on Multimedia
Audio Partitioning and Transcription for Broadcast Data Indexation

Multimedia Tools and Applications
Construction and Evaluation of a Robust Multifeature Speech/Music Discriminator

ICASSP '97 Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97)-Volume 2 - Volume 2
Real-time discrimination of broadcast speech/music

ICASSP '96 Proceedings of the Acoustics, Speech, and Signal Processing, 1996. on Conference Proceedings., 1996 IEEE International Conference - Volume 02
A comparison of features for speech, music discrimination

ICASSP '99 Proceedings of the Acoustics, Speech, and Signal Processing, 1999. on 1999 IEEE International Conference - Volume 01
Speech/music discrimination for multimedia applications

ICASSP '00 Proceedings of the Acoustics, Speech, and Signal Processing, 2000. on IEEE International Conference - Volume 04
Structural Segmentation of Musical Audio by Constrained Clustering

IEEE Transactions on Audio, Speech, and Language Processing

UnderScore: musical underlays for audio stories

Proceedings of the 25th annual ACM symposium on User interface software and technology
Content-based tools for editing audio stories

Proceedings of the 26th annual ACM symposium on User interface software and technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose two speech/music discrimination methods using timbre models and measure their performances on a 3 hour long database of radio podcasts from the BBC. In the first method, the machine estimated classifications obtained with an automatic timbre recognition (ATR) model are post-processed using median filtering. The classification system (LSF/K-means) was trained using two different taxonomic levels, a high-level one (speech, music), and a lower-level one (male and female speech, classical, jazz, rock & pop). The second method combines automatic structural segmentation and timbre recognition (ASS/ATR). The ASS evaluates the similarity between feature distributions (MFCC, RMS) using HMM and soft K-means algorithms. Both methods were evaluated at a semantic (relative correct overlap RCO), and temporal (boundary retrieval F-measure) levels. The ASS/ATR method obtained the best results (average RCO of 94.5% and boundary F-measure of 50.1%). These performances were favourably compared with that obtained by a SVM-based technique providing a good benchmark of the state of the art.