Audio Partitioning and Transcription for Broadcast Data Indexation

Authors:
J. L. Gauvain;L. Lamel;G. Adda
Affiliations:
Spoken Language Processing Group, LIMSI-CNRS, BP 133, 91403 Orsay, France.gauvain@limsi.fr;Spoken Language Processing Group, LIMSI-CNRS, BP 133, 91403 Orsay, France.lamel@limsi.fr;Spoken Language Processing Group, LIMSI-CNRS, BP 133, 91403 Orsay, France.gadda@limsi.fr
Venue:
Multimedia Tools and Applications
Year:
2001

Citing 2
Cited 3

A hidden Markov model information retrieval system

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Transcribing Broadcast News Shows

ICASSP '97 Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97)-Volume 2 - Volume 2

Augmented segmentation and visualization for presentation videos

Proceedings of the 13th annual ACM international conference on Multimedia
Speech/music discrimination in audio podcast using structural segmentation and timbre recognition

CMMR'10 Proceedings of the 7th international conference on Exploring music contents
NIST RT'05S evaluation: pre-processing techniques and speaker diarization on multiple microphone meetings

MLMI'05 Proceedings of the Second international conference on Machine Learning for Multimodal Interaction

Quantified Score

Hi-index	0.00

Visualization

Abstract

This work addresses automatic transcription of television and radio broadcasts in multiple languages. Transcription of such types of data is a major step in developing automatic tools for indexation and retrieval of the vast amounts of information generated on a daily basis. Radio and television broadcasts consist of a continuous data stream made up of segments of different linguistic and acoustic natures, which poses challenges for transcription. Prior to word recognition, the data is partitioned into homogeneous acoustic segments. Non-speech segments are identified and removed, and the speech segments are clustered and labeled according to bandwidth and gender. Word recognition is carried out with a speaker-independent large vocabulary, continuous speech recognizer which makes use of n-gram statistics for language modeling and of continuous density HMMs with Gaussian mixtures for acoustic modeling. This system has consistently obtained top-level performance in DARPA evaluations. Over 500 hours of unpartitioned unrestricted American English broadcast data have been partitioned, transcribed and indexed, with an average word error of about 20%. With current IR technology there is essentially no degradation in information retrieval performance for automatic and manual transcriptions on this data set.