Speaker-adaptive speech recognition using speaker diarization for improved transcription of large spoken archives

Authors:
Petr Cerva;Jan Silovsky;Jindrich Zdansky;Jan Nouza;Ladislav Seps
Affiliations:
-;-;-;-;-
Venue:
Speech Communication
Year:
2013

Citing 7
Cited 0

Robust speech recognition by normalization of the acoustic space

ICASSP '91 Proceedings of the Acoustics, Speech, and Signal Processing, 1991. ICASSP-91., 1991 International Conference
Speaker normalization using efficient frequency warping procedures

ICASSP '96 Proceedings of the Acoustics, Speech, and Signal Processing, 1996. on Conference Proceedings., 1996 IEEE International Conference - Volume 01
The SRI-ICSI Spring 2007 Meeting and Lecture Recognition System

Multimodal Technologies for Perception of Humans
Front-End Factor Analysis for Speaker Verification

IEEE Transactions on Audio, Speech, and Language Processing
Advances in transcription of broadcast news and conversational telephone speech within the combined EARS BBN/LIMSI system

IEEE Transactions on Audio, Speech, and Language Processing
Transcribing Meetings With the AMIDA Systems

IEEE Transactions on Audio, Speech, and Language Processing
A review on speaker diarization systems and approaches

Speech Communication

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper deals with speaker-adaptive speech recognition for large spoken archives. The goal is to improve the recognition accuracy of an automatic speech recognition (ASR) system that is being deployed for transcription of a large archive of Czech radio. This archive represents a significant part of Czech cultural heritage, as it contains recordings covering 90years of broadcasting. A large portion of these documents (100,000h) is to be transcribed and made public for browsing. To improve the transcription results, an efficient speaker-adaptive scheme is proposed. The scheme is based on integration of speaker diarization and adaptation methods and is designed to achieve a low Real-Time Factor (RTF) of the entire adaptation process, because the archive's size is enormous. It thus employs just two decoding passes, where the first one is carried out using the lexicon with a reduced number of items. Moreover, the transcripts from the first pass serve not only for adaptation, but also as the input to the speaker diarization module, which employs two-stage clustering. The output of diarization is then utilized for a cluster-based unsupervised Speaker Adaptation (SA) approach that also utilizes information based on the gender of each individual speaker. Presented experimental results on various types of programs show that our adaptation scheme yields a significant Word Error Rate (WER) reduction from 22.24% to 18.85% over the Speaker Independent (SI) system while operating at a reasonable RTF.