Integrating imperfect transcripts into speech recognition systems for building high-quality corpora

Authors:
Benjamin Lecouteux;Georges Linarès;Stanislas Oger
Affiliations:
Laboratoire Informatique de Grenoble (LIG), University of Grenoble, France;Laboratoire Informatique d'Avignon (LIA), University of Avignon, France;Laboratoire Informatique d'Avignon (LIA), University of Avignon, France
Venue:
Computer Speech and Language
Year:
2012

Citing 10
Cited 1

Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
A Cache-Based Natural Language Model for Speech Recognition

IEEE Transactions on Pattern Analysis and Machine Intelligence
Self-organized language modeling for speech recognition

Readings in speech recognition
Stemming algorithms: a case study for detailed evaluation

Journal of the American Society for Information Science - Special issue: evaluation of information retrieval systems
Using words and phonetic strings for efficient information retrieval from imperfectly transcribed spoken documents

DL '97 Proceedings of the second ACM international conference on Digital libraries
The String-to-String Correction Problem

Journal of the ACM (JACM)
Language Model Adaptation Using Mixtures and an Exponentially Decaying Cache

ICASSP '97 Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97)-Volume 2 - Volume 2
Edit-distance of weighted automata

CIAA'02 Proceedings of the 7th international conference on Implementation and application of automata
The LIA speech recognition system: from 10xRT to 1xRT

TSD'07 Proceedings of the 10th international conference on Text, speech and dialogue
Unsupervised training and directed manual transcription for LVCSR

Speech Communication

Talking to machines

Communications of the ACM

Quantified Score

Hi-index	0.02

Visualization

Abstract

Abstract: The training of state-of-the-art automatic speech recognition (ASR) systems requires huge relevant training corpora. The cost of such databases is high and remains a major limitation for the development of speech-enabled applications in particular contexts (e.g. low-density languages or specialized domains). On the other hand, a large amount of data can be found in news prompts, movie subtitles or scripts, etc. The use of such data as training corpus could provide a low-cost solution to the acoustic model estimation problem. Unfortunately, prior transcripts are seldom exact with respect to the content of the speech signal, and suffer from a lack of temporal information. This paper tackles the issue of prompt-based speech corpora improvement, by addressing the problems mentioned above. We propose a method allowing to locate accurate transcript segments in speech signals and automatically correct errors or lack of transcript surrounding these segments. This method relies on a new decoding strategy where the search algorithm is driven by the imperfect transcription of the input utterances. The experiments are conducted on the French language, by using the ESTER database and a set of records (and associated prompts) from RTBF (Radio Television Belge Francophone). The results demonstrate the effectiveness of the proposed approach, in terms of both error correction and text-to-speech alignment.