Utterance-Based Selective Training for the Automatic Creation of Task-Dependent Acoustic Models

  • Authors:
  • Tobias Cincarek;Tomoki Toda;Hiroshi Saruwatari;Kiyohiro Shikano

  • Affiliations:
  • The authors are with the Graduate School of Information Science, Nara Institute of Science and Technology, Ikoma-shi, 630--0192 Japan. E-mail: cincar-t@is.naist.jp;The authors are with the Graduate School of Information Science, Nara Institute of Science and Technology, Ikoma-shi, 630--0192 Japan. E-mail: cincar-t@is.naist.jp;The authors are with the Graduate School of Information Science, Nara Institute of Science and Technology, Ikoma-shi, 630--0192 Japan. E-mail: cincar-t@is.naist.jp;The authors are with the Graduate School of Information Science, Nara Institute of Science and Technology, Ikoma-shi, 630--0192 Japan. E-mail: cincar-t@is.naist.jp

  • Venue:
  • IEICE - Transactions on Information and Systems
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

To obtain a robust acoustic model for a certain speech recognition task, a large amount of speech data is necessary. However, the preparation of speech data including recording and transcription is very costly and time-consuming. Although there are attempts to build generic acoustic models which are portable among different applications, speech recognition performance is typically task-dependent. This paper introduces a method for automatically building task-dependent acoustic models based on selective training. Instead of setting up a new database, only a small amount of task-specific development data needs to be collected. Based on the likelihood of the target model parameters given this development data, utterances which are acoustically close to the development data are selected from existing speech data resources. Since there are too many possibilities for selecting a data subset from a larger database in general, a heuristic has to be employed. The proposed algorithm deletes single utterances temporarily or alternates between successive deletion and addition of multiple utterances. In order to make selective training computationally practical, model retraining and likelihood calculation need to be fast. It is shown, that the model likelihood can be calculated fast and easily based on sufficient statistics without the need for explicit reconstruction of model parameters. The algorithm is applied to obtain an infant- and elderly-dependent acoustic model with only very few development data available. There is an improvement in word accuracy of up to 9% in comparison to conventional EM training without selection. Furthermore, the approach was also better than MLLR and MAP adaptation with the development data.