Techniques in rapid unsupervised speaker adaptation based on HMM-Sufficient Statistics

  • Authors:
  • Randy Gomez;Tomoki Toda;Hiroshi Saruwatari;Kiyohiro Shikano

  • Affiliations:
  • Graduate of Information Science, Nara Institute of Science and Technology, Japan;Graduate of Information Science, Nara Institute of Science and Technology, Japan;Graduate of Information Science, Nara Institute of Science and Technology, Japan;Graduate of Information Science, Nara Institute of Science and Technology, Japan

  • Venue:
  • Speech Communication
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

In realizing a speech recognition system robust to variation of speakers, a reliable adaptation algorithm is needed. Most adaptation techniques require a large amount of adaptation data from the target speaker to carry out the adaptation task. With the time needed to gather and transcribe adaptation utterances together with the time to execute adaptation, application to speech recognition is limited. We propose a rapid approach to speaker adaptation. We employ HMM-Sufficient Statistics in storing speaker-dependent subspaces. N-Closest speaker selection is employed in resolving the combinatorics of the speaker-dependent subspaces during recognition. This approach allows the adapted model to have a direct correspondence with the target speaker by using the target speakers' utterance for the N-Closest speaker selection. The proposed method employs series of adaptation processes. First, the general model is trained, then adapted to broad gender/age classes, which are further adapted to speaker-specific data. Since HMM-Sufficient Statistics are pre-computed offline, little computation is needed in carrying out the adaptation task online. Moreover, the method requires only a single arbitrary utterance from the target speaker for adaptation. In this paper, we discuss the modification, expansion, and the improvement of rapid adaptation based on HMM-Sufficient Statistics in the framework of Baum-Welch and maximum likelihood linear regression (MLLR). Experimental results using the conventional MLLR, speaker-adaptive training, and CMLLR are evaluated and compared. We also tested for robustness in office, car, crowd and booth environments in several SNR conditions.