Self-learning speaker identification for enhanced speech recognition

  • Authors:
  • Tobias Herbig;Franz Gerl;Wolfgang Minker

  • Affiliations:
  • University of Ulm, Institute of Information Technology, Ulm, Germany and Nuance Communications Aachen GmbH, Ulm, Germany;SVOX Deutschland GmbH, Ulm, Germany;University of Ulm, Institute of Information Technology, Ulm, Germany

  • Venue:
  • Computer Speech and Language
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

A novel approach for joint speaker identification and speech recognition is presented in this article. Unsupervised speaker tracking and automatic adaptation of the human-computer interface is achieved by the interaction of speaker identification, speech recognition and speaker adaptation for a limited number of recurring users. Together with a technique for efficient information retrieval a compact modeling of speech and speaker characteristics is presented. Applying speaker specific profiles allows speech recognition to take individual speech characteristics into consideration to achieve higher recognition rates. Speaker profiles are initialized and continuously adapted by a balanced strategy of short-term and long-term speaker adaptation combined with robust speaker identification. Different users can be tracked by the resulting self-learning speech controlled system. Only a very short enrollment of each speaker is required. Subsequent utterances are used for unsupervised adaptation resulting in continuously improved speech recognition rates. Additionally, the detection of unknown speakers is examined under the objective to avoid the requirement to train new speaker profiles explicitly. The speech controlled system presented here is suitable for in-car applications, e.g. speech controlled navigation, hands-free telephony or infotainment systems, on embedded devices. Results are presented for a subset of the SPEECON database. The results validate the benefit of the speaker adaptation scheme and the unified modeling in terms of speaker identification and speech recognition rates.