Fast unsupervised adaptation based on efficient statistics accumulation using frame independent confidence within monophone states

Authors:
Satoshi Kobashikawa;Atsunori Ogawa;Taichi Asami;Yoshikazu Yamaguchi;Hirokazu Masataki;Satoshi Takahashi
Affiliations:
NTT Media Intelligence Laboratories, NTT Corporation, 1-1 Hikari-no-oka, Yokosuka-Shi, Kanagawa 239-0847, Japan;NTT Communication Science Laboratories, NTT Corporation, 2-4 Hikari-dai, Seika-cho, Kyoto 619-0237, Japan;NTT Media Intelligence Laboratories, NTT Corporation, 1-1 Hikari-no-oka, Yokosuka-Shi, Kanagawa 239-0847, Japan;NTT Media Intelligence Laboratories, NTT Corporation, 1-1 Hikari-no-oka, Yokosuka-Shi, Kanagawa 239-0847, Japan;NTT Media Intelligence Laboratories, NTT Corporation, 1-1 Hikari-no-oka, Yokosuka-Shi, Kanagawa 239-0847, Japan;NTT Media Intelligence Laboratories, NTT Corporation, 1-1 Hikari-no-oka, Yokosuka-Shi, Kanagawa 239-0847, Japan
Venue:
Computer Speech and Language
Year:
2013

Citing 4
Cited 0

Automatic speech recognition and speech variability: A review

Speech Communication
Business Intelligence from Voice of Customer

ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Unsupervised speaker adaptation for telephone call transcription

ICASSP '09 Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing
An audio indexing system for election video material

ICASSP '09 Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper proposes a fast unsupervised acoustic model adaptation technique with efficient statistics accumulation for speech recognition. Conventional adaptation techniques accumulate the acoustic statistics based on a forward-backward algorithm or a Viterbi algorithm. Since both algorithms require a state sequence prior to statistic accumulation, the conventional techniques need time to determine the state sequence by transcribing the target speech in advance. Instead of pre-determining the state sequence, the proposed technique reduces the computation time by accumulating the statistics with state confidence within monophone per frame. It also rapidly selects the appropriate gender acoustic model before adaptation, and further increases the accuracy by employing a power term after adaptation. Recognition experiments using spontaneous speech show that the proposed technique reduces computation time by 57.3% while providing the same accuracy as the conventional adaptation technique.