Multi-speaker/speaker-independent architectures for the multi-state time delay neural network

Authors:
Hermann Hild;Alex Waibel
Affiliations:
School of Computer Science, Carnegie Mellon University, Pittsburg, PA;School of Computer Science, Carnegie Mellon University, Pittsburg, PA
Venue:
ICASSP'93 Proceedings of the 1993 IEEE international conference on Acoustics, speech, and signal processing: speech processing - Volume II
Year:
1993

Citing 5
Cited 0

RecNorm: Simultaneous normalisation and classification applied to speech recognition

NIPS-3 Proceedings of the 1990 conference on Advances in neural information processing systems 3
Connected Letter Recognition with a Multi-State Time Delay Neural Network

Advances in Neural Information Processing Systems 5, [NIPS Conference]
Integrating time alignment and neural networks for high performance continuous speech recognition

ICASSP '91 Proceedings of the Acoustics, Speech, and Signal Processing, 1991. ICASSP-91., 1991 International Conference
Subphonetic modeling with Markov states: senone

ICASSP'92 Proceedings of the 1992 IEEE international conference on Acoustics, speech and signal processing - Volume 1
An LVQ based reference model for speaker-adaptive speech recognition

ICASSP'92 Proceedings of the 1992 IEEE international conference on Acoustics, speech and signal processing - Volume 1

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we present an improved Multi-State Time Delay Neural Network (MS-TDNN) for speaker-independent, connected letter recognition which outperforms an HMM based system (SPHINX) and previous MS-TDNNs [2], and explore new network architectures with "internal speaker models". Four different architectures characterized by an increasing number of speaker-specific parameters are introduced. The speaker-specific parameters can be adjusted by "automatic speaker identification" or by speaker adaptation, allowing for "tuning-in" to a new speaker. Both methods lead to significant improvements over the straightforward speaker-independent architecture. Similar as described in [1], even unsupervised "tuning-in" (speech is unlabeled) works astonishingly well.