A nonlinear autoregressive model for speaker verification

Authors:
Sundararajan Srinivasan;Tao Ma;Georgios Lazarou;Joseph Picone
Affiliations:
Nuance Communications Inc., Sunnyvale, USA 94085;Apple Inc., Cupertino, USA 95014;The New York City Transit Authority, New York, USA 11103;Department of Electrical and Computer Engineering, Temple University, Philadelphia, USA 19027
Venue:
International Journal of Speech Technology
Year:
2014

Citing 5
Cited 0

Numerical Methods for Unconstrained Optimization and Nonlinear Equations (Classics in Applied Mathematics, 16)

Numerical Methods for Unconstrained Optimization and Nonlinear Equations (Classics in Applied Mathematics, 16)
An overview of text-independent speaker recognition: From features to supervectors

Speech Communication
Fundamentals of Speaker Recognition

Fundamentals of Speaker Recognition
How to extract Lyapunov exponents from short and noisy time series

IEEE Transactions on Signal Processing
MVA Processing of Speech Features

IEEE Transactions on Audio, Speech, and Language Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Gaussian Mixture Models (GMM) have been the most popular approach in speaker recognition and verification for over two decades. The inefficiencies of this model for signals such as speech are well documented and include an inability to model temporal dependencies that result from nonlinearities in the speech signal. The resulting models are often complex and overdetermined, which leads to a lack of generalization. In this paper, we present a nonlinear mixture autoregressive model (MixAR) that attempts to directly model nonlinearities in the trajectories of the speech features. We apply this model to the problem of speaker verification. Experiments with synthetic data demonstrate the viability of the model. Evaluations on standard speech databases, including TIMIT, NTIMIT, and NIST-2001, demonstrate that MixAR, using only half the number of parameters and only static features, can achieve a lower equal error rate when compared to GMMs, particularly in the presence of previously unseen noise. Performance as a function of the duration of both the training and evaluation utterances is also analyzed.