A study of voice activity detection techniques for NIST speaker recognition evaluations

Authors:
Man-Wai Mak;Hon-Bill Yu
Affiliations:
-;-
Venue:
Computer Speech and Language
Year:
2014

Citing 10
Cited 0

Discrete Time Processing of Speech Signals

Discrete Time Processing of Speech Signals
Comparing maximum a posteriori vector quantization and Gaussian mixture models in speaker verification

ICASSP '09 Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing
Loquendo - Politecnico di Torino's 2008 NIST speaker recognition evaluation system

ICASSP '09 Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing
Improved likelihood ratio test based voice activity detector applied to speech recognition

Speech Communication
Combining pulse-based features for rejecting far-field speech in a HMM-based Voice Activity Detector

Computers and Electrical Engineering
Voice activity detection based on multiple statistical models

IEEE Transactions on Signal Processing - Part I
Front-End Factor Analysis for Speaker Verification

IEEE Transactions on Audio, Speech, and Language Processing
Robust Voice Activity Detection Using Long-Term Signal Variability

IEEE Transactions on Audio, Speech, and Language Processing
Improved Voice Activity Detection Using Contextual Multiple Hypothesis Testing for Robust Speech Recognition

IEEE Transactions on Audio, Speech, and Language Processing
Voice activity detection algorithm using nonlinear spectral weights, hangover and hangbefore criteria

Computers and Electrical Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Since 2008, interview-style speech has become an important part of the NIST speaker recognition evaluations (SREs). Unlike telephone speech, interview speech has lower signal-to-noise ratio, which necessitates robust voice activity detectors (VADs). This paper highlights the characteristics of interview speech files in NIST SREs and discusses the difficulties in performing speech/non-speech segmentation in these files. To overcome these difficulties, this paper proposes using speech enhancement techniques as a pre-processing step for enhancing the reliability of energy-based and statistical-model-based VADs. A decision strategy is also proposed to overcome the undesirable effects caused by impulsive signals and sinusoidal background signals. The proposed VAD is compared with the ASR transcripts provided by NIST, VAD in the ETSI-AMR Option 2 coder, satistical-model (SM) based VAD, and Gaussian mixture model (GMM) based VAD. Experimental results based on the NIST 2010 SRE dataset suggest that the proposed VAD outperforms these conventional ones whenever interview-style speech is involved. This study also demonstrates that (1) noise reduction is vital for energy-based VAD under low SNR; (2) the ASR transcripts and ETSI-AMR speech coder do not produce accurate speech and non-speech segmentations; and (3) spectral subtraction makes better use of background spectra than the likelihood-ratio tests in the SM-based VAD. The segmentation files produced by the proposed VAD can be found in http://bioinfo.eie.polyu.edu.hk/ssvad.