A two-stage speech activity detection system considering fractal aspects of prosody

Authors:
Soheil Shafiee;Farshad Almasganj;Bahram Vazirnezhad;Ayyoob Jafari
Affiliations:
Biomedical Engineering Department, Amirkabir University of Technology (Polytechnic of Tehran), Iran;Biomedical Engineering Department, Amirkabir University of Technology (Polytechnic of Tehran), Iran;Biomedical Engineering Department, Amirkabir University of Technology (Polytechnic of Tehran), Iran and Iran Electronics Research Institute, Iran;Biomedical Engineering Department, Amirkabir University of Technology (Polytechnic of Tehran), Iran
Venue:
Pattern Recognition Letters
Year:
2010

Citing 6
Cited 0

A robust algorithm for accurate endpointing of speech signals

Speech Communication
A Tutorial on Support Vector Machines for Pattern Recognition

Data Mining and Knowledge Discovery
Comparison of Voice Activity Detection Algorithms for VoIP

ISCC '02 Proceedings of the Seventh International Symposium on Computers and Communications (ISCC'02)
Assessment of the effectiveness of support vector machines for hyperspectral data

Future Generation Computer Systems - Special issue: Geocomputation
A comparison of SVM and HMM classifiers in the off-line signature verification

Pattern Recognition Letters
A new approach to automatic speech summarization

IEEE Transactions on Multimedia

Quantified Score

Hi-index	0.10

Visualization

Abstract

Speech Activity Detectors (SADs) are essential in the noisy environments to provide an acceptable performance in the speech applications, such as speech recognition tasks. In this paper, a two-stage speech activity detection system is presented which at first takes advantage of a voice activity detector to discard pause segments out of the audio signals; this is done even in presence of stationary background noises. In the second stage, the remained segments are classified into speech or non-speech. To find the best feature set in speech/non-speech classification, a large set of robust features are introduced; the optimal subset of these features are chosen by applying a Genetic Algorithm (GA) to the initial feature set. It has been discovered that fractal dimensions of numeric series of prosodic features are the most speech/non-speech differentiating features. Models of the system are trained over a Farsi database, FARSDAT, however, test experiments on the TIMIT English database have been also conducted. Employing the SAD system in conjunction with an ASR system, has been resulted in a relative Word Error Rate (WER) reduction of as high as 28.3%.