Robust Speech Rate Estimation for Spontaneous Speech

Authors:
Dagen Wang;S. S. Narayanan
Affiliations:
IBM T. J. Watson Res. Center, Yorktown Heights, NY;-
Venue:
IEEE Transactions on Audio, Speech, and Language Processing
Year:
2007

Citing 0
Cited 2

SocioPhone: everyday face-to-face interaction monitoring platform using multi-phone sensor fusion

Proceeding of the 11th annual international conference on Mobile systems, applications, and services
Automated assessment and treatment of speech rate and intonation in dysarthria

Proceedings of the 7th International Conference on Pervasive Computing Technologies for Healthcare

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we propose a direct method for speech rate estimation from acoustic features without requiring any automatic speech transcription. We compare various spectral and temporal signal analysis and smoothing strategies to better characterize the underlying syllable structure to derive speech rate. The proposed algorithm extends the methods of spectral sub- band correlation by including temporal correlation and the use of prominent spectral subbands for improving the signal correlation essential for syllable detection. Furthermore, to address some of the practical robustness issues in previously proposed methods, we introduce some novel components into the algorithm such as the use of pitch confidence for filtering spurious syllable envelope peaks, magnifying window for tackling neighboring syllable smearing, and relative peak measure thresholds for pseudo peak rejection. We also describe an automated approach for learning algorithm parameters from data, and find the optimal settings through Monte Carlo simulations and parameter sensitivity analysis. Final experimental evaluations are conducted based on a portion of the Switchboard corpus for which manual phonetic segmentation information, and published results for direct comparison are available. The results show a correlation coefficient of 0.745 with respect to the ground truth based on manual segmentation. This result is about a 17% improvement compared to the current best single estimator and a 11% improvement over the multiestimator evaluated on the same Switchboard database.