Data-driven voice source waveform analysis and synthesis

Authors:
Jon Gudnason;Mark R. P. Thomas;Daniel P. W. Ellis;Patrick A. Naylor
Affiliations:
School of Science and Engineering, Reykjavik University, Iceland;Electrical and Electronic Engineering Department, Imperial College London, London SW7 2AZ, UK;LabROSA, Columbia University, New York, NY 10027, USA;Electrical and Electronic Engineering Department, Imperial College London, London SW7 2AZ, UK
Venue:
Speech Communication
Year:
2012

Citing 7
Cited 0

Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones

Speech Communication
Glottal wave analysis with Pitch Synchronous Iterative Adaptive Inverse Filtering

Speech Communication - Eurospeech '91
Linear Prediction of Speech

Linear Prediction of Speech
K-means clustering via principal component analysis

ICML '04 Proceedings of the twenty-first international conference on Machine learning
Data-driven voice soruce waveform modelling

ICASSP '09 Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing
Using a pitch-synchronous residual codebook for hybrid HMM/frame selection speech synthesis

ICASSP '09 Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing
Estimation of Glottal Closure Instants in Voiced Speech Using the DYPSA Algorithm

IEEE Transactions on Audio, Speech, and Language Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

A data-driven approach is introduced for studying, analyzing and processing the voice source signal. Existing approaches parameterize the voice source signal by using models that are motivated, for example, by a physical model or function-fitting. Such parameterization is often difficult to achieve and it produces a poor approximation to a large variety of real voice source waveforms of the human voice. This paper presents a novel data-driven approach to analyze different types of voice source waveforms using principal component analysis and Gaussian mixture modeling. This approach models certain voice source features that many other approaches fail to model. Prototype voice source waveforms are obtained from each mixture component and analyzed with respect to speaker, phone and pitch. An analysis/synthesis scheme was set up to demonstrate the effectiveness of the method. Compression of the proposed voice source by discarding 75% of the features yields a segmental signal-to-reconstruction error ratio of 13dB and a Bark spectral distortion of 0.14.