Phonetic alignment: speech synthesis-based vs. viterbi-based

Authors:
F. Malfrère;O. Deroo;T. Dutoit;C. Ris
Affiliations:
Faculté Polytechnique de Mons-TCTS, 31, Bld. Dolez, B-7000 Mons, Belgium and Babel Technologies SA, Boulevard Dolez 33, 7000 Mons, Belgium;Faculté Polytechnique de Mons-TCTS, 31, Bld. Dolez, B-7000 Mons, Belgium and Babel Technologies SA, Boulevard Dolez 33, 7000 Mons, Belgium;Faculté Polytechnique de Mons-TCTS, 31, Bld. Dolez, B-7000 Mons, Belgium;Faculté Polytechnique de Mons-TCTS, 31, Bld. Dolez, B-7000 Mons, Belgium
Venue:
Speech Communication
Year:
2003

Citing 4
Cited 5

Fundamentals of speech recognition

Fundamentals of speech recognition
Automatic segmentation and labeling of speech based on Hidden Markov Models

Speech Communication
Connectionist Speech Recognition: A Hybrid Approach

Connectionist Speech Recognition: A Hybrid Approach
Unit selection in a concatenative speech synthesis system using a large speech database

ICASSP '96 Proceedings of the Acoustics, Speech, and Signal Processing, 1996. on Conference Proceedings., 1996 IEEE International Conference - Volume 01

The use of articulator motion information in automatic speech segmentation

Speech Communication
Speech segmentation using regression fusion of boundary predictions

Computer Speech and Language
Bimodal automatic speech segmentation based on audio and visual information fusion

Speech Communication
Text independent methods for speech segmentation

Nonlinear Speech Modeling and Applications
Analysis and HMM-based synthesis of hypo and hyperarticulated speech

Computer Speech and Language

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we compare two different methods for automatically phonetically labeling a continuous speech data-base, as usually required for designing a speech recognition or speech synthesis system. The first method is based on temporal alignment of speech on a synthetic speech pattern; the second method uses either a continuous density hidden Markov models (HMM) or a hybrid HMM/ANN (artificial neural network) system in forced alignment mode. Both systems have been evaluated on read utterances not part of the training set of the HMM systems, and compared to manual segmentation. This study outlines the advantages and drawbacks of both methods. The speech synthetic system has the great advantage that no training stage (hence no large labeled database) is needed, while HMM Systems easily handle multiple phonetic transcriptions (phonetic lattice). We deduce a method for the automatic creation of large phonetically labeled speech databases, based on using the synthetic speech segmentation tool to bootstrap the training process of either a HMM or a hybrid HMM/ANN system. The importance of such segmentation tools is a key point for the development of improved multilingual speech synthesis and recognition systems.