Statistical modeling for unit selection in speech synthesis

Authors:
Cyril Allauzen;Mehryar Mohri;Michael Riley
Affiliations:
AT&T Labs -- Research, Florham Park, NJ;AT&T Labs -- Research, Florham Park, NJ;AT&T Labs -- Research, Florham Park, NJ
Venue:
ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Year:
2004

Citing 8
Cited 4

Semirings, automata, languages

Semirings, automata, languages
Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones

Speech Communication
A design principles of a weighted finite-state transducer library

Theoretical Computer Science - Special issue on implementing automata
Automata, Languages, and Machines

Automata, Languages, and Machines
Automata: Theoretic Aspects of Formal Power Series

Automata: Theoretic Aspects of Formal Power Series
Semiring frameworks and algorithms for shortest-distance problems

Journal of Automata, Languages and Combinatorics
Unit selection in a concatenative speech synthesis system using a large speech database

ICASSP '96 Proceedings of the Acoustics, Speech, and Signal Processing, 1996. on Conference Proceedings., 1996 IEEE International Conference - Volume 01
A general weighted grammar library

CIAA'04 Proceedings of the 9th international conference on Implementation and Application of Automata

Learning with Weighted Transducers

Proceedings of the 2009 conference on Finite-State Methods and Natural Language Processing: Post-proceedings of the 7th International Workshop FSMNLP 2008
Review: Statistical parametric speech synthesis

Speech Communication
Unit selection using k-nearest neighbor search for concatenative speech synthesis

Proceedings of the 3rd International Universal Communication Symposium
Filters for efficient composition of weighted finite-state transducers

CIAA'10 Proceedings of the 15th international conference on Implementation and application of automata

Quantified Score

Hi-index	0.00

Visualization

Abstract

Traditional concatenative speech synthesis systems use a number of heuristics to define the target and concatenation costs, essential for the design of the unit selection component. In contrast to these approaches, we introduce a general statistical modeling framework for unit selection inspired by automatic speech recognition. Given appropriate data, techniques based on that framework can result in a more accurate unit selection, thereby improving the general quality of a speech synthesizer. They can also lead to a more modular and a substantially more efficient system.We present a new unit selection system based on statistical modeling. To overcome the original absence of data, we use an existing high-quality unit selection system to generate a corpus of unit sequences. We show that the concatenation cost can be accurately estimated from this corpus using a statistical n-gram language model over units. We used weighted automata and transducers for the representation of the components of the system and designed a new and more efficient composition algorithm making use of string potentials for their combination. The resulting statistical unit selection is shown to be about 2.6 times faster than the last release of the AT&T Natural Voices Product while preserving the same quality, and offers much flexibility for the use and integration of new and more complex components.