The SR-tree: an index structure for high-dimensional nearest neighbor queries
SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
R-trees: a dynamic index structure for spatial searching
SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
CHATR: a generic speech synthesis system
COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 2
Statistical modeling for unit selection in speech synthesis
ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Unit selection in a concatenative speech synthesis system using a large speech database
ICASSP '96 Proceedings of the Acoustics, Speech, and Signal Processing, 1996. on Conference Proceedings., 1996 IEEE International Conference - Volume 01
Segment pre-selection in decision-tree based speech synthesis systems
ICASSP '00 Proceedings of the Acoustics, Speech, and Signal Processing, 2000. on IEEE International Conference - Volume 02
Hi-index | 0.00 |
We propose a new approach to rapidly identifying adequate synthesis units in extremely large speech corpora. Our aim is to develop a concatenative speech synthesis system with high performance (both speech quality and throughput) for various practical applications. Utilizing very large speech corpora allows more natural sounding synthesized speech to be created; the downside is an increase in the time taken to locate the synthesis units needed. The key to overcoming this problem is introducing state-of-the art database retrieval technologies. The first selection step, based on simple hash search, tabulates all synthesis unit candidates. The second step selects N best candidates using nearest neighbor search, a typical database retrieval technique. Finally, the best sequence of synthesis units is determined by Viterbi search. A runtime measurement test and subjective experiment are carried out. Their results confirm that the proposed approach reduces the runtime by about 40% compared to using only hash search with no degradation in the quality of synthesized speech for a 15 hour corpus.