Unit selection using k-nearest neighbor search for concatenative speech synthesis

Authors:
Hideyuki Mizuno;Satoshi Takahashi
Affiliations:
NTT Cyber Space Laboratories, Yokosuka-Shi, Kanagawa, Japan;NTT Cyber Space Laboratories, Yokosuka-Shi, Kanagawa, Japan
Venue:
Proceedings of the 3rd International Universal Communication Symposium
Year:
2009

Citing 6
Cited 0

The SR-tree: an index structure for high-dimensional nearest neighbor queries

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
R-trees: a dynamic index structure for spatial searching

SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
CHATR: a generic speech synthesis system

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 2
Statistical modeling for unit selection in speech synthesis

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Unit selection in a concatenative speech synthesis system using a large speech database

ICASSP '96 Proceedings of the Acoustics, Speech, and Signal Processing, 1996. on Conference Proceedings., 1996 IEEE International Conference - Volume 01
Segment pre-selection in decision-tree based speech synthesis systems

ICASSP '00 Proceedings of the Acoustics, Speech, and Signal Processing, 2000. on IEEE International Conference - Volume 02

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose a new approach to rapidly identifying adequate synthesis units in extremely large speech corpora. Our aim is to develop a concatenative speech synthesis system with high performance (both speech quality and throughput) for various practical applications. Utilizing very large speech corpora allows more natural sounding synthesized speech to be created; the downside is an increase in the time taken to locate the synthesis units needed. The key to overcoming this problem is introducing state-of-the art database retrieval technologies. The first selection step, based on simple hash search, tabulates all synthesis unit candidates. The second step selects N best candidates using nearest neighbor search, a typical database retrieval technique. Finally, the best sequence of synthesis units is determined by Viterbi search. A runtime measurement test and subjective experiment are carried out. Their results confirm that the proposed approach reduces the runtime by about 40% compared to using only hash search with no degradation in the quality of synthesized speech for a 15 hour corpus.