A segmental speech coder based on a concatenative TTS

Authors:
Ki-Seung Lee;Richard V. Cox
Affiliations:
Department of Electronic Engineering, Konkuk University, 1 Hwayang-dong, Gwangjin-gu, Seoul 143-701, South Korea and Speech Processing Software and Technology Research Department of AT&T Laborator ...;Speech Processing Software and Technology Research Department of AT&T Laboratories Research, NJ
Venue:
Speech Communication
Year:
2002

Citing 3
Cited 1

The rise/fall/connection model of intonation

Speech Communication
Unit selection in a concatenative speech synthesis system using a large speech database

ICASSP '96 Proceedings of the Acoustics, Speech, and Signal Processing, 1996. on Conference Proceedings., 1996 IEEE International Conference - Volume 01
TTS based very low bit rate speech coder

ICASSP '99 Proceedings of the Acoustics, Speech, and Signal Processing, 1999. on 1999 IEEE International Conference - Volume 01

Data driven approaches to speech and language processing

Nonlinear Speech Modeling and Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

An extremely low bit rate speech coder based on a recognition/synthesis paradigm is proposed. In our speech coder, the speech signal is produced in a way which is similar to concatenative speech synthesis of text-to-speech (TTS). Hence, database construction, unit selection and prosody modification, which are the major parts of concatenative TTS, are employed to implement the speech coder. The synthesis units are automatically found in a large database using a joint segmentation/classification scheme. Dynamic programming (DP) is applied to unit selection in which two cost functions, an acoustic target cost and a concatenation cost are used to increase naturalness as well as intelligibility. Prosodic differences between the selected unit and the input segment are compensated for by time-scale and pitch modifications which are based on the harmonic plus noise (HNM) model framework. In single speaker tests, the proposed scheme gave intelligible and natural sounding speech at an average bit rate of about 580 b/s.