Voice quality conversion using interactive evolution of prosodic control

  • Authors:
  • Yuji Sato

  • Affiliations:
  • Faculty of Computer and Information Sciences, Hosei University, 3-7-2, Kajino-cho, Koganei-shi, Tokyo 184-8584, Japan

  • Venue:
  • Applied Soft Computing
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

Recent years have seen the birth of new markets using voice quality conversion technology for a variety of application fields including man-personal machine interfaces, addition of narration in multimedia-content editing, and computer games. Optimal parameters for voice quality conversion, however, are speaker dependent, and consequently, no clear-cut algorithm has existed in the past and parameter adjustment has usually been performed by an experienced designer on a trial and error basis. This paper proposes the application of evolutionary computation, a stochastic search technique based on organic evolution, to parameter adjustment for voice conversion, and reports on several experimental results applicable to the fitting of prosodic coefficients. Evolutionary computation is said to be ''applicable to even cases where the properties of the target function are not well known,'' and we decided to apply it considering that this feature might be effective in our study. Providing an explicit evaluative function for evolutionary computation, however, is difficult, and we here adopt an interactive-evolution system in which genetic manipulation is performed repeatedly while evaluating results based on human emotions. Evaluation experiments were performed on raw human speech recorded by a microphone and speech mechanically synthesized from text. It was found that the application of evolutionary computation could achieve voice conversion satisfying specific targets with relatively little degradation of sound quality and no impression of artificial processing in comparison to parameter adjustment based on designer experience or trial and error. This paper also shows that prosodic conversion coefficients determined by the interactive evolution technique, while exhibiting speaker dependency, is not text dependent.