Mapping between acoustic and articulatory gestures

Authors:
G. Ananthakrishnan;Olov Engwall
Affiliations:
Centre for Speech Technology (CTT), School of Computer Science and Communication, KTH (Royal Institute of Technology), SE-100 44 Stockholm, Sweden;Centre for Speech Technology (CTT), School of Computer Science and Communication, KTH (Royal Institute of Technology), SE-100 44 Stockholm, Sweden
Venue:
Speech Communication
Year:
2011

Citing 5
Cited 0

Glottal source modeling for voice conversion

Speech Communication - Special issue: voice conversion: state of the art and perspectives
Quantitative association of vocal-tract and facial behavior

Speech Communication - Special issue on auditory-visual speech processing
Statistical mapping between articulatory movements and acoustic spectrum using a Gaussian mixture model

Speech Communication
Audiovisual-to-articulatory inversion

Speech Communication
Automatic segmentation of speech

IEEE Transactions on Signal Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper proposes a definition for articulatory as well as acoustic gestures along with a method to segment the measured articulatory trajectories and acoustic waveforms into gestures. Using a simultaneously recorded acoustic-articulatory database, the gestures are detected based on finding critical points in the utterance, both in the acoustic and articulatory representations. The acoustic gestures are parameterized using 2-D cepstral coefficients. The articulatory trajectories are essentially the horizontal and vertical movements of Electromagnetic Articulography (EMA) coils placed on the tongue, jaw and lips along the midsagittal plane. The articulatory movements are parameterized using 2D-DCT using the same transformation that is applied on the acoustics. The relationship between the detected acoustic and articulatory gestures in terms of the timing as well as the shape is studied. In order to study this relationship further, acoustic-to-articulatory inversion is performed using GMM-based regression. The accuracy of predicting the articulatory trajectories from the acoustic waveforms are at par with state-of-the-art frame-based methods with dynamical constraints (with an average error of 1.45-1.55mm for the two speakers in the database). In order to evaluate the acoustic-to-articulatory inversion in a more intuitive manner, a method based on the error in estimated critical points is suggested. Using this method, it was noted that the estimated articulatory trajectories using the acoustic-to-articulatory inversion methods were still not accurate enough to be within the perceptual tolerance of audio-visual asynchrony.