Audio-to-Visual Conversion Using Hidden Markov Models

Authors:
Soonkyu Lee;Dongsuk Yook
Affiliations:
-;-
Venue:
PRICAI '02 Proceedings of the 7th Pacific Rim International Conference on Artificial Intelligence: Trends in Artificial Intelligence
Year:
2002

Citing 1
Cited 2

Adaptive fusion of acoustic and visual sources for automatic speech recognition

Speech Communication - Special issue on auditory-visual speech processing

A coupled HMM approach to video-realistic speech animation

Pattern Recognition
Lipreading with local spatiotemporal descriptors

IEEE Transactions on Multimedia

Quantified Score

Hi-index	0.00

Visualization

Abstract

We describe audio-to-visual conversion techniques for efficient multimedia communications. The audio signals are automatically converted to visual images of mouth shape. The visual speech can be represented as a sequence of visemes, which are the generic face images corresponding to particular sounds. Visual images synchronized with audio signals can provide user-friendly interface for man machine interactions. Also, it can be used to help the people with impaired-hearing. We use HMMs (hidden Markov models) to convert audio signals to a sequence of visemes. In this paper, we compare two approaches in using HMMs. In the first approach, an HMM is trained for each viseme, and the audio signals are directly recognized as a sequence of visemes. In the second approach, each phoneme is modeled with an HMM, and a general phoneme recognizer is utilized to produce a phoneme sequence from the audio signals. The phoneme sequence is then converted to a viseme sequence. We implemented the two approaches and tested them on the TIMIT speech corpus. The viseme recognizer shows 33.9% error rate, and the phoneme-based approach exhibits 29.7% viseme recognition error rate. When similar viseme classes are merged, we have found that the error rates can be reduced to 20.5% and 13.9%, respectably.