Patch-based representation of visual speech

  • Authors:
  • Patrick Lucey;Sridha Sridharan

  • Affiliations:
  • Queensland University of Technology, Brisbane, QLD, Australia;Queensland University of Technology, Brisbane, QLD, Australia

  • Venue:
  • VisHCI '06 Proceedings of the HCSNet workshop on Use of vision in human-computer interaction - Volume 56
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Visual information from a speaker's mouth region is known to improve automatic speech recognition robustness, especially in the presence of acoustic noise. To date, the vast majority of work in this field has viewed these visual features in a holistic manner, which may not take into account the various changes that occur within articulation (process of changing the shape of the vocal tract using the articulators, i.e lips and jaw). Motivated by the work being conducted in fields of audio-visual automatic speech recognition (AVASR) and face recognition using articulatory features (AFs) and patches respectively, we present a proof of concept paper which represents the mouth region as a ensemble of image patches. Our experiments show that by dealing with the mouth region in this manner, we are able to extract more speech information from the visual domain. For the task of visual-only speaker-independent isolated digit recognition, we were able to improve the relative word error rate by more than 23% on the CUAVE audio-visual corpus.