Local spatiotemporal descriptors for visual recognition of spoken phrases

  • Authors:
  • Guoying Zhao;Matti Pietikäinen;Abdenour Hadid

  • Affiliations:
  • University of Oulu: Finland, Oulu, Finland;University of Oulu: Finland, Oulu, Finland;University of Oulu: Finland, Oulu, Finland

  • Venue:
  • Proceedings of the international workshop on Human-centered multimedia
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Visual speech information plays an important role in speech recognition under noisy conditions or for listeners with hearing impairment. In this paper, we propose local spatiotemporal descriptors to represent and recognize spoken isolated phrases based solely on visual input. Positions of the eyes determined by a robust face and eye detector are used for localizing the mouth regions in face images. Spatiotemporal local binary patterns extracted from these regions are used for describing phrase sequences. In our experiments with 817 sequences from ten phrases and 20 speakers, promising accuracies of 62% and 70% were obtained in speaker-independent and speaker-dependent recognition, respectively. In comparison with other methods on the Tulips1 audio-visual database, the accuracy 92.7% of our method clearly out performs the others. Advantages of our approach include local processing and robustness to monotonic gray-scale changes. Moreover, no error prone segmentation of moving lips is needed.