Robust visual speakingness detection using bi-level HMM

Authors:
P. Tiawongsombat;Mun-Ho Jeong;Joo-Seop Yun;Bum-Jae You;Sang-Rok Oh
Affiliations:
HCI & Robotics, University of Science and Technology (UST), South Korea;School of Robotics, Kwangwoon University, South Korea;Gyeongbuk Research Institute of Vehicle Embedded Technology (GIVET), South Korea;Center for Cognitive Robotics Research, Korea Institute of Science and Technology (KIST), 39-1 Hawolgok 2 Dong, Sungbuk Gu, 136-791 Seoul, South Korea;Center for Cognitive Robotics Research, Korea Institute of Science and Technology (KIST), 39-1 Hawolgok 2 Dong, Sungbuk Gu, 136-791 Seoul, South Korea
Venue:
Pattern Recognition
Year:
2012

Citing 9
Cited 0

Application of the Karhunen-Loeve Procedure for the Characterization of Human Faces

IEEE Transactions on Pattern Analysis and Machine Intelligence
Robust Real-Time Face Detection

International Journal of Computer Vision
Multi-Modal Speech Recognition Using Optical-Flow Analysis for Lip Images

Journal of VLSI Signal Processing Systems
Visual Speech Recognition with Loosely Synchronized Feature Streams

ICCV '05 Proceedings of the Tenth IEEE International Conference on Computer Vision - Volume 2
Visual voice activity detection as a help for speech source separation from convolutive mixtures

Speech Communication
Open or Closed Mouth State Detection: Static Supervised Classification Based on Log-Polar Signature

ACIVS '08 Proceedings of the 10th International Conference on Advanced Concepts for Intelligent Vision Systems
Visual lip activity detection and speaker detection using mouth region intensities

IEEE Transactions on Circuits and Systems for Video Technology
Modeling visual perception for image processing

IWANN'07 Proceedings of the 9th international work conference on Artificial neural networks
Visual speech recognition using motion features and hidden Markov models

CAIP'07 Proceedings of the 12th international conference on Computer analysis of images and patterns

Quantified Score

Hi-index	0.01

Visualization

Abstract

Visual voice activity detection (V-VAD) plays an important role in both HCI and HRI, affecting both the conversation strategy and sync between humans and robots/computers. The typical speakingness decision of V-VAD consists of post-processing for signal smoothing and classification using thresholding. Several parameters, ensuring a good trade-off between hit rate and false alarm, are usually heuristically defined. This makes the V-VAD approaches vulnerable to noisy observation and changes of environment conditions, resulting in poor performance and robustness to undesired frequent speaking state changes. To overcome those difficulties, this paper proposes a new probabilistic approach, naming bi-level HMM and analyzing lip activity energy for V-VAD in HRI. The designing idea is based on lip movement and speaking assumptions, embracing two essential procedures into a single model. A bi-level HMM is an HMM with two state variables in different levels, where state occurrence in a lower level conditionally depends on the state in an upper level. The approach works online with low-resolution image and in various lighting conditions, and has been successfully tested in 21 image sequences (22,927 frames). It achieved over 90% of probabilities of detection, in which it brought improvements of almost 20% compared to four other V-VAD approaches.