Techniques for vision-based human-computer interaction

Authors:
Gregory D. Hager;Jason J. Corso
Affiliations:
The Johns Hopkins University;The Johns Hopkins University
Venue:
Techniques for vision-based human-computer interaction
Year:
2006

Citing 0
Cited 1

Image description with features that summarize

Computer Vision and Image Understanding

Quantified Score

Hi-index	0.00

Visualization

Abstract

With the ubiquity of powerful, mobile computers and rapid advances in sensing and robot technologies, there exists a great potential for creating advanced, intelligent computing environments. We investigate techniques for integrating passive, vision-based sensing into such environments, which include both conventional interfaces and large-scale environments. We propose a new methodology for vision-based human-computer interaction called the Visual Interaction Cues (VICs) paradigm. VICs fundamentally relies on a shared perceptual space between the user and computer using monocular and stereoscopic video. In this space, we represent each interface component as a localized region in the image(s). By providing a clearly defined interaction locale, it is not necessary to visually track the user. Rather we model interaction as an expected stream of visual cues corresponding to a gesture. Example interaction cues are motion as when the finger moves to press a push-button, and 3D hand posture for a communicative gesture like a letter in sign language. We explore both procedurally defined parsers of the low-level visual cues and learning-based techniques from machine learning (e.g. neural networks) for the cue parsing. Individual gestures are analogous to a language with only words and no grammar. We have constructed a high-level language model that integrates a set of low-level gestures into a single, coherent probabilistic framework. In the language model, every low-level gesture is called a gesture word. We build a probabilistic graphical model with each node being a gesture word, and use an unsupervised learning technique to train the gesture-language model. Then, a complete action is a sequence of these words through the graph and is called a gesture sentence. We are especially interested in building mobile interactive systems in large-scale, unknown environments. We study the associated where am I problem: the mobile system must be able to map the environment and localize itself in the environment using the video imagery. Under the VICs paradigm, we can solve the interaction problem using local geometry without requiring a complete metric map of the environment. (Abstract shortened by UMI.)