Robot Command Interface Using an Audio-Visual Speech Recognition System

Authors:
Alexánder Ceballos;Juan Gómez;Flavio Prieto;Tanneguy Redarce
Affiliations:
Instituto Tecnológico Metropolitano, Medellín, Colombia and DIEEC, Universidad Nacional de Colombia Sede Manizales, Manizales, Colombia;DIEEC, Universidad Nacional de Colombia Sede Manizales, Manizales, Colombia and Institut National des Sciences Appliquées de Lyon, Lyon, France;DIMM, Universidad Nacional de Colombia Sede Bogotá, Bogotá, Colombia;Institut National des Sciences Appliquées de Lyon, Lyon, France
Venue:
CIARP '09 Proceedings of the 14th Iberoamerican Conference on Pattern Recognition: Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications
Year:
2009

Citing 4
Cited 0

Lip Tracking for MPEG-4 Facial Animation

ICMI '02 Proceedings of the 4th IEEE International Conference on Multimodal Interfaces
Dynamic Bayesian networks for audio-visual speech recognition

EURASIP Journal on Applied Signal Processing
Mouth gesture and voice command based robot command interface

ICRA'09 Proceedings of the 2009 IEEE international conference on Robotics and Automation
Speech recognition with multi-modal features based on neural networks

ICONIP'06 Proceedings of the 13th international conference on Neural Information Processing - Volume Part II

Quantified Score

Hi-index	0.00

Visualization

Abstract

In recent years audio-visual speech recognition has emerged as an active field of research thanks to advances in pattern recognition, signal processing and machine vision. Its ultimate goal is to allow human-computer communication using voice, taking into account the visual information contained in the audio-visual speech signal. This document presents a command's automatic recognition system using audio-visual information. The system is expected to control the laparoscopic robot da Vinci. The audio signal is treated using the Mel Frequency Cepstral Coefficients parametrization method. Besides, features based on the points that define the mouth's outer contour according to the MPEG-4 standard are used in order to extract the visual speech information.