Multimodal Classification of Activities of Daily Living Inside Smart Homes

Authors:
Vit Libal;Bhuvana Ramabhadran;Nadia Mana;Fabio Pianesi;Paul Chippendale;Oswald Lanz;Gerasimos Potamianos
Affiliations:
IBM Thomas J. Watson Research Center, Yorktown Heights, New York, U.S.A.;IBM Thomas J. Watson Research Center, Yorktown Heights, New York, U.S.A.;Fondazione Bruno Kessler (FBK), Trento, Italy;Fondazione Bruno Kessler (FBK), Trento, Italy;Fondazione Bruno Kessler (FBK), Trento, Italy;Fondazione Bruno Kessler (FBK), Trento, Italy;Institute of Computer Science (ICS), FORTH, Heraklion, Greece
Venue:
IWANN '09 Proceedings of the 10th International Work-Conference on Artificial Neural Networks: Part II: Distributed Computing, Artificial Intelligence, Bioinformatics, Soft Computing, and Ambient Assisted Living
Year:
2009

Citing 4
Cited 0

The CLEAR 2007 Evaluation

Multimodal Technologies for Perception of Humans
An Appearance-Based Particle Filter for Visual Tracking in Smart Rooms

Multimodal Technologies for Perception of Humans
Long-time span acoustic activity analysis from far-field sensors in smart homes

ICASSP '09 Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing
CLEAR evaluation of acoustic event detection and classification systems

CLEAR'06 Proceedings of the 1st international evaluation conference on Classification of events, activities and relationships

Quantified Score

Hi-index	0.00

Visualization

Abstract

Smart homes for the aging population have recently started attracting the attention of the research community. One of the problems of interest is this of monitoring the activities of daily living (ADLs) of the elderly aiming at their protection and well-being. In this work, we present our initial efforts to automatically recognize ADLs using multimodal input from audio-visual sensors. For this purpose, and as part of Integrated Project Netcarity, far-field microphones and cameras have been installed inside an apartment and used to collect a corpus of ADLs, acted by multiple subjects. The resulting data streams are processed to generate perception-based acoustic features, as well as human location coordinates that are employed as visual features. The extracted features are then presented to Gaussian mixture models for their classification into a set of predefined ADLs. Our experimental results show that both acoustic and visual features are useful in ADL classification, but performance of the latter deteriorates when subject tracking becomes inaccurate. Furthermore, joint audio-visual classification by simple concatenative feature fusion significantly outperforms both unimodal classifiers.