IEEE Transactions on Pattern Analysis and Machine Intelligence
A multimodal speaker detection and tracking system for teleconferencing
Proceedings of the tenth ACM international conference on Multimedia
A multi-modal approach for determining speaker location and focus
Proceedings of the 5th international conference on Multimodal interfaces
Boosting-Based Multimodal Speaker Detection for Distributed Meeting Videos
IEEE Transactions on Multimedia
The picture says it all!: multimodal interactions and interaction metadata
ICMI '11 Proceedings of the 13th international conference on multimodal interfaces
Designing multiuser multimodal gestural interactions for the living room
Proceedings of the 14th ACM international conference on Multimodal interaction
GPU-based approaches for real-time sound source localization using the SRP-PHAT algorithm
International Journal of High Performance Computing Applications
Hi-index | 0.00 |
Multimodal Interfaces that enable natural means of interaction using multiple modalities such as touch, hand gestures, speech, and facial expressions represent a paradigm shift in human-computer interfaces. Their aim is to allow rich and intuitive multimodal interaction similar to human-to-human communication and interaction. From the multimodal system's perspective, apart from the various input modalities themselves, user context information such as states of attention and activity, and identities of interacting users can help greatly in improving the interaction experience. For example, when sensors such as cameras (webcams, depth sensors etc.) and microphones are always on and continuously capturing signals in their environment, user context information is very useful to distinguish genuine system-directed activity from ambient speech and gesture activity in the surroundings, and distinguish the "active user" from among a set of users. Information about user identity may be used to personalize the system's interface and behavior -- e.g. the look of the GUI, modality recognition profiles, and information layout -- to suit the specific user. In this paper, we present a set of algorithms and an architecture that performs audiovisual analysis of user context using sensors such as cameras and microphone arrays, and integrates components for lip activity and audio direction detection (speech activity), face detection and tracking (attention), and face recognition (identity). The proposed architecture allows the component data flows to be managed and fused with low latency, low memory footprint, and low CPU load, since such a system is typically required to run continuously in the background and report events of attention, activity, and identity, in real-time, to consuming applications.