Identifying the addressee in human-human-robot interactions based on head pose and speech
Proceedings of the 6th international conference on Multimodal interfaces
Contextual recognition of head gestures
ICMI '05 Proceedings of the 7th international conference on Multimodal interfaces
ICMI '05 Proceedings of the 7th international conference on Multimodal interfaces
ICMI '05 Proceedings of the 7th international conference on Multimodal interfaces
Models for multiparty engagement in open-world dialog
SIGDIAL '09 Proceedings of the SIGDIAL 2009 Conference: The 10th Annual Meeting of the Special Interest Group on Discourse and Dialogue
Learning large margin likelihoods for realtime head pose tracking
ICIP'09 Proceedings of the 16th IEEE international conference on Image processing
Conversation scene analysis based on dynamic Bayesian network and image-based gaze detection
International Conference on Multimodal Interfaces and the Workshop on Machine Learning for Multimodal Interaction
Multiperson Visual Focus of Attention from Head Pose and Meeting Contextual Cues
IEEE Transactions on Pattern Analysis and Machine Intelligence
ICMI '11 Proceedings of the 13th international conference on multimodal interfaces
Modeling focus of attention for meeting indexing based on multiple cues
IEEE Transactions on Neural Networks
Investigating the midline effect for visual focus of attention recognition
Proceedings of the 14th ACM international conference on Multimodal interaction
Proceedings of the 8th ACM/IEEE international conference on Human-robot interaction
Leveraging the robot dialog state for visual focus of attention recognition
Proceedings of the 15th ACM on International conference on multimodal interaction
Hi-index | 0.00 |
The paper investigates the problem of addressee recognition -to whom a speaker's utterance is intended- in a setting involving a humanoid robot interacting with multiple persons. More specifically, as it is well known that addressee can primarily be derived from the speaker's visual focus of attention (VFOA) defined as whom or what a person is looking at, we address the following questions: how much does the performance degrade when using automatically extracted VFOA from head pose instead of the VFOA ground-truth? Can the conversational context improve addressee recognition by using it either directly as a side cue in the addressee classifier, or indirectly by improving the VFOA recognition, or in both ways? Finally, from a computational perspective, which VFOA features and normalizations are better and does it matter whether the VFOA recognition module only monitors whether a person looks at potential addressee targets (the robot, people) or if it also considers objects of interest in the environment (paintings in our case) as additional VFOA targets? Experiments on the public Vernissage database where the humanoid Nao robots make a quiz to two participants shows that reducing VFOA confusion (either through context, or by ignoring VFOA targets) improves addressee recognition.