Cross-modality semantic integration with hypothesis rescoring for robust interpretation of multimodal user interactions

Authors:
Pui-Yu Hui;Helen M. Meng
Affiliations:
Human-Computer Communications Laboratory, Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Shatin, N.T., Hong Kong, China;Human-Computer Communications Laboratory, Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Shatin, N.T., Hong Kong, China
Venue:
IEEE Transactions on Audio, Speech, and Language Processing - Special issue on multimodal processing in speech-based interactions
Year:
2009

Citing 15
Cited 2

Speech and gestures for graphic image manipulation

CHI '89 Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
A generic platform for addressing the multimodal challenge

CHI '95 Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Multimodal system processing in mobile environments

UIST '00 Proceedings of the 13th annual ACM symposium on User interface software and technology
Cognitive Status and Form of Reference in Multimodal Human-Computer Interaction

Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence
A probabilistic approach to reference resolution in multimodal user interfaces

Proceedings of the 9th international conference on Intelligent user interfaces
The mathematics of statistical machine translation: parameter estimation

Computational Linguistics - Special issue on using large corpora: II
Unification-based multimodal parsing

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
Finite-state multimodal parsing and understanding

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
When do we interact multimodally?: cognitive load and multimodal communication patterns

Proceedings of the 6th international conference on Multimodal interfaces
Salience modeling based on non-verbal modalities for spoken language understanding

Proceedings of the 8th international conference on Multimodal interfaces
Optimization in multimodal interpretation

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Interdependencies among dialog acts, task goals and discourse inheritance in mixed-initiative dialogs

HLT '02 Proceedings of the second international conference on Human Language Technology Research
Designing the user interface for multimodal speech and pen-based gesture applications: state-of-the-art systems and future research directions

Human-Computer Interaction
A corpus-based approach for cooperative response generation in a dialog system

ISCSLP'06 Proceedings of the 5th international conference on Chinese Spoken Language Processing
Multimodal integration-a statistical view

IEEE Transactions on Multimedia

Usage patterns and latent semantic analyses for task goal inference of multimodal user interactions

Proceedings of the 15th international conference on Intelligent user interfaces
Latent Semantic Analysis for Multimodal User Input With Speech and Gestures

IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP)

Quantified Score

Hi-index	0.00

Visualization

Abstract

We develop a framework pertaining to automatic semantic interpretation of multimodal user interactions using speech and pen gestures. The two input modalities abstract the user's intended message differently into input events, e.g., key terms/phrases in speech or different types of gestures in the pen modality. The proposed framework begins by generating partial interpretations for each input event as a ranked list of hypothesized semantics. We devise a cross-modality semantic integration procedure to align the pair of hypothesis lists between every speech input event and every pen input event in a multimodal expression. This is achieved by the Viterbi alignment algorithm that enforces the temporal ordering of the input events as well as the semantic compatibility of aligned events. The alignment enables generation of a unimodal, verbalized paraphrase that is semantically equivalent to the original multimodal expression. Our experiments are based on a multimodal corpus in the domain of city navigation. Application of the cross-modality integration procedure to near-perfect (manual) transcripts of the speech and pen modalities show that correct unimodal paraphrases are generated for over 97% of the training and test sets. However, if we replace with automatic speech and pen recognition transcripts, the performance drops to 53.7% and 54.8% for the training and test sets, respectively. In order to address this issue, we devised the hypothesis rescoring procedure that evaluates all candidates of cross-modality integration derived from multiple recognition hypotheses from each modality. The rescoring function incorporates the integration score, N-best purity of recognized spoken locative expressions, as well as distances between coordinates of recognized pen gestures and their interpreted icons on the map. Application of cross-modality hypothesis rescoring improved the performance to 67.5% and 69.9% for the training and test sets, respectively.