Speech and gestures for graphic image manipulation
CHI '89 Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
A generic platform for addressing the multimodal challenge
CHI '95 Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Multimodal system processing in mobile environments
UIST '00 Proceedings of the 13th annual ACM symposium on User interface software and technology
Cognitive Status and Form of Reference in Multimodal Human-Computer Interaction
Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence
A probabilistic approach to reference resolution in multimodal user interfaces
Proceedings of the 9th international conference on Intelligent user interfaces
The mathematics of statistical machine translation: parameter estimation
Computational Linguistics - Special issue on using large corpora: II
Unification-based multimodal parsing
COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
Finite-state multimodal parsing and understanding
COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
When do we interact multimodally?: cognitive load and multimodal communication patterns
Proceedings of the 6th international conference on Multimodal interfaces
Salience modeling based on non-verbal modalities for spoken language understanding
Proceedings of the 8th international conference on Multimodal interfaces
Optimization in multimodal interpretation
ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
HLT '02 Proceedings of the second international conference on Human Language Technology Research
A corpus-based approach for cooperative response generation in a dialog system
ISCSLP'06 Proceedings of the 5th international conference on Chinese Spoken Language Processing
Multimodal integration-a statistical view
IEEE Transactions on Multimedia
Usage patterns and latent semantic analyses for task goal inference of multimodal user interactions
Proceedings of the 15th international conference on Intelligent user interfaces
Latent Semantic Analysis for Multimodal User Input With Speech and Gestures
IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP)
Hi-index | 0.00 |
We develop a framework pertaining to automatic semantic interpretation of multimodal user interactions using speech and pen gestures. The two input modalities abstract the user's intended message differently into input events, e.g., key terms/phrases in speech or different types of gestures in the pen modality. The proposed framework begins by generating partial interpretations for each input event as a ranked list of hypothesized semantics. We devise a cross-modality semantic integration procedure to align the pair of hypothesis lists between every speech input event and every pen input event in a multimodal expression. This is achieved by the Viterbi alignment algorithm that enforces the temporal ordering of the input events as well as the semantic compatibility of aligned events. The alignment enables generation of a unimodal, verbalized paraphrase that is semantically equivalent to the original multimodal expression. Our experiments are based on a multimodal corpus in the domain of city navigation. Application of the cross-modality integration procedure to near-perfect (manual) transcripts of the speech and pen modalities show that correct unimodal paraphrases are generated for over 97% of the training and test sets. However, if we replace with automatic speech and pen recognition transcripts, the performance drops to 53.7% and 54.8% for the training and test sets, respectively. In order to address this issue, we devised the hypothesis rescoring procedure that evaluates all candidates of cross-modality integration derived from multiple recognition hypotheses from each modality. The rescoring function incorporates the integration score, N-best purity of recognized spoken locative expressions, as well as distances between coordinates of recognized pen gestures and their interpreted icons on the map. Application of cross-modality hypothesis rescoring improved the performance to 67.5% and 69.9% for the training and test sets, respectively.