Cross-modality semantic integration with hypothesis rescoring for robust interpretation of multimodal user interactions

  • Authors:
  • Pui-Yu Hui;Helen M. Meng

  • Affiliations:
  • Human-Computer Communications Laboratory, Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Shatin, N.T., Hong Kong, China;Human-Computer Communications Laboratory, Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Shatin, N.T., Hong Kong, China

  • Venue:
  • IEEE Transactions on Audio, Speech, and Language Processing - Special issue on multimodal processing in speech-based interactions
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

We develop a framework pertaining to automatic semantic interpretation of multimodal user interactions using speech and pen gestures. The two input modalities abstract the user's intended message differently into input events, e.g., key terms/phrases in speech or different types of gestures in the pen modality. The proposed framework begins by generating partial interpretations for each input event as a ranked list of hypothesized semantics. We devise a cross-modality semantic integration procedure to align the pair of hypothesis lists between every speech input event and every pen input event in a multimodal expression. This is achieved by the Viterbi alignment algorithm that enforces the temporal ordering of the input events as well as the semantic compatibility of aligned events. The alignment enables generation of a unimodal, verbalized paraphrase that is semantically equivalent to the original multimodal expression. Our experiments are based on a multimodal corpus in the domain of city navigation. Application of the cross-modality integration procedure to near-perfect (manual) transcripts of the speech and pen modalities show that correct unimodal paraphrases are generated for over 97% of the training and test sets. However, if we replace with automatic speech and pen recognition transcripts, the performance drops to 53.7% and 54.8% for the training and test sets, respectively. In order to address this issue, we devised the hypothesis rescoring procedure that evaluates all candidates of cross-modality integration derived from multiple recognition hypotheses from each modality. The rescoring function incorporates the integration score, N-best purity of recognized spoken locative expressions, as well as distances between coordinates of recognized pen gestures and their interpreted icons on the map. Application of cross-modality hypothesis rescoring improved the performance to 67.5% and 69.9% for the training and test sets, respectively.