Utilizing gestures to improve sentence boundary detection

  • Authors:
  • Lei Chen;Mary P. Harper

  • Affiliations:
  • School of Electrical and Computer Engineering, Purdue University, West Lafayette, USA 47905 and Educational Testing Service (ETS), Princeton, USA 08541;Department of Computer Science, University of Maryland, College Park, USA 20742 and Human Language Technology Center of Excellence, Johns Hopkins University, Baltimore, USA 21211

  • Venue:
  • Multimedia Tools and Applications
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

An accurate estimation of sentence units (SUs) in spontaneous speech is important for (1) helping listeners to better understand speech content and (2) supporting other natural language processing tasks that require sentence information. There has been much research on automatic SU detection; however, most previous studies have only used lexical and prosodic cues, but have not used nonverbal cues, e.g., gesture. Gestures play an important role in human conversations, including providing semantic content, expressing emotional status, and regulating conversational structure. Given the close relationship between gestures and speech, gestures may provide additional contributions to automatic SU detection. In this paper, we have investigated the use of gesture cues for enhancing the SU detection. Particularly, we have focused on: (1) collecting multimodal data resources involving gestures and SU events in human conversations, (2) analyzing the collected data sets to enrich our knowledge about co-occurrence of gestures and SUs, and (3) building statistical models for detecting SUs using speech and gestural cues. Our data analyses suggest that some gesture patterns influence a word boundary's probability of being an SU. On the basis of the data analyses, a set of novel gestural features were proposed for SU detection. A combination of speech and gestural features was found to provide more accurate SU predictions than using only speech features in discriminative models. Findings in this paper support the view that human conversations are processes involving multimodal cues, and so they are more effectively modeled using information from both verbal and nonverbal channels.