Accessing speech data using strategic fixation

  • Authors:
  • Steve Whittaker;Julia Hirschberg

  • Affiliations:
  • Department of Information Studies, University of Sheffield, 211 Portobello Street, Sheffield S1 4DP, UK;Department of Computer Science, Columbia University, 1214 Amsterdam Avenue, M/C 0401, 450 CS Building, New York, NY 10027, USA

  • Venue:
  • Computer Speech and Language
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

When users access information from text, they engage in strategic fixation, visually scanning the text to focus on regions of interest. However, because speech is both serial and ephemeral, it does not readily support strategic fixation. This paper describes two design principles, indexing and transcript-centric access that address the problem of speech access by supporting strategic fixation. Indexing involves users constructing external visual indices into speech. Users visually scan these indices to find information-rich regions of speech for more detailed processing and playback. Transcription involves transcribing speech using automatic speech recognition (ASR) and enriching that transcription with visual cues. The resulting enriched transcript is time-aligned to the original speech, allowing users to scan the transcript as a whole or the additional visual cues present in the transcript, to fixate and play regions of interest. We tested the effectiveness of these two approaches on a set of reference tasks derived from observations of current voicemail practice. A field trial evaluation of JotMail, an indexed-based interface similar to commercial unified messaging clients, showed that our approaches were effective in supporting speech scanning, information extraction and status tracking, but not archive management. However, users found it onerous to take manual notes with JotMail to provide effective retrieval indices. We therefore built SCANMail, a transcript-based interface that constructs indices automatically, using ASR to generate a transcript of the speech data. SCANMail also uses information extraction techniques to identify regions of potential interest, e.g. telephone numbers, within the transcript. Laboratory and field trials showed that SCANMail overcame most of the problems users reported with JotMail, supporting scanning, information extraction and archiving. Importantly, our evaluations showed that, despite errors, ASR transcripts provide a highly effective tool for browsing. Users exploited the enriched transcript to determine the gist of the underlying speech, and as a guide to identifying areas of speech that it was critical for them to play. Long-term field trials also showed the utility of transcripts to support notification and mobile access.