Semantic speech editing

  • Authors:
  • Steve Whittaker;Brian Amento

  • Affiliations:
  • Sheffield University, Sheffield, UK;AT&T Labs-Research, Florham Park, NJ

  • Venue:
  • Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
  • Year:
  • 2004

Quantified Score

Hi-index 0.01

Visualization

Abstract

Editing speech data is currently time-consuming and error-prone. Speech editors rely on acoustic waveform representations, which force users to repeatedly sample the underlying speech to identify words and phrases to edit. Instead we developed a semantic editor that reduces the need for extensive sampling by providing access to meaning. The editor shows a time-aligned errorful transcript produced by applying automatic speech recognition (ASR) to the original speech. Users visually scan the words in the transcript to identify important phrases. They then edit the transcript directly using standard word processing 'cut and paste' operations, which extract the corresponding time-aligned speech. ASR errors mean that users must supplement what they read in the transcript by accessing the original speech. Even when there are transcript errors, however, the semantic representation still provides users with enough information to target what they edit and play, reducing the need for extensive sampling. A laboratory evaluation showed that semantic editing is more efficient than acoustic editing even when ASR is highly inaccurate.