Semantic speech editing

Authors:
Steve Whittaker;Brian Amento
Affiliations:
Sheffield University, Sheffield, UK;AT&T Labs-Research, Florham Park, NJ
Venue:
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Year:
2004

Citing 14
Cited 8

Expressive richness: a comparison of speech and text as media for revision

CHI '91 Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Working with audio: integrating personal tape recorders and desktop computers

CHI '92 Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Capturing, structuring, and representing ubiquitous audio

ACM Transactions on Information Systems (TOIS)
FILOCHAT: handwritten notes provide access to recorded conversations

CHI '94 Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Retrieving spoken documents by combining multiple index sources

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
SpeechSkimmer: a system for interactively skimming recorded speech

ACM Transactions on Computer-Human Interaction (TOCHI) - Special issue on speech as data
Informedia: news-on-demand multimedia information acquisition and retrieval

Intelligent multimedia information retrieval
All talk and all action: strategies for managing voicemail messages

CHI 98 Cconference Summary on Human Factors in Computing Systems
SCAN: designing and evaluating user interfaces to support retrieval from speech archives

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Jotmail: a voicemail interface that enables you to see what was said

Proceedings of the SIGCHI conference on Human Factors in Computing Systems
The audio notebook: paper and pen interaction with structured speech

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
SCANMail: a voicemail interface that makes speech browsable, readable and searchable

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Four Paradigms for Indexing Video Conferences

IEEE MultiMedia
SCANMail: audio navigation in the voicemail domain

HLT '01 Proceedings of the first international conference on Human language technology research

Time is of the essence: an evaluation of temporal compression algorithms

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Error correction of voicemail transcripts in SCANMail

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Searching in audio: the utility of transcripts, dichotic presentation, and time-compression

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Accessing speech data using strategic fixation

Computer Speech and Language
Design and evaluation of systems to support interaction capture and retrieval

Personal and Ubiquitous Computing - Special Issue: User-centred design and evaluation of ubiquitous groupware
Visualizations: speech, language & autistic spectrum disorder

CHI '08 Extended Abstracts on Human Factors in Computing Systems
Markup as you talk: establishing effective memory cues while still contributing to a meeting

Proceedings of the ACM 2012 conference on Computer Supported Cooperative Work
Content-based tools for editing audio stories

Proceedings of the 26th annual ACM symposium on User interface software and technology

Quantified Score

Hi-index	0.01

Visualization

Abstract

Editing speech data is currently time-consuming and error-prone. Speech editors rely on acoustic waveform representations, which force users to repeatedly sample the underlying speech to identify words and phrases to edit. Instead we developed a semantic editor that reduces the need for extensive sampling by providing access to meaning. The editor shows a time-aligned errorful transcript produced by applying automatic speech recognition (ASR) to the original speech. Users visually scan the words in the transcript to identify important phrases. They then edit the transcript directly using standard word processing 'cut and paste' operations, which extract the corresponding time-aligned speech. ASR errors mean that users must supplement what they read in the transcript by accessing the original speech. Even when there are transcript errors, however, the semantic representation still provides users with enough information to target what they edit and play, reducing the need for extensive sampling. A laboratory evaluation showed that semantic editing is more efficient than acoustic editing even when ASR is highly inaccurate.