Automatic speech recognition for webcasts: how good is good enough and what to do when it isn't

Authors:
Cosmin Munteanu;Gerald Penn;Ron Baecker;Yuecheng Zhang
Affiliations:
Universtiy of Toronto, Canada;Universtiy of Toronto, Canada;Universtiy of Toronto, Canada;Universtiy of Toronto, Canada
Venue:
Proceedings of the 8th international conference on Multimodal interfaces
Year:
2006

Citing 9
Cited 1

Wizard of Oz studies: why and how

IUI '93 Proceedings of the 1st international conference on Intelligent user interfaces
SCANMail: a voicemail interface that makes speech browsable, readable and searchable

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Designing Interactive Speech Systems: From First Ideas to User Testing

Designing Interactive Speech Systems: From First Ideas to User Testing
Lecture and Presentation Tracking in an Intelligent Meeting Room

ICMI '02 Proceedings of the 4th IEEE International Conference on Multimodal Interfaces
Labeling images with a computer game

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Effective work practices for software engineering: free/libre open source software development

Proceedings of the 2004 ACM workshop on Interdisciplinary software engineering research
User strategies for handling information tasks in webcasts

CHI '05 Extended Abstracts on Human Factors in Computing Systems
A web-based system for collaborative annotation of large image and video collections: an evaluation and user study

Proceedings of the 13th annual ACM international conference on Multimedia
The effect of speech recognition accuracy rates on the usefulness and usability of webcast archives

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems

We need to talk: HCI and the delicate topic of spoken language interaction

CHI '13 Extended Abstracts on Human Factors in Computing Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

The increased availability of broadband connections has recently led to an increase in the use of Internet broadcasting (webcasting). Most webcasts are archived and accessed numerous times retrospectively. One challenge to skimming and browsing through such archives is the lack of text transcripts of the webcast's audio channel. This paper describes a procedure for prototyping an Automatic Speech Recognition (ASR) system that generates realistic transcripts of any desired Word Error Rate (WER), thus overcoming the drawbacks of both prototype-based and Wizard of Oz simulations. We used such a system in a user study showing that transcripts with WERs less than 25% are acceptable for use in webcast archives. As current ASR systems can only deliver, in realistic conditions, Word Error Rates (WERs) of around 45%, we also describe a solution for reducing the WER of such transcripts by engaging users to collaborate in a "wiki" fashion on editing the imperfect transcripts obtained through ASR.