Real-time captioning by groups of non-experts

  • Authors:
  • Walter Lasecki;Christopher Miller;Adam Sadilek;Andrew Abumoussa;Donato Borrello;Raja Kushalnagar;Jeffrey Bigham

  • Affiliations:
  • University of Rochester, Rochester, New York, USA;University of Rochester, Rochester, New York, USA;University of Rochester, Rochester, New York, USA;University of Rochester, Rochester, New York, USA;University of Rochester, Rochester, New York, USA;Rochester Institute of Technology, Rochester, New York, USA;University of Rochester, Rochester, New York, USA

  • Venue:
  • Proceedings of the 25th annual ACM symposium on User interface software and technology
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Real-time captioning provides deaf and hard of hearing people immediate access to spoken language and enables participation in dialogue with others. Low latency is critical because it allows speech to be paired with relevant visual cues. Currently, the only reliable source of real-time captions are expensive stenographers who must be recruited in advance and who are trained to use specialized keyboards. Automatic speech recognition (ASR) is less expensive and available on-demand, but its low accuracy, high noise sensitivity, and need for training beforehand render it unusable in real-world situations. In this paper, we introduce a new approach in which groups of non-expert captionists (people who can hear and type) collectively caption speech in real-time on-demand. We present Legion:Scribe, an end-to-end system that allows deaf people to request captions at any time. We introduce an algorithm for merging partial captions into a single output stream in real-time, and a captioning interface designed to encourage coverage of the entire audio stream. Evaluation with 20 local participants and 18 crowd workers shows that non-experts can provide an effective solution for captioning, accurately covering an average of 93.2% of an audio stream with only 10 workers and an average per-word latency of 2.9 seconds. More generally, our model in which multiple workers contribute partial inputs that are automatically merged in real-time may be extended to allow dynamic groups to surpass constituent individuals (even experts) on a variety of human performance tasks.