Tools for collecting speech corpora via Mechanical-Turk

  • Authors:
  • Ian Lane;Alex Waibel;Matthias Eck;Kay Rottmann

  • Affiliations:
  • Carnegie Mellon University, Pittsburgh, PA and Mobile Technologies LLC, Pittsburgh, PA;Carnegie Mellon University, Pittsburgh, PA and Mobile Technologies LLC, Pittsburgh, PA;Mobile Technologies LLC, Pittsburgh, PA;Mobile Technologies LLC, Pittsburgh, PA

  • Venue:
  • CSLDAMT '10 Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

To rapidly port speech applications to new languages one of the most difficult tasks is the initial collection of sufficient speech corpora. State-of-the-art automatic speech recognition systems are typical trained on hundreds of hours of speech data. While pre-existing corpora do exist for major languages, a sufficient amount of quality speech data is not available for most world languages. While previous works have focused on the collection of translations and the transcription of audio via Mechanical-Turk mechanisms, in this paper we introduce two tools which enable the collection of speech data remotely. We then compare the quality of audio collected from paid part-time staff and unsupervised volunteers, and determine that basic user training is critical to obtain usable data.