Tools for collecting speech corpora via Mechanical-Turk

Authors:
Ian Lane;Alex Waibel;Matthias Eck;Kay Rottmann
Affiliations:
Carnegie Mellon University, Pittsburgh, PA and Mobile Technologies LLC, Pittsburgh, PA;Carnegie Mellon University, Pittsburgh, PA and Mobile Technologies LLC, Pittsburgh, PA;Mobile Technologies LLC, Pittsburgh, PA;Mobile Technologies LLC, Pittsburgh, PA
Venue:
CSLDAMT '10 Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk
Year:
2010

Citing 1
Cited 3

The web as a platform to build machine translation resources

Proceedings of the 2009 international workshop on Intercultural collaboration

Creating speech and language data with Amazon's Mechanical Turk

CSLDAMT '10 Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk
An introduction to crowdsourcing for language and multimedia technology research

PROMISE'12 Proceedings of the 2012 international conference on Information Retrieval Meets Information Visualization
A smartphone-based ASR data collection tool for under-resourced languages

Speech Communication

Quantified Score

Hi-index	0.00

Visualization

Abstract

To rapidly port speech applications to new languages one of the most difficult tasks is the initial collection of sufficient speech corpora. State-of-the-art automatic speech recognition systems are typical trained on hundreds of hours of speech data. While pre-existing corpora do exist for major languages, a sufficient amount of quality speech data is not available for most world languages. While previous works have focused on the collection of translations and the transcription of audio via Mechanical-Turk mechanisms, in this paper we introduce two tools which enable the collection of speech data remotely. We then compare the quality of audio collected from paid part-time staff and unsupervised volunteers, and determine that basic user training is critical to obtain usable data.