The web as a platform to build machine translation resources
Proceedings of the 2009 international workshop on Intercultural collaboration
Creating speech and language data with Amazon's Mechanical Turk
CSLDAMT '10 Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk
An introduction to crowdsourcing for language and multimedia technology research
PROMISE'12 Proceedings of the 2012 international conference on Information Retrieval Meets Information Visualization
A smartphone-based ASR data collection tool for under-resourced languages
Speech Communication
Hi-index | 0.00 |
To rapidly port speech applications to new languages one of the most difficult tasks is the initial collection of sufficient speech corpora. State-of-the-art automatic speech recognition systems are typical trained on hundreds of hours of speech data. While pre-existing corpora do exist for major languages, a sufficient amount of quality speech data is not available for most world languages. While previous works have focused on the collection of translations and the transcription of audio via Mechanical-Turk mechanisms, in this paper we introduce two tools which enable the collection of speech data remotely. We then compare the quality of audio collected from paid part-time staff and unsupervised volunteers, and determine that basic user training is critical to obtain usable data.