Assessing agreement on classification tasks: the kappa statistic
Computational Linguistics
Crowdsourcing user studies with Mechanical Turk
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
How do you feel about "dancing queen"?: deriving mood & theme annotations from user tags
Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
Cheap and fast---but is it good?: evaluating non-expert annotations for natural language tasks
EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Improving mood classification in music digital libraries by combining lyrics and audio
Proceedings of the 10th annual joint conference on Digital libraries
What does music mood mean for real users?
Proceedings of the 2012 iConference
Score-Independent Audio Features for Description of Music Expression
IEEE Transactions on Audio, Speech, and Language Processing
Automatic mood detection and tracking of music audio signals
IEEE Transactions on Audio, Speech, and Language Processing
Reading the correct history?: modeling temporal intention in resource sharing
Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries
Constructing an anonymous dataset from the personal digital photo libraries of mac app store users
Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries
User preference-aware music video generation based on modeling scene moods
Proceedings of the 5th ACM Multimedia Systems Conference
Hi-index | 0.00 |
Mood is an important access point in music digital libraries and online music repositories, but generating ground truth for evaluating various music mood classification algorithms is a challenging problem. This is because collecting enough human judgments is time-consuming and costly due to the subjectivity of music mood. In this study, we explore the viability of crowdsourcing music mood classification judgments using Amazon Mechanical Turk (MTurk). Specifically, we compare the mood classification judgments collected for the annual Music Information Retrieval Evaluation eXchange (MIREX) with judgments collected using MTurk. Our data show that the overall distribution of mood clusters and agreement rates from MIREX and MTurk were comparable. However, Turkers tended to agree less with the pre-labeled mood clusters than MIREX evaluators. The system evaluation results generated using both sets of data were mostly the same except for detecting one statistically significant pair using Friedman's test. We conclude that MTurk can potentially serve as a viable alternative for ground truth collection, with some reservation with regards to particular mood clusters.