Generating ground truth for music mood classification using mechanical turk

Authors:
Jin Ha Lee;Xiao Hu
Affiliations:
The Information School, Seattle, WA, USA;University of Denver, Denver, USA
Venue:
Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries
Year:
2012

Citing 8
Cited 3

Assessing agreement on classification tasks: the kappa statistic

Computational Linguistics
Crowdsourcing user studies with Mechanical Turk

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
How do you feel about "dancing queen"?: deriving mood & theme annotations from user tags

Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
Cheap and fast---but is it good?: evaluating non-expert annotations for natural language tasks

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Improving mood classification in music digital libraries by combining lyrics and audio

Proceedings of the 10th annual joint conference on Digital libraries
What does music mood mean for real users?

Proceedings of the 2012 iConference
Score-Independent Audio Features for Description of Music Expression

IEEE Transactions on Audio, Speech, and Language Processing
Automatic mood detection and tracking of music audio signals

IEEE Transactions on Audio, Speech, and Language Processing

Reading the correct history?: modeling temporal intention in resource sharing

Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries
Constructing an anonymous dataset from the personal digital photo libraries of mac app store users

Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries
User preference-aware music video generation based on modeling scene moods

Proceedings of the 5th ACM Multimedia Systems Conference

Quantified Score

Hi-index	0.00

Visualization

Abstract

Mood is an important access point in music digital libraries and online music repositories, but generating ground truth for evaluating various music mood classification algorithms is a challenging problem. This is because collecting enough human judgments is time-consuming and costly due to the subjectivity of music mood. In this study, we explore the viability of crowdsourcing music mood classification judgments using Amazon Mechanical Turk (MTurk). Specifically, we compare the mood classification judgments collected for the annual Music Information Retrieval Evaluation eXchange (MIREX) with judgments collected using MTurk. Our data show that the overall distribution of mood clusters and agreement rates from MIREX and MTurk were comparable. However, Turkers tended to agree less with the pre-labeled mood clusters than MIREX evaluators. The system evaluation results generated using both sets of data were mostly the same except for detecting one statistically significant pair using Friedman's test. We conclude that MTurk can potentially serve as a viable alternative for ground truth collection, with some reservation with regards to particular mood clusters.