Exploring automatic music annotation with "acoustically-objective" tags

Authors:
Derek Tingle;Youngmoo E. Kim;Douglas Turnbull
Affiliations:
Swarthmore College, Swarthmore, PA, USA;Drexel University, Philadelphia, PA, USA;Swarthmore College, Swarthmore, PA, USA
Venue:
Proceedings of the international conference on Multimedia information retrieval
Year:
2010

Citing 7
Cited 9

Fundamentals of speech recognition

Fundamentals of speech recognition
A decision-theoretic generalization of on-line learning and an application to boosting

Journal of Computer and System Sciences - Special issue: 26th annual ACM symposium on the theory of computing & STOC'94, May 23–25, 1994, and second annual Europe an conference on computational learning theory (EuroCOLT'95), March 13–15, 1995
Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary

ECCV '02 Proceedings of the 7th European Conference on Computer Vision-Part IV
Towards musical query-by-semantic-description using the CAL500 data set

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Introduction to Information Retrieval

Introduction to Information Retrieval
Input-agreement: a new mechanism for collecting data using human computation games

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Combining audio content and social context for semantic music discovery

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval

Audio tag annotation and retrieval using tag count information

MMM'11 Proceedings of the 17th international conference on Advances in multimedia modeling - Volume Part I
Contextual tag inference

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP) - Special section on ACM multimedia 2010 best paper candidates, and issue on social media
A Probabilistic Model to Combine Tags and Acoustic Similarity for Music Retrieval

ACM Transactions on Information Systems (TOIS)
Music retagging using label propagation and robust principal component analysis

Proceedings of the 21st international conference companion on World Wide Web
Playlist prediction via metric embedding

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Inferring personal traits from music listening history

Proceedings of the second international ACM workshop on Music information retrieval with user-centered and multimodal strategies
Cross matching of music and image

Proceedings of the 20th ACM international conference on Multimedia
Virtual birding: extending an environmental pastime into the virtual world for citizen science

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Using emotional context from article for contextual music recommendation

Proceedings of the 21st ACM international conference on Multimedia

Quantified Score

Hi-index	0.00

Visualization

Abstract

The task of automatically annotating music with text tags (referred to as autotagging) is vital to creating a large-scale semantic music discovery engine. Yet for an autotagging system to be successful, a large and cleanly-annotated data set must exist to train the system. For this reason, we have collected a data set, called Swat10k, which consists of 10,870 songs annotated using a vocabulary of 475 acoustic tags and 153 genre tags}from Pandora's Music Genome Project. The acoustic tags are considered "acoustically-objective" because they can be consistently applied to songs by expert musicologists. To develop an autotagging system, we use the Swat10k data set in conjunction with two new sets of content-based audio features obtained using the publicly-available Echo Nest API. The Echo Nest Timbre (ENT) features represent a song using a collection of short-time feature vectors. Compared with Mel-frequency cepstral coefficients (MFCCs), ENTs provide a more compact representation of music and improve autotagging performance. We also evaluate the Echo Nest Song (ENS) feature vector, which is a collection of mid-level acoustic features (e.g., beats per minute, average loudness). While the ENS features generally perform worse than the ENTs, they increase the performance of several individual tags. Furthermore, we plan to publicly release our song annotations and corresponding Echo Nest features so that other researchers will be able to use Swat10K to develop and compare alternative autotagging algorithms.