Exploring automatic music annotation with "acoustically-objective" tags

  • Authors:
  • Derek Tingle;Youngmoo E. Kim;Douglas Turnbull

  • Affiliations:
  • Swarthmore College, Swarthmore, PA, USA;Drexel University, Philadelphia, PA, USA;Swarthmore College, Swarthmore, PA, USA

  • Venue:
  • Proceedings of the international conference on Multimedia information retrieval
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

The task of automatically annotating music with text tags (referred to as autotagging) is vital to creating a large-scale semantic music discovery engine. Yet for an autotagging system to be successful, a large and cleanly-annotated data set must exist to train the system. For this reason, we have collected a data set, called Swat10k, which consists of 10,870 songs annotated using a vocabulary of 475 acoustic tags and 153 genre tags}from Pandora's Music Genome Project. The acoustic tags are considered "acoustically-objective" because they can be consistently applied to songs by expert musicologists. To develop an autotagging system, we use the Swat10k data set in conjunction with two new sets of content-based audio features obtained using the publicly-available Echo Nest API. The Echo Nest Timbre (ENT) features represent a song using a collection of short-time feature vectors. Compared with Mel-frequency cepstral coefficients (MFCCs), ENTs provide a more compact representation of music and improve autotagging performance. We also evaluate the Echo Nest Song (ENS) feature vector, which is a collection of mid-level acoustic features (e.g., beats per minute, average loudness). While the ENS features generally perform worse than the ENTs, they increase the performance of several individual tags. Furthermore, we plan to publicly release our song annotations and corresponding Echo Nest features so that other researchers will be able to use Swat10K to develop and compare alternative autotagging algorithms.