Large-scale content-based audio retrieval from text queries

Authors:
Gal Chechik;Eugene Ie;Martin Rehn;Samy Bengio;Dick Lyon
Affiliations:
Google, Mountain View, CA, USA;Google, Mountain View, CA, USA;Google, Mountain View, CA, USA;Google, Mountain View, CA, USA;Google, Mountain View, CA, USA
Venue:
MIR '08 Proceedings of the 1st ACM international conference on Multimedia information retrieval
Year:
2008

Citing 10
Cited 10

Fundamentals of speech recognition

Fundamentals of speech recognition
The nature of statistical learning theory

The nature of statistical learning theory
Modern Information Retrieval

Modern Information Retrieval
Optimizing search engines using clickthrough data

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Labeling images with a computer game

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Generating query substitutions

Proceedings of the 15th international conference on World Wide Web
Online Passive-Aggressive Algorithms

The Journal of Machine Learning Research
Towards musical query-by-semantic-description using the CAL500 data set

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
A discriminative approach for the retrieval of images from text queries

ECML'06 Proceedings of the 17th European conference on Machine Learning
Semantic Annotation and Retrieval of Music and Sound Effects

IEEE Transactions on Audio, Speech, and Language Processing

Learning dictionaries of stable autoregressive models for audio scene analysis

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Content-Based Retrieval of Audio in News Broadcasts

FQAS '09 Proceedings of the 8th International Conference on Flexible Query Answering Systems
A divide-and-conquer approach to latent perceptual indexing of audio for large web 2.0 applications

ICME'09 Proceedings of the 2009 IEEE international conference on Multimedia and Expo
Clustering for music search results

ICME'09 Proceedings of the 2009 IEEE international conference on Multimedia and Expo
Sound retrieval and ranking using sparse auditory representations

Neural Computation
Combining multi-probe histogram and order-statistics based LSH for scalable audio content retrieval

Proceedings of the international conference on Multimedia
An ontological framework for retrieving environmental sounds using semantics and acoustic content

EURASIP Journal on Audio, Speech, and Music Processing - Special issue on environmental sound synthesis, processing, and retrieval
Ecological acoustics perspective for content-based retrieval of environmental sounds

EURASIP Journal on Audio, Speech, and Music Processing - Special issue on environmental sound synthesis, processing, and retrieval
Active learning of custom sound taxonomies in unstructured audio data

Proceedings of the 2nd ACM International Conference on Multimedia Retrieval
Evaluating multimedia features and fusion for example-based event detection

Machine Vision and Applications

Quantified Score

Hi-index	0.03

Visualization

Abstract

In content-based audio retrieval, the goal is to find sound recordings (audio documents) based on their acoustic features. This content-based approach differs from retrieval approaches that index media files using metadata such as file names and user tags. In this paper, we propose a machine learning approach for retrieving sounds that is novel in that it (1) uses free-form text queries rather sound sample based queries, (2) searches by audio content rather than via textual meta data, and (3) can scale to very large number of audio documents and very rich query vocabulary. We handle generic sounds, including a wide variety of sound effects, animal vocalizations and natural scenes. We test a scalable approach based on a passive-aggressive model for image retrieval (PAMIR), and compare it to two state-of-the-art approaches; Gaussian mixture models (GMM) and support vector machines (SVM). We test our approach on two large real-world datasets: a collection of short sound effects, and a noisier and larger collection of user-contributed user-labeled recordings (25K files, 2000 terms vocabulary). We find that all three methods achieved very good retrieval performance. For instance, a positive document is retrieved in the first position of the ranking more than half the time, and on average there are more than 4 positive documents in the first 10 retrieved, for both datasets. PAMIR completed both training and retrieval of all data in less than 6 hours for both datasets, on a single machine. It was one to three orders of magnitude faster than the competing approaches. This approach should therefore scale to much larger datasets in the future.