Efficient and robust music identification with weighted finite-state transducers

Authors:
Mehryar Mohri;Pedro J. Moreno;Eugene Weinstein
Affiliations:
Courant Institute of Mathematical Sciences, New York University, New York, NY and Google, Inc., New York, NY;Google, Inc., New York, NY;Courant Institute of Mathematical Sciences, New York University, New York, NY and Google, Inc., New York, NY
Venue:
IEEE Transactions on Audio, Speech, and Language Processing
Year:
2010

Citing 15
Cited 3

Transducers and repetitions

Theoretical Computer Science
Minimisation of acyclic deterministic automata in linear time

Theoretical Computer Science - Selected papers of the Combinatorial Pattern Matching School
Support-Vector Networks

Machine Learning
Joint lexicon, acoustic unit inventory and model design

Speech Communication - Special issue on modeling pronunciation variation for automatic speech recognition
Finite-state transducers in language and speech processing

Computational Linguistics
Computer Vision for Music Identification

CVPR '05 Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Volume 1 - Volume 01
A Review of Audio Fingerprinting

Journal of VLSI Signal Processing Systems
Content-based methods for the management of digital music

ICASSP '00 Proceedings of the Acoustics, Speech, and Signal Processing, 2000. on IEEE International Conference - Volume 04
Waveprint: Efficient wavelet-based audio fingerprinting

Pattern Recognition
An audio indexing system for election video material

ICASSP '09 Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing
General suffix automaton construction algorithm and space bounds

Theoretical Computer Science
General indexation of weighted automata: application to spoken utterance retrieval

SpeechIR '04 Proceedings of the Workshop on Interdisciplinary Approaches to Speech Indexing and Retrieval at HLT-NAACL 2004
OpenFst: a general and efficient weighted finite-state transducer library

CIAA'07 Proceedings of the 12th international conference on Implementation and application of automata
Factor automata of automata and applications

CIAA'07 Proceedings of the 12th international conference on Implementation and application of automata
Analysis of Minimum Distances in High-Dimensional Musical Spaces

IEEE Transactions on Audio, Speech, and Language Processing

Music identification via vocabulary tree with MFCC peaks

MIRUM '11 Proceedings of the 1st international ACM workshop on Music information retrieval with user-centered and multimodal strategies
On the learnability of shuffle ideals

ALT'12 Proceedings of the 23rd international conference on Algorithmic Learning Theory
On the learnability of shuffle ideals

The Journal of Machine Learning Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present an approach to music identification based on weighted finite-state transducers and Gaussian mixture models, inspired by techniques used in large-vocabulary speech recognition. Our modeling approach is based on learning a set of elementary music sounds in a fully unsupervised manner. While the space of possible music sound sequences is very large, our method enables the construction of a compact and efficient representation for the song collection using finite-state transducers. This paper gives a novel and substantially faster algorithm for the construction of factor transducers, the key representation of song snippets supporting our music identification technique. The complexity of our algorithm is linear with respect to the size of the suffix automaton constructed. Our experiments further show that it helps speed up the construction of the weighted suffix automaton in our task by a factor of 17 with respect to our previous method using the intermediate steps of determinization and minimization. We show that, using these techniques, a large-scale music identification system can be constructed for a database of over 15 000 songs while achieving an identification accuracy of 99.4% on undistorted test data, and performing robustly in the presence of noise and distortions.