Power-law distribution in encoded MFCC frames of speech, music, and environmental sound signals

Authors:
Martín Haro;Joan Serrà;Álvaro Corral;Perfecto Herrera
Affiliations:
Universitat Pompeu Fabra, Barcelona, Spain;Consejo Superior de Investigaciones Científicas, Bellaterra, Spain;Centre de Recerca Matemàtica, Bellaterra, Spain;Universitat Pompeu Fabra, Barcelona, Spain
Venue:
Proceedings of the 21st international conference companion on World Wide Web
Year:
2012

Citing 10
Cited 1

Support-Vector Networks

Machine Learning
Foundations of statistical natural language processing

Foundations of statistical natural language processing
Modern Information Retrieval

Modern Information Retrieval
Zipf's Law, Music Classification, and Aesthetics

Computer Music Journal
Signal Processing Methods for Music Transcription

Signal Processing Methods for Music Transcription
Data Mining: A Knowledge Discovery Approach

Data Mining: A Knowledge Discovery Approach
Discrete-time speech signal processing: principles and practice

Discrete-time speech signal processing: principles and practice
Power-Law Distributions in Empirical Data

SIAM Review
Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data

Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data
Representations of Keypoint-Based Semantic Concept Detection: A Comprehensive Study

IEEE Transactions on Multimedia

P-order normal cloud model: walking on the way between gaussian and power law distributions

ICONIP'12 Proceedings of the 19th international conference on Neural Information Processing - Volume Part II

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many sound-related applications use Mel-Frequency Cepstral Coefficients (MFCC) to describe audio timbral content. Most of the research efforts dealing with MFCCs have been focused on the study of different classification and clustering algorithms, the use of complementary audio descriptors, or the effect of different distance measures. The goal of this paper is to focus on the statistical properties of the MFCC descriptor itself. For that purpose, we use a simple encoding process that maps a short-time MFCC vector to a dictionary of binary code-words. We study and characterize the rank-frequency distribution of such MFCC code-words, considering speech, music, and environmental sound sources. We show that, regardless of the sound source, MFCC code-words follow a shifted power-law distribution. This implies that there are a few code-words that occur very frequently and many that happen rarely. We also observe that the inner structure of the most frequent code-words has characteristic patterns. For instance, close MFCC coefficients tend to have similar quantization values in the case of music signals. Finally, we study the rank-frequency distributions of individual music recordings and show that they present the same type of heavy-tailed distribution as found in the large-scale databases. This fact is exploited in two supervised semantic inference tasks: genre and instrument classification. In particular, we obtain similar classification results as the ones obtained by considering all frames in the recordings by just using 50 (properly selected) frames. Beyond this particular example, we believe that the fact that MFCC frames follow a power-law distribution could potentially have important implications for future audio-based applications.