Beat space segmentation and octave scale cepstral feature for sung language recognition in pop music

Authors:
Namunu C. Maddage;Haizhou Li
Affiliations:
Royal Melbourne Institute of Technology University (RMIT), Melbourne, Australia;Institute for Infocomm Research (I²R), Singapore
Venue:
ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)
Year:
2011

Citing 9
Cited 0

Fundamentals of speech recognition

Fundamentals of speech recognition
Content-based music structure analysis with applications to music semantics understanding

Proceedings of the 12th annual ACM international conference on Multimedia
Music Key Detection for Musical Audio

MMM '05 Proceedings of the 11th International Multimedia Modelling Conference
Music structure based vector space retrieval

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Automatic language recognition using acoustic features

ICASSP '91 Proceedings of the Acoustics, Speech, and Signal Processing, 1991. ICASSP-91., 1991 International Conference
Syllabic level automatic synchronization of music signals and text lyrics

MULTIMEDIA '06 Proceedings of the 14th annual ACM international conference on Multimedia
A phonotactic language model for spoken language identification

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
A Vector Space Modeling Approach to Spoken Language Identification

IEEE Transactions on Audio, Speech, and Language Processing
Automatic mood detection and tracking of music audio signals

IEEE Transactions on Audio, Speech, and Language Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Sung language recognition relies on both effective feature extraction and acoustic modeling. In this paper, we study rhythm based music segmentation with the frame size being the duration of the smallest note in the music, as opposed to fixed length segmentation in spoken language recognition. It is found that acoustic features extracted from the rhythm based segmentation scheme outperform those from fixed length segmentation. We also study the effectiveness of a musically motivated acoustic feature. Octave scale cepstral coefficients (OSCCs) by comparing with the other acoustic features: Log frequency cepstral coefficients, Linear prediction coefficients (LPC) and LPC-derived cepstral coefficients. Finally, we examine the modeling capabilities of Gaussian mixture models and support vector machines in sung language recognition experiments. Experiments conducted on a corpus of 400 popular songs sung in English, Chinese, German, and Indonesian, showed that the OSCC feature outperforms other features. A sung language recognition accuracy of 64.9% was achieved when Gaussian mixture models were trained on shifted-delta-OSCC acoustic features, extracted via rhythm based music segmentation.