Effects of architecture choices on sparse coding in speech recognition

Authors:
Fionntán O'Donnell;Fabian Triefenbach;Jean-Pierre Martens;Benjamin Schrauwen
Affiliations:
Ghent University, Department of Electronics and Information Systems, Sint-Pietersnieuwstraat, Ghent, Belgium;Ghent University, Department of Electronics and Information Systems, Sint-Pietersnieuwstraat, Ghent, Belgium;Ghent University, Department of Electronics and Information Systems, Sint-Pietersnieuwstraat, Ghent, Belgium;Ghent University, Department of Electronics and Information Systems, Sint-Pietersnieuwstraat, Ghent, Belgium
Venue:
ICANN'12 Proceedings of the 22nd international conference on Artificial Neural Networks and Machine Learning - Volume Part I
Year:
2012

Citing 9
Cited 0

Nonnegative features of spectro-temporal sounds for classification

Pattern Recognition Letters
LIBLINEAR: A Library for Large Linear Classification

The Journal of Machine Learning Research
Kernel Codebooks for Scene Categorization

ECCV '08 Proceedings of the 10th European Conference on Computer Vision: Part III
Evaluation of pooling operations in convolutional architectures for object recognition

ICANN'10 Proceedings of the 20th international conference on Artificial neural networks: Part III
LIBSVM: A library for support vector machines

ACM Transactions on Intelligent Systems and Technology (TIST)
A hierarchical framework for spectro-temporal feature extraction

Speech Communication
Scikit-learn: Machine Learning in Python

The Journal of Machine Learning Research
-SVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation

IEEE Transactions on Signal Processing
Simple Method for High-Performance Digit Recognition Based on Sparse Coding

IEEE Transactions on Neural Networks

Quantified Score

Hi-index	0.00

Visualization

Abstract

A common technique in visual object recognition is to use a sparse encoding of low-level input with a feature dictionary followed by a spatial pooling over local neighbourhoods. While some methods stack these in alternating layers within hierarchies, using these two stages alone can also produce state-of-the-art results. Following from vision, this framework is moving in to speech and audio processing tasks. We investigate the effect of architectural choices when applied to a spoken digit recognition task. We find that the unsupervised learning of features has a negligible effect on the classification, with the number of and size of the features being a greater determinant for recognition. Finally, we show that, given an optimised architecture, sparse coding performs comparably with Hidden Markov Models (HMMs) and outperforms K-means clustering.