Hierarchical and parallel processing of auditory and modulation frequencies for automatic speech recognition

Authors:
Fabio Valente
Affiliations:
IDIAP Research Institute, CH-1920 Martigny, Switzerland
Venue:
Speech Communication
Year:
2010

Citing 4
Cited 0

Should recognizers have ears?

Speech Communication - Special issue on robust speech recognition
Connectionist Speech Recognition: A Hybrid Approach

Connectionist Speech Recognition: A Hybrid Approach
Temporal patterns (TRAPs) in ASR of noisy speech

ICASSP '99 Proceedings of the Acoustics, Speech, and Signal Processing, 1999. on 1999 IEEE International Conference - Volume 01
Juicer: a weighted finite-state transducer speech decoder

MLMI'06 Proceedings of the Third international conference on Machine Learning for Multimodal Interaction

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper investigates from an automatic speech recognition perspective, the most effective way of combining Multi Layer Perceptron (MLP) classifiers trained on different ranges of auditory and modulation frequencies. Two different combination schemes based on MLP are considered. The first one operates in parallel fashion and is invariant to the order in which feature streams are introduced. The second one operates in hierarchical fashion and is sensitive to the order in which feature streams are introduced. The study is carried on a Large Vocabulary Continuous Speech Recognition system for transcription of meetings data using the TANDEM approach. Results reveal that (1) the combination of MLPs trained on different ranges of auditory frequencies is more effective if performed in parallel fashion; (2) the combination of MLPs trained on different ranges of modulation frequencies is more effective if performed in hierarchical fashion moving from high to low modulations; (3) the improvement obtained from separate processing of two modulation frequency ranges (12% relative WER reduction w.r.t. the single classifier approach) is considerably larger than the improvement obtained from separate processing of two auditory frequency ranges (4% relative WER reduction w.r.t. the single classifier approach). Similar results are also verified on other LVCSR systems and on other languages. Furthermore, the paper extends the discussion to the combination of classifiers trained on separate auditory-modulation frequency channels showing that previous conclusions hold also in this scenario.