Using Broad Phonetic Group Experts for Improved Speech Recognition

Authors:
Patricia Scanlon;Daniel P. W. Ellis;Richard B. Reilly
Affiliations:
Univ. Coll. Dublin;-;-
Venue:
IEEE Transactions on Audio, Speech, and Language Processing
Year:
2007

Citing 0
Cited 6

Using support vector machines and acoustic noise signal for degradation analysis of rotating machinery

Artificial Intelligence Review
Information theoretic feature extraction for audio-visual speech recognition

IEEE Transactions on Signal Processing
Selecting feature frames for automatic speaker recognition using mutual information

IEEE Transactions on Audio, Speech, and Language Processing
Automatic identification of phonetic similarity based on underspecification

LTC'09 Proceedings of the 4th conference on Human language technology: challenges for computer science and linguistics
Multiple source phoneme recognition aided by articulatory features

IEA/AIE'11 Proceedings of the 24th international conference on Industrial engineering and other applications of applied intelligent systems conference on Modern approaches in applied intelligence - Volume Part II
Automatic phone clustering based on confusion matrices

PROPOR'10 Proceedings of the 9th international conference on Computational Processing of the Portuguese Language

Quantified Score

Hi-index	0.00

Visualization

Abstract

In phoneme recognition experiments, it was found that approximately 75% of misclassified frames were assigned labels within the same broad phonetic group (BPG). While the phoneme can be described as the smallest distinguishable unit of speech, phonemes within BPGs contain very similar characteristics and can be easily confused. However, different BPGs, such as vowels and stops, possess very different spectral and temporal characteristics. In order to accommodate the full range of phonemes, acoustic models of speech recognition systems calculate input features from all frequencies over a large temporal context window. A new phoneme classifier is proposed consisting of a modular arrangement of experts, with one expert assigned to each BPG and focused on discriminating between phonemes within that BPG. Due to the different temporal and spectral structure of each BPG, novel feature sets are extracted using mutual information, to select a relevant time-frequency (TF) feature set for each expert. To construct a phone recognition system, the output of each expert is combined with a baseline classifier under the guidance of a separate BPG detector. Considering phoneme recognition experiments using the TIMIT continuous speech corpus, the proposed architecture afforded significant error rate reductions up to 5% relative