Hybrid statistical pronunciation models designed to be trained by a medium-size corpus

  • Authors:
  • Bahram Vazirnezhad;Farshad Almasganj;Seyed Mohammad Ahadi

  • Affiliations:
  • Biomedical Engineering Department, Amirkabir University of Technology (Tehran Polytechnic), Hafez Avenue, P.O. Box 15875-4413, Tehran, Iran;Biomedical Engineering Department, Amirkabir University of Technology (Tehran Polytechnic), Hafez Avenue, P.O. Box 15875-4413, Tehran, Iran;Electrical Engineering Department, Amirkabir University of Technology (Tehran Polytechnic), Hafez Avenue, P.O. Box 15875-4413, Tehran, Iran

  • Venue:
  • Computer Speech and Language
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Generating pronunciation variants of words is an important subject in speech research and is used extensively in automatic speech recognition and segmentation systems. Decision trees are well known tools in modeling pronunciation over words or sub-word units. In the case of word units and very large vocabulary, in order to train necessary decision trees, a huge amount of speech utterances are required. This training data must contain all of the needed words in the vocabulary with a sufficient number of repetitions for each one. Additionally, an extra corpus is needed for every word which is not included in the original training corpus and may be added to the vocabulary in the future. To overcome these drawbacks, we have designed generalized decision trees, which can be trained using a medium-size corpus over groups of similar words to share information on pronunciation, instead of training a separate tree for every single word. Generalized decision trees predict places in the word where substitution, deletion and insertion of phonemes may occur. After this step, appropriate statistical contextual rules are applied to the permitted places, in order to specifically determine word variants. The hybrids of generalized decision trees and contextual rules are designed in static and dynamic versions. The hybrid static pronunciation models take into account word phonological structures, unigram probabilities, stress and phone context information simultaneously, while the hybrid dynamic models consider an extra feature, speaking rate, to generate pronunciation variants of words. Using the word variants, generated by static and dynamic models, in the lexicon of the SHENAVA Persian continuous speech recognizer, relative word error rate reductions as high as 8.1% and 11.6% are obtained, respectively.