A segmental non-parametric-based phoneme recognition approach at the acoustical level

  • Authors:
  • Ladan Golipour;Douglas O'Shaughnessy

  • Affiliations:
  • INRS-EMT, Montreal, QC, Canada;INRS-EMT, Montreal, QC, Canada

  • Venue:
  • Computer Speech and Language
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Although Hidden Markov Models (HMMs) are still the mainstream approach towards speech recognition, their intrinsic limitations such as first-order Markov models in use or the assumption of independent and identically distributed frames lead to the extensive use of higher level linguistic information to produce satisfactory results. Therefore, researchers began investigating the incorporation of various discriminative techniques at the acoustical level to induce more discrimination between speech units. As is known, the k-nearest neighbour (k-NN) density estimation is discriminant by nature and is widely used in the pattern recognition field. However, its application to speech recognition has been limited to few experiments. In this paper, we introduce a new segmental k-NN-based phoneme recognition technique. In this approach, a group-delay-based method generates phoneme boundary hypotheses, and an approximate version of k-NN density estimation is used for the classification and scoring of variable-length segments. During the decoding, the construction of the phonetic graph starts from the best phoneme boundary setting and progresses through splitting and merging segments using the remaining boundary hypotheses and constraints such as phoneme duration and broad-class similarity information. To perform the k-NN search, we take advantage of a similarity search algorithm called Spatial Approximate Sample Hierarchy (SASH). One major advantage of the SASH algorithm is that its computational complexity is independent of the dimensionality of the data. This allows us to use high-dimensional feature vectors to represent phonemes. By using phonemes as units of speech, the search space is very limited and the decoding process fast. Evaluation of the proposed algorithm with the sole use of the best hypothesis for every segment and excluding phoneme transitional probabilities, context-based, and language model information results in an accuracy of 58.5% with correctness of 67.8% on the TIMIT test dataset.