A segmental non-parametric-based phoneme recognition approach at the acoustical level

Authors:
Ladan Golipour;Douglas O'Shaughnessy
Affiliations:
INRS-EMT, Montreal, QC, Canada;INRS-EMT, Montreal, QC, Canada
Venue:
Computer Speech and Language
Year:
2012

Citing 7
Cited 0

Application of computational geometry to pattern recognition problems

Application of computational geometry to pattern recognition problems
Fast Approximate Similarity Search in Extremely High-Dimensional Data Sets

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Estimation of Classification Error

IEEE Transactions on Computers
Speech Text-Independent Segmentation Using an Improvement Method for Identification of Phoneme Boundaries

CONIELECOMP '09 Proceedings of the 2009 International Conference on Electrical, Communications, and Computers
A flat direct model for speech recognition

ICASSP '09 Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing
Template-Based Continuous Speech Recognition

IEEE Transactions on Audio, Speech, and Language Processing
k-nearest-neighbor Bayes-risk estimation

IEEE Transactions on Information Theory

Quantified Score

Hi-index	0.00

Visualization

Abstract

Although Hidden Markov Models (HMMs) are still the mainstream approach towards speech recognition, their intrinsic limitations such as first-order Markov models in use or the assumption of independent and identically distributed frames lead to the extensive use of higher level linguistic information to produce satisfactory results. Therefore, researchers began investigating the incorporation of various discriminative techniques at the acoustical level to induce more discrimination between speech units. As is known, the k-nearest neighbour (k-NN) density estimation is discriminant by nature and is widely used in the pattern recognition field. However, its application to speech recognition has been limited to few experiments. In this paper, we introduce a new segmental k-NN-based phoneme recognition technique. In this approach, a group-delay-based method generates phoneme boundary hypotheses, and an approximate version of k-NN density estimation is used for the classification and scoring of variable-length segments. During the decoding, the construction of the phonetic graph starts from the best phoneme boundary setting and progresses through splitting and merging segments using the remaining boundary hypotheses and constraints such as phoneme duration and broad-class similarity information. To perform the k-NN search, we take advantage of a similarity search algorithm called Spatial Approximate Sample Hierarchy (SASH). One major advantage of the SASH algorithm is that its computational complexity is independent of the dimensionality of the data. This allows us to use high-dimensional feature vectors to represent phonemes. By using phonemes as units of speech, the search space is very limited and the decoding process fast. Evaluation of the proposed algorithm with the sole use of the best hypothesis for every segment and excluding phoneme transitional probabilities, context-based, and language model information results in an accuracy of 58.5% with correctness of 67.8% on the TIMIT test dataset.