Articulatory feature recognition using dynamic Bayesian networks
Computer Speech and Language
Multiclass support vector machines for articulatory feature classification
AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
Using syllables as acoustic units for spontaneous speech recognition
TSD'10 Proceedings of the 13th international conference on Text, speech and dialogue
Multigranular scale speech recognizers: technological and cognitive view
AI*IA'05 Proceedings of the 9th conference on Advances in Artificial Intelligence
Hi-index | 0.00 |
Current-generation automatic speech recognition (ASR) systems assume that words are readily decomposable into constituent phonetic components (“phonemes”). A detailed linguistic dissection of state-of-the-art speech recognition systems indicates that the conventional phonemic “beads-on-a-string” approach is of limited utility, particularly with respect to informal, conversational material. The study shows that there is a significant gap between the observed data and the pronunciation models of current ASR systems. It also shows that many important factors affecting recognition performance are not modeled explicitly in these systems. Motivated by these findings, this dissertation analyzes spontaneous speech with respect to three important, but often neglected, components of speech (at least with respect to English ASR). These components are articulatory-acoustic features (AFs), the syllable and stress accent. Analysis results provide evidence for an alternative approach of speech modeling, one in which the syllable assumes pre-eminent status and is melded to the lower as well as the higher tiers of linguistic representation through the incorporation of prosodic information such as stress accent. Using concrete examples and statistics from spontaneous speech material it is shown that there exists a systematic relationship between the realization of AFs and stress accent in conjunction with syllable position. This relationship can be used to provide an accurate and parsimonious characterization of pronunciation variation in spontaneous speech. An approach to automatically extract AFs from the acoustic signal is also developed, as is a system for the automatic stress-accent labeling of spontaneous speech. Based on the results of these studies a syllable-centric, multi-tier model of speech recognition is proposed. The model explicitly relates AFs, phonetic segments and syllable constituents to a framework for lexical representation, and incorporates stress-accent information into recognition. A test-bed implementation of the model is developed using a fuzzy-based approach for combining evidence from various AF sources and a pronunciation-variation modeling technique using AF-variation statistics extracted from data. Experiments on a limited-vocabulary speech recognition task using both automatically derived and fabricated data demonstrate the advantage of incorporating AF and stress-accent modeling within the syllable-centric, multi-tier framework, particularly with respect to pronunciation variation in spontaneous speech.