Voice-to-phoneme conversion algorithms for voice-tag applications in embedded platforms

Authors:
Yan Ming Cheng;Changxue Ma;Lynette Melnar
Affiliations:
Human Interaction Research, Motorola Labs, Schaumburg, IL;Human Interaction Research, Motorola Labs, Schaumburg, IL;Human Interaction Research, Motorola Labs, Schaumburg, IL
Venue:
EURASIP Journal on Audio, Speech, and Music Processing - Scalable Audio-Content Analysis
Year:
2008

Citing 4
Cited 0

Elements of information theory

Elements of information theory
Maximum likelihood modelling of pronunciation variation

Speech Communication - Special issue on modeling pronunciation variation for automatic speech recognition
Acoustic Indexing for Multimedia Retrieval and Browsing

ICASSP '97 Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97) -Volume 1 - Volume 1
A tree-trellis based fast search for finding the N-best sentence hypotheses in continuous speech recognition

ICASSP '91 Proceedings of the Acoustics, Speech, and Signal Processing, 1991. ICASSP-91., 1991 International Conference

Quantified Score

Hi-index	0.00

Visualization

Abstract

We describe two voice-to-phoneme conversion algorithms for speaker-independent voice-tag creation specifically targeted at applications on embedded platforms. These algorithms (batch mode and sequential) are compared in speech recognition experiments where they are first applied in a same-language context in which both acoustic model training and voice-tag creation and application are performed on the same language. Then, their performance is tested in a cross-language setting where the acoustic models are trained on a particular source language while the voice-tags are created and applied on a different target language. In the same-language environment, both algorithms either perform comparably to or significantly better than the baseline where utterances are manually transcribed by a phonetician. In the cross-language context, the voice-tag performances vary depending on the source-target language pair, with the variation reflecting predicted phonological similarity between the source and target languages. Among the most similar languages, performance nears that of the native-trained models and surpasses the native reference baseline.