Improvements in non-verbal cue identification using multilingual phone strings

Authors:
Tanja Schultz;Qin Jin;Kornel Laskowski;Alicia Tribble;Alex Waibel
Affiliations:
Carnegie Mellon University;Carnegie Mellon University;Carnegie Mellon University;Carnegie Mellon University;Carnegie Mellon University
Venue:
S2S '02 Proceedings of the ACL-02 workshop on Speech-to-speech translation: algorithms and systems - Volume 7
Year:
2002

Citing 2
Cited 1

Language-independent and language-adaptive acoustic modeling for speech recognition

Speech Communication
Speaker, accent, and language identification using multilingual phone strings

HLT '02 Proceedings of the second international conference on Human Language Technology Research

Speaker Characteristics

Speaker Classification I

Quantified Score

Hi-index	0.00

Visualization

Abstract

Today's state-of-the-art front-ends for multilingual speech-to-speech translation systems apply monolingual speech recognizers trained for a single language and/or accent. The monolingual speech engine is usually adaptable to an unknown speaker over time using unsupervised training methods; however, if the speaker was seen during training, their specialized acoustic model will be applied, since it achieves better performance. In order to make full use of specialized acoustic models in this proposed scenario, it is necessary to automatically identify the speaker with high accuracy. Furthermore, monolingual speech recognizers currently rely on the fact that language and/or accent will be selected beforehand by the user. This requires the user's cooperation and an interface which easily allows for such selection. Both requirements are awkward and error-prone, especially when translation services are provided for many languages using small devices like PDAs or telephones. For these scenarios, front-ends are desired which automatically identify the spoken language or accent. We believe that the automatic identification of an utterance's non-verbal cues, such as language, accent and speaker, are necessary to the successful deployment of speech-to-speech translation systems.