Unsupervised learning of acoustic sub-word units

Authors:
Balakrishnan Varadarajan;Sanjeev Khudanpur;Emmanuel Dupoux
Affiliations:
Johns Hopkins University, Baltimore, MD;Johns Hopkins University, Baltimore, MD;Laboratoire de Science Cognitive, Paris, France
Venue:
HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
Year:
2008

Citing 1
Cited 5

Maximum likelihood successive state splitting

ICASSP '96 Proceedings of the Acoustics, Speech, and Signal Processing, 1996. on Conference Proceedings., 1996 IEEE International Conference - Volume 02

Data-Derived Models for Segmentation with Application to Surgical Assessment and Training

MICCAI '09 Proceedings of the 12th International Conference on Medical Image Computing and Computer-Assisted Intervention: Part I
A Computational Model of Unsupervised Speech Segmentation for Correspondence Learning

Research on Language and Computation
A nonparametric Bayesian approach to acoustic model discovery

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Bootstrapping a unified model of lexical and phonetic acquisition

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Joint training of non-negative Tucker decomposition and discrete density hidden Markov models

Computer Speech and Language

Quantified Score

Hi-index	0.00

Visualization

Abstract

Accurate unsupervised learning of phonemes of a language directly from speech is demonstrated via an algorithm for joint unsupervised learning of the topology and parameters of a hidden Markov model (HMM); states and short state-sequences through this HMM correspond to the learnt sub-word units. The algorithm, originally proposed for unsupervised learning of allophonic variations within a given phoneme set, has been adapted to learn without any knowledge of the phonemes. An evaluation methodology is also proposed, whereby the state-sequence that aligns to a test utterance is transduced in an automatic manner to a phoneme-sequence and compared to its manual transcription. Over 85% phoneme recognition accuracy is demonstrated for speaker-dependent learning from fluent, large-vocabulary speech.