Selective sampling of training data for speech recognition

Authors:
Teresa M. Kamm;Gerard G. L. Meyer
Affiliations:
The Johns Hopkins University, Baltimore, MD;The Johns Hopkins University, Baltimore, MD
Venue:
HLT '02 Proceedings of the second international conference on Human Language Technology Research
Year:
2002

Citing 1
Cited 2

Improving Generalization with Active Learning

Machine Learning - Special issue on structured connectionist systems

Analysis of selective strategies to build a dependency-analyzed corpus

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Unsupervised training and directed manual transcription for LVCSR

Speech Communication

Quantified Score

Hi-index	0.00

Visualization

Abstract

Speech recognition systems are expensive to train, mostly due to the high cost of annotating training data. We previously proposed an iterative training selection algorithm [1], which sought to improve speech recognition by automatically selecting a subset of the available humanly transcribed training data, thereby improving error rates without incurring additional transcription cost. We suggest one improvement to that "selective sampling" algorithm and show that we are able to reduce the error rate on a particular alphadigit recognition problem from 10.3% to 9.5%. We then extend the iterative training selection algorithm to work with untranscribed speech, guiding selection of speech that is then transcribed. We show, on a particular alphadigit recognition problem, that it is possible to match the baseline error rate while only incurring 25% of the transcription cost.