Multilingual speech databases at LDC

Authors:
John J. Godfrey
Affiliations:
University of Pennsylvania, Philadelphia, PA
Venue:
HLT '94 Proceedings of the workshop on Human Language Technology
Year:
1994

Citing 1
Cited 0

SWITCHBOARD: telephone speech corpus for research and development

ICASSP'92 Proceedings of the 1992 IEEE international conference on Acoustics, speech and signal processing - Volume 1

Quantified Score

Hi-index	0.00

Visualization

Abstract

As multilingual products and technology grow in importance, the Linguistic Data Consortium (LDC) intends to provide the resources needed for research and development activities, especially in telephone-based, small-vocabulary recognition applications; language identification research; and large vocabulary continuous speech recognition research.The POLYPHONE corpora, a multilingual "database of databases," are specifically designed to meet the needs of telephone application development and testing. Data sets from many of the world's commercially important languages will be available within the next few years.Language identification corpora will be large sets of spontaneous telephone speech in several languages with a wide variety of speakers, channels, and handsets. One corpus is now available, and current plans call for corpora of increasing size and complexity over the next few years.Large vocabulary speech recognition requires transcribed speech, pronouncing dictionaries, and language models. To fill this need, LDC will use the unattended computer-controlled collection methods developed for SWITCH-BOARD to create several similar corpora, each about one-tenth the size of SWITCHBOARD, in other languages. Text corpora sufficient to create useful language models will be collected and distributed as well. Finally, pronouncing dictionaries covering the vocabulary of both transcripts and texts will be produced and made available.