Large Vocabulary Speech Recognition for Read and Broadcast Czech
TSD '99 Proceedings of the Second International Workshop on Text, Speech and Dialogue
Large Vocabulary Continuous Speech Recognizer for Slovenian Language
TSD '01 Proceedings of the 4th International Conference on Text, Speech and Dialogue
Syllable Based Language Model for Large Vocabulary Continuous Speech Recognition of Polish
TSD '08 Proceedings of the 11th international conference on Text, Speech and Dialogue
Analysis of Czech web 1T 5-gram corpus and its comparison with Czech national corpus data
TSD'10 Proceedings of the 13th international conference on Text, speech and dialogue
COST'09 Proceedings of the Second international conference on Development of Multimodal Interfaces: active Listening and Synchrony
A morphological analyzer using hash tables in main memory (MAHT) and a lexical knowledge base
CICLing'12 Proceedings of the 13th international conference on Computational Linguistics and Intelligent Text Processing - Volume Part I
Hi-index | 0.00 |
In our paper we propose new technique for language modelling of highly inflectional languages such as Czech, Russian an other Slavic languages. Our aim is to alleviate main problem encountered in these languages, which is enormous vocabulary growth caused by great number of different word forms derived from one word (lemma). We reduced the size of the vocabulary by decomposing words into stems and endings and storing these sub-word units (morphemes) in the vocabulary separately. Then we trained morpheme based language model on the decomposed corpus. This paper reports perplexities, OOV rates and some speech recognition results obtained with new language model.