Slovak language model from internet text data

Authors:
Ján Staš;Daniel Hládek;Matúš Pleva;Jozef Juhár
Affiliations:
Technical University of Košice, Faculty of Electrical Engineering and Informatics, Laboratory of Advanced Speech Technologies, Košice, Slovakia;Technical University of Košice, Faculty of Electrical Engineering and Informatics, Laboratory of Advanced Speech Technologies, Košice, Slovakia;Technical University of Košice, Faculty of Electrical Engineering and Informatics, Laboratory of Advanced Speech Technologies, Košice, Slovakia;Technical University of Košice, Faculty of Electrical Engineering and Informatics, Laboratory of Advanced Speech Technologies, Košice, Slovakia
Venue:
Proceedings of the Third COST 2102 international training school conference on Toward autonomous, adaptive, and context-aware multimodal interfaces: theoretical and practical issues
Year:
2010

Citing 2
Cited 1

Speech and Language Processing (2nd Edition)

Speech and Language Processing (2nd Edition)
Multimodal Human Machine Interactions in Virtual and Augmented Reality

Multimodal Signals: Cognitive and Algorithmic Issues

Extracting sentence elements for the natural language understanding based on slovak national corpus

COST'10 Proceedings of the 2010 international conference on Analysis of Verbal and Nonverbal Communication and Enactment

Quantified Score

Hi-index	0.00

Visualization

Abstract

Automatic speech recognition system is one of the parts of the multimodal dialogue system. It is necessary to create correct vocabulary and to generate suitable language model for this purpose. The main aim of this article is to describe a process of building statistical models of the Slovak language with large vocabulary trained on the text data gathered mainly from Internet sources. Several smoothing techniques for different sizes of vocabulary have been used in order to obtain an optimal model of the Slovak language. We have also employed pruning technique based on relative entropy for size reduction of a language model to find the maximum threshold of pruning with minimum degradation in recognition accuracy. Tests were performed by the decoder based on the HTK Toolkit.