Automatic topic identification for large scale language modeling data filtering

Authors:
Lucie Skorkovská;Pavel Ircing;Aleš Pražák;Jan Lehečka
Affiliations:
University of West Bohemia, Faculty of Applied Sciences, Dept. of Cybernetics, Plzeň, Czech Republic;University of West Bohemia, Faculty of Applied Sciences, Dept. of Cybernetics, Plzeň, Czech Republic;University of West Bohemia, Faculty of Applied Sciences, Dept. of Cybernetics, Plzeň, Czech Republic;University of West Bohemia, Faculty of Applied Sciences, Dept. of Cybernetics, Plzeň, Czech Republic
Venue:
TSD'11 Proceedings of the 14th international conference on Text, speech and dialogue
Year:
2011

Citing 6
Cited 1

Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Introduction to Information Retrieval

Introduction to Information Retrieval
Comparison of different lemmatization approaches through the means of information retrieval performance

TSD'10 Proceedings of the 13th international conference on Text, speech and dialogue
Gender-dependent acoustic models fusion developed for automatic subtitling of parliament meetings broadcasted by the Czech TV

TSD'10 Proceedings of the 13th international conference on Text, speech and dialogue
Robust statistic estimates for adaptation in the task of speech recognition

TSD'10 Proceedings of the 13th international conference on Text, speech and dialogue
Web text data mining for building large scale language modelling corpus

TSD'11 Proceedings of the 14th international conference on Text, speech and dialogue

Web text data mining for building large scale language modelling corpus

TSD'11 Proceedings of the 14th international conference on Text, speech and dialogue

Quantified Score

Hi-index	0.00

Visualization

Abstract

The paper presents a module for topic identification that is embedded into a complex system for acquisition and storing large volumes of text data from the Web. The module processes each of the acquired data items and assigns keywords to them from a defined topic hierarchy that was developed for this purposes and is also described in the paper. The quality of the topic identification is evaluated in two ways - using classic precision-recall measures and also indirectly, by measuring the ASR performance of the topic-specific language models that are built using the automatically filtered data.