Automatic topic identification for large scale language modeling data filtering

  • Authors:
  • Lucie Skorkovská;Pavel Ircing;Aleš Pražák;Jan Lehečka

  • Affiliations:
  • University of West Bohemia, Faculty of Applied Sciences, Dept. of Cybernetics, Plzeň, Czech Republic;University of West Bohemia, Faculty of Applied Sciences, Dept. of Cybernetics, Plzeň, Czech Republic;University of West Bohemia, Faculty of Applied Sciences, Dept. of Cybernetics, Plzeň, Czech Republic;University of West Bohemia, Faculty of Applied Sciences, Dept. of Cybernetics, Plzeň, Czech Republic

  • Venue:
  • TSD'11 Proceedings of the 14th international conference on Text, speech and dialogue
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

The paper presents a module for topic identification that is embedded into a complex system for acquisition and storing large volumes of text data from the Web. The module processes each of the acquired data items and assigns keywords to them from a defined topic hierarchy that was developed for this purposes and is also described in the paper. The quality of the topic identification is evaluated in two ways - using classic precision-recall measures and also indirectly, by measuring the ASR performance of the topic-specific language models that are built using the automatically filtered data.