Web text data mining for building large scale language modelling corpus

  • Authors:
  • Jan Švec;Jan Hoidekr;Daniel Soutner;Jan Vavruška

  • Affiliations:
  • University of West Bohemia, Faculty of Applied Sciences, Dept. of Cybernetics, Plzeň, Czech Republic;University of West Bohemia, Faculty of Applied Sciences, Dept. of Cybernetics, Plzeň, Czech Republic;University of West Bohemia, Faculty of Applied Sciences, Dept. of Cybernetics, Plzeň, Czech Republic;University of West Bohemia, Faculty of Applied Sciences, Dept. of Cybernetics, Plzeň, Czech Republic

  • Venue:
  • TSD'11 Proceedings of the 14th international conference on Text, speech and dialogue
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

The paper describes a system for collecting a large text corpus from Internet news servers. The architecture and text preprocessing algorithms are described. We also describe the used duplicity detection algorithm. The resulting corpus contains more than 1 billion tokens in more than 3 millions articles with assigned topics and duplicates identified. Corpus statistics like consistency and perplexity are presented.