Web text data mining for building large scale language modelling corpus

Authors:
Jan Švec;Jan Hoidekr;Daniel Soutner;Jan Vavruška
Affiliations:
University of West Bohemia, Faculty of Applied Sciences, Dept. of Cybernetics, Plzeň, Czech Republic;University of West Bohemia, Faculty of Applied Sciences, Dept. of Cybernetics, Plzeň, Czech Republic;University of West Bohemia, Faculty of Applied Sciences, Dept. of Cybernetics, Plzeň, Czech Republic;University of West Bohemia, Faculty of Applied Sciences, Dept. of Cybernetics, Plzeň, Czech Republic
Venue:
TSD'11 Proceedings of the 14th international conference on Text, speech and dialogue
Year:
2011

Citing 8
Cited 1

Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
Design of Speech Recognition Engine

TDS '00 Proceedings of the Third International Workshop on Text, Speech and Dialogue
Comparison of texts streams in the presence of mild adversaries

ACSW Frontiers '05 Proceedings of the 2005 Australasian workshop on Grid computing and e-research - Volume 44
Web resources for language modeling in conversational speech recognition

ACM Transactions on Speech and Language Processing (TSLP)
Recording and annotation of speech corpus for Czech unit selection speech synthesis

TSD'07 Proceedings of the 10th international conference on Text, speech and dialogue
Online TV captioning of Czech parliamentary sessions

TSD'10 Proceedings of the 13th international conference on Text, speech and dialogue
Automatic topic identification for large scale language modeling data filtering

TSD'11 Proceedings of the 14th international conference on Text, speech and dialogue
Automatic transcription of numerals in inflectional languages

TSD'05 Proceedings of the 8th international conference on Text, Speech and Dialogue

Automatic topic identification for large scale language modeling data filtering

TSD'11 Proceedings of the 14th international conference on Text, speech and dialogue

Quantified Score

Hi-index	0.00

Visualization

Abstract

The paper describes a system for collecting a large text corpus from Internet news servers. The architecture and text preprocessing algorithms are described. We also describe the used duplicity detection algorithm. The resulting corpus contains more than 1 billion tokens in more than 3 millions articles with assigned topics and duplicates identified. Corpus statistics like consistency and perplexity are presented.