Syntactic clustering of the Web
Selected papers from the sixth international conference on World Wide Web
Design of Speech Recognition Engine
TDS '00 Proceedings of the Third International Workshop on Text, Speech and Dialogue
Comparison of texts streams in the presence of mild adversaries
ACSW Frontiers '05 Proceedings of the 2005 Australasian workshop on Grid computing and e-research - Volume 44
Web resources for language modeling in conversational speech recognition
ACM Transactions on Speech and Language Processing (TSLP)
Recording and annotation of speech corpus for Czech unit selection speech synthesis
TSD'07 Proceedings of the 10th international conference on Text, speech and dialogue
Online TV captioning of Czech parliamentary sessions
TSD'10 Proceedings of the 13th international conference on Text, speech and dialogue
Automatic topic identification for large scale language modeling data filtering
TSD'11 Proceedings of the 14th international conference on Text, speech and dialogue
Automatic transcription of numerals in inflectional languages
TSD'05 Proceedings of the 8th international conference on Text, Speech and Dialogue
Automatic topic identification for large scale language modeling data filtering
TSD'11 Proceedings of the 14th international conference on Text, speech and dialogue
Hi-index | 0.00 |
The paper describes a system for collecting a large text corpus from Internet news servers. The architecture and text preprocessing algorithms are described. We also describe the used duplicity detection algorithm. The resulting corpus contains more than 1 billion tokens in more than 3 millions articles with assigned topics and duplicates identified. Corpus statistics like consistency and perplexity are presented.