hrWaC and slWac: compiling web corpora for Croatian and Slovene

Authors:
Nikola Ljubešić;Tomaž Erjavec
Affiliations:
Faculty of Humanities and Social Sciences, University of Zagreb, Croatia;Dept. of Knowledge Technologies, Jožef Stefan Institute, Ljubljana, Slovenia
Venue:
TSD'11 Proceedings of the 14th international conference on Text, speech and dialogue
Year:
2011

Citing 2
Cited 2

Boilerplate detection using shallow text features

Proceedings of the third ACM international conference on Web search and data mining
NoWaC: a large web-based corpus for Norwegian

WAC-6 '10 Proceedings of the NAACL HLT 2010 Sixth Web as Corpus Workshop

Building and using comparable corpora for domain-specific bilingual lexicon extraction

BUCC '11 Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web
Bootstrapping bilingual lexicons from comparable corpora for closely related languages

TSD'11 Proceedings of the 14th international conference on Text, speech and dialogue

Quantified Score

Hi-index	0.00

Visualization

Abstract

Web corpora have become an attractive source of linguistic content, yet are for many languages still not available. This paper introduces two new annotated web corpora: the Croatian hrWaC and the Slovene slWaC. Both were built using a modified standard "Web as Corpus" pipeline having in mind the limited amount of available web data. The modifications are described in the paper, focusing on the content extraction from HTML pages, which combines high precision of extracted language content with a decent recall. The paper also investigates text-types of the acquired corpora using topic modeling, comparing the two corpora among themselves and with ukWaC.