Zero-cost labelling with web feeds for weblog data extraction

Authors:
George Gkotsis;Karen Stepanyan;Alexandra I. Cristea;Mike S. Joy
Affiliations:
University of Warwick, Coventry, United Kingdom;University of Warwick, Coventry, United Kingdom;University of Warwick, Coventry, United Kingdom;University of Warwick, Coventry, United Kingdom
Venue:
Proceedings of the 22nd international conference on World Wide Web companion
Year:
2013

Citing 2
Cited 0

Visual Web Information Extraction with Lixto

Proceedings of the 27th International Conference on Very Large Data Bases
Boilerplate detection using shallow text features

Proceedings of the third ACM international conference on Web search and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

Data extraction from web pages often involves either human intervention for training a wrapper or a reduced level of granularity in the information acquired. Even though the study of social media has drawn the attention of researchers, weblogs remain a part of the web that cannot be harvested efficiently. In this paper, we propose a fully automated approach in generating a wrapper for weblogs, which exploits web feeds for cheap labelling of weblog properties. Instead of performing a pairwise comparison between posts, the model matches the values of the web feeds against their corresponding HTML elements retrieved from multiple weblog posts. It adopts a probabilistic approach for deriving a set of rules and automating the process of wrapper generation. Our evaluation shows that our approach is robust, accurate and efficient in handling different types of weblogs.