Zero-cost labelling with web feeds for weblog data extraction

  • Authors:
  • George Gkotsis;Karen Stepanyan;Alexandra I. Cristea;Mike S. Joy

  • Affiliations:
  • University of Warwick, Coventry, United Kingdom;University of Warwick, Coventry, United Kingdom;University of Warwick, Coventry, United Kingdom;University of Warwick, Coventry, United Kingdom

  • Venue:
  • Proceedings of the 22nd international conference on World Wide Web companion
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Data extraction from web pages often involves either human intervention for training a wrapper or a reduced level of granularity in the information acquired. Even though the study of social media has drawn the attention of researchers, weblogs remain a part of the web that cannot be harvested efficiently. In this paper, we propose a fully automated approach in generating a wrapper for weblogs, which exploits web feeds for cheap labelling of weblog properties. Instead of performing a pairwise comparison between posts, the model matches the values of the web feeds against their corresponding HTML elements retrieved from multiple weblog posts. It adopts a probabilistic approach for deriving a set of rules and automating the process of wrapper generation. Our evaluation shows that our approach is robust, accurate and efficient in handling different types of weblogs.