URL tree: efficient unsupervised content extraction from streams of web documents

  • Authors:
  • Borut Sluban;Miha Grčar

  • Affiliations:
  • Jozef Stefan Institute, Ljubljana, Slovenia;Jozef Stefan Institute, Ljubljana, Slovenia

  • Venue:
  • Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

The Web represents the largest, and an increasingly growing, source of information. Extracting meaningful content from Web pages presents a challenging problem, already extensively addressed in the offline setting. In this work, we focus on content extraction from streams of HTML documents. We present an infrastructure that converts continuously acquired HTML documents into a stream of plain text documents. The presented pipeline consists of RSS readers for data acquisition from different Web sites, a duplicate removal component, and a novel content extraction algorithm which is efficient, unsupervised, and language-independent. Our content extraction approach is based on the observation that HTML documents from the same source normally share a common template. The core of the proposed content extraction algorithm is a simple data structure called URL Tree. The performance of the algorithm was evaluated in a stream setting on a time-stamped semi-automatically annotated dataset which was made publicly available. We compared the performance of URL Tree with that of several open source content extraction algorithms. The evaluation results show that our stream-based algorithm already starts outperforming the other algorithms after only 10 to 100 documents from a specific domain.