URL tree: efficient unsupervised content extraction from streams of web documents

Authors:
Borut Sluban;Miha Grčar
Affiliations:
Jozef Stefan Institute, Ljubljana, Slovenia;Jozef Stefan Institute, Ljubljana, Slovenia
Venue:
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Year:
2013

Citing 5
Cited 0

Template detection via data mining and its applications

Proceedings of the 11th international conference on World Wide Web
Discovering informative content blocks from Web documents

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Template detection for large scale search engines

Proceedings of the 2006 ACM symposium on Applied computing
A fast and robust method for web page template detection and removal

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Boilerplate detection using shallow text features

Proceedings of the third ACM international conference on Web search and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

The Web represents the largest, and an increasingly growing, source of information. Extracting meaningful content from Web pages presents a challenging problem, already extensively addressed in the offline setting. In this work, we focus on content extraction from streams of HTML documents. We present an infrastructure that converts continuously acquired HTML documents into a stream of plain text documents. The presented pipeline consists of RSS readers for data acquisition from different Web sites, a duplicate removal component, and a novel content extraction algorithm which is efficient, unsupervised, and language-independent. Our content extraction approach is based on the observation that HTML documents from the same source normally share a common template. The core of the proposed content extraction algorithm is a simple data structure called URL Tree. The performance of the algorithm was evaluated in a stream setting on a time-stamped semi-automatically annotated dataset which was made publicly available. We compared the performance of URL Tree with that of several open source content extraction algorithms. The evaluation results show that our stream-based algorithm already starts outperforming the other algorithms after only 10 to 100 documents from a specific domain.