Template detection via data mining and its applications
Proceedings of the 11th international conference on World Wide Web
Discovering informative content blocks from Web documents
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Template detection for large scale search engines
Proceedings of the 2006 ACM symposium on Applied computing
A fast and robust method for web page template detection and removal
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Boilerplate detection using shallow text features
Proceedings of the third ACM international conference on Web search and data mining
Hi-index | 0.00 |
The Web represents the largest, and an increasingly growing, source of information. Extracting meaningful content from Web pages presents a challenging problem, already extensively addressed in the offline setting. In this work, we focus on content extraction from streams of HTML documents. We present an infrastructure that converts continuously acquired HTML documents into a stream of plain text documents. The presented pipeline consists of RSS readers for data acquisition from different Web sites, a duplicate removal component, and a novel content extraction algorithm which is efficient, unsupervised, and language-independent. Our content extraction approach is based on the observation that HTML documents from the same source normally share a common template. The core of the proposed content extraction algorithm is a simple data structure called URL Tree. The performance of the algorithm was evaluated in a stream setting on a time-stamped semi-automatically annotated dataset which was made publicly available. We compared the performance of URL Tree with that of several open source content extraction algorithms. The evaluation results show that our stream-based algorithm already starts outperforming the other algorithms after only 10 to 100 documents from a specific domain.