Automatic extraction of informative blocks from webpages

Authors:
Sandip Debnath;Prasenjit Mitra;C. Lee Giles
Affiliations:
The Pennsylvania State University, PA;The Pennsylvania State University, PA;The Pennsylvania State University, PA
Venue:
Proceedings of the 2005 ACM symposium on Applied computing
Year:
2005

Citing 7
Cited 17

Wrapper induction: efficiency and expressiveness

Artificial Intelligence - Special issue on Intelligent internet systems
Template detection via data mining and its applications

Proceedings of the 11th international conference on World Wide Web
Visualizing web site comparisons

Proceedings of the 11th international conference on World Wide Web
Hierarchical Wrapper Induction for Semistructured Information Sources

Autonomous Agents and Multi-Agent Systems
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Discovering informative content blocks from Web documents

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Eliminating noisy information in Web pages for data mining

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining

A fast and robust method for web page template detection and removal

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Sampling, information extraction and summarisation of hidden web databases

Data & Knowledge Engineering - Special issue: WIDM 2004
Removing manually generated boilerplate from electronic texts: experiments with project Gutenberg e-books

CASCON '07 Proceedings of the 2007 conference of the center for advanced studies on Collaborative research
Combining content extraction heuristics: the CombinE system

Proceedings of the 10th International Conference on Information Integration and Web-based Applications & Services
On Finding Templates on Web Collections

World Wide Web
Bridging the gap: from multi document Template Detection to single document Content Extraction

EuroIMSA '08 Proceedings of the IASTED International Conference on Internet and Multimedia Systems and Applications
CETR: content extraction via tag ratios

Proceedings of the 19th international conference on World wide web
Clustering template based web documents

ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval
Document structure meets page layout: loopy random fields for web news content extraction

Proceedings of the 10th ACM symposium on Document engineering
DOM based content extraction via text density

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Extending web engineering models and tools for automatic usability validation

Journal of Web Engineering
A heuristic approach for topical information extraction from news pages

WISE'06 Proceedings of the 7th international conference on Web Information Systems
Knowledge discovery in web-directories: finding term-relations to build a business ontology

EC-Web'05 Proceedings of the 6th international conference on E-Commerce and Web Technologies
Identifying content blocks from web documents

ISMIS'05 Proceedings of the 15th international conference on Foundations of Intelligent Systems
Automated internal web page clustering for improved data extraction

Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics
Effectiveness of template detection on noise reduction and websites summarization

Information Sciences: an International Journal
Metadata Extraction from Books with Facts about Austria

Proceedings of International Conference on Information Integration and Web-based Applications & Services

Quantified Score

Hi-index	0.00

Visualization

Abstract

Search engines crawl and index webpages depending upon their informative content. However, webpages --- especially dynamically generated ones --- contain items that cannot be classified as the "primary content", e.g., navigation side-bars, advertisements, copyright notices, etc. Most end-users search for the primary content, and largely do not seek the non-informative content. A tool that assists an end-user or application to search and process information from webpages automatically, must separate the "primary content blocks" from the other blocks. In this paper, two new algorithms, ContentExtractor, and FeatureExtractor are proposed. The algorithms identify primary content blocks by i) looking for blocks that do not occur a large number of times across webpages and ii) looking for blocks with desired features respectively. They identify the primary content blocks with high precision and recall, reduce the storage requirement for search engines, result in smaller indexes and thereby faster search times, and better user satisfaction. While operating on several thousand webpages obtained from 11 news websites, our algorithms significantly outperform the Entropy-based algorithm proposed by Lin and Ho [7] in both accuracy and run-time.