Identifying content blocks from web documents

Authors:
Sandip Debnath;Prasenjit Mitra;C. Lee Giles
Affiliations:
Department of Computer Science and Engineering;Department of Computer Science and Engineering;Department of Computer Science and Engineering
Venue:
ISMIS'05 Proceedings of the 15th international conference on Foundations of Intelligent Systems
Year:
2005

Citing 12
Cited 8

Data model and query evaluation in global information systems

Journal of Intelligent Information Systems - Special issue: networked information discovery and retrieval
Ariadne: a system for constructing mediators for Internet sources

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
A Web-based information system that reasons with structured collections of text

AGENTS '98 Proceedings of the second international conference on Autonomous agents
Wrapper induction: efficiency and expressiveness

Artificial Intelligence - Special issue on Intelligent internet systems
Template detection via data mining and its applications

Proceedings of the 11th international conference on World Wide Web
Visualizing web site comparisons

Proceedings of the 11th international conference on World Wide Web
Hierarchical Wrapper Induction for Semistructured Information Sources

Autonomous Agents and Multi-Agent Systems
Wrapper Generation via Grammar Induction

ECML '00 Proceedings of the 11th European Conference on Machine Learning
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Discovering informative content blocks from Web documents

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Eliminating noisy information in Web pages for data mining

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Automatic extraction of informative blocks from webpages

Proceedings of the 2005 ACM symposium on Applied computing

Web Contents Extracting for Web-Based Learning

ICWL '08 Proceedings of the 7th international conference on Advances in Web Based Learning
Combining content extraction heuristics: the CombinE system

Proceedings of the 10th International Conference on Information Integration and Web-based Applications & Services
Bridging the gap: from multi document Template Detection to single document Content Extraction

EuroIMSA '08 Proceedings of the IASTED International Conference on Internet and Multimedia Systems and Applications
CETR: content extraction via tag ratios

Proceedings of the 19th international conference on World wide web
Finding and using the content texts of HTML pages

AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology
DOM based content extraction via text density

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Knowledge discovery in web-directories: finding term-relations to build a business ontology

EC-Web'05 Proceedings of the 6th international conference on E-Commerce and Web Technologies
Hybrid model of content extraction

Journal of Computer and System Sciences

Quantified Score

Hi-index	0.00

Visualization

Abstract

Intelligent information processing systems, such as digital libraries or search engines index web-pages according to their informative content. However, web-pages contain several non-informative contents, e.g., navigation sidebars, advertisements, copyright notices, etc. It is very important to separate the informative “primary content blocks” from these non-informative blocks. In this paper, two algorithms, FeatureExtractor and K-FeatureExtractor are proposed to identify the “primary content blocks” based on their features. None of these algorithms require any supervised learning, but still can identify the “primary content blocks” with high precision and recall. While operating on several thousand web-pages obtained from 15 different websites, our algorithms significantly outperform the Entropy-based algorithm proposed by Lin and Ho [14] in both precision and run-time.