Automatic Identification of Informative Sections of Web Pages

Authors:
Sandip Debnath;Prasenjit Mitra;Nirmal Pal;C. Lee Giles
Affiliations:
-;-;-;-
Venue:
IEEE Transactions on Knowledge and Data Engineering
Year:
2005

Citing 0
Cited 17

Page-level template detection via isotonic smoothing

Proceedings of the 16th international conference on World Wide Web
Site-Independent Template-Block Detection

PKDD 2007 Proceedings of the 11th European conference on Principles and Practice of Knowledge Discovery in Databases
A Novel Web Page Analysis Method for Efficient Reasoning of User Preference

APCHI '08 Proceedings of the 8th Asia-Pacific conference on Computer-Human Interaction
A densitometric approach to web page segmentation

Proceedings of the 17th ACM conference on Information and knowledge management
An Informative DOM Subtree Identification Method from Web Pages in Unfamiliar Web Sites

IEICE - Transactions on Information and Systems
A Parallel Algorithm for Finding Related Pages in the Web by Using Segmented Link Structures

PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Boilerplate detection using shallow text features

Proceedings of the third ACM international conference on Web search and data mining
Web page DOM node characterization and its application to page segmentation

IMSAA'09 Proceedings of the 3rd IEEE international conference on Internet multimedia services architecture and applications
Automatic sitemaps generation: Exploring website structures using block extraction and hyperlink analysis

Expert Systems with Applications: An International Journal
Indexing and querying segmented web pages: the BlockWeb Model

World Wide Web
A proposal for the evaluation of adaptive content retrieval, modification and delivery

Proceedings of the First Workshop on Personalised Multilingual Hypertext Retrieval
Detecting splogs using similarities of splog HTML structures

Proceedings of the 4th International Conference on Uniquitous Information Management and Communication
Knowledge discovery in web-directories: finding term-relations to build a business ontology

EC-Web'05 Proceedings of the 6th international conference on E-Commerce and Web Technologies
Extracting informative textual parts from web pages containing user-generated content

Proceedings of the 12th International Conference on Knowledge Management and Knowledge Technologies
A collective user preference management system for U-Commerce

APNOMS'07 Proceedings of the 10th Asia-Pacific conference on Network Operations and Management Symposium: managing next generation networks and services
Automated information extraction from web APIs documentation

WISE'12 Proceedings of the 13th international conference on Web Information Systems Engineering
A hybrid approach for extracting informative content from web pages

Information Processing and Management: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Web pages驴especially dynamically generated ones驴contain several items that cannot be classified as the "primary content,驴 e.g., navigation sidebars, advertisements, copyright notices, etc. Most clients and end-users search for the primary content, and largely do not seek the noninformative content. A tool that assists an end-user or application to search and process information from Web pages automatically, must separate the "primary content sections驴 from the other content sections. We call these sections as "Web page blocks驴 or just "blocks.驴 First, a tool must segment the Web pages into Web page blocks and, second, the tool must separate the primary content blocks from the noninformative content blocks. In this paper, we formally define Web page blocks and devise a new algorithm to partition an HTML page into constituent Web page blocks. We then propose four new algorithms, ContentExtractor, FeatureExtractor, K-FeatureExtractor, and L-Extractor. These algorithms identify primary content blocks by 1) looking for blocks that do not occur a large number of times across Web pages, by 2) looking for blocks with desired features, and by 3) using classifiers, trained with block-features, respectively. While operating on several thousand Web pages obtained from various Web sites, our algorithms outperform several existing algorithms with respect to runtime and/or accuracy. Furthermore, we show that a Web cache system that applies our algorithms to remove noninformative content blocks and to identify similar blocks across Web pages can achieve significant storage savings.