Web page DOM node characterization and its application to page segmentation

Authors:
Gujjar Vineel
Affiliations:
Computing and Decision Sciences Lab, GE Research, India
Venue:
IMSAA'09 Proceedings of the 3rd IEEE international conference on Internet multimedia services architecture and applications
Year:
2009

Citing 22
Cited 4

Learning to remove Internet advertisements

Proceedings of the third annual conference on Autonomous Agents
Enhanced topic distillation using text, markup tags, and hyperlinks

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Template detection via data mining and its applications

Proceedings of the 11th international conference on World Wide Web
A visual tool for building logical data models of websites

Proceedings of the 4th international workshop on Web information and data management
Visual Web Information Extraction with Lixto

Proceedings of the 27th International Conference on Very Large Data Bases
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Discovering informative content blocks from Web documents

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Detecting web page structure for adaptive viewing on small form factor devices

WWW '03 Proceedings of the 12th international conference on World Wide Web
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Wrapper induction for information extraction

Wrapper induction for information extraction
Eliminating noisy information in Web pages for data mining

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning block importance models for web pages

Proceedings of the 13th international conference on World Wide Web
Automatic detection of fragments in dynamically generated web pages

Proceedings of the 13th international conference on World Wide Web
Using the structure of Web sites for automatic segmentation of tables

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Block-based web search

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
WISDOM: Web Intrapage Informative Structure Mining Based on Document Object Model

IEEE Transactions on Knowledge and Data Engineering
The volume and evolution of web page templates

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Automatic Identification of Informative Sections of Web Pages

IEEE Transactions on Knowledge and Data Engineering
Browsing on small screens: recasting web-page segmentation into an efficient machine learning framework

Proceedings of the 15th international conference on World Wide Web
A fast and robust method for web page template detection and removal

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Page-level template detection via isotonic smoothing

Proceedings of the 16th international conference on World Wide Web
Robust web page segmentation for mobile terminal using content-distances and page layout information

Proceedings of the 16th international conference on World Wide Web

Page segmentation by web content clustering

Proceedings of the International Conference on Web Intelligence, Mining and Semantics
Retrieving informative content from web pages with conditional learning of support vector machines and semantic analysis

ICAISC'12 Proceedings of the 11th international conference on Artificial Intelligence and Soft Computing - Volume Part II
Extracting informative textual parts from web pages containing user-generated content

Proceedings of the 12th International Conference on Knowledge Management and Knowledge Technologies
Automated information extraction from web APIs documentation

WISE'12 Proceedings of the 13th international conference on Web Information Systems Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Web pages are generally organized in tenns of visually distinct segments, such as Navigation bars, Advertisement banners, Headers, Portlets and Widgets. Despite the apparent structured layout, web pages are considered a source of unstructured data, from information extraction point of view. Hence, as a step towards interpreting the organization of web data, web page segmentation attempts to identify cohesive regions of a page. In this paper, we present a novel DOM tree mining approach for page segmentation. We first characterize the nodes of DOM tree structure, based on their Content Size and Entropy. While Content Size of a node indicates the amount of textual content contributed by its subtree, Entropy measures the strength of local "patterns" exhibited therein. In other words, a node manifesting highly repetitive patterns begets a high Entropy as per our fonnulation. Based on the characterization of DOM nodes, we then develop an unsupervised algorithm to automatically identify segments of a given web page.