WISDOM: Web Intrapage Informative Structure Mining Based on Document Object Model

Authors:
Hung-Yu Kao;Jan-Ming Ho;Ming-Syan Chen
Affiliations:
-;IEEE;IEEE
Venue:
IEEE Transactions on Knowledge and Data Engineering
Year:
2005

Citing 24
Cited 9

Automatic text processing: the transformation, analysis, and retrieval of information by computer

Automatic text processing: the transformation, analysis, and retrieval of information by computer
NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
Record-boundary discovery in Web documents

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Generating finite-state transducers for semi-structured data extraction from the Web

Information Systems - Special issue on semistructured data
Authoritative sources in a hyperlinked environment

Proceedings of the ninth annual ACM-SIAM symposium on Discrete algorithms
Recognizing structure in Web pages using similarity queries

AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence
Learning to construct knowledge bases from the World Wide Web

Artificial Intelligence - Special issue on Intelligent internet systems
Learning to extract hierarchical information from semi-structured documents

Proceedings of the ninth international conference on Information and knowledge management
Integrating the document object model with hyperlinks for enhanced topic distillation and information extraction

Proceedings of the 10th international conference on World Wide Web
Template detection via data mining and its applications

Proceedings of the 11th international conference on World Wide Web
Modern Information Retrieval

Modern Information Retrieval
A brief survey of web data extraction tools

ACM SIGMOD Record
Using micro information units for internet search

Proceedings of the eleventh international conference on Information and knowledge management
Entropy-based link analysis for mining web informative structures

Proceedings of the eleventh international conference on Information and knowledge management
Discovering Structural Association of Semistructured Data

IEEE Transactions on Knowledge and Data Engineering
Extracting Characteristic Structures among Words in Semistructured Documents

PAKDD '02 Proceedings of the 6th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Discovery of Frequent Tag Tree Patterns in Semistructured Web Documents

PAKDD '02 Proceedings of the 6th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Clipping and Analyzing News Using Machine Learning Techniques

DS '01 Proceedings of the 4th International Conference on Discovery Science
Discovering informative content blocks from Web documents

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Detecting web page structure for adaptive viewing on small form factor devices

WWW '03 Proceedings of the 12th international conference on World Wide Web
Reverse Engineering for Web Data: From Visual to Semantic Structures

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Web Mining: Information and Pattern Discovery on the World Wide Web

ICTAI '97 Proceedings of the 9th International Conference on Tools with Artificial Intelligence
Mining Web Informative Structures and Contents Based on Entropy Analysis

IEEE Transactions on Knowledge and Data Engineering

Page-level template detection via isotonic smoothing

Proceedings of the 16th international conference on World Wide Web
A graph-theoretic approach to webpage segmentation

Proceedings of the 17th international conference on World Wide Web
A densitometric approach to web page segmentation

Proceedings of the 17th ACM conference on Information and knowledge management
Webpage segmentation for extracting images and their surrounding contextual information

MM '09 Proceedings of the 17th ACM international conference on Multimedia
Boilerplate detection using shallow text features

Proceedings of the third ACM international conference on Web search and data mining
Web page DOM node characterization and its application to page segmentation

IMSAA'09 Proceedings of the 3rd IEEE international conference on Internet multimedia services architecture and applications
Page segmentation by web content clustering

Proceedings of the International Conference on Web Intelligence, Mining and Semantics
An approach for document fragment retrieval and its formatting issue in engineering information management

ICCSA'06 Proceedings of the 2006 international conference on Computational Science and Its Applications - Volume Part II
VisHue: web page segmentation for an improved query interface for medlineplus medical encyclopedia

DNIS'11 Proceedings of the 7th international conference on Databases in Networked Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

To increase the commercial value and accessibility of pages, most content sites tend to publish their pages with intrasite redundant information, such as navigation panels, advertisements, and copyright announcements. Such redundant information increases the index size of general search engines and causes page topics to drift. In this paper, we study the problem of mining intrapage informative structure in news Web sites in order to find and eliminate redundant information. Note that intrapage informative structure is a subset of the original Web page and is composed of a set of fine-grained and informative blocks. The intrapage informative structures of pages in a news Web site contain only anchors linking to news pages or bodies of news articles. We propose an intrapage informative structure mining system called WISDOM (Web Intrapage Informative Structure Mining based on the Document Object Model) which applies Information Theory to DOM tree knowledge in order to build the structure. WISDOM splits a DOM tree into many small subtrees and applies a top-down informative block searching algorithm to select a set of candidate informative blocks. The structure is built by expanding the set using proposed merging methods. Experiments on several real news Web sites show high precision and recall rates which validates WISDOM's practical applicability.