DOM-based web pages to determine the structure of the similarity algorithm

Authors:
Chunying Kang
Affiliations:
College of Information Science and Technology, Heilongjiang University, Harbin, HeiLongJiang, China
Venue:
IITA'09 Proceedings of the 3rd international conference on Intelligent information technology application
Year:
2009

Citing 2
Cited 1

Discovering informative content blocks from Web documents

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
HTML Page Analysis Based on Visual Cues

ICDAR '01 Proceedings of the Sixth International Conference on Document Analysis and Recognition

A research of the internet based on web information extraction and data fusion

ICWL'10 Proceedings of the 2010 international conference on New horizons in web-based learning

Quantified Score

Hi-index	0.00

Visualization

Abstract

Web data is currently mainly in the form of HTML pages, expressed by the HTML language of Web pages through the browser after analysis is only suitable for people to browse, not suitable for data exchange as a way to deal with by a computer. This article will make web page decompound a DOM tree, then from the DOM tree body root node to start, in accordance with the breadth-first traversal order DOM tree, layer by layer comparison DOM node tree, statistics of its changes, and then the sum of all floors of the changes, If less than a certain threshold, it is structurally similar to two pages, otherwise dissimilar. because this algorithm is only concerned about the page structure information without concern for the content of the page, it has a very high operating efficiency, while the algorithm is not limited to a specific web page, with good versatility.