Web Information Extraction by HTML Tree Edit Distance Matching

Authors:
Yeonjung Kim;Jeahyun Park;Taehwan Kim;Joongmin Choi
Affiliations:
-;-;-;-
Venue:
ICCIT '07 Proceedings of the 2007 International Conference on Convergence Information Technology
Year:
2007

Citing 0
Cited 7

GAIML: A new language for verbal and graphical interaction in chatbots

Mobile Information Systems - Information Assurance and Advanced Human-Computer Interfaces
RENS --- Enabling a Robot to Identify a Person

ICIRA '09 Proceedings of the 2nd International Conference on Intelligent Robotics and Applications
Tag tree template for Web information and schema extraction

Expert Systems with Applications: An International Journal
Bricolage: example-based retargeting for web design

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Intelligent self-repairable web wrappers

AI*IA'11 Proceedings of the 12th international conference on Artificial intelligence around man and beyond
RTED: a robust algorithm for the tree edit distance

Proceedings of the VLDB Endowment
Robust web data extraction: a novel approach based on minimum cost script edit model

WISM'12 Proceedings of the 2012 international conference on Web Information Systems and Mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

The main issue for effective Web information extraction is how to recognize similar patterns in a Web page. Tra- ditionally, it has been shown that pattern matching by us- ing the HTML DOM tree is more efficient than the sim- ple string matching approach. Nonetheless, previous tree- based pattern matching methods have problems by assum- ing that all HTML tags have the same values, assigning the same weight to each node in HTML trees. This paper proposes an enhanced tree matching algo- rithm that improves the tree edit distance method by con- sidering the characteristics of HTML features. We assign different values to different HTML tree nodes according to their weights for displaying the corresponding data objects in the browser. Pattern matching of HTML patterns is done by obtaining the maximum mapping values of two HTML trees that are constructed with weighted node values from HTML data objects. Experiments are done over several Web commerce sites to evaluate the effectiveness of the proposed HTML tree matching algorithm.