Simple fast algorithms for the editing distance between trees and related problems
SIAM Journal on Computing
The Web is ruined and I ruined it
World Wide Web Journal - Special issue on XML: principles, tools, and techniques
Recognizing structure in Web pages using similarity queries
AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence
An Approach to Identify Duplicated Web Pages
COMPSAC '02 Proceedings of the 26th International Computer Software and Applications Conference on Prolonging Software Life: Development and Redevelopment
HTML Page Analysis Based on Visual Cues
ICDAR '01 Proceedings of the Sixth International Conference on Document Analysis and Recognition
ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 2
A survey on tree edit distance and related problems
Theoretical Computer Science
Visual Similarity Comparison for Web Page Retrieval
WI '05 Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web Intelligence
Factors affecting web page similarity
ECIR'05 Proceedings of the 27th European conference on Advances in Information Retrieval Research
Visually searching the web for structural content
Proceedings of the 3rd International Symposium on Visual Information Communication
Hi-index | 0.00 |
Despite the exponential WWW growth and the success of the Semantic Web, there is limited support today to handle the information found on the Web. In this scenario, techniques and tools that support effective information retrieval are becoming increasingly important. In this work, we present a technique for recognizing and comparing the visual structural information of Web pages, The technique is based on a classification of the set of html-tags which is guided by the visual effect of each tag in the whole structure of the page. This allows us to translate the web page to a normalized form where groups of html tags are mapped into a common canonical one. A metric to compute the distance between two different pages is also introduced. Then, by means of a compression process we are also able to reduce the complexity of recognizing similar structures as well as the processing time when comparing the differences between two Web pages. Finally, we briefly describe a prototype implementation of our tool along with several examples that demonstrate the feasibility of our approach.