An investigation of documents from the World Wide Web
Proceedings of the fifth international World Wide Web conference on Computer networks and ISDN systems
Change detection in hierarchically structured information
SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Meaningful change detection in structured data
SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Generalization of the Kolmogorov-Smirnov test
Computational Statistics & Data Analysis
WebCQ-detecting and delivering information changes on the web
Proceedings of the ninth international conference on Information and knowledge management
Empirically validated web page design metrics
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
An Automated Change Detection Algorithm for HTML Documents Based on Semantic Hierarchies
Proceedings of the 17th International Conference on Data Engineering
Statistical Analysis of Web Documents: A Proposal and a Case Study
DEXA '01 Proceedings of the 12th International Workshop on Database and Expert Systems Applications
An Internet Difference Engine and Its Applications
COMPCON '96 Proceedings of the 41st IEEE International Computer Conference
Efficient and effective web change detection
Data & Knowledge Engineering
Detecting Changes in XML Documents
ICDE '02 Proceedings of the 18th International Conference on Data Engineering
CX-DIFF: a change detection algorithm for XML content and change visualization for WebVigiL
Data & Knowledge Engineering - Special issue: XML schema and data management
Hi-index | 0.00 |
This paper describes an efficient Web page detection approach based on restricting the similarity computations between two versions of a given Web page to the nodes with the same HTML tag type. Before performing the similarity computations, the HTML Web page is transformed into an XML-like structure in which a node corresponds to an open-closed HTML tag. Analytical expressions and supporting experimental results are used to quantify the improvements that are made when comparing the proposed approach to the traditional one, which computes the similarities across all nodes of both pages. It is shown that the improvements are highly dependent on the diversity of tags in the page. That is, the more diverse the page is (i.e., contains mixed content of text, images, links, etc.), the greater the improvements are, while the more uniform it is, the lesser they are.