Comparing clustering algorithms for the identification of similar pages in web applications

Authors:
Andrea De Lucia;Michele Risi;Giuseppe Scanniello;Genoveffa Tortora
Affiliations:
Dipartimento di Matematica e Informatica, Università di Salerno, Fisciano, SA, Italy;Dipartimento di Matematica e Informatica, Università di Salerno, Fisciano, SA, Italy;Dipartimento di Matematica e Informatica, Università della Basilicata, Potenza, Italy;Dipartimento di Matematica e Informatica, Università di Salerno, Fisciano, SA, Italy
Venue:
ICWE'07 Proceedings of the 7th international conference on Web engineering
Year:
2007

Citing 8
Cited 0

An Approach to Identify Duplicated Web Pages

COMPSAC '02 Proceedings of the 26th International Computer Software and Applications Conference on Prolonging Software Life: Development and Redevelopment
Comprehending Web Applications by a Clustering Based Approach

IWPC '02 Proceedings of the 10th International Workshop on Program Comprehension
Using Clustering to Support the Migration from Static to Dynamic Web Pages

IWPC '03 Proceedings of the 11th IEEE International Workshop on Program Comprehension
Restructuring Multilingual Web Sites

ICSM '02 Proceedings of the International Conference on Software Maintenance (ICSM'02)
Pattern Classification (2nd Edition)

Pattern Classification (2nd Edition)
Web site evolution

Journal of Software Maintenance and Evolution: Research and Practice - Special issue: Web site evolution
Using a Competitive Clustering Algorithm to Comprehend Web Applications

WSE '06 Proceedings of the Eighth IEEE International Symposium on Web Site Evolution
Identifying cloned navigational patterns in web applications

Journal of Web Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we analyze some widely employed clustering algorithms to identify duplicated or cloned pages in web applications. Indeed, we consider an agglomerative hierarchical clustering algorithm, a divisive clustering algorithm, k-means partitional clustering algorithm, and a partitional competitive clustering algorithm, namely Winner Takes All (WTA). All the clustering algorithms take as input a matrix of the distances between the structures of the web pages. The distance of two pages is computed applying the Levenshtein edit distance to the strings that encode the sequences of HTML tags of the web pages.