Comparing clustering algorithms for the identification of similar pages in web applications

  • Authors:
  • Andrea De Lucia;Michele Risi;Giuseppe Scanniello;Genoveffa Tortora

  • Affiliations:
  • Dipartimento di Matematica e Informatica, Università di Salerno, Fisciano, SA, Italy;Dipartimento di Matematica e Informatica, Università di Salerno, Fisciano, SA, Italy;Dipartimento di Matematica e Informatica, Università della Basilicata, Potenza, Italy;Dipartimento di Matematica e Informatica, Università di Salerno, Fisciano, SA, Italy

  • Venue:
  • ICWE'07 Proceedings of the 7th international conference on Web engineering
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, we analyze some widely employed clustering algorithms to identify duplicated or cloned pages in web applications. Indeed, we consider an agglomerative hierarchical clustering algorithm, a divisive clustering algorithm, k-means partitional clustering algorithm, and a partitional competitive clustering algorithm, namely Winner Takes All (WTA). All the clustering algorithms take as input a matrix of the distances between the structures of the web pages. The distance of two pages is computed applying the Levenshtein edit distance to the strings that encode the sequences of HTML tags of the web pages.