An investigation of clustering algorithms in the identification of similar web pages

  • Authors:
  • Andrea De Lucia;Michele Risi;Giuseppe Scanniello;Genoveffa Tortora

  • Affiliations:
  • Dipartimento di Matematica e Informatica, University of Salerno, Italy;Dipartimento di Matematica e Informatica, University of Salerno, Italy;Dipartimento di Matematica e Informatica, University of Basilicata, Italy;Dipartimento di Matematica e Informatica, University of Salerno, Italy

  • Venue:
  • Journal of Web Engineering
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper we investigate the effect of using clustering algorithms in the reverse engineering field to identify pages that are similar either at the structural level or at the content level. To this end, we have used two instances of a general process that only differ for the measure used to compare web pages. In particular, two web pages at the structural level and at the content level are compared by using the Levenshtein edit distances and Latent Semantic Indexing, respectively. The static pages of two web applications and one static web site have been used to compare the results achieved by using the considered clustering algorithms both at the structural and content level. On these applications we generally achieved comparable results. However, the investigation has also suggested some heuristics to quickly identify the best partition of web pages into clusters among the possible partitions both at the structural and at the content level.