Identifying similar pages in Web applications using a competitive clustering algorithm: Special Issue Articles

Authors:
Andrea De Lucia;Giuseppe Scanniello;Genoveffa Tortora
Affiliations:
Dipartimento di Matematica e Informatica, Università di Salerno Via Ponte don Melillo, 84084 Fisciano (SA), Italy;Dipartimento di Matematica e Informatica, Università della Basilicata, Viale dell'Ateneo, 10 Macchia Romana, 85100 Potenza, Italy;Dipartimento di Matematica e Informatica, Università di Salerno Via Ponte don Melillo, 84084 Fisciano (SA), Italy
Venue:
Journal of Software Maintenance and Evolution: Research and Practice - Web Site Evolution (WSE 2006)
Year:
2007

Citing 0
Cited 2

A Visual Framework for the Definition and Execution of Reverse Engineering Processes

VISUAL '08 Proceedings of the 10th international conference on Visual Information Systems: Web-Based Visual Information Search and Management
An investigation of clustering algorithms in the identification of similar web pages

Journal of Web Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present an approach based on Winner Takes All (WTA), a competitive clustering algorithm, to support the comprehension of static and dynamic Web applications during Web application reengineering. This approach adopts a process that first computes the distance between Web pages and then identifies and groups similar pages using the considered clustering algorithm. We present an instance of application of the clustering process to identify similar pages at the structural level. The page structure is encoded into a string of HTML tags and then the distance between Web pages at the structural level is computed using the Levenshtein string edit distance algorithm. A prototype to automate the clustering process has been implemented that can be extended to other instances of the process, such as the identification of groups of similar pages at content level. The approach and the tool have been evaluated in two case studies. The results have shown that the WTA clustering algorithm suggests heuristics to easily identify the best partition of Web pages into clusters among the possible partitions. Copyright © 2007 John Wiley & Sons, Ltd.