An investigation of clustering algorithms in the identification of similar web pages

Authors:
Andrea De Lucia;Michele Risi;Giuseppe Scanniello;Genoveffa Tortora
Affiliations:
Dipartimento di Matematica e Informatica, University of Salerno, Italy;Dipartimento di Matematica e Informatica, University of Salerno, Italy;Dipartimento di Matematica e Informatica, University of Basilicata, Italy;Dipartimento di Matematica e Informatica, University of Salerno, Italy
Venue:
Journal of Web Engineering
Year:
2009

Citing 35
Cited 0

Algorithms for clustering data

Algorithms for clustering data
Ranking algorithms

Information retrieval
RMM: a methodology for structured hypermedia design

Communications of the ACM
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Building Web applications with UML

Building Web applications with UML
Data clustering: a review

ACM Computing Surveys (CSUR)
Supporting program comprehension using semantic and structural information

ICSE '01 Proceedings of the 23rd International Conference on Software Engineering
Information Retrieval

Information Retrieval
Understanding and Restructuring Web Sites with ReWeb

IEEE MultiMedia
An Approach to Identify Duplicated Web Pages

COMPSAC '02 Proceedings of the 26th International Computer Software and Applications Conference on Prolonging Software Life: Development and Redevelopment
On Software Maintenance Process Improvement Based on Code Clone Analysis

PROFES '02 Proceedings of the 4th International Conference on Product Focused Software Process Improvement
Latent Semantic Analysis for German Literature Investigation

Proceedings of the International Conference, 7th Fuzzy Days on Computational Intelligence, Theory and Applications
Measuring Clone Based Reengineering Opportunities

METRICS '99 Proceedings of the 6th International Symposium on Software Metrics
On finding duplication and near-duplication in large software systems

WCRE '95 Proceedings of the Second Working Conference on Reverse Engineering
Using Clustering Algorithms in Legacy Systems Remodularization

WCRE '97 Proceedings of the Fourth Working Conference on Reverse Engineering (WCRE '97)
Experiments with Clustering as a Software Remodularization Method

WCRE '99 Proceedings of the Sixth Working Conference on Reverse Engineering
Reverse Engineering to Achieve Maintainable WWW Sites

WCRE '01 Proceedings of the Eighth Working Conference on Reverse Engineering (WCRE'01)
Clone Detection Using Abstract Syntax Trees

ICSM '98 Proceedings of the International Conference on Software Maintenance
Comprehending Web Applications by a Clustering Based Approach

IWPC '02 Proceedings of the 10th International Workshop on Program Comprehension
Using Clustering to Support the Migration from Static to Dynamic Web Pages

IWPC '03 Proceedings of the 11th IEEE International Workshop on Program Comprehension
The Evolution of Websites

IWPC '99 Proceedings of the 7th International Workshop on Program Comprehension
Testing Web Applications

ICSM '02 Proceedings of the International Conference on Software Maintenance (ICSM'02)
Restructuring Multilingual Web Sites

ICSM '02 Proceedings of the International Conference on Software Maintenance (ICSM'02)
Pattern Classification (2nd Edition)

Pattern Classification (2nd Edition)
Web site evolution

Journal of Software Maintenance and Evolution: Research and Practice - Special issue: Web site evolution
Reverse engineering web applications: the WARE approach

Journal of Software Maintenance and Evolution: Research and Practice - Special issue: Web site evolution
Semantic clustering: Identifying topics in source code

Information and Software Technology
Recovering traceability links in software artifact management systems using information retrieval methods

ACM Transactions on Software Engineering and Methodology (TOSEM)
Identifying similar pages in Web applications using a competitive clustering algorithm: Special Issue Articles

Journal of Software Maintenance and Evolution: Research and Practice - Web Site Evolution (WSE 2006)
Knowledge discovery in virtual community texts: Clustering virtual communities

Journal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology
Improving Web site understanding with keyword-based clustering

Journal of Software Maintenance and Evolution: Research and Practice
Clustering Algorithms and Latent Semantic Indexing to Identify Similar Pages in Web Applications

WSE '07 Proceedings of the 2007 9th IEEE International Workshop on Web Site Evolution
Function clone detection in web applications: a semiautomated approach

Journal of Web Engineering
Identifying cloned navigational patterns in web applications

Journal of Web Engineering
An investigation of cloning in web applications

ICWE'05 Proceedings of the 5th international conference on Web Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we investigate the effect of using clustering algorithms in the reverse engineering field to identify pages that are similar either at the structural level or at the content level. To this end, we have used two instances of a general process that only differ for the measure used to compare web pages. In particular, two web pages at the structural level and at the content level are compared by using the Levenshtein edit distances and Latent Semantic Indexing, respectively. The static pages of two web applications and one static web site have been used to compare the results achieved by using the considered clustering algorithms both at the structural and content level. On these applications we generally achieved comparable results. However, the investigation has also suggested some heuristics to quickly identify the best partition of web pages into clusters among the possible partitions both at the structural and at the content level.