Improving Web site understanding with keyword-based clustering

Authors:
Filippo Ricca;Emanuele Pianta;Paolo Tonella;Christian Girardi
Affiliations:
Unità CINI at DISI, 16146 Genova, Italy;ITC-irst, Centro per la Ricerca Scientifica e Tecnologica, 38050 Povo (Trento), Italy;ITC-irst, Centro per la Ricerca Scientifica e Tecnologica, 38050 Povo (Trento), Italy;ITC-irst, Centro per la Ricerca Scientifica e Tecnologica, 38050 Povo (Trento), Italy
Venue:
Journal of Software Maintenance and Evolution: Research and Practice
Year:
2008

Citing 30
Cited 2

Recent trends in hierarchic document clustering: a critical review

Information Processing and Management: an International Journal
Automatic text processing: the transformation, analysis, and retrieval of information by computer

Automatic text processing: the transformation, analysis, and retrieval of information by computer
Program understanding: challenge for the 1990's

IBM Systems Journal
An Information Retrieval Approach for Automatically Constructing Software Libraries

IEEE Transactions on Software Engineering
Web document clustering: a feasibility demonstration

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
SPHINX: a framework for creating personal, site-specific Web crawlers

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Foundations of statistical natural language processing

Foundations of statistical natural language processing
Experimentation in software engineering: an introduction

Experimentation in software engineering: an introduction
Analysis and testing of Web applications

ICSE '01 Proceedings of the 23rd International Conference on Software Engineering
Supporting program comprehension using semantic and structural information

ICSE '01 Proceedings of the 23rd International Conference on Software Engineering
Parallel crawlers

Proceedings of the 11th international conference on World Wide Web
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Mercator: A scalable, extensible Web crawler

World Wide Web
Crawling the Hidden Web

Proceedings of the 27th International Conference on Very Large Data Bases
An Approach to Identify Duplicated Web Pages

COMPSAC '02 Proceedings of the 26th International Computer Software and Applications Conference on Prolonging Software Life: Development and Redevelopment
Recovering documentation-to-source-code traceability links using latent semantic indexing

Proceedings of the 25th International Conference on Software Engineering
Assessing the relevance of identifier names in a legacy software system

CASCON '98 Proceedings of the 1998 conference of the Centre for Advanced Studies on Collaborative research
Using Clustering Algorithms in Legacy Systems Remodularization

WCRE '97 Proceedings of the Fourth Working Conference on Reverse Engineering (WCRE '97)
Experiments with Clustering as a Software Remodularization Method

WCRE '99 Proceedings of the Sixth Working Conference on Reverse Engineering
Information Retrieval Models for Recovering Traceability Links between Code and Documentation

ICSM '00 Proceedings of the International Conference on Software Maintenance (ICSM'00)
Comprehending Web Applications by a Clustering Based Approach

IWPC '02 Proceedings of the 10th International Workshop on Program Comprehension
Using Clustering to Support the Migration from Static to Dynamic Web Pages

IWPC '03 Proceedings of the 11th IEEE International Workshop on Program Comprehension
Using Automatic Clustering to Produce High-Level System Organizations of Source Code

IWPC '98 Proceedings of the 6th International Workshop on Program Comprehension
The Evolution of Websites

IWPC '99 Proceedings of the 7th International Workshop on Program Comprehension
Restructuring Multilingual Web Sites

ICSM '02 Proceedings of the International Conference on Software Maintenance (ICSM'02)
An Empirical Study on Keyword-based Web Site Clustering

IWPC '04 Proceedings of the 12th IEEE International Workshop on Program Comprehension
UbiCrawler: a scalable fully distributed web crawler

Software—Practice & Experience
Beyond lexical units: enriching wordnets with phrasets

EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 2
A content and structure website mining model

Proceedings of the 15th international conference on World Wide Web
A website mining model centered on user queries

EWMF'05/KDO'05 Proceedings of the 2005 joint international conference on Semantics, Web and Mining

An investigation of clustering algorithms in the identification of similar web pages

Journal of Web Engineering
Context-driven semantic enrichment of italian news archive

ESWC'10 Proceedings of the 7th international conference on The Semantic Web: research and Applications - Volume Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

Web applications are becoming more and more complex and difficult to maintain. To satisfy the customer's demands, they need to be updated often and quickly. In the maintenance phase, Web site understanding is a central activity. In this phase, programmers spend a lot of time and effort in the comprehension of the internal Web site structure. Such activity is often required because the available documentation is not aligned with the implementation, if not missing at all. Reverse engineering techniques have the potential to support Web site understanding, by providing views that show the organization of a site and its navigational structure. However, representing each Web page as a node in a diagram recovered from the source code of the Web site often leads to huge and unreadable graphs. Moreover, since the level of connectivity is typically high, the edges in such graphs make the overall result even less usable. In this paper, we propose an approach to Web site understanding based on clustering of client-side HTML pages with similar content. This approach works well with content-oriented sites rather than application-oriented ones and uses a crawler to download the Web pages of the target Web site. The presence of common keywords is exploited to decide when it is appropriate to group pages together. An experimental work, including 17 Web sites, validates our approach and shows that the clusters produced automatically are close to those that a human would produce for a given Web site. Copyright © 2007 John Wiley & Sons, Ltd.