Improving Web site understanding with keyword-based clustering

  • Authors:
  • Filippo Ricca;Emanuele Pianta;Paolo Tonella;Christian Girardi

  • Affiliations:
  • Unità CINI at DISI, 16146 Genova, Italy;ITC-irst, Centro per la Ricerca Scientifica e Tecnologica, 38050 Povo (Trento), Italy;ITC-irst, Centro per la Ricerca Scientifica e Tecnologica, 38050 Povo (Trento), Italy;ITC-irst, Centro per la Ricerca Scientifica e Tecnologica, 38050 Povo (Trento), Italy

  • Venue:
  • Journal of Software Maintenance and Evolution: Research and Practice
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Web applications are becoming more and more complex and difficult to maintain. To satisfy the customer's demands, they need to be updated often and quickly. In the maintenance phase, Web site understanding is a central activity. In this phase, programmers spend a lot of time and effort in the comprehension of the internal Web site structure. Such activity is often required because the available documentation is not aligned with the implementation, if not missing at all. Reverse engineering techniques have the potential to support Web site understanding, by providing views that show the organization of a site and its navigational structure. However, representing each Web page as a node in a diagram recovered from the source code of the Web site often leads to huge and unreadable graphs. Moreover, since the level of connectivity is typically high, the edges in such graphs make the overall result even less usable. In this paper, we propose an approach to Web site understanding based on clustering of client-side HTML pages with similar content. This approach works well with content-oriented sites rather than application-oriented ones and uses a crawler to download the Web pages of the target Web site. The presence of common keywords is exploited to decide when it is appropriate to group pages together. An experimental work, including 17 Web sites, validates our approach and shows that the clusters produced automatically are close to those that a human would produce for a given Web site. Copyright © 2007 John Wiley & Sons, Ltd.