Recent trends in hierarchic document clustering: a critical review
Information Processing and Management: an International Journal
Automatic text processing: the transformation, analysis, and retrieval of information by computer
Automatic text processing: the transformation, analysis, and retrieval of information by computer
Program understanding: challenge for the 1990's
IBM Systems Journal
An Information Retrieval Approach for Automatically Constructing Software Libraries
IEEE Transactions on Software Engineering
Web document clustering: a feasibility demonstration
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
SPHINX: a framework for creating personal, site-specific Web crawlers
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Foundations of statistical natural language processing
Foundations of statistical natural language processing
Experimentation in software engineering: an introduction
Experimentation in software engineering: an introduction
Analysis and testing of Web applications
ICSE '01 Proceedings of the 23rd International Conference on Software Engineering
Supporting program comprehension using semantic and structural information
ICSE '01 Proceedings of the 23rd International Conference on Software Engineering
Proceedings of the 11th international conference on World Wide Web
Introduction to Modern Information Retrieval
Introduction to Modern Information Retrieval
Mercator: A scalable, extensible Web crawler
World Wide Web
Proceedings of the 27th International Conference on Very Large Data Bases
An Approach to Identify Duplicated Web Pages
COMPSAC '02 Proceedings of the 26th International Computer Software and Applications Conference on Prolonging Software Life: Development and Redevelopment
Recovering documentation-to-source-code traceability links using latent semantic indexing
Proceedings of the 25th International Conference on Software Engineering
Assessing the relevance of identifier names in a legacy software system
CASCON '98 Proceedings of the 1998 conference of the Centre for Advanced Studies on Collaborative research
Using Clustering Algorithms in Legacy Systems Remodularization
WCRE '97 Proceedings of the Fourth Working Conference on Reverse Engineering (WCRE '97)
Experiments with Clustering as a Software Remodularization Method
WCRE '99 Proceedings of the Sixth Working Conference on Reverse Engineering
Information Retrieval Models for Recovering Traceability Links between Code and Documentation
ICSM '00 Proceedings of the International Conference on Software Maintenance (ICSM'00)
Comprehending Web Applications by a Clustering Based Approach
IWPC '02 Proceedings of the 10th International Workshop on Program Comprehension
Using Clustering to Support the Migration from Static to Dynamic Web Pages
IWPC '03 Proceedings of the 11th IEEE International Workshop on Program Comprehension
Using Automatic Clustering to Produce High-Level System Organizations of Source Code
IWPC '98 Proceedings of the 6th International Workshop on Program Comprehension
IWPC '99 Proceedings of the 7th International Workshop on Program Comprehension
Restructuring Multilingual Web Sites
ICSM '02 Proceedings of the International Conference on Software Maintenance (ICSM'02)
An Empirical Study on Keyword-based Web Site Clustering
IWPC '04 Proceedings of the 12th IEEE International Workshop on Program Comprehension
UbiCrawler: a scalable fully distributed web crawler
Software—Practice & Experience
Beyond lexical units: enriching wordnets with phrasets
EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 2
A content and structure website mining model
Proceedings of the 15th international conference on World Wide Web
A website mining model centered on user queries
EWMF'05/KDO'05 Proceedings of the 2005 joint international conference on Semantics, Web and Mining
An investigation of clustering algorithms in the identification of similar web pages
Journal of Web Engineering
Context-driven semantic enrichment of italian news archive
ESWC'10 Proceedings of the 7th international conference on The Semantic Web: research and Applications - Volume Part I
Hi-index | 0.00 |
Web applications are becoming more and more complex and difficult to maintain. To satisfy the customer's demands, they need to be updated often and quickly. In the maintenance phase, Web site understanding is a central activity. In this phase, programmers spend a lot of time and effort in the comprehension of the internal Web site structure. Such activity is often required because the available documentation is not aligned with the implementation, if not missing at all. Reverse engineering techniques have the potential to support Web site understanding, by providing views that show the organization of a site and its navigational structure. However, representing each Web page as a node in a diagram recovered from the source code of the Web site often leads to huge and unreadable graphs. Moreover, since the level of connectivity is typically high, the edges in such graphs make the overall result even less usable. In this paper, we propose an approach to Web site understanding based on clustering of client-side HTML pages with similar content. This approach works well with content-oriented sites rather than application-oriented ones and uses a crawler to download the Web pages of the target Web site. The presence of common keywords is exploited to decide when it is appropriate to group pages together. An experimental work, including 17 Web sites, validates our approach and shows that the clusters produced automatically are close to those that a human would produce for a given Web site. Copyright © 2007 John Wiley & Sons, Ltd.