Automatic sitemaps generation: Exploring website structures using block extraction and hyperlink analysis

  • Authors:
  • Shian-Hua Lin;Kuan-Pak Chu;Chun-Ming Chiu

  • Affiliations:
  • Department of Computer Science and Information Engineering, National Chi Nan University, Taiwan;Department of Computer Science and Information Engineering, National Chi Nan University, Taiwan;Department of Computer Science and Information Engineering, National Chi Nan University, Taiwan

  • Venue:
  • Expert Systems with Applications: An International Journal
  • Year:
  • 2011

Quantified Score

Hi-index 12.05

Visualization

Abstract

Sitemaps designed by webmasters are not only presenting the main usage flows for users, but also organizing the hierarchical concept of the website. However, websites seldom provide sitemap pages to facilitate users to browse pages easily. Even provided, these sitemaps are not for machine-understanding, although few websites provide sitemaps with the XML format. In this paper, we develop a system, SiteMap Generator (SMG), to automatically generate the hierarchical sitemap for a website. SMG consists of five components. Sequence Translator translates a page's HTML source into a long sequence and then Page Partitioner splits the page into blocks based on analyzing the sequence complexity. Block Identifier categorizes each block into one of three block types: content, structure or redundant. Using the popular sequence searching tool, BLAST, Block Cluster calculates similarities between blocks so that blocks with similar functionalities are grouped and considered as candidate blocks for the sitemap. Finally, Hyperlink Analyzer transforms page-to-page links into block-to-block links and applies Kleinberg's HITS algorithm to estimate authority and hub values of each block. Block entropy value derived from features entropies is also used to improve the HITS. Several experiments on three websites: Mozilla, CNN and Yahoo! News, show that SMG is useful to partition a page into blocks (F1=86%), identify the block type (F1=85%), and generate the sitemap for the website (F1=63%).