MenuMiner: revealing the information architecture of large web sites by analyzing maximal cliques

  • Authors:
  • Matthias Keller;Martin Nussbaumer

  • Affiliations:
  • Karlsruhe Institute of Technology (KIT), Karlsruhe, Germany;Karlsruhe Institute of Technology (KIT), Karlsruhe, Germany

  • Venue:
  • Proceedings of the 21st international conference companion on World Wide Web
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

The foundation of almost all web sites' information architecture is a hierarchical content organization. Thus information architects put much effort in designing taxonomies that structure the content in a comprehensible and sound way. The taxonomies are obvious to human users from the site's system of main and sub menus. But current methods of web structure mining are not able to extract these central aspects of the information architecture. This is because they cannot interpret the visual encoding to recognize menus and their rank as humans do. In this paper we show that a web site's main navigation system can not only be distinguished by visual features but also by certain structural characteristics of the HTML tree and the web graph. We have developed a reliable and scalable solution that solves the problem of extracting menus for mining the information architecture. The novel MenuMiner-algorithm allows retrieving the original content organization of large-scale web sites. These data are very valuable for many applications, e.g. the presentation of search results. In an experiment we applied the method for finding site boundaries within a large domain. The evaluation showed that the method reliably delivers menus and site boundaries where other current approaches fail.