Mining taxonomies from web menus: rule-based concepts and algorithms

  • Authors:
  • Matthias Keller;Hannes Hartenstein

  • Affiliations:
  • Steinbuch Centre for Computing, Karlsruhe Institute of Technology, Karlsruhe, Germany;Steinbuch Centre for Computing, Karlsruhe Institute of Technology, Karlsruhe, Germany

  • Venue:
  • ICWE'13 Proceedings of the 13th international conference on Web Engineering
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

The logical hierarchies of Web sites (i.e. Web site taxonomies) are obvious to humans, because humans can distinguish different menu levels and their relationships. But such accurate information about the logical structure is not yet available to machines. Many applications would benefit if Web site taxonomies could be mined from menus, but it was an almost unsolvable problem in the past. While a tag newly introduced in HTML5 and novel mining methods allow to distinguish menus from other contents today, it has not yet been researched, how the underlying taxonomies can be extracted, given the menus. In this paper we present the first detailed analysis of the problem and introduce rule-based concepts for addressing each identified sub problem. We report on a large-scale study on mining hierarchical menus of 350 randomly selected domains. Our methods allow extracting Web site taxonomy information that was not available before with high precision and high recall.