Finding a Web Community by Maximum Flow Algorithm with HITS Score Based Capacity

  • Authors:
  • Noriko Imafuji;Masaru Kitsuregawa

  • Affiliations:
  • -;-

  • Venue:
  • DASFAA '03 Proceedings of the Eighth International Conference on Database Systems for Advanced Applications
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, we propose an edge capacity based on huband authority scores, and examine the effects of using theedge capacity on the method for extracting web communities using maximum flow algorithm proposed by G.Flake etal. A web community is a collection of web pages in which acommon (or related) topic is taken up. In recent years, various methods for finding web communities have been proposed. G.Flake et al.'s method, which is based on maximumflow algorithm, has a big advantages: "topic drift" doesnot easily occur. On the other hand, it sets the edge capacity to a fixed value for every edge, which is one of the majorcause of failing to obtain a proper web community. Ourapproach, which is using HITS score based edge capacity, effectively extracts web pages retaining well-balancedin both global and local relations to the given seed node.We examined the effects by the experiments for randomlyselected 20 topics using web archives in Japan crawled in2002. The result confirmed that the average precision roseapproximately 20%.