Modeling the web as a hypergraph to compute page reputation

  • Authors:
  • Klessius Berlt;Edleno Silva de Moura;André Carvalho;Marco Cristo;Nivio Ziviani;Thierson Couto

  • Affiliations:
  • Department of Computer Science, Federal University of Amazonas, Manaus, Brazil;Department of Computer Science, Federal University of Amazonas, Manaus, Brazil;Department of Computer Science, Federal University of Amazonas, Manaus, Brazil;FUCAPI, Analysis, Research and Tech. Innovation Center, Manaus, Brazil;Department of Computer Science, Federal University of Minas Gerais, Belo Horizonte, Brazil;Institute of Informatics, Federal University of Goiás, Goiínia, Brazil

  • Venue:
  • Information Systems
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this work we propose a model to represent the web as a directed hypergraph (instead of a graph), where links connect pairs of disjointed sets of pages. The web hypergraph is derived from the web graph by dividing the set of pages into non-overlapping blocks and using the links between pages of distinct blocks to create hyperarcs. A hyperarc connects a block of pages to a single page, in order to provide more reliable information for link analysis. We use the hypergraph model to create the hypergraph versions of the Pagerank and Indegree algorithms, referred to as HyperPagerank and HyperIndegree, respectively. The hypergraph is derived from the web graph by grouping pages by two different partition criteria: grouping together the pages that belong to the same web host or to the same web domain. We compared the original page-based algorithms with the host-based and domain-based versions of the algorithms, considering a combination of the page reputation, the textual content of the pages and the anchor text. Experimental results using three distinct web collections show that the HyperPagerank and HyperIndegree algorithms may yield better results than the original graph versions of the Pagerank and Indegree algorithms. We also show that the hypergraph versions of the algorithms were slightly less affected by noise links and spamming.