Reliable evaluations of URL normalization

  • Authors:
  • Sung Jin Kim;Hyo Sook Jeong;Sang Ho Lee

  • Affiliations:
  • School of Computer Science and Engineering, Seoul National University, Seoul, Korea;School of Computing, Soongsil University, Seoul, Korea;School of Computing, Soongsil University, Seoul, Korea

  • Venue:
  • ICCSA'06 Proceedings of the 2006 international conference on Computational Science and Its Applications - Volume Part V
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

URL normalization is a process of transforming URL strings into canonical form. Through this process, duplicate URL representations for web pages can be reduced significantly. There are a number of normalization methods. In this paper, we describe four metrics for evaluating normalization methods. The reliability and consistency of a URL is also considered in our evaluation. With the metrics proposed, we evaluate seven normalization methods. The evaluation results on over 25 million URLs, extracted from the web, are reported in this paper.