On URL normalization

  • Authors:
  • Sang Ho Lee;Sung Jin Kim;Seok Hoo Hong

  • Affiliations:
  • School of Computing, Soongsil University, Seoul, Korea;School of Computer Science and Engineering, Seoul National University, Seoul, Korea;School of Computing, Soongsil University, Seoul, Korea

  • Venue:
  • ICCSA'05 Proceedings of the 2005 international conference on Computational Science and Its Applications - Volume Part II
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

Since syntactically different URLs could represent the same resource in WWW, there are on-going efforts to define the URL normalization in the standard communities. This paper considers the three additional URL normalization steps beyond ones specified in the standard URL normalization. The idea behind our work is that in the URL normalization we want to minimize false negatives further while allowing false positives in a limited level. Two metrics are defined to analyze the effect of each step in the URL normalization. Over 170 million URLs that were collected in the real web pages, we did an experiment, and interesting statistical results are reported in this paper.