On URL normalization

Authors:
Sang Ho Lee;Sung Jin Kim;Seok Hoo Hong
Affiliations:
School of Computing, Soongsil University, Seoul, Korea;School of Computer Science and Engineering, Seoul National University, Seoul, Korea;School of Computing, Soongsil University, Seoul, Korea
Venue:
ICCSA'05 Proceedings of the 2005 international conference on Computational Science and Its Applications - Volume Part II
Year:
2005

Citing 2
Cited 6

Mercator: A scalable, extensible Web crawler

World Wide Web
Implementation of a web robot and statistics on the Korean web

HSI'03 Proceedings of the 2nd international conference on Human.society@internet

Evaluation of crawling policies for a web-repository crawler

Proceedings of the seventeenth conference on Hypertext and hypermedia
A Parallel Algorithm for Finding Related Pages in the Web by Using Segmented Link Structures

PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
The missing links: discovering hidden same-as links among a billion of triples

Proceedings of the 12th International Conference on Information Integration and Web-based Applications & Services
Reliable evaluations of URL normalization

ICCSA'06 Proceedings of the 2006 international conference on Computational Science and Its Applications - Volume Part V
Adaptive crawler for external hyperlinks search and acquisition

Automation and Remote Control
Development of an intelligent distributed news retrieval system

International Journal of Knowledge-based and Intelligent Engineering Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Since syntactically different URLs could represent the same resource in WWW, there are on-going efforts to define the URL normalization in the standard communities. This paper considers the three additional URL normalization steps beyond ones specified in the standard URL normalization. The idea behind our work is that in the URL normalization we want to minimize false negatives further while allowing false positives in a limited level. Two metrics are defined to analyze the effect of each step in the URL normalization. Over 170 million URLs that were collected in the real web pages, we did an experiment, and interesting statistical results are reported in this paper.