Mercator: A scalable, extensible Web crawler
World Wide Web
Design and Implementation of a High-Performance Distributed Web Crawler
ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Implementation of a web robot and statistics on the Korean web
HSI'03 Proceedings of the 2nd international conference on Human.society@internet
An empirical study on the change of web pages
APWeb'05 Proceedings of the 7th Asia-Pacific web conference on Web Technologies Research and Development
ICCSA'05 Proceedings of the 2005 international conference on Computational Science and Its Applications - Volume Part II
Do not crawl in the dust: different urls with similar text
Proceedings of the 16th international conference on World Wide Web
Do not crawl in the DUST: Different URLs with similar text
ACM Transactions on the Web (TWEB)
Hi-index | 0.00 |
URL normalization is a process of transforming URL strings into canonical form. Through this process, duplicate URL representations for web pages can be reduced significantly. There are a number of normalization methods. In this paper, we describe four metrics for evaluating normalization methods. The reliability and consistency of a URL is also considered in our evaluation. With the metrics proposed, we evaluate seven normalization methods. The evaluation results on over 25 million URLs, extracted from the web, are reported in this paper.