A new approach for verifying URL uniqueness in web crawlers

Authors:
Wallace Favoreto Henrique;Nivio Ziviani;Marco Antônio Cristo;Edleno Silva de Moura;Altigran Soares da Silva;Cristiano Carvalho
Affiliations:
Universidade Federal de Minas Gerais, Department of Computer Science, Belo Horizonte, Brazil;Universidade Federal de Minas Gerais, Department of Computer Science, Belo Horizonte, Brazil;Universidade Federal do Amazonas, Department of Computer Science, Manaus, Brazil;Universidade Federal do Amazonas, Department of Computer Science, Manaus, Brazil;Universidade Federal do Amazonas, Department of Computer Science, Manaus, Brazil;Universidade Federal de Minas Gerais, Department of Computer Science, Belo Horizonte, Brazil
Venue:
SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Year:
2011

Citing 5
Cited 0

Mercator: A scalable, extensible Web crawler

World Wide Web
Design and Implementation of a High-Performance Distributed Web Crawler

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Exploiting the hierarchical structure for link analysis

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
IRLbot: Scaling to 6 billion pages and beyond

ACM Transactions on the Web (TWEB)
Modeling the web as a hypergraph to compute page reputation

Information Systems

Quantified Score

Hi-index	0.01

Visualization

Abstract

The Web has become a huge repository of pages and search engines allow users to find relevant information in this repository. Web crawlers are an important component of search engines. They find, download, parse content and store pages in a repository. In this paper, we present a new algorithm for verifying URL uniqueness in a large-scale web crawler. The verifier of uniqueness must check if a URL is present in the repository of unique URLs and if the corresponding page was already collected. The algorithm is based on a novel policy for organizing the set of unique URLs according to the server they belong to, exploiting a locality of reference property. This property is inherent in Web traversals, which follows from the skewed distribution of links within a web page, thus favoring references to other pages from a same server. We select the URLs to be crawled taking into account information about the servers they belong to, thus allowing the usage of our algorithm in the crawler without extra cost to pre-organize the entries. We compare our algorithm with a state-of-the-art algorithm found in the literature. We present a model for both algorithms and compare their performances. We carried out experiments using a crawling simulation of a representative subset of the Web which show that the adopted policy yields to a significant improvement in the time spent handling URL uniqueness verification.