A bridging model for parallel computation
Communications of the ACM
The anatomy of a large-scale hypertextual Web search engine
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Communications of the ACM
Proceedings of the 11th international conference on World Wide Web
Adaptive on-line page importance computation
WWW '03 Proceedings of the 12th international conference on World Wide Web
UbiCrawler: a scalable fully distributed web crawler
Software—Practice & Experience
Crawling a country: better strategies than breadth-first for web page ordering
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Parallel crawling for online social networks
Proceedings of the 16th international conference on World Wide Web
Bulk-Synchronous On-Line Crawling on Clusters of Computers
PDP '08 Proceedings of the 16th Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP 2008)
Building efficient multi-threaded search nodes
CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
On-line multi-threaded processing of web user-clicks on multi-core processors
VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
Hi-index | 0.00 |
Large scale data centers for crawlers are able to maintain a very large number of active http connections in order to download as fast as possible the usually huge number of web pages from given sections of the WWW. This generates a continuous stream of new URLs of documents to be downloaded and it is clear that the associated work-load can only be served efficiently with proper parallel computing techniques. The incoming new URLs have to be organized by a priority measure in order to download the most relevant documents first. Efficiently managing them along with other synchronization issues such as URLs downloaded by different processing nodes forming a cluster of computers are the matters of this paper. We propose efficient and scalable strategies which consider intra-node multi-core multi-threading on an inter-nodes distributed memory environment, including efficient use of secondary memory.