Distributed high-performance web crawler based on peer-to-peer network

Authors:
Liu Fei;Ma Fan-Yuan;Ye Yun-Ming;Li Ming-Lu;Yu Jia-Di
Affiliations:
Department of Computer Science and Engineering, Shanghai Jiaotong University, Shanghai, P. R. China;Department of Computer Science and Engineering, Shanghai Jiaotong University, Shanghai, P. R. China;Department of Computer Science and Engineering, Shanghai Jiaotong University, Shanghai, P. R. China;Department of Computer Science and Engineering, Shanghai Jiaotong University, Shanghai, P. R. China;Department of Computer Science and Engineering, Shanghai Jiaotong University, Shanghai, P. R. China
Venue:
PDCAT'04 Proceedings of the 5th international conference on Parallel and Distributed Computing: applications and Technologies
Year:
2004

Citing 4
Cited 0

The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
A scalable content-addressable network

Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications
Mercator: A scalable, extensible Web crawler

World Wide Web
Hyperlink Analysis for the Web

IEEE Internet Computing

Quantified Score

Hi-index	0.01

Visualization

Abstract

Distributing the crawling activity among multiple machines can distribute processing to reduce the analysis of web page. This paper presents the design of a distributed web crawler based on Peer-to-Peer network. The distributed crawler harnesses the excess bandwidth and computing resources of nodes in system to crawl the web. Each crawler is deployed in a computing node of P2P to analyze web page and generate indices. Control node is another node to being in charge of distributing URLs to balance the load of the crawler. Control nodes are organized as P2P network. The crawler nodes managed by the same control node is a group. According to the ID of crawler and average load of the group, crawler can decide whether transmits the URL to control node or hold itself. We present an implementation of the distributed crawler based on Igloo and simulate the environment to evaluate the balancing load on the crawlers and crawl speed.