Architectural design and evaluation of an efficient web-crawling system

Authors:
Hongfei Yan;Jianyong Wang;Xiaoming Li;Lin Guo
Affiliations:
Computer Networks and Distributed Systems Laboratory, Department of Computer Science and Technology, Peking University, Beijing 100871, PR China;Computer Networks and Distributed Systems Laboratory, Department of Computer Science and Technology, Peking University, Beijing 100871, PR China;Computer Networks and Distributed Systems Laboratory, Department of Computer Science and Technology, Peking University, Beijing 100871, PR China;Computer Networks and Distributed Systems Laboratory, Department of Computer Science and Technology, Peking University, Beijing 100871, PR China
Venue:
Journal of Systems and Software
Year:
2002

Citing 2
Cited 5

The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Digging for Gold on the Web: Experience with the WebGather

HPC '00 Proceedings of the The Fourth International Conference on High-Performance Computing in the Asia-Pacific Region-Volume 2 - Volume 2

The Evolution of Link-Attributes for Pages and Its Implications on Web Crawling

WI '04 Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence
UbiCrawler: a scalable fully distributed web crawler

Software—Practice & Experience
On the peninsula phenomenon in web graph and its implications on web search

Computer Networks: The International Journal of Computer and Telecommunications Networking
The Viúva Negra crawler: an experience report

Software—Practice & Experience
A full distributed web crawler based on structured network

AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents an architectural design and evaluation result of an efficient Web-crawling system. The design involves a fully distributed architecture, a URL allocating algorithm, and a method to assure system scalability and dynamic reconfigurability. Simulation experiment shows that load balance, scalability and efficiency can be achieved in the system. Currently this distributed Web-crawling subsystem has been successfully integrated with WebGather, a well-known Chinese and English Web search engine, aimed at collecting all the Web pages in China and keeping pace with the rapid growth of Chinese Web information. In addition, we believe that the design can also be useful in other context such as digital library, etc.