IglooG: a distributed web crawler based on grid service

Authors:
Fei Liu;Fan-yuan Ma;Yun-ming Ye;Ming-lu Li;Jia-di Yu
Affiliations:
Department of Computer Science and Engineering, Shanghai Jiaotong University, Shanghai, P.R. China;Department of Computer Science and Engineering, Shanghai Jiaotong University, Shanghai, P.R. China;Department of Computer Science and Engineering, Shanghai Jiaotong University, Shanghai, P.R. China;Department of Computer Science and Engineering, Shanghai Jiaotong University, Shanghai, P.R. China;Department of Computer Science and Engineering, Shanghai Jiaotong University, Shanghai, P.R. China
Venue:
APWeb'05 Proceedings of the 7th Asia-Pacific web conference on Web Technologies Research and Development
Year:
2005

Citing 8
Cited 0

The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
SPHINX: a framework for creating personal, site-specific Web crawlers

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Matrices, Vector Spaces, and Information Retrieval

SIAM Review
A scalable content-addressable network

Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications
Mercator: A scalable, extensible Web crawler

World Wide Web
Hyperlink Analysis for the Web

IEEE Internet Computing
pSearch: information retrieval in structured overlays

ACM SIGCOMM Computer Communication Review
Grid Information Services for Distributed Resource Sharing

HPDC '01 Proceedings of the 10th IEEE International Symposium on High Performance Distributed Computing

Quantified Score

Hi-index	0.01

Visualization

Abstract

Web crawler is program used to download documents from the web site. This paper presents the design of a distributed web crawler on grid platform. This distributed web crawler is based on our previous work Igloo. Each crawler is deployed as grid service to improve the scalability of the system. Information services in our system are in charge of distributing URLs to balance the loads of the crawlers and are deployed as grid service. Information services are organized as Peer-to-Peer overlay network. According to the ID of crawler and semantic vector of crawl page that is computed by Latent Semantic Indexing, crawler can decide whether transmits the URL to information service or hold itself. We present an implementation of the distributed crawler based on Igloo and simulate the environment of Grid to evaluate the balancing load on the crawlers and crawl speed. Both the theoretical analysis and the experimental results show that our system is a high-performance and reliable system.