Design and Implementation of a Distributed Crawler and Filtering Processor

  • Authors:
  • Demetrios Zeinalipour-Yazti;Marios D. Dikaiakos

  • Affiliations:
  • -;-

  • Venue:
  • NGITS '02 Proceedings of the 5th International Workshop on Next Generation Information Technologies and Systems
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

Web crawlers are the key component of services running on Internet and providing searching and indexing support for the entire Web, for corporate Intranets and large portal sites. More recently, crawlers have also been used as tools to conduct focused Web searches and to gather data about the characteristics of the WWW. In this paper, we study the employment of crawlers as a programmable, scalable, and distributed component in future Internet middleware infrastructures and proxy services. In particular, we present the architecture and implementation of, and experimentation withWebRACE, a high-performance, distributedWeb crawler, filtering server and object cache. We address the challenge of designing and implementing modular, open, distributed, and scalable crawlers, using Java. We describe our design and implementation decisions, and various optimizations. We discuss the advantages and disadvantages of using Java to implement the WebRACE-crawler, and present an evaluation of its performance. WebRACE is designed in the context of eRACE, an extensible Retrieval Annotation Caching Engine, which collects, annotates and disseminates information from heterogeneous Internet sources and protocols, according to XML-encoded user profiles that determine the urgency and relevance of collected information.