Design and Implementation of a Distributed Crawler and Filtering Processor

Authors:
Demetrios Zeinalipour-Yazti;Marios D. Dikaiakos
Affiliations:
-;-
Venue:
NGITS '02 Proceedings of the 5th International Workshop on Next Generation Information Technologies and Systems
Year:
2002

Citing 11
Cited 11

The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
SPHINX: a framework for creating personal, site-specific Web crawlers

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Efficient crawling through URL ordering

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
The Ninja architecture for robust Internet-scale systems and services373423

Computer Networks: The International Journal of Computer and Telecommunications Networking - pervasive computing
Searching the Web

ACM Transactions on Internet Technology (TOIT)
The Java Language Specification

The Java Language Specification
Mercator: A scalable, extensible Web crawler

World Wide Web
Crawling the Hidden Web

Proceedings of the 27th International Conference on Very Large Data Bases
Scalable, distributed data structures for internet service construction

OSDI'00 Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4
SIFT: a tool for wide-area information dissemination

TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings

Intermediary infrastructures for the world wide web

Computer Networks: The International Journal of Computer and Telecommunications Networking
UbiCrawler: a scalable fully distributed web crawler

Software—Practice & Experience
Crawling a country: better strategies than breadth-first for web page ordering

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Architecture of a grid-enabled Web search engine

Information Processing and Management: an International Journal
BioCrawler: An intelligent crawler for the semantic web

Expert Systems with Applications: An International Journal
On the properties of spam-advertised URL addresses

Journal of Network and Computer Applications
On the feasibility of geographically distributed web crawling

Proceedings of the 3rd international conference on Scalable information systems
An investigation of web crawler behavior: characterization and metrics

Computer Communications
Discovering URLs through user feedback

Proceedings of the 20th ACM international conference on Information and knowledge management
A policy for electing super-nodes in unstructured p2p networks

AP2PC'04 Proceedings of the Third international conference on Agents and Peer-to-Peer Computing
A distributed middleware infrastructure for personalized services

Computer Communications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Web crawlers are the key component of services running on Internet and providing searching and indexing support for the entire Web, for corporate Intranets and large portal sites. More recently, crawlers have also been used as tools to conduct focused Web searches and to gather data about the characteristics of the WWW. In this paper, we study the employment of crawlers as a programmable, scalable, and distributed component in future Internet middleware infrastructures and proxy services. In particular, we present the architecture and implementation of, and experimentation withWebRACE, a high-performance, distributedWeb crawler, filtering server and object cache. We address the challenge of designing and implementing modular, open, distributed, and scalable crawlers, using Java. We describe our design and implementation decisions, and various optimizations. We discuss the advantages and disadvantages of using Java to implement the WebRACE-crawler, and present an evaluation of its performance. WebRACE is designed in the context of eRACE, an extensible Retrieval Annotation Caching Engine, which collects, annotates and disseminates information from heterogeneous Internet sources and protocols, according to XML-encoded user profiles that determine the urgency and relevance of collected information.