The anatomy of a large-scale hypertextual Web search engine
WWW7 Proceedings of the seventh international conference on World Wide Web 7
SPHINX: a framework for creating personal, site-specific Web crawlers
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Efficient crawling through URL ordering
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Focused crawling: a new approach to topic-specific Web resource discovery
WWW '99 Proceedings of the eighth international conference on World Wide Web
The Ninja architecture for robust Internet-scale systems and services373423
Computer Networks: The International Journal of Computer and Telecommunications Networking - pervasive computing
ACM Transactions on Internet Technology (TOIT)
The Java Language Specification
The Java Language Specification
Mercator: A scalable, extensible Web crawler
World Wide Web
Proceedings of the 27th International Conference on Very Large Data Bases
Scalable, distributed data structures for internet service construction
OSDI'00 Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4
SIFT: a tool for wide-area information dissemination
TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings
Intermediary infrastructures for the world wide web
Computer Networks: The International Journal of Computer and Telecommunications Networking
UbiCrawler: a scalable fully distributed web crawler
Software—Practice & Experience
Crawling a country: better strategies than breadth-first for web page ordering
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Architecture of a grid-enabled Web search engine
Information Processing and Management: an International Journal
BioCrawler: An intelligent crawler for the semantic web
Expert Systems with Applications: An International Journal
On the properties of spam-advertised URL addresses
Journal of Network and Computer Applications
On the feasibility of geographically distributed web crawling
Proceedings of the 3rd international conference on Scalable information systems
An investigation of web crawler behavior: characterization and metrics
Computer Communications
Discovering URLs through user feedback
Proceedings of the 20th ACM international conference on Information and knowledge management
A policy for electing super-nodes in unstructured p2p networks
AP2PC'04 Proceedings of the Third international conference on Agents and Peer-to-Peer Computing
A distributed middleware infrastructure for personalized services
Computer Communications
Hi-index | 0.00 |
Web crawlers are the key component of services running on Internet and providing searching and indexing support for the entire Web, for corporate Intranets and large portal sites. More recently, crawlers have also been used as tools to conduct focused Web searches and to gather data about the characteristics of the WWW. In this paper, we study the employment of crawlers as a programmable, scalable, and distributed component in future Internet middleware infrastructures and proxy services. In particular, we present the architecture and implementation of, and experimentation withWebRACE, a high-performance, distributedWeb crawler, filtering server and object cache. We address the challenge of designing and implementing modular, open, distributed, and scalable crawlers, using Java. We describe our design and implementation decisions, and various optimizations. We discuss the advantages and disadvantages of using Java to implement the WebRACE-crawler, and present an evaluation of its performance. WebRACE is designed in the context of eRACE, an extensible Retrieval Annotation Caching Engine, which collects, annotates and disseminates information from heterogeneous Internet sources and protocols, according to XML-encoded user profiles that determine the urgency and relevance of collected information.