Design and Implementation of a High-Performance Distributed Web Crawler

Authors:
Affiliations:
Venue:
ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Year:
2002

Citing 0
Cited 53

I/O-efficient techniques for computing pagerank

Proceedings of the eleventh international conference on Information and knowledge management
Agents, Crawlers, and Web Retrieval

CIA '02 Proceedings of the 6th International Workshop on Cooperative Information Agents VI
Web application security assessment by fault injection and behavior monitoring

WWW '03 Proceedings of the 12th international conference on World Wide Web
Efficient URL caching for world wide web crawling

WWW '03 Proceedings of the 12th international conference on World Wide Web
Effective page refresh policies for Web crawlers

ACM Transactions on Database Systems (TODS)
Design of a crawler with bounded bandwidth

Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters
The Evolution of Link-Attributes for Pages and Its Implications on Web Crawling

WI '04 Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence
Researchexplorer: gaining insights through exploration in multimedia scientific data

Proceedings of the 6th ACM SIGMM international workshop on Multimedia information retrieval
Local methods for estimating pagerank values

Proceedings of the thirteenth ACM international conference on Information and knowledge management
SmartCrawl: a new strategy for the exploration of the hidden web

Proceedings of the 6th annual ACM international workshop on Web information and data management
UbiCrawler: a scalable fully distributed web crawler

Software—Practice & Experience
Three-level caching for efficient query processing in large Web search engines

WWW '05 Proceedings of the 14th international conference on World Wide Web
Efficient processing of client transactions in real-time

Distributed and Parallel Databases
Crawling a country: better strategies than breadth-first for web page ordering

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
A testing framework for Web application security assessment

Computer Networks: The International Journal of Computer and Telecommunications Networking - Web security
A computational study of external-memory BFS algorithms

SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
Managing duplicates in a web archive

Proceedings of the 2006 ACM symposium on Applied computing
Evaluation of crawling policies for a web-repository crawler

Proceedings of the seventeenth conference on Hypertext and hypermedia
Automated gathering of Web information: An in-depth examination of agents interacting with search engines

ACM Transactions on Internet Technology (TOIT)
Architecture of a grid-enabled Web search engine

Information Processing and Management: an International Journal
Improving web spam classifiers using link structure

AIRWeb '07 Proceedings of the 3rd international workshop on Adversarial information retrieval on the web
Optimized query execution in large search engines with global page ordering

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
The Viúva Negra crawler: an experience report

Software—Practice & Experience
Development of an agent system to collect schedule information on the web for intermodal transportation network planning

CEA'07 Proceedings of the 2007 annual Conference on International Conference on Computer Engineering and Applications
Performance of compressed inverted list caching in search engines

Proceedings of the 17th international conference on World Wide Web
IRLbot: scaling to 6 billion pages and beyond

Proceedings of the 17th international conference on World Wide Web
Building Data-Intensive Grid Applications with Globus Toolkit --- An Evaluation Based on Web Crawling

ICSOC '07 Proceedings of the 5th international conference on Service-Oriented Computing
On the feasibility of geographically distributed web crawling

Proceedings of the 3rd international conference on Scalable information systems
IRLbot: Scaling to 6 billion pages and beyond

ACM Transactions on the Web (TWEB)
A practical method for browsing a relational database using a standard search engine

Integrated Computer-Aided Engineering - Selected papers from the IEEE Conference on Information Reuse and Integration (IRI), July 13-15, 2008
Malay document analysis and recognition

WSEAS Transactions on Information Science and Applications
A testing framework for Web application security assessment

Computer Networks: The International Journal of Computer and Telecommunications Networking - Web security
Web Crawling

Foundations and Trends in Information Retrieval
Implementation of a web robot and statistics on the Korean web

HSI'03 Proceedings of the 2nd international conference on Human.society@internet
Adaptive focused crawling

The adaptive web
A full distributed web crawler based on structured network

AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology
News page discovery policy for instant crawlers

AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology
CRAYSE: design and implementation of efficient text search algorithm in a web crawler

ACM SIGSOFT Software Engineering Notes
Batch query processing for web search engines

Proceedings of the fourth ACM international conference on Web search and data mining
Online social honeynets: trapping web crawlers in OSN

MDAI'11 Proceedings of the 8th international conference on Modeling decisions for artificial intelligence
A new approach for verifying URL uniqueness in web crawlers

SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Discovering URLs through user feedback

Proceedings of the 20th ACM international conference on Information and knowledge management
A tool for link-based web page classification

CAEPIA'11 Proceedings of the 14th international conference on Advances in artificial intelligence: spanish association for artificial intelligence
A framework for utilising usage trends in the crawling and indexing process of search engines

International Journal of Knowledge and Web Intelligence
Reliable evaluations of URL normalization

ICCSA'06 Proceedings of the 2006 international conference on Computational Science and Its Applications - Volume Part V
How to evaluate the effectiveness of URL normalizations

HSI'05 Proceedings of the 3rd international conference on Human Society@Internet: web and Communication Technologies and Internet-Related Social Issues
Educational resources recommendation system based on agents and semantic web for helping students in a virtual learning environment

International Journal of Web Based Communities
The Study of Content Security for Mobile Internet

Wireless Personal Communications: An International Journal
RetriBlog: An architecture-centered framework for developing blog crawlers

Expert Systems with Applications: An International Journal
Ontology-Based Shopping Agent for E-Marketing

International Journal of Intelligent Information Technologies
Current challenges in web crawling

ICWE'13 Proceedings of the 13th international conference on Web Engineering
A brief history of web crawlers

CASCON '13 Proceedings of the 2013 Conference of the Center for Advanced Studies on Collaborative Research
Development of an intelligent distributed news retrieval system

International Journal of Knowledge-based and Intelligent Engineering Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Broad web search engines as well as many more specialized search tools rely on web crawlers to acquire large collections of pages for indexing and analysis. Such a web crawler may interact with millions of hosts over a period of weeks or months, and thus issues of robustness, flexibility, and manageability are of major importance. In addition, I/O performance, network resources, and OS limits must be taken into account in order to achieve high performance at a reasonable cost.In this paper, we describe the design and implementation of a distributed web crawler that runs on a network of workstations. The crawler scales to (at least) several hundred pages per second, is resilient against system crashes and other events, and can be adapted to various crawling applications. We present the software architecture of the system, discuss the performance bottlenecks, and describe efficient techniques for achieving high performance. We also report preliminary experimental results based on a crawl of $120$ million pages on $5$ million hosts.