High performance crawling system

Authors:
Younès Hafri;Chabane Djeraba
Affiliations:
Ecole Polytechnique de Nantes, Cédex, France;UMR CNRS, Cédex - France
Venue:
Proceedings of the 6th ACM SIGMM international workshop on Multimedia information retrieval
Year:
2004

Citing 17
Cited 7

Artificial intelligence: a modern approach

Artificial intelligence: a modern approach
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Efficient crawling through URL ordering

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Measuring index quality using random walks on the Web

WWW '99 Proceedings of the eighth international conference on World Wide Web
Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
Synchronizing a database to improve freshness

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
WebBase: a repository of Web pages

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
On near-uniform URL sampling

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Breadth-first crawling yields high-quality pages

Proceedings of the 10th international conference on World Wide Web
Controlling the robots of Web search engines

Proceedings of the 2001 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Mercator: A scalable, extensible Web crawler

World Wide Web
Using Reinforcement Learning to Spider the Web Efficiently

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Distributed Hypertext Resource Discovery Through Examples

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
The Evolution of the Web and Implications for an Incremental Crawler

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Focused Crawling Using Context Graphs

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Approximating Aggregate Queries about Web Pages via Random Walks

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Compressing the Graph Structure of the Web

DCC '01 Proceedings of the Data Compression Conference

Crawling a country: better strategies than breadth-first for web page ordering

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Evaluation of crawling policies for a web-repository crawler

Proceedings of the seventeenth conference on Hypertext and hypermedia
IRLbot: scaling to 6 billion pages and beyond

Proceedings of the 17th international conference on World Wide Web
IRLbot: Scaling to 6 billion pages and beyond

ACM Transactions on the Web (TWEB)
Web Crawling

Foundations and Trends in Information Retrieval
A full distributed web crawler based on structured network

AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology
Crawling the infinite web

Journal of Web Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

In the present paper, we will describe the design and implementation of a real-time distributed system of Web crawling running on a cluster of machines. The system crawls several thousands of pages every second, includes a high-performance fault manager, is platform independent and is able to adapt transparently to a wide range of configurations without incurring additional hardware expenditure. We will then provide details of the system architecture and describe the technical choices for very high performance crawling. Finally, we will discuss the experimental results obtained, comparing them with other documented systems