UbiCrawler: a scalable fully distributed web crawler

Authors:
Paolo Boldi;Bruno Codenotti;Massimo Santini;Sebastiano Vigna
Affiliations:
Dipartimento di Scienze dell'Informazione, Università degli Studi di Milano, via Comelico 39/41, I-20135 Milano, Italy;Department of Computer Science, The University of Iowa, 14 Maclean Hall, Iowa City IA;Dipartimento di Scienze Sociali, Cognitive e Quantitative, Università di Modena e Reggio Emilia, via Giglioli Valle 9, I-42100 Reggio Emilia, Italy;Dipartimento di Scienze dell'Informazione, Università degli Studi di Milano, via Comelico 39/41, I-20135 Milano, Italy
Venue:
Software—Practice & Experience
Year:
2004

Citing 14
Cited 77

Unreliable failure detectors for reliable distributed systems

Journal of the ACM (JACM)
Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web

STOC '97 Proceedings of the twenty-ninth annual ACM symposium on Theory of computing
Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator

ACM Transactions on Modeling and Computer Simulation (TOMACS) - Special issue on uniform random number generation
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Web caching with consistent hashing

WWW '99 Proceedings of the eighth international conference on World Wide Web
Self-stabilizing systems in spite of distributed control

Communications of the ACM
Breadth-first crawling yields high-quality pages

Proceedings of the 10th international conference on World Wide Web
Searching the Web

ACM Transactions on Internet Technology (TOIT)
Parallel crawlers

Proceedings of the 11th international conference on World Wide Web
Architectural design and evaluation of an efficient web-crawling system

Journal of Systems and Software
Design and Implementation of DDH: A Distributed Dynamic Hashing Algorithm

FODO '93 Proceedings of the 4th International Conference on Foundations of Data Organization and Algorithms
Design and Implementation of a Distributed Crawler and Filtering Processor

NGITS '02 Proceedings of the 5th International Workshop on Next Generation Information Technologies and Systems
High-performance web crawling

Handbook of massive data sets
Design and Implementation of a High-Performance Distributed Web Crawler

ICDE '02 Proceedings of the 18th International Conference on Data Engineering

PageRank as a function of the damping factor

WWW '05 Proceedings of the 14th international conference on World Wide Web
Crawling a country: better strategies than breadth-first for web page ordering

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
The language observatory project (LOP)

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Mutable strings in Java: design, implementation and lightweight text-search algorithms

Science of Computer Programming - Special issue on principles and practice of programming in java (PPPJ 2003)
Status of the African Web

Proceedings of the 15th international conference on World Wide Web
Multilingual ICT education: language observatory as a monitoring instrument

SEARCC '05 Proceedings of the 2005 South East Asia Regional Computer Science Confederation (SEARCC) Conference - Volume 46
A reference collection for web spam

ACM SIGIR Forum
Characterization of national Web domains

ACM Transactions on Internet Technology (TOIT)
Extraction and classification of dense communities in the web

Proceedings of the 16th international conference on World Wide Web
Combining text and link analysis for focused crawling-An application for vertical search engines

Information Systems
Decoding the structure of the WWW: A comparative analysis of Web crawls

ACM Transactions on the Web (TWEB)
Link analysis for Web spam detection

ACM Transactions on the Web (TWEB)
The Viúva Negra crawler: an experience report

Software—Practice & Experience
A scalable pattern mining approach to web graph compression with communities

WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
Improving Web site understanding with keyword-based clustering

Journal of Software Maintenance and Evolution: Research and Practice
Development of an agent system to collect schedule information on the web for intermodal transportation network planning

CEA'07 Proceedings of the 2007 annual Conference on International Conference on Computer Engineering and Applications
IRLbot: scaling to 6 billion pages and beyond

Proceedings of the 17th international conference on World Wide Web
BioCrawler: An intelligent crawler for the semantic web

Expert Systems with Applications: An International Journal
ResIn: a combination of results caching and index pruning for high-performance web search engines
An ontological website models-supported search agent for web services

Expert Systems with Applications: An International Journal
Design trade-offs for search engine caching

ACM Transactions on the Web (TWEB)
Traps and Pitfalls of Topic-Biased PageRank

Algorithms and Models for the Web-Graph
Main-memory triangle computations for very large (sparse (power-law)) graphs

Theoretical Computer Science
Ad-hoc data processing in the cloud

Proceedings of the VLDB Endowment
High-performance priority queues for parallel crawlers

Proceedings of the 10th ACM workshop on Web information and data management
On the feasibility of geographically distributed web crawling

Proceedings of the 3rd international conference on Scalable information systems
A large time-aware web graph

ACM SIGIR Forum
Compressed collections for simulated crawling

ACM SIGIR Forum
Efficient Partitioning Strategies for Distributed Web Crawling

Information Networking. Towards Ubiquitous Networking and Services
On the bit-complexity of Lempel-Ziv compression

SODA '09 Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms
Speeding up algorithms on compressed web graphs

Proceedings of the Second ACM International Conference on Web Search and Data Mining
Extraction and classification of dense implicit communities in the Web graph

ACM Transactions on the Web (TWEB)
OntoPortal: An ontology-supported portal architecture with linguistically enhanced and focused crawler technologies

Expert Systems with Applications: An International Journal
Spectral Clustering in Social Networks

Advances in Web Mining and Web Usage Analysis
Web spam filtering in internet archives

Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
Web spam challenge proposal for filtering in archives

Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
IRLbot: Scaling to 6 billion pages and beyond

ACM Transactions on the Web (TWEB)
A practical method for browsing a relational database using a standard search engine

Integrated Computer-Aided Engineering - Selected papers from the IEEE Conference on Information Reuse and Integration (IRI), July 13-15, 2008
PageRank: Functional dependencies

ACM Transactions on Information Systems (TOIS)
On compressing the textual web

Proceedings of the third ACM international conference on Web search and data mining
Web Crawling

Foundations and Trends in Information Retrieval
OntoCrawler: A focused crawler with ontology-supported website models for information agents

Expert Systems with Applications: An International Journal
Adaptive focused crawling

The adaptive web
Using polynomial chaos to compute the influence of multiple random surfers in the PageRank model

WAW'07 Proceedings of the 5th international conference on Algorithms and models for the web-graph
A full distributed web crawler based on structured network

AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology
The compressed permuterm index

ACM Transactions on Algorithms (TALG)
Scale-adaptable recrawl strategies for DHT-based distributed web crawling system

NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
Fast sparse matrix-vector multiplication on GPUs: implications for graph mining

Proceedings of the VLDB Endowment
Computing strongly connected components in the streaming model

TAPAS'11 Proceedings of the First international ICST conference on Theory and practice of algorithms in (computer) systems
Theory and practice of monotone minimal perfect hashing

Journal of Experimental Algorithmics (JEA)
Online social honeynets: trapping web crawlers in OSN

MDAI'11 Proceedings of the 8th international conference on Modeling decisions for artificial intelligence
Discovering URLs through user feedback

Proceedings of the 20th ACM international conference on Information and knowledge management
Local computation of PageRank: the ranking side

Proceedings of the 20th ACM international conference on Information and knowledge management
Practical representations for web and social graphs

Proceedings of the 20th ACM international conference on Information and knowledge management
A tool for link-based web page classification

CAEPIA'11 Proceedings of the 14th international conference on Advances in artificial intelligence: spanish association for artificial intelligence
Searching and browsing Linked Data with SWSE: The Semantic Web Search Engine

Web Semantics: Science, Services and Agents on the World Wide Web
Country domain governance: an analysis by data-mining of country domains

Artificial Life and Robotics
Crawling Ajax-Based Web Applications through Dynamic Analysis of User Interface State Changes

ACM Transactions on the Web (TWEB)
MultiCrawler: a pipelined architecture for crawling and indexing semantic web data

ISWC'06 Proceedings of the 5th international conference on The Semantic Web
Lightweight data indexing and compression in external memory

LATIN'10 Proceedings of the 9th Latin American conference on Theoretical Informatics
A focused crawler with ontology-supported website models for information agents

GPC'10 Proceedings of the 5th international conference on Advances in Grid and Pervasive Computing
Parallelization of pagerank on multicore processors

ICDCIT'12 Proceedings of the 8th international conference on Distributed Computing and Internet Technology
Distributed data possession checking for securing multiple replicas in geographically-dispersed clouds

Journal of Computer and System Sciences
Practical acceleration for computing the HITS ExpertRank vectors

Journal of Computational and Applied Mathematics
Cloudpress 2.0: a next generation news retrieval system on the cloud with a built-in summarizer

Proceedings of the International Conference on Advances in Computing, Communications and Informatics
Direction-optimizing breadth-first search

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Crawling rich internet applications: the state of the art

CASCON '12 Proceedings of the 2012 Conference of the Center for Advanced Studies on Collaborative Research
OXPath: A language for scalable data extraction, automation, and crawling on the deep web

The VLDB Journal — The International Journal on Very Large Data Bases
NCDawareRank: a novel ranking method that exploits the decomposable structure of the web

Proceedings of the sixth ACM international conference on Web search and data mining
Cloudpress 2.0: a new-age news retrieval system on the cloud

International Journal of Information and Communication Technology
Mizan: a system for dynamic load balancing in large-scale graph processing

Proceedings of the 8th ACM European Conference on Computer Systems
GPS: a graph processing system

Proceedings of the 25th International Conference on Scientific and Statistical Database Management
Compact representation of Web graphs with extended functionality

Information Systems
Tight and simple Web graph compression for forward and reverse neighbor queries

Discrete Applied Mathematics
On computing the diameter of real-world undirected graphs

Theoretical Computer Science
Direction-optimizing breadth-first search

Scientific Programming - Selected Papers from Super Computing 2012
Development of an intelligent distributed news retrieval system

International Journal of Knowledge-based and Intelligent Engineering Systems

Quantified Score

Hi-index	0.01

Visualization

Abstract

We report our experience in implementing UbiCrawler, a scalable distributed Web crawler, using the Java programming language. The main features of UbiCrawler are platform independence, linear scalability, graceful degradation in the presence of faults, a very effective assignment function (based on consistent hashing) for partitioning the domain to crawl, and more in general the complete decentralization of every task. The necessity of handling very large sets of data has highlighted some limitations of the Java APIs, which prompted the authors to partially reimplement them.