Unreliable failure detectors for reliable distributed systems
Journal of the ACM (JACM)
STOC '97 Proceedings of the twenty-ninth annual ACM symposium on Theory of computing
Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator
ACM Transactions on Modeling and Computer Simulation (TOMACS) - Special issue on uniform random number generation
The anatomy of a large-scale hypertextual Web search engine
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Web caching with consistent hashing
WWW '99 Proceedings of the eighth international conference on World Wide Web
Self-stabilizing systems in spite of distributed control
Communications of the ACM
Breadth-first crawling yields high-quality pages
Proceedings of the 10th international conference on World Wide Web
ACM Transactions on Internet Technology (TOIT)
Proceedings of the 11th international conference on World Wide Web
Architectural design and evaluation of an efficient web-crawling system
Journal of Systems and Software
Design and Implementation of DDH: A Distributed Dynamic Hashing Algorithm
FODO '93 Proceedings of the 4th International Conference on Foundations of Data Organization and Algorithms
Design and Implementation of a Distributed Crawler and Filtering Processor
NGITS '02 Proceedings of the 5th International Workshop on Next Generation Information Technologies and Systems
Handbook of massive data sets
Design and Implementation of a High-Performance Distributed Web Crawler
ICDE '02 Proceedings of the 18th International Conference on Data Engineering
PageRank as a function of the damping factor
WWW '05 Proceedings of the 14th international conference on World Wide Web
Crawling a country: better strategies than breadth-first for web page ordering
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
The language observatory project (LOP)
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Mutable strings in Java: design, implementation and lightweight text-search algorithms
Science of Computer Programming - Special issue on principles and practice of programming in java (PPPJ 2003)
Proceedings of the 15th international conference on World Wide Web
Multilingual ICT education: language observatory as a monitoring instrument
SEARCC '05 Proceedings of the 2005 South East Asia Regional Computer Science Confederation (SEARCC) Conference - Volume 46
A reference collection for web spam
ACM SIGIR Forum
Characterization of national Web domains
ACM Transactions on Internet Technology (TOIT)
Extraction and classification of dense communities in the web
Proceedings of the 16th international conference on World Wide Web
Decoding the structure of the WWW: A comparative analysis of Web crawls
ACM Transactions on the Web (TWEB)
Link analysis for Web spam detection
ACM Transactions on the Web (TWEB)
The Viúva Negra crawler: an experience report
Software—Practice & Experience
A scalable pattern mining approach to web graph compression with communities
WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
Improving Web site understanding with keyword-based clustering
Journal of Software Maintenance and Evolution: Research and Practice
CEA'07 Proceedings of the 2007 annual Conference on International Conference on Computer Engineering and Applications
IRLbot: scaling to 6 billion pages and beyond
Proceedings of the 17th international conference on World Wide Web
BioCrawler: An intelligent crawler for the semantic web
Expert Systems with Applications: An International Journal
An ontological website models-supported search agent for web services
Expert Systems with Applications: An International Journal
Design trade-offs for search engine caching
ACM Transactions on the Web (TWEB)
Traps and Pitfalls of Topic-Biased PageRank
Algorithms and Models for the Web-Graph
Main-memory triangle computations for very large (sparse (power-law)) graphs
Theoretical Computer Science
Ad-hoc data processing in the cloud
Proceedings of the VLDB Endowment
High-performance priority queues for parallel crawlers
Proceedings of the 10th ACM workshop on Web information and data management
On the feasibility of geographically distributed web crawling
Proceedings of the 3rd international conference on Scalable information systems
ACM SIGIR Forum
Compressed collections for simulated crawling
ACM SIGIR Forum
Efficient Partitioning Strategies for Distributed Web Crawling
Information Networking. Towards Ubiquitous Networking and Services
On the bit-complexity of Lempel-Ziv compression
SODA '09 Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms
Speeding up algorithms on compressed web graphs
Proceedings of the Second ACM International Conference on Web Search and Data Mining
Extraction and classification of dense implicit communities in the Web graph
ACM Transactions on the Web (TWEB)
Expert Systems with Applications: An International Journal
Spectral Clustering in Social Networks
Advances in Web Mining and Web Usage Analysis
Web spam filtering in internet archives
Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
Web spam challenge proposal for filtering in archives
Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
IRLbot: Scaling to 6 billion pages and beyond
ACM Transactions on the Web (TWEB)
A practical method for browsing a relational database using a standard search engine
Integrated Computer-Aided Engineering - Selected papers from the IEEE Conference on Information Reuse and Integration (IRI), July 13-15, 2008
PageRank: Functional dependencies
ACM Transactions on Information Systems (TOIS)
On compressing the textual web
Proceedings of the third ACM international conference on Web search and data mining
Foundations and Trends in Information Retrieval
OntoCrawler: A focused crawler with ontology-supported website models for information agents
Expert Systems with Applications: An International Journal
The adaptive web
Using polynomial chaos to compute the influence of multiple random surfers in the PageRank model
WAW'07 Proceedings of the 5th international conference on Algorithms and models for the web-graph
A full distributed web crawler based on structured network
AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology
The compressed permuterm index
ACM Transactions on Algorithms (TALG)
Scale-adaptable recrawl strategies for DHT-based distributed web crawling system
NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
Fast sparse matrix-vector multiplication on GPUs: implications for graph mining
Proceedings of the VLDB Endowment
Computing strongly connected components in the streaming model
TAPAS'11 Proceedings of the First international ICST conference on Theory and practice of algorithms in (computer) systems
Theory and practice of monotone minimal perfect hashing
Journal of Experimental Algorithmics (JEA)
Online social honeynets: trapping web crawlers in OSN
MDAI'11 Proceedings of the 8th international conference on Modeling decisions for artificial intelligence
Discovering URLs through user feedback
Proceedings of the 20th ACM international conference on Information and knowledge management
Local computation of PageRank: the ranking side
Proceedings of the 20th ACM international conference on Information and knowledge management
Practical representations for web and social graphs
Proceedings of the 20th ACM international conference on Information and knowledge management
A tool for link-based web page classification
CAEPIA'11 Proceedings of the 14th international conference on Advances in artificial intelligence: spanish association for artificial intelligence
Searching and browsing Linked Data with SWSE: The Semantic Web Search Engine
Web Semantics: Science, Services and Agents on the World Wide Web
Country domain governance: an analysis by data-mining of country domains
Artificial Life and Robotics
Crawling Ajax-Based Web Applications through Dynamic Analysis of User Interface State Changes
ACM Transactions on the Web (TWEB)
MultiCrawler: a pipelined architecture for crawling and indexing semantic web data
ISWC'06 Proceedings of the 5th international conference on The Semantic Web
Lightweight data indexing and compression in external memory
LATIN'10 Proceedings of the 9th Latin American conference on Theoretical Informatics
A focused crawler with ontology-supported website models for information agents
GPC'10 Proceedings of the 5th international conference on Advances in Grid and Pervasive Computing
Parallelization of pagerank on multicore processors
ICDCIT'12 Proceedings of the 8th international conference on Distributed Computing and Internet Technology
Journal of Computer and System Sciences
Practical acceleration for computing the HITS ExpertRank vectors
Journal of Computational and Applied Mathematics
Cloudpress 2.0: a next generation news retrieval system on the cloud with a built-in summarizer
Proceedings of the International Conference on Advances in Computing, Communications and Informatics
Direction-optimizing breadth-first search
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Crawling rich internet applications: the state of the art
CASCON '12 Proceedings of the 2012 Conference of the Center for Advanced Studies on Collaborative Research
OXPath: A language for scalable data extraction, automation, and crawling on the deep web
The VLDB Journal — The International Journal on Very Large Data Bases
NCDawareRank: a novel ranking method that exploits the decomposable structure of the web
Proceedings of the sixth ACM international conference on Web search and data mining
Cloudpress 2.0: a new-age news retrieval system on the cloud
International Journal of Information and Communication Technology
Mizan: a system for dynamic load balancing in large-scale graph processing
Proceedings of the 8th ACM European Conference on Computer Systems
GPS: a graph processing system
Proceedings of the 25th International Conference on Scientific and Statistical Database Management
Compact representation of Web graphs with extended functionality
Information Systems
Tight and simple Web graph compression for forward and reverse neighbor queries
Discrete Applied Mathematics
On computing the diameter of real-world undirected graphs
Theoretical Computer Science
Direction-optimizing breadth-first search
Scientific Programming - Selected Papers from Super Computing 2012
Development of an intelligent distributed news retrieval system
International Journal of Knowledge-based and Intelligent Engineering Systems
Hi-index | 0.01 |
We report our experience in implementing UbiCrawler, a scalable distributed Web crawler, using the Java programming language. The main features of UbiCrawler are platform independence, linear scalability, graceful degradation in the presence of faults, a very effective assignment function (based on consistent hashing) for partitioning the domain to crawl, and more in general the complete decentralization of every task. The necessity of handling very large sets of data has highlighted some limitations of the Java APIs, which prompted the authors to partially reimplement them.