Parallel crawlers

Authors:
Junghoo Cho;Hector Garcia-Molina
Affiliations:
University of California, Los Angeles;Stanford University, Stanford CA
Venue:
Proceedings of the 11th international conference on World Wide Web
Year:
2002

Citing 15
Cited 62

Distributed operating systems

ACM Computing Surveys (CSUR) - The MIT Press scientific computation series
Serverless network file systems

SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
Principles of distributed database systems (2nd ed.)

Principles of distributed database systems (2nd ed.)
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
SPHINX: a framework for creating personal, site-specific Web crawlers

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Efficient crawling through URL ordering

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
Parallel permutation and sorting algorithms and a new generalized connection network

Journal of the ACM (JACM)
Synchronizing a database to improve freshness

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Graph structure in the Web

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Parallel crawlers

Proceedings of the 11th international conference on World Wide Web
Mercator: A scalable, extensible Web crawler

World Wide Web
The Evolution of the Web and Implications for an Incremental Crawler

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Focused Crawling Using Context Graphs

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Parallel algorithms for the transitive closure and the connected component problems

STOC '76 Proceedings of the eighth annual ACM symposium on Theory of computing

Parallel crawlers

Proceedings of the 11th international conference on World Wide Web
Web application security assessment by fault injection and behavior monitoring

WWW '03 Proceedings of the 12th international conference on World Wide Web
Efficient URL caching for world wide web crawling

WWW '03 Proceedings of the 12th international conference on World Wide Web
Effective page refresh policies for Web crawlers

ACM Transactions on Database Systems (TODS)
SECO: Mediation Services for Semantic Web Data

IEEE Intelligent Systems
Distributed community crawling

Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters
UbiCrawler: a scalable fully distributed web crawler

Software—Practice & Experience
Crawling a country: better strategies than breadth-first for web page ordering

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Web-crawling reliability

Journal of the American Society for Information Science and Technology - Special issue: Webometrics
A testing framework for Web application security assessment

Computer Networks: The International Journal of Computer and Telecommunications Networking - Web security
Geographical partition for distributed web crawling

Proceedings of the 2005 workshop on Geographic information retrieval
Geographically focused collaborative crawling

Proceedings of the 15th international conference on World Wide Web
Evaluation of crawling policies for a web-repository crawler

Proceedings of the seventeenth conference on Hypertext and hypermedia
The Web as a graph: How far we are

ACM Transactions on Internet Technology (TOIT)
Architecture of a grid-enabled Web search engine

Information Processing and Management: an International Journal
On the peninsula phenomenon in web graph and its implications on web search

Computer Networks: The International Journal of Computer and Telecommunications Networking
Parallel crawling for online social networks

Proceedings of the 16th international conference on World Wide Web
Efficient Monitoring Algorithm for Fast News Alerts

IEEE Transactions on Knowledge and Data Engineering
RankMass crawler: a crawler with high personalized pagerank coverage guarantee

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
The Viúva Negra crawler: an experience report

Software—Practice & Experience
Improving Web site understanding with keyword-based clustering

Journal of Software Maintenance and Evolution: Research and Practice
MokE: a tool for Mobile-ok evaluation of web content

W4A '08 Proceedings of the 2008 international cross-disciplinary conference on Web accessibility (W4A)
Parallel crawler architecture and web page change detection

WSEAS Transactions on Computers
High-performance priority queues for parallel crawlers

Proceedings of the 10th ACM workshop on Web information and data management
On the feasibility of geographically distributed web crawling

Proceedings of the 3rd international conference on Scalable information systems
Protecting Digital Library Collections with Collaborative Web Image Copy Detection

ICADL 08 Proceedings of the 11th International Conference on Asian Digital Libraries: Universal and Ubiquitous Access to Information
Efficient Partitioning Strategies for Distributed Web Crawling

Information Networking. Towards Ubiquitous Networking and Services
A Scalable Lightweight Distributed Crawler for Crawling with Limited Resources

WI-IAT '08 Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 03
Topical web crawling using weighted anchor text and web page change detection techniques

WSEAS Transactions on Information Science and Applications
Design of CORE: context ontology rule enhanced focused web crawler

Proceedings of the International Conference on Advances in Computing, Communication and Control
IRLbot: Scaling to 6 billion pages and beyond

ACM Transactions on the Web (TWEB)
Quantifying performance and quality gains in distributed web search engines

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Deploying applications in multi-SAN SMP clusters

International Journal of Computational Science and Engineering
Harvesting Large-Scale Grids for Software Resources

CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
State of the Art in Semantic Focused Crawlers

ICCSA '09 Proceedings of the International Conference on Computational Science and Its Applications: Part II
On the feasibility of multi-site web search engines

Proceedings of the 18th ACM conference on Information and knowledge management
A testing framework for Web application security assessment

Computer Networks: The International Journal of Computer and Telecommunications Networking - Web security
FICA: A novel intelligent crawling algorithm based on reinforcement learning

Web Intelligence and Agent Systems
Web Crawling

Foundations and Trends in Information Retrieval
Technologies and the development of the Automated Metadata Indexing and Analysis (AMIA) system

Journal of Visual Communication and Image Representation
Eliminate redundancy in parallel search: a multi-agent coordination approach

PRICAI'06 Proceedings of the 9th Pacific Rim international conference on Artificial intelligence
Implementation of a web robot and statistics on the Korean web

HSI'03 Proceedings of the 2nd international conference on Human.society@internet
Adaptive focused crawling

The adaptive web
Estimating and sampling graphs with multidimensional random walks

IMC '10 Proceedings of the 10th ACM SIGCOMM conference on Internet measurement
CAMEO: continuous analytics for massively multiplayer online games on cloud resources

Euro-Par'09 Proceedings of the 2009 international conference on Parallel processing
CAMEO: enabling social networks for massively multiplayer online games through continuous analytics and cloud computing

Proceedings of the 9th Annual Workshop on Network and Systems Support for Games
A robust link-translating proxy server mirroring the whole web

ACM SIGAPP Applied Computing Review
Architecture for a parallel focused crawler for clickstream analysis

ACIIDS'11 Proceedings of the Third international conference on Intelligent information and database systems - Volume Part I
Crawling the infinite web

Journal of Web Engineering
Multi agent system for historical information retrieval from online social networks

KES-AMSTA'11 Proceedings of the 5th KES international conference on Agent and multi-agent systems: technologies and applications
An architecture for a focused trend parallel Web crawler with the application of clickstream analysis

Information Sciences: an International Journal
Discovering URLs through user feedback

Proceedings of the 20th ACM international conference on Information and knowledge management
Crawling Ajax-Based Web Applications through Dynamic Analysis of User Interface State Changes

ACM Transactions on the Web (TWEB)
Parallel web spiders for cooperative information gathering

GCC'05 Proceedings of the 4th international conference on Grid and Cooperative Computing
OverCite: a cooperative digital research library

IPTPS'05 Proceedings of the 4th international conference on Peer-to-Peer Systems
Minersoft: Software retrieval in grid and cloud computing infrastructures

ACM Transactions on Internet Technology (TOIT)
Crawling rich internet applications: the state of the art

CASCON '12 Proceedings of the 2012 Conference of the Center for Advanced Studies on Collaborative Research
Multi agent system approach for vulnerability analysis of online social network profiles over time

International Journal of Knowledge and Web Intelligence
Designing a fast file system crawler with incremental differencing

ACM SIGOPS Operating Systems Review
MapReduce Based Information Retrieval Algorithms for Efficient Ranking of Webpages

International Journal of Information Retrieval Research
Crowd crawling: towards collaborative data collection for large-scale online social networks

Proceedings of the first ACM conference on Online social networks
Development of an intelligent distributed news retrieval system

International Journal of Knowledge-based and Intelligent Engineering Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we study how we can design an effective parallel crawler. As the size of the Web grows, it becomes imperative to parallelize a crawling process, in order to finish downloading pages in a reasonable amount of time. We first propose multiple architectures for a parallel crawler and identify fundamental issues related to parallel crawling. Based on this understanding, we then propose metrics to evaluate a parallel crawler, and compare the proposed architectures using 40 million pages collected from the Web. Our results clarify the relative merits of each architecture and provide a good guideline on when to adopt which architecture.