ACM Computing Surveys (CSUR) - The MIT Press scientific computation series
Serverless network file systems
SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
Principles of distributed database systems (2nd ed.)
Principles of distributed database systems (2nd ed.)
The anatomy of a large-scale hypertextual Web search engine
WWW7 Proceedings of the seventh international conference on World Wide Web 7
SPHINX: a framework for creating personal, site-specific Web crawlers
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Efficient crawling through URL ordering
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Focused crawling: a new approach to topic-specific Web resource discovery
WWW '99 Proceedings of the eighth international conference on World Wide Web
Parallel permutation and sorting algorithms and a new generalized connection network
Journal of the ACM (JACM)
Synchronizing a database to improve freshness
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Proceedings of the 11th international conference on World Wide Web
Mercator: A scalable, extensible Web crawler
World Wide Web
The Evolution of the Web and Implications for an Incremental Crawler
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Focused Crawling Using Context Graphs
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Parallel algorithms for the transitive closure and the connected component problems
STOC '76 Proceedings of the eighth annual ACM symposium on Theory of computing
Proceedings of the 11th international conference on World Wide Web
Web application security assessment by fault injection and behavior monitoring
WWW '03 Proceedings of the 12th international conference on World Wide Web
Efficient URL caching for world wide web crawling
WWW '03 Proceedings of the 12th international conference on World Wide Web
Effective page refresh policies for Web crawlers
ACM Transactions on Database Systems (TODS)
SECO: Mediation Services for Semantic Web Data
IEEE Intelligent Systems
Distributed community crawling
Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters
UbiCrawler: a scalable fully distributed web crawler
Software—Practice & Experience
Crawling a country: better strategies than breadth-first for web page ordering
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Journal of the American Society for Information Science and Technology - Special issue: Webometrics
A testing framework for Web application security assessment
Computer Networks: The International Journal of Computer and Telecommunications Networking - Web security
Geographical partition for distributed web crawling
Proceedings of the 2005 workshop on Geographic information retrieval
Geographically focused collaborative crawling
Proceedings of the 15th international conference on World Wide Web
Evaluation of crawling policies for a web-repository crawler
Proceedings of the seventeenth conference on Hypertext and hypermedia
The Web as a graph: How far we are
ACM Transactions on Internet Technology (TOIT)
Architecture of a grid-enabled Web search engine
Information Processing and Management: an International Journal
On the peninsula phenomenon in web graph and its implications on web search
Computer Networks: The International Journal of Computer and Telecommunications Networking
Parallel crawling for online social networks
Proceedings of the 16th international conference on World Wide Web
Efficient Monitoring Algorithm for Fast News Alerts
IEEE Transactions on Knowledge and Data Engineering
RankMass crawler: a crawler with high personalized pagerank coverage guarantee
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
The Viúva Negra crawler: an experience report
Software—Practice & Experience
Improving Web site understanding with keyword-based clustering
Journal of Software Maintenance and Evolution: Research and Practice
MokE: a tool for Mobile-ok evaluation of web content
W4A '08 Proceedings of the 2008 international cross-disciplinary conference on Web accessibility (W4A)
Parallel crawler architecture and web page change detection
WSEAS Transactions on Computers
High-performance priority queues for parallel crawlers
Proceedings of the 10th ACM workshop on Web information and data management
On the feasibility of geographically distributed web crawling
Proceedings of the 3rd international conference on Scalable information systems
Protecting Digital Library Collections with Collaborative Web Image Copy Detection
ICADL 08 Proceedings of the 11th International Conference on Asian Digital Libraries: Universal and Ubiquitous Access to Information
Efficient Partitioning Strategies for Distributed Web Crawling
Information Networking. Towards Ubiquitous Networking and Services
A Scalable Lightweight Distributed Crawler for Crawling with Limited Resources
WI-IAT '08 Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 03
Topical web crawling using weighted anchor text and web page change detection techniques
WSEAS Transactions on Information Science and Applications
Design of CORE: context ontology rule enhanced focused web crawler
Proceedings of the International Conference on Advances in Computing, Communication and Control
IRLbot: Scaling to 6 billion pages and beyond
ACM Transactions on the Web (TWEB)
Quantifying performance and quality gains in distributed web search engines
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Deploying applications in multi-SAN SMP clusters
International Journal of Computational Science and Engineering
Harvesting Large-Scale Grids for Software Resources
CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
State of the Art in Semantic Focused Crawlers
ICCSA '09 Proceedings of the International Conference on Computational Science and Its Applications: Part II
On the feasibility of multi-site web search engines
Proceedings of the 18th ACM conference on Information and knowledge management
A testing framework for Web application security assessment
Computer Networks: The International Journal of Computer and Telecommunications Networking - Web security
FICA: A novel intelligent crawling algorithm based on reinforcement learning
Web Intelligence and Agent Systems
Foundations and Trends in Information Retrieval
Technologies and the development of the Automated Metadata Indexing and Analysis (AMIA) system
Journal of Visual Communication and Image Representation
Eliminate redundancy in parallel search: a multi-agent coordination approach
PRICAI'06 Proceedings of the 9th Pacific Rim international conference on Artificial intelligence
Implementation of a web robot and statistics on the Korean web
HSI'03 Proceedings of the 2nd international conference on Human.society@internet
The adaptive web
Estimating and sampling graphs with multidimensional random walks
IMC '10 Proceedings of the 10th ACM SIGCOMM conference on Internet measurement
CAMEO: continuous analytics for massively multiplayer online games on cloud resources
Euro-Par'09 Proceedings of the 2009 international conference on Parallel processing
Proceedings of the 9th Annual Workshop on Network and Systems Support for Games
A robust link-translating proxy server mirroring the whole web
ACM SIGAPP Applied Computing Review
Architecture for a parallel focused crawler for clickstream analysis
ACIIDS'11 Proceedings of the Third international conference on Intelligent information and database systems - Volume Part I
Journal of Web Engineering
Multi agent system for historical information retrieval from online social networks
KES-AMSTA'11 Proceedings of the 5th KES international conference on Agent and multi-agent systems: technologies and applications
Information Sciences: an International Journal
Discovering URLs through user feedback
Proceedings of the 20th ACM international conference on Information and knowledge management
Crawling Ajax-Based Web Applications through Dynamic Analysis of User Interface State Changes
ACM Transactions on the Web (TWEB)
Parallel web spiders for cooperative information gathering
GCC'05 Proceedings of the 4th international conference on Grid and Cooperative Computing
OverCite: a cooperative digital research library
IPTPS'05 Proceedings of the 4th international conference on Peer-to-Peer Systems
Minersoft: Software retrieval in grid and cloud computing infrastructures
ACM Transactions on Internet Technology (TOIT)
Crawling rich internet applications: the state of the art
CASCON '12 Proceedings of the 2012 Conference of the Center for Advanced Studies on Collaborative Research
Multi agent system approach for vulnerability analysis of online social network profiles over time
International Journal of Knowledge and Web Intelligence
Designing a fast file system crawler with incremental differencing
ACM SIGOPS Operating Systems Review
MapReduce Based Information Retrieval Algorithms for Efficient Ranking of Webpages
International Journal of Information Retrieval Research
Crowd crawling: towards collaborative data collection for large-scale online social networks
Proceedings of the first ACM conference on Online social networks
Development of an intelligent distributed news retrieval system
International Journal of Knowledge-based and Intelligent Engineering Systems
Hi-index | 0.00 |
In this paper we study how we can design an effective parallel crawler. As the size of the Web grows, it becomes imperative to parallelize a crawling process, in order to finish downloading pages in a reasonable amount of time. We first propose multiple architectures for a parallel crawler and identify fundamental issues related to parallel crawling. Based on this understanding, we then propose metrics to evaluate a parallel crawler, and compare the proposed architectures using 40 million pages collected from the Web. Our results clarify the relative merits of each architecture and provide a good guideline on when to adopt which architecture.