I/O-efficient techniques for computing pagerank
Proceedings of the eleventh international conference on Information and knowledge management
Agents, Crawlers, and Web Retrieval
CIA '02 Proceedings of the 6th International Workshop on Cooperative Information Agents VI
Web application security assessment by fault injection and behavior monitoring
WWW '03 Proceedings of the 12th international conference on World Wide Web
Efficient URL caching for world wide web crawling
WWW '03 Proceedings of the 12th international conference on World Wide Web
Effective page refresh policies for Web crawlers
ACM Transactions on Database Systems (TODS)
Design of a crawler with bounded bandwidth
Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters
The Evolution of Link-Attributes for Pages and Its Implications on Web Crawling
WI '04 Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence
Researchexplorer: gaining insights through exploration in multimedia scientific data
Proceedings of the 6th ACM SIGMM international workshop on Multimedia information retrieval
Local methods for estimating pagerank values
Proceedings of the thirteenth ACM international conference on Information and knowledge management
SmartCrawl: a new strategy for the exploration of the hidden web
Proceedings of the 6th annual ACM international workshop on Web information and data management
UbiCrawler: a scalable fully distributed web crawler
Software—Practice & Experience
Three-level caching for efficient query processing in large Web search engines
WWW '05 Proceedings of the 14th international conference on World Wide Web
Efficient processing of client transactions in real-time
Distributed and Parallel Databases
Crawling a country: better strategies than breadth-first for web page ordering
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
A testing framework for Web application security assessment
Computer Networks: The International Journal of Computer and Telecommunications Networking - Web security
A computational study of external-memory BFS algorithms
SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
Managing duplicates in a web archive
Proceedings of the 2006 ACM symposium on Applied computing
Evaluation of crawling policies for a web-repository crawler
Proceedings of the seventeenth conference on Hypertext and hypermedia
ACM Transactions on Internet Technology (TOIT)
Architecture of a grid-enabled Web search engine
Information Processing and Management: an International Journal
Improving web spam classifiers using link structure
AIRWeb '07 Proceedings of the 3rd international workshop on Adversarial information retrieval on the web
Optimized query execution in large search engines with global page ordering
VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
The Viúva Negra crawler: an experience report
Software—Practice & Experience
CEA'07 Proceedings of the 2007 annual Conference on International Conference on Computer Engineering and Applications
Performance of compressed inverted list caching in search engines
Proceedings of the 17th international conference on World Wide Web
IRLbot: scaling to 6 billion pages and beyond
Proceedings of the 17th international conference on World Wide Web
ICSOC '07 Proceedings of the 5th international conference on Service-Oriented Computing
On the feasibility of geographically distributed web crawling
Proceedings of the 3rd international conference on Scalable information systems
IRLbot: Scaling to 6 billion pages and beyond
ACM Transactions on the Web (TWEB)
A practical method for browsing a relational database using a standard search engine
Integrated Computer-Aided Engineering - Selected papers from the IEEE Conference on Information Reuse and Integration (IRI), July 13-15, 2008
Malay document analysis and recognition
WSEAS Transactions on Information Science and Applications
A testing framework for Web application security assessment
Computer Networks: The International Journal of Computer and Telecommunications Networking - Web security
Foundations and Trends in Information Retrieval
Implementation of a web robot and statistics on the Korean web
HSI'03 Proceedings of the 2nd international conference on Human.society@internet
The adaptive web
A full distributed web crawler based on structured network
AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology
News page discovery policy for instant crawlers
AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology
CRAYSE: design and implementation of efficient text search algorithm in a web crawler
ACM SIGSOFT Software Engineering Notes
Batch query processing for web search engines
Proceedings of the fourth ACM international conference on Web search and data mining
Online social honeynets: trapping web crawlers in OSN
MDAI'11 Proceedings of the 8th international conference on Modeling decisions for artificial intelligence
A new approach for verifying URL uniqueness in web crawlers
SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Discovering URLs through user feedback
Proceedings of the 20th ACM international conference on Information and knowledge management
A tool for link-based web page classification
CAEPIA'11 Proceedings of the 14th international conference on Advances in artificial intelligence: spanish association for artificial intelligence
A framework for utilising usage trends in the crawling and indexing process of search engines
International Journal of Knowledge and Web Intelligence
Reliable evaluations of URL normalization
ICCSA'06 Proceedings of the 2006 international conference on Computational Science and Its Applications - Volume Part V
How to evaluate the effectiveness of URL normalizations
HSI'05 Proceedings of the 3rd international conference on Human Society@Internet: web and Communication Technologies and Internet-Related Social Issues
International Journal of Web Based Communities
The Study of Content Security for Mobile Internet
Wireless Personal Communications: An International Journal
RetriBlog: An architecture-centered framework for developing blog crawlers
Expert Systems with Applications: An International Journal
Ontology-Based Shopping Agent for E-Marketing
International Journal of Intelligent Information Technologies
Current challenges in web crawling
ICWE'13 Proceedings of the 13th international conference on Web Engineering
A brief history of web crawlers
CASCON '13 Proceedings of the 2013 Conference of the Center for Advanced Studies on Collaborative Research
Development of an intelligent distributed news retrieval system
International Journal of Knowledge-based and Intelligent Engineering Systems
Hi-index | 0.00 |
Broad web search engines as well as many more specialized search tools rely on web crawlers to acquire large collections of pages for indexing and analysis. Such a web crawler may interact with millions of hosts over a period of weeks or months, and thus issues of robustness, flexibility, and manageability are of major importance. In addition, I/O performance, network resources, and OS limits must be taken into account in order to achieve high performance at a reasonable cost.In this paper, we describe the design and implementation of a distributed web crawler that runs on a network of workstations. The crawler scales to (at least) several hundred pages per second, is resilient against system crashes and other events, and can be adapted to various crawling applications. We present the software architecture of the system, discuss the performance bottlenecks, and describe efficient techniques for achieving high performance. We also report preliminary experimental results based on a crawl of $120$ million pages on $5$ million hosts.