The architecture and implementation of an extensible web crawler

Authors:
Jonathan M. Hsieh;Steven D. Gribble;Henry M. Levy
Affiliations:
Department of Computer Science & Engineering, University of Washington, Seattle, WA;Department of Computer Science & Engineering, University of Washington, Seattle, WA;Department of Computer Science & Engineering, University of Washington, Seattle, WA
Venue:
NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation
Year:
2010

Citing 27
Cited 1

Data placement in Bubba

SIGMOD '88 Proceedings of the 1988 ACM SIGMOD international conference on Management of data
Index structures for selective dissemination of information under the Boolean model

ACM Transactions on Database Systems (TODS)
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Matching events in a content-based subscription system

Proceedings of the eighteenth annual ACM symposium on Principles of distributed computing
NiagaraCQ: a scalable continuous query system for Internet databases

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Achieving scalability and expressiveness in an Internet-scale event notification service

Proceedings of the nineteenth annual ACM symposium on Principles of distributed computing
Efficient string matching: an aid to bibliographic search

Communications of the ACM
Filtering algorithms and implementation for very fast publish/subscribe systems

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Mercator: A scalable, extensible Web crawler

World Wide Web
The Gamma Database Machine Project

IEEE Transactions on Knowledge and Data Engineering
Efficient Filtering of XML Documents for Selective Dissemination of Information

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Efficient Matching for Web-Based Publish/Subscribe Systems

CooplS '02 Proceedings of the 7th International Conference on Cooperative Information Systems
Scalable Trigger Processing

ICDE '99 Proceedings of the 15th International Conference on Data Engineering
A Fast Regular Expression Indexing Engine

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Enhancing byte-level network intrusion detection signatures with context

Proceedings of the 10th ACM conference on Computer and communications security
Aurora: a new model and architecture for data stream management

The VLDB Journal — The International Journal on Very Large Data Bases
Web-scale information extraction in knowitall: (preliminary results)

Proceedings of the 13th international conference on World Wide Web
Trend detection through temporal link analysis

Journal of the American Society for Information Science and Technology - Special issue: Webometrics
Photo tourism: exploring photo collections in 3D

ACM SIGGRAPH 2006 Papers
Protomatching network traffic for high throughputnetwork intrusion detection

Proceedings of the 13th ACM conference on Computer and communications security
The BSD packet filter: a new architecture for user-level packet capture

USENIX'93 Proceedings of the USENIX Winter 1993 Conference Proceedings on USENIX Winter 1993 Conference Proceedings
IRLbot: scaling to 6 billion pages and beyond

Proceedings of the 17th international conference on World Wide Web
Scalable regular expression matching on data streams

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
XFA: Faster Signature Matching with Extended Automata

SP '08 Proceedings of the 2008 IEEE Symposium on Security and Privacy
All your iFRAMEs point to Us

SS'08 Proceedings of the 17th conference on Security symposium
CloudViews: communal data sharing in public clouds

HotCloud'09 Proceedings of the 2009 conference on Hot topics in cloud computing
Cobra: contentbased filtering and aggregation of blogs and RSS feeds

NSDI'07 Proceedings of the 4th USENIX conference on Networked systems design & implementation

Current challenges in web crawling

ICWE'13 Proceedings of the 13th international conference on Web Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many Web services operate their own Web crawlers to discover data of interest, despite the fact that large-scale, timely crawling is complex, operationally intensive, and expensive. In this paper, we introduce the extensible crawler, a service that crawls the Web on behalf of its many client applications. Clients inject filters into the extensible crawler; the crawler evaluates all received filters against each Web page, notifying clients of matches. As a result, the act of crawling the Web is decoupled from determining whether a page is of interest, shielding client applications from the burden of crawling the Web themselves. This paper describes the architecture, implementation, and evaluation of our prototype extensible crawler, and also relates early experience from several crawler applications we have built. We focus on the challenges and trade-offs in the system, such as the design of a filter language that is simultaneously expressive and efficient to execute, the use of filter indexing to cheaply match a page against millions of filters, and the use of document and filter partitioning to scale our prototype implementation to high document throughput and large numbers of filters. We argue that the low-latency, high selectivity, and scalable nature of our system makes it a promising platform for taking advantage of emerging real-time streams of data, such as Facebook or Twitter feeds.