SIGMOD '88 Proceedings of the 1988 ACM SIGMOD international conference on Management of data
Index structures for selective dissemination of information under the Boolean model
ACM Transactions on Database Systems (TODS)
The anatomy of a large-scale hypertextual Web search engine
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Matching events in a content-based subscription system
Proceedings of the eighteenth annual ACM symposium on Principles of distributed computing
NiagaraCQ: a scalable continuous query system for Internet databases
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Achieving scalability and expressiveness in an Internet-scale event notification service
Proceedings of the nineteenth annual ACM symposium on Principles of distributed computing
Efficient string matching: an aid to bibliographic search
Communications of the ACM
Filtering algorithms and implementation for very fast publish/subscribe systems
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Mercator: A scalable, extensible Web crawler
World Wide Web
The Gamma Database Machine Project
IEEE Transactions on Knowledge and Data Engineering
Efficient Filtering of XML Documents for Selective Dissemination of Information
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Efficient Matching for Web-Based Publish/Subscribe Systems
CooplS '02 Proceedings of the 7th International Conference on Cooperative Information Systems
ICDE '99 Proceedings of the 15th International Conference on Data Engineering
A Fast Regular Expression Indexing Engine
ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Enhancing byte-level network intrusion detection signatures with context
Proceedings of the 10th ACM conference on Computer and communications security
Aurora: a new model and architecture for data stream management
The VLDB Journal — The International Journal on Very Large Data Bases
Web-scale information extraction in knowitall: (preliminary results)
Proceedings of the 13th international conference on World Wide Web
Trend detection through temporal link analysis
Journal of the American Society for Information Science and Technology - Special issue: Webometrics
Photo tourism: exploring photo collections in 3D
ACM SIGGRAPH 2006 Papers
Protomatching network traffic for high throughputnetwork intrusion detection
Proceedings of the 13th ACM conference on Computer and communications security
The BSD packet filter: a new architecture for user-level packet capture
USENIX'93 Proceedings of the USENIX Winter 1993 Conference Proceedings on USENIX Winter 1993 Conference Proceedings
IRLbot: scaling to 6 billion pages and beyond
Proceedings of the 17th international conference on World Wide Web
Scalable regular expression matching on data streams
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
XFA: Faster Signature Matching with Extended Automata
SP '08 Proceedings of the 2008 IEEE Symposium on Security and Privacy
SS'08 Proceedings of the 17th conference on Security symposium
CloudViews: communal data sharing in public clouds
HotCloud'09 Proceedings of the 2009 conference on Hot topics in cloud computing
Cobra: contentbased filtering and aggregation of blogs and RSS feeds
NSDI'07 Proceedings of the 4th USENIX conference on Networked systems design & implementation
Current challenges in web crawling
ICWE'13 Proceedings of the 13th international conference on Web Engineering
Hi-index | 0.00 |
Many Web services operate their own Web crawlers to discover data of interest, despite the fact that large-scale, timely crawling is complex, operationally intensive, and expensive. In this paper, we introduce the extensible crawler, a service that crawls the Web on behalf of its many client applications. Clients inject filters into the extensible crawler; the crawler evaluates all received filters against each Web page, notifying clients of matches. As a result, the act of crawling the Web is decoupled from determining whether a page is of interest, shielding client applications from the burden of crawling the Web themselves. This paper describes the architecture, implementation, and evaluation of our prototype extensible crawler, and also relates early experience from several crawler applications we have built. We focus on the challenges and trade-offs in the system, such as the design of a filter language that is simultaneously expressive and efficient to execute, the use of filter indexing to cheaply match a page against millions of filters, and the use of document and filter partitioning to scale our prototype implementation to high document throughput and large numbers of filters. We argue that the low-latency, high selectivity, and scalable nature of our system makes it a promising platform for taking advantage of emerging real-time streams of data, such as Facebook or Twitter feeds.