Current challenges in web crawling

Authors:
Denis Shestakov
Affiliations:
Department of Media Technology, Aalto University, Aalto, Finland
Venue:
ICWE'13 Proceedings of the 13th international conference on Web Engineering
Year:
2013

Citing 12
Cited 0

Discovering informative content blocks from Web documents

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Design and Implementation of a High-Performance Distributed Web Crawler

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Random sampling from a search engine's index

Journal of the ACM (JACM)
Google's Deep Web crawl

Proceedings of the VLDB Endowment
Sitemaps: above and beyond the crawl of duty

Proceedings of the 18th international conference on World wide web
IRLbot: Scaling to 6 billion pages and beyond

ACM Transactions on the Web (TWEB)
AJAX Crawl: Making AJAX Applications Searchable

ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Web Crawling

Foundations and Trends in Information Retrieval
The web as a graph: measurements, models, and methods

COCOON'99 Proceedings of the 5th annual international conference on Computing and combinatorics
The architecture and implementation of an extensible web crawler

NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation
On building a search interface discovery system

RED'09 Proceedings of the 2nd international conference on Resource discovery
Sampling the national deep web

DEXA'11 Proceedings of the 22nd international conference on Database and expert systems applications - Volume Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

Web crawling, a process of collecting web pages in an automated manner, is the primary and ubiquitous operation used by a large number of web systems and agents starting from a simple program for website backup to a major web search engine. Due to an astronomical amount of data already published on the Web and ongoing exponential growth of web content, any party that want to take advantage of massive-scale web data faces a high barrier to entry. In this tutorial, we will introduce the audience to five topics: architecture and implementation of high-performance web crawler, collaborative web crawling, crawling the deep Web, crawling multimedia content and future directions in web crawling research.