Discovering informative content blocks from Web documents
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Design and Implementation of a High-Performance Distributed Web Crawler
ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Random sampling from a search engine's index
Journal of the ACM (JACM)
Proceedings of the VLDB Endowment
Sitemaps: above and beyond the crawl of duty
Proceedings of the 18th international conference on World wide web
IRLbot: Scaling to 6 billion pages and beyond
ACM Transactions on the Web (TWEB)
AJAX Crawl: Making AJAX Applications Searchable
ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Foundations and Trends in Information Retrieval
The web as a graph: measurements, models, and methods
COCOON'99 Proceedings of the 5th annual international conference on Computing and combinatorics
The architecture and implementation of an extensible web crawler
NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation
On building a search interface discovery system
RED'09 Proceedings of the 2nd international conference on Resource discovery
Sampling the national deep web
DEXA'11 Proceedings of the 22nd international conference on Database and expert systems applications - Volume Part I
Hi-index | 0.00 |
Web crawling, a process of collecting web pages in an automated manner, is the primary and ubiquitous operation used by a large number of web systems and agents starting from a simple program for website backup to a major web search engine. Due to an astronomical amount of data already published on the Web and ongoing exponential growth of web content, any party that want to take advantage of massive-scale web data faces a high barrier to entry. In this tutorial, we will introduce the audience to five topics: architecture and implementation of high-performance web crawler, collaborative web crawling, crawling the deep Web, crawling multimedia content and future directions in web crawling research.