User browsing behavior-driven web crawling

Authors:
Minghai Liu;Rui Cai;Ming Zhang;Lei Zhang
Affiliations:
School of EECS, Peking University, Beijing, China;Microsoft Research Asia, Beijing, China;School of EECS, Peking University, Beijing, China;Microsoft Research Asia, Beijing, China
Venue:
Proceedings of the 20th ACM international conference on Information and knowledge management
Year:
2011

Citing 17
Cited 2

The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Efficient crawling through URL ordering

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
Authoritative sources in a hyperlinked environment

Journal of the ACM (JACM)
Adaptive on-line page importance computation

WWW '03 Proceedings of the 12th international conference on World Wide Web
Topical web crawlers: Evaluating adaptive algorithms

ACM Transactions on Internet Technology (TOIT)
User-centric Web crawling

WWW '05 Proceedings of the 14th international conference on World Wide Web
Crawling a country: better strategies than breadth-first for web page ordering

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Learning to crawl: Comparing classification schemes

ACM Transactions on Information Systems (TOIS)
Structure-driven crawler generation by example

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
RankMass crawler: a crawler with high personalized pagerank coverage guarantee

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Crawl ordering by search impact

WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
iRobot: an intelligent crawler for web forums

Proceedings of the 17th international conference on World Wide Web
Exploring traversal strategy for web forum crawling

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
The impact of crawl policy on web search effectiveness

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Web Crawling

Foundations and Trends in Information Retrieval
A pattern tree-based approach to learning URL normalization rules

Proceedings of the 19th international conference on World wide web

Studying page life patterns in dynamical web

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Adscape: harvesting and analyzing online display ads

Proceedings of the 23rd international conference on World wide web

Quantified Score

Hi-index	0.00

Visualization

Abstract

To optimize the performance of web crawlers, various page importance measures have been studied to select and order URLs in crawling. Most sophisticated measures (e.g. breadth-first and PageRank) are based on link structure. In this paper, we treat the problem from another perspective and propose to measure page importance through mining user interest and behaviors from web browse logs. Unlike most existing approaches which work on single URL, in this paper, both the log mining and the crawl ordering are performed at the granularity of URL pattern. The proposed URL pattern-based crawl orderings are capable to properly predict the importance of newly created (unseen) URLs. Promising experimental results proved the feasibility of our approach.