The anatomy of a large-scale hypertextual Web search engine
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Efficient crawling through URL ordering
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Focused crawling: a new approach to topic-specific Web resource discovery
WWW '99 Proceedings of the eighth international conference on World Wide Web
Authoritative sources in a hyperlinked environment
Journal of the ACM (JACM)
Adaptive on-line page importance computation
WWW '03 Proceedings of the 12th international conference on World Wide Web
Topical web crawlers: Evaluating adaptive algorithms
ACM Transactions on Internet Technology (TOIT)
WWW '05 Proceedings of the 14th international conference on World Wide Web
Crawling a country: better strategies than breadth-first for web page ordering
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Learning to crawl: Comparing classification schemes
ACM Transactions on Information Systems (TOIS)
Structure-driven crawler generation by example
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
RankMass crawler: a crawler with high personalized pagerank coverage guarantee
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Crawl ordering by search impact
WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
iRobot: an intelligent crawler for web forums
Proceedings of the 17th international conference on World Wide Web
Exploring traversal strategy for web forum crawling
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
The impact of crawl policy on web search effectiveness
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Foundations and Trends in Information Retrieval
A pattern tree-based approach to learning URL normalization rules
Proceedings of the 19th international conference on World wide web
Studying page life patterns in dynamical web
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Adscape: harvesting and analyzing online display ads
Proceedings of the 23rd international conference on World wide web
Hi-index | 0.00 |
To optimize the performance of web crawlers, various page importance measures have been studied to select and order URLs in crawling. Most sophisticated measures (e.g. breadth-first and PageRank) are based on link structure. In this paper, we treat the problem from another perspective and propose to measure page importance through mining user interest and behaviors from web browse logs. Unlike most existing approaches which work on single URL, in this paper, both the log mining and the crawl ordering are performed at the granularity of URL pattern. The proposed URL pattern-based crawl orderings are capable to properly predict the importance of newly created (unseen) URLs. Promising experimental results proved the feasibility of our approach.