Discovering URLs through user feedback

Authors:
Xiao Bai;B. Barla Cambazoglu;Flavio P. Junqueira
Affiliations:
Yahoo! Research, Barcelona, Spain;Yahoo! Research, Barcelona, Spain;Yahoo! Research, Barcelona, Spain
Venue:
Proceedings of the 20th ACM international conference on Information and knowledge management
Year:
2011

Citing 34
Cited 1

Efficient crawling through URL ordering

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Accessibility of information on the Web

intelligence
An adaptive model for optimizing performance of an incremental web crawler

Proceedings of the 10th international conference on World Wide Web
Breadth-first crawling yields high-quality pages

Proceedings of the 10th international conference on World Wide Web
Parallel crawlers

Proceedings of the 11th international conference on World Wide Web
Optimal crawling strategies for web search engines

Proceedings of the 11th international conference on World Wide Web
Mercator: A scalable, extensible Web crawler

World Wide Web
The Evolution of the Web and Implications for an Incremental Crawler

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Crawling the Hidden Web

Proceedings of the 27th International Conference on Very Large Data Bases
Design and Implementation of a Distributed Crawler and Filtering Processor

NGITS '02 Proceedings of the 5th International Workshop on Next Generation Information Technologies and Systems
Design and Implementation of a High-Performance Distributed Web Crawler

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Effective page refresh policies for Web crawlers

ACM Transactions on Database Systems (TODS)
What's new on the web?: the evolution of the web from a search engine perspective

Proceedings of the 13th international conference on World Wide Web
Ranking the web frontier

Proceedings of the 13th international conference on World Wide Web
A large-scale study of the evolution of web pages

Software—Practice & Experience - Special issue: Web technologies
Eye-tracking analysis of user behavior in WWW search

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
SmartCrawl: a new strategy for the exploration of the hidden web

Proceedings of the 6th annual ACM international workshop on Web information and data management
UbiCrawler: a scalable fully distributed web crawler

Software—Practice & Experience
User-centric Web crawling

WWW '05 Proceedings of the 14th international conference on World Wide Web
Downloading textual hidden web content through keyword queries

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Query chains: learning to rank from implicit feedback

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Geographical partition for distributed web crawling

Proceedings of the 2005 workshop on Geographic information retrieval
The discoverability of the web

Proceedings of the 16th international conference on World Wide Web
An adaptive crawler for locating hidden-Web entry points

Proceedings of the 16th international conference on World Wide Web
Can social bookmarking improve web search?

WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
IRLbot: scaling to 6 billion pages and beyond

Proceedings of the 17th international conference on World Wide Web
Google's Deep Web crawl

Proceedings of the VLDB Endowment
On the feasibility of geographically distributed web crawling

Proceedings of the 3rd international conference on Scalable information systems
Efficient Partitioning Strategies for Distributed Web Crawling

Information Networking. Towards Ubiquitous Networking and Services
The web changes everything: understanding the dynamics of web content

Proceedings of the Second ACM International Conference on Web Search and Data Mining
Resonance on the web: web dynamics and revisitation patterns

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
The impact of crawl policy on web search effectiveness

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Web Crawling

Foundations and Trends in Information Retrieval
A characterization of online browsing behavior

Proceedings of the 19th international conference on World wide web

Recording and replaying navigations on AJAX web sites

ICWE'12 Proceedings of the 12th international conference on Web Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Search engines rely upon crawling to build their Web page collections. A Web crawler typically discovers new URLs by following the link structure induced by links on Web pages. As the number of documents on the Web is large, discovering newly created URLs may take arbitrarily long, and depending on how a given page is connected to others, such a crawler may miss the pages altogether. In this paper, we evaluate the benefits of integrating a passive URL discovery mechanism into a Web crawler. This mechanism is passive in the sense that it does not require the crawler to actively fetch documents from the Web to discover URLs. We focus here on a mechanism that uses toolbar data as a representative source for new URL discovery. We use the toolbar logs of Yahoo! to characterize the URLs that are accessed by users via their browsers, but not discovered by Yahoo! Web crawler. We show that a high fraction of URLs that appear in toolbar logs are not discovered by the crawler. We also reveal that a certain fraction of URLs are discovered by the crawler later than the time they are first accessed by users. One important conclusion of our work is that web search engines can highly benefit from user feedback in the form of toolbar logs for passive URL discovery.