Integer and combinatorial optimization
Integer and combinatorial optimization
Resource allocation problems: algorithmic approaches
Resource allocation problems: algorithmic approaches
Life, death, and lawfulness on the electronic frontier
Proceedings of the ACM SIGCHI Conference on Human factors in computing systems
The art of computer programming, volume 2 (3rd ed.): seminumerical algorithms
The art of computer programming, volume 2 (3rd ed.): seminumerical algorithms
Towards a better understanding of Web resources and server responses for improved caching
WWW '99 Proceedings of the eighth international conference on World Wide Web
A Fast Selection Algorithm and the Problem of Optimum Distribution of Effort
Journal of the ACM (JACM)
Accessibility of information on the Web
intelligence
Synchronizing a database to improve freshness
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Web traffic modeling and Web server performance analysis
ACM SIGMETRICS Performance Evaluation Review
The content and access dynamics of a busy Web site: findings and implications
Proceedings of the conference on Applications, Technologies, Architectures, and Protocols for Computer Communication
ACM Transactions on Internet Technology (TOIT)
Scheduling in Computer and Manufacturing Systems
Scheduling in Computer and Manufacturing Systems
Efficiently serving dynamic data at highly accessed web sites
IEEE/ACM Transactions on Networking (TON)
Rate of change and other metrics: a live study of the world wide web
USITS'97 Proceedings of the USENIX Symposium on Internet Technologies and Systems on USENIX Symposium on Internet Technologies and Systems
Scheduling: Theory, Algorithms, and Systems
Scheduling: Theory, Algorithms, and Systems
Monitoring the dynamic web to respond to continuous queries
WWW '03 Proceedings of the 12th international conference on World Wide Web
Optimizing result prefetching in web search engines with segmented indices
ACM Transactions on Internet Technology (TOIT)
Impact of search engines on page popularity
Proceedings of the 13th international conference on World Wide Web
Sic transit gloria telae: towards an understanding of the web's decay
Proceedings of the 13th international conference on World Wide Web
Competitive caching of query results in search engines
Theoretical Computer Science - Special issue: Online algorithms in memoriam, Steve Seiden
WWW '05 Proceedings of the 14th international conference on World Wide Web
VLDB '05 Proceedings of the 31st international conference on Very large data bases
Evaluation of crawling policies for a web-repository crawler
Proceedings of the seventeenth conference on Hypertext and hypermedia
ACM Transactions on Internet Technology (TOIT)
Distributed Context Retrieval and Consistency Control in Pervasive Computing
Journal of Network and Systems Management
The discoverability of the web
Proceedings of the 16th international conference on World Wide Web
Efficient Monitoring Algorithm for Fast News Alerts
IEEE Transactions on Knowledge and Data Engineering
Modeling and managing changes in text databases
ACM Transactions on Database Systems (TODS)
WIC: a general-purpose algorithm for monitoring web information sources
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Designing clustering-based web crawling policies for search engine crawlers
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
RankMass crawler: a crawler with high personalized pagerank coverage guarantee
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Crawl ordering by search impact
WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
Recrawl scheduling based on information longevity
Proceedings of the 17th international conference on World Wide Web
Enhancing digital libraries using missing content analysis
Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries
A Hierarchy of Twofold Resource Allocation Automata Supporting Optimal Web Polling
IEA/AIE '08 Proceedings of the 21st international conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems: New Frontiers in Applied Artificial Intelligence
Parallel crawler architecture and web page change detection
WSEAS Transactions on Computers
Sitemaps: above and beyond the crawl of duty
Proceedings of the 18th international conference on World wide web
Incorporating site-level knowledge for incremental crawling of web forums: a list-wise strategy
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
A practical method for browsing a relational database using a standard search engine
Integrated Computer-Aided Engineering - Selected papers from the IEEE Conference on Information Reuse and Integration (IRI), July 13-15, 2008
Foundations and Trends in Information Retrieval
Optimising context data dissemination and storage in distributed pervasive computing systems
Pervasive and Mobile Computing
Efficiently detecting webpage updates using samples
ICWE'07 Proceedings of the 7th international conference on Web engineering
On trade-offs in event delivery systems
Proceedings of the Fourth ACM International Conference on Distributed Event-Based Systems
Clustering-based incremental web crawling
ACM Transactions on Information Systems (TOIS)
Scale-adaptable recrawl strategies for DHT-based distributed web crawling system
NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
Archiving the web using page changes patterns: a case study
Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
Caché: caching location-enhanced content to improve user privacy
MobiSys '11 Proceedings of the 9th international conference on Mobile systems, applications, and services
Engineering Applications of Artificial Intelligence
Discovering URLs through user feedback
Proceedings of the 20th ACM international conference on Information and knowledge management
Decomposition-Based optimization of reload strategies in the world wide web
WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering
Learning the grammar of distant change in the world-wide web
AI'04 Proceedings of the 17th Australian joint conference on Advances in Artificial Intelligence
Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Serial position effects of clicking behavior on result pages returned by search engines
Proceedings of the 21st ACM international conference on Information and knowledge management
A self-adaptive strategy for web crawler in in-site search
WISM'12 Proceedings of the 2012 international conference on Web Information Systems and Mining
Predicting content change on the web
Proceedings of the sixth ACM international conference on Web search and data mining
Hi-index | 0.00 |
Web Search Engines employ multiple so-called crawlers to maintain local copies of web pages. But these web pages are frequently updated by their owners, and therefore the crawlers must regularly revisit the web pages to maintain the freshness of their local copies. In this paper, we propose a two-part scheme to optimize this crawling process. One goal might be the minimization of the average level of staleness over all web pages, and the scheme we propose can solve this problem. Alternatively, the same basic scheme could be used to minimize a possibly more important search engine embarrassment level metric: The frequency with which a client makes a search engine query and then clicks on a returned url only to find that the result is incorrect. The first part our scheme determines the (nearly) optimal crawling frequencies, as well as the theoretically optimal times to crawl each web page. It does so within an extremely general stochastic framework, one which supports a wide range of complex update patterns found in practice. It uses techniques from probability theory and the theory of resource allocation problems which are highly computationally efficient -- crucial for practicality because the size of the problem in the web environment is immense. The second part employs these crawling frequencies and ideal crawl times as input, and creates an optimal achievable schedule for the crawlers. Our solution, based on network flow theory, is exact as well as highly efficient. An analysis of the update patterns from a highly accessed and highly dynamic web site is used to gain some insights into the properties of page updates in practice. Then, based on this analysis, we perform a set of detailed simulation experiments to demonstrate the quality and speed of our approach.