Monitoring the dynamic web to respond to continuous queries

Authors:
Sandeep Pandey;Krithi Ramamritham;Soumen Chakrabarti
Affiliations:
Indian Institute of Technology, Powai, Mumbai, India;Indian Institute of Technology, Powai, Mumbai, India;Indian Institute of Technology, Powai, Mumbai, India
Venue:
WWW '03 Proceedings of the 12th international conference on World Wide Web
Year:
2003

Citing 12
Cited 20

Resource allocation problems: algorithmic approaches

Resource allocation problems: algorithmic approaches
Life, death, and lawfulness on the electronic frontier

Proceedings of the ACM SIGCHI Conference on Human factors in computing systems
On the scale and performance of cooperative Web proxy caching

Proceedings of the seventeenth ACM symposium on Operating systems principles
Synchronizing a database to improve freshness

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
How dynamic is the Web?

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
WebCQ-detecting and delivering information changes on the web

Proceedings of the ninth international conference on Information and knowledge management
Adaptive precision setting for cached approximate values

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Optimal crawling strategies for web search engines

Proceedings of the 11th international conference on World Wide Web
Mercator: A scalable, extensible Web crawler

World Wide Web
Continual Queries for Internet Scale Event-Driven Information Delivery

IEEE Transactions on Knowledge and Data Engineering
The Evolution of the Web and Implications for an Incremental Crawler

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Characteristics of WWW Client-based Traces

Characteristics of WWW Client-based Traces

Web-CAM: monitoring the dynamic Web to respond to continual queries

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
A Report of Activities at the WIC-India Research Center

WI '04 Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence
Adaptive pull-based policies for wide area data delivery

ACM Transactions on Database Systems (TODS)
Temporal multi-page summarization

Web Intelligence and Agent Systems
Answering bounded continuous search queries in the world wide web

Proceedings of the 16th international conference on World Wide Web
Efficient Monitoring Algorithm for Fast News Alerts

IEEE Transactions on Knowledge and Data Engineering
WIC: a general-purpose algorithm for monitoring web information sources

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
A Hierarchy of Twofold Resource Allocation Automata Supporting Optimal Web Polling

IEA/AIE '08 Proceedings of the 21st international conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems: New Frontiers in Applied Artificial Intelligence
A Web data extraction approach to harvesting data from online sources

Proceedings of the 2006 conference on Advances in Intelligent IT: Active Media Technology 2006
A Hierarchy of Twofold Resource Allocation Automata Supporting Optimal Sampling

IEA/AIE '09 Proceedings of the 22nd International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems: Next-Generation Applied Intelligence
Using Knowledge Base for Event-Driven Scheduling of Web Monitoring Systems

EC-Web 2009 Proceedings of the 10th International Conference on E-Commerce and Web Technologies
Web Crawling

Foundations and Trends in Information Retrieval
On using a hierarchy of twofold resource allocation automata to solve stochastic nonlinear resource allocation problems

AI'07 Proceedings of the 20th Australian joint conference on Advances in artificial intelligence
Optimal sampling for estimation with constrained resources using a learning automaton-based solution for the nonlinear fractional knapsack problem

Applied Intelligence
Best-effort refresh strategies for content-based RSS feed aggregation

WISE'10 Proceedings of the 11th international conference on Web information systems engineering
Learning automata-based solutions to the optimal web polling problem modelled as a nonlinear fractional knapsack problem

Engineering Applications of Artificial Intelligence
Decomposition-Based optimization of reload strategies in the world wide web

WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering
Ten theses on logic languages for the semantic web

PPSWR'05 Proceedings of the Third international conference on Principles and Practice of Semantic Web Reasoning
Key element-context model: an approach to efficient web metadata maintenance

ECDL'07 Proceedings of the 11th European conference on Research and Advanced Technology for Digital Libraries
Tasklets: enabling end user programming of web widgets

International Journal of Web Engineering and Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

Continuous queries are queries for which responses given to users must be continuously updated, as the sources of interest get updated. Such queries occur, for instance, during on-line decision making, e.g., traffic flow control, weather monitoring, etc. The problem of keeping the responses current reduces to the problem of deciding how often to visit a source to determine if and how it has been modified, in order to update earlier responses accordingly. On the surface, this seems to be similar to the crawling problem since crawlers attempt to keep indexes up-to-date as pages change and users pose search queries. We show that this is not the case, both due to the inherent differences between the nature of the two problems as well as the performance metric. We propose, develop and evaluate a novel multi-phase (Continuous Adaptive Monitoring) (CAM) solution to the problem of maintaining the currency of query results. Some of the important phases are: The tracking phase, in which changes, to an initially identified set of relevant pages, are tracked. From the observed change characteristics of these pages, a probabilistic model of their change behavior is formulated and weights are assigned to pages to denote their importance for the current queries. During the next phase, the resource allocation phase, based on these statistics, resources, needed to continuously monitor these pages for changes, are allocated. Given these resource allocations, the scheduling phase produces an optimal achievable schedule for the monitoring tasks. An experimental evaluation of our approach compared to prior approaches for crawling dynamic web pages shows the effectiveness of CAM for monitoring dynamic changes. For example, by monitoring just 5% of the page changes, CAM is able to return 90% of the changed information to the users. The experiments also produce some interesting observations pertaining to the differences between the two problems of crawling--to build an index--and the problem of change tracking--to respond to continuous queries.