Optimizing content freshness of relations extracted from the web using keyword search

Authors:
Mohan Yang;Haixun Wang;Lipyeow Lim;Min Wang
Affiliations:
Shanghai Jiao Tong University, Shanghai, and Microsoft Research Asia, Beijing, China;Microsoft Research Asia, Beijing, China;University of Hawaii at Manoa, Honolulu, USA;HP Labs China, Beijing, China
Venue:
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Year:
2010

Citing 13
Cited 0

Synchronizing a database to improve freshness

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Adaptive push-pull: disseminating dynamic web data

Proceedings of the 10th international conference on World Wide Web
Approximation algorithms

Approximation algorithms
Best-effort cache synchronization with source cooperation

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Crawling the Hidden Web

Proceedings of the 27th International Conference on Very Large Data Bases
Downloading textual hidden web content through keyword queries

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Query Selection Techniques for Efficient Crawling of Structured Web Sources

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
An adaptive crawler for locating hidden-Web entry points

Proceedings of the 16th international conference on World Wide Web
Efficient Monitoring Algorithm for Fast News Alerts

IEEE Transactions on Knowledge and Data Engineering
A Sketch Algorithm for Estimating Two-Way and Multi-Way Associations

Computational Linguistics
Google's Deep Web crawl

Proceedings of the VLDB Endowment
Translating Query for Deep Web Using Ontology

CSSE '08 Proceedings of the 2008 International Conference on Computer Science and Software Engineering - Volume 04
An Approach to Deep Web Crawling by Sampling

WI-IAT '08 Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 01

Quantified Score

Hi-index	0.00

Visualization

Abstract

An increasing number of applications operate on data obtained from the Web. These applications typically maintain local copies of the web data to avoid network latency in data accesses. As the data on the Web evolves, it is critical that the local copy be kept up-to-date. Data freshness is one of the most important data quality issues, and has been extensively studied for various applications including web crawling. However, web crawling is focused on obtaining as many raw web pages as possible. Our applications, on the other hand, are interested in specific content from specific data sources. Knowing the content or the semantics of the data enables us to differentiate data items based on their importance and volatility, which are key factors that impact the design of the data synchronization strategy. In this work, we formulate the concept of content freshness, and present a novel approach that maintains content freshness with least amount of web communication. Specifically, we assume data is accessible through a general keyword search interface, and we form keyword queries based on their selectivity, as well their contribution to content freshness of the local copy. Experiments show the effectiveness of our approach compared with several naive methods for keeping data fresh.