Modeling and managing changes in text databases

Authors:
Panagiotis G. Ipeirotis;Alexandros Ntoulas;Junghoo Cho;Luis Gravano
Affiliations:
New York University, New York, NY;Microsoft Search Labs, Mountain View, CA;University of California, Los Angeles, CA;Columbia University, New York, NY
Venue:
ACM Transactions on Database Systems (TODS)
Year:
2007

Citing 23
Cited 3

STARTS: Stanford proposal for Internet meta-searching

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Statistical methods for speech recognition

Statistical methods for speech recognition
Towards a better understanding of Web resources and server responses for improved caching

WWW '99 Proceedings of the eighth international conference on World Wide Web
GlOSS: text-source discovery over the Internet

ACM Transactions on Database Systems (TODS)
Synchronizing a database to improve freshness

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
How dynamic is the Web?

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
An adaptive model for optimizing performance of an incremental web crawler

Proceedings of the 10th international conference on World Wide Web
Query-based sampling of text databases

ACM Transactions on Information Systems (TOIS)
Optimal crawling strategies for web search engines

Proceedings of the 11th international conference on World Wide Web
Best-effort cache synchronization with source cooperation

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Keeping Up with the Changing Web

Computer
QProber: A system for automatic classification of hidden-Web databases

ACM Transactions on Information Systems (TOIS)
The Evolution of the Web and Implications for an Incremental Crawler

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
A large-scale study of the evolution of web pages

WWW '03 Proceedings of the 12th international conference on World Wide Web
Estimating frequency of change

ACM Transactions on Internet Technology (TOIT)
Relevant document distribution estimation method for resource selection

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Pattern Classification (2nd Edition)

Pattern Classification (2nd Edition)
What's new on the web?: the evolution of the web from a search engine perspective

Proceedings of the 13th international conference on World Wide Web
Modeling and Managing Content Changes in Text Databases

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
User-centric Web crawling

WWW '05 Proceedings of the 14th international conference on World Wide Web
Rate of change and other metrics: a live study of the world wide web

USITS'97 Proceedings of the USENIX Symposium on Internet Technologies and Systems on USENIX Symposium on Internet Technologies and Systems
Distributed search over the hidden web: hierarchical database sampling and selection

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Effective change detection using sampling

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases

PISA: A framework for integrating uncooperative peers into P2P-based federated search

Computer Communications
DSNotify - A solution for event detection and link maintenance in dynamic datasets

Web Semantics: Science, Services and Agents on the World Wide Web
Monitoring User Evolution in Twitter

ASONAM '12 Proceedings of the 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Large amounts of (often valuable) information are stored in web-accessible text databases. “Metasearchers” provide unified interfaces to query multiple such databases at once. For efficiency, metasearchers rely on succinct statistical summaries of the database contents to select the best databases for each query. So far, database selection research has largely assumed that databases are static, so the associated statistical summaries do not evolve over time. However, databases are rarely static and the statistical summaries that describe their contents need to be updated periodically to reflect content changes. In this article, we first report the results of a study showing how the content summaries of 152 real web databases evolved over a period of 52 weeks. Then, we show how to use “survival analysis” techniques in general, and Cox's proportional hazards regression in particular, to model database changes over time and predict when we should update each content summary. Finally, we exploit our change model to devise update schedules that keep the summaries up to date by contacting databases only when needed, and then we evaluate the quality of our schedules experimentally over real web databases.