Effectively Mining and Using Coverage and Overlap Statistics for Data Integration

Authors:
Zaiqing Nie;Subbarao Kambhampati;Ullas Nambiar
Affiliations:
-;-;-
Venue:
IEEE Transactions on Knowledge and Data Engineering
Year:
2005

Citing 20
Cited 5

Query caching and optimization in distributed mediator systems

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Effective retrieval with distributed collections

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Data mining: concepts and techniques

Data mining: concepts and techniques
Building regression cost models for multidatabase systems

DIS '96 Proceedings of the fourth international conference on on Parallel and distributed information systems
Joint optimization of cost and coverage of query plans in data integration

Proceedings of the tenth international conference on Information and knowledge management
Mining source coverage statistics for data integration

Proceedings of the 3rd international workshop on Web information and data management
Mining coverage statistics for websource selection in a mediator

Proceedings of the eleventh international conference on Information and knowledge management
Quality-driven Integration of Heterogenous Information Systems

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
A Scalable Algorithm for Answering Queries Using Views

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Generalizing GlOSS to Vector-Space Databases and Broker Hierarchies

VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
Querying Heterogeneous Information Sources Using Source Descriptions

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Using Probabilistic Information in Data Integration

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Optimizing Recursive Information-Gathering Plans

IJCAI '99 Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence
Learning response time for WebSources using query feedback and application in query optimization

The VLDB Journal — The International Journal on Very Large Data Bases
Estimating the Usefulness of Search Engines

ICDE '99 Proceedings of the 15th International Conference on Data Engineering
Efficiently Ordering Query Plans for Data Integration

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Concept Hierarchy Based Text Database Categorization in a Metasearch Engine Environment

WISE '00 Proceedings of the First International Conference on Web Information Systems Engineering (WISE'00)-Volume 1 - Volume 1
A Frequency-based Approach for Mining Coverage Statistics in Data Integration

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
BibFinder/StatMiner: effectively mining and using coverage and overlap statistics in data integration

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29

Web data retrieval: solving spatial range queries using k-nearest neighbor searches

Geoinformatica
Supporting range queries on web data using k-nearest neighbor search

W2GIS'07 Proceedings of the 7th international conference on Web and wireless geographical information systems
Data integration with dependent sources

Proceedings of the 14th International Conference on Extending Database Technology
Large-scale copy detection

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Querying e-catalogs using content summaries

ODBASE'06/OTM'06 Proceedings of the 2006 Confederated international conference on On the Move to Meaningful Internet Systems: CoopIS, DOA, GADA, and ODBASE - Volume Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

Recent work in data integration has shown the importance of statistical information about the coverage and overlap of sources for efficient query processing. Despite this recognition, there are no effective approaches for learning the needed statistics. The key challenge in learning such statistics is keeping the number of needed statistics low enough to have the storage and learning costs manageable. In this paper, we present a set of connected techniques that estimate the coverage and overlap statistics,while keeping the needed statistics tightly under control. Our approach uses a hierarchical classification of the queries and threshold-based variants of familiar data mining techniques to dynamically decide the level of resolution at which to learn the statistics. We describe the details of our method,and present experimental results demonstrating the efficiency of the learning algorithms and the effectiveness of the learned statistics over both controlled data sources and in the context of BibFinder with autonomous online sources.