Data mining: concepts and techniques
Data mining: concepts and techniques
Probe, count, and classify: categorizing hidden web databases
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Joint optimization of cost and coverage of query plans in data integration
Proceedings of the tenth international conference on Information and knowledge management
Mining source coverage statistics for data integration
Proceedings of the 3rd international workshop on Web information and data management
Fast Algorithms for Mining Association Rules in Large Databases
VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Using Probabilistic Information in Data Integration
VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Optimizing Recursive Information Gathering Plans in EMERAC
Journal of Intelligent Information Systems
A Frequency-based Approach for Mining Coverage Statistics in Data Integration
ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Effectively Mining and Using Coverage and Overlap Statistics for Data Integration
IEEE Transactions on Knowledge and Data Engineering
A simulation-based approach for dynamic process management at web service platforms
Computers and Industrial Engineering
The GEON portal: accelerating knowledge discovery in the geosciences
WIDM '06 Proceedings of the 8th annual ACM international workshop on Web information and data management
A simulation-based approach for dynamic process management at web service platforms
Computers and Industrial Engineering
Hi-index | 0.00 |
Recent work in data integration has shown the importance of statistical information about the coverage and overlap of sources for efficient query processing. Despite this recognition there are no effective approaches for learning the needed statistics. The key challenge in learning such statistics is keeping the number of needed statistics low enough to have the storage and learning costs manageable. Naive approaches can become infeasible very quickly. In this paper we present a set of connected techniques that estimate the coverage and overlap statistics while keeping the needed statistics tightly under control. Our approach uses a hierarchical classification of the queries, and threshold based variants of familiar data mining techniques to dynamically decide the level of resolution at which to learn the statistics. We describe the details of our method, and present experimental results demonstrating the efficiency of the learning algorithms and the effectiveness of the learned statistics.