BibFinder/StatMiner: effectively mining and using coverage and overlap statistics in data integration

Authors:
Zaiqing Nie;Subbarao Kambhampati;Thomas Hernandez
Affiliations:
Department of Computer Science and Engineering, Arizona State University,Tempe, AZ;Department of Computer Science and Engineering, Arizona State University,Tempe, AZ;Department of Computer Science and Engineering, Arizona State University,Tempe, AZ
Venue:
VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Year:
2003

Citing 4
Cited 10

Joint optimization of cost and coverage of query plans in data integration

Proceedings of the tenth international conference on Information and knowledge management
Quality-driven Integration of Heterogenous Information Systems

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Using Probabilistic Information in Data Integration

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Efficiently Ordering Query Plans for Data Integration

ICDE '02 Proceedings of the 18th International Conference on Data Engineering

Answering imprecise database queries: a novel approach

WIDM '03 Proceedings of the 5th ACM international workshop on Web information and data management
A Frequency-based Approach for Mining Coverage Statistics in Data Integration

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Providing ranked relevant results for web database queries

Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters
Mining approximate functional dependencies and concept similarities to answer imprecise queries

Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004
Effectively Mining and Using Coverage and Overlap Statistics for Data Integration

IEEE Transactions on Knowledge and Data Engineering
Improving collection selection with overlap awareness in P2P search engines

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Query-By-Keywords (QBK): Query Formulation Using Semantics and Feedback

ER '09 Proceedings of the 28th International Conference on Conceptual Modeling
PaSE: locating online copy of scientific documents effectively

ICADL'04 Proceedings of the 7th international Conference on Digital Libraries: international collaboration and cross-fertilization
On the usage of global document occurrences in peer-to-peer information systems

OTM'05 Proceedings of the 2005 Confederated international conference on On the Move to Meaningful Internet Systems - Volume >Part I
IQN routing: integrating quality and novelty in P2P querying and ranking

EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

Recent work in data integration has shown the importance of statistical information about the coverage and overlap of sources for efficient query processing. Despite this recognition there are no effective approaches for learning the needed statistics. In this paper we present StatMiner, a system for estimating the coverage and overlap statistics while keeping the needed statistics tightly under control. StatMiner uses a hierarchical classification of the queries, and threshold based variants of familiar data mining techniques to dynamically decide the level of resolution at which to learn the statistics. We will demonstrate the major functionalities of StatMiner and the effectiveness of the learned statistics in BibFinder, a publicly available computer science bibliography mediator we developed. The sources that BibFinder integrates are autonomous and can have uncontrolled coverage and overlap. An important focus in BibFinder was thus to mine coverage and overlap statistics about these sources and to exploit them to improve query processing.