Explore or exploit?: effective strategies for disambiguating large databases

Authors:
Reynold Cheng;Eric Lo;Xuan S. Yang;Ming-Hay Luk;Xiang Li;Xike Xie
Affiliations:
University of Hong Kong, Hong Kong;Hong Kong Polytechnic University, Kowloon, Hong Kong;University of Hong Kong, Hong Kong;Hong Kong Polytechnic University, Kowloon, Hong Kong;University of Hong Kong, Hong Kong;University of Hong Kong, Hong Kong
Venue:
Proceedings of the VLDB Endowment
Year:
2010

Citing 16
Cited 0

The Management of Probabilistic Data

IEEE Transactions on Knowledge and Data Engineering
Evaluating probabilistic queries over imprecise data

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Adaptive filters for continuous queries over distributed data streams

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Efficiently Managing Context Information for Large-Scale Scenarios

PERCOM '05 Proceedings of the Third IEEE International Conference on Pervasive Computing and Communications
Cost-efficient processing of MIN/MAX queries over distributed sensors with uncertainty

Proceedings of the 2005 ACM symposium on Applied computing
Clean Answers over Dirty Databases: A Probabilistic Approach

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Towards correcting input data errors probabilistically using integrity constraints

MobiDE '06 Proceedings of the 5th ACM international workshop on Data engineering for wireless and mobile access
ULDBs: databases with uncertainty and lineage

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
COMA: a system for flexible combination of schema matching approaches

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Model-driven data acquisition in sensor networks

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Efficient query evaluation on probabilistic databases

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Probabilistic skylines on uncertain data

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Quality-Aware Probing of Uncertain Data with Resource Constraints

SSDBM '08 Proceedings of the 20th international conference on Scientific and Statistical Database Management
Cleaning uncertain data with quality guarantees

Proceedings of the VLDB Endowment
Modeling and querying possible repairs in duplicate detection

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

Data ambiguity is inherent in applications such as data integration, location-based services, and sensor monitoring. In many situations, it is possible to "clean", or remove, ambiguities from these databases. For example, the GPS location of a user is inexact due to measurement errors, but context information (e.g., what a user is doing) can be used to reduce the imprecision of the location value. In order to obtain a database with a higher quality, we study how to disambiguate a database by appropriately selecting candidates to clean. This problem is challenging because cleaning involves a cost, is limited by a budget, may fail, and may not remove all ambiguities. Moreover, the statistical information about how likely database objects can be cleaned may not be precisely known. We tackle these challenges by proposing two types of algorithms. The first type makes use of greedy heuristics to make sensible decisions; however, these algorithms do not make use of cleaning information and require user input for parameters to achieve high cleaning effectiveness. We propose the Explore-Exploit (or EE) algorithm, which gathers valuable information during the cleaning process to determine how the remaining cleaning budget should be invested. We also study how to fine-tune the parameters of EE in order to achieve optimal cleaning effectiveness. Experimental evaluations on real and synthetic datasets validate the effectiveness and efficiency of our approaches.