Cleaning uncertain data with quality guarantees

Authors:
Reynold Cheng;Jinchuan Chen;Xike Xie
Affiliations:
The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong;The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong;The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong
Venue:
Proceedings of the VLDB Endowment
Year:
2008

Citing 25
Cited 17

Greed sort: optimal deterministic sorting on parallel disks

Journal of the ACM (JACM)
The reliability of queries (extended abstract)

PODS '95 Proceedings of the fourteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
The complexity of query reliability

PODS '98 Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Updating and Querying Databases that Track Mobile Units

Distributed and Parallel Databases - Special issue on mobile data management and applications
Introduction to Algorithms

Introduction to Algorithms
The Management of Probabilistic Data

IEEE Transactions on Knowledge and Data Engineering
Capturing the Uncertainty of Moving-Object Representations

SSD '99 Proceedings of the 6th International Symposium on Advances in Spatial Databases
Evaluating probabilistic queries over imprecise data

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Adaptive filters for continuous queries over distributed data streams

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Cost-efficient processing of MIN/MAX queries over distributed sensors with uncertainty

Proceedings of the 2005 ACM symposium on Applied computing
Indexing multi-dimensional uncertain data with arbitrary probability density functions

VLDB '05 Proceedings of the 31st international conference on Very large data bases
A Mathematical Theory of Communication

A Mathematical Theory of Communication
Clean Answers over Dirty Databases: A Probabilistic Approach

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
A Sampling-Based Approach to Optimizing Top-k Queries in Sensor Networks

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Towards correcting input data errors probabilistically using integrity constraints

MobiDE '06 Proceedings of the 5th ACM international workshop on Data engineering for wireless and mobile access
ULDBs: databases with uncertainty and lineage

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Trio: a system for data, uncertainty, and lineage

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Model-driven data acquisition in sensor networks

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Efficient query evaluation on probabilistic databases

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Efficient indexing methods for probabilistic threshold queries over uncertain data

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Probabilistic skylines on uncertain data

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Monochromatic and bichromatic reverse skyline search over uncertain databases

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Efficient Processing of Top-k Queries in Uncertain Databases

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Probabilistic Verifiers: Evaluating Constrained Nearest-Neighbor Queries over Uncertain Data

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Probabilistic nearest-neighbor query on uncertain objects

DASFAA'07 Proceedings of the 12th international conference on Database systems for advanced applications

Evaluating probability threshold k-nearest-neighbor queries over uncertain data

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Qualitative effects of knowledge rules and user feedback in probabilistic data integration

The VLDB Journal — The International Journal on Very Large Data Bases
Creating probabilistic databases from duplicated data

The VLDB Journal — The International Journal on Very Large Data Bases
Missing data imputation: a fuzzy K-means clustering algorithm over sliding window

FSKD'09 Proceedings of the 6th international conference on Fuzzy systems and knowledge discovery - Volume 3
Querying and cleaning uncertain data

QuaCon'09 Proceedings of the 1st international conference on Quality of context
An abstract processing model for the quality of context data

QuaCon'09 Proceedings of the 1st international conference on Quality of context
Selective data acquisition for probabilistic K-NN query

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Data selection for exact value acquisition to improve uncertain clustering

WAIM'10 Proceedings of the 11th international conference on Web-age information management
Explore or exploit?: effective strategies for disambiguating large databases

Proceedings of the VLDB Endowment
Sensitivity analysis and explanations for robust query evaluation in probabilistic databases

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Cleaning uncertain streams for query improvement

APWeb'11 Proceedings of the 13th Asia-Pacific web conference on Web technologies and applications
Scrubbing query results from probabilistic databases

Proceedings of the 15th Symposium on International Database Engineering & Applications
Data-driven trajectory smoothing

Proceedings of the 19th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems
An efficient method for cleaning dirty-events over uncertain data in WSNs

Journal of Computer Science and Technology - Special issue on Natural Language Processing
A decision tree-based missing value imputation technique for data pre-processing

AusDM '11 Proceedings of the Ninth Australasian Data Mining Conference - Volume 121
Causality and responsibility: probabilistic queries revisited in uncertain databases

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Entity resolution for distributed probabilistic data

Distributed and Parallel Databases

Quantified Score

Hi-index	0.00

Visualization

Abstract

Uncertain or imprecise data are pervasive in applications like location-based services, sensor monitoring, and data collection and integration. For these applications, probabilistic databases can be used to store uncertain data, and querying facilities are provided to yield answers with statistical confidence. Given that a limited amount of resources is available to "clean" the database (e.g., by probing some sensor data values to get their latest values), we address the problem of choosing the set of uncertain objects to be cleaned, in order to achieve the best improvement in the quality of query answers. For this purpose, we present the PWS-quality metric, which is a universal measure that quantifies the ambiguity of query answers under the possible world semantics. We study how PWS-quality can be efficiently evaluated for two major query classes: (1) queries that examine the satisfiability of tuples independent of other tuples (e.g., range queries); and (2) queries that require the knowledge of the relative ranking of the tuples (e.g., MAX queries). We then propose a polynomial-time solution to achieve an optimal improvement in PWS-quality. Other fast heuristics are presented as well. Experiments, performed on both real and synthetic datasets, show that the PWS-quality metric can be evaluated quickly, and that our cleaning algorithm provides an optimal solution with high efficiency. To our best knowledge, this is the first work that develops a quality metric for a probabilistic database, and investigates how such a metric can be used for data cleaning purposes.