Towards correcting input data errors probabilistically using integrity constraints

Authors:
Nodira Khoussainova;Magdalena Balazinska;Dan Suciu
Affiliations:
University of Washington, Seattle, WA;University of Washington, Seattle, WA;University of Washington, Seattle, WA
Venue:
MobiDE '06 Proceedings of the 5th ACM international workshop on Data engineering for wireless and mobile access
Year:
2006

Citing 13
Cited 30

A probabilistic relational algebra for the integration of information retrieval and database systems

ACM Transactions on Information Systems (TOIS)
The Management of Probabilistic Data

IEEE Transactions on Knowledge and Data Engineering
The Theory of Probabilistic Databases

VLDB '87 Proceedings of the 13th International Conference on Very Large Data Bases
Evaluating probabilistic queries over imprecise data

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Aurora: a new model and architecture for data stream management

The VLDB Journal — The International Journal on Very Large Data Bases
Highly available, fault-tolerant, parallel dataflows

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
High-Availability Algorithms for Distributed Stream Processing

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Fault-tolerance in the Borealis distributed stream processing system

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Consistently estimating the selectivity of conjuncts of predicates

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Working Models for Uncertain Data

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Clean Answers over Dirty Databases: A Probabilistic Approach

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Efficient query evaluation on probabilistic databases

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Declarative support for sensor data cleaning

PERVASIVE'06 Proceedings of the 4th international conference on Pervasive Computing

Sketching probabilistic data streams

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Robust management of outliers in sensor network aggregate queries

MobiDE '07 Proceedings of the 6th ACM international workshop on Data engineering for wireless and mobile access
Management of probabilistic data: foundations and challenges

Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Suppression and failures in sensor networks: a Bayesian approach

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Event queries on correlated probabilistic streams

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Cascadia: A System for Specifying, Detecting, and Managing RFID Events

Proceedings of the 6th international conference on Mobile systems, applications, and services
Probabilistic databases

ACM SIGACT News
Tagmark: reliable estimations of RFID tags for business processes

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
RFID: An Ideal Technology for Ubiquitous Computing?

UIC '08 Proceedings of the 5th international conference on Ubiquitous Intelligence and Computing
Cleaning uncertain data with quality guarantees

Proceedings of the VLDB Endowment
Efficient RFID Data Imputation by Analyzing the Correlations of Monitored Objects

DASFAA '09 Proceedings of the 14th International Conference on Database Systems for Advanced Applications
Maintaining consistency of vague databases using data dependencies

Data & Knowledge Engineering
Finding misplaced items in retail by clustering RFID data

Proceedings of the 13th International Conference on Extending Database Technology
Sensor faults: Detection methods and prevalence in real-world datasets

ACM Transactions on Sensor Networks (TOSN)
Leveraging spatio-temporal redundancy for RFID data cleansing

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
TACO: tunable approximate computation of outliers in wireless sensor networks

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Mining uncertain data with probabilistic guarantees

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
Online anomaly detection for sensor systems: A simple and efficient approach

Performance Evaluation
Accelerating probabilistic frequent itemset mining: a model-based approach

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Explore or exploit?: effective strategies for disambiguating large databases

Proceedings of the VLDB Endowment
Data Auditor: exploring data quality and semantics using pattern tableaux

Proceedings of the VLDB Endowment
Leveraging communication information among readers for RFID data cleaning

WAIM'11 Proceedings of the 12th international conference on Web-age information management
Resilience is more than availability

Proceedings of the 2011 workshop on New security paradigms workshop
Efficiently answering probability threshold-based shortest path queries over uncertain graphs

DASFAA'10 Proceedings of the 15th international conference on Database Systems for Advanced Applications - Volume Part I
Incremental update on probabilistic frequent itemsets in uncertain databases

Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication
Efficient subject-oriented evaluating and mining methods for data with schema uncertainty

ADMA'11 Proceedings of the 7th international conference on Advanced Data Mining and Applications - Volume Part I
An efficient method for cleaning dirty-events over uncertain data in WSNs

Journal of Computer Science and Technology - Special issue on Natural Language Processing
Adam: Identifying defects in context-aware adaptation

Journal of Systems and Software
In-network approximate computation of outliers with quality guarantees

Information Systems
IDEA: improving dependability for self-adaptive applications

Proceedings of the 2013 Middleware Doctoral Symposium

Quantified Score

Hi-index	0.00

Visualization

Abstract

Mobile and pervasive applications frequently rely on devices such as RFID antennas or sensors (light, temperature, motion) to provide them information about the physical world. These devices, however, are unreliable. They produce streams of information where portions of data may be missing, duplicated, or erroneous. Current state of the art is to correct errors locally (e.g., range constraints for temperature readings) or use spatial/temporal correlations (e.g., smoothing temperature readings). However, errors are often apparent only in a global setting, e.g., missed readings of objects that are known to be present, or exit readings from a parking garage without matching entry readings.In this paper, we present StreamClean, a system for correcting input data errors automatically using application defined global integrity constraints. Because it is frequently impossible to make corrections with certainty, we propose a probabilistic approach, where the system assigns to each input tuple the probability that it is correct.We show that StreamClean handles a large class of input data errors, and corrects them sufficiently fast to keep-up with input rates of many mobile and pervasive applications. We also show that the probabilities assigned by StreamClean correspond to a user's intuitive notion of correctness.