Handling Uncertain Data in Array Database Systems

Authors:
Tingjian Ge;Stan Zdonik
Affiliations:
Brown University. tige@cs.brown.edu;Brown University. sbz@cs.brown.edu
Venue:
ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Year:
2008

Citing 0
Cited 7

Mining data streams with periodically changing distributions

Proceedings of the 18th ACM conference on Information and knowledge management
PODS: a new model and processing algorithms for uncertain data streams

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
A*-tree: a structure for storage and modeling of uncertain multidimensional arrays

Proceedings of the VLDB Endowment
Conditioning and aggregating uncertain data streams: going beyond expectations

Proceedings of the VLDB Endowment
CLARO: modeling and processing uncertain data streams

The VLDB Journal — The International Journal on Very Large Data Bases
Xtream: a system for continuous querying over uncertain data streams

SUM'12 Proceedings of the 6th international conference on Scalable Uncertainty Management
Query execution timing: taming real-time anytime queries on multicore processors

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Scientific and intelligence applications have special data handling needs. In these settings, data does not fit the standard model of short coded records that had dominated the data management area for three decades. Array database systems have a specialized architecture to address this problem. Since the data is typically an approximation of reality, it is important to be able to handle imprecision and uncertainty in an efficient and provably accurate way. We propose a discrete approach for value distributions and adopt a standard metric (i.e., variation distance) in probability theory to measure the quality of a result distribution. We then propose a novel algorithm that has a provable upper bound on the variation distance between its result distribution and the "ideal" one. Complementary to that, we advocate the usage of a "statistical mode" suitable for the results of many queries and applications, which is also much more efficient for execution. We show how the statistical mode also presents interesting predicate evaluation strategies. In addition, extensive experiments are performed on real world datasets to evaluate our algorithms.