AN EFFICIENT REPRESENTATION MODEL OF DISTANCE DISTRIBUTION BETWEEN UNCERTAIN OBJECTS

Authors:
Edward Hung;Lurong Xiao;Regant Y.S. Hung
Affiliations:
Department of Computing, The Hong Kong Polytechnic University, Hung Hom, Hong Kong;Department of Computing, The Hong Kong Polytechnic University, Hung Hom, Hong Kong;Department of Computer Science, The University of Hong Kong, Pokfulam, Hong Kong
Venue:
Computational Intelligence
Year:
2012

Citing 18
Cited 0

A probabilistic relational model and algebra

ACM Transactions on Database Systems (TODS)
Towards general measures of comparison of objects

Fuzzy Sets and Systems - Special issue dedicated to the memory of Professor Arnold Kaufmann
ProbView: a flexible probabilistic database system

ACM Transactions on Database Systems (TODS)
Probabilistic temporal databases, I: algebra

ACM Transactions on Database Systems (TODS)
Probabilistic object bases

ACM Transactions on Database Systems (TODS)
The Management of Probabilistic Data

IEEE Transactions on Knowledge and Data Engineering
Algorithms for Mining Distance-Based Outliers in Large Datasets

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
The Theory of Probabilistic Databases

VLDB '87 Proceedings of the 13th International Conference on Very Large Data Bases
Clustering of interval data based on city-block distances

Pattern Recognition Letters
Aggregate operators in probabilistic databases

Journal of the ACM (JACM)
Indexing multi-dimensional uncertain data with arbitrary probability density functions

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Efficient Clustering of Uncertain Data

ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Probabilistic interval XML

ACM Transactions on Computational Logic (TOCL)
ProTDB: probabilistic data in XML

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Mining frequent itemsets from uncertain data

PAKDD'07 Proceedings of the 11th Pacific-Asia conference on Advances in knowledge discovery and data mining
Probabilistic similarity join on uncertain data

DASFAA'06 Proceedings of the 11th international conference on Database Systems for Advanced Applications
Uncertain data mining: an example in clustering location data

PAKDD'06 Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Divergence measures based on the Shannon entropy

IEEE Transactions on Information Theory

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we consider the problem of efficient computation of distance between uncertain objects. In many real life applications, data like sensor readings and weather forecasts are usually uncertain when they are collected or produced. An uncertain object has a probability distribution function (PDF) to represent the probability that it is actually located in a particular location. A fast and accurate distance computation between uncertain objects is important to many uncertain query evaluation (e.g., range queries and nearest-neighbor queries) and uncertain data mining tasks (e.g., classifications, clustering, and outlier detection). However, existing approaches involve distance computations between samples of two objects, which is very computationally intensive. On one hand, it is expensive to calculate and store the actual distribution of the possible distance values between two uncertain objects. On the other hand, the expected distance (the weighted average of the pairwise distances among samples of two uncertain objects) provides very limited information and also restricts the definitions and usefulness of queries and mining tasks. In this paper, we propose several approaches to calculate the mean of the actual distance distribution and approximate its variance. Based on these, we suggest that the actual distance distribution could be approximated using a standard distribution like Gaussian or Gamma distribution. Experiments on real data and synthetic data show that our approach produces an approximation in a very short time with acceptable accuracy (about 90%). We suggest that it is practical for the research communities to define and develop more powerful queries and data mining tasks based on the distance distribution instead of the expected distance. © 2012 Wiley Periodicals, Inc.