Robust space transformations for distance-based operations

Authors:
Edwin M. Knorr;Raymond T. Ng;Ruben H. Zamar
Affiliations:
Univ. of British Columbia, Vancouver, BC Canada;Univ. of British Columbia, Vancouver, BC Canada;Univ. of British Columbia, Vancouver, BC Canada
Venue:
Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2001

Citing 14
Cited 8

Robust regression and outlier detection

Robust regression and outlier detection
The R*-tree: an efficient and robust access method for points and rectangles

SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
Efficient and effective querying by image content

Journal of Intelligent Information Systems - Special issue: advances in visual information management systems
Distance-based indexing for high-dimensional metric spaces

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Automatic subspace clustering of high dimensional data for data mining applications

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Density-based indexing for approximate nearest-neighbor queries

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
LOF: identifying density-based local outliers

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Efficient algorithms for mining outliers from large data sets

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
R-trees: a dynamic index structure for spatial searching

SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
Algorithms for Mining Distance-Based Outliers in Large Datasets

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Efficient and Effective Clustering Methods for Spatial Data Mining

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Efficient User-Adaptable Similarity Search in Large Multimedia Databases

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
STING: A Statistical Information Grid Approach to Spatial Data Mining

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases

Making every bit count: fast nonlinear axis scaling

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Privacy-preserving k-means clustering over vertically partitioned data

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Computation of Minimum-Volume Covering Ellipsoids

Operations Research
Scale-invariant clustering with minimum volume ellipsoids

Computers and Operations Research
A fuzzy logic-based method for outliers detection

AIAP'07 Proceedings of the 25th conference on Proceedings of the 25th IASTED International Multi-Conference: artificial intelligence and applications
Outlier Detection Based on the Distribution of Distances between Data Points

Informatica
Effective Spatial Characterization System Using Density-Based Clustering

ICCS '07 Proceedings of the 7th international conference on Computational Science, Part II
Probability Based Metrics for Locally Weighted Naive Bayes

CAI '07 Proceedings of the 20th conference of the Canadian Society for Computational Studies of Intelligence on Advances in Artificial Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

For many KDD operations, such as nearest neighbor search, distance-based clustering, and outlier detection, there is an underlying &kgr;-D data space in which each tuple/object is represented as a point in the space. In the presence of differing scales, variability, correlation, and/or outliers, we may get unintuitive results if an inappropriate space is used.The fundamental question that this paper addresses is: "What then is an appropriate space?" We propose using a robust space transformation called the Donoho-Stahel estimator. In the first half of the paper, we show the key properties of the estimator. Of particular importance to KDD applications involving databases is the stability property, which says that in spite of frequent updates, the estimator does not: (a) change much, (b) lose its usefulness, or (c) require re-computation. In the second half, we focus on the computation of the estimator for high-dimensional databases. We develop randomized algorithms and evaluate how well they perform empirically. The novel algorithm we develop called the Hybrid-random algorithm is, in most cases, at least an order of magnitude faster than the Fixed-angle and Subsampling algorithms.