Efficient Biased Sampling for Approximate Clustering and Outlier Detection in Large Data Sets

Authors:
George Kollios;Dimitrios Gunopulos;Nick Koudas;Stefan Berchtold
Affiliations:
-;-;-;-
Venue:
IEEE Transactions on Knowledge and Data Engineering
Year:
2003

Citing 32
Cited 31

Random sampling with a reservoir

ACM Transactions on Mathematical Software (TOMS)
Robust regression and outlier detection

Robust regression and outlier detection
Practical selectivity estimation through adaptive sampling

SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
Sequential sampling procedures for query size estimation

SIGMOD '92 Proceedings of the 1992 ACM SIGMOD international conference on Management of data
BIRCH: an efficient data clustering method for very large databases

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
CURE: an efficient clustering algorithm for large databases

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Automatic subspace clustering of high dimensional data for data mining applications

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
New sampling-based summary statistics for improving approximate query answers

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Random sampling for histogram construction: how much is enough?

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Wavelet-based histograms for selectivity estimation

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Data cube approximation and histograms via wavelets

Proceedings of the seventh international conference on Information and knowledge management
Sublinear time algorithms for metric space problems

STOC '99 Proceedings of the thirty-first annual ACM symposium on Theory of computing
Subquadratic approximation algorithms for clustering problems in high dimensional spaces

STOC '99 Proceedings of the thirty-first annual ACM symposium on Theory of computing
Selectivity estimation in spatial databases

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Multi-dimensional selectivity estimation using compressed histogram information

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
A comparison of selectivity estimators for range queries on metric attributes

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Random sampling techniques for space efficient online computation of order statistics of large datasets

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
On random sampling over joins

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Join synopses for approximate query answering

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Density biased sampling: an improved method for data mining and clustering

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
LOF: identifying density-based local outliers

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Efficient algorithms for mining outliers from large data sets

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Approximating multi-dimensional aggregate range queries over real attributes

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Algorithms for Mining Distance-Based Outliers in Large Datasets

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Finding Intensional Knowledge of Distance-Based Outliers

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Efficient and Effective Clustering Methods for Spatial Data Mining

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Sampling Large Databases for Association Rules

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Selectivity Estimation Without the Attribute Value Independence Assumption

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Random Sampling from Database Files: A Survey

Proceedings of the 5th International Conference SSDBM on Statistical and Scientific Database Management
Mining Deviants in a Time Series Database

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
A Sublinear Time Approximation Scheme for Clustering in Metric Spaces

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Range Selectivity Estimation for Continuous Attributes

SSDBM '99 Proceedings of the 11th International Conference on Scientific and Statistical Database Management

An efficient and effective algorithm for density biased sampling

Proceedings of the eleventh international conference on Information and knowledge management
Distributed deviation detection in sensor networks

ACM SIGMOD Record
Indexed-based density biased sampling for clustering applications

Data & Knowledge Engineering
Value and Relation Display: Interactive Visual Exploration of Large Data Sets with Hundreds of Dimensions

IEEE Transactions on Visualization and Computer Graphics
FRSDE: Fast reduced set density estimator using minimal enclosing ball approximation

Pattern Recognition
A genetic approach for efficient outlier detection in projected space

Pattern Recognition
Special Section: Point-Based Graphics: Fast vector quantization for efficient rendering of compressed point-clouds

Computers and Graphics
Angle-based outlier detection in high-dimensional data

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
A Density-Biased Sampling Technique to Improve Cluster Representativeness

PKDD 2007 Proceedings of the 11th European conference on Principles and Practice of Knowledge Discovery in Databases
Feature-preserved sampling over streaming data

ACM Transactions on Knowledge Discovery from Data (TKDD)
Finding anomalous periodic time series

Machine Learning
An adaptive flocking algorithm for performing approximate clustering

Information Sciences: an International Journal
From minimum enclosing ball to fast fuzzy inference system training on large datasets

IEEE Transactions on Fuzzy Systems
Data clustering: 50 years beyond K-means

Pattern Recognition Letters
Scalable Clustering for Mining Local-Correlated Clusters in High Dimensions and Large Datasets

Fundamenta Informaticae - Intelligent Data Analysis in Granular Computing
Large-scale robust visual codebook construction

Proceedings of the international conference on Multimedia
Unsupervised trajectory sampling

ECML PKDD'10 Proceedings of the 2010 European conference on Machine learning and knowledge discovery in databases: Part III
Swarming agents for discovering clusters in spatial data

ISPDC'03 Proceedings of the Second international conference on Parallel and distributed computing
An unbiased distance-based outlier detection approach for high-dimensional data

DASFAA'11 Proceedings of the 16th international conference on Database systems for advanced applications - Volume Part I
A two-phase clustering algorithm based on artificial immune network

ICNC'05 Proceedings of the First international conference on Advances in Natural Computation - Volume Part II
Weighted k-means for density-biased clustering

DaWaK'05 Proceedings of the 7th international conference on Data Warehousing and Knowledge Discovery
PatZip: pattern-preserved spatial data compression

PAKDD'05 Proceedings of the 9th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Possibility theoretic clustering

ICIC'05 Proceedings of the 2005 international conference on Advances in Intelligent Computing - Volume Part I
Indices of novelty for emerging topic detection

Information Processing and Management: an International Journal
Stratified k-means clustering over a deep web data source

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
A survey on unsupervised outlier detection in high-dimensional numerical data

Statistical Analysis and Data Mining
ESC: An efficient synchronization-based clustering algorithm

Knowledge-Based Systems
Subsampling for efficient and effective unsupervised outlier detection ensembles

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient event detection by exploiting crowds

Proceedings of the 7th ACM international conference on Distributed event-based systems
Pairwise similarity for cluster ensemble problem: link-based and approximate approaches

Transactions on Large-Scale Data- and Knowledge-centered systems IX
Local outlier detection reconsidered: a generalized view on locality with applications to spatial, video, and network outlier detection

Data Mining and Knowledge Discovery

Quantified Score

Hi-index	0.00

Visualization

Abstract

We investigate the use of biased sampling according to the density of the data set to speed up the operation of general data mining tasks, such as clustering and outlier detection in large multidimensional data sets. In density-biased sampling, the probability that a given point will be included in the sample depends on the local density of the data set. We propose a general technique for density-biased sampling that can factor in user requirements to sample for properties of interest and can be tuned for specific data mining tasks. This allows great flexibility and improved accuracy of the results over simple random sampling. We describe our approach in detail, we analytically evaluate it, and show how it can be optimized for approximate clustering and outlier detection. Finally, we present a thorough experimental evaluation of the proposed method, applying density-biased sampling on real and synthetic data sets, and employing clustering and outlier detection algorithms, thus highlighting the utility of our approach.