Indexed-based density biased sampling for clustering applications

Authors:
Alexandros Nanopoulos;Yannis Theodoridis;Yannis Manolopoulos
Affiliations:
Department of Informatics, Aristotle University of Thessaloniki, Thessaloniki, Greece;Department of Informatics, University of Piraeus, Greece;Department of Informatics, Aristotle University of Thessaloniki, Thessaloniki, Greece
Venue:
Data & Knowledge Engineering
Year:
2006

Citing 29
Cited 5

Random sampling with a reservoir

ACM Transactions on Mathematical Software (TOMS)
Random sampling from B+ trees

VLDB '89 Proceedings of the 15th international conference on Very large data bases
The R*-tree: an efficient and robust access method for points and rectangles

SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
Towards an analysis of range query performance in spatial data structures

PODS '93 Proceedings of the twelfth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
On packing R-trees

CIKM '93 Proceedings of the second international conference on Information and knowledge management
The power of sampling in knowledge discovery

PODS '94 Proceedings of the thirteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Nearest neighbor queries

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
A model for the prediction of R-tree performance

PODS '96 Proceedings of the fifteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
On sampling regional data

Data & Knowledge Engineering
CURE: an efficient clustering algorithm for large databases

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Selectivity estimation in spatial databases

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Density biased sampling: an improved method for data mining and clustering

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Data bubbles: quality preserving performance boosting for hierarchical clustering

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Modeling high-dimensional index structures using sampling

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules

Data Mining and Knowledge Discovery
R-trees: a dynamic index structure for spatial searching

SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
Sampling from Spatial Databases

Proceedings of the Ninth International Conference on Data Engineering
Similarity-Driven Sampling for Data Mining

PKDD '98 Proceedings of the Second European Symposium on Principles of Data Mining and Knowledge Discovery
Optimal Grid-Clustering: Towards Breaking the Curse of Dimensionality in High-Dimensional Clustering

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
C2P: Clustering based on Closest Pairs

Proceedings of the 27th International Conference on Very Large Data Bases
Random Sampling from Pseudo-Ranked B+ Trees

VLDB '92 Proceedings of the 18th International Conference on Very Large Data Bases
Sampling Large Databases for Association Rules

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
The X-tree: An Index Structure for High-Dimensional Data

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Fast Incremental Maintenance of Approximate Histograms

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Revisiting R-Tree Construction Principles

ADBIS '02 Proceedings of the 6th East European Conference on Advances in Databases and Information Systems
Knowledge Discovery in Large Spatial Databases: Focusing Techniques for Efficient Class Identification

SSD '95 Proceedings of the 4th International Symposium on Advances in Spatial Databases
Oracle8i Spatial: Experiences with Extensible Databases

SSD '99 Proceedings of the 6th International Symposium on Advances in Spatial Databases
Evaluation of sampling for data mining of association rules

RIDE '97 Proceedings of the 7th International Workshop on Research Issues in Data Engineering (RIDE '97) High Performance Database Management for Large-Scale Applications
Efficient Biased Sampling for Approximate Clustering and Outlier Detection in Large Data Sets

IEEE Transactions on Knowledge and Data Engineering

Improving density-based methods for hierarchical clustering of web pages

Data & Knowledge Engineering
A Density-Biased Sampling Technique to Improve Cluster Representativeness

PKDD 2007 Proceedings of the 11th European conference on Principles and Practice of Knowledge Discovery in Databases
Graph nodes clustering with the sigmoid commute-time kernel: A comparative study

Data & Knowledge Engineering
Unsupervised trajectory sampling

ECML PKDD'10 Proceedings of the 2010 European conference on Machine learning and knowledge discovery in databases: Part III
A general stochastic clustering method for automatic cluster discovery

Pattern Recognition

Quantified Score

Hi-index	0.00

Visualization

Abstract

Density biased sampling (DBS) has been proposed to address the limitations of Uniform sampling, by producing the desired probability distribution in the sample. The ease of producing a random sample depends on the available mechanism for accessing the elements of the dataset. Existing DBS algorithms perform sampling over flat files. In this paper, we develop a new method that exploits spatial indexes and the local density information they preserve, to provide good quality of sampling result and fast access to elements of the dataset. With the proposed method accurate density estimations can be produced with respect to factors like skew, noise or dimensionality. Moreover, significant improvement in sampling time is attained. The performance of the proposed method is examined analytically and experimentally. The comparative results illustrate its superiority over existing methods.