Density biased sampling: an improved method for data mining and clustering

Authors:
Christopher R. Palmer;Christos Faloutsos
Affiliations:
Computer Science Department, Carnegie Mellon University, Pittsburgh, PA;Computer Science Department, Carnegie Mellon University, Pittsburgh, PA
Venue:
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Year:
2000

Citing 14
Cited 51

Random sampling with a reservoir

ACM Transactions on Mathematical Software (TOMS)
Compilers: principles, techniques, and tools

Compilers: principles, techniques, and tools
Random sampling from hash files

SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
Clustering algorithms

Information retrieval
BIRCH: an efficient data clustering method for very large databases

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Online aggregation

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
CURE: an efficient clustering algorithm for large databases

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
New sampling-based summary statistics for improving approximate query answers

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Web document clustering: a feasibility demonstration

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Clustering Algorithms

Clustering Algorithms
Machine Learning

Machine Learning
On B-Tree Indices for Skewed Distributions

VLDB '92 Proceedings of the 18th International Conference on Very Large Data Bases
Sampling-Based Estimation of the Number of Distinct Values of an Attribute

VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
Modeling Skewed Distribution Using Multifractals and the `80-20' Law

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases

Finding similar images quicky using object shapes

Proceedings of the tenth international conference on Information and knowledge management
An efficient and effective algorithm for density biased sampling

Proceedings of the eleventh international conference on Information and knowledge management
Clustering High Dimensional Massive Scientific Datasets

Journal of Intelligent Information Systems
C2P: Clustering based on Closest Pairs

Proceedings of the 27th International Conference on Very Large Data Bases
Scaling-Up Model-Based Clustering Algorithm by Working on Clustering Features

IDEAL '02 Proceedings of the Third International Conference on Intelligent Data Engineering and Automated Learning
Maintaining variance and k-medians over data stream windows

Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
An expectation-maximization algorithm working on data summary

Second international workshop on Intelligent systems design and application
A robust and efficient clustering algorithm based on cohesion self-merging

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient Biased Sampling for Approximate Clustering and Outlier Detection in Large Data Sets

IEEE Transactions on Knowledge and Data Engineering
Scalable Model-based Clustering by Working on Data Summaries

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
The power-method: a comprehensive estimation technique for multi-dimensional queries

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Analysis of predictive spatio-temporal queries

ACM Transactions on Database Systems (TODS)
A k-Median Algorithm with Running Time Independent of Data Size

Machine Learning
A top-down approach for density-based clustering using multidimensional indexes

Journal of Systems and Software - Special issue: Performance modeling and analysis of computer systems and networks
Shared Memory Parallelization of Data Mining Algorithms: Techniques, Programming Interface, and Performance

IEEE Transactions on Knowledge and Data Engineering
Subspace clustering for high dimensional categorical data

ACM SIGKDD Explorations Newsletter
Scalable Model-Based Clustering for Large Databases Based on Data Summarization

IEEE Transactions on Pattern Analysis and Machine Intelligence
Indexed-based density biased sampling for clustering applications

Data & Knowledge Engineering
iVIBRATE: Interactive visualization-based framework for clustering large datasets

ACM Transactions on Information Systems (TOIS)
Physical Database Design: the database professional's guide to exploiting indexes, views, storage, and more

Physical Database Design: the database professional's guide to exploiting indexes, views, storage, and more
Constrained data clustering by depth control and progressive constraint relaxation

The VLDB Journal — The International Journal on Very Large Data Bases
Classifying imbalanced data using a bagging ensemble variation (BEV)

ACM-SE 45 Proceedings of the 45th annual southeast regional conference
Quality-Aware Sampling and Its Applications in Incremental Data Mining

IEEE Transactions on Knowledge and Data Engineering
Fast ordering of large categorical datasets for visualization

Intelligent Data Analysis
TaxaMiner: an experimentation framework for automated taxonomy bootstrapping

International Journal of Web and Grid Services
A scalable sampling scheme for clustering in network traffic analysis

Proceedings of the 2nd international conference on Scalable information systems
A Density-Biased Sampling Technique to Improve Cluster Representativeness

PKDD 2007 Proceedings of the 11th European conference on Principles and Practice of Knowledge Discovery in Databases
Clustering high dimensional data: A graph-based relaxed optimization approach

Information Sciences: an International Journal
Feature-preserved sampling over streaming data

ACM Transactions on Knowledge Discovery from Data (TKDD)
A search space reduction methodology for data mining in large databases

Engineering Applications of Artificial Intelligence
Optimal sampling from sliding windows

Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Using trees to depict a forest

Proceedings of the VLDB Endowment
Scalable model-based cluster analysis using clustering features

Pattern Recognition
A search space reduction methodology for large databases: a case study

ICDM'07 Proceedings of the 7th industrial conference on Advances in data mining: theoretical aspects and applications
Critical infrastructure protection: Resource efficient sampling to improve detection of less frequent patterns in network traffic

Journal of Network and Computer Applications
Unsupervised trajectory sampling

ECML PKDD'10 Proceedings of the 2010 European conference on Machine learning and knowledge discovery in databases: Part III
Sampling for information and structure preservation when mining large data bases

IBERAMIA'10 Proceedings of the 12th Ibero-American conference on Advances in artificial intelligence
Effective and efficient sampling methods for deep web aggregation queries

Proceedings of the 14th International Conference on Extending Database Technology
Coupling or decoupling for KNN search on road networks?: a hybrid framework on user query patterns

Proceedings of the 20th ACM international conference on Information and knowledge management
Optimal sampling from sliding windows

Journal of Computer and System Sciences
Efficient prediction-based validation for document clustering

ECML'06 Proceedings of the 17th European conference on Machine Learning
Weighted k-means for density-biased clustering

DaWaK'05 Proceedings of the 7th international conference on Data Warehousing and Knowledge Discovery
PatZip: pattern-preserved spatial data compression

PAKDD'05 Proceedings of the 9th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Detecting gaming the system in constraint-based tutors

UMAP'10 Proceedings of the 18th international conference on User Modeling, Adaptation, and Personalization
A metropolis sampling method for drawing representative samples from large databases

DASFAA'05 Proceedings of the 10th international conference on Database Systems for Advanced Applications
On approximation algorithms for data mining applications

Efficient Approximation and Online Algorithms
ESC: An efficient synchronization-based clustering algorithm

Knowledge-Based Systems
Towards realistic sampling: generating dependencies in a relational database

Proceedings of the 7th International Conference on Ubiquitous Information Management and Communication
Efficient event detection by exploiting crowds

Proceedings of the 7th ACM international conference on Distributed event-based systems
An automated search space reduction methodology for large databases

ICDM'13 Proceedings of the 13th international conference on Advances in Data Mining: applications and theoretical aspects
Pairwise similarity for cluster ensemble problem: link-based and approximate approaches

Transactions on Large-Scale Data- and Knowledge-centered systems IX

Quantified Score

Hi-index	0.00

Visualization

Abstract

Data mining in large data sets often requires a sampling or summarization step to form an in-core representation of the data that can be processed more efficiently. Uniform random sampling is frequently used in practice and also frequently criticized because it will miss small clusters. Many natural phenomena are known to follow Zipf's distribution and the inability of uniform sampling to find small clusters is of practical concern. Density Biased Sampling is proposed to probabilistically under-sample dense regions and over-sample light regions. A weighted sample is used to preserve the densities of the original data. Density biased sampling naturally includes uniform sampling as a special case. A memory efficient algorithm is proposed that approximates density biased sampling using only a single scan of the data. We empirically evaluate density biased sampling using synthetic data sets that exhibit varying cluster size distributions finding up to a factor of six improvement over uniform sampling.