Random sampling with a reservoir
ACM Transactions on Mathematical Software (TOMS)
Compilers: principles, techniques, and tools
Compilers: principles, techniques, and tools
Random sampling from hash files
SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
Information retrieval
BIRCH: an efficient data clustering method for very large databases
SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
CURE: an efficient clustering algorithm for large databases
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
New sampling-based summary statistics for improving approximate query answers
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Web document clustering: a feasibility demonstration
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Clustering Algorithms
Machine Learning
On B-Tree Indices for Skewed Distributions
VLDB '92 Proceedings of the 18th International Conference on Very Large Data Bases
Sampling-Based Estimation of the Number of Distinct Values of an Attribute
VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
Modeling Skewed Distribution Using Multifractals and the `80-20' Law
VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Finding similar images quicky using object shapes
Proceedings of the tenth international conference on Information and knowledge management
An efficient and effective algorithm for density biased sampling
Proceedings of the eleventh international conference on Information and knowledge management
Clustering High Dimensional Massive Scientific Datasets
Journal of Intelligent Information Systems
C2P: Clustering based on Closest Pairs
Proceedings of the 27th International Conference on Very Large Data Bases
Scaling-Up Model-Based Clustering Algorithm by Working on Clustering Features
IDEAL '02 Proceedings of the Third International Conference on Intelligent Data Engineering and Automated Learning
Maintaining variance and k-medians over data stream windows
Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
An expectation-maximization algorithm working on data summary
Second international workshop on Intelligent systems design and application
A robust and efficient clustering algorithm based on cohesion self-merging
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient Biased Sampling for Approximate Clustering and Outlier Detection in Large Data Sets
IEEE Transactions on Knowledge and Data Engineering
Scalable Model-based Clustering by Working on Data Summaries
ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
The power-method: a comprehensive estimation technique for multi-dimensional queries
CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Analysis of predictive spatio-temporal queries
ACM Transactions on Database Systems (TODS)
A k-Median Algorithm with Running Time Independent of Data Size
Machine Learning
A top-down approach for density-based clustering using multidimensional indexes
Journal of Systems and Software - Special issue: Performance modeling and analysis of computer systems and networks
IEEE Transactions on Knowledge and Data Engineering
Subspace clustering for high dimensional categorical data
ACM SIGKDD Explorations Newsletter
Scalable Model-Based Clustering for Large Databases Based on Data Summarization
IEEE Transactions on Pattern Analysis and Machine Intelligence
Indexed-based density biased sampling for clustering applications
Data & Knowledge Engineering
iVIBRATE: Interactive visualization-based framework for clustering large datasets
ACM Transactions on Information Systems (TOIS)
Physical Database Design: the database professional's guide to exploiting indexes, views, storage, and more
Constrained data clustering by depth control and progressive constraint relaxation
The VLDB Journal — The International Journal on Very Large Data Bases
Classifying imbalanced data using a bagging ensemble variation (BEV)
ACM-SE 45 Proceedings of the 45th annual southeast regional conference
Quality-Aware Sampling and Its Applications in Incremental Data Mining
IEEE Transactions on Knowledge and Data Engineering
Fast ordering of large categorical datasets for visualization
Intelligent Data Analysis
TaxaMiner: an experimentation framework for automated taxonomy bootstrapping
International Journal of Web and Grid Services
A scalable sampling scheme for clustering in network traffic analysis
Proceedings of the 2nd international conference on Scalable information systems
A Density-Biased Sampling Technique to Improve Cluster Representativeness
PKDD 2007 Proceedings of the 11th European conference on Principles and Practice of Knowledge Discovery in Databases
Clustering high dimensional data: A graph-based relaxed optimization approach
Information Sciences: an International Journal
Feature-preserved sampling over streaming data
ACM Transactions on Knowledge Discovery from Data (TKDD)
A search space reduction methodology for data mining in large databases
Engineering Applications of Artificial Intelligence
Optimal sampling from sliding windows
Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Using trees to depict a forest
Proceedings of the VLDB Endowment
Scalable model-based cluster analysis using clustering features
Pattern Recognition
A search space reduction methodology for large databases: a case study
ICDM'07 Proceedings of the 7th industrial conference on Advances in data mining: theoretical aspects and applications
Journal of Network and Computer Applications
Unsupervised trajectory sampling
ECML PKDD'10 Proceedings of the 2010 European conference on Machine learning and knowledge discovery in databases: Part III
Sampling for information and structure preservation when mining large data bases
IBERAMIA'10 Proceedings of the 12th Ibero-American conference on Advances in artificial intelligence
Effective and efficient sampling methods for deep web aggregation queries
Proceedings of the 14th International Conference on Extending Database Technology
Coupling or decoupling for KNN search on road networks?: a hybrid framework on user query patterns
Proceedings of the 20th ACM international conference on Information and knowledge management
Optimal sampling from sliding windows
Journal of Computer and System Sciences
Efficient prediction-based validation for document clustering
ECML'06 Proceedings of the 17th European conference on Machine Learning
Weighted k-means for density-biased clustering
DaWaK'05 Proceedings of the 7th international conference on Data Warehousing and Knowledge Discovery
PatZip: pattern-preserved spatial data compression
PAKDD'05 Proceedings of the 9th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Detecting gaming the system in constraint-based tutors
UMAP'10 Proceedings of the 18th international conference on User Modeling, Adaptation, and Personalization
A metropolis sampling method for drawing representative samples from large databases
DASFAA'05 Proceedings of the 10th international conference on Database Systems for Advanced Applications
On approximation algorithms for data mining applications
Efficient Approximation and Online Algorithms
ESC: An efficient synchronization-based clustering algorithm
Knowledge-Based Systems
Towards realistic sampling: generating dependencies in a relational database
Proceedings of the 7th International Conference on Ubiquitous Information Management and Communication
Efficient event detection by exploiting crowds
Proceedings of the 7th ACM international conference on Distributed event-based systems
An automated search space reduction methodology for large databases
ICDM'13 Proceedings of the 13th international conference on Advances in Data Mining: applications and theoretical aspects
Pairwise similarity for cluster ensemble problem: link-based and approximate approaches
Transactions on Large-Scale Data- and Knowledge-centered systems IX
Hi-index | 0.00 |
Data mining in large data sets often requires a sampling or summarization step to form an in-core representation of the data that can be processed more efficiently. Uniform random sampling is frequently used in practice and also frequently criticized because it will miss small clusters. Many natural phenomena are known to follow Zipf's distribution and the inability of uniform sampling to find small clusters is of practical concern. Density Biased Sampling is proposed to probabilistically under-sample dense regions and over-sample light regions. A weighted sample is used to preserve the densities of the original data. Density biased sampling naturally includes uniform sampling as a special case. A memory efficient algorithm is proposed that approximates density biased sampling using only a single scan of the data. We empirically evaluate density biased sampling using synthetic data sets that exhibit varying cluster size distributions finding up to a factor of six improvement over uniform sampling.