Faster methods for random sampling
Communications of the ACM
Algorithms
The art of computer programming, volume 2 (3rd ed.): seminumerical algorithms
The art of computer programming, volume 2 (3rd ed.): seminumerical algorithms
A note on sampling a tape-file
Communications of the ACM
Percentile finding algorithm for multiple sorted runs
VLDB '89 Proceedings of the 15th international conference on Very large data bases
VLDB '89 Proceedings of the 15th international conference on Very large data bases
Random sampling from hash files
SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
Optimal sample cost residues for differential database batch query problems
Journal of the ACM (JACM)
Reservoir-sampling algorithms of time complexity O(n(1 + log(N/n)))
ACM Transactions on Mathematical Software (TOMS)
Improved histograms for selectivity estimation of range predicates
SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
A model for the prediction of R-tree performance
PODS '96 Proceedings of the fifteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
CURE: an efficient clustering algorithm for large databases
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Efficient mid-query re-optimization of sub-optimal query execution plans
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
New sampling-based summary statistics for improving approximate query answers
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Wavelet-based histograms for selectivity estimation
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Tracking join and self-join sizes in limited storage
PODS '99 Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Correcting execution of distributed queries
DPDS '90 Proceedings of the second international symposium on Databases in parallel and distributed systems
Density biased sampling: an improved method for data mining and clustering
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
A robust, optimization-based approach for approximate answering of aggregate queries
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Models and issues in data stream systems
Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Sampling from a moving window over streaming data
SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Processing complex aggregate queries over data streams
Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Fast incremental maintenance of approximate histograms
ACM Transactions on Database Systems (TODS)
An efficient and effective algorithm for density biased sampling
Proceedings of the eleventh international conference on Information and knowledge management
Continuous queries over data streams
ACM SIGMOD Record
Deciding to Correct Distributed Query Processing
IEEE Transactions on Knowledge and Data Engineering
Efficient Cost Models for Spatial Queries Using R-Trees
IEEE Transactions on Knowledge and Data Engineering
Informed content delivery across adaptive overlay networks
Proceedings of the 2002 conference on Applications, technologies, architectures, and protocols for computer communications
Simple Random Sampling from Relational Databases
VLDB '86 Proceedings of the 12th International Conference on Very Large Data Bases
ICICLES: Self-Tuning Samples for Approximate Query Answering
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports
Proceedings of the 27th International Conference on Very Large Data Bases
Approximate Query Processing: Taming the TeraBytes
Proceedings of the 27th International Conference on Very Large Data Bases
C2P: Clustering based on Closest Pairs
Proceedings of the 27th International Conference on Very Large Data Bases
Data Compression Support in Databases
VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Estimation of Query-Result Distribution and its Application in Parallel-Join Load Balancing
VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Fast Incremental Maintenance of Approximate Histograms
VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
On Linear-Spline Based Histograms
WAIM '02 Proceedings of the Third International Conference on Advances in Web-Age Information Management
Time-Interval Sampling for Improved Estimations in Data Warehouses
DaWaK 2000 Proceedings of the 4th International Conference on Data Warehousing and Knowledge Discovery
CC '02 Proceedings of the 11th International Conference on Compiler Construction
ICDCS '03 Proceedings of the 23rd International Conference on Distributed Computing Systems
Extended wavelets for multiple measures
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Dynamic sample selection for approximate query processing
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Efficient Biased Sampling for Approximate Clustering and Outlier Detection in Large Data Sets
IEEE Transactions on Knowledge and Data Engineering
Optimized Disjunctive Association Rules via Sampling
ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
DSQoS-distributed architecture providing QoS in summary warehouses
DOLAP '03 Proceedings of the 6th ACM international workshop on Data warehousing and OLAP
Characterizing memory requirements for queries over continuous data streams
ACM Transactions on Database Systems (TODS)
Range counting over multidimensional data streams
SCG '04 Proceedings of the twentieth annual symposium on Computational geometry
Flow sampling under hard resource constraints
Proceedings of the joint international conference on Measurement and modeling of computer systems
Online maintenance of very large random samples
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
A New Conceptual Clustering Framework
Machine Learning
Informed content delivery across adaptive overlay networks
IEEE/ACM Transactions on Networking (TON)
Subspace clustering for high dimensional categorical data
ACM SIGKDD Explorations Newsletter
Venn Sampling: A Novel Prediction Technique for Moving Objects
ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Approximate counts and quantiles over sliding windows
PODS '04 Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Sampling search-engine results
WWW '05 Proceedings of the 14th international conference on World Wide Web
Estimating arbitrary subset sums with few probes
Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
SPASS: scalable and energy-efficient data acquisition in sensor databases
Proceedings of the 4th ACM international workshop on Data engineering for wireless and mobile access
Sampling algorithms in a stream operator
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Domain-Driven Data Synopses for Dynamic Quantiles
IEEE Transactions on Knowledge and Data Engineering
Summarizing and mining inverse distributions on data streams via dynamic inverse sampling
VLDB '05 Proceedings of the 31st international conference on Very large data bases
Design of a next generation sampling service for large scale data analysis applications
Proceedings of the 19th annual international conference on Supercomputing
Query workload-aware overlay construction using histograms
Proceedings of the 14th ACM international conference on Information and knowledge management
Random sampling from database files: a survey
SSDBM'1990 Proceedings of the 5th international conference on Statistical and Scientific Database Management
The DLT priority sampling is essentially optimal
Proceedings of the thirty-eighth annual ACM symposium on Theory of computing
QROCK: A quick version of the ROCK algorithm for clustering of categorical data
Pattern Recognition Letters
Weighted random sampling with a reservoir
Information Processing Letters
Indexed-based density biased sampling for clustering applications
Data & Knowledge Engineering
Sequential reservoir sampling with a nonuniform distribution
ACM Transactions on Mathematical Software (TOMS)
Counting triangles in data streams
Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
The CQL continuous query language: semantic foundations and query execution
The VLDB Journal — The International Journal on Very Large Data Bases
iVIBRATE: Interactive visualization-based framework for clustering large datasets
ACM Transactions on Information Systems (TOIS)
Spatial scan statistics: approximations and performance study
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Cost-based optimization in DB2 XML
IBM Systems Journal
A dip in the reservoir: maintaining sample synopses of evolving datasets
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
On biased reservoir sampling in the presence of stream evolution
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Random Sampling for Continuous Streams with Arbitrary Updates
IEEE Transactions on Knowledge and Data Engineering
Efficient sampling of training set in large and noisy multimedia data
ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)
Error minimization in approximate range aggregates
Data & Knowledge Engineering
Optimized stratified sampling for approximate query processing
ACM Transactions on Database Systems (TODS)
Extended wavelets for multiple measures
ACM Transactions on Database Systems (TODS)
A priority random sampling algorithm for time-based sliding windows over weighted streaming data
Proceedings of the 2007 ACM symposium on Applied computing
On synopses for distinct-value estimation under multiset operations
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
A random walk approach to sampling hidden databases
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Quality-Aware Sampling and Its Applications in Incremental Data Mining
IEEE Transactions on Knowledge and Data Engineering
Maintaining bernoulli samples over evolving multisets
Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
A new data clustering approach: Generalized cellular automata
Information Systems
Estimating the sortedness of a data stream
SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
Approximate frequency counts over data streams
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Ranked reservoir sampling: an extension to the reservoir sampling algorithm
Software—Practice & Experience
Sampling streaming data with replacement
Computational Statistics & Data Analysis
Priority sampling for estimation of arbitrary subset sums
Journal of the ACM (JACM)
Robust estimation with sampling and approximate pre-aggregation
VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Improving data quality: consistency and accuracy
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Deterministic algorithms for sampling count data
Data & Knowledge Engineering
A stratified approach to progressive approximate joins
EDBT '08 Proceedings of the 11th international conference on Extending database technology: Advances in database technology
Why go logarithmic if we can go linear?: Towards effective distinct counting of search traffic
EDBT '08 Proceedings of the 11th international conference on Extending database technology: Advances in database technology
Sampling time-based sliding windows in bounded space
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Reference-based indexing for metric spaces with costly distance measures
The VLDB Journal — The International Journal on Very Large Data Bases
Maintaining very large random samples using the geometric file
The VLDB Journal — The International Journal on Very Large Data Bases
Hashed samples: selectivity estimators for set similarity selection queries
Proceedings of the VLDB Endowment
Online maintenance of very large random samples on flash storage
Proceedings of the VLDB Endowment
Feature-preserved sampling over streaming data
ACM Transactions on Knowledge Discovery from Data (TKDD)
Efficient measurement of data flow enabling communication-aware parallelisation
IFMT '08 Proceedings of the 1st international forum on Next-generation multicore/manycore technologies
Better algorithms for benign bandits
SODA '09 Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms
Stream sampling for variance-optimal estimation of subset sums
SODA '09 Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms
The design of a query monitoring system
ACM Transactions on Database Systems (TODS)
Sample synopses for approximate answering of group-by queries
Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
AMID: Approximation of MultI-measured Data using SVD
Information Sciences: an International Journal
Optimal sampling from sliding windows
Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Estimating the confidence of conditional functional dependencies
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Brahms: Byzantine resilient random membership sampling
Computer Networks: The International Journal of Computer and Telecommunications Networking
Change (Detection) You Can Believe in: Finding Distributional Shifts in Data Streams
IDA '09 Proceedings of the 8th International Symposium on Intelligent Data Analysis: Advances in Intelligent Data Analysis VIII
Which Is Better for Frequent Pattern Mining: Approximate Counting or Sampling?
DaWaK '09 Proceedings of the 11th International Conference on Data Warehousing and Knowledge Discovery
Concept sampling: towards systematic selection in large-scale mixed concepts in machine learning
IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
A code generation approach to optimizing high-performance distributed data stream processing
Proceedings of the 18th ACM conference on Information and knowledge management
The WEKA data mining software: an update
ACM SIGKDD Explorations Newsletter
Randomized multi-pass streaming skyline algorithms
Proceedings of the VLDB Endowment
Composable, scalable, and accurate weight summarization of unaggregated data sets
Proceedings of the VLDB Endowment
Coordinated weighted sampling for estimating aggregates over multiple weight assignments
Proceedings of the VLDB Endowment
Weighted random sampling with a reservoir
Information Processing Letters
ACM Transactions on Computer Systems (TOCS)
Online maintenance of very large random samples on flash storage
The VLDB Journal — The International Journal on Very Large Data Bases
The worst page-replacement policy
FUN'07 Proceedings of the 4th international conference on Fun with algorithms
Streaming algorithms for selection and approximate sorting
FSTTCS'07 Proceedings of the 27th international conference on Foundations of software technology and theoretical computer science
Handling numeric attributes in hoeffding trees
PAKDD'08 Proceedings of the 12th Pacific-Asia conference on Advances in knowledge discovery and data mining
Event-based lossy compression for effective and efficient OLAP over data streams
Data & Knowledge Engineering
PinKDD'07 Proceedings of the 1st ACM SIGKDD international conference on Privacy, security, and trust in KDD
A near-optimal algorithm for estimating the entropy of a stream
ACM Transactions on Algorithms (TALG)
A test paradigm for detecting changes in transactional data streams
DASFAA'08 Proceedings of the 13th international conference on Database systems for advanced applications
Optimal sampling from distributed streams
Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Towards automatic optimization of MapReduce programs
Proceedings of the 1st ACM symposium on Cloud computing
Distributed structural and value XML filtering
Proceedings of the Fourth ACM International Conference on Distributed Event-Based Systems
Efficient distributed random walks with applications
Proceedings of the 29th ACM SIGACT-SIGOPS symposium on Principles of distributed computing
A profile-based tool for finding pipeline parallelism in sequential programs
Parallel Computing
Dynamic symbolic database application testing
Proceedings of the Third International Workshop on Testing Database Systems
PIKM '10 Proceedings of the 3rd workshop on Ph.D. students in information and knowledge management
Stratified reservoir sampling over heterogeneous data streams
SSDBM'10 Proceedings of the 22nd international conference on Scientific and statistical database management
The orange customer analysis platform
ICDM'10 Proceedings of the 10th industrial conference on Advances in data mining: applications and theoretical aspects
Speed up gradual rule mining from stream data! A B-Tree and OWA-based approach
Journal of Intelligent Information Systems
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Processing high data rate streams in System S
Journal of Parallel and Distributed Computing
Distributed frequent items detection on uncertain data
ADMA'10 Proceedings of the 6th international conference on Advanced data mining and applications: Part I
Proceedings of the 14th International Conference on Extending Database Technology
Just-in-time analytics on large file systems
FAST'11 Proceedings of the 9th USENIX conference on File and stroage technologies
Discovery of frequent patterns in transactional data streams
Transactions on large-scale data- and knowledge-centered systems II
Discovery of frequent patterns in transactional data streams
Transactions on large-scale data- and knowledge-centered systems II
Get the most out of your sample: optimal unbiased estimators using partial information
Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Structure-aware sampling on data streams
Proceedings of the ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Efficient online locality sensitive hashing via reservoir counting
HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
Structure-aware sampling on data streams
ACM SIGMETRICS Performance Evaluation Review - Performance evaluation review
Better Algorithms for Benign Bandits
The Journal of Machine Learning Research
Static load balancing of parallel mining of frequent itemsets using reservoir sampling
MLDM'11 Proceedings of the 7th international conference on Machine learning and data mining in pattern recognition
Optimal random sampling from distributed streams revisited
DISC'11 Proceedings of the 25th international conference on Distributed computing
Optimal sampling from sliding windows
Journal of Computer and System Sciences
gSketch: on query estimation in graph streams
Proceedings of the VLDB Endowment
SIAM Journal on Scientific Computing
Efficient Stream Sampling for Variance-Optimal Estimation of Subset Sums
SIAM Journal on Computing
ASIAN'04 Proceedings of the 9th Asian Computing Science conference on Advances in Computer Science: dedicated to Jean-Louis Lassez on the Occasion of His 5th Cycle Birthday
Easily-Implemented adaptive packet sampling for high speed networks flow measurement
ICCS'06 Proceedings of the 6th international conference on Computational Science - Volume Part IV
Deferred maintenance of disk-based random samples
EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
A simple, yet effective and efficient, sliding window sampling algorithm
DASFAA'10 Proceedings of the 15th international conference on Database Systems for Advanced Applications - Volume Part I
Weighted k-means for density-biased clustering
DaWaK'05 Proceedings of the 7th international conference on Data Warehousing and Knowledge Discovery
Streaming k-means on well-clusterable data
Proceedings of the twenty-second annual ACM-SIAM symposium on Discrete Algorithms
Streams, security and scalability
DBSec'05 Proceedings of the 19th annual IFIP WG 11.3 working conference on Data and Applications Security
Efficient sampling: application to image data
PAKDD'05 Proceedings of the 9th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Hierarchical group-based sampling
BNCOD'05 Proceedings of the 22nd British National conference on Databases: enterprise, Skills and Innovation
Adaptive spatial partitioning for multidimensional data streams
ISAAC'04 Proceedings of the 15th international conference on Algorithms and Computation
Density estimation for spatial data streams
SSTD'05 Proceedings of the 9th international conference on Advances in Spatial and Temporal Databases
Continuous sampling from distributed streams
Journal of the ACM (JACM)
When random sampling preserves privacy
CRYPTO'06 Proceedings of the 26th annual international conference on Advances in Cryptology
On approximation algorithms for data mining applications
Efficient Approximation and Online Algorithms
A false negative approach to mining frequent itemsets from high speed transactional data streams
Information Sciences: an International Journal
Deterministic splitter finding in a stream with constant storage and guarantees
ISAAC'06 Proceedings of the 17th international conference on Algorithms and Computation
Towards a unified architecture for in-RDBMS analytics
SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Non-linear data stream compression: foundations and theoretical results
HAIS'12 Proceedings of the 7th international conference on Hybrid Artificial Intelligent Systems - Volume Part I
Proceedings of the 15th International Conference on Extending Database Technology
Don't let the negatives bring you down: sampling from streams of signed updates
Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE joint international conference on Measurement and Modeling of Computer Systems
Fair sampling across network flow measurements
Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE joint international conference on Measurement and Modeling of Computer Systems
Survey: Streaming techniques and data aggregation in networks of tiny artefacts
Computer Science Review
On supervised mining of dynamic content-based networks1
Statistical Analysis and Data Mining
Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches
Foundations and Trends in Databases
Space-efficient sampling from social activity streams
Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications
Sampling connected induced subgraphs uniformly at random
SSDBM'12 Proceedings of the 24th international conference on Scientific and Statistical Database Management
Efficient sampling methods for discrete distributions
ICALP'12 Proceedings of the 39th international colloquium conference on Automata, Languages, and Programming - Volume Part I
Real-time top-n recommendation in social streams
Proceedings of the sixth ACM conference on Recommender systems
Streaming analysis of discourse participants
EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Themis: an I/O-efficient MapReduce
Proceedings of the Third ACM Symposium on Cloud Computing
Proceedings of the 21st ACM international conference on Information and knowledge management
ECCV'12 Proceedings of the 12th European conference on Computer Vision - Volume Part IV
Modeling sovereign RFID data streams in collaborative traceable networks
WISE'12 Proceedings of the 13th international conference on Web Information Systems Engineering
An effective and efficient parallel approach for random graph generation over GPUs
Journal of Parallel and Distributed Computing
Quality and efficiency for kernel density estimates in large data
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Incremental linear model trees on massive datasets: keep it simple, keep it fast
Proceedings of the 28th Annual ACM Symposium on Applied Computing
Real time processing of data from patient biodevices
HIKM '11 Proceedings of the Fourth Australasian Workshop on Health Informatics and Knowledge Management - Volume 120
A space efficient streaming algorithm for triangle counting using the birthday paradox
Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
A sample-based hierarchical adaptive K-means clustering method for large-scale video retrieval
Knowledge-Based Systems
Learning from data streams with only positive and unlabeled data
Journal of Intelligent Information Systems
A survey on concept drift adaptation
ACM Computing Surveys (CSUR)
TeRec: a temporal recommender system over tweet stream
Proceedings of the VLDB Endowment
Parallel computation of skyline and reverse skyline queries using mapreduce
Proceedings of the VLDB Endowment
Adaptive stratified reservoir sampling over heterogeneous data streams
Information Systems
Optimizing Sample Design for Approximate Query Processing
International Journal of Knowledge-Based Organizations
Hi-index | 0.00 |
We introduce fast algorithms for selecting a random sample of n records without replacement from a pool of N records, where the value of N is unknown beforehand. The main result of the paper is the design and analysis of Algorithm Z; it does the sampling in one pass using constant space and in O(n(1 + log(N/n))) expected time, which is optimum, up to a constant factor. Several optimizations are studied that collectively improve the speed of the naive version of the algorithm by an order of magnitude. We give an efficient Pascal-like implementation that incorporates these modifications and that is suitable for general use. Theoretical and empirical results indicate that Algorithm Z outperforms current methods by a significant margin.