Random sampling from hash files

Authors:
Frank Olken;Doron Rotem;Ping Xu
Affiliations:
Computer Science Research & Development Dept., Information and Computing Sciences DIV., Lawrence Berkeley Laboratory, 1 Cyclotron Road, Berkeley, CA;Computer Science Research & Development Dept., Information and Computing Sciences DIV., Lawrence Berkeley Laboratory, 1 Cyclotron Road, Berkeley, CA;Computer Science Dept., San Francisco State University, San Francisco, CA
Venue:
SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
Year:
1990

Citing 13
Cited 13

Random sampling with a reservoir

ACM Transactions on Mathematical Software (TOMS)
Processing aggregate relational queries with hard time constraints

SIGMOD '89 Proceedings of the 1989 ACM SIGMOD international conference on Management of data
Estimating the size of generalized transitive closures

VLDB '89 Proceedings of the 15th international conference on Very large data bases
Random sampling from B+ trees

VLDB '89 Proceedings of the 15th international conference on Very large data bases
Statistical estimators for relational algebra expressions

Proceedings of the seventh ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Performance analysis of linear hashing with partial expansions

ACM Transactions on Database Systems (TODS)
Extendible hashing—a fast access method for dynamic files

ACM Transactions on Database Systems (TODS)
Secure statistical databases with random sample queries

ACM Transactions on Database Systems (TODS)
Analysis and performance of inverted data base structures

Communications of the ACM
Simple Random Sampling from Relational Databases

VLDB '86 Proceedings of the 12th International Conference on Very Large Data Bases
Sampling Algorithms for Differential Batch Retrieval Problems (Extended Abstract)

Proceedings of the 11th Colloquium on Automata, Languages and Programming
New Strategies for Computing the Transitive Closure of a Database Relation

VLDB '87 Proceedings of the 13th International Conference on Very Large Data Bases
Computer based management information systems embodying answer accuracy as a user parameter

Computer based management information systems embodying answer accuracy as a user parameter

Processing time-constrained aggregate queries in CASE-DB

ACM Transactions on Database Systems (TODS)
On the relative cost of sampling for join selectivity estimation

PODS '94 Proceedings of the thirteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Density biased sampling: an improved method for data mining and clustering

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
An Evaluation of Non-Equijoin Algorithms

VLDB '91 Proceedings of the 17th International Conference on Very Large Data Bases
Practical Skew Handling in Parallel Joins

VLDB '92 Proceedings of the 18th International Conference on Very Large Data Bases
Algebraic Optimization of Computations over Scientific Databases

VLDB '93 Proceedings of the 19th International Conference on Very Large Data Bases
Online maintenance of very large random samples

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
A disk-based join with probabilistic guarantees

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Confidence bounds for sampling-based group by estimates

ACM Transactions on Database Systems (TODS)
Maintaining very large random samples using the geometric file

The VLDB Journal — The International Journal on Very Large Data Bases
Online maintenance of very large random samples on flash storage

Proceedings of the VLDB Endowment
Online maintenance of very large random samples on flash storage

The VLDB Journal — The International Journal on Very Large Data Bases
Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches

Foundations and Trends in Databases

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we discuss simple random sampling from hash files on secondary storage. We consider both iterative and batch sampling algorithms from both static and dynamic hashing methods. The static methods considered are open addressing hash files and hash files with separate overflow chains. The dynamic hashing methods considered are Linear Hash files [Lit80] and Extendible Hash files [FNPS79]. We give the cost of sampling in terms of the cost of successfully searching a hash file and show how to exploit features of the dynamic hashing methods to improve sampling efficiency.