Random sampling with a reservoir
ACM Transactions on Mathematical Software (TOMS)
Introduction to algorithms
Random sampling from database files: a survey
SSDBM V Proceedings of the fifth international conference on Statistical and scientific database management
Parallel database systems: the future of high performance database systems
Communications of the ACM
Server-directed collective I/O in Panda
Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
The Vesta parallel file system
ACM Transactions on Computer Systems (TOCS)
The galley parallel file system
ICS '96 Proceedings of the 10th international conference on Supercomputing
Efficient progressive sampling
KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Disk allocation for Cartesian product files on multiple-disk systems
ACM Transactions on Database Systems (TODS)
Data mining: concepts and techniques
Data mining: concepts and techniques
Parallel I/O for high performance computing
Parallel I/O for high performance computing
PDIS '93 Proceedings of the second international conference on Parallel and distributed information systems
Models and issues in data stream systems
Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
MPI-IO/GPFS, an optimized implementation of MPI-IO on top of GPFS
Proceedings of the 2001 ACM/IEEE conference on Supercomputing
A Survey of Methods for Scaling Up Inductive Algorithms
Data Mining and Knowledge Discovery
Communications of the ACM
A General Method for Scaling Up Machine Learning Algorithms and its Application to Clustering
ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Study of Scalable Declustering Algorithms for Parallel Grid Files
IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
The Idea of De-Clustering and its Applications
VLDB '86 Proceedings of the 12th International Conference on Very Large Data Bases
Declustering Objects for Visualization
VLDB '93 Proceedings of the 19th International Conference on Very Large Data Bases
Sampling Large Databases for Association Rules
VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
ICDE '95 Proceedings of the Eleventh International Conference on Data Engineering
Efficient Progressive Sampling for Association Rules
ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
MS-I/O: A Distributed Multi-Storage I/O System
CCGRID '02 Proceedings of the 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid
Learning Rules for Anomaly Detection of Hostile Network Traffic
ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Online maintenance of very large random samples
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
PVFS: a parallel file system for linux clusters
ALS'00 Proceedings of the 4th annual Linux Showcase & Conference - Volume 4
I/O conscious algorithm design and systems support for data analysis on emerging architectures
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Hi-index | 0.00 |
Advances in data collection and storage technologies have resulted in large and dynamically growing data sets at many organizations. Database and data mining researchers often use sampling with great effect to scale up performance on these data sets with small cost to accuracy. However, existing techniques often ignore the cost of computing a sample. This cost is often linear in the size of the data set, not the sample, which is expensive. Furthermore, for data mining applications that leverage progressive sampling or bootstrapping-based techniques, this cost can be prohibitive, since they require the generation of multiple samples.To address this problem, we present a solution in the context of a state-of-the-art data analysis center. Specifically, we propose a scalable service that supports sample generation with cost linear in the size of the sample. We then present an efficient parallelization of this service. Our solution leverages high speed interconnects (e.g. Myrinet, Infini-band) for parallel I/O operations with pipelined data transfers. We export an interface that supports both ad-hoc SQL-like querying for database applications, as well as a stand-alone service for data mining applications. We then evaluate our work using queries abstracted from a network monitoring and analysis application, which uses both database and progressive sampling queries. We demonstrate that our implementation achieves good load balance and realizes up to an order of magnitude speedup when compared with extant approaches.