Design of a next generation sampling service for large scale data analysis applications

Authors:
H. Wang;S. Parthasarathy;A. Ghoting;S. Tatikonda;G. Buehrer;T. Kurc;J. Saltz
Affiliations:
The Ohio State University, Columbus, OH;The Ohio State University, Columbus, OH;The Ohio State University, Columbus, OH;The Ohio State University, Columbus, OH;The Ohio State University, Columbus, OH;The Ohio State University, Columbus, OH;The Ohio State University, Columbus, OH
Venue:
Proceedings of the 19th annual international conference on Supercomputing
Year:
2005

Citing 29
Cited 2

Random sampling with a reservoir

ACM Transactions on Mathematical Software (TOMS)
Introduction to algorithms

Introduction to algorithms
Random sampling from database files: a survey

SSDBM V Proceedings of the fifth international conference on Statistical and scientific database management
Parallel database systems: the future of high performance database systems

Communications of the ACM
Server-directed collective I/O in Panda

Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
The Vesta parallel file system

ACM Transactions on Computer Systems (TOCS)
The galley parallel file system

ICS '96 Proceedings of the 10th international conference on Supercomputing
Efficient progressive sampling

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Disk allocation for Cartesian product files on multiple-disk systems

ACM Transactions on Database Systems (TODS)
An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization

Machine Learning
Data mining: concepts and techniques

Data mining: concepts and techniques
Parallel I/O for high performance computing

Parallel I/O for high performance computing
Declustering using fractals

PDIS '93 Proceedings of the second international conference on Parallel and distributed information systems
Models and issues in data stream systems

Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
MPI-IO/GPFS, an optimized implementation of MPI-IO on top of GPFS

Proceedings of the 2001 ACM/IEEE conference on Supercomputing
A Survey of Methods for Scaling Up Inductive Algorithms

Data Mining and Knowledge Discovery
The knowledge grid

Communications of the ACM
Passion: Optimized I/O for Parallel Applications

Computer
A General Method for Scaling Up Machine Learning Algorithms and its Application to Clustering

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Study of Scalable Declustering Algorithms for Parallel Grid Files

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
The Idea of De-Clustering and its Applications

VLDB '86 Proceedings of the 12th International Conference on Very Large Data Bases
Declustering Objects for Visualization

VLDB '93 Proceedings of the 19th International Conference on Very Large Data Bases
Sampling Large Databases for Association Rules

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
A Similarity Graph-Based Approach to Declustering Problems and Its Application towards Paralleling Grid Files

ICDE '95 Proceedings of the Eleventh International Conference on Data Engineering
Efficient Progressive Sampling for Association Rules

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
MS-I/O: A Distributed Multi-Storage I/O System

CCGRID '02 Proceedings of the 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid
Learning Rules for Anomaly Detection of Hostile Network Traffic

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Online maintenance of very large random samples

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
PVFS: a parallel file system for linux clusters

ALS'00 Proceedings of the 4th annual Linux Showcase & Conference - Volume 4

I/O conscious algorithm design and systems support for data analysis on emerging architectures

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Design and analysis of a multi-dimensional data sampling service for large scale data analysis applications

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Advances in data collection and storage technologies have resulted in large and dynamically growing data sets at many organizations. Database and data mining researchers often use sampling with great effect to scale up performance on these data sets with small cost to accuracy. However, existing techniques often ignore the cost of computing a sample. This cost is often linear in the size of the data set, not the sample, which is expensive. Furthermore, for data mining applications that leverage progressive sampling or bootstrapping-based techniques, this cost can be prohibitive, since they require the generation of multiple samples.To address this problem, we present a solution in the context of a state-of-the-art data analysis center. Specifically, we propose a scalable service that supports sample generation with cost linear in the size of the sample. We then present an efficient parallelization of this service. Our solution leverages high speed interconnects (e.g. Myrinet, Infini-band) for parallel I/O operations with pipelined data transfers. We export an interface that supports both ad-hoc SQL-like querying for database applications, as well as a stand-alone service for data mining applications. We then evaluate our work using queries abstracted from a network monitoring and analysis application, which uses both database and progressive sampling queries. We demonstrate that our implementation achieves good load balance and realizes up to an order of magnitude speedup when compared with extant approaches.