An efficient algorithm for sequential random sampling
ACM Transactions on Mathematical Software (TOMS)
Improved histograms for selectivity estimation of range predicates
SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Improved query performance with variant indexes
SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Global static indexing for real-time exploration of very large regular grids
Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Using Disk Throughput Data in Predictions of End-to-End Grid Data Transfers
GRID '02 Proceedings of the Third International Workshop on Grid Computing
Overcoming Limitations of Sampling for Aggregation Queries
Proceedings of the 17th International Conference on Data Engineering
Histogram-Based Approximation of Set-Valued Query-Answers
VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Compressing Bitmap Indexes for Faster Search Operations
SSDBM '02 Proceedings of the 14th International Conference on Scientific and Statistical Database Management
Chimera: AVirtual Data System for Representing, Querying, and Automating Data Derivation
SSDBM '02 Proceedings of the 14th International Conference on Scientific and Statistical Database Management
Approximate query processing using wavelets
The VLDB Journal — The International Journal on Very Large Data Bases
Fast Approximate Query Answering Using Precomputed Statistics
ICDE '99 Proceedings of the 15th International Conference on Data Engineering
Byte-aligned bitmap compression
DCC '95 Proceedings of the Conference on Data Compression
A Next Step: Visualizing Errors and Uncertainty
IEEE Computer Graphics and Applications
Stork: Making Data Placement a First Class Citizen in the Grid
ICDCS '04 Proceedings of the 24th International Conference on Distributed Computing Systems (ICDCS'04)
SSDBM '04 Proceedings of the 16th International Conference on Scientific and Statistical Database Management
Performance and Scalability of a Replica Location Service
HPDC '04 Proceedings of the 13th IEEE International Symposium on High Performance Distributed Computing
A Peer-to-Peer Replica Location Service Based on a Distributed Hash Table
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
A Metadata Catalog Service for Data Intensive Applications
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Modeling and Taming Parallel TCP on the Wide Area Network
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Practical performance portability in the Parallel Ocean Program (POP): Research Articles
Concurrency and Computation: Practice & Experience - The High Performance Architectural Challenge: Mass Market versus Proprietary Components?
Using bitmap index for interactive exploration of large datasets
SSDBM '03 Proceedings of the 15th International Conference on Scientific and Statistical Database Management
IEEE Transactions on Visualization and Computer Graphics
Scalable approximate query processing with the DBO engine
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Breaking the Curse of Cardinality on Bitmap Indexes
SSDBM '08 Proceedings of the 20th international conference on Scientific and Statistical Database Management
Artemis: integrating scientific data on the grid
IAAI'04 Proceedings of the 16th conference on Innovative applications of artifical intelligence
Improving GridFTP performance using the Phoebus session layer
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
A Map-Reduce System with an Alternate API for Multi-core Environments
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Lessons learned from moving earth system grid data sets over a 20 Gbps wide-area network
Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
A data transfer framework for large-scale science experiments
Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
An Information-Theoretic Framework for Flow Visualization
IEEE Transactions on Visualization and Computer Graphics
Parallel index and query for large scale data analysis
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Supporting User-Defined Subsetting and Aggregation over Parallel NetCDF Datasets
CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Extending Map-Reduce for Efficient Predicate-Based Sampling
ICDE '12 Proceedings of the 2012 IEEE 28th International Conference on Data Engineering
Early accurate results for advanced analytics on MapReduce
Proceedings of the VLDB Endowment
Indexing and Parallel Query Processing Support for Visualizing Climate Datasets
ICPP '12 Proceedings of the 2012 41st International Conference on Parallel Processing
In-situ sampling of a large-scale particle simulation for interactive visualization and analysis
EuroVis'11 Proceedings of the 13th Eurographics / IEEE - VGTC conference on Visualization
Hi-index | 0.00 |
With growing computational capabilities of parallel machines, scientific simulations are being performed at finer spatial and temporal scales, leading to a data explosion. The growing sizes are making it extremely hard to store, manage, disseminate, analyze, and visualize these datasets, especially as neither the memory capacity of parallel machines, memory access speeds, nor disk bandwidths are increasing at the same rate as the computing power. Sampling can be an effective technique to address the above challenges, but it is extremely important to ensure that dataset characteristics are preserved, and the loss of accuracy is within acceptable levels. In this paper, we address the data explosion problems by developing a novel sampling approach, and implementing it in a flexible system that supports server-side sampling and data subsetting. We observe that to allow subsetting over scientific datasets, data repositories are likely to use an indexing technique. Among these techniques, we see that bitmap indexing can not only effectively support subsetting over scientific datasets, but can also help create samples that preserve both value and spatial distributions over scientific datasets. We have developed algorithms for using bitmap indices to sample datasets. We have also shown how only a small amount of additional metadata stored with bitvectors can help assess loss of accuracy with a particular subsampling level. Some of the other properties of this novel approach include: 1) sampling can be flexibly applied to a subset of the original dataset, which may be specified using a value-based and/or a dimension-based subsetting predicate, and 2) no data reorganization is needed, once bitmap indices have been generated. We have extensively evaluated our method with different types of datasets and applications, and demonstrated the effectiveness of our approach.