SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Random Sampling from Database Files: A Survey
Proceedings of the 5th International Conference SSDBM on Statistical and Scientific Database Management
A new two-phase sampling based algorithm for discovering association rules
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Effective use of block-level sampling in statistics estimation
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Relational confidence bounds are easy with the bootstrap
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Interpreting the data: Parallel analysis with Sawzall
Scientific Programming - Dynamic Grids and Worldwide Computing
Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?
FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Pig latin: a not-so-foreign language for data processing
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Hive: a warehousing solution over a map-reduce framework
Proceedings of the VLDB Endowment
Parallel K-Means Clustering Based on MapReduce
CloudCom '09 Proceedings of the 1st International Conference on Cloud Computing
Online aggregation and continuous query support in MapReduce
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
HaLoop: efficient iterative data processing on large clusters
Proceedings of the VLDB Endowment
A platform for scalable one-pass analytics using MapReduce
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Hyracks: A flexible and extensible foundation for data-intensive computing
ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
Extending Map-Reduce for Efficient Predicate-Based Sampling
ICDE '12 Proceedings of the 2012 IEEE 28th International Conference on Data Engineering
Taming massive distributed datasets: data sampling using bitmap indices
Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Parallel online aggregation in action
Proceedings of the 25th International Conference on Scientific and Statistical Database Management
A general bootstrap performance diagnostic
Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Sampling estimators for parallel online aggregation
BNCOD'13 Proceedings of the 29th British National conference on Big Data
Scalable progressive analytics on big data in the cloud
Proceedings of the VLDB Endowment
Hi-index | 0.00 |
Approximate results based on samples often provide the only way in which advanced analytical applications on very massive data sets can satisfy their time and resource constraints. Unfortunately, methods and tools for the computation of accurate early results are currently not supported in MapReduce-oriented systems although these are intended for 'big data'. Therefore, we proposed and implemented a non-parametric extension of Hadoop which allows the incremental computation of early results for arbitrary work-flows, along with reliable on-line estimates of the degree of accuracy achieved so far in the computation. These estimates are based on a technique called bootstrapping that has been widely employed in statistics and can be applied to arbitrary functions and data distributions. In this paper, we describe our Early Accurate Result Library (EARL) for Hadoop that was designed to minimize the changes required to the MapReduce framework. Various tests of EARL of Hadoop are presented to characterize the frequent situations where EARL can provide major speed-ups over the current version of Hadoop.