PREDIcT: towards predicting the runtime of large scale iterative analytics

Authors:
Adrian Daniel Popescu;Andrey Balmin;Vuk Ercegovac;Anastasia Ailamaki
Affiliations:
Ecole Polytechnique Fédérale de Lausanne, Lausanne, Switzerland;IBM Almaden Research Center, San Jose, CA and GraphSQL, Mountain View, CA;IBM Almaden Research Center, San Jose, CA and Google Inc.;Ecole Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
Venue:
Proceedings of the VLDB Endowment
Year:
2013

Citing 29
Cited 0

An amateur's introduction to recursive query processing strategies

SIGMOD '86 Proceedings of the 1986 ACM SIGMOD international conference on Management of data
A bridging model for parallel computation

Communications of the ACM
Random sampling for histogram construction: how much is enough?

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
LEO - DB2's LEarning Optimizer

Proceedings of the 27th International Conference on Very Large Data Bases
A cluster algorithm for graphs

A cluster algorithm for graphs
How Fast Is the k-Means Method?

Algorithmica
When can we trust progress estimators for SQL queries?

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
How slow is the k-means method?

Proceedings of the twenty-second annual symposium on Computational geometry
Sampling from large graphs

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Adaptive query processing

Foundations and Trends in Databases
Statistical properties of community structure in large social and information networks

Proceedings of the 17th international conference on World Wide Web
Predicting Multiple Metrics for Queries: Better Decisions Enabled by Machine Learning

ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
PEGASUS: A Peta-Scale Graph Mining System Implementation and Observations

ICDM '09 Proceedings of the 2009 Ninth IEEE International Conference on Data Mining
MAD skills: new analysis practices for big data

Proceedings of the VLDB Endowment
Pregel: a system for large-scale graph processing

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
ParaTimer: a progress indicator for MapReduce DAGs

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Walking in facebook: a case study of unbiased sampling of OSNs

INFOCOM'10 Proceedings of the 29th conference on Information communications
Spark: cluster computing with working sets

HotCloud'10 Proceedings of the 2nd USENIX conference on Hot topics in cloud computing
HaLoop: efficient iterative data processing on large clusters

Proceedings of the VLDB Endowment
Map-reduce extensions and recursive queries

Proceedings of the 14th International Conference on Extending Database Technology
Performance prediction for concurrent database workloads

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
ARIA: automatic resource inference and allocation for mapreduce environments

Proceedings of the 8th ACM international conference on Autonomic computing
FLEX: a slot allocation scheduling optimizer for MapReduce workloads

Proceedings of the ACM/IFIP/USENIX 11th International Conference on Middleware
Multi-query SQL progress indicators

EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
Distributed GraphLab: a framework for machine learning and data mining in the cloud

Proceedings of the VLDB Endowment
Spinning fast iterative data flows

Proceedings of the VLDB Endowment
Robust estimation of resource consumption for SQL queries using statistical techniques

Proceedings of the VLDB Endowment
Same Queries, Different Data: Can We Predict Runtime Performance?

ICDEW '12 Proceedings of the 2012 IEEE 28th International Conference on Data Engineering Workshops
Mizan: a system for dynamic load balancing in large-scale graph processing

Proceedings of the 8th ACM European Conference on Computer Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Machine learning algorithms are widely used today for analytical tasks such as data cleaning, data categorization, or data filtering. At the same time, the rise of social media motivates recent uptake in large scale graph processing. Both categories of algorithms are dominated by iterative subtasks, i.e., processing steps which are executed repetitively until a convergence condition is met. Optimizing cluster resource allocations among multiple workloads of iterative algorithms motivates the need for estimating their runtime, which in turn requires: i) predicting the number of iterations, and ii) predicting the processing time of each iteration. As both parameters depend on the characteristics of the dataset and on the convergence function, estimating their values before execution is difficult. This paper proposes PREDIcT, an experimental methodology for predicting the runtime of iterative algorithms. PREDIcT uses sample runs for capturing the algorithm's convergence trend and per-iteration key input features that are well correlated with the actual processing requirements of the complete input dataset. Using this combination of characteristics we predict the runtime of iterative algorithms, including algorithms with very different runtime patterns among subsequent iterations. Our experimental evaluation of multiple algorithms on scale-free graphs shows a relative prediction error of 10%-30% for predicting runtime, including algorithms with up to 100× runtime variability among consecutive iterations.