An amateur's introduction to recursive query processing strategies
SIGMOD '86 Proceedings of the 1986 ACM SIGMOD international conference on Management of data
A bridging model for parallel computation
Communications of the ACM
Random sampling for histogram construction: how much is enough?
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
LEO - DB2's LEarning Optimizer
Proceedings of the 27th International Conference on Very Large Data Bases
A cluster algorithm for graphs
A cluster algorithm for graphs
How Fast Is the k-Means Method?
Algorithmica
When can we trust progress estimators for SQL queries?
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
How slow is the k-means method?
Proceedings of the twenty-second annual symposium on Computational geometry
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Foundations and Trends in Databases
Statistical properties of community structure in large social and information networks
Proceedings of the 17th international conference on World Wide Web
Predicting Multiple Metrics for Queries: Better Decisions Enabled by Machine Learning
ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
PEGASUS: A Peta-Scale Graph Mining System Implementation and Observations
ICDM '09 Proceedings of the 2009 Ninth IEEE International Conference on Data Mining
MAD skills: new analysis practices for big data
Proceedings of the VLDB Endowment
Pregel: a system for large-scale graph processing
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
ParaTimer: a progress indicator for MapReduce DAGs
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Walking in facebook: a case study of unbiased sampling of OSNs
INFOCOM'10 Proceedings of the 29th conference on Information communications
Spark: cluster computing with working sets
HotCloud'10 Proceedings of the 2nd USENIX conference on Hot topics in cloud computing
HaLoop: efficient iterative data processing on large clusters
Proceedings of the VLDB Endowment
Map-reduce extensions and recursive queries
Proceedings of the 14th International Conference on Extending Database Technology
Performance prediction for concurrent database workloads
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
ARIA: automatic resource inference and allocation for mapreduce environments
Proceedings of the 8th ACM international conference on Autonomic computing
FLEX: a slot allocation scheduling optimizer for MapReduce workloads
Proceedings of the ACM/IFIP/USENIX 11th International Conference on Middleware
Multi-query SQL progress indicators
EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
Distributed GraphLab: a framework for machine learning and data mining in the cloud
Proceedings of the VLDB Endowment
Spinning fast iterative data flows
Proceedings of the VLDB Endowment
Robust estimation of resource consumption for SQL queries using statistical techniques
Proceedings of the VLDB Endowment
Same Queries, Different Data: Can We Predict Runtime Performance?
ICDEW '12 Proceedings of the 2012 IEEE 28th International Conference on Data Engineering Workshops
Mizan: a system for dynamic load balancing in large-scale graph processing
Proceedings of the 8th ACM European Conference on Computer Systems
Hi-index | 0.00 |
Machine learning algorithms are widely used today for analytical tasks such as data cleaning, data categorization, or data filtering. At the same time, the rise of social media motivates recent uptake in large scale graph processing. Both categories of algorithms are dominated by iterative subtasks, i.e., processing steps which are executed repetitively until a convergence condition is met. Optimizing cluster resource allocations among multiple workloads of iterative algorithms motivates the need for estimating their runtime, which in turn requires: i) predicting the number of iterations, and ii) predicting the processing time of each iteration. As both parameters depend on the characteristics of the dataset and on the convergence function, estimating their values before execution is difficult. This paper proposes PREDIcT, an experimental methodology for predicting the runtime of iterative algorithms. PREDIcT uses sample runs for capturing the algorithm's convergence trend and per-iteration key input features that are well correlated with the actual processing requirements of the complete input dataset. Using this combination of characteristics we predict the runtime of iterative algorithms, including algorithms with very different runtime patterns among subsequent iterations. Our experimental evaluation of multiple algorithms on scale-free graphs shows a relative prediction error of 10%-30% for predicting runtime, including algorithms with up to 100× runtime variability among consecutive iterations.