Spinning fast iterative data flows

Authors:
Stephan Ewen;Kostas Tzoumas;Moritz Kaufmann;Volker Markl
Affiliations:
Technische Universität Berlin, Germany;Technische Universität Berlin, Germany;Technische Universität Berlin, Germany;Technische Universität Berlin, Germany
Venue:
Proceedings of the VLDB Endowment
Year:
2012

Citing 27
Cited 11

An amateur's introduction to recursive query processing strategies

SIGMOD '86 Proceedings of the 1986 ACM SIGMOD international conference on Management of data
On the power of magic

PODS '87 Proceedings of the sixth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
A performance evaluation of four parallel join algorithms in a shared-nothing multiprocessor environment

SIGMOD '89 Proceedings of the 1989 ACM SIGMOD international conference on Management of data
Program derivation by fixed point computation

Science of Computer Programming
General purpose parallel architectures

Handbook of theoretical computer science (vol. A)
Access path selection in a relational database management system

SIGMOD '79 Proceedings of the 1979 ACM SIGMOD international conference on Management of data
On the Evaluation of Recursion in (Deductive) Database Systems by Efficient Differential Fixpoint Iteration

Proceedings of the Third International Conference on Data Engineering
A More Efficient Message-Optimal Algorithm for Distributed Termination Detection

IPPS '92 Proceedings of the 6th International Parallel Processing Symposium
An Overview of The System Software of A Parallel Relational Database Machine GRACE

VLDB '86 Proceedings of the 12th International Conference on Very Large Data Bases
GAMMA - A High Performance Dataflow Database Machine

VLDB '86 Proceedings of the 12th International Conference on Very Large Data Bases
The Volcano Optimizer Generator: Extensibility and Efficient Search

Proceedings of the Ninth International Conference on Data Engineering
The webgraph framework I: compression techniques

Proceedings of the 13th international conference on World Wide Web
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
SCOPE: easy and efficient parallel processing of massive data sets

Proceedings of the VLDB Endowment
Recursion in XQuery: put your distributivity safety belt on

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
PEGASUS: A Peta-Scale Graph Mining System Implementation and Observations

ICDM '09 Proceedings of the 2009 Ninth IEEE International Conference on Data Mining
Nephele/PACTs: a programming model and execution framework for web-scale analytical processing

Proceedings of the 1st ACM symposium on Cloud computing
Pregel: a system for large-scale graph processing

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Twister: a runtime for iterative MapReduce

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Spark: cluster computing with working sets

HotCloud'10 Proceedings of the 2nd USENIX conference on Hot topics in cloud computing
HaLoop: efficient iterative data processing on large clusters

Proceedings of the VLDB Endowment
Map-reduce extensions and recursive queries

Proceedings of the 14th International Conference on Extending Database Technology
ASTERIX: towards a scalable, semistructured data platform for evolving-world models

Distributed and Parallel Databases
CIEL: a universal execution engine for distributed data-flow computing

Proceedings of the 8th USENIX conference on Networked systems design and implementation
Hyracks: A flexible and extensible foundation for data-intensive computing

ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
Opening the black boxes in data flow optimization

Proceedings of the VLDB Endowment

Scalable similarity-based neighborhood methods with MapReduce

Proceedings of the sixth ACM conference on Recommender systems
Designing good algorithms for MapReduce and beyond

Proceedings of the Third ACM Symposium on Cloud Computing
Iterative parallel data processing with stratosphere: an inside look

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
A case for dynamic memory partitioning in data centers

Proceedings of the Second Workshop on Data Analytics in the Cloud
i2MapReduce: incremental iterative MapReduce

Proceedings of the 2nd International Workshop on Cloud Intelligence
"All roads lead to Rome": optimistic recovery for distributed iterative data processing

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles

ACM SIGOPS 24th Symposium on Operating Systems Principles
Dandelion: a compiler and runtime for heterogeneous systems

Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
A demonstration of iterative parallel array processing in support of telescope image analysis

Proceedings of the VLDB Endowment
PREDIcT: towards predicting the runtime of large scale iterative analytics

Proceedings of the VLDB Endowment
Benchmarking graph-processing platforms: a vision

Proceedings of the 5th ACM/SPEC international conference on Performance engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Parallel dataflow systems are a central part of most analytic pipelines for big data. The iterative nature of many analysis and machine learning algorithms, however, is still a challenge for current systems. While certain types of bulk iterative algorithms are supported by novel dataflow frameworks, these systems cannot exploit computational dependencies present in many algorithms, such as graph algorithms. As a result, these algorithms are inefficiently executed and have led to specialized systems based on other paradigms, such as message passing or shared memory. We propose a method to integrate incremental iterations, a form of workset iterations, with parallel dataflows. After showing how to integrate bulk iterations into a dataflow system and its optimizer, we present an extension to the programming model for incremental iterations. The extension alleviates for the lack of mutable state in dataflows and allows for exploiting the sparse computational dependencies inherent in many iterative algorithms. The evaluation of a prototypical implementation shows that those aspects lead to up to two orders of magnitude speedup in algorithm runtime, when exploited. In our experiments, the improved dataflow system is highly competitive with specialized systems while maintaining a transparent and unified dataflow abstraction.