An amateur's introduction to recursive query processing strategies
SIGMOD '86 Proceedings of the 1986 ACM SIGMOD international conference on Management of data
PODS '87 Proceedings of the sixth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
SIGMOD '89 Proceedings of the 1989 ACM SIGMOD international conference on Management of data
Program derivation by fixed point computation
Science of Computer Programming
General purpose parallel architectures
Handbook of theoretical computer science (vol. A)
Access path selection in a relational database management system
SIGMOD '79 Proceedings of the 1979 ACM SIGMOD international conference on Management of data
Proceedings of the Third International Conference on Data Engineering
A More Efficient Message-Optimal Algorithm for Distributed Termination Detection
IPPS '92 Proceedings of the 6th International Parallel Processing Symposium
An Overview of The System Software of A Parallel Relational Database Machine GRACE
VLDB '86 Proceedings of the 12th International Conference on Very Large Data Bases
GAMMA - A High Performance Dataflow Database Machine
VLDB '86 Proceedings of the 12th International Conference on Very Large Data Bases
The Volcano Optimizer Generator: Extensibility and Efficient Search
Proceedings of the Ninth International Conference on Data Engineering
The webgraph framework I: compression techniques
Proceedings of the 13th international conference on World Wide Web
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Dryad: distributed data-parallel programs from sequential building blocks
Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
SCOPE: easy and efficient parallel processing of massive data sets
Proceedings of the VLDB Endowment
Recursion in XQuery: put your distributivity safety belt on
Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
PEGASUS: A Peta-Scale Graph Mining System Implementation and Observations
ICDM '09 Proceedings of the 2009 Ninth IEEE International Conference on Data Mining
Nephele/PACTs: a programming model and execution framework for web-scale analytical processing
Proceedings of the 1st ACM symposium on Cloud computing
Pregel: a system for large-scale graph processing
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Twister: a runtime for iterative MapReduce
Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Spark: cluster computing with working sets
HotCloud'10 Proceedings of the 2nd USENIX conference on Hot topics in cloud computing
HaLoop: efficient iterative data processing on large clusters
Proceedings of the VLDB Endowment
Map-reduce extensions and recursive queries
Proceedings of the 14th International Conference on Extending Database Technology
ASTERIX: towards a scalable, semistructured data platform for evolving-world models
Distributed and Parallel Databases
CIEL: a universal execution engine for distributed data-flow computing
Proceedings of the 8th USENIX conference on Networked systems design and implementation
Hyracks: A flexible and extensible foundation for data-intensive computing
ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
Opening the black boxes in data flow optimization
Proceedings of the VLDB Endowment
Scalable similarity-based neighborhood methods with MapReduce
Proceedings of the sixth ACM conference on Recommender systems
Designing good algorithms for MapReduce and beyond
Proceedings of the Third ACM Symposium on Cloud Computing
Iterative parallel data processing with stratosphere: an inside look
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
A case for dynamic memory partitioning in data centers
Proceedings of the Second Workshop on Data Analytics in the Cloud
i2MapReduce: incremental iterative MapReduce
Proceedings of the 2nd International Workshop on Cloud Intelligence
"All roads lead to Rome": optimistic recovery for distributed iterative data processing
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
ACM SIGOPS 24th Symposium on Operating Systems Principles
Dandelion: a compiler and runtime for heterogeneous systems
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
A demonstration of iterative parallel array processing in support of telescope image analysis
Proceedings of the VLDB Endowment
PREDIcT: towards predicting the runtime of large scale iterative analytics
Proceedings of the VLDB Endowment
Benchmarking graph-processing platforms: a vision
Proceedings of the 5th ACM/SPEC international conference on Performance engineering
Hi-index | 0.00 |
Parallel dataflow systems are a central part of most analytic pipelines for big data. The iterative nature of many analysis and machine learning algorithms, however, is still a challenge for current systems. While certain types of bulk iterative algorithms are supported by novel dataflow frameworks, these systems cannot exploit computational dependencies present in many algorithms, such as graph algorithms. As a result, these algorithms are inefficiently executed and have led to specialized systems based on other paradigms, such as message passing or shared memory. We propose a method to integrate incremental iterations, a form of workset iterations, with parallel dataflows. After showing how to integrate bulk iterations into a dataflow system and its optimizer, we present an extension to the programming model for incremental iterations. The extension alleviates for the lack of mutable state in dataflows and allows for exploiting the sparse computational dependencies inherent in many iterative algorithms. The evaluation of a prototypical implementation shows that those aspects lead to up to two orders of magnitude speedup in algorithm runtime, when exploited. In our experiments, the improved dataflow system is highly competitive with specialized systems while maintaining a transparent and unified dataflow abstraction.