Nephele/PACTs: a programming model and execution framework for web-scale analytical processing

Authors:
Dominic Battré;Stephan Ewen;Fabian Hueske;Odej Kao;Volker Markl;Daniel Warneke
Affiliations:
Technische Universität Berlin, Berlin, Germany;Technische Universität Berlin, Berlin, Germany;Technische Universität Berlin, Berlin, Germany;Technische Universität Berlin, Berlin, Germany;Technische Universität Berlin, Berlin, Germany;Technische Universität Berlin, Berlin, Germany
Venue:
Proceedings of the 1st ACM symposium on Cloud computing
Year:
2010

Citing 16
Cited 45

A performance evaluation of four parallel join algorithms in a shared-nothing multiprocessor environment

SIGMOD '89 Proceedings of the 1989 ACM SIGMOD international conference on Management of data
A Symmetric Fragment and Replicate Algorithm for Distributed Joinsyout

IEEE Transactions on Parallel and Distributed Systems
Access path selection in a relational database management system

SIGMOD '79 Proceedings of the 1979 ACM SIGMOD international conference on Management of data
An Overview of The System Software of A Parallel Relational Database Machine GRACE

VLDB '86 Proceedings of the 12th International Conference on Very Large Data Bases
GAMMA - A High Performance Dataflow Database Machine

VLDB '86 Proceedings of the 12th International Conference on Very Large Data Bases
Map-reduce-merge: simplified relational data processing on large clusters

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Automatic optimization of parallel dataflow programs

ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
SCOPE: easy and efficient parallel processing of massive data sets

Proceedings of the VLDB Endowment
A comparison of approaches to large-scale data analysis

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Nephele: efficient parallel data processing in the cloud

Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers
SQL/MapReduce: a practical approach to self-describing, polymorphic, and parallelizable user-defined functions

Proceedings of the VLDB Endowment
Hive: a warehousing solution over a map-reduce framework

Proceedings of the VLDB Endowment
DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation

Massively parallel data analysis with PACTs on Nephele

Proceedings of the VLDB Endowment
Map-reduce extensions and recursive queries

Proceedings of the 14th International Conference on Extending Database Technology
ASTERIX: towards a scalable, semistructured data platform for evolving-world models

Distributed and Parallel Databases
CloudFuice: a flexible cloud-based data integration system

ICWE'11 Proceedings of the 11th international conference on Web engineering
ChuQL: processing XML with XQuery using Hadoop

Proceedings of the 2011 Conference of the Center for Advanced Studies on Collaborative Research
Parallel data processing with MapReduce: a survey

ACM SIGMOD Record
Cluster computing, recursion and datalog

Datalog'10 Proceedings of the First international conference on Datalog Reloaded
The HaLoop approach to large-scale iterative data analysis

The VLDB Journal — The International Journal on Very Large Data Bases
SkewTune: mitigating skew in mapreduce applications

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Optimizing analytic data flows for multiple execution engines

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Inside "Big Data management": ogres, onions, or parfaits?

Proceedings of the 15th International Conference on Extending Database Technology
Clydesdale: structured data processing on MapReduce

Proceedings of the 15th International Conference on Extending Database Technology
An optimization framework for map-reduce queries

Proceedings of the 15th International Conference on Extending Database Technology
Transitive closure and recursive Datalog implemented on clusters

Proceedings of the 15th International Conference on Extending Database Technology
Adaptive MapReduce using situation-aware mappers

Proceedings of the 15th International Conference on Extending Database Technology
Integrating open government data with stratosphere for more transparency

Web Semantics: Science, Services and Agents on the World Wide Web
Massively-parallel stream processing under QoS constraints with Nephele

Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
Challenges and approaches for distributed workflow-driven analysis of large-scale biological data: vision paper

Proceedings of the 2012 Joint EDBT/ICDT Workshops
Efficient multi-way theta-join processing using MapReduce

Proceedings of the VLDB Endowment
Opening the black boxes in data flow optimization

Proceedings of the VLDB Endowment
Spinning fast iterative data flows

Proceedings of the VLDB Endowment
Optimization of analytic data flows for next generation business intelligence applications

TPCTC'11 Proceedings of the Third TPC Technology conference on Topics in Performance Evaluation, Measurement and Characterization
Myriad: parallel data generation on shared-nothing architectures

Proceedings of the 1st Workshop on Architectures and Systems for Big Data
SCOPE: parallel databases meet MapReduce

The VLDB Journal — The International Journal on Very Large Data Bases
AMADA: web data repositories in the amazon cloud

Proceedings of the 21st ACM international conference on Information and knowledge management
Report from the first workshop on scalable workflow enactment engines and technology (SWEET'12)

ACM SIGMOD Record
Map/reduce on EMF models

Proceedings of the 1st International Workshop on Model-Driven Engineering for High Performance and CLoud computing
Sparkler: supporting large-scale matrix factorization

Proceedings of the 16th International Conference on Extending Database Technology
Iterative parallel data processing with stratosphere: an inside look

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Designing a database system for modern processing architectures

Proceedings of the 2013 Sigmod/PODS Ph.D. symposium on PhD symposium
A case for dynamic memory partitioning in data centers

Proceedings of the Second Workshop on Data Analytics in the Cloud
Reference representation techniques for large models

Proceedings of the Workshop on Scalability in Model Driven Engineering
Large-scale social-media analytics on stratosphere

Proceedings of the 22nd international conference on World Wide Web companion
Data-Fu: a language and an interpreter for interaction with read/write linked data

Proceedings of the 22nd international conference on World Wide Web
Large-scale computation not at the cost of expressiveness

HotOS'13 Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems
Mammoth: autonomic data processing framework for scientific state-transition applications

Proceedings of the 2013 ACM Cloud and Autonomic Computing Conference
"All roads lead to Rome": optimistic recovery for distributed iterative data processing

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
The family of mapreduce and large-scale data processing systems

ACM Computing Surveys (CSUR)
PonIC: using stratosphere to speed up pig analytics

Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
Continuous cloud-scale query optimization and processing

Proceedings of the VLDB Endowment
Hardware-oblivious parallelism for in-memory column-stores

Proceedings of the VLDB Endowment
Scalable topic-specific influence analysis on microblogs

Proceedings of the 7th ACM international conference on Web search and data mining
Approaches to Distributed Execution of Scientific Workflows in Kepler

Fundamenta Informaticae - Scalable Workflow Enactment Engines and Technology
Hybrid Analytic Flows-the Case for Optimization

Fundamenta Informaticae - Scalable Workflow Enactment Engines and Technology
Nephele streaming: stream processing under QoS constraints at scale

Cluster Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a parallel data processor centered around a programming model of so called Parallelization Contracts (PACTs) and the scalable parallel execution engine Nephele [18]. The PACT programming model is a generalization of the well-known map/reduce programming model, extending it with further second-order functions, as well as with Output Contracts that give guarantees about the behavior of a function. We describe methods to transform a PACT program into a data flow for Nephele, which executes its sequential building blocks in parallel and deals with communication, synchronization and fault tolerance. Our definition of PACTs allows to apply several types of optimizations on the data flow during the transformation. The system as a whole is designed to be as generic as (and compatible to) map/reduce systems, while overcoming several of their major weaknesses: 1) The functions map and reduce alone are not sufficient to express many data processing tasks both naturally and efficiently. 2) Map/reduce ties a program to a single fixed execution strategy, which is robust but highly suboptimal for many tasks. 3) Map/reduce makes no assumptions about the behavior of the functions. Hence, it offers only very limited optimization opportunities. With a set of examples and experiments, we illustrate how our system is able to naturally represent and efficiently execute several tasks that do not fit the map/reduce model well.