Nephele: efficient parallel data processing in the cloud

Authors:
Daniel Warneke;Odej Kao
Affiliations:
Technische Universität Berlin, Berlin, Germany;Technische Universität Berlin, Berlin, Germany
Venue:
Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers
Year:
2009

Citing 14
Cited 19

Maximum likelihood network topology identification from edge-based unicast measurements

SIGMETRICS '02 Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Condor-G: A Computation Management Agent for Multi-Institutional Grids

Cluster Computing
LEO - DB2's LEarning Optimizer

Proceedings of the 27th International Conference on Very Large Data Bases
VDE: Virtual Distributed Ethernet

TRIDENTCOM '05 Proceedings of the First International Conference on Testbeds and Research Infrastructures for the DEvelopment of NeTworks and COMmunities
Pegasus: A framework for mapping complex scientific workflows onto distributed systems

Scientific Programming
Interpreting the data: Parallel analysis with Sawzall

Scientific Programming - Dynamic Grids and Worldwide Computing
Map-reduce-merge: simplified relational data processing on large clusters

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Falkon: a Fast and Light-weight tasK executiON framework

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
virtio: towards a de-facto standard for virtual I/O devices

ACM SIGOPS Operating Systems Review - Research and developments in the Linux kernel
SCOPE: easy and efficient parallel processing of massive data sets

Proceedings of the VLDB Endowment
Hadoop: The Definitive Guide

Hadoop: The Definitive Guide

Nephele/PACTs: a programming model and execution framework for web-scale analytical processing

Proceedings of the 1st ACM symposium on Cloud computing
Massively parallel data analysis with PACTs on Nephele

Proceedings of the VLDB Endowment
Towards jungle computing with Ibis/Constellation

Proceedings of the 2011 workshop on Dynamic distributed data-intensive applications, programming abstractions, and systems
An approach for processing large and non-uniform media objects on mapreduce-based clusters

ICADL'11 Proceedings of the 13th international conference on Asia-pacific digital libraries: for cultural heritage, knowledge dissemination, and future creation
Parallel data processing with MapReduce: a survey

ACM SIGMOD Record
A dependency-driven formulation of parareal: parallel-in-time solution of PDEs as a many-task application

Proceedings of the 2011 ACM international workshop on Many task computing on grids and supercomputers
Integrating open government data with stratosphere for more transparency

Web Semantics: Science, Services and Agents on the World Wide Web
A highly efficient cloud-based architecture for large-scale STB event processing: industry article

Proceedings of the 6th ACM International Conference on Distributed Event-Based Systems
An adaptive parallel execution strategy for cloud-based scientific workflows

Concurrency and Computation: Practice & Experience
Opening the black boxes in data flow optimization

Proceedings of the VLDB Endowment
Using Broadcast Networks to Create On-demand Extremely Large Scale High-throughput Computing Infrastructures

Journal of Grid Computing
A Provenance-based Adaptive Scheduling Heuristic for Parallel Scientific Workflows in Clouds

Journal of Grid Computing
Type 2 slowly changing dimensions: a case study using the cooperating system

Proceedings of the fifteenth international workshop on Data warehousing and OLAP
Report from the first workshop on scalable workflow enactment engines and technology (SWEET'12)

ACM SIGMOD Record
A case for dynamic memory partitioning in data centers

Proceedings of the Second Workshop on Data Analytics in the Cloud
Adaptive Online Compression in Clouds--Making Informed Decisions in Virtual Machine Environments

Journal of Grid Computing
Performance evaluation of dynamic cloud resource migration based on temporal and capacity-aware policy for efficient resource sharing

Proceedings of the 2nd ACM workshop on High performance mobile opportunistic systems
PonIC: using stratosphere to speed up pig analytics

Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
An energy and deadline aware resource provisioning, scheduling and optimization framework for cloud systems

Proceedings of the Ninth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis

Quantified Score

Hi-index	0.01

Visualization

Abstract

In recent years Cloud Computing has emerged as a promising new approach for ad-hoc parallel data processing. Major cloud computing companies have started to integrate frameworks for parallel data processing in their product portfolio, making it easy for customers to access these services and to deploy their programs. However, the processing frameworks which are currently used stem from the field of cluster computing and disregard the particular nature of a cloud. As a result, the allocated compute resources may be inadequate for big parts of the submitted job and unnecessarily increase processing time and cost. In this paper we discuss the opportunities and challenges for efficient parallel data processing in clouds and present our ongoing research project Nephele. Nephele is the first data processing framework to explicitly exploit the dynamic resource allocation offered by today's compute clouds for both, task scheduling and execution. It allows assigning the particular tasks of a processing job to different types of virtual machines and takes care of their instantiation and termination during the job execution. Based on this new framework, we perform evaluations on a compute cloud system and compare the results to the existing data processing framework Hadoop.