FlumeJava: easy, efficient data-parallel pipelines

Authors:
Craig Chambers;Ashish Raniwala;Frances Perry;Stephen Adams;Robert R. Henry;Robert Bradshaw;Nathan Weizenbaum
Affiliations:
Google, Seattle, WA, USA;Google, Seattle, WA, USA;Google, Seattle, WA, USA;Google, Seattle, WA, USA;Google, Seattle, WA, USA;Google, Seattle, WA, USA;Google, Seattle, WA, USA
Venue:
PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
Year:
2010

Citing 15
Cited 50

Implicit parallel programming in pH

Implicit parallel programming in pH
C**: A Large-Grain, Object-Oriented, Data-Parallel Programming Language

Proceedings of the 5th International Workshop on Languages and Compilers for Parallel Computing
New Ideas in Parallel Lisp: Language Design, Implementation, and Programming Tools

Proceedings of the US/Japan Workshop on Parallel Lisp: Languages and Systems
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
LINQ: reconciling object, relations and XML in the .NET framework

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Experiences with MapReduce, an abstraction for large-scale computation

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Interpreting the data: Parallel analysis with Sawzall

Scientific Programming - Dynamic Grids and Worldwide Computing
Map-reduce-merge: simplified relational data processing on large clusters

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Bigtable: a distributed storage system for structured data

OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
SCOPE: easy and efficient parallel processing of massive data sets

Proceedings of the VLDB Endowment
DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation

Dremel: interactive analysis of web-scale datasets

Proceedings of the VLDB Endowment
Piccolo: building fast, distributed programs with partitioned tables

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Dremel: interactive analysis of web-scale datasets

Communications of the ACM
ASTERIX: towards a scalable, semistructured data platform for evolving-world models

Distributed and Parallel Databases
Steno: automatic optimization of declarative queries

Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
Parallelizing large-scale data processing applications with data skew: a case study in product-offer matching

Proceedings of the second international workshop on MapReduce and its applications
Fay: extensible distributed tracing from kernels to clusters

SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Streams that compose using macros that oblige

PEPM '12 Proceedings of the ACM SIGPLAN 2012 workshop on Partial evaluation and program manipulation
Your mouse is a database

Communications of the ACM
Jockey: guaranteed job latency in data parallel clusters

Proceedings of the 7th ACM european conference on Computer Systems
MadLINQ: large-scale distributed matrix computation for the cloud

Proceedings of the 7th ACM european conference on Computer Systems
Your Mouse is a Database

Queue - Development
PerfXplain: debugging MapReduce job performance

Proceedings of the VLDB Endowment
Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Re-optimizing data-parallel computing

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Asynchronous adaptive optimisation for generic data-parallel array programming

Concurrency and Computation: Practice & Experience
Swift: A language for distributed parallel scripting

Parallel Computing
Using Pig as a data preparation language for large-scale mining software repositories studies: An experience report

Journal of Systems and Software
From a calculus to an execution environment for stream processing

Proceedings of the 6th ACM International Conference on Distributed Event-Based Systems
The seven deadly sins of cloud computing research

HotCloud'12 Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing
Stubby: a transformation-based optimizer for MapReduce workflows

Proceedings of the VLDB Endowment
Avalanche: a fine-grained flow graph model for irregular applications on distributed-memory systems

Proceedings of the 1st ACM SIGPLAN workshop on Functional high-performance computing
PQL: a purely-declarative java extension for parallel programming

ECOOP'12 Proceedings of the 26th European conference on Object-Oriented Programming
Auto-parallelizing stateful distributed streaming applications

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Fay: Extensible Distributed Tracing from Kernels to Clusters

ACM Transactions on Computer Systems (TOCS)
Spotting code optimizations in data-parallel pipelines through PeriSCOPE

OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
Scripting distributed scientific workflows using Weaver

Concurrency and Computation: Practice & Experience
Coflow: a networking abstraction for cluster applications

Proceedings of the 11th ACM Workshop on Hot Topics in Networks
Using clouds for MapReduce measurement assignments

ACM Transactions on Computing Education (TOCE)
Just-in-time data distribution for analytical query processing

ADBIS'12 Proceedings of the 16th East European conference on Advances in Databases and Information Systems
Cogset: a high performance MapReduce engine

Concurrency and Computation: Practice & Experience
Reify your collection queries for modularity and speed!

Proceedings of the 12th annual international conference on Aspect-oriented software development
BigBench: towards an industry standard benchmark for big data analytics

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Simulation of database-valued markov chains using SimSQL

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
TimeStream: reliable stream computation in the cloud

Proceedings of the 8th ACM European Conference on Computer Systems
Optimus: a dynamic rewriting framework for data-parallel execution plans

Proceedings of the 8th ACM European Conference on Computer Systems
A bloat-aware design for big data applications

Proceedings of the 2013 international symposium on memory management
Boa: a language and infrastructure for analyzing ultra-large-scale software repositories

Proceedings of the 2013 International Conference on Software Engineering
A characteristic study on failures of production distributed data-parallel programs

Proceedings of the 2013 International Conference on Software Engineering
Large-scale computation not at the cost of expressiveness

HotOS'13 Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems
Distributed data management using MapReduce

ACM Computing Surveys (CSUR)
Task fusion: improving utilization of multi-user clusters

Proceedings of the 2013 companion publication for conference on Systems, programming, & applications: software for humanity
Forge: generating a high performance DSL implementation from a declarative specification

Proceedings of the 12th international conference on Generative programming: concepts & experiences
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles

ACM SIGOPS 24th Symposium on Operating Systems Principles
Dandelion: a compiler and runtime for heterogeneous systems

Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
The family of mapreduce and large-scale data processing systems

ACM Computing Surveys (CSUR)
The shape of things to run: compiling complex stream graphs to reconfigurable hardware in lime

ECOOP'13 Proceedings of the 27th European conference on Object-Oriented Programming
Representing mapreduce optimisations in the nested relational calculus

BNCOD'13 Proceedings of the 29th British National conference on Big Data
A scalable approach to column-based low-rank matrix approximation

IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence
Scalable, example-based refactorings with refaster

Proceedings of the 2013 ACM workshop on Workshop on refactoring tools

Quantified Score

Hi-index	0.03

Visualization

Abstract

MapReduce and similar systems significantly ease the task of writing data-parallel code. However, many real-world computations require a pipeline of MapReduces, and programming and managing such pipelines can be difficult. We present FlumeJava, a Java library that makes it easy to develop, test, and run efficient data-parallel pipelines. At the core of the FlumeJava library are a couple of classes that represent immutable parallel collections, each supporting a modest number of operations for processing them in parallel. Parallel collections and their operations present a simple, high-level, uniform abstraction over different data representations and execution strategies. To enable parallel operations to run efficiently, FlumeJava defers their evaluation, instead internally constructing an execution plan dataflow graph. When the final results of the parallel operations are eventually needed, FlumeJava first optimizes the execution plan, and then executes the optimized operations on appropriate underlying primitives (e.g., MapReduces). The combination of high-level abstractions for parallel data and computation, deferred evaluation and optimization, and efficient parallel primitives yields an easy-to-use system that approaches the efficiency of hand-optimized pipelines. FlumeJava is in active use by hundreds of pipeline developers within Google.