Building a high-level dataflow system on top of Map-Reduce: the Pig experience

Authors:
Alan F. Gates;Olga Natkovich;Shubham Chopra;Pradeep Kamath;Shravan M. Narayanamurthy;Christopher Olston;Benjamin Reed;Santhosh Srinivasan;Utkarsh Srivastava
Affiliations:
Yahoo!, Inc.;Yahoo!, Inc.;Yahoo!, Inc.;Yahoo!, Inc.;Yahoo!, Inc.;Yahoo!, Inc.;Yahoo!, Inc.;Yahoo!, Inc.;Yahoo!, Inc.
Venue:
Proceedings of the VLDB Endowment
Year:
2009

Citing 14
Cited 61

Distributed query processing in a relational data base system

SIGMOD '78 Proceedings of the 1978 ACM SIGMOD international conference on management of data
Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals

Data Mining and Knowledge Discovery
Java support for data-intensive systems: experiences building the telegraph dataflow system

ACM SIGMOD Record
Volcano— An Extensible and Parallel Query Evaluation System

IEEE Transactions on Knowledge and Data Engineering
Practical Skew Handling in Parallel Joins

VLDB '92 Proceedings of the 18th International Conference on Very Large Data Bases
Latent semantic models for collaborative filtering

ACM Transactions on Information Systems (TOIS)
Compiled Query Execution Engine using JVM

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Interpreting the data: Parallel analysis with Sawzall

Scientific Programming - Dynamic Grids and Worldwide Computing
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
SCOPE: easy and efficient parallel processing of massive data sets

Proceedings of the VLDB Endowment
Efficient processing of complex features for information retrieval

Efficient processing of complex features for information retrieval
Generating example data for dataflow programs

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation

Xbase: cloud-enabled information appliance for healthcare

Proceedings of the 13th International Conference on Extending Database Technology
Efficient parallel set-similarity joins using MapReduce

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Integrating hadoop and parallel DBMs

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Parallel programming framework for large batch transaction processing on scale-out systems

Proceedings of the 3rd Annual Haifa Experimental Systems Conference
Reliable data-center scale computations

Proceedings of the 4th International Workshop on Large Scale Distributed Systems and Middleware
Multidimensional arrays for warehousing data on clouds

Globe'10 Proceedings of the Third international conference on Data management in grid and peer-to-peer systems
MRShare: sharing across multiple queries in MapReduce

Proceedings of the VLDB Endowment
Hadoop++: making a yellow elephant run like a cheetah (without it even noticing)

Proceedings of the VLDB Endowment
Cheetah: a high performance, custom data warehouse on top of MapReduce

Proceedings of the VLDB Endowment
Parallel evaluation of conjunctive queries

Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Nova: continuous Pig/Hadoop workflows

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
A Hadoop based distributed loading approach to parallel data warehouses

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Inspector gadget: a framework for custom monitoring and debugging of distributed dataflows

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Full-text indexing for optimizing selection operations in large-scale data analytics

Proceedings of the second international workshop on MapReduce and its applications
PigSPARQL: mapping SPARQL to Pig Latin

Proceedings of the International Workshop on Semantic Web Information Management
CoScan: cooperative scan sharing in the cloud

Proceedings of the 2nd ACM Symposium on Cloud Computing
Comparing high level mapreduce query languages

APPT'11 Proceedings of the 9th international conference on Advanced parallel processing technologies
Building wavelet histograms on large data in MapReduce

Proceedings of the VLDB Endowment
Parallel data processing with MapReduce: a survey

ACM SIGMOD Record
ReStore: reusing results of MapReduce jobs

Proceedings of the VLDB Endowment
Meeting service level objectives of Pig programs

Proceedings of the 2nd International Workshop on Cloud Computing Platforms
SkewTune: mitigating skew in mapreduce applications

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Large-scale machine learning at twitter

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Inside "Big Data management": ogres, onions, or parfaits?

Proceedings of the 15th International Conference on Extending Database Technology
Clydesdale: structured data processing on MapReduce

Proceedings of the 15th International Conference on Extending Database Technology
An optimization framework for map-reduce queries

Proceedings of the 15th International Conference on Extending Database Technology
Adaptive MapReduce using situation-aware mappers

Proceedings of the 15th International Conference on Extending Database Technology
Parallel skyline queries

Proceedings of the 15th International Conference on Database Theory
Optimizing Completion Time and Resource Provisioning of Pig Programs

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Using Pig as a data preparation language for large-scale mining software repositories studies: An experience report

Journal of Systems and Software
Stubby: a transformation-based optimizer for MapReduce workflows

Proceedings of the VLDB Endowment
The unified logging infrastructure for data analytics at Twitter

Proceedings of the VLDB Endowment
Interactive analytical processing in big data systems: a cross-industry study of MapReduce workloads

Proceedings of the VLDB Endowment
Efficient big data processing in Hadoop MapReduce

Proceedings of the VLDB Endowment
Automated profiling and resource management of pig programs for meeting service level objectives

Proceedings of the 9th international conference on Autonomic computing
Spotting code optimizations in data-parallel pipelines through PeriSCOPE

OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
Multimedia Applications and Security in MapReduce: Opportunities and Challenges

Concurrency and Computation: Practice & Experience
Balancing reducer skew in MapReduce workloads using progressive sampling

Proceedings of the Third ACM Symposium on Cloud Computing
Using clouds for MapReduce measurement assignments

ACM Transactions on Computing Education (TOCE)
Robust runtime optimization and skew-resistant execution of analytical SPARQL queries on pig

ISWC'12 Proceedings of the 11th international conference on The Semantic Web - Volume Part I
Security Issues for Cloud Computing

International Journal of Information Security and Privacy
A generate-test-aggregate parallel programming library: systematic parallel programming for MapReduce

Proceedings of the 2013 International Workshop on Programming Models and Applications for Multicores and Manycores
Oozie: towards a scalable workflow management system for Hadoop

Proceedings of the 1st ACM SIGMOD Workshop on Scalable Workflow Execution Engines and Technologies
Supporting data aspects in pig latin

Proceedings of the 12th annual international conference on Aspect-oriented software development
Fast data in the era of big data: Twitter's real-time related query suggestion architecture

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Scaling big data mining infrastructure: the twitter experience

ACM SIGKDD Explorations Newsletter
Rhea: automatic filtering for unstructured cloud storage

nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
Assisting developers of big data analytics applications when deploying on hadoop clouds

Proceedings of the 2013 International Conference on Software Engineering
Reference representation techniques for large models

Proceedings of the Workshop on Scalability in Model Driven Engineering
WTF: the who to follow service at Twitter

Proceedings of the 22nd international conference on World Wide Web
Cache conscious star-join in MapReduce environments

Proceedings of the 2nd International Workshop on Cloud Intelligence
MRPacker: an SQL to mapreduce optimizer

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Performance Modeling and Optimization of Deadline-Driven Pig Programs

ACM Transactions on Autonomous and Adaptive Systems (TAAS)
The family of mapreduce and large-scale data processing systems

ACM Computing Surveys (CSUR)
Representing mapreduce optimisations in the nested relational calculus

BNCOD'13 Proceedings of the 29th British National conference on Big Data
PonIC: using stratosphere to speed up pig analytics

Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
Hadoop GIS: a high performance spatial data warehousing system over mapreduce

Proceedings of the VLDB Endowment
Efficient query evaluation on distributed graphs with Hadoop environment

Proceedings of the Fourth Symposium on Information and Communication Technology
Dimension independent similarity computation

The Journal of Machine Learning Research
Speeding-up codon analysis on the cloud with local MapReduce aggregation

Information Sciences: an International Journal
A Generate-Test-Aggregate parallel programming library for systematic parallel programming

Parallel Computing

Quantified Score

Hi-index	0.01

Visualization

Abstract

Increasingly, organizations capture, transform and analyze enormous data sets. Prominent examples include internet companies and e-science. The Map-Reduce scalable dataflow paradigm has become popular for these applications. Its simple, explicit dataflow programming model is favored by some over the traditional high-level declarative approach: SQL. On the other hand, the extreme simplicity of Map-Reduce leads to much low-level hacking to deal with the many-step, branching dataflows that arise in practice. Moreover, users must repeatedly code standard operations such as join by hand. These practices waste time, introduce bugs, harm readability, and impede optimizations. Pig is a high-level dataflow system that aims at a sweet spot between SQL and Map-Reduce. Pig offers SQL-style high-level data manipulation constructs, which can be assembled in an explicit dataflow and interleaved with custom Map- and Reduce-style functions or executables. Pig programs are compiled into sequences of Map-Reduce jobs, and executed in the Hadoop Map-Reduce environment. Both Pig and Hadoop are open-source projects administered by the Apache Software Foundation. This paper describes the challenges we faced in developing Pig, and reports performance comparisons between Pig execution and raw Map-Reduce execution.