Distributed query processing in a relational data base system
SIGMOD '78 Proceedings of the 1978 ACM SIGMOD international conference on management of data
Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals
Data Mining and Knowledge Discovery
Volcano An Extensible and Parallel Query Evaluation System
IEEE Transactions on Knowledge and Data Engineering
Practical Skew Handling in Parallel Joins
VLDB '92 Proceedings of the 18th International Conference on Very Large Data Bases
Latent semantic models for collaborative filtering
ACM Transactions on Information Systems (TOIS)
Compiled Query Execution Engine using JVM
ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Interpreting the data: Parallel analysis with Sawzall
Scientific Programming - Dynamic Grids and Worldwide Computing
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Pig latin: a not-so-foreign language for data processing
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
SCOPE: easy and efficient parallel processing of massive data sets
Proceedings of the VLDB Endowment
Efficient processing of complex features for information retrieval
Efficient processing of complex features for information retrieval
Generating example data for dataflow programs
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Xbase: cloud-enabled information appliance for healthcare
Proceedings of the 13th International Conference on Extending Database Technology
Efficient parallel set-similarity joins using MapReduce
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Integrating hadoop and parallel DBMs
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Parallel programming framework for large batch transaction processing on scale-out systems
Proceedings of the 3rd Annual Haifa Experimental Systems Conference
Reliable data-center scale computations
Proceedings of the 4th International Workshop on Large Scale Distributed Systems and Middleware
Multidimensional arrays for warehousing data on clouds
Globe'10 Proceedings of the Third international conference on Data management in grid and peer-to-peer systems
MRShare: sharing across multiple queries in MapReduce
Proceedings of the VLDB Endowment
Hadoop++: making a yellow elephant run like a cheetah (without it even noticing)
Proceedings of the VLDB Endowment
Cheetah: a high performance, custom data warehouse on top of MapReduce
Proceedings of the VLDB Endowment
Parallel evaluation of conjunctive queries
Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Nova: continuous Pig/Hadoop workflows
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
A Hadoop based distributed loading approach to parallel data warehouses
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Inspector gadget: a framework for custom monitoring and debugging of distributed dataflows
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Full-text indexing for optimizing selection operations in large-scale data analytics
Proceedings of the second international workshop on MapReduce and its applications
PigSPARQL: mapping SPARQL to Pig Latin
Proceedings of the International Workshop on Semantic Web Information Management
CoScan: cooperative scan sharing in the cloud
Proceedings of the 2nd ACM Symposium on Cloud Computing
Comparing high level mapreduce query languages
APPT'11 Proceedings of the 9th international conference on Advanced parallel processing technologies
Building wavelet histograms on large data in MapReduce
Proceedings of the VLDB Endowment
Parallel data processing with MapReduce: a survey
ACM SIGMOD Record
ReStore: reusing results of MapReduce jobs
Proceedings of the VLDB Endowment
Meeting service level objectives of Pig programs
Proceedings of the 2nd International Workshop on Cloud Computing Platforms
SkewTune: mitigating skew in mapreduce applications
SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Large-scale machine learning at twitter
SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Inside "Big Data management": ogres, onions, or parfaits?
Proceedings of the 15th International Conference on Extending Database Technology
Clydesdale: structured data processing on MapReduce
Proceedings of the 15th International Conference on Extending Database Technology
An optimization framework for map-reduce queries
Proceedings of the 15th International Conference on Extending Database Technology
Adaptive MapReduce using situation-aware mappers
Proceedings of the 15th International Conference on Extending Database Technology
Proceedings of the 15th International Conference on Database Theory
Optimizing Completion Time and Resource Provisioning of Pig Programs
CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Journal of Systems and Software
Stubby: a transformation-based optimizer for MapReduce workflows
Proceedings of the VLDB Endowment
The unified logging infrastructure for data analytics at Twitter
Proceedings of the VLDB Endowment
Interactive analytical processing in big data systems: a cross-industry study of MapReduce workloads
Proceedings of the VLDB Endowment
Efficient big data processing in Hadoop MapReduce
Proceedings of the VLDB Endowment
Automated profiling and resource management of pig programs for meeting service level objectives
Proceedings of the 9th international conference on Autonomic computing
Spotting code optimizations in data-parallel pipelines through PeriSCOPE
OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
Multimedia Applications and Security in MapReduce: Opportunities and Challenges
Concurrency and Computation: Practice & Experience
Balancing reducer skew in MapReduce workloads using progressive sampling
Proceedings of the Third ACM Symposium on Cloud Computing
Using clouds for MapReduce measurement assignments
ACM Transactions on Computing Education (TOCE)
Robust runtime optimization and skew-resistant execution of analytical SPARQL queries on pig
ISWC'12 Proceedings of the 11th international conference on The Semantic Web - Volume Part I
Security Issues for Cloud Computing
International Journal of Information Security and Privacy
Proceedings of the 2013 International Workshop on Programming Models and Applications for Multicores and Manycores
Oozie: towards a scalable workflow management system for Hadoop
Proceedings of the 1st ACM SIGMOD Workshop on Scalable Workflow Execution Engines and Technologies
Supporting data aspects in pig latin
Proceedings of the 12th annual international conference on Aspect-oriented software development
Fast data in the era of big data: Twitter's real-time related query suggestion architecture
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Scaling big data mining infrastructure: the twitter experience
ACM SIGKDD Explorations Newsletter
Rhea: automatic filtering for unstructured cloud storage
nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
Assisting developers of big data analytics applications when deploying on hadoop clouds
Proceedings of the 2013 International Conference on Software Engineering
Reference representation techniques for large models
Proceedings of the Workshop on Scalability in Model Driven Engineering
WTF: the who to follow service at Twitter
Proceedings of the 22nd international conference on World Wide Web
Cache conscious star-join in MapReduce environments
Proceedings of the 2nd International Workshop on Cloud Intelligence
MRPacker: an SQL to mapreduce optimizer
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Performance Modeling and Optimization of Deadline-Driven Pig Programs
ACM Transactions on Autonomous and Adaptive Systems (TAAS)
The family of mapreduce and large-scale data processing systems
ACM Computing Surveys (CSUR)
Representing mapreduce optimisations in the nested relational calculus
BNCOD'13 Proceedings of the 29th British National conference on Big Data
PonIC: using stratosphere to speed up pig analytics
Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
Hadoop GIS: a high performance spatial data warehousing system over mapreduce
Proceedings of the VLDB Endowment
Efficient query evaluation on distributed graphs with Hadoop environment
Proceedings of the Fourth Symposium on Information and Communication Technology
Dimension independent similarity computation
The Journal of Machine Learning Research
Speeding-up codon analysis on the cloud with local MapReduce aggregation
Information Sciences: an International Journal
Hi-index | 0.01 |
Increasingly, organizations capture, transform and analyze enormous data sets. Prominent examples include internet companies and e-science. The Map-Reduce scalable dataflow paradigm has become popular for these applications. Its simple, explicit dataflow programming model is favored by some over the traditional high-level declarative approach: SQL. On the other hand, the extreme simplicity of Map-Reduce leads to much low-level hacking to deal with the many-step, branching dataflows that arise in practice. Moreover, users must repeatedly code standard operations such as join by hand. These practices waste time, introduce bugs, harm readability, and impede optimizations. Pig is a high-level dataflow system that aims at a sweet spot between SQL and Map-Reduce. Pig offers SQL-style high-level data manipulation constructs, which can be assembled in an explicit dataflow and interleaved with custom Map- and Reduce-style functions or executables. Pig programs are compiled into sequences of Map-Reduce jobs, and executed in the Hadoop Map-Reduce environment. Both Pig and Hadoop are open-source projects administered by the Apache Software Foundation. This paper describes the challenges we faced in developing Pig, and reports performance comparisons between Pig execution and raw Map-Reduce execution.