Pig latin: a not-so-foreign language for data processing

Authors:
Christopher Olston;Benjamin Reed;Utkarsh Srivastava;Ravi Kumar;Andrew Tomkins
Affiliations:
Yahoo! Research, Santa Clara, CA, USA;Yahoo! Research, Santa Clara, CA, USA;Yahoo! Research, Santa Clara, CA, USA;Yahoo! Research, Santa Clara, CA, USA;Yahoo! Research, Santa Clara, CA, USA
Venue:
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Year:
2008

Citing 12
Cited 335

A survey of theoretical research on typed complex database objects

Databases
Fundamentals of database systems

Fundamentals of database systems
Programming parallel algorithms

Communications of the ACM
On random sampling over joins

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals

Data Mining and Knowledge Discovery
Interpreting the data: Parallel analysis with Sawzall

Scientific Programming - Dynamic Grids and Worldwide Computing
Map-reduce-merge: simplified relational data processing on large clusters

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Dynamo: amazon's highly available key-value store

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Bigtable: a distributed storage system for structured data

OSDI '06 Proceedings of the 7th symposium on Operating systems design and implementation
Community systems research at Yahoo!

ACM SIGMOD Record

Automatic optimization of parallel dataflow programs

ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
Correlated Query Process and P2P Execution

Globe '08 Proceedings of the 1st international conference on Data Management in Grid and Peer-to-Peer Systems
Clustera: an integrated computation and data management system

Proceedings of the VLDB Endowment
SCOPE: easy and efficient parallel processing of massive data sets

Proceedings of the VLDB Endowment
PNUTS: Yahoo!'s hosted data serving platform

Proceedings of the VLDB Endowment
Large-scale collaborative analysis and extraction of web data

Proceedings of the VLDB Endowment
Data-Continuous SQL Process Model

OTM '08 Proceedings of the OTM 2008 Confederated International Conferences, CoopIS, DOA, GADA, IS, and ODBASE 2008. Part I on On the Move to Meaningful Internet Systems:
Adaptive workload allocation in query processing in autonomous heterogeneous environments

Distributed and Parallel Databases
Traverse: Simplified Indexing on Large Map-Reduce-Merge Clusters

DASFAA '09 Proceedings of the 14th International Conference on Database Systems for Advanced Applications
MapReduce optimization using regulated dynamic prioritization

Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
Open-source grid technologies for web-scale computing

ACM SIGACT News
BotGraph: large scale spamming botnet detection

NSDI'09 Proceedings of the 6th USENIX symposium on Networked systems design and implementation
A comparison of approaches to large-scale data analysis

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Generating example data for dataflow programs

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Building community-centric information exploration applications on social content sites

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Distributed data-parallel computing using a high-level programming language

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Experiences on Processing Spatial Data with MapReduce

SSDBM 2009 Proceedings of the 21st International Conference on Scientific and Statistical Database Management
Brief announcement: PUSH, a DISC shell

Proceedings of the 28th ACM symposium on Principles of distributed computing
Query interactions in database workloads

Proceedings of the Second International Workshop on Testing Database Systems
A Vision for Next Generation Query Processors and an Associated Research Agenda

Globe '09 Proceedings of the 2nd International Conference on Data Management in Grid and Peer-to-Peer Systems
An In-Database Streaming Solution to Multi-camera Fusion

Globe '09 Proceedings of the 2nd International Conference on Data Management in Grid and Peer-to-Peer Systems
New Challenges in Information Integration

DaWaK '09 Proceedings of the 11th International Conference on Data Warehousing and Knowledge Discovery
Efficiently support MapReduce-like computation models inside parallel DBMS

IDEAS '09 Proceedings of the 2009 International Database Engineering & Applications Symposium
MapReduce and parallel DBMSs: friends or foes?

Communications of the ACM - Amir Pnueli: Ahead of His Time
MapReduce: a flexible data processing tool

Communications of the ACM - Amir Pnueli: Ahead of His Time
Distributed aggregation for data-parallel computing: interfaces and implementations

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
The nature of data center traffic: measurements & analysis

Proceedings of the 9th ACM SIGCOMM conference on Internet measurement conference
Composing and executing parallel data-flow graphs with shell pipes

Proceedings of the 4th Workshop on Workflows in Support of Large-Scale Science
A code generation approach to optimizing high-performance distributed data stream processing

Proceedings of the 18th ACM conference on Information and knowledge management
Practical lessons of data mining at Yahoo!

Proceedings of the 18th ACM conference on Information and knowledge management
Nephele: efficient parallel data processing in the cloud

Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers
Query processing of massive trajectory data based on mapreduce

Proceedings of the first international workshop on Cloud data management
SQL/MapReduce: a practical approach to self-describing, polymorphic, and parallelizable user-defined functions

Proceedings of the VLDB Endowment
Building a high-level dataflow system on top of Map-Reduce: the Pig experience

Proceedings of the VLDB Endowment
HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads

Proceedings of the VLDB Endowment
RAPID: Enabling Scalable Ad-Hoc Analytics on the Semantic Web

ISWC '09 Proceedings of the 8th International Semantic Web Conference
Optimizing joins in a map-reduce environment

Proceedings of the 13th International Conference on Extending Database Technology
DEDUCE: at the intersection of MapReduce and stream processing

Proceedings of the 13th International Conference on Extending Database Technology
Xbase: cloud-enabled information appliance for healthcare

Proceedings of the 13th International Conference on Extending Database Technology
Managing scientific data

Communications of the ACM
HadoopToSQL: a mapReduce query optimizer

Proceedings of the 5th European conference on Computer systems
Cloud-TM: harnessing the cloud with distributed transactional memories

ACM SIGOPS Operating Systems Review
Storing and accessing live mashup content in the cloud

ACM SIGOPS Operating Systems Review
Distributed indexing of web scale datasets for the cloud

Proceedings of the 2010 Workshop on Massive Data Analytics on the Cloud
Towards scalable RDF graph analytics on MapReduce

Proceedings of the 2010 Workshop on Massive Data Analytics on the Cloud
SPARQL basic graph pattern processing with iterative MapReduce

Proceedings of the 2010 Workshop on Massive Data Analytics on the Cloud
FlumeJava: easy, efficient data-parallel pipelines

PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
Stateful bulk processing for incremental analytics

Proceedings of the 1st ACM symposium on Cloud computing
Comet: batched stream processing for data intensive distributed computing

Proceedings of the 1st ACM symposium on Cloud computing
Skew-resistant parallel processing of feature-extracting scientific user-defined functions

Proceedings of the 1st ACM symposium on Cloud computing
Nephele/PACTs: a programming model and execution framework for web-scale analytical processing

Proceedings of the 1st ACM symposium on Cloud computing
The case for PIQL: a performance insightful query language

Proceedings of the 1st ACM symposium on Cloud computing
Making cloud intermediate data fault-tolerant

Proceedings of the 1st ACM symposium on Cloud computing
Pregel: a system for large-scale graph processing

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
ParaTimer: a progress indicator for MapReduce DAGs

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Integrating hadoop and parallel DBMs

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
A comparison of join algorithms for log processing in MaPreduce

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Ricardo: integrating R and Hadoop

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
HadoopDB in action: building real world applications

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Online aggregation and continuous query support in MapReduce

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Parallel programming framework for large batch transaction processing on scale-out systems

Proceedings of the 3rd Annual Haifa Experimental Systems Conference
ASSET queries: a declarative alternative to MapReduce

ACM SIGMOD Record
Toward a cost-effective cloud storage service

ICACT'10 Proceedings of the 12th international conference on Advanced communication technology
A Map-Reduce System with an Alternate API for Multi-core Environments

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
MRAP: a novel MapReduce-based framework to support HPC analytics applications with access patterns

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Weaver: integrating distributed computing abstractions into scientific workflows using Python

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Massive Semantic Web data compression with MapReduce

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Parallelizing multiple group-by query in share-nothing environment: a MapReduce study case

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
A common substrate for cluster computing

HotCloud'09 Proceedings of the 2009 conference on Hot topics in cloud computing
DryadInc: reusing work in large-scale computations

HotCloud'09 Proceedings of the 2009 conference on Hot topics in cloud computing
Wave computing in the cloud

HotOS'09 Proceedings of the 12th conference on Hot topics in operating systems
On availability of intermediate data in cloud computations

HotOS'09 Proceedings of the 12th conference on Hot topics in operating systems
MapReduce online

NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation
DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Improving MapReduce performance in heterogeneous environments

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Towards practical incremental recomputation for scientists: an implementation for the Python language

TAPP'10 Proceedings of the 2nd conference on Theory and practice of provenance
An experience report on scaling tools for mining software repositories using MapReduce

Proceedings of the IEEE/ACM international conference on Automated software engineering
Manimal: relational optimization for data-intensive programs

Procceedings of the 13th International Workshop on the Web and Databases
Reliable data-center scale computations

Proceedings of the 4th International Workshop on Large Scale Distributed Systems and Middleware
Conductor: orchestrating the clouds

Proceedings of the 4th International Workshop on Large Scale Distributed Systems and Middleware
Towards a theory of search queries

ACM Transactions on Database Systems (TODS)
See spot run: using spot instances for mapreduce workflows

HotCloud'10 Proceedings of the 2nd USENIX conference on Hot topics in cloud computing
Spark: cluster computing with working sets

HotCloud'10 Proceedings of the 2nd USENIX conference on Hot topics in cloud computing
Scripting the cloud with skywriting

HotCloud'10 Proceedings of the 2nd USENIX conference on Hot topics in cloud computing
Towards energy proportional cloud for data processing frameworks

SustainIT'10 Proceedings of the First USENIX conference on Sustainable information technology
Scalable clustering algorithm for N-body simulations in a shared-nothing cluster

SSDBM'10 Proceedings of the 22nd international conference on Scientific and statistical database management
Comparing Hadoop and Fat-Btree based access method for small file I/O applications

WAIM'10 Proceedings of the 11th international conference on Web-age information management
JAWS: Job-Aware Workload Scheduling for the Exploration of Turbulence Simulations

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Merging file systems and data bases to fit the grid

Globe'10 Proceedings of the Third international conference on Data management in grid and peer-to-peer systems
Multidimensional arrays for warehousing data on clouds

Globe'10 Proceedings of the Third international conference on Data management in grid and peer-to-peer systems
Processing high data rate streams in System S

Journal of Parallel and Distributed Computing
HaLoop: efficient iterative data processing on large clusters

Proceedings of the VLDB Endowment
Dremel: interactive analysis of web-scale datasets

Proceedings of the VLDB Endowment
The performance of MapReduce: an in-depth study

Proceedings of the VLDB Endowment
MRShare: sharing across multiple queries in MapReduce

Proceedings of the VLDB Endowment
Hadoop++: making a yellow elephant run like a cheetah (without it even noticing)

Proceedings of the VLDB Endowment
DataGarage: warehousing massive performance data on commodity servers

Proceedings of the VLDB Endowment
Knuckles: bringing the database to the data

International Journal of Computational Science and Engineering
Integrating MapReduce and RDBMSs

Proceedings of the 2010 Conference of the Center for Advanced Studies on Collaborative Research
Web data processing on the cloud

Proceedings of the 2010 Conference of the Center for Advanced Studies on Collaborative Research
Nectar: automatic management of data and computation in datacenters

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Piccolo: building fast, distributed programs with partitioned tables

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Chukwa: a system for reliable large-scale log collection

LISA'10 Proceedings of the 24th international conference on Large installation system administration
On the expressiveness and trade-offs of large scale tuple stores

OTM'10 Proceedings of the 2010 international conference on On the move to meaningful internet systems: Part II
The case for object databases in cloud data management

ICOODB'10 Proceedings of the Third international conference on Objects and databases
Online querying of d-dimensional hierarchies

Journal of Parallel and Distributed Computing
Qex: symbolic SQL query explorer

LPAR'10 Proceedings of the 16th international conference on Logic for programming, artificial intelligence, and reasoning
Demaq/Transscale: Automated distribution and scalability for declarative applications

Information Systems
Data structure fusion

APLAS'10 Proceedings of the 8th Asian conference on Programming languages and systems
CPLDP: an efficient large dataset processing system built on cloud platform

ADMA'10 Proceedings of the 6th international conference on Advanced data mining and applications - Volume Part II
Map-reduce extensions and recursive queries

Proceedings of the 14th International Conference on Extending Database Technology
RanKloud: a scalable ranked query processing framework on hadoop

Proceedings of the 14th International Conference on Extending Database Technology
Dremel: interactive analysis of web-scale datasets

Communications of the ACM
Optimizing intermediate data management in MapReduce computations

Proceedings of the First International Workshop on Cloud Computing Platforms
Architectural Requirements for Cloud Computing Systems: An Enterprise Cloud Approach

Journal of Grid Computing
ASTERIX: towards a scalable, semistructured data platform for evolving-world models

Distributed and Parallel Databases
CIEL: a universal execution engine for distributed data-flow computing

Proceedings of the 8th USENIX conference on Networked systems design and implementation
Automatic optimization for MapReduce programs

Proceedings of the VLDB Endowment
Aspects of data-intensive cloud computing

From active data management to event-based systems and more
Ripple: A publish/subscribe service for multidata item updates propagation in the cloud

Journal of Network and Computer Applications
Column-oriented storage techniques for MapReduce

Proceedings of the VLDB Endowment
Brasil: basic resource aggregation system infrastructure layer

Proceedings of the 1st International Workshop on Runtime and Operating Systems for Supercomputers
A latency and fault-tolerance optimizer for online parallel query plans

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Schedule optimization for data processing flows on the cloud

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Processing theta-joins using MapReduce

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Fast personalized PageRank on MapReduce

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
A platform for scalable one-pass analytics using MapReduce

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Nova: continuous Pig/Hadoop workflows

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Providing scalable database services on the cloud

WISE'10 Proceedings of the 11th international conference on Web information systems engineering
Optimizing data partitioning for data-parallel computing

HotOS'13 Proceedings of the 13th USENIX conference on Hot topics in operating systems
Data representation synthesis

Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
Steno: automatic optimization of declarative queries

Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
Otus: resource attribution in data-intensive clusters

Proceedings of the second international workshop on MapReduce and its applications
Static type checking of Hadoop MapReduce programs

Proceedings of the second international workshop on MapReduce and its applications
Parallelizing large-scale data processing applications with data skew: a case study in product-offer matching

Proceedings of the second international workshop on MapReduce and its applications
Full-text indexing for optimizing selection operations in large-scale data analytics

Proceedings of the second international workshop on MapReduce and its applications
The case for being lazy: how to leverage lazy evaluation in MapReduce

Proceedings of the 2nd international workshop on Scientific cloud computing
Towards efficient subgraph search in cloud computing environments

DASFAA'11 Proceedings of the 16th international conference on Database systems for advanced applications
Adapting skyline computation to the MapReduce framework: algorithms and experiments

DASFAA'11 Proceedings of the 16th international conference on Database systems for advanced applications
LinearDB: a relational approach to make data warehouse scale like MapReduce

DASFAA'11 Proceedings of the 16th international conference on Database systems for advanced applications: Part II
PigSPARQL: mapping SPARQL to Pig Latin

Proceedings of the International Workshop on Semantic Web Information Management
HiTune: dataflow-based performance analysis for big data cloud

USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
TidyFS: a simple and small distributed file system

USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
CoHadoop: flexible data placement and its exploitation in Hadoop

Proceedings of the VLDB Endowment
Sloppy Python: using dynamic analysis to automatically add error tolerance to ad-hoc data processing scripts

Proceedings of the Ninth International Workshop on Dynamic Analysis
An intermediate algebra for optimizing RDF graph pattern matching on MapReduce

ESWC'11 Proceedings of the 8th extended semantic web conference on The semanic web: research and applications - Volume Part II
NIMBLE: a toolkit for the implementation of parallel data mining and machine learning algorithms on mapreduce

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
GBASE: a scalable and general graph management system

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Towards a scalable and robust multi-tenancy SaaS

Proceedings of the Second Asia-Pacific Symposium on Internetware
Spectral analysis for billion-scale graphs: discoveries and implementation

PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part II
New ideas track: testing mapreduce-style programs

Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering
Brown Dwarf: A fully-distributed, fault-tolerant data warehousing system

Journal of Parallel and Distributed Computing
How the minotaur turned into ariadne: ontologies in web data extraction

ICWE'11 Proceedings of the 11th international conference on Web engineering
CloudFuice: a flexible cloud-based data integration system

ICWE'11 Proceedings of the 11th international conference on Web engineering
ETLMR: a highly scalable dimensional ETL framework based on mapreduce

DaWaK'11 Proceedings of the 13th international conference on Data warehousing and knowledge discovery
Tagged mapreduce: efficiently computing multi-analytics using mapreduce

DaWaK'11 Proceedings of the 13th international conference on Data warehousing and knowledge discovery
Principles of distributed data management in 2020?

DEXA'11 Proceedings of the 22nd international conference on Database and expert systems applications - Volume Part I
Data integration over NoSQL stores using access path based mappings

DEXA'11 Proceedings of the 22nd international conference on Database and expert systems applications - Volume Part I
Incoop: MapReduce for incremental computations

Proceedings of the 2nd ACM Symposium on Cloud Computing
CoScan: cooperative scan sharing in the cloud

Proceedings of the 2nd ACM Symposium on Cloud Computing
Query optimization for massively parallel data processing

Proceedings of the 2nd ACM Symposium on Cloud Computing
PrIter: a distributed framework for prioritized iterative computations

Proceedings of the 2nd ACM Symposium on Cloud Computing
Trojan data layouts: right shoes for a running elephant

Proceedings of the 2nd ACM Symposium on Cloud Computing
Automatic management of partitioned, replicated search services

Proceedings of the 2nd ACM Symposium on Cloud Computing
Scaling the mobile millennium system in the cloud

Proceedings of the 2nd ACM Symposium on Cloud Computing
Query engine grid for executing SQL streaming process

Globe'11 Proceedings of the 4th international conference on Data management in grid and peer-to-peer systems
Comparing high level mapreduce query languages

APPT'11 Proceedings of the 9th international conference on Advanced parallel processing technologies
CrowdForge: crowdsourcing complex work

Proceedings of the 24th annual ACM symposium on User interface software and technology
The jabberwocky programming environment for structured social computing

Proceedings of the 24th annual ACM symposium on User interface software and technology
Easy and effective parallel programmable ETL

Proceedings of the ACM 14th international workshop on Data Warehousing and OLAP
Building wavelet histograms on large data in MapReduce

Proceedings of the VLDB Endowment
Efficient processing of RDF graph pattern matching on MapReduce platforms

Proceedings of the second international workshop on Data intensive computing in the clouds
ChuQL: processing XML with XQuery using Hadoop

Proceedings of the 2011 Conference of the Center for Advanced Studies on Collaborative Research
Parallel data processing with MapReduce: a survey

ACM SIGMOD Record
Processing and visualizing the data in tweets

ACM SIGMOD Record
Putting lipstick on pig: enabling database-style workflow provenance

Proceedings of the VLDB Endowment
Of hammers and nails: an empirical comparison of three paradigms for processing large graphs

Proceedings of the fifth ACM international conference on Web search and data mining
Case study of scientific data processing on a cloud using hadoop

HPCS'09 Proceedings of the 23rd international conference on High Performance Computing Systems and Applications
Executing multiple group by query using mapreduce approach: implementation and optimization

GPC'10 Proceedings of the 5th international conference on Advances in Grid and Pervasive Computing
GLADE: a scalable framework for efficient analytics

ACM SIGOPS Operating Systems Review
Tarazu: optimizing MapReduce on heterogeneous clusters

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Social networking in developing regions

Proceedings of the Fifth International Conference on Information and Communication Technologies and Development
ReStore: reusing results of MapReduce jobs

Proceedings of the VLDB Endowment
Jockey: guaranteed job latency in data parallel clusters

Proceedings of the 7th ACM european conference on Computer Systems
MadLINQ: large-scale distributed matrix computation for the cloud

Proceedings of the 7th ACM european conference on Computer Systems
Static scheduling in clouds

HotCloud'11 Proceedings of the 3rd USENIX conference on Hot topics in cloud computing
The datacenter needs an operating system

HotCloud'11 Proceedings of the 3rd USENIX conference on Hot topics in cloud computing
HiTune: dataflow-based performance analysis for big data cloud

HotCloud'11 Proceedings of the 3rd USENIX conference on Hot topics in cloud computing
A universal calculus for stream processing languages

ESOP'10 Proceedings of the 19th European conference on Programming Languages and Systems
PerfXplain: debugging MapReduce job performance

Proceedings of the VLDB Endowment
Abstract state machines for data-parallel computing

Conceptual Modelling and Its Theoretical Foundations
RDFPath: path query processing on large RDF graphs with mapreduce

ESWC'11 Proceedings of the 8th international conference on The Semantic Web
Resource provisioning framework for mapreduce jobs with performance goals

Middleware'11 Proceedings of the 12th ACM/IFIP/USENIX international conference on Middleware
The HaLoop approach to large-scale iterative data analysis

The VLDB Journal — The International Journal on Very Large Data Bases
Trust me, i'm partially right: incremental visualization lets analysts explore large datasets faster

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
V-SMART-join: a scalable mapreduce framework for all-pair similarity joins of multisets and vectors

Proceedings of the VLDB Endowment
What next?: a half-dozen data management research goals for big data and the cloud

PODS '12 Proceedings of the 31st symposium on Principles of Database Systems
High performance spatial query processing for large scale scientific data

PhD '12 Proceedings of the on SIGMOD/PODS 2012 PhD Symposium
Advanced partitioning techniques for massively distributed computation

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Declarative error management for robust data-intensive applications

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
GLADE: big data analytics made easy

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
ReStore: reusing results of MapReduce jobs in pig

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Oracle in-database hadoop: when mapreduce meets RDBMS

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Large-scale machine learning at twitter

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Optimizing analytic data flows for multiple execution engines

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Re-optimizing data-parallel computing

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Optimizing data shuffling in data-parallel computation by understanding user-defined functions

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Orchestrating the deployment of computations in the cloud with conductor

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
To nest or not to nest, when and how much: representing intermediate results of graph pattern queries in MapReduce based processing

SWIM '12 Proceedings of the 4th International Workshop on Semantic Web Information Management
Inside "Big Data management": ogres, onions, or parfaits?

Proceedings of the 15th International Conference on Extending Database Technology
An optimization framework for map-reduce queries

Proceedings of the 15th International Conference on Extending Database Technology
Efficient SPARQL query processing in mapreduce through data partitioning and indexing

APWeb'12 Proceedings of the 14th Asia-Pacific international conference on Web Technologies and Applications
Bizard: an online multi-dimensional data analysis visualization tool

APWeb'12 Proceedings of the 14th Asia-Pacific international conference on Web Technologies and Applications
ComMapReduce: an improvement of mapreduce with lightweight communication mechanisms

DASFAA'12 Proceedings of the 17th international conference on Database Systems for Advanced Applications - Volume Part II
Improving the diagnosis of mild hypertrophic cardiomyopathy with MapReduce

Proceedings of third international workshop on MapReduce and its Applications Date
Understanding the effects and implications of compute node related failures in hadoop

Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
Maestro: Replica-Aware Map Scheduling for MapReduce

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
ParaLite: Supporting Collective Queries in Database System to Parallelize User-Defined Executable

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
FunSQL: it is time to make SQL functional

Proceedings of the 2012 Joint EDBT/ICDT Workshops
Cost models for view materialization in the cloud

Proceedings of the 2012 Joint EDBT/ICDT Workshops
Using Pig as a data preparation language for large-scale mining software repositories studies: An experience report

Journal of Systems and Software
ASTERIX: scalable warehouse-style web data integration

Proceedings of the Ninth International Workshop on Information Integration on the Web
From a calculus to an execution environment for stream processing

Proceedings of the 6th ACM International Conference on Distributed Event-Based Systems
Early accurate results for advanced analytics on MapReduce

Proceedings of the VLDB Endowment
Bridging the divide between software developers and operators using logs

Proceedings of the 34th International Conference on Software Engineering
MapReduce indexing strategies: Studying scalability and efficiency

Information Processing and Management: an International Journal
GigaTensor: scaling tensor analysis up by 100 times - algorithms and discoveries

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
The seven deadly sins of cloud computing research

HotCloud'12 Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing
Systematic approach of using power save mode for cloud data processing services

International Journal of Ad Hoc and Ubiquitous Computing
Towards a hybrid row-column database for a cloud-based medical data management system

Proceedings of the 1st International Workshop on Cloud Intelligence
Opening the black boxes in data flow optimization

Proceedings of the VLDB Endowment
REX: recursive, delta-based data-centric computation

Proceedings of the VLDB Endowment
Optimization of analytic data flows for next generation business intelligence applications

TPCTC'11 Proceedings of the Third TPC Technology conference on Topics in Performance Evaluation, Measurement and Characterization
HadoopRDF: a scalable semantic data analytical engine

ICIC'12 Proceedings of the 8th international conference on Intelligent Computing Theories and Applications
M3R: increased performance for in-memory Hadoop jobs

Proceedings of the VLDB Endowment
The unified logging infrastructure for data analytics at Twitter

Proceedings of the VLDB Endowment
ASTERIX: an open source system for "Big Data" management and analysis (demo)

Proceedings of the VLDB Endowment
Boa: analyzing ultra-large-scale code corpus

Proceedings of the 3rd annual conference on Systems, programming, and applications: software for humanity
Data-intensive architecture for scientific knowledge discovery

Distributed and Parallel Databases
gbase: an efficient analysis platform for large graphs

The VLDB Journal — The International Journal on Very Large Data Bases
SCOPE: parallel databases meet MapReduce

The VLDB Journal — The International Journal on Very Large Data Bases
Spotting code optimizations in data-parallel pipelines through PeriSCOPE

OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
A study on data deduplication in HPC storage systems

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
SCALLA: A Platform for Scalable One-Pass Analytics Using MapReduce

ACM Transactions on Database Systems (TODS)
Scripting distributed scientific workflows using Weaver

Concurrency and Computation: Practice & Experience
Multimedia Applications and Security in MapReduce: Opportunities and Challenges

Concurrency and Computation: Practice & Experience
HEDC: a histogram estimator for data in the cloud

Proceedings of the fourth international workshop on Cloud data management
Coflow: a networking abstraction for cluster applications

Proceedings of the 11th ACM Workshop on Hot Topics in Networks
Sailfish: a framework for large scale data processing

Proceedings of the Third ACM Symposium on Cloud Computing
Metaphor: a system for related search recommendations

Proceedings of the 21st ACM international conference on Information and knowledge management
On-the-fly task execution for speeding up pipelined mapreduce

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Towards integrating workflow and database provenance

IPAW'12 Proceedings of the 4th international conference on Provenance and Annotation of Data and Processes
Resource provisioning framework for MapReduce jobs with performance goals

Proceedings of the 12th International Middleware Conference
Optimizing large-scale Semi-Naïve datalog evaluation in hadoop

Datalog 2.0'12 Proceedings of the Second international conference on Datalog in Academia and Industry
Just-in-time data distribution for analytical query processing

ADBIS'12 Proceedings of the 16th East European conference on Advances in Databases and Information Systems
Cogset: a high performance MapReduce engine

Concurrency and Computation: Practice & Experience
Scalable RDF data compression with MapReduce

Concurrency and Computation: Practice & Experience
Towards building a high performance spatial query system for large scale medical imaging data

Proceedings of the 20th International Conference on Advances in Geographic Information Systems
Robust runtime optimization and skew-resistant execution of analytical SPARQL queries on pig

ISWC'12 Proceedings of the 11th international conference on The Semantic Web - Volume Part I
Static and dynamic semantics of NoSQL languages

POPL '13 Proceedings of the 40th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Report from the first workshop on scalable workflow enactment engines and technology (SWEET'12)

ACM SIGMOD Record
Constructing a data accessing layer for in-memory data grid

Proceedings of the Fourth Asia-Pacific Symposium on Internetware
Learning to rank for spatiotemporal search

Proceedings of the sixth ACM international conference on Web search and data mining
Makeflow: a portable abstraction for data intensive computing on clusters, clouds, and grids

Proceedings of the 1st ACM SIGMOD Workshop on Scalable Workflow Execution Engines and Technologies
Turbine: a distributed-memory dataflow engine for extreme-scale many-task applications

Proceedings of the 1st ACM SIGMOD Workshop on Scalable Workflow Execution Engines and Technologies
Tiled-MapReduce: Efficient and Flexible MapReduce Processing on Multicore with Tiling

ACM Transactions on Architecture and Code Optimization (TACO)
Supporting data aspects in pig latin

Proceedings of the 12th annual international conference on Aspect-oriented software development
Invisible loading: access-driven data transfer from raw files into database systems

Proceedings of the 16th International Conference on Extending Database Technology
Eagle-eyed elephant: split-oriented indexing in Hadoop

Proceedings of the 16th International Conference on Extending Database Technology
Efficient processing of containment queries on nested sets

Proceedings of the 16th International Conference on Extending Database Technology
HIL: a high-level scripting language for entity integration

Proceedings of the 16th International Conference on Extending Database Technology
Scalable SAPRQL querying processing on large RDF data in cloud computing environment

ICPCA/SWS'12 Proceedings of the 2012 international conference on Pervasive Computing and the Networked World
The big data ecosystem at LinkedIn

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
BigBench: towards an industry standard benchmark for big data analytics

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Minimal MapReduce algorithms

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Communication steps for parallel query processing

Proceedings of the 32nd symposium on Principles of database systems
Cumulon: optimizing statistical data analysis in the cloud

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Fast data in the era of big data: Twitter's real-time related query suggestion architecture

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
TimeStream: reliable stream computation in the cloud

Proceedings of the 8th ACM European Conference on Computer Systems
Optimus: a dynamic rewriting framework for data-parallel execution plans

Proceedings of the 8th ACM European Conference on Computer Systems
Presto: distributed machine learning and graph processing with sparse matrices

Proceedings of the 8th ACM European Conference on Computer Systems
CPI2: CPU performance isolation for shared compute clusters

Proceedings of the 8th ACM European Conference on Computer Systems
A bloat-aware design for big data applications

Proceedings of the 2013 international symposium on memory management
Scaling big data mining infrastructure: the twitter experience

ACM SIGKDD Explorations Newsletter
Big graph mining: algorithms and discoveries

ACM SIGKDD Explorations Newsletter
Issues in big data testing and benchmarking

Proceedings of the Sixth International Workshop on Testing Database Systems
HyMR: a hybrid MapReduce workflow system

Proceedings of the 3rd international workshop on Emerging computational methods for the life sciences
Early experiences in using a domain-specific language for large-scale graph analysis

First International Workshop on Graph Data Management Experiences and Systems
On benchmarking online social media analytical queries

First International Workshop on Graph Data Management Experiences and Systems
Exploiting MapReduce and data compression for data-intensive applications

Proceedings of the Conference on Extreme Science and Engineering Discovery Environment: Gateway to Discovery
Assisting developers of big data analytics applications when deploying on hadoop clouds

Proceedings of the 2013 International Conference on Software Engineering
Boa: a language and infrastructure for analyzing ultra-large-scale software repositories

Proceedings of the 2013 International Conference on Software Engineering
A characteristic study on failures of production distributed data-parallel programs

Proceedings of the 2013 International Conference on Software Engineering
WTF: the who to follow service at Twitter

Proceedings of the 22nd international conference on World Wide Web
Large-scale computation not at the cost of expressiveness

HotOS'13 Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems
Efficient social network data query processing on MapReduce

Proceedings of the 5th ACM workshop on HotPlanet
An adaptive data transfer algorithm using block device reconfiguration in virtual MapReduce clusters

Proceedings of the 2013 ACM Cloud and Autonomic Computing Conference
Review: An overview of anonymity technology usage

Computer Communications
Distributed data management using MapReduce

ACM Computing Surveys (CSUR)
Data-Intensive Cloud Computing: Requirements, Expectations, Challenges, and Solutions

Journal of Grid Computing
MrCrypt: static analysis for secure cloud computations

Proceedings of the 2013 ACM SIGPLAN international conference on Object oriented programming systems languages & applications
Scheduling data processing flows under budget constraint on the cloud

Proceedings of the 2013 Research in Adaptive and Convergent Systems
The family of mapreduce and large-scale data processing systems

ACM Computing Surveys (CSUR)
Scalable lineage capture for debugging DISC analytics

Proceedings of the 4th annual Symposium on Cloud Computing
Natjam: design and evaluation of eviction policies for supporting priorities and deadlines in mapreduce clusters

Proceedings of the 4th annual Symposium on Cloud Computing
Apache Hadoop YARN: yet another resource negotiator

Proceedings of the 4th annual Symposium on Cloud Computing
BrackitMR: flexible XQuery processing in mapreduce

WAIM'13 Proceedings of the 14th international conference on Web-Age Information Management
PonIC: using stratosphere to speed up pig analytics

Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
MR-runner: a modularized map-reduce job management tool

Proceedings of the 5th Asia-Pacific Symposium on Internetware
Semantics and provenance for processing element composition in dispel workflows

WORKS '13 Proceedings of the 8th Workshop on Workflows in Support of Large-Scale Science
CRUCIBLE: towards unified secure on- and off-line analytics at scale

DISCS-2013 Proceedings of the 2013 International Workshop on Data-Intensive Scalable Computing Systems
TAO: Facebook's distributed data store for the social graph

USENIX ATC'13 Proceedings of the 2013 USENIX conference on Annual Technical Conference
PIKACHU: how to rebalance load in optimizing mapreduce on heterogeneous clusters

USENIX ATC'13 Proceedings of the 2013 USENIX conference on Annual Technical Conference
Network support for resource disaggregation in next-generation datacenters

Proceedings of the Twelfth ACM Workshop on Hot Topics in Networks
Hadoop's adolescence: an analysis of Hadoop usage in scientific workloads

Proceedings of the VLDB Endowment
Continuous cloud-scale query optimization and processing

Proceedings of the VLDB Endowment
Piranha: optimizing short jobs in Hadoop

Proceedings of the VLDB Endowment
Hadoop GIS: a high performance spatial data warehousing system over mapreduce

Proceedings of the VLDB Endowment
A demonstration of SpatialHadoop: an efficient mapreduce framework for spatial data

Proceedings of the VLDB Endowment
Medical data management in the SYSEO project

ACM SIGMOD Record
Active data: a data-centric approach to data life-cycle management

PDSW '13 Proceedings of the 8th Parallel Data Storage Workshop
Optimized data management for e-learning in the clouds towards Cloodle

Proceedings of the Fourth Symposium on Information and Communication Technology
Simplifying Scalable Graph Processing with a Domain-Specific Language

Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
Distributed socialite: a datalog-based language for large-scale graph analysis

Proceedings of the VLDB Endowment
ComMapReduce: An improvement of MapReduce with lightweight communication mechanisms

Data & Knowledge Engineering
Modeling and optimizing large-scale data flows

Future Generation Computer Systems
Dimension independent similarity computation

The Journal of Machine Learning Research
Run-time performance optimization of a BigData query language

Proceedings of the 5th ACM/SPEC international conference on Performance engineering
SHadoop: Improving MapReduce performance by optimizing job execution mechanism in Hadoop clusters

Journal of Parallel and Distributed Computing
Order matters! Harnessing a world of orderings for reasoning over massive data

Semantic Web
Hybrid Analytic Flows-the Case for Optimization

Fundamenta Informaticae - Scalable Workflow Enactment Engines and Technology
Turbine: A Distributed-memory Dataflow Engine for High Performance Many-task Applications

Fundamenta Informaticae - Scalable Workflow Enactment Engines and Technology
A platform for eXtreme analytics

IBM Journal of Research and Development
IBM streams processing language: analyzing big data in motion

IBM Journal of Research and Development
Aggregation and degradation in JetStream: streaming analytics in the wide area

NSDI'14 Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation

Quantified Score

Hi-index	0.03

Visualization

Abstract

There is a growing need for ad-hoc analysis of extremely large data sets, especially at internet companies where innovation critically depends on being able to analyze terabytes of data collected every day. Parallel database products, e.g., Teradata, offer a solution, but are usually prohibitively expensive at this scale. Besides, many of the people who analyze this data are entrenched procedural programmers, who find the declarative, SQL style to be unnatural. The success of the more procedural map-reduce programming model, and its associated scalable implementations on commodity hardware, is evidence of the above. However, the map-reduce paradigm is too low-level and rigid, and leads to a great deal of custom user code that is hard to maintain, and reuse. We describe a new language called Pig Latin that we have designed to fit in a sweet spot between the declarative style of SQL, and the low-level, procedural style of map-reduce. The accompanying system, Pig, is fully implemented, and compiles Pig Latin into physical plans that are executed over Hadoop, an open-source, map-reduce implementation. We give a few examples of how engineers at Yahoo! are using Pig to dramatically reduce the time required for the development and execution of their data analysis tasks, compared to using Hadoop directly. We also report on a novel debugging environment that comes integrated with Pig, that can lead to even higher productivity gains. Pig is an open-source, Apache-incubator project, and available for general use.