Interpreting the data: Parallel analysis with Sawzall

Authors:
Rob Pike;Sean Dorward;Robert Griesemer;Sean Quinlan
Affiliations:
Google, Inc. CA, USA;Google, Inc. CA, USA;Google, Inc. CA, USA;Google, Inc. CA, USA
Venue:
Scientific Programming - Dynamic Grids and Worldwide Computing
Year:
2005

Citing 12
Cited 156

Fast allocation and deallocation of memory based on object lifetimes

Software—Practice & Experience
Programming pearls: little languages

Communications of the ACM
Hancock: a language for extracting signatures from data streams

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Space-efficient online computation of quantile summaries

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Python; Essential Reference

Python; Essential Reference
Programming in PROLOG

Programming in PROLOG
Finding Frequent Items in Data Streams

ICALP '02 Proceedings of the 29th International Colloquium on Automata, Languages and Programming
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Brook for GPUs: stream computing on graphics hardware

ACM SIGGRAPH 2004 Papers
Distributed computing in practice: the Condor experience: Research Articles

Concurrency and Computation: Practice & Experience - Grid Performance
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Enhancing server availability and security through failure-oblivious computing

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6

Hidden in Plain Sight

Queue - Performance
Map-reduce-merge: simplified relational data processing on large clusters

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Agreeing to disagree: search engines and their public interfaces

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Bigtable: a distributed storage system for structured data

OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Google's MapReduce programming model — Revisited

Science of Computer Programming
Status report: the manticore project

ML '07 Proceedings of the 2007 workshop on Workshop on ML
Confessions of a used programming language salesman

Proceedings of the 22nd annual ACM SIGPLAN conference on Object-oriented programming systems and applications
Bigtable: a distributed storage system for structured data

OSDI '06 Proceedings of the 7th symposium on Operating systems design and implementation
Google's MapReduce programming model – Revisited

Science of Computer Programming
RadixZip: linear time compression of token streams

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
On distributing symmetric streaming computations

Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms
Bigtable: A Distributed Storage System for Structured Data

ACM Transactions on Computer Systems (TOCS)
Data management projects at Google

ACM SIGMOD Record
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
San Fermín: aggregating large data sets using a binomial swap forest

NSDI'08 Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation
Declarative processing for computer games

Sandbox '08 Proceedings of the 2008 ACM SIGGRAPH symposium on Video games
Answering what-if deployment and configuration questions with wise

Proceedings of the ACM SIGCOMM 2008 conference on Data communication
Automatic optimization of parallel dataflow programs

ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
Implicitly-threaded parallelism in Manticore

Proceedings of the 13th ACM SIGPLAN international conference on Functional programming
Toward loosely coupled programming on petascale systems

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
SCOPE: easy and efficient parallel processing of massive data sets

Proceedings of the VLDB Endowment
Large-scale collaborative analysis and extraction of web data

Proceedings of the VLDB Endowment
Finding frequent items in data streams

Proceedings of the VLDB Endowment
GRIMS: a scalable management and storage system for massive remote sensing images

Proceedings of the 3rd international conference on Scalable information systems
Traverse: Simplified Indexing on Large Map-Reduce-Merge Clusters

DASFAA '09 Proceedings of the 14th International Conference on Database Systems for Advanced Applications
DRAM errors in the wild: a large-scale field study

Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
MapReduce optimization using regulated dynamic prioritization

Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
BotGraph: large scale spamming botnet detection

NSDI'09 Proceedings of the 6th USENIX symposium on Networked systems design and implementation
Finding the frequent items in streams of data

Communications of the ACM - A View of Parallel Computing
Brief announcement: PUSH, a DISC shell

Proceedings of the 28th ACM symposium on Principles of distributed computing
Towards Efficient MapReduce Using MPI

Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Distributed parse mining

SETQA-NLP '09 Proceedings of the Workshop on Software Engineering, Testing, and Quality Assurance for Natural Language Processing
MapReduce: a flexible data processing tool

Communications of the ACM - Amir Pnueli: Ahead of His Time
Distributed aggregation for data-parallel computing: interfaces and implementations

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Composing and executing parallel data-flow graphs with shell pipes

Proceedings of the 4th Workshop on Workflows in Support of Large-Scale Science
Exploring many task computing in scientific workflows

Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers
Nephele: efficient parallel data processing in the cloud

Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers
Persistent temporal streams

Proceedings of the 10th ACM/IFIP/USENIX International Conference on Middleware
Building a high-level dataflow system on top of Map-Reduce: the Pig experience

Proceedings of the VLDB Endowment
Cloud Computing: An Overview

CloudCom '09 Proceedings of the 1st International Conference on Cloud Computing
Methods for finding frequent items in data streams

The VLDB Journal — The International Journal on Very Large Data Bases
DEDUCE: at the intersection of MapReduce and stream processing

Proceedings of the 13th International Conference on Extending Database Technology
Efficiency matters!

ACM SIGOPS Operating Systems Review
Managing scientific data

Communications of the ACM
Measuring the user experience on a large scale: user-centered metrics for web applications

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
HadoopToSQL: a mapReduce query optimizer

Proceedings of the 5th European conference on Computer systems
Harnessing input redundancy in a MapReduce framework

Proceedings of the 2010 ACM Symposium on Applied Computing
Towards scalable architectures for clickstream data warehousing

DNIS'07 Proceedings of the 5th international conference on Databases in networked information systems
Beyond online aggregation: parallel and incremental data mining with online Map-Reduce

Proceedings of the 2010 Workshop on Massive Data Analytics on the Cloud
Towards scalable RDF graph analytics on MapReduce

Proceedings of the 2010 Workshop on Massive Data Analytics on the Cloud
FlumeJava: easy, efficient data-parallel pipelines

PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
Comet: batched stream processing for data intensive distributed computing

Proceedings of the 1st ACM symposium on Cloud computing
Pregel: a system for large-scale graph processing

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Online aggregation and continuous query support in MapReduce

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Predictable time-sharing for DryadLINQ cluster

Proceedings of the 7th international conference on Autonomic computing
Persistent temporal streams

Middleware'09 Proceedings of the ACM/IFIP/USENIX 10th international conference on Middleware
APHID: An architecture for private, high-performance integrated data mining

Future Generation Computer Systems
On distributing symmetric streaming computations

ACM Transactions on Algorithms (TALG)
Toward a cost-effective cloud storage service

ICACT'10 Proceedings of the 12th international conference on Advanced communication technology
User browsing models: relevance versus examination

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
Middleware support for many-task computing

Cluster Computing
A Map-Reduce System with an Alternate API for Multi-core Environments

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
MapReduce for the cell broadband engine architecture

IBM Journal of Research and Development
Weaver: integrating distributed computing abstractions into scientific workflows using Python

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
A common substrate for cluster computing

HotCloud'09 Proceedings of the 2009 conference on Hot topics in cloud computing
DryadInc: reusing work in large-scale computations

HotCloud'09 Proceedings of the 2009 conference on Hot topics in cloud computing
Wave computing in the cloud

HotOS'09 Proceedings of the 12th conference on Hot topics in operating systems
MapReduce online

NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation
DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Improving MapReduce performance in heterogeneous environments

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Scripting the cloud with skywriting

HotCloud'10 Proceedings of the 2nd USENIX conference on Hot topics in cloud computing
ESQP: an efficient SQL query processing for cloud data management

CloudDB '10 Proceedings of the second international workshop on Cloud data management
Scalable clustering algorithm for N-body simulations in a shared-nothing cluster

SSDBM'10 Proceedings of the 22nd international conference on Scientific and statistical database management
Multidimensional arrays for warehousing data on clouds

Globe'10 Proceedings of the Third international conference on Data management in grid and peer-to-peer systems
Evaluating IPv6 adoption in the internet

PAM'10 Proceedings of the 11th international conference on Passive and active measurement
A middleware for parallel processing of large graphs

Proceedings of the 8th International Workshop on Middleware for Grids, Clouds and e-Science
Dremel: interactive analysis of web-scale datasets

Proceedings of the VLDB Endowment
MRShare: sharing across multiple queries in MapReduce

Proceedings of the VLDB Endowment
Behavioral simulations in MapReduce

Proceedings of the VLDB Endowment
HADI: Mining Radii of Large Graphs

ACM Transactions on Knowledge Discovery from Data (TKDD)
Knuckles: bringing the database to the data

International Journal of Computational Science and Engineering
Integrating MapReduce and RDBMSs

Proceedings of the 2010 Conference of the Center for Advanced Studies on Collaborative Research
Nectar: automatic management of data and computation in datacenters

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Piccolo: building fast, distributed programs with partitioned tables

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Chukwa: a system for reliable large-scale log collection

LISA'10 Proceedings of the 24th international conference on Large installation system administration
On the expressiveness and trade-offs of large scale tuple stores

OTM'10 Proceedings of the 2010 international conference on On the move to meaningful internet systems: Part II
Scheduling divisible MapReduce computations

Journal of Parallel and Distributed Computing
CPLDP: an efficient large dataset processing system built on cloud platform

ADMA'10 Proceedings of the 6th international conference on Advanced data mining and applications - Volume Part II
Dremel: interactive analysis of web-scale datasets

Communications of the ACM
A generic parallel processing model for facilitating data mining and integration

Parallel Computing
ASTERIX: towards a scalable, semistructured data platform for evolving-world models

Distributed and Parallel Databases
Implicitly threaded parallelism in manticore

Journal of Functional Programming
Brasil: basic resource aggregation system infrastructure layer

Proceedings of the 1st International Workshop on Runtime and Operating Systems for Supercomputers
Processing theta-joins using MapReduce

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Adapting skyline computation to the MapReduce framework: algorithms and experiments

DASFAA'11 Proceedings of the 16th international conference on Database systems for advanced applications
Scatter-Gather-Merge: An efficient star-join query processing algorithm for data-parallel frameworks

Cluster Computing
Sloppy Python: using dynamic analysis to automatically add error tolerance to ad-hoc data processing scripts

Proceedings of the Ninth International Workshop on Dynamic Analysis
An intermediate algebra for optimizing RDF graph pattern matching on MapReduce

ESWC'11 Proceedings of the 8th extended semantic web conference on The semanic web: research and applications - Volume Part II
Estimating the number of users behind ip addresses for combating abusive traffic

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
GBASE: a scalable and general graph management system

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
New ideas track: testing mapreduce-style programs

Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering
The jabberwocky programming environment for structured social computing

Proceedings of the 24th annual ACM symposium on User interface software and technology
Scalable hashing for shared memory supercomputers

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Parallel data processing with MapReduce: a survey

ACM SIGMOD Record
Of hammers and nails: an empirical comparison of three paradigms for processing large graphs

Proceedings of the fifth ACM international conference on Web search and data mining
Case study of scientific data processing on a cloud using hadoop

HPCS'09 Proceedings of the 23rd international conference on High Performance Computing Systems and Applications
GLADE: a scalable framework for efficient analytics

ACM SIGOPS Operating Systems Review
DVM: towards a datacenter-scale virtual machine

VEE '12 Proceedings of the 8th ACM SIGPLAN/SIGOPS conference on Virtual Execution Environments
Static scheduling in clouds

HotCloud'11 Proceedings of the 3rd USENIX conference on Hot topics in cloud computing
A universal calculus for stream processing languages

ESOP'10 Proceedings of the 19th European conference on Programming Languages and Systems
PerfXplain: debugging MapReduce job performance

Proceedings of the VLDB Endowment
Trust me, i'm partially right: incremental visualization lets analysts explore large datasets faster

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
V-SMART-join: a scalable mapreduce framework for all-pair similarity joins of multisets and vectors

Proceedings of the VLDB Endowment
Declarative error management for robust data-intensive applications

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Optimizing data shuffling in data-parallel computation by understanding user-defined functions

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
An optimization framework for map-reduce queries

Proceedings of the 15th International Conference on Extending Database Technology
Swift: A language for distributed parallel scripting

Parallel Computing
From a calculus to an execution environment for stream processing

Proceedings of the 6th ACM International Conference on Distributed Event-Based Systems
Early accurate results for advanced analytics on MapReduce

Proceedings of the VLDB Endowment
MapReduce indexing strategies: Studying scalability and efficiency

Information Processing and Management: an International Journal
Riposte: a trace-driven compiler and parallel VM for vector code in R

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Analyzing ultra-large-scale code corpus with boa

Proceedings of the 3rd annual conference on Systems, programming, and applications: software for humanity
Boa: analyzing ultra-large-scale code corpus

Proceedings of the 3rd annual conference on Systems, programming, and applications: software for humanity
Data-intensive architecture for scientific knowledge discovery

Distributed and Parallel Databases
gbase: an efficient analysis platform for large graphs

The VLDB Journal — The International Journal on Very Large Data Bases
SCOPE: parallel databases meet MapReduce

The VLDB Journal — The International Journal on Very Large Data Bases
Spotting code optimizations in data-parallel pipelines through PeriSCOPE

OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
Scripting distributed scientific workflows using Weaver

Concurrency and Computation: Practice & Experience
Coflow: a networking abstraction for cluster applications

Proceedings of the 11th ACM Workshop on Hot Topics in Networks
On-the-fly task execution for speeding up pipelined mapreduce

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Optimizing and Tuning MapReduce Jobs to Improve the Large-Scale Data Analysis Process

International Journal of Intelligent Systems
Cogset: a high performance MapReduce engine

Concurrency and Computation: Practice & Experience
Constructing a data accessing layer for in-memory data grid

Proceedings of the Fourth Asia-Pacific Symposium on Internetware
Optimizing budget constrained spend in search advertising

Proceedings of the sixth ACM international conference on Web search and data mining
A generate-test-aggregate parallel programming library: systematic parallel programming for MapReduce

Proceedings of the 2013 International Workshop on Programming Models and Applications for Multicores and Manycores
Turbine: a distributed-memory dataflow engine for extreme-scale many-task applications

Proceedings of the 1st ACM SIGMOD Workshop on Scalable Workflow Execution Engines and Technologies
Tiled-MapReduce: Efficient and Flexible MapReduce Processing on Multicore with Tiling

ACM Transactions on Architecture and Code Optimization (TACO)
Invisible loading: access-driven data transfer from raw files into database systems

Proceedings of the 16th International Conference on Extending Database Technology
HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm

Proceedings of the 16th International Conference on Extending Database Technology
BigBench: towards an industry standard benchmark for big data analytics

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Quantiles over data streams: an experimental study

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
HyMR: a hybrid MapReduce workflow system

Proceedings of the 3rd international workshop on Emerging computational methods for the life sciences
MapReduce with communication overlap (MaRCO)

Journal of Parallel and Distributed Computing
Boa: a language and infrastructure for analyzing ultra-large-scale software repositories

Proceedings of the 2013 International Conference on Software Engineering
Answering: techniques and deployment experience

IEEE/ACM Transactions on Networking (TON)
Distributed data management using MapReduce

ACM Computing Surveys (CSUR)
Data-Intensive Cloud Computing: Requirements, Expectations, Challenges, and Solutions

Journal of Grid Computing
The family of mapreduce and large-scale data processing systems

ACM Computing Surveys (CSUR)
Cloud-aware processing of MapReduce-based OLAP applications

AusPDC '13 Proceedings of the Eleventh Australasian Symposium on Parallel and Distributed Computing - Volume 140
Representing mapreduce optimisations in the nested relational calculus

BNCOD'13 Proceedings of the 29th British National conference on Big Data
Piranha: optimizing short jobs in Hadoop

Proceedings of the VLDB Endowment
SHadoop: Improving MapReduce performance by optimizing job execution mechanism in Hadoop clusters

Journal of Parallel and Distributed Computing
A Generate-Test-Aggregate parallel programming library for systematic parallel programming

Parallel Computing
Turbine: A Distributed-memory Dataflow Engine for High Performance Many-task Applications

Fundamenta Informaticae - Scalable Workflow Enactment Engines and Technology
A platform for eXtreme analytics

IBM Journal of Research and Development

Quantified Score

Hi-index	0.03

Visualization

Abstract

Very large data sets often have a flat but regular structure and span multiple disks and machines. Examples include telephone call records, network logs, and web document repositories. These large data sets are not amenable to study using traditional database techniques, if only because they can be too large to fit in a single relational database. On the other hand, many of the analyses done on them can be expressed using simple, easily distributed computations: filtering, aggregation, extraction of statistics, and so on. We present a system for automating such analyses. A filtering phase, in which a query is expressed using a new procedural programming language, emits data to an aggregation phase. Both phases are distributed over hundreds or even thousands of computers. The results are then collated and saved to a file. The design -- including the separation into two phases, the form of the programming language, and the properties of the aggregators -- exploits the parallelism inherent in having data and computation distributed across many machines.