SCOPE: easy and efficient parallel processing of massive data sets

Authors:
Ronnie Chaiken;Bob Jenkins;Per-Åke Larson;Bill Ramsey;Darren Shakib;Simon Weaver;Jingren Zhou
Affiliations:
Microsoft Corporation;Microsoft Corporation;Microsoft Corporation;Microsoft Corporation;Microsoft Corporation;Microsoft Corporation;Microsoft Corporation
Venue:
Proceedings of the VLDB Endowment
Year:
2008

Citing 9
Cited 139

Query Processing in Parallel Relational Database Systems

Query Processing in Parallel Relational Database Systems
Optimizing data aggregation for cluster-based internet services

Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Interpreting the data: Parallel analysis with Sawzall

Scientific Programming - Dynamic Grids and Worldwide Computing
Map-reduce-merge: simplified relational data processing on large clusters

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Bigtable: a distributed storage system for structured data

OSDI '06 Proceedings of the 7th symposium on Operating systems design and implementation
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data

A comparison of approaches to large-scale data analysis

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Distributed data-parallel computing using a high-level programming language

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Query interactions in database workloads

Proceedings of the Second International Workshop on Testing Database Systems
Inferring Dataflow Properties of User Defined Table Processors

SAS '09 Proceedings of the 16th International Symposium on Static Analysis
Scaling-Up and Speeding-Up Video Analytics Inside Database Engine

DEXA '09 Proceedings of the 20th International Conference on Database and Expert Systems Applications
Extend UDF Technology for Integrated Analytics

DaWaK '09 Proceedings of the 11th International Conference on Data Warehousing and Knowledge Discovery
Efficiently support MapReduce-like computation models inside parallel DBMS

IDEAS '09 Proceedings of the 2009 International Database Engineering & Applications Symposium
MapReduce and parallel DBMSs: friends or foes?

Communications of the ACM - Amir Pnueli: Ahead of His Time
Distributed aggregation for data-parallel computing: interfaces and implementations

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
The nature of data center traffic: measurements & analysis

Proceedings of the 9th ACM SIGCOMM conference on Internet measurement conference
Composing and executing parallel data-flow graphs with shell pipes

Proceedings of the 4th Workshop on Workflows in Support of Large-Scale Science
Nephele: efficient parallel data processing in the cloud

Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers
SQL/MapReduce: a practical approach to self-describing, polymorphic, and parallelizable user-defined functions

Proceedings of the VLDB Endowment
Building a high-level dataflow system on top of Map-Reduce: the Pig experience

Proceedings of the VLDB Endowment
Hive: a warehousing solution over a map-reduce framework

Proceedings of the VLDB Endowment
Mining document collections to facilitate accurate approximate entity matching

Proceedings of the VLDB Endowment
HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads

Proceedings of the VLDB Endowment
Using word-sense disambiguation methods to classify web queries by intent

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3 - Volume 3
Cloud-TM: harnessing the cloud with distributed transactional memories

ACM SIGOPS Operating Systems Review
Harnessing input redundancy in a MapReduce framework

Proceedings of the 2010 ACM Symposium on Applied Computing
Towards scalable RDF graph analytics on MapReduce

Proceedings of the 2010 Workshop on Massive Data Analytics on the Cloud
FlumeJava: easy, efficient data-parallel pipelines

PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
Stateful bulk processing for incremental analytics

Proceedings of the 1st ACM symposium on Cloud computing
Comet: batched stream processing for data intensive distributed computing

Proceedings of the 1st ACM symposium on Cloud computing
Nephele/PACTs: a programming model and execution framework for web-scale analytical processing

Proceedings of the 1st ACM symposium on Cloud computing
Indexing multi-dimensional data in a cloud system

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Integrating hadoop and parallel DBMs

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
A comparison of join algorithms for log processing in MaPreduce

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Large graph processing in the cloud

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Toward a cost-effective cloud storage service

ICACT'10 Proceedings of the 12th international conference on Advanced communication technology
DryadInc: reusing work in large-scale computations

HotCloud'09 Proceedings of the 2009 conference on Hot topics in cloud computing
Wave computing in the cloud

HotOS'09 Proceedings of the 12th conference on Hot topics in operating systems
Volley: automated data placement for geo-distributed cloud services

NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation
DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
ESQP: an efficient SQL query processing for cloud data management

CloudDB '10 Proceedings of the second international workshop on Cloud data management
A large scale ranker-based system for search query spelling correction

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Scalable clustering algorithm for N-body simulations in a shared-nothing cluster

SSDBM'10 Proceedings of the 22nd international conference on Scientific and statistical database management
Experience in extending query engine for continuous analytics

DaWaK'10 Proceedings of the 12th international conference on Data warehousing and knowledge discovery
Comparing Hadoop and Fat-Btree based access method for small file I/O applications

WAIM'10 Proceedings of the 11th international conference on Web-age information management
Merging file systems and data bases to fit the grid

Globe'10 Proceedings of the Third international conference on Data management in grid and peer-to-peer systems
Multidimensional arrays for warehousing data on clouds

Globe'10 Proceedings of the Third international conference on Data management in grid and peer-to-peer systems
Dremel: interactive analysis of web-scale datasets

Proceedings of the VLDB Endowment
The performance of MapReduce: an in-depth study

Proceedings of the VLDB Endowment
MRShare: sharing across multiple queries in MapReduce

Proceedings of the VLDB Endowment
Hadoop++: making a yellow elephant run like a cheetah (without it even noticing)

Proceedings of the VLDB Endowment
DataGarage: warehousing massive performance data on commodity servers

Proceedings of the VLDB Endowment
Nectar: automatic management of data and computation in datacenters

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Reining in the outliers in map-reduce clusters using Mantri

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
The case for object databases in cloud data management

ICOODB'10 Proceedings of the Third international conference on Objects and databases
Online querying of d-dimensional hierarchies

Journal of Parallel and Distributed Computing
Demaq/Transscale: Automated distribution and scalability for declarative applications

Information Systems
Map-reduce extensions and recursive queries

Proceedings of the 14th International Conference on Extending Database Technology
RanKloud: a scalable ranked query processing framework on hadoop

Proceedings of the 14th International Conference on Extending Database Technology
Dremel: interactive analysis of web-scale datasets

Communications of the ACM
ASTERIX: towards a scalable, semistructured data platform for evolving-world models

Distributed and Parallel Databases
CIEL: a universal execution engine for distributed data-flow computing

Proceedings of the 8th USENIX conference on Networked systems design and implementation
Sharing the data center network

Proceedings of the 8th USENIX conference on Networked systems design and implementation
Parallel evaluation of conjunctive queries

Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Fast personalized PageRank on MapReduce

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
A platform for scalable one-pass analytics using MapReduce

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Automated partitioning design in parallel database systems

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Providing scalable database services on the cloud

WISE'10 Proceedings of the 11th international conference on Web information systems engineering
G2: a graph processing system for diagnosing distributed systems

USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
Regularized latent semantic indexing

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
GBASE: a scalable and general graph management system

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Brown Dwarf: A fully-distributed, fault-tolerant data warehousing system

Journal of Parallel and Distributed Computing
CloudFuice: a flexible cloud-based data integration system

ICWE'11 Proceedings of the 11th international conference on Web engineering
ETLMR: a highly scalable dimensional ETL framework based on mapreduce

DaWaK'11 Proceedings of the 13th international conference on Data warehousing and knowledge discovery
Mining large distributed log data in near real time

SLAML '11 Managing Large-scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques
Continuous data stream query in the cloud

Proceedings of the 20th ACM international conference on Information and knowledge management
Continuous access to cloud event services with event pipe queries

OTM'11 Proceedings of the 2011th Confederated international conference on On the move to meaningful internet systems - Volume Part II
Extend core UDF framework for GPU-enabled analytical query evaluation

Proceedings of the 15th Symposium on International Database Engineering & Applications
Building wavelet histograms on large data in MapReduce

Proceedings of the VLDB Endowment
Parallel data processing with MapReduce: a survey

ACM SIGMOD Record
Case study of scientific data processing on a cloud using hadoop

HPCS'09 Proceedings of the 23rd international conference on High Performance Computing Systems and Applications
Scalable splitting of massive data streams

DASFAA'10 Proceedings of the 15th international conference on Database Systems for Advanced Applications - Volume Part II
GLADE: a scalable framework for efficient analytics

ACM SIGOPS Operating Systems Review
Meeting service level objectives of Pig programs

Proceedings of the 2nd International Workshop on Cloud Computing Platforms
Jockey: guaranteed job latency in data parallel clusters

Proceedings of the 7th ACM european conference on Computer Systems
MadLINQ: large-scale distributed matrix computation for the cloud

Proceedings of the 7th ACM european conference on Computer Systems
PerfXplain: debugging MapReduce job performance

Proceedings of the VLDB Endowment
Abstract state machines for data-parallel computing

Conceptual Modelling and Its Theoretical Foundations
Mining for insights in the search engine query stream

Proceedings of the 21st international conference companion on World Wide Web
The HaLoop approach to large-scale iterative data analysis

The VLDB Journal — The International Journal on Very Large Data Bases
What next?: a half-dozen data management research goals for big data and the cloud

PODS '12 Proceedings of the 31st symposium on Principles of Database Systems
Advanced partitioning techniques for massively distributed computation

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Declarative error management for robust data-intensive applications

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Recurring job optimization in scope

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
PACMan: coordinated memory caching for parallel jobs

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Re-optimizing data-parallel computing

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Optimizing data shuffling in data-parallel computation by understanding user-defined functions

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Inside "Big Data management": ogres, onions, or parfaits?

Proceedings of the 15th International Conference on Extending Database Technology
Clydesdale: structured data processing on MapReduce

Proceedings of the 15th International Conference on Extending Database Technology
An optimization framework for map-reduce queries

Proceedings of the 15th International Conference on Extending Database Technology
Efficient SPARQL query processing in mapreduce through data partitioning and indexing

APWeb'12 Proceedings of the 14th Asia-Pacific international conference on Web Technologies and Applications
Putting a "big-data" platform to good use: training kinect

Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
Scope playback: self-validation in the cloud

DBTest '12 Proceedings of the Fifth International Workshop on Testing Database Systems
Optimizing Completion Time and Resource Provisioning of Pig Programs

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
FunSQL: it is time to make SQL functional

Proceedings of the 2012 Joint EDBT/ICDT Workshops
Opening the black boxes in data flow optimization

Proceedings of the VLDB Endowment
Spinning fast iterative data flows

Proceedings of the VLDB Endowment
Stream-join revisited in the context of epoch-based SQL continuous query

Proceedings of the 16th International Database Engineering & Applications Sysmposium
Optimization of analytic data flows for next generation business intelligence applications

TPCTC'11 Proceedings of the Third TPC Technology conference on Topics in Performance Evaluation, Measurement and Characterization
Automated profiling and resource management of pig programs for meeting service level objectives

Proceedings of the 9th international conference on Autonomic computing
gbase: an efficient analysis platform for large graphs

The VLDB Journal — The International Journal on Very Large Data Bases
SCOPE: parallel databases meet MapReduce

The VLDB Journal — The International Journal on Very Large Data Bases
Spotting code optimizations in data-parallel pipelines through PeriSCOPE

OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
SCALLA: A Platform for Scalable One-Pass Analytics Using MapReduce

ACM Transactions on Database Systems (TODS)
Scripting distributed scientific workflows using Weaver

Concurrency and Computation: Practice & Experience
Coflow: a networking abstraction for cluster applications

Proceedings of the 11th ACM Workshop on Hot Topics in Networks
Bridging the tenant-provider gap in cloud services

Proceedings of the Third ACM Symposium on Cloud Computing
Regularized Latent Semantic Indexing: A New Approach to Large-Scale Topic Modeling

ACM Transactions on Information Systems (TOIS)
Cogset: a high performance MapReduce engine

Concurrency and Computation: Practice & Experience
Towards building a high performance spatial query system for large scale medical imaging data

Proceedings of the 20th International Conference on Advances in Geographic Information Systems
Maguro, a system for indexing and searching over very large text collections

Proceedings of the sixth ACM international conference on Web search and data mining
Invisible loading: access-driven data transfer from raw files into database systems

Proceedings of the 16th International Conference on Extending Database Technology
Minimal MapReduce algorithms

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Simulation of database-valued markov chains using SimSQL

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Shark: SQL and rich analytics at scale

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
TimeStream: reliable stream computation in the cloud

Proceedings of the 8th ACM European Conference on Computer Systems
Parameterised architectural patterns for providing cloud service fault tolerance with accurate costings

Proceedings of the 16th International ACM Sigsoft symposium on Component-based software engineering
Effective straggler mitigation: attack of the clones

nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
Rhea: automatic filtering for unstructured cloud storage

nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
Leveraging endpoint flexibility in data-intensive clusters

Proceedings of the ACM SIGCOMM 2013 conference on SIGCOMM
A characteristic study on failures of production distributed data-parallel programs

Proceedings of the 2013 International Conference on Software Engineering
Group chats on Twitter

Proceedings of the 22nd international conference on World Wide Web
Large-scale computation not at the cost of expressiveness

HotOS'13 Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems
Performance Modeling and Optimization of Deadline-Driven Pig Programs

ACM Transactions on Autonomous and Adaptive Systems (TAAS)
The family of mapreduce and large-scale data processing systems

ACM Computing Surveys (CSUR)
Scale-up vs scale-out for Hadoop: time to rethink?

Proceedings of the 4th annual Symposium on Cloud Computing
Apache Hadoop YARN: yet another resource negotiator

Proceedings of the 4th annual Symposium on Cloud Computing
MR-runner: a modularized map-reduce job management tool

Proceedings of the 5th Asia-Pacific Symposium on Internetware
UpSizeR: Synthetically scaling an empirical relational database

Information Systems
Hadoop GIS: a high performance spatial data warehousing system over mapreduce

Proceedings of the VLDB Endowment
Optimized data management for e-learning in the clouds towards Cloodle

Proceedings of the Fourth Symposium on Information and Communication Technology
SeaCloudDM: a database cluster framework for managing and querying massive heterogeneous sensor sampling data

The Journal of Supercomputing
Shroud: ensuring private access to large-scale data in the data center

FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies
JovianDATA: a multidimensional database for the cloud

Proceedings of the 17th International Conference on Management of Data
GRASS: trimming stragglers in approximation analytics

NSDI'14 Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation

Quantified Score

Hi-index	0.02

Visualization

Abstract

Companies providing cloud-scale services have an increasing need to store and analyze massive data sets such as search logs and click streams. For cost and performance reasons, processing is typically done on large clusters of shared-nothing commodity machines. It is imperative to develop a programming model that hides the complexity of the underlying system but provides flexibility by allowing users to extend functionality to meet a variety of requirements. In this paper, we present a new declarative and extensible scripting language, SCOPE (Structured Computations Optimized for Parallel Execution), targeted for this type of massive data analysis. The language is designed for ease of use with no explicit parallelism, while being amenable to efficient parallel execution on large clusters. SCOPE borrows several features from SQL. Data is modeled as sets of rows composed of typed columns. The select statement is retained with inner joins, outer joins, and aggregation allowed. Users can easily define their own functions and implement their own versions of operators: extractors (parsing and constructing rows from a file), processors (row-wise processing), reducers (group-wise processing), and combiners (combining rows from two inputs). SCOPE supports nesting of expressions but also allows a computation to be specified as a series of steps, in a manner often preferred by programmers. We also describe how scripts are compiled into efficient, parallel execution plans and executed on large clusters.