A comparison of approaches to large-scale data analysis

Authors:
Andrew Pavlo;Erik Paulson;Alexander Rasin;Daniel J. Abadi;David J. DeWitt;Samuel Madden;Michael Stonebraker
Affiliations:
Brown University, Providence, RI, USA;University of Wisconsin, Madison, WI, USA;Brown University, Providence, RI, USA;Yale University, New Haven, CT, USA;Microsoft Inc., Madison, WI, USA;Massachusetts Institute of Technology, Cambridge, MA, USA;Massachusetts Institute of Technology, Cambridge, MA, USA
Venue:
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Year:
2009

Citing 12
Cited 171

An Overview of The System Software of A Parallel Relational Database Machine GRACE

VLDB '86 Proceedings of the 12th International Conference on Very Large Data Bases
GAMMA - A High Performance Dataflow Database Machine

VLDB '86 Proceedings of the 12th International Conference on Very Large Data Bases
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Implementation of data abstraction in the relational database system INGRES

ACM SIGMOD Record
LINQ: reconciling object, relations and XML in the .NET framework

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Agile Web Development with Rails

Agile Web Development with Rails
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Multiprocessor hash-based join algorithms

VLDB '85 Proceedings of the 11th international conference on Very Large Data Bases - Volume 11
Technical perspective: the data center is the computer

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
SCOPE: easy and efficient parallel processing of massive data sets

Proceedings of the VLDB Endowment

MapReduce and parallel DBMSs: friends or foes?

Communications of the ACM - Amir Pnueli: Ahead of His Time
MapReduce: a flexible data processing tool

Communications of the ACM - Amir Pnueli: Ahead of His Time
Hive: a warehousing solution over a map-reduce framework

Proceedings of the VLDB Endowment
HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads

Proceedings of the VLDB Endowment
RAPID: Enabling Scalable Ad-Hoc Analytics on the Semantic Web

ISWC '09 Proceedings of the 8th International Semantic Web Conference
Efficiency matters!

ACM SIGOPS Operating Systems Review
Caching and Materialization for Web Databases

Foundations and Trends in Databases
HadoopToSQL: a mapReduce query optimizer

Proceedings of the 5th European conference on Computer systems
Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling

Proceedings of the 5th European conference on Computer systems
Towards scalable RDF graph analytics on MapReduce

Proceedings of the 2010 Workshop on Massive Data Analytics on the Cloud
SPARQL basic graph pattern processing with iterative MapReduce

Proceedings of the 2010 Workshop on Massive Data Analytics on the Cloud
Skew-resistant parallel processing of feature-extracting scientific user-defined functions

Proceedings of the 1st ACM symposium on Cloud computing
Nephele/PACTs: a programming model and execution framework for web-scale analytical processing

Proceedings of the 1st ACM symposium on Cloud computing
Benchmarking cloud serving systems with YCSB

Proceedings of the 1st ACM symposium on Cloud computing
Efficient parallel set-similarity joins using MapReduce

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
An evaluation of alternative architectures for transaction processing in the cloud

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Integrating hadoop and parallel DBMs

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
A comparison of join algorithms for log processing in MaPreduce

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
HadoopDB in action: building real world applications

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
ASSET queries: a declarative alternative to MapReduce

ACM SIGMOD Record
BTWorld: towards observing the global BitTorrent file-sharing network

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
MapReduce online

NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation
Manimal: relational optimization for data-intensive programs

Procceedings of the 13th International Workshop on the Web and Databases
Parallel bulk insertion for large-scale analytics applications

Proceedings of the 4th International Workshop on Large Scale Distributed Systems and Middleware
Wimpy node clusters: what about non-wimpy workloads?

Proceedings of the Sixth International Workshop on Data Management on New Hardware
Towards personal high-performance geospatial computing (HPC-G): perspectives and a case study

Proceedings of the ACM SIGSPATIAL International Workshop on High Performance and Distributed Geographic Information Systems
Cloud computing for geosciences: deployment of GEOSS clearinghouse on Amazon's EC2

Proceedings of the ACM SIGSPATIAL International Workshop on High Performance and Distributed Geographic Information Systems
Massive structured data management solution

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Benchmarking cloud-based data management systems

CloudDB '10 Proceedings of the second international workshop on Cloud data management
Tradeoffs between parallel database systems, Hadoop, and HadoopDB as platforms for petabyte-scale analysis

SSDBM'10 Proceedings of the 22nd international conference on Scientific and statistical database management
Supporting web-based visual exploration of large-scale raster geospatial data using binned min-max Quadtree

SSDBM'10 Proceedings of the 22nd international conference on Scientific and statistical database management
Comparing Hadoop and Fat-Btree based access method for small file I/O applications

WAIM'10 Proceedings of the 11th international conference on Web-age information management
High throughput data-compression for cloud storage

Globe'10 Proceedings of the Third international conference on Data management in grid and peer-to-peer systems
A middleware for parallel processing of large graphs

Proceedings of the 8th International Workshop on Middleware for Grids, Clouds and e-Science
Energy management for MapReduce clusters

Proceedings of the VLDB Endowment
HaLoop: efficient iterative data processing on large clusters

Proceedings of the VLDB Endowment
Runtime measurements in the cloud: observing, analyzing, and reducing variance

Proceedings of the VLDB Endowment
The performance of MapReduce: an in-depth study

Proceedings of the VLDB Endowment
MRShare: sharing across multiple queries in MapReduce

Proceedings of the VLDB Endowment
Hadoop++: making a yellow elephant run like a cheetah (without it even noticing)

Proceedings of the VLDB Endowment
DataGarage: warehousing massive performance data on commodity servers

Proceedings of the VLDB Endowment
HADI: Mining Radii of Large Graphs

ACM Transactions on Knowledge Discovery from Data (TKDD)
Integrating MapReduce and RDBMSs

Proceedings of the 2010 Conference of the Center for Advanced Studies on Collaborative Research
Large-scale incremental processing using distributed transactions and notifications

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Reining in the outliers in map-reduce clusters using Mantri

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Chukwa: a system for reliable large-scale log collection

LISA'10 Proceedings of the 24th international conference on Large installation system administration
The case for object databases in cloud data management

ICOODB'10 Proceedings of the Third international conference on Objects and databases
Liquid benchmarks: towards an online platform for collaborative assessment of computer science research results

TPCTC'10 Proceedings of the Second TPC technology conference on Performance evaluation, measurement and characterization of complex systems
Big data and cloud computing: current state and future opportunities

Proceedings of the 14th International Conference on Extending Database Technology
RanKloud: a scalable ranked query processing framework on hadoop

Proceedings of the 14th International Conference on Extending Database Technology
10 rules for scalable performance in 'simple operation' datastores

Communications of the ACM
Automatic optimization for MapReduce programs

Proceedings of the VLDB Endowment
Toward a standard benchmark for computer security research: the worldwide intelligence network environment (WINE)

Proceedings of the First Workshop on Building Analysis Datasets and Gathering Experience Returns for Security
Towards improved load balancing for data intensive distributed computing

Proceedings of the 2011 ACM Symposium on Applied Computing
Column-oriented storage techniques for MapReduce

Proceedings of the VLDB Endowment
A latency and fault-tolerance optimizer for online parallel query plans

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Processing theta-joins using MapReduce

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
A platform for scalable one-pass analytics using MapReduce

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Emerging trends in the enterprise data analytics: connecting Hadoop and DB2 warehouse

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Efficient processing of data warehousing queries in a split execution environment

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Providing scalable database services on the cloud

WISE'10 Proceedings of the 11th international conference on Web information systems engineering
Optimizing data partitioning for data-parallel computing

HotOS'13 Proceedings of the 13th USENIX conference on Hot topics in operating systems
Full-text indexing for optimizing selection operations in large-scale data analytics

Proceedings of the second international workshop on MapReduce and its applications
The case for being lazy: how to leverage lazy evaluation in MapReduce

Proceedings of the 2nd international workshop on Scientific cloud computing
LinearDB: a relational approach to make data warehouse scale like MapReduce

DASFAA'11 Proceedings of the 16th international conference on Database systems for advanced applications: Part II
PigSPARQL: mapping SPARQL to Pig Latin

Proceedings of the International Workshop on Semantic Web Information Management
HiTune: dataflow-based performance analysis for big data cloud

USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
CoHadoop: flexible data placement and its exploitation in Hadoop

Proceedings of the VLDB Endowment
Towards a scalable and robust multi-tenancy SaaS

Proceedings of the Second Asia-Pacific Symposium on Internetware
Brown Dwarf: A fully-distributed, fault-tolerant data warehousing system

Journal of Parallel and Distributed Computing
On the benefits of transparent compression for cost-effective cloud data storage

Transactions on large-scale data- and knowledge-centered systems III
ONE: a predictable and scalable DW model

DaWaK'11 Proceedings of the 13th international conference on Data warehousing and knowledge discovery
ETLMR: a highly scalable dimensional ETL framework based on mapreduce

DaWaK'11 Proceedings of the 13th international conference on Data warehousing and knowledge discovery
VarDB: high-performance warehouse processing with massive ordering and binary search

DaWaK'11 Proceedings of the 13th international conference on Data warehousing and knowledge discovery
Tagged mapreduce: efficiently computing multi-analytics using mapreduce

DaWaK'11 Proceedings of the 13th international conference on Data warehousing and knowledge discovery
An efficient multi-tier tablet server storage architecture

Proceedings of the 2nd ACM Symposium on Cloud Computing
DOT: a matrix model for analyzing, optimizing and deploying software for big data analytics in distributed systems

Proceedings of the 2nd ACM Symposium on Cloud Computing
PrIter: a distributed framework for prioritized iterative computations

Proceedings of the 2nd ACM Symposium on Cloud Computing
Trojan data layouts: right shoes for a running elephant

Proceedings of the 2nd ACM Symposium on Cloud Computing
Power-efficient networking for balanced system designs: early experiences with PCIe

HotPower '11 Proceedings of the 4th Workshop on Power-Aware Computing and Systems
Automatic physical database tuning middleware for web-based applications

ADBIS'11 Proceedings of the 15th international conference on Advances in databases and information systems
Building cubes with MapReduce

Proceedings of the ACM 14th international workshop on Data Warehousing and OLAP
Improving the efficiency of subset queries on raster images

Proceedings of the ACM SIGSPATIAL Second International Workshop on High Performance and Distributed Geographic Information Systems
A predictable storage model for scalable parallel DW

Proceedings of the 15th Symposium on International Database Engineering & Applications
Building wavelet histograms on large data in MapReduce

Proceedings of the VLDB Endowment
Benchmarking MapReduce Implementations for Application Usage Scenarios

GRID '11 Proceedings of the 2011 IEEE/ACM 12th International Conference on Grid Computing
Parallel data processing with MapReduce: a survey

ACM SIGMOD Record
GLADE: a scalable framework for efficient analytics

ACM SIGOPS Operating Systems Review
HiTune: dataflow-based performance analysis for big data cloud

HotCloud'11 Proceedings of the 3rd USENIX conference on Hot topics in cloud computing
Online optimization for scheduling preemptable tasks on IaaS cloud systems

Journal of Parallel and Distributed Computing
Distributed parallel architecture for storing and processing large datasets

SEPADS'12/EDUCATION'12 Proceedings of the 11th WSEAS international conference on Software Engineering, Parallel and Distributed Systems, and proceedings of the 9th WSEAS international conference on Engineering Education
Abstract state machines for data-parallel computing

Conceptual Modelling and Its Theoretical Foundations
Apriori-based frequent itemset mining algorithms on MapReduce

Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication
The HaLoop approach to large-scale iterative data analysis

The VLDB Journal — The International Journal on Very Large Data Bases
Performance Evaluation of Range Queries in Key Value Stores

Journal of Grid Computing
High performance spatial query processing for large scale scientific data

PhD '12 Proceedings of the on SIGMOD/PODS 2012 PhD Symposium
Shark: fast data analysis using coarse-grained distributed memory

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
GLADE: big data analytics made easy

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Clydesdale: structured data processing on hadoop

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
TIRAMOLA: elastic nosql provisioning through a cloud management platform

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
CloudRAMSort: fast and efficient large-scale distributed RAM sort on shared-nothing cluster

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Optimizing data shuffling in data-parallel computation by understanding user-defined functions

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Automatic scaling of selective SPARQL joins using the TIRAMOLA system

SWIM '12 Proceedings of the 4th International Workshop on Semantic Web Information Management
The efficiency of mapreduce in parallel external memory

LATIN'12 Proceedings of the 10th Latin American international conference on Theoretical Informatics
Inside "Big Data management": ogres, onions, or parfaits?

Proceedings of the 15th International Conference on Extending Database Technology
Clydesdale: structured data processing on MapReduce

Proceedings of the 15th International Conference on Extending Database Technology
Adaptive MapReduce using situation-aware mappers

Proceedings of the 15th International Conference on Extending Database Technology
Delay tails in MapReduce scheduling

Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE joint international conference on Measurement and Modeling of Computer Systems
ComMapReduce: an improvement of mapreduce with lightweight communication mechanisms

DASFAA'12 Proceedings of the 17th international conference on Database Systems for Advanced Applications - Volume Part II
Halt or continue: estimating progress of queries in the cloud

DASFAA'12 Proceedings of the 17th international conference on Database Systems for Advanced Applications - Volume Part II
ParaLite: Supporting Collective Queries in Database System to Parallelize User-Defined Executable

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
MapReduce indexing strategies: Studying scalability and efficiency

Information Processing and Management: an International Journal
The seven deadly sins of cloud computing research

HotCloud'12 Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing
Toward efficient querying of compressed network payloads

USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
Stubby: a transformation-based optimizer for MapReduce workflows

Proceedings of the VLDB Endowment
Only aggressive elephants are fast elephants

Proceedings of the VLDB Endowment
Towards energy-efficient database cluster design

Proceedings of the VLDB Endowment
TEEPA: a timely-aware elastic parallel architecture

Proceedings of the 16th International Database Engineering & Applications Sysmposium
PRISM: privacy-preserving search in mapreduce

PETS'12 Proceedings of the 12th international conference on Privacy Enhancing Technologies
The MADlib analytics library: or MAD skills, the SQL

Proceedings of the VLDB Endowment
Can the elephants handle the NoSQL onslaught?

Proceedings of the VLDB Endowment
Interactive analytical processing in big data systems: a cross-industry study of MapReduce workloads

Proceedings of the VLDB Endowment
Efficient big data processing in Hadoop MapReduce

Proceedings of the VLDB Endowment
Spanner: Google's globally-distributed database

OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
HSim: A MapReduce simulator in enabling Cloud Computing

Future Generation Computer Systems
SCALLA: A Platform for Scalable One-Pass Analytics Using MapReduce

ACM Transactions on Database Systems (TODS)
Multimedia Applications and Security in MapReduce: Opportunities and Challenges

Concurrency and Computation: Practice & Experience
Towards benchmarking stream data warehouses

Proceedings of the fifteenth international workshop on Data warehousing and OLAP
Sailfish: a framework for large scale data processing

Proceedings of the Third ACM Symposium on Cloud Computing
Join processing using Bloom filter in MapReduce

Proceedings of the 2012 ACM Research in Applied Computation Symposium
Overcoming the scalability limitations of parallel star schema data warehouses

ICA3PP'12 Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I
Using mapreduce to scale events correlation discovery for business processes mining

BPM'12 Proceedings of the 10th international conference on Business Process Management
Just-in-time data distribution for analytical query processing

ADBIS'12 Proceedings of the 16th East European conference on Advances in Databases and Information Systems
Optimizing and Tuning MapReduce Jobs to Improve the Large-Scale Data Analysis Process

International Journal of Intelligent Systems
Cogset: a high performance MapReduce engine

Concurrency and Computation: Practice & Experience
Towards building a high performance spatial query system for large scale medical imaging data

Proceedings of the 20th International Conference on Advances in Geographic Information Systems
Providing timely results with an elastic parallel DW

ISMIS'12 Proceedings of the 20th international conference on Foundations of Intelligent Systems
Map/reduce on EMF models

Proceedings of the 1st International Workshop on Model-Driven Engineering for High Performance and CLoud computing
Speeding up large-scale point-in-polygon test based spatial join on GPUs

Proceedings of the 1st ACM SIGSPATIAL International Workshop on Analytics for Big Geospatial Data
An ontology enhanced parallel SVM for scalable spam filter training

Neurocomputing
Breaking the MapReduce stage barrier

Cluster Computing
Invisible loading: access-driven data transfer from raw files into database systems

Proceedings of the 16th International Conference on Extending Database Technology
A performance comparison of parallel DBMSs and MapReduce on large-scale text analytics

Proceedings of the 16th International Conference on Extending Database Technology
STEMscopes: contextualizing learning analytics in a K-12 science curriculum

Proceedings of the Third International Conference on Learning Analytics and Knowledge
BigBench: towards an industry standard benchmark for big data analytics

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
CARTILAGE: adding flexibility to the Hadoop skeleton

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Shark: SQL and rich analytics at scale

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Optimus: a dynamic rewriting framework for data-parallel execution plans

Proceedings of the 8th ACM European Conference on Computer Systems
Rhea: automatic filtering for unstructured cloud storage

nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
Issues in big data testing and benchmarking

Proceedings of the Sixth International Workshop on Testing Database Systems
DeepSea: self-adaptive data partitioning and replication in scalable distributed data systems

Proceedings of the 2013 Sigmod/PODS Ph.D. symposium on PhD symposium
MapReduce with communication overlap (MaRCO)

Journal of Parallel and Distributed Computing
Towards a workload for evolutionary analytics

Proceedings of the Second Workshop on Data Analytics in the Cloud
Spanner: Google’s Globally Distributed Database

ACM Transactions on Computer Systems (TOCS)
EMF modeling in traffic surveillance experiments

Proceedings of the Modelling of the Physical World Workshop
Distributed data management using MapReduce

ACM Computing Surveys (CSUR)
Thermal Modeling of Hybrid Storage Clusters

Journal of Signal Processing Systems
Discovering influential authors in heterogeneous academic networks by a co-ranking method

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Prolog programming with a map-reduce parallel construct

Proceedings of the 15th Symposium on Principles and Practice of Declarative Programming
Data-Intensive Cloud Computing: Requirements, Expectations, Challenges, and Solutions

Journal of Grid Computing
MrCrypt: static analysis for secure cloud computations

Proceedings of the 2013 ACM SIGPLAN international conference on Object oriented programming systems languages & applications
Cloudy: heterogeneous middleware for in time queries processing

Proceedings of the 17th International Database Engineering & Applications Symposium
The family of mapreduce and large-scale data processing systems

ACM Computing Surveys (CSUR)
Scale-up vs scale-out for Hadoop: time to rethink?

Proceedings of the 4th annual Symposium on Cloud Computing
A parallel spatial data analysis infrastructure for the cloud

Proceedings of the 21st ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems
Using a real-time top-k algorithm to mine the most frequent items over multiple streams

ICIC'13 Proceedings of the 9th international conference on Intelligent Computing Theories
Hadoop's adolescence: an analysis of Hadoop usage in scientific workloads

Proceedings of the VLDB Endowment
Hadoop GIS: a high performance spatial data warehousing system over mapreduce

Proceedings of the VLDB Endowment
A survey of multiple classifier systems as hybrid systems

Information Fusion
Instant loading for main memory databases

Proceedings of the VLDB Endowment
Modeling and optimizing large-scale data flows

Future Generation Computer Systems

Quantified Score

Hi-index	0.02

Visualization

Abstract

There is currently considerable enthusiasm around the MapReduce (MR) paradigm for large-scale data analysis [17]. Although the basic control flow of this framework has existed in parallel SQL database management systems (DBMS) for over 20 years, some have called MR a dramatically new computing model [8, 17]. In this paper, we describe and compare both paradigms. Furthermore, we evaluate both kinds of systems in terms of performance and development complexity. To this end, we define a benchmark consisting of a collection of tasks that we have run on an open source version of MR as well as on two parallel DBMSs. For each task, we measure each system's performance for various degrees of parallelism on a cluster of 100 nodes. Our results reveal some interesting trade-offs. Although the process to load data into and tune the execution of parallel DBMSs took much longer than the MR system, the observed performance of these DBMSs was strikingly better. We speculate about the causes of the dramatic performance difference and consider implementation concepts that future systems should take from both kinds of architectures.