Distributed data management using MapReduce

Authors:
Feng Li;Beng Chin Ooi;M. Tamer Özsu;Sai Wu
Affiliations:
National University of Singapore, Singapore;National University of Singapore, Singapore;University of Waterloo, Canada;Zhejiang University, China
Venue:
ACM Computing Surveys (CSUR)
Year:
2014

Citing 78
Cited 0

A performance evaluation of four parallel join algorithms in a shared-nothing multiprocessor environment

SIGMOD '89 Proceedings of the 1989 ACM SIGMOD international conference on Management of data
A bridging model for parallel computation

Communications of the ACM
Online aggregation

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Using Semi-Joins to Solve Relational Queries

Journal of the ACM (JACM)
Rethinking Database System Architecture: Towards a Self-Tuning RISC-Style Database System

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
C-store: a column-oriented DBMS

VLDB '05 Proceedings of the 31st international conference on Very large data bases
A Primitive Operator for Similarity Joins in Data Cleaning

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Integrating compression and execution in column-oriented database systems

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
How to wring a table dry: entropy compression of relations and querying of compressed relations

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Interpreting the data: Parallel analysis with Sawzall

Scientific Programming - Dynamic Grids and Worldwide Computing
Scaling up all pairs similarity search

Proceedings of the 16th international conference on World Wide Web
Map-reduce-merge: simplified relational data processing on large clusters

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
The end of an architectural era: (it's time for a complete rewrite)

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Efficient similarity joins for near duplicate detection

Proceedings of the 17th international conference on World Wide Web
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Clustera: an integrated computation and data management system

Proceedings of the VLDB Endowment
A comparison of approaches to large-scale data analysis

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
MapReduce and parallel DBMSs: friends or foes?

Communications of the ACM - Amir Pnueli: Ahead of His Time
SQL/MapReduce: a practical approach to self-describing, polymorphic, and parallelizable user-defined functions

Proceedings of the VLDB Endowment
Hive: a warehousing solution over a map-reduce framework

Proceedings of the VLDB Endowment
HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads

Proceedings of the VLDB Endowment
Optimizing joins in a map-reduce environment

Proceedings of the 13th International Conference on Extending Database Technology
DEDUCE: at the intersection of MapReduce and stream processing

Proceedings of the 13th International Conference on Extending Database Technology
FlumeJava: easy, efficient data-parallel pipelines

PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
Pregel: a system for large-scale graph processing

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Efficient parallel set-similarity joins using MapReduce

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
ParaTimer: a progress indicator for MapReduce DAGs

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Indexing multi-dimensional data in a cloud system

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Continuous sampling for online aggregation over multiple queries

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
A comparison of join algorithms for log processing in MaPreduce

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
HadoopDB in action: building real world applications

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Twister: a runtime for iterative MapReduce

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
MapReduce online

NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation
Improving MapReduce performance in heterogeneous environments

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Spark: cluster computing with working sets

HotCloud'10 Proceedings of the 2nd USENIX conference on Hot topics in cloud computing
HaLoop: efficient iterative data processing on large clusters

Proceedings of the VLDB Endowment
The performance of MapReduce: an in-depth study

Proceedings of the VLDB Endowment
MRShare: sharing across multiple queries in MapReduce

Proceedings of the VLDB Endowment
Hadoop++: making a yellow elephant run like a cheetah (without it even noticing)

Proceedings of the VLDB Endowment
Efficient B-tree based indexing for cloud data processing

Proceedings of the VLDB Endowment
Cheetah: a high performance, custom data warehouse on top of MapReduce

Proceedings of the VLDB Endowment
Reining in the outliers in map-reduce clusters using Mantri

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
S4: Distributed Stream Computing Platform

ICDMW '10 Proceedings of the 2010 IEEE International Conference on Data Mining Workshops
Dremel: interactive analysis of web-scale datasets

Communications of the ACM
Principles of Distributed Database Systems

Principles of Distributed Database Systems
Automatic optimization for MapReduce programs

Proceedings of the VLDB Endowment
Column-oriented storage techniques for MapReduce

Proceedings of the VLDB Endowment
Parallel evaluation of conjunctive queries

Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Processing theta-joins using MapReduce

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Llama: leveraging columnar storage for scalable join processing in the MapReduce framework

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
A platform for scalable one-pass analytics using MapReduce

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Providing scalable database services on the cloud

WISE'10 Proceedings of the 11th international conference on Web information systems engineering
SystemML: Declarative machine learning on MapReduce

ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
ES2: A cloud data storage system for supporting both OLTP and OLAP

ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
RCFile: A fast and space-efficient data placement structure in MapReduce-based warehouse systems

ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
MAP-JOIN-REDUCE: Toward Scalable and Efficient Data Analysis on Large Clusters

IEEE Transactions on Knowledge and Data Engineering
CoScan: cooperative scan sharing in the cloud

Proceedings of the 2nd ACM Symposium on Cloud Computing
Query optimization for massively parallel data processing

Proceedings of the 2nd ACM Symposium on Cloud Computing
Trojan data layouts: right shoes for a running elephant

Proceedings of the 2nd ACM Symposium on Cloud Computing
Fast crash recovery in RAMCloud

SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Building wavelet histograms on large data in MapReduce

Proceedings of the VLDB Endowment
V-SMART-join: a scalable mapreduce framework for all-pair similarity joins of multisets and vectors

Proceedings of the VLDB Endowment
Distributed GraphLab: a framework for machine learning and data mining in the cloud

Proceedings of the VLDB Endowment
SkewTune: mitigating skew in mapreduce applications

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Efficient parallel kNN joins for large data in MapReduce

Proceedings of the 15th International Conference on Extending Database Technology
Fuzzy Joins Using MapReduce

ICDE '12 Proceedings of the 2012 IEEE 28th International Conference on Data Engineering
Parallel Top-K Similarity Join Algorithms Using MapReduce

ICDE '12 Proceedings of the 2012 IEEE 28th International Conference on Data Engineering
Load Balancing in MapReduce Based on Scalable Cardinality Estimates

ICDE '12 Proceedings of the 2012 IEEE 28th International Conference on Data Engineering
LogBase: a scalable log-structured database system in the cloud

Proceedings of the VLDB Endowment
Efficient processing of k nearest neighbor joins using MapReduce

Proceedings of the VLDB Endowment
Efficient multi-way theta-join processing using MapReduce

Proceedings of the VLDB Endowment
Sailfish: a framework for large scale data processing

Proceedings of the Third ACM Symposium on Cloud Computing
Themis: an I/O-efficient MapReduce

Proceedings of the Third ACM Symposium on Cloud Computing
Balancing reducer skew in MapReduce workloads using progressive sampling

Proceedings of the Third ACM Symposium on Cloud Computing
An efficient and compact indexing scheme for large-scale data store

ICDE '13 Proceedings of the 2013 IEEE International Conference on Data Engineering (ICDE 2013)

Quantified Score

Hi-index	0.00

Visualization

Abstract

MapReduce is a framework for processing and managing large-scale datasets in a distributed cluster, which has been used for applications such as generating search indexes, document clustering, access log analysis, and various other forms of data analytics. MapReduce adopts a flexible computation model with a simple interface consisting of map and reduce functions whose implementations can be customized by application developers. Since its introduction, a substantial amount of research effort has been directed toward making it more usable and efficient for supporting database-centric operations. In this article, we aim to provide a comprehensive review of a wide range of proposals and systems that focusing fundamentally on the support of distributed data management and processing using the MapReduce framework.