Map-reduce-merge: simplified relational data processing on large clusters

Authors:
Hung-chih Yang;Ali Dasdan;Ruey-Lung Hsiao;D. Stott Parker
Affiliations:
Yahoo!, Sunnyvale, CA;Yahoo!, Sunnyvale, CA;UCLA, Los Angeles, CA;UCLA, Los Angeles, CA
Venue:
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Year:
2007

Citing 5
Cited 105

Parallel database systems: the future of high performance database systems

Communications of the ACM
GAMMA - A High Performance Dataflow Database Machine

VLDB '86 Proceedings of the 12th International Conference on Very Large Data Bases
Scientific data management in the coming decade

ACM SIGMOD Record
Interpreting the data: Parallel analysis with Sawzall

Scientific Programming - Dynamic Grids and Worldwide Computing
Bigtable: a distributed storage system for structured data

OSDI '06 Proceedings of the 7th symposium on Operating systems design and implementation

Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Mars: a MapReduce framework on graphics processors

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
SCOPE: easy and efficient parallel processing of massive data sets

Proceedings of the VLDB Endowment
Ad-hoc data processing in the cloud

Proceedings of the VLDB Endowment
Gordon: using flash memory to build fast, power-efficient clusters for data-intensive applications

Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Adaptive workload allocation in query processing in autonomous heterogeneous environments

Distributed and Parallel Databases
Traverse: Simplified Indexing on Large Map-Reduce-Merge Clusters

DASFAA '09 Proceedings of the 14th International Conference on Database Systems for Advanced Applications
Making cluster applications energy-aware

ACDC '09 Proceedings of the 1st workshop on Automated control for datacenters and clouds
BotGraph: large scale spamming botnet detection

NSDI'09 Proceedings of the 6th USENIX symposium on Networked systems design and implementation
E = MC3: managing uncertain enterprise data in a cluster-computing environment

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Experiences on Processing Spatial Data with MapReduce

SSDBM 2009 Proceedings of the 21st International Conference on Scientific and Statistical Database Management
Evaluating SPLASH-2 Applications Using MapReduce

APPT '09 Proceedings of the 8th International Symposium on Advanced Parallel Processing Technologies
MapReduce Programming Model for .NET-Based Cloud Computing

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Dynamic Query Processing for P2P Data Services in the Cloud

DEXA '09 Proceedings of the 20th International Conference on Database and Expert Systems Applications
Efficiently support MapReduce-like computation models inside parallel DBMS

IDEAS '09 Proceedings of the 2009 International Database Engineering & Applications Symposium
Composing and executing parallel data-flow graphs with shell pipes

Proceedings of the 4th Workshop on Workflows in Support of Large-Scale Science
Nephele: efficient parallel data processing in the cloud

Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers
Query processing of massive trajectory data based on mapreduce

Proceedings of the first international workshop on Cloud data management
RAPID: Enabling Scalable Ad-Hoc Analytics on the Semantic Web

ISWC '09 Proceedings of the 8th International Semantic Web Conference
An Efficient Cloud Computing-Based Architecture for Freight System Application in China Railway

CloudCom '09 Proceedings of the 1st International Conference on Cloud Computing
Optimizing joins in a map-reduce environment

Proceedings of the 13th International Conference on Extending Database Technology
DEDUCE: at the intersection of MapReduce and stream processing

Proceedings of the 13th International Conference on Extending Database Technology
Harnessing input redundancy in a MapReduce framework

Proceedings of the 2010 ACM Symposium on Applied Computing
Semi-join computation on distributed file systems using map-reduce-merge model

Proceedings of the 2010 ACM Symposium on Applied Computing
Towards scalable RDF graph analytics on MapReduce

Proceedings of the 2010 Workshop on Massive Data Analytics on the Cloud
FlumeJava: easy, efficient data-parallel pipelines

PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
Nephele/PACTs: a programming model and execution framework for web-scale analytical processing

Proceedings of the 1st ACM symposium on Cloud computing
Efficient parallel set-similarity joins using MapReduce

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Indexing multi-dimensional data in a cloud system

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Integrating hadoop and parallel DBMs

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
A comparison of join algorithms for log processing in MaPreduce

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Parallelizing XML data-streaming workflows via MapReduce

Journal of Computer and System Sciences
Flood: elastic streaming MapReduce

Proceedings of the Fourth ACM International Conference on Distributed Event-Based Systems
A Map-Reduce System with an Alternate API for Multi-core Environments

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
MapReduce for the cell broadband engine architecture

IBM Journal of Research and Development
Massive Semantic Web data compression with MapReduce

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Tiled-MapReduce: optimizing resource usages of data-parallel applications on multicore with tiling

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Manimal: relational optimization for data-intensive programs

Procceedings of the 13th International Workshop on the Web and Databases
Spark: cluster computing with working sets

HotCloud'10 Proceedings of the 2nd USENIX conference on Hot topics in cloud computing
A programming framework for integrating web-based spatiotemporal sensor data with MapReduce capabilities

Proceedings of the ACM SIGSPATIAL International Workshop on GeoStreaming
JAWS: Job-Aware Workload Scheduling for the Exploration of Turbulence Simulations

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Multidimensional arrays for warehousing data on clouds

Globe'10 Proceedings of the Third international conference on Data management in grid and peer-to-peer systems
The performance of MapReduce: an in-depth study

Proceedings of the VLDB Endowment
MRShare: sharing across multiple queries in MapReduce

Proceedings of the VLDB Endowment
Hadoop++: making a yellow elephant run like a cheetah (without it even noticing)

Proceedings of the VLDB Endowment
Behavioral simulations in MapReduce

Proceedings of the VLDB Endowment
Cheetah: a high performance, custom data warehouse on top of MapReduce

Proceedings of the VLDB Endowment
Continuous mapreduce for In-DB stream analytics

OTM'10 Proceedings of the 2010 international conference on On the move to meaningful internet systems
Parallel skyline computation on multicore architectures

Information Systems
CPRS: A cloud-based program recommendation system for digital TV platforms

Future Generation Computer Systems
Automatic optimization for MapReduce programs

Proceedings of the VLDB Endowment
Efficient parallel skyline processing using hyperplane projections

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Processing theta-joins using MapReduce

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Fast personalized PageRank on MapReduce

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
A platform for scalable one-pass analytics using MapReduce

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Automated partitioning design in parallel database systems

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Efficient processing of data warehousing queries in a split execution environment

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Garbage collection auto-tuning for Java mapreduce on multi-cores

Proceedings of the international symposium on Memory management
A hierarchical framework for cross-domain MapReduce execution

Proceedings of the second international workshop on Emerging computational methods for the life sciences
Full-text indexing for optimizing selection operations in large-scale data analytics

Proceedings of the second international workshop on MapReduce and its applications
Scatter-Gather-Merge: An efficient star-join query processing algorithm for data-parallel frameworks

Cluster Computing
A load-balance based resource-scheduling algorithm under cloud computing environment

ICWL'10 Proceedings of the 2010 international conference on New horizons in web-based learning
Tagged mapreduce: efficiently computing multi-analytics using mapreduce

DaWaK'11 Proceedings of the 13th international conference on Data warehousing and knowledge discovery
Comparing high level mapreduce query languages

APPT'11 Proceedings of the 9th international conference on Advanced parallel processing technologies
I/O streaming evaluation of batch queries for data-intensive computational turbulence

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Improving the efficiency of subset queries on raster images

Proceedings of the ACM SIGSPATIAL Second International Workshop on High Performance and Distributed Geographic Information Systems
Extend core UDF framework for GPU-enabled analytical query evaluation

Proceedings of the 15th Symposium on International Database Engineering & Applications
Parallel data processing with MapReduce: a survey

ACM SIGMOD Record
Case study of scientific data processing on a cloud using hadoop

HPCS'09 Proceedings of the 23rd international conference on High Performance Computing Systems and Applications
Scalable splitting of massive data streams

DASFAA'10 Proceedings of the 15th international conference on Database Systems for Advanced Applications - Volume Part II
CPRS: a cloud-based program recommendation system for digital TV platforms

GPC'10 Proceedings of the 5th international conference on Advances in Grid and Pervasive Computing
Executing multiple group by query using mapreduce approach: implementation and optimization

GPC'10 Proceedings of the 5th international conference on Advances in Grid and Pervasive Computing
Tarazu: optimizing MapReduce on heterogeneous clusters

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
DVM: towards a datacenter-scale virtual machine

VEE '12 Proceedings of the 8th ACM SIGPLAN/SIGOPS conference on Virtual Execution Environments
Chapter 14: building search computing applications

Search Computing
Exploiting MapReduce-based similarity joins

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Optimizing data shuffling in data-parallel computation by understanding user-defined functions

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Clydesdale: structured data processing on MapReduce

Proceedings of the 15th International Conference on Extending Database Technology
An optimization framework for map-reduce queries

Proceedings of the 15th International Conference on Extending Database Technology
ComMapReduce: an improvement of mapreduce with lightweight communication mechanisms

DASFAA'12 Proceedings of the 17th international conference on Database Systems for Advanced Applications - Volume Part II
Pool-Based distributed evolutionary algorithms using an object database

EvoApplications'12 Proceedings of the 2012t European conference on Applications of Evolutionary Computation
Hierarchical MapReduce Programming Model and Scheduling Algorithms

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
MapReduce-based similarity join for metric spaces

Proceedings of the 1st International Workshop on Cloud Intelligence
PQL: a purely-declarative java extension for parallel programming

ECOOP'12 Proceedings of the 26th European conference on Object-Oriented Programming
Spotting code optimizations in data-parallel pipelines through PeriSCOPE

OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
SCALLA: A Platform for Scalable One-Pass Analytics Using MapReduce

ACM Transactions on Database Systems (TODS)
Join processing using Bloom filter in MapReduce

Proceedings of the 2012 ACM Research in Applied Computation Symposium
Optimizing and Tuning MapReduce Jobs to Improve the Large-Scale Data Analysis Process

International Journal of Intelligent Systems
Scalable RDF data compression with MapReduce

Concurrency and Computation: Practice & Experience
Computing scientometrics in large-scale academic search engines with mapreduce

WISE'12 Proceedings of the 13th international conference on Web Information Systems Engineering
Efficiently compressing OLAP data cubes via R-tree based recursive partitions

ISMIS'12 Proceedings of the 20th international conference on Foundations of Intelligent Systems
Tiled-MapReduce: Efficient and Flexible MapReduce Processing on Multicore with Tiling

ACM Transactions on Architecture and Code Optimization (TACO)
Breaking the MapReduce stage barrier

Cluster Computing
Email marketing and scalability using Hadoop

Proceedings of the 5th ACM COMPUTE Conference: Intelligent & scalable system technologies
Parameterised architectural patterns for providing cloud service fault tolerance with accurate costings

Proceedings of the 16th International ACM Sigsoft symposium on Component-based software engineering
HyMR: a hybrid MapReduce workflow system

Proceedings of the 3rd international workshop on Emerging computational methods for the life sciences
MapReduce with communication overlap (MaRCO)

Journal of Parallel and Distributed Computing
Ad-hoc aggregate query processing algorithms based on bit-store for query intensive applications in cloud computing

Future Generation Computer Systems
Toward intersection filter-based optimization for joins in MapReduce

Proceedings of the 2nd International Workshop on Cloud Intelligence
Distributed data management using MapReduce

ACM Computing Surveys (CSUR)
The family of mapreduce and large-scale data processing systems

ACM Computing Surveys (CSUR)
Cloud-aware processing of MapReduce-based OLAP applications

AusPDC '13 Proceedings of the Eleventh Australasian Symposium on Parallel and Distributed Computing - Volume 140
ComMapReduce: An improvement of MapReduce with lightweight communication mechanisms

Data & Knowledge Engineering
A Scalable Distributed Framework for Efficient Analytics on Ordered Datasets

UCC '13 Proceedings of the 2013 IEEE/ACM 6th International Conference on Utility and Cloud Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Map-Reduce is a programming model that enables easy development of scalable parallel applications to process a vast amount of data on large clusters of commodity machines. Through a simple interface with two functions, map and reduce, this model facilitates parallel implementation of many real-world tasks such as data processing jobs for search engines and machine learning. However,this model does not directly support processing multiple related heterogeneous datasets. While processing relational data is a common need, this limitation causes difficulties and/or inefficiency when Map-Reduce is applied on relational operations like joins. We improve Map-Reduce into a new model called Map-Reduce-Merge. It adds to Map-Reduce a Merge phase that can efficiently merge data already partitioned and sorted (or hashed) by map and reduce modules. We also demonstrate that this new model can express relational algebra operators as well as implement several join algorithms.