Parallelizing multiple group-by query in share-nothing environment: a MapReduce study case

Authors:
Jie Pan;Yann Le Biannic;Frédéric Magoulès
Affiliations:
Ecole Centrale Paris, Cedex, France;SAP Business Objects, Levallois-Perret, Cedex, France;Ecole Centrale Paris, Châtenay-Malabry, Cedex, France
Venue:
Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Year:
2010

Citing 4
Cited 1

Google's MapReduce programming model — Revisited

Science of Computer Programming
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads

Proceedings of the VLDB Endowment

ComMapReduce: An improvement of MapReduce with lightweight communication mechanisms

Data & Knowledge Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

MapReduce has excellent scalability and fault-tolerance. It fits well with dominant distributed architectures of today, such as cluster or Grid, which are usually shared-nothing computing environments. However, using MapReduce for data analysis application still meets some challenges, since MapReduce is a low-level procedural programming paradigm and it does not directly support relational algebraic operators. In this work, we addressed a typical data analytic query, multiple group-by query. We parallelized the calculations involved in this type of query with MapReduce, and we introduced indexation and data partition in our work. We measured the speedup performance for implementations over both horizontally partitioned data and vertically partitioned data. We analysed the performance affecting factors from both measurement and formal estimation during this procedure.