Executing multiple group by query using mapreduce approach: implementation and optimization

Authors:
Jie Pan;Frédéric Magoulès;Yann Le Biannic
Affiliations:
Ecole Centrale Paris, Grande Voie des Vignes, Châtenay-Malabry Cedex, France;Ecole Centrale Paris, Grande Voie des Vignes, Châtenay-Malabry Cedex, France;SAP BusinessObjects, Levallois-Perret Cedex, France
Venue:
GPC'10 Proceedings of the 5th international conference on Advances in Grid and Pervasive Computing
Year:
2010

Citing 9
Cited 0

Parallel database systems: the future of high performance database systems

Communications of the ACM
Horizontal data partitioning in database design

SIGMOD '82 Proceedings of the 1982 ACM SIGMOD international conference on Management of data
Efficient computation of multiple group by queries

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Map-reduce-merge: simplified relational data processing on large clusters

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Google's MapReduce programming model — Revisited

Science of Computer Programming
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
MRPGA: An Extension of MapReduce for Parallelizing Genetic Algorithms

ESCIENCE '08 Proceedings of the 2008 Fourth IEEE International Conference on eScience
Hive: a warehousing solution over a map-reduce framework

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

MapReduce model is a new parallel programming model initially developed for large-scale web content processing Data analysis meets the issue of how to do calculation over extremely large dataset The arrival of MapReduce provides a chance to utilize commodity hardware for massively parallel data analysis applications The translation and optimization from relational algebra operators to MapReduce programs is still an open and dynamic research field In this paper, we focus on a special type of data analysis query, namely, multiple group by query We first study the communication cost of MapReduce model, then we give an initial implementation of multiple group by query We then propose an optimized version which addresses and improves the communication cost issues Our optimized version shows a better accelerating ability and a better scalability than the other version.