Executing multiple group by query using mapreduce approach: implementation and optimization

  • Authors:
  • Jie Pan;Frédéric Magoulès;Yann Le Biannic

  • Affiliations:
  • Ecole Centrale Paris, Grande Voie des Vignes, Châtenay-Malabry Cedex, France;Ecole Centrale Paris, Grande Voie des Vignes, Châtenay-Malabry Cedex, France;SAP BusinessObjects, Levallois-Perret Cedex, France

  • Venue:
  • GPC'10 Proceedings of the 5th international conference on Advances in Grid and Pervasive Computing
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

MapReduce model is a new parallel programming model initially developed for large-scale web content processing Data analysis meets the issue of how to do calculation over extremely large dataset The arrival of MapReduce provides a chance to utilize commodity hardware for massively parallel data analysis applications The translation and optimization from relational algebra operators to MapReduce programs is still an open and dynamic research field In this paper, we focus on a special type of data analysis query, namely, multiple group by query We first study the communication cost of MapReduce model, then we give an initial implementation of multiple group by query We then propose an optimized version which addresses and improves the communication cost issues Our optimized version shows a better accelerating ability and a better scalability than the other version.