Parallelizing multiple group-by query in share-nothing environment: a MapReduce study case

  • Authors:
  • Jie Pan;Yann Le Biannic;Frédéric Magoulès

  • Affiliations:
  • Ecole Centrale Paris, Cedex, France;SAP Business Objects, Levallois-Perret, Cedex, France;Ecole Centrale Paris, Châtenay-Malabry, Cedex, France

  • Venue:
  • Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

MapReduce has excellent scalability and fault-tolerance. It fits well with dominant distributed architectures of today, such as cluster or Grid, which are usually shared-nothing computing environments. However, using MapReduce for data analysis application still meets some challenges, since MapReduce is a low-level procedural programming paradigm and it does not directly support relational algebraic operators. In this work, we addressed a typical data analytic query, multiple group-by query. We parallelized the calculations involved in this type of query with MapReduce, and we introduced indexation and data partition in our work. We measured the speedup performance for implementations over both horizontally partitioned data and vertically partitioned data. We analysed the performance affecting factors from both measurement and formal estimation during this procedure.