Understanding system design for big data workloads

  • Authors:
  • H. P. Hofstee;G. C. Chen;F. H. Gebara;K. Hall;J. Herring;D. Jamsek;J. Li;Y. Li;J. W. Shi;P. W. Y. Wong

  • Affiliations:
  • IBM Research Division, Austin Research Laboratory, Austin, TX;IBM Research Division, China Research Laboratory, ShangDi, Haidian District, Beijing, China;IBM Research Division, Austin Research Laboratory, Austin, TX;IBM Global Business Services, Charlotte, NC;IBM System and Technology Group, Poughkeepsie Development Laboratory, Poughkeepsie, NY;IBM Research Division, Austin Research Laboratory, Austin, TX;IBM Research Division, Austin Research Laboratory, Austin, TX;IBM Research Division, China Research Laboratory, ShangDi, Haidian District, Beijing, China;IBM Research Division, China Research Laboratory, ShangDi, Haidian District, Beijing, China;IBM Research Division, Linux Technology Center, Austin, TX

  • Venue:
  • IBM Journal of Research and Development
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper explores the design and optimization implications for systems targeted at Big Data workloads. We confirm that these workloads differ from workloads typically run on more traditional transactional and data-warehousing systems in fundamental ways, and, therefore, a system optimized for Big Data can be expected to differ from these other systems. Rather than only studying the performance of representative computational kernels, and focusing on central-processing-unit performance, this paper studies the system as a whole. We identify three major phases in a typical Big Data workload, and we propose that each of these phases should be represented in a Big Data systems benchmark. We implemented our ideas on two distinct IBM POWER7® processor-based systems that target different market sectors, and we analyze their performance on a sort benchmark. In particular, this paper includes an evaluation of POWER7 processor-based systems using MapReduce TeraSort, which is a workload that can be a "stress test" for multiple dimensions of system performance. We combine this work with a broader perspective on Big Data workloads and suggest a direction for a future benchmark definition effort. A number of methods to further improve system performance are proposed.