Understanding system design for big data workloads

Authors:
H. P. Hofstee;G. C. Chen;F. H. Gebara;K. Hall;J. Herring;D. Jamsek;J. Li;Y. Li;J. W. Shi;P. W. Y. Wong
Affiliations:
IBM Research Division, Austin Research Laboratory, Austin, TX;IBM Research Division, China Research Laboratory, ShangDi, Haidian District, Beijing, China;IBM Research Division, Austin Research Laboratory, Austin, TX;IBM Global Business Services, Charlotte, NC;IBM System and Technology Group, Poughkeepsie Development Laboratory, Poughkeepsie, NY;IBM Research Division, Austin Research Laboratory, Austin, TX;IBM Research Division, Austin Research Laboratory, Austin, TX;IBM Research Division, China Research Laboratory, ShangDi, Haidian District, Beijing, China;IBM Research Division, China Research Laboratory, ShangDi, Haidian District, Beijing, China;IBM Research Division, Linux Technology Center, Austin, TX
Venue:
IBM Journal of Research and Development
Year:
2013

Citing 3
Cited 0

SystemT: a system for declarative information extraction

ACM SIGMOD Record
The PERCS High-Performance Interconnect

HOTI '10 Proceedings of the 2010 18th IEEE Symposium on High Performance Interconnects
SystemML: Declarative machine learning on MapReduce

ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper explores the design and optimization implications for systems targeted at Big Data workloads. We confirm that these workloads differ from workloads typically run on more traditional transactional and data-warehousing systems in fundamental ways, and, therefore, a system optimized for Big Data can be expected to differ from these other systems. Rather than only studying the performance of representative computational kernels, and focusing on central-processing-unit performance, this paper studies the system as a whole. We identify three major phases in a typical Big Data workload, and we propose that each of these phases should be represented in a Big Data systems benchmark. We implemented our ideas on two distinct IBM POWER7® processor-based systems that target different market sectors, and we analyze their performance on a sort benchmark. In particular, this paper includes an evaluation of POWER7 processor-based systems using MapReduce TeraSort, which is a workload that can be a "stress test" for multiple dimensions of system performance. We combine this work with a broader perspective on Big Data workloads and suggest a direction for a future benchmark definition effort. A number of methods to further improve system performance are proposed.