Sailfish: a framework for large scale data processing

  • Authors:
  • Sriram Rao;Raghu Ramakrishnan;Adam Silberstein;Mike Ovsiannikov;Damian Reeves

  • Affiliations:
  • Microsoft Corp.;Microsoft Corp.;LinkedIn;Quantcast Corp;Quantcast Corp

  • Venue:
  • Proceedings of the Third ACM Symposium on Cloud Computing
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, we present Sailfish, a new Map-Reduce framework for large scale data processing. The Sailfish design is centered around aggregating intermediate data, specifically data produced by map tasks and consumed later by reduce tasks, to improve performance by batching disk I/O. We introduce an abstraction called I-files for supporting data aggregation, and describe how we implemented it as an extension of the distributed filesystem, to efficiently batch data written by multiple writers and read by multiple readers. Sailfish adapts the Map-Reduce layer in Hadoop to use I-files for transporting data from map tasks to reduce tasks. We present experimental results demonstrating that Sailfish improves performance of standard Hadoop; in particular, we show 20% to 5 times faster performance on a representative mix of real jobs and datasets at Yahoo!. We also demonstrate that the Sailfish design enables auto-tuning functionality that handles changes in data volume and skewed distributions effectively, thereby addressing an important practical drawback of Hadoop, which in contrast relies on programmers to configure system parameters appropriately for each job, for each input dataset. Our Sailfish implementation and the other software components developed as part of this paper has been released as open source.