GPFS-SNC: an enterprise cluster file system for big data

Authors:
R. Jain;P. Sarkar;D. Subhraveti
Affiliations:
IBM Research Division, Almaden Research Center, San Jose, CA;IBM Research Division, Almaden Research Center, San Jose, CA;IBM Research Division, Almaden Research Center, San Jose, CA
Venue:
IBM Journal of Research and Development
Year:
2013

Citing 8
Cited 0

Fat-trees: universal networks for hardware-efficient supercomputing

IEEE Transactions on Computers
Self-Organizing Maps

Self-Organizing Maps
GPFS: A Shared-Disk File System for Large Computing Clusters

FAST '02 Proceedings of the Conference on File and Storage Technologies
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
PVFS: a parallel file system for linux clusters

ALS'00 Proceedings of the 4th annual Linux Showcase & Conference - Volume 4
The Hadoop Distributed File System

MSST '10 Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST)
GPFS-SNC: an enterprise storage framework for virtual-machine clouds

IBM Journal of Research and Development

Quantified Score

Hi-index	0.00

Visualization

Abstract

A new class of data-intensive applications commonly referred to as Big Data applications (e.g., customer sentiment analysis based on click-stream logs) involves processing massive amounts of data with a focus on semantically transforming the data. This class of applications is massively parallel and well suited for the MapReduce programming framework that allows users to perform large-scale data analyses such that the application execution layer handles the system architecture, data partitioning, and task scheduling. In this paper, we introduce GPFS-SNC (General Parallel File System for Shared Nothing Clusters), a scalable file system that operates over a cluster of commodity machines and direct-attached storage and meets the requirements of analytics and traditional applications that are typically used together in analytics solutions. The architecture extends an existing enterprise cluster file system to support these emerging classes of workloads by applying five innovative optimizations: 1) locality awareness to allow compute jobs to be scheduled on nodes where the data resides, 2) metablocks that allow large and small block sizes to co-exist in the same file system to meet the needs of different types of applications, 3) write affinity that allows applications to dictate the layout of files on different nodes in order to maximize both write and read bandwidth, 4) pipelined replication to maximize use of network bandwidth for data replication, and 5) distributed recovery to minimize the effect of failures on ongoing computation.