An optimized approach for storing and accessing small files on cloud storage

Authors:
Bo Dong;Qinghua Zheng;Feng Tian;Kuo-Ming Chao;Rui Ma;Rachid Anane
Affiliations:
MOE Key Lab for Intelligent Networks and Network Security, Xi'an Jiaotong University, Xi'an, China and Department of Computer Science and Technology, Xi'an Jiaotong University, Xi'an, China;MOE Key Lab for Intelligent Networks and Network Security, Xi'an Jiaotong University, Xi'an, China and Department of Computer Science and Technology, Xi'an Jiaotong University, Xi'an, China;MOE Key Lab for Intelligent Networks and Network Security, Xi'an Jiaotong University, Xi'an, China;Faculty of Engineering and Computing, Coventry University, Coventry, UK;MOE Key Lab for Intelligent Networks and Network Security, Xi'an Jiaotong University, Xi'an, China and Department of Computer Science and Technology, Xi'an Jiaotong University, Xi'an, China;Faculty of Engineering and Computing, Coventry University, Coventry, UK
Venue:
Journal of Network and Computer Applications
Year:
2012

Citing 15
Cited 0

A large-scale study of file-system contents

SIGMETRICS '99 Proceedings of the 1999 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Monitoring the Earth System Grid with MDS4

E-SCIENCE '06 Proceedings of the Second IEEE International Conference on e-Science and Grid Computing
The Sloan Digital Sky Survey Data Archive Server

Computing in Science and Engineering
Predictive data grouping: Defining the bounds of energy and latency reduction through predictive data grouping and replication

ACM Transactions on Storage (TOS)
TaP: table-based prefetching for storage caches

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
FARMER: a novel approach to file access correlation mining and evaluation reference model for optimizing peta-scale file system performance

HPDC '08 Proceedings of the 17th international symposium on High performance distributed computing
Context-aware prefetching at the storage server

ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
Memory resource allocation for file system prefetching: from a supply chain management perspective

Proceedings of the 4th ACM European conference on Computer systems
Cloud computing and emerging IT platforms: Vision, hype, and reality for delivering computing as the 5th utility

Future Generation Computer Systems
Cumulus: Filesystem backup to the cloud

ACM Transactions on Storage (TOS)
Mining dependency in distributed systems through unstructured logs analysis

ACM SIGOPS Operating Systems Review
A digital library architecture supporting massive small files and efficient replica maintenance

Proceedings of the 10th annual joint conference on Digital libraries
Comparing Hadoop and Fat-Btree based access method for small file I/O applications

WAIM'10 Proceedings of the 11th international conference on Web-age information management
The Hadoop Distributed File System

MSST '10 Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST)
Hadoop: The Definitive Guide

Hadoop: The Definitive Guide

Quantified Score

Hi-index	0.00

Visualization

Abstract

Hadoop distributed file system (HDFS) is widely adopted to support Internet services. Unfortunately, native HDFS does not perform well for large numbers but small size files, which has attracted significant attention. This paper firstly analyzes and points out the reasons of small file problem of HDFS: (1) large numbers of small files impose heavy burden on NameNode of HDFS; (2) correlations between small files are not considered for data placement; and (3) no optimization mechanism, such as prefetching, is provided to improve I/O performance. Secondly, in the context of HDFS, the clear cut-off point between large and small files is determined through experimentation, which helps determine 'how small is small'. Thirdly, according to file correlation features, files are classified into three types: structurally-related files, logically-related files, and independent files. Finally, based on the above three steps, an optimized approach is designed to improve the storage and access efficiencies of small files on HDFS. File merging and prefetching scheme is applied for structurally-related small files, while file grouping and prefetching scheme is used for managing logically-related small files. Experimental results demonstrate that the proposed schemes effectively improve the storage and access efficiencies of small files, compared with native HDFS and a Hadoop file archiving facility.