A large-scale study of file-system contents
SIGMETRICS '99 Proceedings of the 1999 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Monitoring the Earth System Grid with MDS4
E-SCIENCE '06 Proceedings of the Second IEEE International Conference on e-Science and Grid Computing
The Sloan Digital Sky Survey Data Archive Server
Computing in Science and Engineering
ACM Transactions on Storage (TOS)
TaP: table-based prefetching for storage caches
FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
HPDC '08 Proceedings of the 17th international symposium on High performance distributed computing
Context-aware prefetching at the storage server
ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
Memory resource allocation for file system prefetching: from a supply chain management perspective
Proceedings of the 4th ACM European conference on Computer systems
Future Generation Computer Systems
Cumulus: Filesystem backup to the cloud
ACM Transactions on Storage (TOS)
Mining dependency in distributed systems through unstructured logs analysis
ACM SIGOPS Operating Systems Review
A digital library architecture supporting massive small files and efficient replica maintenance
Proceedings of the 10th annual joint conference on Digital libraries
Comparing Hadoop and Fat-Btree based access method for small file I/O applications
WAIM'10 Proceedings of the 11th international conference on Web-age information management
The Hadoop Distributed File System
MSST '10 Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST)
Hadoop: The Definitive Guide
Hi-index | 0.00 |
Hadoop distributed file system (HDFS) is widely adopted to support Internet services. Unfortunately, native HDFS does not perform well for large numbers but small size files, which has attracted significant attention. This paper firstly analyzes and points out the reasons of small file problem of HDFS: (1) large numbers of small files impose heavy burden on NameNode of HDFS; (2) correlations between small files are not considered for data placement; and (3) no optimization mechanism, such as prefetching, is provided to improve I/O performance. Secondly, in the context of HDFS, the clear cut-off point between large and small files is determined through experimentation, which helps determine 'how small is small'. Thirdly, according to file correlation features, files are classified into three types: structurally-related files, logically-related files, and independent files. Finally, based on the above three steps, an optimized approach is designed to improve the storage and access efficiencies of small files on HDFS. File merging and prefetching scheme is applied for structurally-related small files, while file grouping and prefetching scheme is used for managing logically-related small files. Experimental results demonstrate that the proposed schemes effectively improve the storage and access efficiencies of small files, compared with native HDFS and a Hadoop file archiving facility.