The Hadoop Distributed File System

Authors:
Konstantin Shvachko;Hairong Kuang;Sanjay Radia;Robert Chansler
Affiliations:
Yahoo! Sunnyvale, California USA;Yahoo! Sunnyvale, California USA;Yahoo! Sunnyvale, California USA;Yahoo! Sunnyvale, California USA
Venue:
MSST '10 Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST)
Year:
2010

Citing 0
Cited 102

BlobSeer: Next-generation data management for large scale infrastructures

Journal of Parallel and Distributed Computing
Scalable knowledge harvesting with high precision and high recall

Proceedings of the fourth ACM international conference on Web search and data mining
Blink: managing server clusters on intermittent power

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
Scale and concurrency of GIGA+: file system directories with millions of files

FAST'11 Proceedings of the 9th USENIX conference on File and stroage technologies
FATE and DESTINI: a framework for cloud recovery testing

Proceedings of the 8th USENIX conference on Networked systems design and implementation
Cumulus: an open source storage cloud for science

Proceedings of the 2nd international workshop on Scientific cloud computing
Adapting MapReduce for HPC environments

Proceedings of the 20th international symposium on High performance distributed computing
Towards continuous policy-driven demand response in data centers

Proceedings of the 2nd ACM SIGCOMM workshop on Green networking
CassMail: a scalable, highly-available, and rapidly-prototyped e-mail service

Proceedings of the 11th IFIP WG 6.1 international conference on Distributed applications and interoperable systems
Fast crash recovery in RAMCloud

SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
PREFAIL: a programmable tool for multiple-failure injection

Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications
Concurrent non-deferred reference counting on the Microgrid: first experiences

IFL'10 Proceedings of the 22nd international conference on Implementation and application of functional languages
Qserv: a distributed shared-nothing database for the LSST catalog

State of the Practice Reports
Hadoop acceleration through network levitated merge

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
On the duality of data-intensive file system design: reconciling HDFS and PVFS

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
MARIANE: MApReduce Implementation Adapted for HPC Environments

GRID '11 Proceedings of the 2011 IEEE/ACM 12th International Conference on Grid Computing
A hardware and software computational platform for the HiPerDNO (high performance distribution network operation) project

Proceedings of the first international workshop on High performance computing, networking and analytics for the power grid
Experimenting lucene index on HBase in an HPC environment

Proceedings of the first annual workshop on High performance computing meets databases
Riding the elephant: managing ensembles with hadoop

Proceedings of the 2011 ACM international workshop on Many task computing on grids and supercomputers
GHOST: GPGPU-offloaded high performance storage I/O deduplication for primary storage system

Proceedings of the 2012 International Workshop on Programming Models and Applications for Multicores and Manycores
Horus: fine-grained encryption-based security for high performance petascale storage

Proceedings of the sixth workshop on Parallel Data Storage
Apriori-based frequent itemset mining algorithms on MapReduce

Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication
Consistency without ordering

FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
Serving large-scale batch computed data with project Voldemort

FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
Walnut: a unified cloud object store

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Don't lose sleep over availability: the GreenUp decentralized wakeup service

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
An intelligent cloud system adopting file pre-fetching

ADCONS'11 Proceedings of the 2011 international conference on Advanced Computing, Networking and Security
Improving the diagnosis of mild hypertrophic cardiomyopathy with MapReduce

Proceedings of third international workshop on MapReduce and its Applications Date
CEFLS: A Cost-Effective File Lookup Service in a Distributed Metadata File System

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
MARLA: MapReduce for Heterogeneous Clusters

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Hierarchical MapReduce Programming Model and Scheduling Algorithms

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
A Workflow-Aware Storage System: An Opportunity Study

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Investigation of Data Locality in MapReduce

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
MapReduce Workload Modeling with Statistical Approach

Journal of Grid Computing
Integrated in-system storage architecture for high performance computing

Proceedings of the 2nd International Workshop on Runtime and Operating Systems for Supercomputers
Schönhage-Strassen algorithm with MapReduce for multiplying terabit integers

Proceedings of the 2011 International Workshop on Symbolic-Numeric Computation
The seven deadly sins of cloud computing research

HotCloud'12 Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing
Gnothi: separating data and metadata for efficient and available storage replication

USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
Scalability of replicated metadata services in distributed file systems

DAIS'12 Proceedings of the 12th IFIP WG 6.1 international conference on Distributed Applications and Interoperable Systems
HFAA: a generic socket API for Hadoop file systems

Proceedings of the 2nd Workshop on Architectures and Systems for Big Data
An optimized approach for storing and accessing small files on cloud storage

Journal of Network and Computer Applications
Flat datacenter storage

OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
CloST: a hadoop-based storage system for big spatio-temporal data analytics

Proceedings of the 21st ACM international conference on Information and knowledge management
Scalable Reed-Solomon-based reliable local storage for HPC applications on iaas clouds

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Data-Intensive Workload Consolidation for the Hadoop Distributed File System

GRID '12 Proceedings of the 2012 ACM/IEEE 13th International Conference on Grid Computing
Expressive Query Support for Multidimensional Data in Distributed Hash Tables

UCC '12 Proceedings of the 2012 IEEE/ACM Fifth International Conference on Utility and Cloud Computing
Towards big linked data: a large-scale, distributed semantic data storage

Proceedings of the 14th International Conference on Information Integration and Web-based Applications & Services
A RAMCloud Storage System based on HDFS: Architecture, implementation and evaluation

Journal of Systems and Software
Pyramid Codes: Flexible Schemes to Trade Space for Access Efficiency in Reliable Data Storage Systems

ACM Transactions on Storage (TOS)
Exploiting geospatial and chronological characteristics in data streams to enable efficient storage and retrievals

Future Generation Computer Systems
X10-FT: transparent fault tolerance for APGAS language and runtime

Proceedings of the 2013 International Workshop on Programming Models and Applications for Multicores and Manycores
Elastic and effective spatio-temporal query processing scheme on Hadoop

Proceedings of the 1st ACM SIGSPATIAL International Workshop on Analytics for Big Geospatial Data
Indexing and searching 100M images with map-reduce

Proceedings of the 3rd ACM conference on International conference on multimedia retrieval
IBIS: interposed big-data I/O scheduler

Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
SCDA: SLA-aware cloud datacenter architecture for efficient content storage and retrieval

Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
The big data ecosystem at LinkedIn

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
High performance risk aggregation: addressing the data processing challenge the hadoop mapreduce way

Proceedings of the 4th ACM workshop on Scientific cloud computing
A throughput optimal algorithm for map task scheduling in mapreduce with data locality

ACM SIGMETRICS Performance Evaluation Review
A classification of file placement and replication methods on grids

Future Generation Computer Systems
Input data organization for batch processing in time window based computations

Proceedings of the 28th Annual ACM Symposium on Applied Computing
Power-reduction techniques for data-center storage systems

ACM Computing Surveys (CSUR)
Robustness in the Salus scalable block store

nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
QuickSAN: a storage area network for fast, distributed, solid state disks

Proceedings of the 40th Annual International Symposium on Computer Architecture
Supporting robust system analysis with the test matrix tool framework

Proceedings of the 2013 ACM SIGSIM conference on Principles of advanced discrete simulation
Obtaining ground-truth software architectures

Proceedings of the 2013 International Conference on Software Engineering
Toward common patterns for distributed, concurrent, fault-tolerant code

HotOS'13 Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems
FSaaS: Configuring Policies for Managing Shared Files Among Cooperating, Distributed Applications

International Journal of Web Portals
ACIC: automatic cloud I/O configurator for HPC applications

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
CooMR: cross-task coordination for efficient data management in MapReduce programs

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Prolog programming with a map-reduce parallel construct

Proceedings of the 15th Symposium on Principles and Practice of Declarative Programming
Boosting energy efficiency with mirrored data block replication policy and energy scheduler

ACM SIGOPS Operating Systems Review
PredictionIO: a distributed machine learning server for practical software development

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Simplifying MapReduce data processing

International Journal of Computational Science and Engineering
Leveraging sharding in the design of scalable replication protocols

Proceedings of the 4th annual Symposium on Cloud Computing
Apache Hadoop YARN: yet another resource negotiator

Proceedings of the 4th annual Symposium on Cloud Computing
USTO.RE: a private cloud storage software system

ICWE'13 Proceedings of the 13th international conference on Web Engineering
A protocol for simultaneous use of confidentiality and integrity in large-scale storage systems

Proceedings of the 6th International Conference on Security of Information and Networks
PonIC: using stratosphere to speed up pig analytics

Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
Generating request streams on Big Data using clustered renewal processes

Performance Evaluation
CRUCIBLE: towards unified secure on- and off-line analytics at scale

DISCS-2013 Proceedings of the 2013 International Workshop on Data-Intensive Scalable Computing Systems
Copysets: reducing the frequency of data loss in cloud storage

USENIX ATC'13 Proceedings of the 2013 USENIX conference on Annual Technical Conference
Securing data services: a security architecture design for private storage cloud based on HDFS

International Journal of Grid and Utility Computing
Optimization strategies for A/B testing on HADOOP

Proceedings of the VLDB Endowment
The quantcast file system

Proceedings of the VLDB Endowment
Structuring PLFS for extensibility

PDSW '13 Proceedings of the 8th Parallel Data Storage Workshop
MapReduce "garbage" collection

CASCON '13 Proceedings of the 2013 Conference of the Center for Advanced Studies on Collaborative Research
DIMO: distributed index for matching multimedia objects using MapReduce

Proceedings of the 5th ACM Multimedia Systems Conference
A Study of Linux File System Evolution

ACM Transactions on Storage (TOS)
A three-phase energy-saving strategy for cloud storage systems

Journal of Systems and Software
Google hostload prediction based on Bayesian model with optimized feature combination

Journal of Parallel and Distributed Computing
MORM: A Multi-objective Optimized Replication Management strategy for cloud storage cluster

Journal of Systems Architecture: the EUROMICRO Journal
Analyzing, modeling and evaluating dynamic adaptive fault tolerance strategies in cloud computing environments

The Journal of Supercomputing
A multi-dimensional index structure based on improved VA-file and CAN in the cloud

International Journal of Automation and Computing
X10-FT: Transparent fault tolerance for APGAS language and runtime

Parallel Computing
ORTHRUS: a lightweighted block-level cloud storage system

Cluster Computing
Scalable Metadata Management Through OSD+ Devices

International Journal of Parallel Programming
A study of Linux file system evolution

FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies
HARDFS: hardening HDFS with selective and lightweight versioning

FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies
Horus: fine-grained encryption-based security for large-scale storage

FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies
Analysis of HDFS under HBase: a facebook messages case study

FAST'14 Proceedings of the 12th USENIX conference on File and Storage Technologies
GPFS-SNC: an enterprise cluster file system for big data

IBM Journal of Research and Development
Exalt: empowering researchers to evaluate large-scale storage systems

NSDI'14 Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation

Quantified Score

Hi-index	0.00

Visualization

Abstract

The Hadoop Distributed File System (HDFS) is designed to store very large data sets reliably, and to stream those data sets at high bandwidth to user applications. In a large cluster, thousands of servers both host directly attached storage and execute user application tasks. By distributing storage and computation across many servers, the resource can grow with demand while remaining economical at every size. We describe the architecture of HDFS and report on experience using HDFS to manage 25 petabytes of enterprise data at Yahoo!.