Adventures in stochastic processes
Adventures in stochastic processes
Feasibility of a serverless distributed file system deployed on an existing set of desktop PCs
Proceedings of the 2000 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Erasure Coding Vs. Replication: A Quantitative Comparison
IPTPS '01 Revised Papers from the First International Workshop on Peer-to-Peer Systems
Reliability Mechanisms for Very Large Storage Systems
MSS '03 Proceedings of the 20 th IEEE/11 th NASA Goddard Conference on Mass Storage Systems and Technologies (MSS'03)
Introspective Failure Analysis: Avoiding Correlated Failures in Peer-to-Peer Systems
SRDS '02 Proceedings of the 21st IEEE Symposium on Reliable Distributed Systems
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Disk Scrubbing in Large Archival Storage Systems
MASCOTS '04 Proceedings of the The IEEE Computer Society's 12th Annual International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems
On the Impact of Replica Placement to the Reliability of Distributed Brick Storage Systems
ICDCS '05 Proceedings of the 25th IEEE International Conference on Distributed Computing Systems
Awarded Best Paper! -- Row-Diagonal Parity for Double Disk Failure Correction
FAST '04 Proceedings of the 3rd USENIX Conference on File and Storage Technologies
FAST '04 Proceedings of the 3rd USENIX Conference on File and Storage Technologies
A fresh look at the reliability of long-term digital storage
Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
Autopilot: automatic data center management
ACM SIGOPS Operating Systems Review - Systems work at Microsoft Research
An analysis of latent sector errors in disk drives
Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Efficient replica maintenance for distributed storage systems
NSDI'06 Proceedings of the 3rd conference on Networked Systems Design & Implementation - Volume 3
Exploiting availability prediction in distributed systems
NSDI'06 Proceedings of the 3rd conference on Networked Systems Design & Implementation - Volume 3
Subtleties in tolerating correlated failures in wide-area storage systems
NSDI'06 Proceedings of the 3rd conference on Networked Systems Design & Implementation - Volume 3
Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?
FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Failure trends in a large disk drive population
FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Bigtable: a distributed storage system for structured data
OSDI '06 Proceedings of the 7th symposium on Operating systems design and implementation
FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
An analysis of data corruption in the storage stack
FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
DRAM errors in the wild: a large-scale field study
Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
GFS: evolution on fast-forward
Communications of the ACM
High availability in DHTs: erasure coding vs. replication
IPTPS'05 Proceedings of the 4th international conference on Peer-to-Peer Systems
TPCTC'10 Proceedings of the Second TPC technology conference on Performance evaluation, measurement and characterization of complex systems
In search of I/O-optimal recovery from disk failures
HotStorage'11 Proceedings of the 3rd USENIX conference on Hot topics in storage and file systems
Understanding network failures in data centers: measurement, analysis, and implications
Proceedings of the ACM SIGCOMM 2011 conference
Small cache, big effect: provable load balancing for randomly partitioned cluster services
Proceedings of the 2nd ACM Symposium on Cloud Computing
Automatic management of partitioned, replicated search services
Proceedings of the 2nd ACM Symposium on Cloud Computing
Thialfi: a client notification service for internet-scale applications
SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
PREFAIL: a programmable tool for multiple-failure injection
Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications
Modeling and tolerating heterogeneous failures in large parallel systems
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Enhancing application robustness in cloud data centers
Proceedings of the 2011 Conference of the Center for Advanced Studies on Collaborative Research
Long-term availability prediction for groups of volunteer resources
Journal of Parallel and Distributed Computing
Rethinking erasure codes for cloud file systems: minimizing I/O for recovery and degraded reads
FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
Surviving failures in bandwidth-constrained datacenters
Proceedings of the ACM SIGCOMM 2012 conference on Applications, technologies, architectures, and protocols for computer communication
Erasure coding in windows azure storage
USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
Failure-aware resource provisioning for hybrid Cloud infrastructure
Journal of Parallel and Distributed Computing
MyCassandra: a cloud storage supporting both read heavy and write heavy workloads
Proceedings of the 5th Annual International Systems and Storage Conference
Surviving failures in bandwidth-constrained datacenters
ACM SIGCOMM Computer Communication Review - Special october issue SIGCOMM '12
Robust Redundancy Scheme for the Repair Process: Hierarchical Codes in the Bandwidth-Limited Systems
Journal of Grid Computing
Themis: an I/O-efficient MapReduce
Proceedings of the Third ACM Symposium on Cloud Computing
Theia: visual signatures for problem diagnosis in large hadoop clusters
lisa'12 Proceedings of the 26th international conference on Large Installation System Administration: strategies, tools, and techniques
ACM Transactions on Storage (TOS)
Cloud API issues: an empirical study and impact
Proceedings of the 9th international ACM Sigsoft conference on Quality of software architectures
In-network redundancy generation for opportunistic speedup of data backup
Future Generation Computer Systems
Juggling the Jigsaw: towards automated problem inference from network trouble tickets
nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
Robustness in the Salus scalable block store
nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
XORing elephants: novel erasure codes for big data
Proceedings of the VLDB Endowment
Data-Intensive Cloud Computing: Requirements, Expectations, Challenges, and Solutions
Journal of Grid Computing
Limplock: understanding the impact of limpware on scale-out cloud systems
Proceedings of the 4th annual Symposium on Cloud Computing
An untold story of redundant clouds: making your service deployment truly reliable
Proceedings of the 9th Workshop on Hot Topics in Dependable Systems
Effect of codeword placement on the reliability of erasure coded data storage systems
QEST'13 Proceedings of the 10th international conference on Quantitative Evaluation of Systems
HotStorage'13 Proceedings of the 5th USENIX conference on Hot Topics in Storage and File Systems
Copysets: reducing the frequency of data loss in cloud storage
USENIX ATC'13 Proceedings of the 2013 USENIX conference on Annual Technical Conference
On the efficiency of durable state machine replication
USENIX ATC'13 Proceedings of the 2013 USENIX conference on Annual Technical Conference
On the feasibility of completely wirelesss datacenters
IEEE/ACM Transactions on Networking (TON)
Warming up storage-level caches with bonfire
FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies
FAST'14 Proceedings of the 12th USENIX conference on File and Storage Technologies
Analysis of HDFS under HBase: a facebook messages case study
FAST'14 Proceedings of the 12th USENIX conference on File and Storage Technologies
Hi-index | 0.00 |
Highly available cloud storage is often implemented with complex, multi-tiered distributed systems built on top of clusters of commodity servers and disk drives. Sophisticated management, load balancing and recovery techniques are needed to achieve high performance and availability amidst an abundance of failure sources that include software, hardware, network connectivity, and power issues. While there is a relative wealth of failure studies of individual components of storage systems, such as disk drives, relatively little has been reported so far on the overall availability behavior of large cloudbased storage services. We characterize the availability properties of cloud storage systems based on an extensive one year study of Google's main storage infrastructure and present statistical models that enable further insight into the impact of multiple design choices, such as data placement and replication strategies. With these models we compare data availability under a variety of system parameters given the real patterns of failures observed in our fleet.