On designing and deploying internet-scale services

Authors:
James Hamilton
Affiliations:
Windows Live Services Platform
Venue:
LISA'07 Proceedings of the 21st conference on Large Installation System Administration Conference
Year:
2007

Citing 2
Cited 22

Recovery Oriented Computing: A New Research Agenda for a New Century

HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture
Autopilot: automatic data center management

ACM SIGOPS Operating Systems Review - Systems work at Microsoft Research

Multi-tenant databases for software as a service: schema-mapping techniques

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Enabling the autonomic data center with a smart bare-metal server platform

ICAC '09 Proceedings of the 6th international conference on Autonomic computing
Toward a cloud computing research agenda

ACM SIGACT News
Building reliable large-scale distributed systems: when theory meets practice

ACM SIGACT News
CatchAndRetry: extending exceptions to handle distributed system failures and recovery

Proceedings of the Fifth Workshop on Programming Languages and Operating Systems
Fluxo: a system for internet service programming by non-expert developers

Proceedings of the 1st ACM symposium on Cloud computing
Characterizing cloud computing hardware reliability

Proceedings of the 1st ACM symposium on Cloud computing
Data center TCP (DCTCP)

Proceedings of the ACM SIGCOMM 2010 conference
FLUXO: a simple service compiler

HotOS'09 Proceedings of the 12th conference on Hot topics in operating systems
Towards automatically checking thousands of failures with micro-specifications

HotDep'10 Proceedings of the Sixth international conference on Hot topics in system dependability
DepSky: dependable and secure storage in a cloud-of-clouds

Proceedings of the sixth conference on Computer systems
FATE and DESTINI: a framework for cloud recovery testing

Proceedings of the 8th USENIX conference on Networked systems design and implementation
A cascade ranking model for efficient ranked retrieval

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Enabling dynamic data centers with a smart bare-metal server platform

Cluster Computing
Automatic management of partitioned, replicated search services

Proceedings of the 2nd ACM Symposium on Cloud Computing
Economics of cloud computing for enterprise IT

IBM Journal of Research and Development
Fast candidate generation for real-time tweet search with bloom filter chains

ACM Transactions on Information Systems (TOIS)
Introducing service-level awareness in the cloud

Proceedings of the 4th annual Symposium on Cloud Computing
DepSky: Dependable and Secure Storage in a Cloud-of-Clouds

ACM Transactions on Storage (TOS)
Toward software-defined SLAs

Communications of the ACM
Toward Software-defined SLAs

Queue - Distributed Computing
HARDFS: hardening HDFS with selective and lightweight versioning

FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies

Quantified Score

Hi-index	0.02

Visualization

Abstract

The system-to-administrator ratio is commonly used as a rough metric to understand administrative costs in high-scale services. With smaller, less automated services this ratio can be as low as 2:1, whereas on industry leading, highly automated services, we've seen ratios as high as 2, 500:1. Within Microsoft services, Autopilot [1] is often cited as the magic behind the success of the Windows Live Search team in achieving high system-to-administrator ratios. While autoadministration is important, the most important factor is actually the service itself. Is the service efficient to automate? Is it what we refer to more generally as operations-friendly? Services that are operations-friendly require little human intervention, and both detect and recover from all but the most obscure failures without administrative intervention. This paper summarizes the best practices accumulated over many years in scaling some of the largest services at MSN and Windows Live.