Recovery Oriented Computing: A New Research Agenda for a New Century
HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture
Autopilot: automatic data center management
ACM SIGOPS Operating Systems Review - Systems work at Microsoft Research
Multi-tenant databases for software as a service: schema-mapping techniques
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Enabling the autonomic data center with a smart bare-metal server platform
ICAC '09 Proceedings of the 6th international conference on Autonomic computing
Toward a cloud computing research agenda
ACM SIGACT News
CatchAndRetry: extending exceptions to handle distributed system failures and recovery
Proceedings of the Fifth Workshop on Programming Languages and Operating Systems
Fluxo: a system for internet service programming by non-expert developers
Proceedings of the 1st ACM symposium on Cloud computing
Characterizing cloud computing hardware reliability
Proceedings of the 1st ACM symposium on Cloud computing
Proceedings of the ACM SIGCOMM 2010 conference
FLUXO: a simple service compiler
HotOS'09 Proceedings of the 12th conference on Hot topics in operating systems
Towards automatically checking thousands of failures with micro-specifications
HotDep'10 Proceedings of the Sixth international conference on Hot topics in system dependability
DepSky: dependable and secure storage in a cloud-of-clouds
Proceedings of the sixth conference on Computer systems
FATE and DESTINI: a framework for cloud recovery testing
Proceedings of the 8th USENIX conference on Networked systems design and implementation
A cascade ranking model for efficient ranked retrieval
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Enabling dynamic data centers with a smart bare-metal server platform
Cluster Computing
Automatic management of partitioned, replicated search services
Proceedings of the 2nd ACM Symposium on Cloud Computing
Economics of cloud computing for enterprise IT
IBM Journal of Research and Development
Fast candidate generation for real-time tweet search with bloom filter chains
ACM Transactions on Information Systems (TOIS)
Introducing service-level awareness in the cloud
Proceedings of the 4th annual Symposium on Cloud Computing
DepSky: Dependable and Secure Storage in a Cloud-of-Clouds
ACM Transactions on Storage (TOS)
Communications of the ACM
Queue - Distributed Computing
HARDFS: hardening HDFS with selective and lightweight versioning
FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies
Hi-index | 0.02 |
The system-to-administrator ratio is commonly used as a rough metric to understand administrative costs in high-scale services. With smaller, less automated services this ratio can be as low as 2:1, whereas on industry leading, highly automated services, we've seen ratios as high as 2, 500:1. Within Microsoft services, Autopilot [1] is often cited as the magic behind the success of the Windows Live Search team in achieving high system-to-administrator ratios. While autoadministration is important, the most important factor is actually the service itself. Is the service efficient to automate? Is it what we refer to more generally as operations-friendly? Services that are operations-friendly require little human intervention, and both detect and recover from all but the most obscure failures without administrative intervention. This paper summarizes the best practices accumulated over many years in scaling some of the largest services at MSN and Windows Live.