ACM Transactions on Computer Systems (TOCS)
Mastering Active Directory for Windows Server 2003
Mastering Active Directory for Windows Server 2003
HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
Modular software upgrades for distributed systems
ECOOP'06 Proceedings of the 20th European conference on Object-Oriented Programming
On designing and deploying internet-scale services
LISA'07 Proceedings of the 21st conference on Large Installation System Administration Conference
Towards a next generation data center architecture: scalability and commoditization
Proceedings of the ACM workshop on Programmable routers for extensible services of tomorrow
The cost of a cloud: research problems in data center networks
ACM SIGCOMM Computer Communication Review
Enabling the autonomic data center with a smart bare-metal server platform
ICAC '09 Proceedings of the 6th international conference on Autonomic computing
Why should we integrate services, servers, and networking in a data center?
Proceedings of the 1st ACM workshop on Research on enterprise networking
New frontiers in internet network management
ACM SIGCOMM Computer Communication Review
A collaborative management as a service framework for managing Internetware systems
Proceedings of the First Asia-Pacific Symposium on Internetware
A2A: An Architecture for Autonomic Management Coordination
DSOM '09 Proceedings of the 20th IFIP/IEEE International Workshop on Distributed Systems: Operations and Management: Integrated Management of Systems, Services, Processes and People in IT
SPECI, a Simulation Tool Exploring Cloud-Scale Data Centres
CloudCom '09 Proceedings of the 1st International Conference on Cloud Computing
CatchAndRetry: extending exceptions to handle distributed system failures and recovery
Proceedings of the Fifth Workshop on Programming Languages and Operating Systems
Toward automatic policy refinement in repair services for large distributed systems
ACM SIGOPS Operating Systems Review
Centrifuge: integrated lease management and partitioning for cloud services
NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation
Reliable data-center scale computations
Proceedings of the 4th International Workshop on Large Scale Distributed Systems and Middleware
Model-driven coordinated management of data centers
Computer Networks: The International Journal of Computer and Telecommunications Networking
Availability in globally distributed storage systems
OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
AHAFS subsystem for enhancing operating system health in the cloud computing era
IBM Journal of Research and Development
Automated incident management for a platform-as-a-service cloud
Hot-ICE'11 Proceedings of the 11th USENIX conference on Hot topics in management of internet, cloud, and enterprise networks and services
Sharing the data center network
Proceedings of the 8th USENIX conference on Networked systems design and implementation
TidyFS: a simple and small distributed file system
USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
Enabling dynamic data centers with a smart bare-metal server platform
Cluster Computing
The data furnace: heating up with cloud computing
HotCloud'11 Proceedings of the 3rd USENIX conference on Hot topics in cloud computing
CloudSense: continuous fine-grain cloud monitoring with compressive sensing
HotCloud'11 Proceedings of the 3rd USENIX conference on Hot topics in cloud computing
Synergy2cloud: introducing cross-sharing of application experiences into the cloud management cycle
Hot-ICE'12 Proceedings of the 2nd USENIX conference on Hot Topics in Management of Internet, Cloud, and Enterprise Networks and Services
NetPilot: automating datacenter network failure mitigation
Proceedings of the ACM SIGCOMM 2012 conference on Applications, technologies, architectures, and protocols for computer communication
TROPIC: transactional resource orchestration platform in the cloud
USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
NetPilot: automating datacenter network failure mitigation
ACM SIGCOMM Computer Communication Review - Special october issue SIGCOMM '12
True elasticity in multi-tenant data-intensive compute clusters
Proceedings of the Third ACM Symposium on Cloud Computing
MemRed: towards reliable web applications
Proceedings of the Workshop on Secure and Dependable Middleware for Cloud Monitoring and Management
Towards dependable clients: improving the reliability and availability of the browsers
Proceedings of the 9th Middleware Doctoral Symposium of the 13th ACM/IFIP/USENIX International Middleware Conference
Maguro, a system for indexing and searching over very large text collections
Proceedings of the sixth ACM international conference on Web search and data mining
Distributed oblivious load balancing using prioritized job replication
Proceedings of the 8th International Conference on Network and Service Management
Explicit multipath congestion control for data center networks
Proceedings of the ninth ACM conference on Emerging networking experiments and technologies
Hi-index | 0.00 |
Microsoft is rapidly increasing the number of large-scale web services that it operates. Services such as Windows Live Search and Windows Live Mail operate from data centers that contain tens or hundreds of thousands of computers, and it is essential that these data centers function reliably with minimal human intervention. This paper describes the first version of Autopilot, the automatic data center management infrastructure developed within Microsoft over the last few years. Autopilot is responsible for automating software provisioning and deployment; system monitoring; and carrying out repair actions to deal with faulty software and hardware. A key assumption underlying Autopilot is that the services built on it must be designed to be manageable. We also therefore outline the best practices adopted by applications that run on Autopilot.