Leases: an efficient fault-tolerant mechanism for distributed file cache consistency
SOSP '89 Proceedings of the twelfth ACM symposium on Operating systems principles
Analysis and simulation of a fair queueing algorithm
SIGCOMM '89 Symposium proceedings on Communications architectures & protocols
Replication in the harp file system
SOSP '91 Proceedings of the thirteenth ACM symposium on Operating systems principles
Petal: distributed virtual disks
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Generational garbage collection and the radioactive decay model
Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
A model, analysis, and protocol framework for soft state-based communication
Proceedings of the conference on Applications, technologies, architectures, and protocols for computer communication
SEDA: an architecture for well-conditioned, scalable internet services
SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Lessons from Giant-Scale Services
IEEE Internet Computing
Perfect Failure Detection in Timed Asynchronous Systems
IEEE Transactions on Computers
Hippodrome: Running Circles Around Storage Administration
FAST '02 Proceedings of the Conference on File and Storage Technologies
Finding surprising patterns in a time series database in linear time and space
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Weighted voting for replicated data
SOSP '79 Proceedings of the seventh ACM symposium on Operating systems principles
Harvest, Yield, and Scalable Tolerant Systems
HOTOS '99 Proceedings of the The Seventh Workshop on Hot Topics in Operating Systems
A Methodology for Detection and Estimation of Software Aging
ISSRE '98 Proceedings of the The Ninth International Symposium on Software Reliability Engineering
A comparison of hard-state and soft-state signaling protocols
Proceedings of the 2003 conference on Applications, technologies, architectures, and protocols for computer communications
HOTOS '01 Proceedings of the Eighth Workshop on Hot Topics in Operating Systems
HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
Using performance reflection in systems software
HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
Palimpsest: soft-capacity storage for planetary-scale services
HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
FAB: enterprise storage systems on a shoestring
HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
The case for a session state storage layer
HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
Scalable, distributed data structures for internet service construction
OSDI'00 Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4
ATEC '99 Proceedings of the annual conference on USENIX Annual Technical Conference
FAB: building distributed enterprise disk arrays from commodity components
ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Cheap recovery: a key to self-managing state
ACM Transactions on Storage (TOS)
Combining statistical monitoring and predictable recovery for self-management
WOSS '04 Proceedings of the 1st ACM SIGSOFT workshop on Self-managed systems
Autonomous recovery in componentized Internet applications
Cluster Computing
J2EE server scalability through EJB replication
Proceedings of the 2006 ACM symposium on Applied computing
Microreboot — A technique for cheap recovery
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Minimal backups of cryptographic protocol runs
Proceedings of the 6th ACM workshop on Formal methods in security engineering
State considerations in distributed systems
Crossroads
Obtaining resource controllability in service cooperation environments
Proceedings of the 7th International Conference on Mobile and Ubiquitous Multimedia
Secure resource control in service oriented applications
CCNC'09 Proceedings of the 6th IEEE Conference on Consumer Communications and Networking Conference
Centrifuge: integrated lease management and partitioning for cloud services
NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation
Dynamically scaling applications in the cloud
ACM SIGCOMM Computer Communication Review
A study on scalability of services and privacy issues in cloud computing
ICDCIT'12 Proceedings of the 8th international conference on Distributed Computing and Internet Technology
Journal of Ambient Intelligence and Smart Environments
Journal of Ambient Intelligence and Smart Environments
Pico replication: a high availability framework for middleboxes
Proceedings of the 4th annual Symposium on Cloud Computing
Hi-index | 0.00 |
The cost and complexity of administration of large systems has come to dominate their total cost of ownership. Stateless and soft-state components, e.g. Web servers or network routers, are easy to manage: capacity can be scaled incrementally by adding more nodes, rebalancing of load after failover is easy, and reactive or proactive ("rolling") reboots can be used to handle transient failures. We show that it is possible to achieve the same ease of management for the state-storage subsystem by subdividing persistent state according to the specific guarantees needed by each type. While other systems [19,17] have addressed persistent-until-deleted state, we describe SSM, a store for a previously unaddressed class of state - user-session state - that exhibits the same manageability properties as stateless nodes while providing firm storage guarantees. Any node can be proactively or reactively rebooted at any time to recover from transient faults, without impacting online performance or losing data. We exploit this simplified manageability by pairing SSM with an application-generic, statistical-anomaly-based framework that detects crashes, hangs, and performance failures, and automatically attempts to recover from them by rebooting faulty nodes. Although the detection techniques generate some false positives, the cost of recovery is so low that the false positives have low impact. We provide microbenchmarks to demonstrate SSM's built-in overload protection, failure management and self-tuning. We benchmark SSM integrated into a production enterprise-scale interactive service to demonstrate that these benefits need not come at the cost of significantly decreased throughput or response time.