IEEE Spectrum
Applied software measurement: assuring productivity and quality
Applied software measurement: assuring productivity and quality
Computer related risks
Feature interactions in the global information infrastructure
SIGSOFT '95 Proceedings of the 3rd ACM SIGSOFT symposium on Foundations of software engineering
Resource aggregation for fault tolerance in integrated services networks
ACM SIGCOMM Computer Communication Review
The Feature and Service Interaction Problem in Telecommunications Systems: A Survey
IEEE Transactions on Software Engineering
Architectural modeling in industry—an experience report
Proceedings of the 20th international conference on Software engineering
Fundamentals of fault-tolerant distributed computing in asynchronous environments
ACM Computing Surveys (CSUR)
Software reliability and dependability: a roadmap
Proceedings of the Conference on The Future of Software Engineering
ROC-1: Hardware Support for Recovery-Oriented Computing
IEEE Transactions on Computers - Special issue on fault-tolerant embedded systems
Open Signaling for ATM Networks: A Vexed Question of Performance
Journal of Network and Systems Management
Architecture and Dependability of Large-Scale Internet Services
IEEE Internet Computing
IEEE Design & Test
A Generalisable Measure of Self-Organisation and Emergence
ICANN '01 Proceedings of the International Conference on Artificial Neural Networks
Assume-Guarantee Algorithms for Automatic Detection of Software Failures
IFM '02 Proceedings of the Third International Conference on Integrated Formal Methods
Is IP going to take over the world (of communications)?
ACM SIGCOMM Computer Communication Review
An Approach to Measuring and Assessing Dependability for Critical Software Systems
ISSRE '97 Proceedings of the Eighth International Symposium on Software Reliability Engineering
A Simple Way to Estimate the Cost of Downtime
LISA '02 Proceedings of the 16th USENIX conference on System administration
Traffic-aware stress testing of distributed systems based on UML models
Proceedings of the 28th international conference on Software engineering
An Experimental Study of Internet Path Diversity
IEEE Transactions on Dependable and Secure Computing
Improving web availability for clients with MONET
NSDI'05 Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation - Volume 2
Microreboot — A technique for cheap recovery
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Improving the reliability of internet paths with one-hop source routing
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Why do internet services fail, and what can be done about it?
USITS'03 Proceedings of the 4th conference on USENIX Symposium on Internet Technologies and Systems - Volume 4
Journal of Systems and Software
Information Assurance: Dependability and Security in Networked Systems
Information Assurance: Dependability and Security in Networked Systems
Empirical analysis of a genetic algorithm-based stress test technique
Proceedings of the 10th annual conference on Genetic and evolutionary computation
Enhancing end-to-end availability and performance via topology-aware overlay networks
Computer Networks: The International Journal of Computer and Telecommunications Networking
iPlane Nano: path prediction for peer-to-peer applications
NSDI'09 Proceedings of the 6th USENIX symposium on Networked systems design and implementation
How to find self-inflicted troubles
Journal of Computing Sciences in Colleges
Informed detour selection helps reliability
INFOCOM'09 Proceedings of the 28th IEEE international conference on Computer Communications Workshops
Experience and challenges with UML-driven performance engineering of a Distributed Real-Time System
Information and Software Technology
Computer Networks: The International Journal of Computer and Telecommunications Networking
Generic load regulation framework for Erlang
Proceedings of the 9th ACM SIGPLAN workshop on Erlang
Auto-scaling emergency call centres using cloud resources to handle disasters
Proceedings of the Nineteenth International Workshop on Quality of Service
Impact of traffic load on SCTP failovers in SIGTRAN
ICN'05 Proceedings of the 4th international conference on Networking - Volume Part I
Proof-based system engineering using a virtual system model
ISAS'05 Proceedings of the Second international conference on Service Availability
Wireless Personal Communications: An International Journal
Endurance: A new robustness measure for complex networks under multiple failure scenarios
Computer Networks: The International Journal of Computer and Telecommunications Networking
Hi-index | 4.11 |
The US portion of possibly the largest distributed system in the world, the PSTN is also among the most reliable. But why? The author studied outage records maintained by the US Federal Communications Commission to find out. The FCC data tallies the number and duration of outages as well as the number of customers affected. Although such data are interesting as separate entities, it became clear that a meaningful comparison required their combination. The author used the concept of customer minutes-the number of affected customers multiplied by the length of the outage in minutes-to provide this comparison. The author also used a classification technique that is general enough to compare with other large distributed systems. Major categories of failure include human error, acts of nature, hardware and software failures, overloads, and vandalism. Analysis of the FCC records shows that although human error causes almost half the outages, such outages are relatively short. Overloads-outages accepted as a cost-performance trade-off for the telephone system-affect the most customers for the most minutes. The capability for human intervention is one key to the PSTN's reliability. This system also achieves an excellent trade-off between loose coupling and interaction simplicity, showing that the balance between these characteristics is an important consideration in any distributed-system design. Telephone switch manufacturers also produce extremely reliable software.