Resilient computing systems: vol. 1
A case for redundant arrays of inexpensive disks (RAID)
SIGMOD '88 Proceedings of the 1988 ACM SIGMOD international conference on Management of data
The fuzzy barrier: a mechanism for high speed synchronization of processors
ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit
IEEE Transactions on Computers - Special issue on fault-tolerant computing
The SPLASH-2 programs: characterization and methodological considerations
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
IBM experiments in soft fails in computer electronics (1978–1994)
IBM Journal of Research and Development - Special issue: terrestrial cosmic rays and soft errors
Fault-tolerant computer system design
Fault-tolerant computer system design
An Architecture for Tolerating Processor Failures in Shared-Memory Multiprocessors
IEEE Transactions on Computers
Proceedings of the ninth annual ACM symposium on Parallel algorithms and architectures
Generating representative Web workloads for network and server performance evaluation
SIGMETRICS '98/PERFORMANCE '98 Proceedings of the 1998 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
IEEE Transactions on Parallel and Distributed Systems
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
DIVA: a reliable substrate for deep submicron microarchitecture design
Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Transient fault detection via simultaneous multithreading
Proceedings of the 27th annual international symposium on Computer architecture
Architectural support for scalable speculative parallelization in shared-memory multiprocessors
Proceedings of the 27th annual international symposium on Computer architecture
Slipstream processors: improving both performance and fault tolerance
ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Interconnection Networks: An Engineering Approach
Interconnection Networks: An Engineering Approach
Increasing Work, Pushing the Clock
Computer
Computer
Verifying a Multiprocessor Cache Controller Using Random Test Generation
IEEE Design & Test
Spider: A High-Speed Network Interconnect
IEEE Micro
Starfire: Extending the SMP Envelope
IEEE Micro
Error Recovery in Shared Memory Multiprocessors Using Private Caches
IEEE Transactions on Parallel and Distributed Systems
AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors
FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization
HPCA '98 Proceedings of the 4th International Symposium on High-Performance Computer Architecture
HPCA '98 Proceedings of the 4th International Symposium on High-Performance Computer Architecture
Checkpointing and Its Applications
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Software and Hardware for Exploiting Speculative Parallelism with a Multiprocessor
Software and Hardware for Exploiting Speculative Parallelism with a Multiprocessor
IBM S/390 parallel enterprise server G5 fault tolerance: a historical perspective
IBM Journal of Research and Development
Detailed design and evaluation of redundant multithreading alternatives
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Dynamic Data Replication: An Approach to Providing Fault-Tolerant Shared Memory Clusters
HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Variability in Architectural Simulations of Multi-Threaded Workloads
HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
ReEnact: using thread-level speculation mechanisms to debug data races in multithreaded codes
Proceedings of the 30th annual international symposium on Computer architecture
A "flight data recorder" for enabling full-system multiprocessor deterministic replay
Proceedings of the 30th annual international symposium on Computer architecture
Adaptive incremental checkpointing for massively parallel systems
Proceedings of the 18th annual international conference on Supercomputing
Fingerprinting: bounding soft-error detection latency and bandwidth
ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Application-level checkpointing for shared memory programs
ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Current Practice and a Direction Forward in Checkpoint/Restart Implementations for Fault Tolerance
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 18 - Volume 19
A serializability violation detector for shared-memory server programs
Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
BugNet: Continuously Recording Program Execution for Deterministic Replay Debugging
Proceedings of the 32nd annual international symposium on Computer Architecture
Dynamic Verification of Sequential Consistency
Proceedings of the 32nd annual international symposium on Computer Architecture
Energy reduction in multiprocessor systems using transactional memory
ISLPED '05 Proceedings of the 2005 international symposium on Low power electronics and design
Memory State Compressors for Giga-Scale Checkpoint/Restore
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Exploiting Coarse-Grain Verification Parallelism for Power-Efficient Fault Tolerance
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Cherry-MP: Correctly Integrating Checkpointed Early Resource Recycling in Chip Multiprocessors
Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Fast and transparent recovery for continuous availability of cluster-based servers
Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
An Integrated Framework for Dependable and Revivable Architectures Using Multicore Processors
Proceedings of the 33rd annual international symposium on Computer Architecture
Recording shared memory dependencies using strata
Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Experimental evaluation of application-level checkpointing for OpenMP programs
Proceedings of the 20th annual international conference on Supercomputing
Architecture of a Self-Checkpointing Microprocessor that Incorporates Nanomagnetic Devices
IEEE Transactions on Computers
Phoenix: Detecting and Recovering from Permanent Processor Design Bugs with Programmable Hardware
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
A hardware/software framework for supporting transactional memory in a MPSoC environment
ACM SIGARCH Computer Architecture News
Configurable isolation: building high availability systems with commodity multi-core processors
Proceedings of the 34th annual international symposium on Computer architecture
Adapting to intermittent faults in multicore systems
Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Atom-Aid: Detecting and Surviving Atomicity Violations
ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Techniques for Efficient Software Checking
Languages and Compilers for Parallel Computing
Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Online design bug detection: RTL analysis, flexible mechanisms, and evaluation
Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
RT-replayer: a record-replay architecture for embedded real-time software debugging
Proceedings of the 2009 ACM symposium on Applied Computing
Dynamic heterogeneity and the need for multicore virtualization
ACM SIGOPS Operating Systems Review
ESoftCheck: Removal of Non-vital Checks for Fault Tolerance
Proceedings of the 7th annual IEEE/ACM International Symposium on Code Generation and Optimization
A case for an interleaving constrained shared-memory multi-processor
Proceedings of the 36th annual international symposium on Computer architecture
Architecture Design for Soft Errors
Architecture Design for Soft Errors
mSWAT: low-cost hardware fault detection and diagnosis for multicore systems
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Offline symbolic analysis for multi-processor execution replay
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Specifying and dynamically verifying address translation-aware memory consistency
Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Adaptive online testing for efficient hard fault detection
ICCD'09 Proceedings of the 2009 IEEE international conference on Computer design
Timetraveler: exploiting acyclic races for optimizing memory race recording
Proceedings of the 37th annual international symposium on Computer architecture
Proceedings of the 37th annual international symposium on Computer architecture
Relax: an architectural framework for software recovery of hardware faults
Proceedings of the 37th annual international symposium on Computer architecture
Fault-Tolerant Flow Control in On-chip Networks
NOCS '10 Proceedings of the 2010 Fourth ACM/IEEE International Symposium on Networks-on-Chip
Using introspective software-based testing for post-silicon debug and repair
Proceedings of the 47th Design Automation Conference
Design techniques for cross-layer resilience
Proceedings of the Conference on Design, Automation and Test in Europe
A self-adaptive system architecture to address transistor aging
Proceedings of the Conference on Design, Automation and Test in Europe
CASPAR: hardware patching for multi-core processors
Proceedings of the Conference on Design, Automation and Test in Europe
Checkpoint/restart-enabled parallel debugging
EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface
Rebound: scalable checkpointing for coherent shared memory
Proceedings of the 38th annual international symposium on Computer architecture
Sampling + DMR: practical and low-overhead permanent fault detection
Proceedings of the 38th annual international symposium on Computer architecture
DRAIN: distributed recovery architecture for inaccessible nodes in multi-core chips
Proceedings of the 48th Design Automation Conference
Reliability analysis for MPSoCs with mixed-critical, hard real-time constraints
CODES+ISSS '11 Proceedings of the seventh IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
A systematic methodology to develop resilient cache coherence protocols
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Encore: low-cost, fine-grained transient fault recovery
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Application-Level checkpointing techniques for parallel programs
ICDCIT'06 Proceedings of the Third international conference on Distributed Computing and Internet Technology
Using dynamic task level redundancy for OpenMP fault tolerance
ARCS'12 Proceedings of the 25th international conference on Architecture of Computing Systems
Specification and synthesis of hardware checkpointing and rollback mechanisms
Proceedings of the 49th Annual Design Automation Conference
Proceedings of the 26th ACM international conference on Supercomputing
Proceedings of the 39th Annual International Symposium on Computer Architecture
Viper: virtual pipelines for enhanced reliability
Proceedings of the 39th Annual International Symposium on Computer Architecture
Runtime verification: a computer architecture perspective
RV'11 Proceedings of the Second international conference on Runtime verification
Fault tolerance for multi-threaded applications by leveraging hardware transactional memory
Proceedings of the ACM International Conference on Computing Frontiers
FaulTM: error detection and recovery using hardware transactional memory
Proceedings of the Conference on Design, Automation and Test in Europe
Reli: hardware/software checkpoint and recovery scheme for embedded processors
DATE '12 Proceedings of the Conference on Design, Automation and Test in Europe
A survey of checker architectures
ACM Computing Surveys (CSUR)
Semi-automated debugging via binary search through a process lifetime
Proceedings of the Seventh Workshop on Programming Languages and Operating Systems
Virtually-aged sampling DMR: unifying circuit failure prediction and circuit failure detection
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
A low-power instruction replay mechanism for design of resilient microprocessors
ACM Transactions on Embedded Computing Systems (TECS)
Hi-index | 0.00 |
We develop an availability solution, called SafetyNet, that uses a unified, lightweight checkpoint/recovery mechanism to support multiple long-latency fault detection schemes. At an abstract level, SafetyNet logically maintains multiple, globally consistent checkpoints of the state of a shared memory multiprocessor (i.e., processors, memory, and coherence permissions), and it recovers to a pre-fault checkpoint of the system and re-executes if a fault is detected. SafetyNet efficiently coordinates checkpoints across the system in logical time and uses "logically atomic" coherence transactions to free checkpoints of transient coherence state. SafetyNet minimizes performance overhead by pipelining checkpoint validation with subsequent parallel execution.We illustrate SafetyNet avoiding system crashes due to either dropped coherence messages or the loss of an interconnection network switch (and its buffered messages). Using full-system simulation of a 16-way multiprocessor running commercial workloads, we find that SafetyNet (a) adds statistically insignificant runtime overhead in the common-case of fault-free execution, and (b) avoids a crash when tolerated faults occur.