IEEE Transactions on Parallel and Distributed Systems
Detailed design and evaluation of redundant multithreading alternatives
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-Order Processors
HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Checkpointing and Its Applications
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
LLVA: A Low-level Virtual Instruction Set Architecture
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Checkpoint Processing and Recovery: Towards Scalable Large Instruction Window Processors
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation
Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Pin: building customized program analysis tools with dynamic instrumentation
Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Software-controlled fault tolerance
ACM Transactions on Architecture and Code Optimization (TACO)
Automatic Instruction-Level Software-Only Recovery
DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Reunion: Complexity-Effective Multicore Redundancy
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Carbon: architectural support for fine-grained parallelism on chip multiprocessors
Proceedings of the 34th annual international symposium on Computer architecture
Optimistic parallelism requires abstractions
Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Paceline: Improving Single-Thread Performance in Nanoscale CMPs through Core Overclocking
PACT '07 Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques
Application-Level Correctness and its Impact on Fault Tolerance
HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
Effective Optimistic-Checker Tandem Core Design through Architectural Pruning
Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Reliable Systems on Unreliable Fabrics
IEEE Design & Test
The PARSEC benchmark suite: characterization and architectural implications
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Toward a multicore architecture for real-time ray-tracing
Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Architectural core salvaging in a multi-core processor for hard-error tolerance
Proceedings of the 36th annual international symposium on Computer architecture
Rigel: an architecture and scalable programming interface for a 1000-core accelerator
Proceedings of the 36th annual international symposium on Computer architecture
Fault Tolerant Computer Architecture
Fault Tolerant Computer Architecture
CodeSurfer/x86—A platform for analyzing x86 executables
CC'05 Proceedings of the 14th international conference on Compiler Construction
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Flikker: saving DRAM refresh-power through critical data partitioning
Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
EnerJ: approximate data types for safe and general low-power computation
Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
Sampling + DMR: practical and low-overhead permanent fault detection
Proceedings of the 38th annual international symposium on Computer architecture
Architecture support for disciplined approximate programming
ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Encore: low-cost, fine-grained transient fault recovery
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Static analysis and compiler design for idempotent processing
Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
Fault resilience of the algebraic multi-grid solver
Proceedings of the 26th ACM international conference on Supercomputing
A defect-tolerant accelerator for emerging high-performance applications
Proceedings of the 39th Annual International Symposium on Computer Architecture
Containment domains: a scalable, efficient, and flexible resilience scheme for exascale systems
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
ConAir: featherweight concurrency bug recovery via single-threaded idempotent execution
Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
An adaptive low-overhead mechanism for dependable general-purpose many-core processors
ICT-EurAsia'13 Proceedings of the 2013 international conference on Information and Communication Technology
Neural Acceleration for General-Purpose Approximate Programs
MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Continuous real-world inputs can open up alternative accelerator designs
Proceedings of the 40th Annual International Symposium on Computer Architecture
A survey of checker architectures
ACM Computing Surveys (CSUR)
Rethinking algorithm-based fault tolerance with a cooperative software-hardware approach
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Verifying quantitative reliability for programs that execute on unreliable hardware
Proceedings of the 2013 ACM SIGPLAN international conference on Object oriented programming systems languages & applications
SAGE: self-tuning approximation for graphics engines
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Approximate storage in solid-state memories
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
A survey of cross-layer power-reliability tradeoffs in multi and many core systems-on-chip
Microprocessors & Microsystems
Containment domains: A scalable, efficient and flexible resilience scheme for exascale systems
Scientific Programming - Selected Papers from Super Computing 2012
Hi-index | 0.01 |
As technology scales ever further, device unreliability is creating excessive complexity for hardware to maintain the illusion of perfect operation. In this paper, we consider whether exposing hardware fault information to software and allowing software to control fault recovery simplifies hardware design and helps technology scaling. The combination of emerging applications and emerging many-core architectures makes software recovery a viable alternative to hardware-based fault recovery. Emerging applications tend to have few I/O and memory side-effects, which limits the amount of information that needs checkpointing, and they allow discarding individual sub-computations with small qualitative impact. Software recovery can harness these properties in ways that hardware recovery cannot. We describe Relax, an architectural framework for software recovery of hardware faults. Relax includes three core components: (1) an ISA extension that allows software to mark regions of code for software recovery, (2) a hardware organization that simplifies reliability considerations and provides energy efficiency with hardware recovery support removed, and (3) software support for compilers and programmers to utilize the Relax ISA. Applying Relax to counter the effects of process variation, our results show a 20% energy efficiency improvement for PARSEC applications with only minimal source code changes and simpler hardware.