Grasshopper: an orthogonally persistent operating system
Computing Systems
The SPLASH-2 programs: characterization and methodological considerations
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
ACM SIGOPS Operating Systems Review
Storage-class memory: the next storage system technology
IBM Journal of Research and Development
Phase-change random access memory: a scalable technology
IBM Journal of Research and Development
A durable and energy efficient main memory using phase change memory technology
Proceedings of the 36th annual international symposium on Computer architecture
Scalable high performance main memory system using phase-change memory technology
Proceedings of the 36th annual international symposium on Computer architecture
Better I/O through byte-addressable, persistent memory
Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Leveraging 3D PCRAM technologies to reduce checkpoint overhead for future exascale systems
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Phase change memory architecture and the quest for scalability
Communications of the ACM
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Mnemosyne: lightweight persistent memory
Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
NV-Heaps: making persistent objects fast and safe with next-generation, non-volatile memories
Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
Operating system implications of fast, cheap, non-volatile memory
HotOS'13 Proceedings of the 13th USENIX conference on Hot topics in operating systems
ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
The Reliability Wall for Exascale Supercomputing
IEEE Transactions on Computers
Hi-index | 0.00 |
Reliability wall is one of the most challenging problems for next generation High Performance Computing (HPC) systems. Traditional system design adopts extra fault tolerance mechanism. However, the cost of fault tolerance mechanism itself may incur huge cost, so as to decrease the utilization ratio of the HPC system. To address this problem, we present NV-process, a fault-tolerance process model based on NVRAM. NV-process instances run in a self-contained way in NVRAM, thus to survive across operating system reboot. NV-process provides an elegant way for the applications to tolerate system crashes. We implement a prototype system of NV-process based on Linux and analyze the advantages over traditional fault tolerant mechanism for future HPC applications.