Conjugate gradient sparse solvers: performance-power characteristics

Authors:
Korad Malkowski;Ingyu Lee;Padma Raghavan;Mary Jane Irwin
Affiliations:
The Pennsylvania State University, Department of Computer Science and Engineering, University Park, PA;The Pennsylvania State University, Department of Computer Science and Engineering, University Park, PA;The Pennsylvania State University, Department of Computer Science and Engineering, University Park, PA;The Pennsylvania State University, Department of Computer Science and Engineering, University Park, PA
Venue:
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Year:
2006

Citing 15
Cited 2

NAS parallel benchmark results

Proceedings of the 1992 ACM/IEEE conference on Supercomputing
Evaluation of Hardware-Based Stride and Sequential Prefetching in Shared-Memory Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
The SimpleScalar tool set, version 2.0

ACM SIGARCH Computer Architecture News
Improving the memory-system performance of sparse-matrix vector multiplication

IBM Journal of Research and Development
Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Wattch: a framework for architectural-level power analysis and optimizations

Proceedings of the 27th annual international symposium on Computer architecture
Performance modeling and tuning of an unstructured mesh CFD application

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Designing a Modern Memory Hierarchy with Hardware Prefetching

IEEE Transactions on Computers
Computer Solution of Large Sparse Positive Definite

Computer Solution of Large Sparse Positive Definite
Smarter Memory: Improving Bandwidth for Streamed References

Computer
SimpleScalar: An Infrastructure for Computer System Modeling

Computer
An overview of the BlueGene/L Supercomputer

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Fine-Grained Dynamic Voltage and Frequency Scaling for Precise Energy and Performance Trade-Off Based on the Ratio of Off-Chip Access to On-Chip Computation Times

Proceedings of the conference on Design, automation and test in Europe - Volume 1
Power and Energy Profiling of Scientific Applications on Distributed Systems

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
A non-uniform cache architecture for low power system design

ISLPED '05 Proceedings of the 2005 international symposium on Low power electronics and design

Phase-aware adaptive hardware selection for power-efficient scientific computations

ISLPED '07 Proceedings of the 2007 international symposium on Low power electronics and design
On improving performance and energy profiles of sparse scientific applications

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

We characterize the performance and power attributes of the conjugate gradient (CG) sparse solver which is widely used in scientific applications. We use cycle-accurate simulations with SimpleScalar and Wattch, on a processor and memory architecture similar to the configuration of a node of the BlueGene/L. We first demonstrate that substantial power savings can be obtained without performance degradation if low power modes of caches can be utilized. We next show that if Dynamic Voltage Scaling (DVS) can be used, power and energy savings are possible, but these are realized only at the expense of performance penalties. We then consider two simple memory subsystem optimizations, namely memory and level-2 cache prefetching. We demonstrate that when DVS and low power modes of caches are used with these optimizations, performance can be improved significantly with reductions in power and energy. For example, execution time is reduced by 23%, power by 55% and energy by 65% in the final configuration at 500MHz relative to the original at 1GHz. We also use our codes and the CG NAS benchmark code to demonstrate that performance and power profiles can vary significantly depending on matrix properties and the level of code tuning. These results indicate that architectural evaluations can benefit if traditional benchmarks are augmented with codes more representative of tuned scientific applications.