Efficient transposition algorithms for large matrices
Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Transporting a matrix on a vector computer
Parallel Computing
FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Memory Hierarchy Considerations for Fast Transpose and Bit-Reversals
HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
PRIM: A Fast Matrix Transpose Method
IEEE Transactions on Software Engineering
In Situ Visualization at Extreme Scale: Challenges and Opportunities
IEEE Computer Graphics and Applications
Reducing energy usage with memory and computation-aware dynamic frequency scaling
Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part I
Using runtime activity to dynamically filter out inefficient data prefetches
Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part I
Performance Analysis and Benchmarking of the Intel SCC
CLUSTER '11 Proceedings of the 2011 IEEE International Conference on Cluster Computing
Why Modern CPUs Are Starving and What Can Be Done about It
Computing in Science and Engineering
ISOBAR hybrid compression-I/O interleaving for large-scale parallel I/O optimization
Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
SERA-IO: Integrating Energy Consciousness into Parallel I/O Middleware
CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
ISOBAR Preconditioner for Effective and High-throughput Lossless Data Compression
ICDE '12 Proceedings of the 2012 IEEE 28th International Conference on Data Engineering
Byte-precision level of detail processing for variable precision analytics
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Hi-index | 0.00 |
High-performance and energy-efficient data management applications are a necessity for HPC systems due to the extreme scale of data produced by high fidelity scientific simulations that these systems support. Data layout in memory hugely impacts the performance. For better performance, most simulations interleave variables in memory during their calculation phase, but deinterleave the data for subsequent storage and analysis. As a result, efficient data deinterleaving is critical; yet, common deinterleaving methods provide inefficient throughput and energy performance. To address this problem, we propose a deinterleaving method that is high performance, energy efficient, and generic to any data type. To the best of our knowledge, this is the first deinterleaving method that 1) exploits data cache prefetching, 2) reduces memory accesses, and 3) optimizes the use of complete cache line writes. When evaluated against conventional deinterleaving methods on 105 STREAM standard micro-benchmarks, our method always improved throughput and throughput/watt on multi-core systems. In the best case, our deinterleaving method improved throughput up to 26.2x and throughput/watt up to 7.8x.