Dynamic Cluster Resource Allocations for Jobs with Known and Unknown Memory Demands
IEEE Transactions on Parallel and Distributed Systems
Combining analytical and empirical approaches in tuning matrix transposition
Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Optimal bit-reversal using vector permutations
Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Performance of parallel bit-reversal with cilk and UPC for fast fourier transform
GPC'10 Proceedings of the 5th international conference on Advances in Grid and Pervasive Computing
Hi-index | 0.00 |
In this paper, we examine different methods using techniques of blocking, buffering, and padding for efficient implementations of bit-reversals. We evaluate the merits and limits of each technique and its application and architecture-dependent conditions for developing cache-optimal methods. Besides testing the methods on different uniprocessors, we conducted both simulation and measurements on two commercial symmetric multiprocessors (SMP) to provide architectural insights into the methods and their implementations. We present two contributions in this paper: (1) Our integrated blocking methods, which match cache associativity and translation-lookaside buffer (TLB) cache size and which fully use the available registers, are cache-optimal and fast. (2) We show that our padding methods outperform other software-oriented methods, and we believe they are the fastest in terms of minimizing both CPU and memory access cycles. Since the padding methods are almost independent of hardware, they could be widely used on many uniprocessor workstations and multiprocessors.