POPL '88 Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
The cache performance and optimizations of blocked algorithms
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Scanning polyhedra with DO loops
PPOPP '91 Proceedings of the third ACM SIGPLAN symposium on Principles and practice of parallel programming
Computer organization & design: the hardware/software interface
Computer organization & design: the hardware/software interface
Integration, the VLSI Journal
Software overhead in messaging layers: where does the time go?
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Partitioning and mapping of nested loops for linear array multicomputers
The Journal of Supercomputing - Special issue: trends in parallel operating systems
U-Net: a user-level network interface for parallel and distributed computing
SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
Network interface for protected, user-level communication
Network interface for protected, user-level communication
Parallel stereocorrelation on a reconfigurable multi-ring network
The Journal of Supercomputing - Special issue on parallel and distributed processing
Parallel Computer Vision on a Reconfigurable Multiprocessor Network
IEEE Transactions on Parallel and Distributed Systems
Communication-minimal tiling of uniform dependence loops
Journal of Parallel and Distributed Computing
Determining the idle time of a tiling
Proceedings of the 24th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Effects of communication latency, overhead, and bandwidth in a cluster architecture
Proceedings of the 24th annual international symposium on Computer architecture
Modeling communication pipeline latency
SIGMETRICS '98/PERFORMANCE '98 Proceedings of the 1998 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
ICS '98 Proceedings of the 12th international conference on Supercomputing
Improving Cache Locality by a Combination of Loop and Data Transformations
IEEE Transactions on Computers - Special issue on cache memory and related problems
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Selecting tile shape for minimal execution time
Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
Optimal scheduling for UET/VET-UCT generalized n-dimensional grid task graphs
Journal of Parallel and Distributed Computing
Chain Grouping: A Method for Partitioning Loops onto Mesh-Connected Processor Arrays
IEEE Transactions on Parallel and Distributed Systems
Exploiting Wavefront Parallelism on Large-Scale Shared-Memory Multiprocessors
IEEE Transactions on Parallel and Distributed Systems
User-space communication: a quantitative study
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
EMP: zero-copy OS-bypass NIC-driven gigabit ethernet message passing
Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Pipelined Data Parallel Algorithms-II: Design
IEEE Transactions on Parallel and Distributed Systems
Partitioning and Mapping Nested Loops on Multiprocessor Systems
IEEE Transactions on Parallel and Distributed Systems
On Supernode Transformation with Minimized Total Running Time
IEEE Transactions on Parallel and Distributed Systems
On Time Optimal Supernode Shape
IEEE Transactions on Parallel and Distributed Systems
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
VIA over SCI: Consequences of a Zero Copy Implementation and Comparison with VIA over Myrinet
IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
The SCI Standard and Applications of SCI
SCI: Scalable Coherent Interface, Architecture and Software for High-Performance Compute Clusters
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
On the Parallel Execution Time of Tiled Loops
IEEE Transactions on Parallel and Distributed Systems
The Sensitivity of Communication Mechanisms to Bandwidth and Latency
HPCA '98 Proceedings of the 4th International Symposium on High-Performance Computer Architecture
A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping
ICPPW '02 Proceedings of the 2002 International Conference on Parallel Processing Workshops
Minimizing Completion Time for Loop Tiling with Computation and Communication Overlapping
IPDPS '01 Proceedings of the 15th International Parallel and Distributed Processing Symposium (IPDPS'01) - Volume 1
Evaluation of Loop Grouping Methods Based on Orthogonal Projection Spaces
ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
Pin-down Cache: A Virtual Memory Management Technique for Zero-copy Communication
IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
Tiling, Block Data Layout, and Memory Hierarchy Performance
IEEE Transactions on Parallel and Distributed Systems
A Geometric Programming Framework for Optimal Multi-Level Tiling
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
An efficient code generation technique for tiled iteration spaces
IEEE Transactions on Parallel and Distributed Systems
Parallel loop generation and scheduling
The Journal of Supercomputing
Automatic code generation for distributed memory architectures in the polytope model
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Hi-index | 0.00 |
This paper proposes a novel approach for the parallel execution of tiled Iteration Spaces onto a cluster of SMP PC nodes. Each SMP node has multiple CPUs and a single memory mapped PCI-SCI Network Interface Card. We apply a hyperplane-based grouping transformation to the tiled space, so as to group together independent neighboring tiles and assign them to the same SMP node. In this way, intranode (intragroup) communication is annihilated. Groups are atomically executed inside each node. Nodes exchange data between successive group computations. We schedule groups much more efficiently by exploiting the inherent overlapping between communication and computation phases among successive atomic group executions. The applied non-blocking schedule resembles a pipelined datapath, where group computation phases are overlapped with communication ones, instead of being interleaved with them. Our experimental results illustrate that the proposed method outperforms previous approaches involving blocking communication or conventional grouping schemes.