Effective cache prefetching on bus-based multiprocessors
ACM Transactions on Computer Systems (TOCS)
Data Forwarding in Scalable Shared-Memory Multiprocessors
IEEE Transactions on Parallel and Distributed Systems
Cache Injection: A Novel Technique for Tolerating Memory Latency in Bus-Based SMPs
Euro-Par '00 Proceedings from the 6th International Euro-Par Conference on Parallel Processing
ISPDC '04 Proceedings of the Third International Symposium on Parallel and Distributed Computing/Third International Workshop on Algorithms, Models and Tools for Parallel Computing on Heterogeneous Networks
Program Graph Scheduling for Dynamic SMP Clusters with Communication on the Fly
ISPDC '04 Proceedings of the Third International Symposium on Parallel and Distributed Computing/Third International Workshop on Algorithms, Models and Tools for Parallel Computing on Heterogeneous Networks
Parallel matrix multiplication based on dynamic SMP clusters in SoC technology
ISPA'07 Proceedings of the 2007 international conference on Frontiers of High Performance Computing and Networking
Hi-index | 0.00 |
This paper evaluates new architectural solutions for data communication in shared memory parallel systems. These solutions enable creation of run-time reconfigurable processor clusters with very efficient inter-processor data exchange. It makes that data brought in the data cache of a processor, which enters a cluster, can be transparently intercepted by many processors in the cluster. Direct communication between processor caches is possible, which eliminates standard data transactions. The system provides simultaneous connections of processors with many memory modules that further increases the potential for parallel inter-cluster data exchange. System on chip technology is applied. Special program macro-data flow graphs enable proper structuring of program execution control, including specification of parallel execution, data cache operations, switching processors between clusters and multiple parallel reads of data on the fly. Simulation results from symbolic execution of graphs of fine grain numerical algorithms illustrate high efficiency and suitability of the proposed architecture for massively parallel fine-grain numerical computations.