Performance analysis of multiple-processor systems
Performance analysis of multiple-processor systems
Performance analysis of the FFT algorithm on a shared-memory parallel architecture
IBM Journal of Research and Development
Best worst mappings for the omega network
IBM Journal of Research and Development
The influence of parallel decomposition strategies on the performance of multiprocessor systems
ISCA '85 Proceedings of the 12th annual international symposium on Computer architecture
The NYU Ultracomputer Designing an MIMD Shared Memory Parallel Computer
IEEE Transactions on Computers
Interconnections Between Processors and Memory Modules Using the Shuffle-Exchange Network
IEEE Transactions on Computers
Access and Alignment of Data in an Array Processor
IEEE Transactions on Computers
On the Impact of Communication Complexity on the Design of Parallel Numerical Algorithms
IEEE Transactions on Computers
Modeling the Weather with a Data Flow Supercomputer
IEEE Transactions on Computers
Performance analysis of the FFT algorithm on a shared-memory parallel architecture
IBM Journal of Research and Development
Best worst mappings for the omega network
IBM Journal of Research and Development
SIGMOD '88 Proceedings of the 1988 ACM SIGMOD international conference on Management of data
Iterative Algorithms for Solution of Large Sparse Systems of Linear Equations on Hypercubes
IEEE Transactions on Computers
Exploiting variable grain parallelism at runtime
PPEALS '88 Proceedings of the ACM/SIGPLAN conference on Parallel programming: experience with applications, languages and systems
Implementation and performance analysis of parallel assignment algorithms on a hypercube computer
C3P Proceedings of the third conference on Hypercube concurrent computers and applications - Volume 2
Measuring the scalability of parallel computer systems
Proceedings of the 1989 ACM/IEEE conference on Supercomputing
Shared Block Contention in a Cache Coherence Protocol
IEEE Transactions on Computers
The KYKLOS Multicomputer Network: Interconnection Strategies, Properties, and Applications
IEEE Transactions on Computers
Improved Algorithms for Mapping Pipelined and Parallel Computations
IEEE Transactions on Computers
Models of machines and computation for mapping in multicomputers
ACM Computing Surveys (CSUR)
Fault simulation in a distributed environment
DAC '88 Proceedings of the 25th ACM/IEEE Design Automation Conference
Extended Hypercube: A Hierarchical Interconnection Network of Hypercubes
IEEE Transactions on Parallel and Distributed Systems
Synchronization and Communication Costs of Loop Partitioning on Shared-Memory Multiprocessor Systems
IEEE Transactions on Parallel and Distributed Systems
Analysis of Macro-Dataflow Dynamic Scheduling on Nonuniform Memory Access Architectures
IEEE Transactions on Parallel and Distributed Systems
Scheduling DAG's for Asynchronous Multiprocessor Execution
IEEE Transactions on Parallel and Distributed Systems
Parallelism in a Main-Memory DBMS: The Performance of PRISMA/DB
VLDB '92 Proceedings of the 18th International Conference on Very Large Data Bases
A Non-Uniform Data Fragmentation Strategy for Parallel Main-Menory Database Systems
VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
Parallel program performance prediction using deterministic task graph analysis
ACM Transactions on Computer Systems (TOCS)
International Journal of Networking and Virtual Organisations
HPCS'09 Proceedings of the 23rd international conference on High Performance Computing Systems and Applications
Hi-index | 14.99 |
In this paper we analyze the effects of the problem decomposition, the allocation of subproblems to processors, and the grain size of subproblems on the performance of a multiple- processor shared-memory architecture. Our results indicate that for algorithms where both the computation and the communication overhead can be fully decomposed among N processors, the speedup is a nondecreasing function of the level of granularity for arbitrary interconnection structure and allocation of subproblems to processors. For these algorithms, the speedup is an increasing function of the level of granularity provided that the interconnection bandwidth is greater than unity. If the bandwidth is equal to unity, then the speedup converges to the value equal to the ratio of processing time to communication time. For algorithms where the computation is decomposable but the communication overhead cannot be decomposed, the speedup is a nondecreasing function of the level of granularity for the best case bandwidth only. If the bandwidth is less than N, the speedup reaches its maximum and then decreases approaching zero as the level of granularity grows. For algorithms where the computation consists of parallel and serial sections of code and the communication overhead is fully decomposable, the speedup converges to a value inversely proportional to the fraction of time spent in the serial code even for the best case interconnection bandwidth.