Exploiting heterogeneous parallelism on a multithreaded multiprocessor
ICS '92 Proceedings of the 6th international conference on Supercomputing
ICS '90 Proceedings of the 4th international conference on Supercomputing
Computing Partitions with Applications to the Knapsack Problem
Journal of the ACM (JACM)
Multi-processor performance on the Tera MTA
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Computers and Intractability: A Guide to the Theory of NP-Completeness
Computers and Intractability: A Guide to the Theory of NP-Completeness
An optimal parallelization of the two-list algorithm of cost O(2n/2)
Parallel Computing
Scalable Parallel Programming with CUDA
Queue - GPU Computing
A performance study of general-purpose applications on graphics processors using CUDA
Journal of Parallel and Distributed Computing
Parallel time and space upper-bounds for the subset-sum problem
Theoretical Computer Science
Exploring the performance of massively multithreaded architectures
Concurrency and Computation: Practice & Experience
Proceedings of Programming Models and Applications on Multicores and Manycores
Hi-index | 0.00 |
The subset-sum problem is a well-known NP-complete combinatorial problem that is solvable in pseudo-polynomial time, that is, time proportional to the number of input objects multiplied by the sum of their sizes. This product defines the size of the dynamic programming table used to solve the problem. We show how this problem can be parallelized on three contemporary architectures, that is, a 128-processor Cray Extreme Multithreading (XMT) massively multithreaded machine, a 16-processor IBM x3755 shared memory machine, and a 240-core NVIDIA FX 5800 graphics processing unit (GPU). We show that it is straightforward to parallelize this algorithm on the Cray XMT primarily because of the word-level locking that is available on this architecture. For the other two machines, we present an alternating word algorithm that can implement an efficient solution. Our results show that the GPU performs well for problems whose tables fit within the device memory. Because GPUs typically have memories in the order of 10GB, such architectures are best for small problem sizes that have tables of size approximately 1010. The IBM x3755 performs very well on medium-sized problems that fit within its 64-GB memory but has poor scalability as the number of processors increases and is unable to sustain performance as the problem size increases. This machine tends to saturate for problem sizes of 1011 bits. The Cray XMT shows very good scaling for large problems and demonstrates sustained performance as the problem size increases. However, this machine has poor scaling for small problem sizes; it performs best for problem sizes of 1012 bits or more. The results in this paper illustrate that the subset-sum problem can be parallelized well on all three architectures, albeit for different ranges of problem sizes. The performance of these three machines under varying problem sizes show the strengths and weaknesses of the three architectures. Copyright © 2012 John Wiley & Sons, Ltd.