A Synergetic Approach to Throughput Computing on x86-Based Multicore Desktops

Authors:
Chi-Keung Luk;Ryan Newton;William Hasenplaugh;Mark Hampton;Geoff Lowney
Affiliations:
Intel;Intel;Intel;Intel;Intel
Venue:
IEEE Software
Year:
2011

Citing 0
Cited 4

An Intel Cilk plus based task tree executor architecture

SEPADS'12/EDUCATION'12 Proceedings of the 11th WSEAS international conference on Software Engineering, Parallel and Distributed Systems, and proceedings of the 9th WSEAS international conference on Engineering Education
Can traditional programming bridge the Ninja performance gap for parallel computing applications?

Proceedings of the 39th Annual International Symposium on Computer Architecture
High throughput software for direct numerical simulations of compressible two-phase flows

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Parallel rendering of human-computer interaction industrial applications on multi-/many-core platforms

HCI'13 Proceedings of the 15th international conference on Human-Computer Interaction: human-centred design approaches, methods, tools, and environments - Volume Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

In the era of multicores, many applications that require substantial computing power and data crunching can now run on desktop PCs. However, to achieve the best possible performance, developers must write applications in a way that exploits both parallelism and cache locality. This article proposes one such approach for x86-based architectures that uses cache-oblivious techniques to divide a large problem into smaller subproblems, which are mapped to different cores or threads. The authors then use the compiler to exploit SIMD parallelism within each subproblem. Finally, they use autotuning to pick the best parameter values throughout the optimization process. The authors have implemented this approach with the Intel compiler and the newly developed Intel Software Autotuning Tool. Experimental results collected on a dual-socket quad-core Nehalem show that the approach achieves an average speed up of almost 20x over the best serial cases for an important set of computational kernels.