Accelerating sequential programs on commodity multi-core processors
Journal of Parallel and Distributed Computing
Hi-index | 0.00 |
With the prevalence of chip multiprocessor (CMP) on server and client computers, it becomes an important issue to use the multicores to speedup existing sequential programs. Decoupled Software Pipelining (DSWP) is a recent proposed technique that extracts non-speculative threads from sequential programs for higher performance. However, this technique is not effective on commodity CMP architecture, because the inter-thread communication and synchronization overhead often offset the benefit from the parallelization. To reduce the overhead without modification to CMP architecture, this paper presents a clustered DSWP (CDSWP) technique that is an extension to DSWP. By communicating a dependent data set instead of a single dependent data, this technique transforms sequential program into a clustered thread pipeline. The meaning of "clustered" is that some dependent data items are clustered together as a communication unit. The advantage of this technique is that it can eliminate false sharing and reduce the average cache latency, and thus the overhead is reduced greatly. According to the preliminary experiments on some commodity CMP architectures, we have achieved loop speedup ranging from 16% to 58% on some SPEC2000 benchmark programs.