Understanding PARSEC performance on contemporary CMPs

Authors:
Major Bhadauria;Vincent M. Weaver;Sally A. McKee
Affiliations:
Cornell University, USA;Cornell University, USA;Chalmers University of Technology, Sweden
Venue:
IISWC '09 Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)
Year:
2009

Citing 0
Cited 8

Simplifying concurrent algorithms by exploiting hardware transactional memory

Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
Garbage collection auto-tuning for Java mapreduce on multi-cores

Proceedings of the international symposium on Memory management
Dark silicon and the end of multicore scaling

Proceedings of the 38th annual international symposium on Computer architecture
Complementing user-level coarse-grain parallelism with implicit speculative parallelism

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Can manycores support the memory requirements of scientific applications?

ISCA'10 Proceedings of the 2010 international conference on Computer Architecture
Power Limitations and Dark Silicon Challenge the Future of Multicore

ACM Transactions on Computer Systems (TOCS)
Power challenges may end the multicore era

Communications of the ACM
PCantorSim: Accelerating parallel architecture simulation through fractal-based sampling

ACM Transactions on Architecture and Code Optimization (TACO)

Quantified Score

Hi-index	0.02

Visualization

Abstract

PARSEC is a reference application suite used in industry and academia to assess new Chip Multiprocessor (CMP) designs. No investigation to date has profiled PARSEC on real hardware to better understand scaling properties and bottlenecks. This understanding is crucial in guiding future CMP designs for these kinds of emerging workloads. We use hardware performance counters, taking a systems-level approach and varying common architectural parameters: number of out-of-order cores, memory hierarchy configurations, number of multiple simultaneous threads, number of memory channels, and processor frequencies. We find these programs to be largely compute-bound, and thus limited by number of cores, micro-architectural resources, and cache-to-cache transfers, rather than by off-chip memory or system bus bandwidth. Half the suite fails to scale linearly with increasing number of threads, and some applications saturate performance at few threads on all platforms tested. Exploiting thread level parallelism delivers greater payoffs than exploiting instruction level parallelism. To reduce power and improve performance, we recommend increasing the number of arithmetic units per core, increasing support for TLP, and reducing support for ILP.