Simplifying concurrent algorithms by exploiting hardware transactional memory
Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
Garbage collection auto-tuning for Java mapreduce on multi-cores
Proceedings of the international symposium on Memory management
Dark silicon and the end of multicore scaling
Proceedings of the 38th annual international symposium on Computer architecture
Complementing user-level coarse-grain parallelism with implicit speculative parallelism
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Can manycores support the memory requirements of scientific applications?
ISCA'10 Proceedings of the 2010 international conference on Computer Architecture
Power Limitations and Dark Silicon Challenge the Future of Multicore
ACM Transactions on Computer Systems (TOCS)
Power challenges may end the multicore era
Communications of the ACM
PCantorSim: Accelerating parallel architecture simulation through fractal-based sampling
ACM Transactions on Architecture and Code Optimization (TACO)
Hi-index | 0.02 |
PARSEC is a reference application suite used in industry and academia to assess new Chip Multiprocessor (CMP) designs. No investigation to date has profiled PARSEC on real hardware to better understand scaling properties and bottlenecks. This understanding is crucial in guiding future CMP designs for these kinds of emerging workloads. We use hardware performance counters, taking a systems-level approach and varying common architectural parameters: number of out-of-order cores, memory hierarchy configurations, number of multiple simultaneous threads, number of memory channels, and processor frequencies. We find these programs to be largely compute-bound, and thus limited by number of cores, micro-architectural resources, and cache-to-cache transfers, rather than by off-chip memory or system bus bandwidth. Half the suite fails to scale linearly with increasing number of threads, and some applications saturate performance at few threads on all platforms tested. Exploiting thread level parallelism delivers greater payoffs than exploiting instruction level parallelism. To reduce power and improve performance, we recommend increasing the number of arithmetic units per core, increasing support for TLP, and reducing support for ILP.