Billion-particle SIMD-friendly two-point correlation on large-scale HPC cluster systems

Authors:
Jatin Chhugani;Changkyu Kim;Hemant Shukla;Jongsoo Park;Pradeep Dubey;John Shalf;Horst D. Simon
Affiliations:
Parallel Computing Lab, Intel Corporation;Parallel Computing Lab, Intel Corporation;Lawrence Berkeley National Laboratory;Parallel Computing Lab, Intel Corporation;Parallel Computing Lab, Intel Corporation;Lawrence Berkeley National Laboratory;Lawrence Berkeley National Laboratory
Venue:
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Year:
2012

Citing 19
Cited 3

Scheduling multithreaded computations by work stealing

Journal of the ACM (JACM)
Multidimensional binary search trees used for associative searching

Communications of the ACM
Lazy Task Creation: A Technique for Increasing the Granularity of Parallel Programs

IEEE Transactions on Parallel and Distributed Systems
Practical Parallel Algorithms for Dynamic Data Redistribution, Median Finding, and Selection

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Carbon: architectural support for fine-grained parallelism on chip multiprocessors

Proceedings of the 34th annual international symposium on Computer architecture
Physical simulation for animation and visual effects: parallelization and characterization for chip multiprocessors

Proceedings of the 34th annual international symposium on Computer architecture
Larrabee: a many-core x86 architecture for visual computing

ACM SIGGRAPH 2008 papers
Atomic Vector Operations on Chip Multiprocessors

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Accelerating cosmological data analysis with graphics processors

Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units
Implementation of the two-point angular correlation function on a high-performance reconfigurable computer

Scientific Programming
FAWN: a fast array of wimpy nodes

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Scalable work stealing

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Sort vs. Hash revisited: fast join implementation on modern multi-core CPUs

Proceedings of the VLDB Endowment
Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
190 TFlops Astrophysical N-body Simulation on a Cluster of GPUs

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Parallel SAH k-D tree construction

Proceedings of the Conference on High Performance Graphics
Data Challenges for Next-generation Radio Telescopes

E-SCIENCEW '10 Proceedings of the 2010 Sixth IEEE International Conference on e-Science Workshops
Designing and dynamically load balancing hybrid LU for multi/many-core

Computer Science - Research and Development
Enabling and scaling biomolecular simulations of 100 million atoms on petascale machines with a multicore-optimized message-driven runtime

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis

Performance evaluation of Intel® transactional synchronization extensions for high-performance computing

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Load balance for semantic cluster-based data integration systems

Proceedings of the 17th International Database Engineering & Applications Symposium
Automatic vectorization of tree traversals

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques

Quantified Score

Hi-index	0.00

Visualization

Abstract

Two-point Correlation Function (TPCF) is widely used in astronomy to characterize the distribution of matter/energy in the Universe, and help derive the physics that can trace back to the creation of the universe. However, it is prohibitively slow for current sized datasets, and would continue to be a critical bottleneck with the trend of increasing dataset sizes to billions of particles and more, which makes TPCF a compelling benchmark application for future exa-scale architectures. State-of-the-art TPCF implementations do not map well to the underlying SIMD hardware, and also suffer from load-imbalance for large core counts. In this paper, we present a novel SIMD-friendly histogram update algorithm that exploits the spatial locality of histogram updates to achieve near-linear SIMD scaling. We also present a load-balancing scheme that combines domain-specific initial static division of work and dynamic task migration across nodes to effectively balance computation across nodes. Using Zin supercomputer at Lawrence Livermore National Laboratory (25,600 cores of Intel® Xeon® E5-2670, each with 256-bit SIMD), we achieve 90% parallel efficiency and 96% SIMD efficiency, and perform TPCF computation on a 1.7 billion particle dataset in 5.3 hours (at least 35 x faster than previous approaches). In terms of cost per performance (measured in flops/$), we achieve at least an order-of-magnitude (11.1 x) higher flops/$ as compared to the best known results [1]. Consequently, we now have line-of-sight to achieving the processing power for correlation computation to process billion+ particles telescopic data.