Performance analysis of direct N-body algorithms for astrophysical simulations on distributed systems

doi:10.1016/j.parco.2007.01.001

Parallel Computing

Volume 33, Issue 3, April 2007, Pages 159-173

https://doi.org/10.1016/j.parco.2007.01.001 Get rights and content

Abstract

We discuss the performance of direct summation codes used in the simulation of astrophysical stellar systems on highly distributed architectures. These codes compute the gravitational interaction among stars in an exact way and have an O(N²) scaling with the number of particles. They can be applied to a variety of astrophysical problems, like the evolution of star clusters, the dynamics of black holes, the formation of planetary systems, and cosmological simulations. The simulation of realistic star clusters with sufficiently high accuracy cannot be performed on a single workstation but may be possible on parallel computers or grids. We have implemented two parallel schemes for a direct N-body code and we study their performance on general purpose parallel computers and large computational grids. We present the results of timing analyzes conducted on the different architectures and compare them with the predictions from theoretical models. We conclude that the simulation of star clusters with up to a million particles will be possible on large distributed computers in the next decade. Simulating entire galaxies however will in addition require new hybrid methods to speedup the calculation.

Introduction

Numerical methods for solving the classical astrophysical N-body problem have evolved in two main directions in recent years. On the one hand, approximated models like Fokker–Planck models [1], [2], gaseous models [3], and Monte Carlo models [4], [5], [6] have been applied to the simulation of globular clusters and galactic nuclei. These models permit to follow the global evolution of large systems along their lifetime but at the expense of moderate accuracy and resolution. The basic approach is to group particles according to their spatial distribution and use a truncated multipole expansion to evaluate the force exerted by the whole group instead of evaluating directly the contributions from the single particles. On the other hand, direct summation methods have been developed to accurately model the dynamics and evolution of collisional systems like dense star clusters. These codes compute all the inter-particle forces and are therefore the most accurate. They are also more general, as they can be used to simulate both low and high density regions. Their high accuracy is necessary when studying physical phenomena like mass segregation, core collapse, dynamical encounters, formation of binaries or higher order systems, ejection of high velocity stars, and runaway collisions. Direct methods have an O(N²) scaling with the number of stars and are therefore limited to smaller particle numbers compared to approximated methods. For this reason, so far they have only been applied to the simulation of moderate size star clusters. The simulation of globular clusters containing one million stars is still a challenge from the computational point of view, but it is an important astrophysical problem. It will provide insight in the complex dynamics of these collisional systems, in the microscopic and macroscopic physical phenomena, and it will help finding evidence of the controversial presence of a central black hole. The simulation of the evolution of these systems under the combined effects of gravity, stellar evolution, and hopefully hydrodynamics will allow to study the stellar population and to compare the results with observations.

The need to simulate ever larger systems and to include a realistic physical treatment of stars asks for a speedup in the most demanding part of the calculation: the gravitational dynamics. Significant improvement in the performance of direct codes can be obtained either by means of special purpose computers like GRAPE hardware [7] or of general purpose distributed systems [8]. In this work, we focus on the performance of direct N-body codes on distributed systems. Two main classes of algorithms can be used to parallelize direct summation N-body codes: replicated data and distributed data algorithms. In this work, we present a performance analysis of the two algorithms on different architectures: a Beowulf cluster, three supercomputers, and two computational grids. We provide theoretical predictions and actual measurements for the execution time on different platforms, allowing the choice of the best performing scheme for a given architecture.

Section snippets

Numerical method

In the direct method the gravitational force acting on a particle is computed by summing up the contributions from all the other particles according to Newton’s law $F_{i} = m_{i} a_{i} = - {Gm}_{i} \sum_{j = 1, j \neq i}^{j = N} \frac{m_{j} (r_{i} - r_{j})}{| r_{i} - r_{j} |^{3}} .$ The number of force calculations per particle is N(N − 1)/2 . Given the fact that the force acting on a particle usually varies smoothly with time, the integration of the particle trajectory makes use of force polynomial fitting. In this work we implement the fourth-order Hermite integrator with

Parallel schemes for direct N-body codes

The parallelization of a direct N-body code can proceed in different ways depending on the desired intrinsic degree of parallelism and communication to computation ratio. We implemented two different parallelization algorithms, the copy algorithm and the ring algorithm, for an Hermite scheme with block time-steps using the standard MPI library package. If we denote with N the total number of particles in the system and with p the number of available processors, both algorithms have a

Performance analysis of different parallel N-body schemes

The performance of a parallel code does not only depend on the properties of the code itself, like the parallelization scheme and the intrinsic degree of parallelism, but also on the properties of the parallel computer used for the computation. The main factors determining the general performance are the calculation speed of each node, the bandwidth of the inter-processor communication, and the start-up time (latency). A theoretical estimate of the total time needed for one full force

Performance on the BlueGene/L supercomputer

The BlueGene/L supercomputer is a novel machine developed by IBM to provide a very high number of computing nodes with a modest power requirement. Each node consists of two processors, a special variant of IBM’s Power family, with a clock speed of 700 MHz. To obtain good performance at this relatively low frequency, each node processes multiple instructions per clock cycle. The nodes are interconnected through multiple complementary high-speed low-latency networks, including a 3D torus network

Performance analysis on the GRID

Grid technology is rapidly becoming a major component of computational science. It offers a unified means of access to different and distant computational resources, with the possibility to securely access highly distributed resources that scientists do not necessarily own or have an account on. Connectivity between distant locations, interoperability between different kinds of systems, and resources and high levels of computational performance are some of the most promising characteristics of

Discussion

A numerical challenge for the astronomical community in the next years will be the simulation of star clusters containing one million stars. We have shown that direct N-body codes can efficiently be applied to the simulation of large stellar systems and that their performance can be predicted with simple models.

In this section, we apply the performance model introduced in Section 4.1 for the copy algorithm to predict the total execution time for the simulation of a system with N = 10⁶. In Table 3

Conclusions

We have implemented two parallelization schemes for direct N-body codes with block time-steps, the copy and ring algorithm, and compared their performance on different parallel computers. In the case of clusters or supercomputers, the execution times for the two schemes are comparable except for very small systems where the communication time dominates over the calculation time and hence the copy algorithm performs slightly better. The ring algorithm is well suited for the integration of very

Acknowledgments

We thank Jun Makino and Piet Hut for support through the ACS project (http://www.artcompsci.org). We also thank Gyan Bhanot, Bob Walkup and the IBM T.J Watson Research Center for performing test runs on the BlueGene supercomputer, the DAS-2 and CrossGrid projects, for their technical help in the grid tests and Rainer Spurzem and David Merritt for useful discussions on parallel computing. This work was supported by the Netherlands Organization for Scientific Research (NWO Grant #635.000.001),

References (20)

E.N. Dorband et al.
Systolic and hyper-systolic algorithms for the gravitational N-body problem, with an application to Brownian motion
J. Comput. Phys.
(2003)
J. Makino
An efficient parallel algorithm for O(N²) direct summation method and its variations on distributed-memory parallel machines
NewA
(2002)
H. Cohn
Late core collapse in star clusters and the gravothermal instability
ApJ
(1980)
B.W. Murphy et al.
Dynamical and luminosity evolution of active galactic nuclei – Models with a mass spectrum
ApJ
(1991)
P.D. Louis et al.
Anisotropic gaseous models for the evolution of star clusters
MNRAS
(1991)
M. Hénon, Two recent developments concerning the Monte Carlo method, in: IAU Symposium 69: Dynamics of the Solar...
L. Spitzer, Dynamical theory of spherical stellar systems with large N (invited Paper), in: IAU Symposium 69: Dynamics...
M. Giersz
Monte Carlo simulations of star clusters – I. First results
MNRAS
(1998)
J. Makino et al.
GRAPE-6: massively-parallel special-purpose computer for astrophysical particle simulations
PASJ
(2003)
J. Makino et al.
On a Hermite integrator with Ahmad–Cohen scheme for gravitational many-body problems
PASJ
(1992)

There are more references available in the full text version of this article.

Cited by (19)

Gravitational tree-code on graphics processing units: Implementation in CUDA
2010, Procedia Computer Science
We present a new very fast tree-code which runs on massively parallel Graphical Processing Units (GPU) with NVIDIA CUDA architecture. The tree-construction and calculation of multipole moments is carried out on the host CPU, while the force calculation which consists of tree walks and evaluation of interaction list is carried out on the GPU. In this way we achieve a sustained performance of about 100GFLOP/s and data transfer rates of about 50GB/s. It takes about a second to compute forces on a million particles with an opening angle of $θ \approx 0.5$ . The code has a convenient user interface and is freely available for use¹.
A platform independent communication library for distributed computing
2010, Procedia Computer Science
We present MPWide, a platform independent communication library for performing message passing between supercomputers. Our library couples several local MPI applications through a long distance network using, for example, optical links. The implementation is deliberately kept light-weight, platform independent and the library can be installed and used without administrative privileges. The only requirements are a C++ compiler and at least one open port to a wide area network on each site. In this paper we present the library, describe the user interface, present performance tests and apply MPWide in a large scale cosmological N-body simulation on a network of two computers, one in Amsterdam and the other in Tokyo.
SAPPORO: A way to turn your graphics cards into a GRAPE-6
2009, New Astronomy
We present Sapporo, a library for performing high precision gravitational N-body simulations on NVIDIA graphical processing units (GPUs). Our library mimics the GRAPE-6 library, and N-body codes currently running on GRAPE-6 can switch to Sapporo by a simple relinking of the library. The precision of our library is comparable to that of GRAPE-6, even though internally the GPU hardware is limited to single precision arithmetics. This limitation is effectively overcome by emulating double precision for calculating the distance between particles. The performance loss of this operation is small (≲20%) compared to the advantage of being able to run at high precision. We tested the library using several GRAPE-6-enabled N-body codes, in particular with Starlab and phiGRAPE. We measured peak performance of 800 Gflop/s for running with $10^{6}$ particles on a PC with four commercial G92 architecture GPUs (two GeForce 9800GX2). As a production test, we simulated a 32 k Plummer model with equal-mass stars well beyond core collapse. The simulation took 41 days, during which the mean performance was 113 Gflop/s. The GPU did not show any problems from running in a production environment for such an extended period of time.
Graphic-card cluster for astrophysics (GraCCA) - Performance tests
2008, New Astronomy
Citation Excerpt :
The replacement will shorten the cosmology computation time by a sizable factor. Finally, Gualandris et al. (2007) have presented the first highly parallel, grid-based N-body simulations. It opens a paradigm for the GraCCA system to connect with the grid-computing community in the future.
In this paper, we describe the architecture and performance of the GraCCA system, a graphic-card cluster for astrophysics simulations. It consists of 16 nodes, with each node equipped with two modern graphic cards, the NVIDIA GeForce 8800 GTX. This computing cluster provides a theoretical performance of 16.2 TFLOPS. To demonstrate its performance in astrophysics computation, we have implemented a parallel direct N-body simulation program with shared time-step algorithm in this system. Our system achieves a measured performance of 7.1 TFLOPS and a parallel efficiency of 90% for simulating a globular cluster of 1024 K particles. In comparing with the GRAPE-6A cluster at RIT (Rochester Institute of Technology), the GraCCA system achieves a more than twice higher measured speed and an even higher performance-per-dollar ratio. Moreover, our system can handle up to 320M particles and can serve as a general-purpose computing cluster for a wide range of astrophysics problems.
Distributed N-body simulation on the grid using dedicated hardware
2008, New Astronomy
Citation Excerpt :
In order to further understand the results and to enable performance predictions for larger network setups, we decided to model the performance of the grid calculations. We model the performance of the simulation by adopting the parallel performance models described by Makino (2002) and Harfst et al. (2007) and combining it with the grid performance model described in Gualandris et al. (2007). Further extension and calibration of the model allows us to simulate the performance of our N-body simulations on a G3 or any other topology.
We present performance measurements of direct gravitational N-body simulation on the grid, with and without specialized (GRAPE-6) hardware. Our intercontinental virtual organization consists of three sites, one in Tokyo, one in Philadelphia and one in Amsterdam. We run simulations with up to 196,608 particles for a variety of topologies. In many cases, high performance simulations over the entire planet are dominated by network bandwidth rather than latency. With this global grid of GRAPEs our calculation time remains dominated by communication over the entire range of N, which was limited due to the use of three sites. Increasing the number of particles will result in a more efficient execution. Based on these timings, we construct and calibrate a model to predict the performance of our simulation on any grid infrastructure with or without GRAPE. We apply this model to predict the simulation performance on the Netherlands DAS-3 wide area computer. Equipping the DAS-3 with GRAPE-6Af hardware would achieve break-even between calculation and communication at a few million particles, resulting in a compute time of just over ten hours for 1 N-body time unit.
A parallel gravitational N-body kernel
2008, New Astronomy
We describe source code level parallelization for the kira direct gravitational N-body integrator, the workhorse of the starlab production environment for simulating dense stellar systems. The parallelization strategy, called “j-parallelization”, involves the partition of the computational domain by distributing all particles in the system among the available processors. Partial forces on the particles to be advanced are calculated in parallel by their parent processors, and are then summed in a final global operation. Once total forces are obtained, the computing elements proceed to the computation of their particle trajectories. We report the results of timing measurements on four different parallel computers, and compare them with theoretical predictions. The computers employ either a high-speed interconnect, a NUMA architecture to minimize the communication overhead or are distributed in a grid. The code scales well in the domain tested, which ranges from 1024 to 65,536 stars on 1–128 processors, providing satisfactory speedup. Running the production environment on a grid becomes inefficient for more than 60 processors distributed across three sites.

View all citing articles on Scopus

View full text

Performance analysis of direct N-body algorithms for astrophysical simulations on distributed systems

Abstract

Introduction

Section snippets

Numerical method

Parallel schemes for direct N-body codes

Performance analysis of different parallel N-body schemes

Performance on the BlueGene/L supercomputer

Performance analysis on the GRID

Discussion

Conclusions

Acknowledgments

J. Comput. Phys.

NewA

Late core collapse in star clusters and the gravothermal instability

ApJ

Dynamical and luminosity evolution of active galactic nuclei – Models with a mass spectrum

ApJ

Anisotropic gaseous models for the evolution of star clusters

MNRAS

Monte Carlo simulations of star clusters – I. First results

MNRAS

GRAPE-6: massively-parallel special-purpose computer for astrophysical particle simulations

PASJ

On a Hermite integrator with Ahmad–Cohen scheme for gravitational many-body problems

PASJ