Elsevier

Parallel Computing

Volume 33, Issue 3, April 2007, Pages 159-173
Parallel Computing

Performance analysis of direct N-body algorithms for astrophysical simulations on distributed systems

https://doi.org/10.1016/j.parco.2007.01.001Get rights and content

Abstract

We discuss the performance of direct summation codes used in the simulation of astrophysical stellar systems on highly distributed architectures. These codes compute the gravitational interaction among stars in an exact way and have an O(N2) scaling with the number of particles. They can be applied to a variety of astrophysical problems, like the evolution of star clusters, the dynamics of black holes, the formation of planetary systems, and cosmological simulations. The simulation of realistic star clusters with sufficiently high accuracy cannot be performed on a single workstation but may be possible on parallel computers or grids. We have implemented two parallel schemes for a direct N-body code and we study their performance on general purpose parallel computers and large computational grids. We present the results of timing analyzes conducted on the different architectures and compare them with the predictions from theoretical models. We conclude that the simulation of star clusters with up to a million particles will be possible on large distributed computers in the next decade. Simulating entire galaxies however will in addition require new hybrid methods to speedup the calculation.

Introduction

Numerical methods for solving the classical astrophysical N-body problem have evolved in two main directions in recent years. On the one hand, approximated models like Fokker–Planck models [1], [2], gaseous models [3], and Monte Carlo models [4], [5], [6] have been applied to the simulation of globular clusters and galactic nuclei. These models permit to follow the global evolution of large systems along their lifetime but at the expense of moderate accuracy and resolution. The basic approach is to group particles according to their spatial distribution and use a truncated multipole expansion to evaluate the force exerted by the whole group instead of evaluating directly the contributions from the single particles. On the other hand, direct summation methods have been developed to accurately model the dynamics and evolution of collisional systems like dense star clusters. These codes compute all the inter-particle forces and are therefore the most accurate. They are also more general, as they can be used to simulate both low and high density regions. Their high accuracy is necessary when studying physical phenomena like mass segregation, core collapse, dynamical encounters, formation of binaries or higher order systems, ejection of high velocity stars, and runaway collisions. Direct methods have an O(N2) scaling with the number of stars and are therefore limited to smaller particle numbers compared to approximated methods. For this reason, so far they have only been applied to the simulation of moderate size star clusters. The simulation of globular clusters containing one million stars is still a challenge from the computational point of view, but it is an important astrophysical problem. It will provide insight in the complex dynamics of these collisional systems, in the microscopic and macroscopic physical phenomena, and it will help finding evidence of the controversial presence of a central black hole. The simulation of the evolution of these systems under the combined effects of gravity, stellar evolution, and hopefully hydrodynamics will allow to study the stellar population and to compare the results with observations.

The need to simulate ever larger systems and to include a realistic physical treatment of stars asks for a speedup in the most demanding part of the calculation: the gravitational dynamics. Significant improvement in the performance of direct codes can be obtained either by means of special purpose computers like GRAPE hardware [7] or of general purpose distributed systems [8]. In this work, we focus on the performance of direct N-body codes on distributed systems. Two main classes of algorithms can be used to parallelize direct summation N-body codes: replicated data and distributed data algorithms. In this work, we present a performance analysis of the two algorithms on different architectures: a Beowulf cluster, three supercomputers, and two computational grids. We provide theoretical predictions and actual measurements for the execution time on different platforms, allowing the choice of the best performing scheme for a given architecture.

Section snippets

Numerical method

In the direct method the gravitational force acting on a particle is computed by summing up the contributions from all the other particles according to Newton’s lawFi=miai=-Gmij=1,jij=Nmj(ri-rj)|ri-rj|3.The number of force calculations per particle is N(N  1)/2 . Given the fact that the force acting on a particle usually varies smoothly with time, the integration of the particle trajectory makes use of force polynomial fitting. In this work we implement the fourth-order Hermite integrator with

Parallel schemes for direct N-body codes

The parallelization of a direct N-body code can proceed in different ways depending on the desired intrinsic degree of parallelism and communication to computation ratio. We implemented two different parallelization algorithms, the copy algorithm and the ring algorithm, for an Hermite scheme with block time-steps using the standard MPI library package. If we denote with N the total number of particles in the system and with p the number of available processors, both algorithms have a

Performance analysis of different parallel N-body schemes

The performance of a parallel code does not only depend on the properties of the code itself, like the parallelization scheme and the intrinsic degree of parallelism, but also on the properties of the parallel computer used for the computation. The main factors determining the general performance are the calculation speed of each node, the bandwidth of the inter-processor communication, and the start-up time (latency). A theoretical estimate of the total time needed for one full force

Performance on the BlueGene/L supercomputer

The BlueGene/L supercomputer is a novel machine developed by IBM to provide a very high number of computing nodes with a modest power requirement. Each node consists of two processors, a special variant of IBM’s Power family, with a clock speed of 700 MHz. To obtain good performance at this relatively low frequency, each node processes multiple instructions per clock cycle. The nodes are interconnected through multiple complementary high-speed low-latency networks, including a 3D torus network

Performance analysis on the GRID

Grid technology is rapidly becoming a major component of computational science. It offers a unified means of access to different and distant computational resources, with the possibility to securely access highly distributed resources that scientists do not necessarily own or have an account on. Connectivity between distant locations, interoperability between different kinds of systems, and resources and high levels of computational performance are some of the most promising characteristics of

Discussion

A numerical challenge for the astronomical community in the next years will be the simulation of star clusters containing one million stars. We have shown that direct N-body codes can efficiently be applied to the simulation of large stellar systems and that their performance can be predicted with simple models.

In this section, we apply the performance model introduced in Section 4.1 for the copy algorithm to predict the total execution time for the simulation of a system with N = 106. In Table 3

Conclusions

We have implemented two parallelization schemes for direct N-body codes with block time-steps, the copy and ring algorithm, and compared their performance on different parallel computers. In the case of clusters or supercomputers, the execution times for the two schemes are comparable except for very small systems where the communication time dominates over the calculation time and hence the copy algorithm performs slightly better. The ring algorithm is well suited for the integration of very

Acknowledgments

We thank Jun Makino and Piet Hut for support through the ACS project (http://www.artcompsci.org). We also thank Gyan Bhanot, Bob Walkup and the IBM T.J Watson Research Center for performing test runs on the BlueGene supercomputer, the DAS-2 and CrossGrid projects, for their technical help in the grid tests and Rainer Spurzem and David Merritt for useful discussions on parallel computing. This work was supported by the Netherlands Organization for Scientific Research (NWO Grant #635.000.001),

References (20)

There are more references available in the full text version of this article.

Cited by (19)

  • Graphic-card cluster for astrophysics (GraCCA) - Performance tests

    2008, New Astronomy
    Citation Excerpt :

    The replacement will shorten the cosmology computation time by a sizable factor. Finally, Gualandris et al. (2007) have presented the first highly parallel, grid-based N-body simulations. It opens a paradigm for the GraCCA system to connect with the grid-computing community in the future.

  • Distributed N-body simulation on the grid using dedicated hardware

    2008, New Astronomy
    Citation Excerpt :

    In order to further understand the results and to enable performance predictions for larger network setups, we decided to model the performance of the grid calculations. We model the performance of the simulation by adopting the parallel performance models described by Makino (2002) and Harfst et al. (2007) and combining it with the grid performance model described in Gualandris et al. (2007). Further extension and calibration of the model allows us to simulate the performance of our N-body simulations on a G3 or any other topology.

View all citing articles on Scopus
View full text