Performance analysis of direct N-body algorithms for astrophysical simulations on distributed systems
Introduction
Numerical methods for solving the classical astrophysical N-body problem have evolved in two main directions in recent years. On the one hand, approximated models like Fokker–Planck models [1], [2], gaseous models [3], and Monte Carlo models [4], [5], [6] have been applied to the simulation of globular clusters and galactic nuclei. These models permit to follow the global evolution of large systems along their lifetime but at the expense of moderate accuracy and resolution. The basic approach is to group particles according to their spatial distribution and use a truncated multipole expansion to evaluate the force exerted by the whole group instead of evaluating directly the contributions from the single particles. On the other hand, direct summation methods have been developed to accurately model the dynamics and evolution of collisional systems like dense star clusters. These codes compute all the inter-particle forces and are therefore the most accurate. They are also more general, as they can be used to simulate both low and high density regions. Their high accuracy is necessary when studying physical phenomena like mass segregation, core collapse, dynamical encounters, formation of binaries or higher order systems, ejection of high velocity stars, and runaway collisions. Direct methods have an O(N2) scaling with the number of stars and are therefore limited to smaller particle numbers compared to approximated methods. For this reason, so far they have only been applied to the simulation of moderate size star clusters. The simulation of globular clusters containing one million stars is still a challenge from the computational point of view, but it is an important astrophysical problem. It will provide insight in the complex dynamics of these collisional systems, in the microscopic and macroscopic physical phenomena, and it will help finding evidence of the controversial presence of a central black hole. The simulation of the evolution of these systems under the combined effects of gravity, stellar evolution, and hopefully hydrodynamics will allow to study the stellar population and to compare the results with observations.
The need to simulate ever larger systems and to include a realistic physical treatment of stars asks for a speedup in the most demanding part of the calculation: the gravitational dynamics. Significant improvement in the performance of direct codes can be obtained either by means of special purpose computers like GRAPE hardware [7] or of general purpose distributed systems [8]. In this work, we focus on the performance of direct N-body codes on distributed systems. Two main classes of algorithms can be used to parallelize direct summation N-body codes: replicated data and distributed data algorithms. In this work, we present a performance analysis of the two algorithms on different architectures: a Beowulf cluster, three supercomputers, and two computational grids. We provide theoretical predictions and actual measurements for the execution time on different platforms, allowing the choice of the best performing scheme for a given architecture.
Section snippets
Numerical method
In the direct method the gravitational force acting on a particle is computed by summing up the contributions from all the other particles according to Newton’s lawThe number of force calculations per particle is N(N − 1)/2 . Given the fact that the force acting on a particle usually varies smoothly with time, the integration of the particle trajectory makes use of force polynomial fitting. In this work we implement the fourth-order Hermite integrator with
Parallel schemes for direct N-body codes
The parallelization of a direct N-body code can proceed in different ways depending on the desired intrinsic degree of parallelism and communication to computation ratio. We implemented two different parallelization algorithms, the copy algorithm and the ring algorithm, for an Hermite scheme with block time-steps using the standard MPI library package. If we denote with N the total number of particles in the system and with p the number of available processors, both algorithms have a
Performance analysis of different parallel N-body schemes
The performance of a parallel code does not only depend on the properties of the code itself, like the parallelization scheme and the intrinsic degree of parallelism, but also on the properties of the parallel computer used for the computation. The main factors determining the general performance are the calculation speed of each node, the bandwidth of the inter-processor communication, and the start-up time (latency). A theoretical estimate of the total time needed for one full force
Performance on the BlueGene/L supercomputer
The BlueGene/L supercomputer is a novel machine developed by IBM to provide a very high number of computing nodes with a modest power requirement. Each node consists of two processors, a special variant of IBM’s Power family, with a clock speed of 700 MHz. To obtain good performance at this relatively low frequency, each node processes multiple instructions per clock cycle. The nodes are interconnected through multiple complementary high-speed low-latency networks, including a 3D torus network
Performance analysis on the GRID
Grid technology is rapidly becoming a major component of computational science. It offers a unified means of access to different and distant computational resources, with the possibility to securely access highly distributed resources that scientists do not necessarily own or have an account on. Connectivity between distant locations, interoperability between different kinds of systems, and resources and high levels of computational performance are some of the most promising characteristics of
Discussion
A numerical challenge for the astronomical community in the next years will be the simulation of star clusters containing one million stars. We have shown that direct N-body codes can efficiently be applied to the simulation of large stellar systems and that their performance can be predicted with simple models.
In this section, we apply the performance model introduced in Section 4.1 for the copy algorithm to predict the total execution time for the simulation of a system with N = 106. In Table 3
Conclusions
We have implemented two parallelization schemes for direct N-body codes with block time-steps, the copy and ring algorithm, and compared their performance on different parallel computers. In the case of clusters or supercomputers, the execution times for the two schemes are comparable except for very small systems where the communication time dominates over the calculation time and hence the copy algorithm performs slightly better. The ring algorithm is well suited for the integration of very
Acknowledgments
We thank Jun Makino and Piet Hut for support through the ACS project (http://www.artcompsci.org). We also thank Gyan Bhanot, Bob Walkup and the IBM T.J Watson Research Center for performing test runs on the BlueGene supercomputer, the DAS-2 and CrossGrid projects, for their technical help in the grid tests and Rainer Spurzem and David Merritt for useful discussions on parallel computing. This work was supported by the Netherlands Organization for Scientific Research (NWO Grant #635.000.001),
References (20)
- et al.
Systolic and hyper-systolic algorithms for the gravitational N-body problem, with an application to Brownian motion
J. Comput. Phys.
(2003) An efficient parallel algorithm for O(N2) direct summation method and its variations on distributed-memory parallel machines
NewA
(2002)Late core collapse in star clusters and the gravothermal instability
ApJ
(1980)- et al.
Dynamical and luminosity evolution of active galactic nuclei – Models with a mass spectrum
ApJ
(1991) - et al.
Anisotropic gaseous models for the evolution of star clusters
MNRAS
(1991) - M. Hénon, Two recent developments concerning the Monte Carlo method, in: IAU Symposium 69: Dynamics of the Solar...
- L. Spitzer, Dynamical theory of spherical stellar systems with large N (invited Paper), in: IAU Symposium 69: Dynamics...
Monte Carlo simulations of star clusters – I. First results
MNRAS
(1998)- et al.
GRAPE-6: massively-parallel special-purpose computer for astrophysical particle simulations
PASJ
(2003) - et al.
On a Hermite integrator with Ahmad–Cohen scheme for gravitational many-body problems
PASJ
(1992)
Cited by (19)
Gravitational tree-code on graphics processing units: Implementation in CUDA
2010, Procedia Computer ScienceA platform independent communication library for distributed computing
2010, Procedia Computer ScienceSAPPORO: A way to turn your graphics cards into a GRAPE-6
2009, New AstronomyGraphic-card cluster for astrophysics (GraCCA) - Performance tests
2008, New AstronomyCitation Excerpt :The replacement will shorten the cosmology computation time by a sizable factor. Finally, Gualandris et al. (2007) have presented the first highly parallel, grid-based N-body simulations. It opens a paradigm for the GraCCA system to connect with the grid-computing community in the future.
Distributed N-body simulation on the grid using dedicated hardware
2008, New AstronomyCitation Excerpt :In order to further understand the results and to enable performance predictions for larger network setups, we decided to model the performance of the grid calculations. We model the performance of the simulation by adopting the parallel performance models described by Makino (2002) and Harfst et al. (2007) and combining it with the grid performance model described in Gualandris et al. (2007). Further extension and calibration of the model allows us to simulate the performance of our N-body simulations on a G3 or any other topology.
A parallel gravitational N-body kernel
2008, New Astronomy