Parallel Realization of the Hybrid Model Code for Numerical Simulation of Plasma Dynamics

The work is devoted to a parallel realization of a hybrid model for study of plasma dynamics in axially symmetric open magnetic traps. The model is based on MHD approximation for electron component of the plasma and on the kinetic approach for ion component. In the model the particle-in-cells method (PIC) with explicit numerical schemes on staggered grids for the cylindrical coordinate system is applied. The permanent injection of the particles, the long characteristic times, the necessary grid resolution and the conditional stability of the method required development of parallel version of the algorithm. In the parallelization we use the mixed decomposition with static load balancing. We present the parallel algorithm as well as results of the computational experiments on plasma dynamics in regime of diamagnetic confinement in an open magnetic trap.


Introduction
The present-day leaders in the field of devices for plasma confinement with purpose of controlled thermonuclear fusion are tokamaks. Experiments on JET tokamak yielded fusion power 16MW and fusion power gain Q around 0.65 [1]. The ratio of the plasma pressure to the magnetic field pressure, β, in tokamaks does not exceed 5-10%. This fact together with the complicated tokamak design, big sizes, high experimental costs brings into consideration open plasma traps. A new concepts of plasma confinement are proposed and needed to be verified. For example, formation of "diamagnetic bubble" allows creating a compact linear open traps for the fusion purposes [2].
Mathematical modelling is the key instrument for the physical effects analysis. The appropriate model for the plasma simulation is based on kinetic Vlasov equation for electrons and ions. However, the differences in masses of electrons and ions make this model inapplicable due to the enormously high computer requirements. Our 2D hybrid model is based on MHD approximation for electron component of the plasma and on the kinetic approach for the ion component [3]. Despite of its economy compared with fully kinetic models the parallel version of the numerical algorithm is natural necessity for the reasonable computational times. The aim of the work is the parallel realization of the hybrid model code for the simulations of the plasma flows interaction as well as testing it on example of beam injection into open magnetic trap. In the paper the algorithm description and the finite-difference schemes are written out and the results of the numerical tests are presented.

Problem statement
We consider a cylindrical open trap of radius R and length L with background hydrogen plasma with density n 0 at the initial time moment. Two coaxial round coils on the ends of the trap produce the magnetic field of the mirrors, the magnetic field at the center on the axis (0, L/2) is H 0 . In the same point a beam particle source is located. The question is to study the self-consistent beam-plasma interaction.
As we consider an axially symmetric case, we use cylindrical coordinate system in the two dimensional domain and eliminate derivatives ∂/∂ϕ from the equations.
For the definition of the dimensionless variables we use the characteristic values H 0 for the magnetic field and n 0 for the density. The characteristic time is t 0 = 1/ω ci , where ω ci = eH 0 /cm i is the ion gyrofrequency. The characteristic length is L 0 = c/ω pi , ω pi = 4πn 0 e 2 /m i is the ion plasma frequency. The characteristic velocity is Alfven velocity The ion dynamics can be described with the Vlasov equation, we solve it with the method of characteristics. The equations for the ion trajectories coincide with the equations of the characteristics of kinetic Vlasov equation: where r i , v i are the ion coordinates and the velocities, The force F i includes Lorentz force and the ion-electron friction force with κ = cm e /eH 0 τ ei , where τ ei is the characteristic ion-electron collision time [4]. Since we consider the quasi-neutral plasma n = n i = n e . The current is where n e is electron density, V e is electron velocity. The ion velocity V i and the ion density n i are defined by the ion distribution function f i (t, r, v): The massless electron component is described by the transport equation: The electron temperature T e is defined by the heating due to the ion-electron friction, the heat flux, the changes due to the compression/expansion and obeys the equation: with κ 1 = 4πeλ/H 0 c, where p e = n e T e is the pressure of the electrons, λ is the thermal conductivity coefficient. The adiabatic index γ = 5/3. We consider κ = const, κ 1 = const.
At the initial time moment the electric field E = 0 and the magnetic field is defined from potential of the coils in a doubled domain with zero boundary conditions. On the present stage of the model realization we consider the perturbations of all values at the borders are negligibly small and we assume the particles reflect from the external boundaries.

Algorithm description
We use particle-in-cell method on staggered uniform grids to solve the equations (1-10) [5].
in the spatial mesh nodes. The values of particle data r, z, v r , v ϕ , v z are defined on the Lagrangian mesh.
Since in equations (3) (4), (7), (8) the currents appear with a factor as j/n, we substitute U = j/n and use the arrays for U during the computations. The only equation requiring values j is (10) and we use the multiplication n U there. The result of the changes is decrease of the number of the operations performed by the processors and increase of the algorithm performance.
Here E ∇ϕ = 0 and we use only two arrays for E ∇ . For equations (3) and (7) rewritten in terms of E we obtain the reduction of the term k U and a computation speed-up due to the use of arrays only for E. We use the idea and calculate and store the array data of E * and E ∇ separately for the calculation of the magnetic field H ϕ using equation (9). The additional arrays allow using more compact scheme template and more accurate calculation of the pressure gradient contributions into the magnetic field.
The stationary structure of the diamagnetic bubble implies attainment of the bubble saturation with the particles and further steady plasma flux in the trap, defined by the particle injection and losses. The characteristic times of the system evolution are expected to be 10 4 ω −1 ci , so, for the time unit with ∼ 10 4 steps it requires ∼ 10 8 time steps.
Since the solution accuracy is determined by the grid resolution, the bubble boundary layer of the size λ b ∼ 1 cm has to be described with minimum 10 grid nodes. The bubble sizes 10 × 100 cm require grids with 100 × 1000 nodes. The particle-in-cell accuracy is also defined by the number of the model particles in each cell [6], and more accurate results require bigger number of the model beam particles and model background particles. The millions of the injected particles require sufficient processor memory. The grid refinement leads to proportional increase of the number of the background ions. Besides, smaller time step must be taken for finer grids according to the stability condition for the explicit schemes. All these factors necessitate using of a parallel version of the described algorithm for both faster computations and larger data processing.
The domain is divided in equal parts and each subdomain is assigned to one of np 0 groups of processor cores. The particles of the subdomain are distributed evenly among the corresponding group cores [7]. From np processor cores the ones with ranks 0..np 0 − 1 are denominated as master cores, the exchanges of the auxiliary grid nodes are performed with the master cores. The rest of the cores are denominated as slaves, they receive the same grid data from their masters, process the particles and send the results back. The static core distribution for the particles and the domain decomposition are presented schematically in figure 1.  The block-scheme of the algorithm is presented in figure 2. Let us consider one step of the time cycle. It consists from the two stages and switching between them. On the Lagrangian stage the particle velocities and coordinates are calculated. The grid values are calculated on the Eulerian stage.
In the beginning of the cycle the master cores distribute the field values B, E * , E ∇ to their slaves using function MPI Bcast.

Ion motion
We use the scheme [8] for the values r i , v i in Cartesian coordinates, and then recalculate them in the cylindrical coordinates: Each processor core computes its own particle velocities and coordinates using bilinear interpolation for the electromagnetic fields. We form buffer arrays for the particles, which move away from their subdomain, to be sent to one of the cores of the neighbour group. The local rank of the receiving core in the neighbour group is defined as mod(n + 1, C g ), where C g is number of cores in group g. This way of particles exchanges yields a uniform particle distribution among the cores in each group.

Average ion velocities, densities
The switching from the Lagrangian to Eulerian stage implies that each processor core computes the average velocity in cell and density. First, the square of the cell [r a−1/2 , r a+1/2 ] × [z b−1/2 , z b+1/2 ] is calculated: S cell a = πr 2 a+1/2 − πr 2 a−1/2 = 2πh r r a , and then the volume of the cell: The density values on the spatial grid are Here j = 1..N is the number of model particle, q j is the full charge of the ion j, R CIC is cloudin-cell shape-function of the particle (bilinear interpolation). Similarly, from the velocities of ions v j the average ion velocity is computed at the nodes of the grid: Since the density is calculated at points i − 1/2, k − 1/2, for the velocity components V ir and V iz in (15) the values n i,k−1/2 , n i−1/2,k are densities, calculated as arithmetic average from two corresponding density values. The masters collect the processed arrays using MPI Reduce from their group and then exchange with each other with the boundary grid values.

Currents
The further computations in the time cycle are performed only by the master cores on the spatial grids, so only the grid boundary exchanges are needed.
From Maxwell equation (10) values U r , U ϕ , U z are determined:

Electron velocities
The electron velocities are defined from relation (4): 3.5. Electric field Equation (7) is used to compute the electric field: The values of E * are computed according to the following schemes: where the overline denotes the arithmetic average of the four values in the nearest nodes: For the value of E ∇ the following schemes are used:

Magnetic field
The formulae for the magnetic field are of the following form:     For the numerical experiments spatial grid with N r = 60 and N z = 300 nodes and N τ = 10 5 time steps τ = 5 · 10 −4 were used. The number of background ions used J b = 7.2 · 10 4 , the total number of ions J p = 3.4 · 10 6 . The computations took ∼ 9.5 hours.
For the grids N r = 30, N z = 150 and time step τ = 8 · 10 −4 from 10 to 72 processor cores in 10..30 groups were used. We denote the core distribution as np 0 − np. The two central groups have maximal number of cores, for example, for the case of 10 − 30 the number is C 4 = C 5 = 5. The number C g falls linearly with the factor δ as the number of group g approaches the domain ends:     figure 8). The core distribution 30 − 72 is most effective as it has high number of cores in the center: C 7 = 8, δ = 1 and smallest of all considered subdomains with particles. The configuration 10 − 14 and 10 − 16 have the same number of central cores C 4 = C 5 = 4, but higher δ = 2 for np = 16 corresponds to an additional core in groups g = 3 and g = 6. The more uniform core load yields decreasing of the computation times for 3% for T = 100. The linear decomposition np 0 = np is the most inefficient case, but bigger number of processors np 0 allows processing smaller domains and, as consequence, smaller number of particles, providing faster computations. The computation times measured in seconds for the evolution time T = 1.25 · 10 5 τ are presented in Tab. 1. The table demonstrates the strong dependence of the computational time on the particle processing. The calculation of the velocities, coordinates and currents takes ∼ 98% of the total time T total . The largest subdomain with the biggest number of particles in core J c corresponds to the distribution 10 − 10 and the longest computations. The minimal number of particles J c corresponds to the configuration 30 − 72 and yields the best results. Since number of the grid nodes is smaller then the number of particles, the computations on the Eulerian stage take significantly smaller time (less than 1% of T total ). The spatial grid size is slowly growing parameter in the simulations, whereas the number of particles is defined by the injection intensity and the evolution time T. In the present simulations the particle exchanges take ∼ 10 −4 T total or less, but for 10 3 ω −1 ci these times must be taken into account. The times for 2D array exchanges grow with the maximal number of cores in group. A certain distribution within the given np cores may significantly accelerate the computations, for example, for np = 30 the most effective configuration in the table corresponds to np 0 = 10 and requires about 5 times smaller computation time than for 10 − 10. The test simulations demonstrate the mixed decomposition advantages, but a dynamical load balancing may considerably improve it.
The computations were performed on Intel Xeon Phi 7290 processors of Siberian Supercomputer Center cluster (ICM&MG SB RAS, Novosibirsk). decomposition is powerful method for acceleration of the computation times, and appropriate core distribution may significantly decrease the times. For the problems with highly non-uniform particle distribution in time a dynamic load balancing is required for efficient computations.