Atomistic spin model simulations of magnetic nanomaterials

Atomistic modelling of magnetic materials provides unprecedented detail about the underlying physical processes that govern their macroscopic properties, and allows the simulation of complex effects such as surface anisotropy, ultrafast laser-induced spin dynamics, exchange bias, and mi- crostructural effects. Here we present the key methods used in atomistic spin models which are then applied to a range of magnetic problems. We detail the parallelisation strategies used which enable the routine simulation of extended systems with full atomistic resolution.


I. INTRODUCTION
Atomistic models of magnetic materials , where the atoms treated as possessing a local magnetic moment, originated with Ising in 1925 as the first model of the phase transition in a ferromagnet [1]. The Ising model has spin-up and spin-down only states, and is amenable to analytical treatment, at least in two dimensions. Although it is still extensively used in the study of phase transitions, it is limited in applicability to magnetic materials and cannot be used for dynamic simulations. A natural extension of the Ising model is to allow the atomic spin to vary freely in 3D space [2,3] which yields the classical Heisenberg model, where quantum mechanical effects on the atomic spins are neglected [2]. Monte Carlo simulations of the classical Heisenberg model allowed the study of phase transitions, surface and finite size effects in simple magnetic systems. The study of dynamic phenomena however was intrinsically limited due to the use of Monte Carlo methods until the development of dynamic [4,5] and stochastic Landau-Lifshitz-Gilbert atomistic spin models [6][7][8].
Today most magnetic modelling is performed using numerical micromagnetics in finite-difference[37] and finiteelement [38,39] flavours. The theoretical basis of micromagnetics is that the atomic dipoles which make up the magnetic material can be approximated as a continuous vector field where, due to the exchange interaction, the atomic dipoles in a small finite volume are perfectly aligned. Micromagnetics has proven to be an essential tool in understanding a range of complex magnetic effects [40][41][42] but due to the rapid pace of technological development in magnetic materials the continuum approximation at its heart precludes its application to many problems of interest at the beginning of the 21 st century, such as heat assisted magnetic recording [43], ultrafast laser-induced demagnetisation [44,45], exchange bias in spin valves [46], surface and interface anisotropy [47,48] and high anisotropy materials for ultrahigh density recording media such as FePt [49]. The common theme to these problems is a sub-nanometre spatial variation in the magnetisation caused by high temperatures, atomic level ordering (anti-and ferri-magnets), or atomic surface and interface effects. To tackle these problems requires a formalism to take account of the detailed atomic structure to express its impact on the macroscopic behaviour of a nano particle, grain or complete device.
Some but not all of these problems can adequately be tackled by next-generation micromagnetic approaches utilising the Landau-Lifshitz-Bloch equation [50][51][52], which is based on a physically robust treatment of the coupling of a macrospin to a heat bath, allowing the study of high temperature processes [53], ultrafast demagnetisation [54,55] and switching [56]. However, true atomic scale variations of the magnetisation, as apparent in antiferromagnets, surfaces and interfaces, still require an atomistic approach. Additionally with the decreasing size of magnetic elements, finite size effects begin to play in increasing role in the physical properties of magnetic systems [57].
In this article we present an overview of the common computational methods utilised in atomistic spin simu-lations and details of their implementation in the opensource vampire software package [58]. Testing of the code is an essential part of ensuring the accuracy of the model and so we also detail important tests of the various parts of the model and compare them to analytic results while exploring some interesting physics of small magnetic systems.
vampire is designed specifically with these problems in mind, and can easily simulate nanoparticles, multilayer films, interfacial mixing, surface anisotropy and roughness, core-shell systems, granular media and lithographically defined patterns, all with fully atomistic resolution. In addition truly realistic systems predicted by Molecular Dynamics simulations [19,20,59] can also be used giving unprecedented detail about the relationships between shape and structure and the magnetic properties of nanoparticles. In addition to these physical features vampire also utilises the computing power of multiprocessor machines through parallelisation, allowing systems of practical interest to be simulated routinely, and largescale problems on the 100+ nm length scale to be simulated on computing clusters.

II. THE ATOMISTIC SPIN MODEL
The physical basis of the atomistic spin model is the localisation of unpaired electrons to atomic sites, leading to an effective local atomistic magnetic moment. The degree of localisation of electrons has historically been a contentious issue in 3d metals [60], due to the magnetism originating in the outer electrons which are notionally loosely bound to the atoms. Ab initio calculations of the electron density [61] show that in reality even in 'itinerant' ferromagnets the spin polarisation is well-localised to the atomic sites. Essentially this suggests that the bonding electrons are unpolarised, and after taking into account the bonding charge the remaining d-electrons form a welldefined effective localised moment on the atomic sites.
Magnetic systems are fundamentally quantum mechanical in nature since the electron energy levels are quantised, the exchange interaction is a purely quantum mechanical effect, and other important effects such as magnetocrystalline anisotropy arise from relativistic interactions of electronic orbitals with the lattice, which are the province of ab initio models. In addition to these properties at the electronic level, the properties of magnetic materials are heavily influenced by thermal effects which are typically difficult to incorporate into standard density functional theory approaches. Therefore models of magnetic materials should combine the quantum mechanical properties with a robust thermodynamic formalism. The simplest model of magnetism using this approach is the Ising model [1], which allows the atomic moments one of two allowed states along a fixed quantisation axis. Although useful as a descriptive system, the forced quantisation is equivalent to infinite anisotropy, limiting the applicability of the Ising model in relation to real materials. In the classical description the direction of the atomic moment is a continuous variable in 3D space allowing for finite anisotropies and dynamic calculations. In some sense the classical spin model is analogous to Molecular Dynamics, where the energetics of the system are determined primarily from quantum mechanics, but the time evolution and thermodynamic properties are treated classically.

A. The classical spin Hamiltonian
The Heisenberg spin model encapsulates the essential physics of a magnetic material at the atomic level, where the energetics of a system of interacting atomic moments is given by a spin Hamiltonian (which neglects non-magnetic effects such the as the Coulomb term). The spin Hamiltonian H typically has the form: denoting terms for the exchange interaction, magnetic anisotropy, and externally applied magnetic fields respectively.
The dominant term in the spin Hamiltonian is the Heisenberg exchange energy, which arises due to the symmetry of the electron wavefunction and the Pauli exclusion principle [60] which governs the orientation of electronic spins in overlapping electron orbitals. Due to its electrostatic origin, the associated energies of the exchange interaction are around 1−2 eV, which is typically up to 1000 times larger than the next largest contribution and give rise to magnetic ordering temperatures in the range 300-1300K. The exchange energy for a system of interacting atomic moments is given by the expression where J ij is the exchange interaction between atomic sites i and j, S i is a unit vector denoting the local spin moment direction and S j is the spin moment direction of neighbouring atoms. The unit vectors are taken from the actual atomic moment µ s and given by S i = µ s /|µ s |.
It is important to note here the significance of the sign of J ij . For ferromagnetic materials where neighbouring spins align in parallel, J ij > 0, and for anti-ferromagnetic materials where the spins prefer to align anti-parallel J ij < 0. Due to the strong distance dependence of the exchange interaction the sum in Eq. 2 is often truncated to include nearest neighbours only. This significantly reduces the computational effort while being a good approximation for many materials of interest. In reality the exchange interaction can extend to several atomic spacings [29,30], representing hundreds of pairwise interactions.
In the simplest case the exchange interaction J ij is isotropic, meaning that the exchange energy of two spins depends only on their relative orientation. In more complex materials, the exchange interaction forms a tensor with components: which is capable of describing anisotropic exchange interactions, such as two-ion anisotropy [29] and the Dzyaloshinskii-Moriya interaction (off-diagonal components of the exchange tensor). In the case of tensorial exchange H M exc , the exchange energy is given by the product: Obtaining the components of the exchange tensor may be done phenomenologically, or via ab initio methods such as the relativistic torque method [62][63][64][65] or the spincluster expansion technique [30,[66][67][68]. The above expressions for the exchange energy also exclude higherorder exchange interactions such as three-spin and fourspin terms. In most materials the higher order exchange terms are significantly smaller than the leading term and can safely be neglected.
While the exchange energy gives rise to magnetic ordering at the atomic level, the stability of a magnetic material is dominated by the magnetic anisotropy, or preference for the atomic moments to align along a preferred spatial direction. There are several physical effects which give rise to anisotropy, but the most important is the magnetocrystalline anisotropy (namely the preference for spin moments to align with particular crystallographic axes) arising from the interaction of atomic electron orbitals with the local crystal environment [69,70].
The simplest form of anisotropy is of the single-ion uniaxial type, where the magnetic moments prefer to align along a single axis, e, often called the easy axis and is an interaction confined to the local moment. Uniaxial anisotropy is most commonly found in particles with elongated shape (shape anisotropy), or where the crystal lattice is distorted along a single axis as in materials such as hexagonal Cobalt and L1 0 ordered FePt. The uniaxial single ion anisotropy energy is given by the expression: where k u is the anisotropy energy per atom. Materials with a cubic crystal structure, such as Iron and Nickel, have a different form of anisotropy known as cubic anisotropy. Cubic anisotropy is generally much weaker than uniaxial anisotropy, and has three principal directions which energetically are easy, hard and very hard magnetisation directions respectively. Cubic anisotropy is described by the expression: where k c is the cubic anisotropy energy per atom, and S x ,S y , and S z are the x,y, and z components of the spin moment S respectively. Most magnetic problems also involve interactions between the system and external applied fields, denoted as H app . External fields can arise in many ways, for example a nearby magnetic material, or as an effective field from an electric current. In all cases the applied field energy is simply given by: A note on magnetic units The subject of magnetic units is controversial due to the existence of multiple competing standards and historical origins. [60] Starting from the atomic level however, the dimensionality of units is relatively transparent. Atomic moments are usually accounted for in multiples of the Bohr magneton (µ B ), the magnetic moment of an isolated electron, with units of Joules/Tesla. Given a number of atoms of moment µ s in a volume, the moment per unit volume is naturally in units of J/T/m 3 , which is identical to the SI unit of A/m. However, the dimensionality (moment per unit volume) of the unit A/m is not as obvious as JT −1 m −3 , and so the latter form is used herein.
Applied magnetic fields are hence defined in Tesla, which comes naturally from the derivative of the spin Hamiltonian with respect to the local moment. The unit of Tesla for applied field is also beneficial for hysteresis loops, since the area enclosed a typical M-H loop is then given as an energy density (Joules/m 3 ). A list of key magnetic parameters and variables and their units are shown in Tab. I.

III. SYSTEM PARAMETERISATION AND GENERATION
Unlike micromagnetic simulations where the magnetic system can be partitioned using either a finite difference or finite element discretisation, atomistic simulations generally require some a priori knowledge of atomic positions. Most simple magnetic materials such as Fe, Co or Ni form regular crystals, while more complex systems such as oxides, antiferromagnets and Heusler alloys possess correspondingly complex atomic structures. For ferromagnetic metals, the details of atomic positions are generally less important due to the strong parallel orientation of moments, and so they can often (but not al- ways) be represented using a simple cubic discretisation. In contrast, the properties of ferrimagnetic and antiferromagnetic materials are inherently tied to the atomic positions due to frustration and exchange interactions, and so simulation of these materials must incorporate details of the atomic structure. In addition to the atomic structure of the material, it is also necessary to parameterise the terms of the spin Hamiltonian given by Eq. 1, principally including exchange and anisotropy values but also with other terms. There are generally two ways in which this may be done: using experimentally determined properties or with a multiscale approach using ab initio density functional theory calculations as input to the spin model.

Atomistic parameters from ab initio calculations
Ab initio density functional theory (DFT) calculations utilise the Hohenberg-Kohn-Sham theory [71,72] that the total energy E of a system can be written solely in terms the electron density, rho. Thus, if the electron density is known then the physical properties of the system can be found. In practice, the both electron density and the spin density are used as fundamental quantities in the total energy expression for spin-polarised systems [73]. In many implementations DFT-based methods only consider the outer electrons of a system, since the inner electrons play a minimal role in the bonding and also partially screen the effect of the nuclear core. These effects are approximated by a pseudopotential which determines the potential felt by the valence electrons. In all-electron methods, however, the core electron density is also relaxed. By energy minimisation DFT enables the calculation of a wide range of properties, including lattice constants, and in the case of magnetic ma-terials localised spin moments, magnetic ground state and the effective magnetocrystalline anisotropy. Standard software packages such as vasp [74], castep [75,76] and siesta [77] make such calculations readily accessible. At present determining site resolved properties such as anisotropy constants and pairwise exchange interactions is more involved and require ab initio Green's functions techniques such as the fully relativistic Korringa-Kohn-Rostoker method [78,79] or the lmto method [80,81] in conjunction with the magnetic force theorem [62]. An alternative approach for the calculation of exchange parameters is the utilisation of the generalised Bloch theorem for spin-spiral states in magnetic systems [82] together with a Fourier transformation of k-dependent spin-spiral energies [83,84].
A number of studies have determined atomic magnetic properties from first principles calculations by direct mapping onto a spin model, including the principal magnetic elements Co, Ni and Fe [81], metallic alloys including FePt [29], IrMn [31], oxides [85] and spin glasses [86], and also bilayer systems such as Fe/FePt. [87] Such calculations give detailed insight into microscopic magnetic properties, including atomic moments, long-ranged exchange interactions, magnetocrystalline anisotropies (including surface and two-ion interactions) and other details not readily available from phenomenological theories. Combined with atomistic models it is possible to determine macroscopic properties such as the Curie temperature, temperature dependent anisotropies, and magnetic ground states, often in excellent agreement with experiment. However, the computational complexity of DFT calculations also means that the systems which can be simulated with this multi scale approach are often limited to small clusters, perfect bulk systems and 2D periodic systems, while real materials of course often contain a plethora of defects disrupting the long range order. Some studies have also investigated the effects of disorder in magnetic systems combined with a spin model mapping, such as dilute magnetic semiconductors [88] and metallic alloys [89].
Magnetic properties calculated at the electronic level have a synergy with atomistic spin models, since the electronic properties can often be mapped onto a Heisenberg spin model with effective local moments. This multiscale electronic and atomistic approach avoids the continuum approximations of micromagnetics and treats magnetic materials at the natural atomic scale.

Atomistic parameters from macroscopic properties
The alternative approach to multiscale atomistic/density-functional-theory simulations is to derive the parameters from experimentally determined values. This has the advantage of speed and lower complexity, whilst foregoing microscopic details of the exchange interactions or anisotropies. Another key advantage of generic parameters is the possibility of parametric studies, where parameters are varied explicitly to determine their importance for the macroscopic properties of the system, such as has been done for studies of surface anisotropy [17] and exchange bias [13].
Unlike micromagnetic simulations, the fundamental thermodynamic approach of the atomistic model means that all parameters must be determined for zero temperature. The spin fluctuations then determine the intrinsic temperature dependence of the effective parameters which are usually put into micromagnetic simulations as parameters. Fortunately determination of the atomic moments, exchange constants and anisotropies from experimental values is relatively straightforward for most systems.

Atomic spin moment
The atomic spin moment µ s is related to the saturation magnetisation simply by: where M s is the saturation magnetisation at 0K in JT −1 m −3 , a is the unit cell size (m), and n at is the number of atoms per unit cell. We also note the usual convention of expressing atomic moments in multiples or fractions of the Bohr magneton, µ B owing to their electronic origin. Taking BCC Iron as an example, the zero temperature saturation magnetisation is 1.75 MA/m [90], unit cell size of a = 2.866Å, this gives an atomic moment of 2.22 µ B /atom.

Exchange energy
For a generic atomistic model with z nearest neighbour interactions, the exchange constant is given by the meanfield expression: where k B is the Boltzmann constant and T c is the Curie temperature z is the number of nearest neighbours. is a correction factor from the usual mean-field expression which arises due to spin waves in the 3D Heisenberg model [91] and is ∼ 0.86. Because of this is also crystal structure and coordination number dependent, and so the calculated T c will vary slightly according to the specifics of the system. For Cobalt with a T c of 1388K and assuming a hexagonal crystal structure with z = 12, this gives a nearest neighbour exchange interaction J ij = 6.064 × 10 −21 J/link.

Anisotropy energy
The atomistic magnetocrystalline anisotropy k u is derived from the macroscopic anisotropy constant K u by the expression: where K u in given in J/m 3 . In addition to the atomistic parameters, it is also worth noting the analogous expressions for the anisotropy field H a for a single domain particle: where symbols have their usual meaning. At this point it is worth mentioning that the measured anisotropy is a free energy difference. While the intrinsic k u remains (to a first approximation) temperature independent, at a non-zero temperature the free energy in the easy/hard directions is increased/decreased due to the magnetisation fluctuations. Thus the macroscopic anisotropy value decreases with increasing temperature, vanishing at T c . The thermodynamic basis of atomistic models makes them highly suitable for the investigation of such phenomena, as we show later. Applying the preceding operations, parameters for the key ferromagnetic elements are given in Tab. II.

Ferrimagnets and antiferromagnets
In the case of ferrimagnets and anti-ferromagnets the above methods for anisotropy and moment determination do not work due to the lack of macroscopic measurements, although the estimated exchange energies apply equally well to the Néel temperature provided no magnetic frustration (due to lattice symmetry) is present. In general, other theoretical calculations or formalisms are required to determine parameters, such as mean-field approaches [9] or density functional theory calculations [31].

Atomistic system generation
In addition to determining the parameters of the spin Hamiltonian, an essential part of the atomistic model is the determination of the nuclear, or atomic, positions in the system. In the multiscale approach utilising ab initio parameterisation of the system, the spin Hamiltonian is intrinsically tied to the atomic positions. The additional detail offered by first principles calculations is highly desirable even for perfect crystals and atomically sharp interfaces, however the computational complexity of the calculations limits the ability to parameterise a For systems modelled using the nearest neighbour approximation, the atomic structures are much less restricted, allowing for simulations of material defects such as interface roughness [13]and intermixing [21], magnetic multilayers [94,95], disordered magnetic alloys [9], surface [17] and finite size effects [57]. vampire includes extensive functionality to generate such systems, the details of which are included in Appendix B. In addition to crystallographic and molecular systems [96,97] it is also possible to investigate magnetic effects in disordered materials and nanoparticles by incorporating the results of Molecular Dynamics simulations [19,20,98].

IV. INTEGRATION METHODS
Although the spin Hamiltonian describes the energetics of the magnetic system, it provides no information regarding its time evolution, thermal fluctuations, or the ability to determine the ground state for the system. In the following the commonly utilised integration methods for atomistic spin models are introduced.

Spin Dynamics
The first understanding of spin dynamics came from ferromagnetic resonance experiments, where the time dependent behaviour of a magnetic materials is described by the equation derived by Landau and Lifshitz [99], and in the modern form given by: where m is a unit vector describing the direction of the sample magnetisation, H is the effective applied acting on the sample, γ is the gyromagnetic ratio and α is a phenomenological damping constant which is a property of the material. The physical origin of the Landau-Lifshitz (LL) equation arises due to two distinct physical effects. The precession of the magnetisation (first term in Eq. 12) arises due to the quantum mechanical interaction of an atomic spin with an applied field. The relaxation of the magnetisation (second term in Eq. 12) is an elementary formulation of energy transfer representing the coupling of the magnetisation to a heat bath which aligns the magnetisation along the field direction with a characteristic coupling strength determined by α. In the LL equation the relaxation rate of the magnetisation to the field direction is a linear function of the damping parameter, which was shown by Gilbert to yield incorrect dynamics for materials with high damping [100]. Subsequently Gilbert introduced critical damping, with a maximum effective damping for α = 1, to arrive at the Landau-Lifshitz-Gilbert (LLG) equation. Although initially derived to describe the dynamics of the macroscopic magnetisation of a sample, the LLG is the standard equation of motion used in numerical micromagnetics, describing the dynamics of small magnetic elements.
The same equation of motion can also be applied at the atomistic level. The precession term arises quantum mechanically for atomic spins and the relaxation term now describes direct angular momentum transfer between the spins and the heat bath, which includes contributions from the lattice [101] and the electrons [102]. A distinction between the macroscopic LLG and the atomistic LLG now appears in terms of the effects included within the damping parameter. In the macroscopic LLG, α includes all contributions, both intrinsic (such as spin-lattice and spin-electron interactions) and extrinsic (spin-spin interactions arising from demagnetisation fields, surface defects [103], doping [104] and temperature [50]), while the atomistic LLG only includes the local intrinsic contributions. To distinguish the different definitions of damping we therefore introduce a microscopic damping parameter λ. Although the form of the LLG is identical for atomistic and macroscopic length scales, the microstructural detail in the atomistic model allows for calculations of the effective damping including extrinsic effects, such as rare-earth doping [104]. Including a microscopic damping λ the atomistic Landau-Lifshitz-Gilbert equation is given by where S i is a unit vector representing the direction of the magnetic spin moment of site i, γ is the gyromagnetic ratio and H i eff is the net magnetic field on each spin. The atomistic LLG equation describes the interaction of an atomic spin moment i with an effective magnetic field, which is obtained from the negative first derivative of the complete spin Hamiltonian, such that: where µ s is the local spin moment. The inclusion of the spin moment within the effective field is significant, in that the field is then expressed in units of Tesla, given a Hamiltonian in Joules. Given typical energies in the Hamiltonian of 10 µeV -100 meV range. This gives fields typically in the range 0.1 -1000 Tesla, given a spin moment of the same order as the Bohr magneton (µ B ).

Langevin Dynamics
In its standard form the LLG equation is strictly only applicable to simulations at zero temperature. Thermal effects cause thermodynamic fluctuations of the spin moments which at sufficiently high temperatures are stronger than the exchange interaction, giving rise to the ferromagnetic-paramagnetic transition. The effects of temperature can be taken into account by using Langevin Dynamics, an approach developed by Brown [105]. The basic idea behind Langevin Dynamics is to assume that the thermal fluctuations on each atomic site can be represented by a Gaussian white noise term. As the temperature is increased, the width of the Gaussian distribution increases, thus representing stronger thermal fluctuations. The established Langevin Dynamics method is widely used for spin dynamics simulations and incorporates an effective thermal field into the LLG equation to simulate thermal effects [106][107][108]. The thermal fluctuations are represented by a gaussian distribution Γ(t) in three dimensions with a mean of zero. At each time step the instantaneous thermal field on each spin i is given by: where k B is the Boltzmann constant, T is the system temperature, λ is the Gilbert damping parameter, γ is the absolute value of the gyromagnetic ratio, µ s is the magnitude of the atomic magnetic moment, and ∆t is the integration time step. The effective field for application in the LLG equation with Langevin Dynamics then reads: Given that for each time step three Gaussian distributed random numbers are required for every spin, efficient generation of such numbers is essential. vampire makes use the Mersenne Twister [109] uniform random number generator and the Ziggurat method [110] for generating the Gaussian distribution.
It is useful at the this point to address the applicability of the atomistic LLG to slow and fast problems respectively. In reality the thermal and magnetic fluctuations are correlated at the atomic level, arising from the dynamic interactions between the atoms and lattice/electron system. These correlations may be important in terms of the thermal fluctuations experienced by the atomistic spins. In the conventional Langevin dynamics approach described above, the noise term is completely uncorrelated in time and space. For short timescales however, the thermal fluctuations are correlated in time, and so the noise is Coloured [111]. The effect of the Coloured noise is to lessen the effect of sudden temperature changes on the magnetisation dynamics. However, the existence of ultrafast magnetisation dynamics [11,44], and that it is driven by a thermal rather than quantum mechanical process [112], requires that the effective correlation time is short, with an upper bound of around 1 fs. Given that the correlation time is close to the integration timestep, the applicability of the LLG to problems with timescales ≥ 1 fs is sound. There will be a point however where the LLG is no longer valid, where direct simulation of the microscopic damping mechanisms will be necessary. Progress has been made in linking molecular dynamics and spin models [101,113,114] which enables the simulation of spin-lattice interactions, which is particularly relevant for slow problems, such as conventional magnetic recording where switching occurs over ns timescales. However, it is also essential to consider spin-electron effects [102,115] necessary for ultrafast demagnetisation processes, although the physical origins are still currently debated [116].

Time Integration of the LLG Equation
In order to determine the time evolution of a system of spins, it is necessary to solve the stochastic LLG equation, as given by Eqs. 13 and 16, numerically. The choice of solver is limited due to the stochastic nature of the equations. Specifically, it is necessary to ensure convergence to the Stratonovich solution. This has been considered in detail by Garcia-Palacios [106], but the essential requirement [117] is that the solver enforces the conservation of the magnitude of the spin, either implicitly or by renormalisation. The most primitive integration scheme uses Euler's method, which assumes a lin- ear change in the spin direction in a single discretised time step, ∆t. An improved integration scheme, known as the Heun method [106] is commonly used, which allows the use of larger time steps because of its use of a predictor-corrector algorithm. Other more advanced integration schemes have been suggested, such as the midpoint method rule [118] and modified predictor-corrector midpoint schemes [104,119]. The principal advantage of the midpoint scheme is that the length of the spin vector is preserved during the integration which allows larger time steps to be used. However for the midpoint scheme the significant increase in computational complexity outweighs the benefits of larger time steps [119]. Modified predictor-corrector schemes [104,119] reduce the computational complexity of the midpoint scheme, but with a loss of accuracy, particularly in the time-dependent dynamics [104]. For valid integration of the stochastic LLG equation it is also necessary for the numerical scheme to converge to the Stratonovich solution [106,120]. Although proven for the midpoint and Heun numerical schemes, the validity of the predictor-corrector midpoint schemes for the stochastic LLG have yet to be confirmed. On balance the Heun scheme, despite its relative simplicity, is sufficiently computationally efficient that it is still the most widely used integration scheme for stochastic magnetisation dynamics, and so we proceed to describe its implementation in detail.
In the Heun method, the first (predictor) step calculates the new spin direction, S i , for a given effective field H i eff by performing a standard Euler integration step, given by: where The Heun scheme does not preserve the spin length and so it is essential to renormalise the spin unit vector length S i after both the predictor and corrector steps to ensure numerical stability and convergence to the Stratanovich solution. After the first step the effective field must be re-evaluated as the intermediate spin positions have now changed. It should be noted that the random thermal field does not change between steps. The second (corrector) step then uses the predicted spin position and revised effective field H i eff to calculate the final spin position, resulting in a complete integration step given by: where The predictor step of the integration is performed on every spin in the system before proceeding to evaluate the corrector step for every spin. This is then repeated many times so that the time evolution of the system can be simulated. Although the Heun scheme was derived specifically for a stochastic equation with multiplicative noise, in the absence of the noise term the Heun method reduces to a standard second order Runge-Kutta method [121]. In order to test the implementation of the Heun integration scheme, it is possible to compare the calculated result with the analytical solution for the LLG. For the simple case of a single spin initially along the x-axis in an applied field along the z-axis, the time evolution [122] is given by: The expected and simulated time evolution for a single spin with H = 10T, ∆t = 1 × 10 −15 s and λ = 0.1, 0.05 is plotted in Fig 1. Superficially the simulated and expected time evolution agree very well, with errors around 10 −6 . The error gives a characteristic trace the size and shape of which is indicative of a correct implementation of the Heun integration scheme. Ideally one would like to use the largest time step possible so as to simulate systems for the longest time. For micromagnetic simulations at zero temperature, the time step is a well defined quantity since the largest field (usually the exchange term) essentially defines the precession frequency. However, for atomistic simulations using the stochastic LLG equation with Langevin dynamics, the effective field becomes temperature dependent. The con- sequence of this is that for atomistic models the most difficult region to integrate is in the immediate vicinity of the Curie point. Errors in the integration of the system will be apparent from a non-converged value for the average magnetisation. This gives a relatively simple case which can then be used to test the stability of integration schemes for the stochastic LLG model. A plot of the mean magnetisation as a function of temperature is shown in Fig. 2 for a representative system consisting of 22 × 22 × 22 unit cells with generic material parameters of FePt with an fcc crystal structure, nearest neighbour exchange interaction of J ij = 3.0 × 10 −21 J/link and uniaxial anisotropy of 1.0 × 10 −23 J/atom. The system is first equilibrated for 10 ps at each temperature and then the mean magnetisation is calculated over a further 10 ps.
First, comparing the effect of temperature on the minimum allowable time step, the data show that for low temperatures reasonably large time steps of 1 × 10 −15 give the correct solution of the LLG equations, while near the Curie point (690 K) the deviations from the correct equilibrium value are significant. Consequently for simulations studying high temperature reversal processes time steps of 1×10 −16 s are necessary. It should be noted that the time steps which can be used are material dependent -specifically if a material with higher Curie temperature is used then the usable time steps will be correspondingly lower due to the increased exchange field.
From a practical perspective a significant advantage of the spin dynamics method is the ability to parallelise the integration system by domain decomposition, details of which are given in Appendix C. This allows the efficient simulation of relatively large systems consisting of tens or hundreds of grains or nano structures with dimensions greater than 100 nm for nanosecond timescales, with typ-ical numbers of spins in the range 10 6 − 10 8 .

Monte Carlo Methods
While spin dynamics are particularly useful for obtaining dynamic information about the magnetic properties or reversal processes for a system, they are often not the optimal method for determining the equilibrium properties, for example the temperature dependent magnetisation. The Monte Carlo Metropolis algorithm [123] provides a natural way to simulate temperature effects where dynamics are not required due to the rapid convergence to equilibrium and relative ease of implementation.
The Monte Carlo metropolis algorithm for a classical spin system proceeds as follows. First a random spin i is picked and its initial spin direction S i is changed randomly to a new trial position S i , a so-called trial move. The change in energy ∆E = E(S i ) − E(S i ) between the old and new positions is then evaluated, and the trial move is then accepted with probability by comparison with a uniform random number in the range 0-1. Probabilities greater than 1, corresponding with a reduction in energy, are accepted unconditionally. This procedure is then repeated until N trial moves have been attempted, where N is the number of spins in the complete system. Each set of N trial moves comprises a single Monte Carlo step. The nature of the trial move is important due to two requirements of any Monte Carlo algorithm: ergodicity and reversibility. Ergodicity expresses the requirement that all possible states of the system are accessible, while reversibility requires that the transition probability between two states is invariant, explicitly P (S i → S i ) = P (S i → S i ). From Eq. 22 reversibility is obvious since the probability of a spin change depends only on the initial and final energy. Ergodicity is easy to satisfy by moving the selected spin to a random position on the unit sphere, however this has an undesirable consequence at low temperatures since large deviations of spins from the collinear direction are highly improbable due to the strength of the exchange interaction. Thus at low temperatures a series of trial moves on the unit sphere will lead to most moves being rejected. Ideally a move acceptance rate of around 50% is desired, since very high and very low rates require significantly more Monte Carlo steps to reach a state representative of true thermal equilibrium.
One of the most efficient Monte Carlo algorithms for classical spin models was developed by Hinzke and Nowak [124], involving a combinational approach using a mixture of different trial moves. The principal advantage of this method is the efficient sampling of all available phase space while maintaining a reasonable trial move ac- ceptance rate. The Hinzke-Nowak method utilises three distinct types of move: spin-flip, Gaussian and random, as illustrated schematically in Fig. 3.
The spin-flip move simply reverses the direction of the spin such that S i = −S i to explicitly allow the nucleation of a switching event. The spin flip move is identical to a move in Ising spin models. It should be noted that spin flip moves do not by themselves satisfy ergodicity in the classical spin model, since states perpendicular to the initial spin direction are inaccessible. However, when used in combination with other ergodic trial moves this is quite permissible. The Gaussian trial move takes the initial spin direction and moves the spin to a point on the unit sphere in the vicinity of the initial position according to the expression where Γ is a Gaussian distributed random number and σ g is the width of a cone around the initial spin S i . After generating the trial position S i the position is normalised to yield a spin of unit length. The choice of a Gaussian distribution is deliberate since after normalisation the trial moves have a uniform sampling over the cone. The width of the cone is generally chosen to be temperature dependent and of the form The Gaussian trial move thus favors small angular changes in the spin direction at low temperatures, giving a good acceptance probability for most temperatures. The final random trial move picks a random point on the unit sphere according to which ensures ergodicity for the complete algorithm and ensures efficient sampling of the phase space at high temperatures. For each trial step one of these three trial moves is picked randomly, which in general leads to good algorithmic properties. To verify that the random sampling and Gaussian trial moves give the expected behaviour, a plot of the calculated trial moves on the unit sphere for the different algorithms is shown in Fig. 4. The important points are that the random trial move is uniform on the unit sphere, and that the Gaussian trial move is close to the initial spin direction, along the z-axis in this case.
At this point it is worthwhile considering the relative efficiencies of Monte Carlo and spin dynamics for calculating equilibrium properties. Fig. 5 shows the simulated temperature dependent magnetisation for a test system using both LLG spin dynamics and Monte Carlo methods. Agreement between the two methods is good, but the spin dynamics simulation takes around twenty times longer to compute due to the requirements of a low time step and slower convergence to equilibrium. However, Monte Carlo algorithms are notoriously difficult to parallelise, and so for larger systems LLG spin dynamic simulations are generally more efficient than Monte Carlo methods.

V. TEST SIMULATIONS
Having outlined the important theoretical and computational methods for the atomistic simulation of magnetic materials, we now proceed to detail the tests we have refined to ensure the correct implementation of the main components of the model. Such tests are particularly helpful to those wishing to implement these methods. Similar tests developed for micromagnetic packages [125] have proven an essential benchmark for the implementation of improved algorithms and codes with different capabilities but the same core functionality.

Angular variation of the Coercivity
Assuming a correct implementation of an integration scheme as described in the previous section, the first test case of interest is the correct implementation of uniaxial magnetic anisotropy. For a single spin in an applied field and at zero temperature, the behaviour of the magnetisation is essentially that of a Stoner-Wohlfarth particle, where the angular variation of the coercivity, or reversing field, is well known [126]. This test serves to verify the static solution for the LLG equation by ensuring an easy axis loop gives a coercivity of H k = 2k u /µ s as expected analytically. For this problem the Hamiltonian reads where k u is the on-site uniaxial anisotropy constant and H app is the external applied field. The spin is initialised pointing along the applied field direction, and then the LLG equation is solved for the system, until the net torque on the system S × H eff ≤ 10 −6 T, essentially a condition of local minimum energy. The field strength is decreased from saturation in steps of 0.01 H/H k and solved again until the same condition is reached. A plot of the calculated alignment of the magnetisation to the applied field (S · H app ) for different angles from the easy axis is shown in Fig. 6. The cal- culated hysteresis curve conforms exactly to the Stoner-Wohlfarth solution.

Boltzmann distribution for a single spin
To quantitatively test the thermal effects in the model and the correct implementation of the stochastic LLG or Monte Carlo integrators, the simplest case is that of the Boltzmann distribution for a single spin with anisotropy (or applied field), where the probability distribution is characteristic of the temperature and the anisotropy energy. The Boltzmann distribution is given by: where θ is the angle from the easy axis. The spin is initialised along the easy axis direction and the system is allowed to evolve for 10 8 time steps after equilibration, recording the angle of the spin to the easy axis at each time. Since the anisotropy energy is symmetric along the easy axis, the probability distribution is reflected and summed about π/2, since at low temperatures the spin is confined to the upper well (θ < π/2). Fig. 7 shows the normalised probability distribution for three reduced temperatures.
The agreement between the calculated distributions is excellent, indicating a correct implementation of the stochastic LLG equation.

Curie temperature
Tests such as the Stoner-Wohlfarth hysteresis or Boltzmann distribution are helpful in verifying the mechanical implementation of an algorithm for a single spin, but interacting systems of spins present a significant challenge in that no analytical solutions exist. Hence it is necessary to calculate some well defined macroscopic property which ensures the correct implementation of interactions in a system. The Curie temperature T c of a nanoparticle is primarily determined by the strength of the exchange interaction between spins and so makes an ideal test of the exchange interaction. As discussed previously the bulk Curie temperature is related to the exchange coupling by the mean field expression given in Eq. 9. However, for nanoparticles with a reduction in coordination number at the surface and a finite number of spins, the Curie temperature and criticality of the temperature dependent magnetisation will vary significantly with varying size [57].
To investigate the effects of finite size and reduction in surface coordination on the Curie temperature, the equilibrium magnetisation for different sizes of truncated octahedron nanoparticles was calculated as a function of temperature. The Hamiltonian for the simulated system is where J ij = 5.6 × 10 −21 J/link, and the crystal structure is face-centred-cubic, which is believed to be representative of Cobalt nanoparticles. Given the relative strength of the exchange interaction, anisotropy generally has a negligible impact on the Curie temperature of a material, and so the omission of anisotropy from the Hamiltonian is purely for simplicity. The system is simulated using the Monte Carlo method with 10,000 equilibration and 20,000 averaging steps. The system is heated sequentially in 10K steps, with the final state of the previous temperature taken as the starting point of the next temperature to minimise the number of steps required to reach thermal equilibrium. The mean temperature dependent magnetisation for different particle sizes is plotted in Fig. 8.
From Eq. 9 the expected Curie temperature is 1282K, which is in agreement with the results for the 10 nm diameter nanoparticle. For smaller particle sizes the magnetic behaviour close to the Curie temperature loses its criticality, making T C difficult to determine. Traditionally the Curie point is taken as the maximum of the gradient dm/dT [57], however this significantly underestimates the actual temperature at which magnetic order is lost (which is, by definition, the Curie temperature). Other estimates of the Curie point such as the divergence in the susceptibility are probably a better estimate for finite systems, but this is beyond the scope of the present article. Another effect visible for very small particle sizes is the appearance of a magnetisation above the Curie point, an effect first reported by Binder [127]. This arises from local moment correlations which exist above T C . It is an effect only observable in nanoparticles where the system size is close to the magnetic correlation length.

Demagnetising fields
For systems larger than the single domain limit [33] and systems which have one dimension significantly different from another, the demagnetisation field can have a dominant effect on the macroscopic magnetic properties. In micromagnetic formalisms implemented in software packages such as oommf[37], magpar [38] and nmag [39], the calculation of the demagnetisation fields is calculated ac-curately due to the routine simulation of large systems where such fields dominate. Due to the long-ranged interaction the calculation of the demagnetisation field generally dominates the compute time and so computational methods such as the fast-fourier-transform [128,129] and multipole expansion [130] have been developed to accelerate their calculation.
In large-scale atomistic calculations, it is generally sufficient to adopt a micromagnetic discretisation for the demagnetisation fields, since they only have a significant effect on nanometer length scales [7]. Additionally due to the generally slow variation of magnetisation, the timescales associated with the changes in the demagnetisation field are typically much longer than the time step for atomistic spins. Here we present a modified finite difference scheme for calculating the demagnetisation fields, described as follows.
The complete system is first discretised into macrocells with a fixed cell size, each consisting of a number of atoms, as shown in Fig. 9(a). The cell size is freely adjustable from atomistic resolution to multiple unit cells depending on the accuracy required. The position of each macrocell p mc is determined from the magnetic 'centre of mass' given by the expression where n is the number of atoms in the macrocell and α represents the spatial dimension x, y, z. For a magnetic material with the same magnetic moment at each site, Eq. 29 corrects for partial occupation of a macrocell by using the mean atomic position as the origin of the macrocell dipole, as shown in Fig. 9(b). For a sample consisting of two materials with different atomic moments, the 'magnetic centre of mass' is closer to the atoms with the higher atomic moments, as shown in Fig. 9(c). This modified micromagnetic scheme gives a good approximation of the demagnetisation field without having to use computationally costly atomistic resolution calculation of the demagnetisation field.
The total moment in each macrocell m mc is calculated from the vector sum of the atomic moments within each cell, given by Depending on the particulars of the system, the macrocell moments can vary significantly depending on position, composition and temperature. At elevated temperatures close to the Curie point, the macrocell magnetisation becomes small, and so the effects of the demagnetising field are much less important. Similarly in compensated ferrimagnets consisting of two competing sublattices the overall macrocell demagnetisation can also be small again leading to a reduced influence of the demagnetising field. The demagnetisation field within each macrocell p is  9. a). Visualisation of the macrocell approach used to calculate the demagnetisation field, with the system discretised into cubic macrocells. Each macrocell consists of several atoms, shown schematically as cones. b). Schematic of the macrocell discretisation at the curved surface of a material, indicated by the dashed line. The mean position of the atoms within the macrocell defines the centre of mass where the effective macrocell dipole is located. c). Schematic of a macrocell consisting of two materials with different atomic moments. Since the magnetisation is dominated by one material, the magnetic centre of mass moves closer to the material with the higher atomic moments. (Colour online.) given by where r is the separation between dipoles p and q,r is a unit vector in the direction p → q, and V p mc is the volume of the macrocell p. The first term in Eq. 31 is the usual dipole term arising from all other macrocells in the system, while the second term is the self-demagnetisation field of the macrocell, taken here as having a demagnetisation factor 1 / 3 . Strictly this is applicable only for the field at the centre of a cube. However, the nonuniformity of the field inside a uniformly magnetised cube is not large and the assumption of a uniform demag-netisation field is a reasonable approximation. The selfdemagnetisation term is often neglected in the literature, but in fact is essential when calculating the field inside a magnetic material. Once the demagnetisation field is calculated for each macrocell, this is applied uniformly to all atoms as an effective field within the macrocell. It should be noted however that the macrocell size cannot be larger than the smallest sample dimension, otherwise significant errors in the calculation of the demagnetising field will be incurred.
The volume of the macrocell V mc is an effective volume determined from the number of atoms in each cell and given by where n a mc is the number of atoms in the macrocell, n a uc is the number of atoms in the unit cell and V uc is the volume of the unit cell. The macrocell volume is necessary to determine the magnetisation (moment per volume) in the macrocell. For unit cells much smaller than the system size, Eq. 32 is a good approximation, however for a large unit cell with significant free space, for example a nanoparticle in vacuum, the free space contributes to the effective volume which reduces the effective macrocell volume.

Demagnetising field of a platelet
To verify the implementation of the demagnetisation field calculation it is necessary to compare the calculated fields with some analytic solution. Due to the complexity of demagnetisation fields analytical solutions are only available for simple geometric shapes such cubes and cylinders [131], however for an infinite perpendicularly magnetised platelet the demagnetisation field approaches the magnetic saturation −µ 0 M . To test this limit we have calculated the demagnetising field of a 20 nm × 20 nm × 1 nm platelet as shown in Fig. 10. In the centre of the film agreement with the analytic value is good, while at the edges the demagnetisation field is reduced as expected.

Performance characteristics
In micromagnetic simulations, calculation of the demagnetisation field usually dominates the runtime of the code and generally it is preferable to have as large a cell size as possible. For atomistic calculations however, additional flexibility in the frequency of updates of the demagnetisation field is permitted due to the short time steps used and the fact that the magnetisation is generally a slowly varying property.
To investigate the effects of different macrocell sizes and the time taken between updates of the demagnetisa-  . Fig 11(a) shows hysteresis loops calculated for different macrocell sizes for the nanodot. For most cell sizes the results of the calculation agree quite well, however, for a cell size of 4 unit cells, the calculated coercivity is significantly larger, owing to the creation of a flat macrocell (with dimensions 4×4×1 unit cells). This illustrates that for systems with small dimensions, extra care must be taken when specifying the macro cell size in order to avoid non-cubic cells. In general, the problem with asymmetric macrocells is not trivial to solve within the finite difference formalism, since the problem arises due to a modification of both the intracell and intercell contributions to the demagnetising field. Fig 11(b) shows the runtime for a single update of the demagnetising field on a single CPU for different macrocell size discretisations. Noting the logarithmic scale for the simulation time, single unit cell discretisations are computationally costly while not giving significantly better results than larger macrocell discretisations. Although the demagnetisation field calculation is an n 2 mc problem, it is possible to pre-calculate the distances between the macrocells at the cost of increased memory usage. Due to the computational cost of calculating the position vectors, this method is often much faster than the brute force calculation. However, due to the fact that memory usage increases proportionally to n 2 mc , fine discretisations for large systems can require many GBs of memory.
By collating terms in Eq. 31 it is possible to construct the following matrix M pq for each pairwise interaction: (3r x r x − 1)/r 3 pq − 1/3 3r x r y 3r x r z 3r x r y (3r y r y − 1)/r 3 pq − 1/3 3r y r z 3r x r z 3r y r z (3r z r z − 1)/r 3 pq − 1/3) where r x , r y , r z are the components of the unit vector in the direction p → q, and r pq is the separation of macrocells. Since the matrix is symmetric along the diagonal only 6 numbers need to be stored in memory. The total demagnetisation field for each macrocell p is then given by: The relative performance of the matrix optimisation is plotted for comparison in Fig 11(b), showing a significant reduction in runtime. Where the computer memory is sufficiently large, the recalculated matrix should always be employed for optimal performance. In addition to variable macrocell sizes, due to the small time steps employed in atomistic models and that the magnetisation is generally a slowly varying property, it is not always necessary to update the demagnetisation fields every single time step. Hysteresis loops for differ-ent times between updates of the demagnetisation field are plotted in Fig 11(c). In general hysteresis calculations are sufficiently accurate with a picosecond update of the demagnetising field, which significantly reduces the computational cost.
In general good accuracy for the demagnetising field calculation can be achieved with coarse discretisation and infrequent updates, but fast dynamics such as those induced by laser excitation require much faster updates, or simulation of domain wall processes in high anisotropy materials requires finer discretisations to achieve correct results.

Demagnetising field in a prolate ellipsoid
Since the macrocell approach works well in platelets and nanodots, it is also interesting to apply the same method to a slightly more complex system: a prolate ellipsoid. An ellipsoid adds an effective shape anisotropy due to the demagnetisation field, and so for a particle with uniaxial magnetocrystalline anisotropy along the elongated direction (z), the calculated coercivity should increase according to the difference in the demagnetisation field along x and z, given by: where ∆N = N z − N x . The demagnetising factors N x , N y , and N z are known analytically for various ellipticities [132], and here we assume c/a = b/a = 0.5, where a, b, and c are the extent of the ellipsoid along z, x and y respectively. To verify the macrocell approach gives the same expected increase of the coercivity we have simulated a generic ferromagnet with atomic moment 1.5 µ B , an FCC crystal structure with lattice spacing 3.54Åand anisotropy field of H a = 1T. The particle is cut from the lattice in the shape of an ellipsoid, of diameter 10 nm and height of 20 nm, as shown inset in Fig 12. A macrocell size of 2 unit cells is used, which is updated every 100 time steps (0.1 ps).
As expected the coercivity increases due to the shape anisotropy. From Ref. 132 the expected increase in the coercivity is H shape dm = 0.37 T which compares well to the simulated increase of 0.33 T.

VI. PARALLEL IMPLEMENTATION AND SCALING
Although the algorithms and methods discussed in the preceding sections describe the mechanics of atomistic spin models, it is important to note finally the importance of parallel processing in simulating realistic systems which include many-particle interactions, or nano patterned elements with large lateral sizes. Details of the parallelisation strategies which have been adopted to enable the optimum performance of vampire for different problems are presented in Appendix C. In general terms the parallelisation works by subdividing the simulated system into sections, with each processor simulating part of the complete system. Spin orientations at the processor boundaries have to be exchanged with neighbouring processors to calculate the exchange interactions, which for small problems and large numbers of processors can significantly reduce the parallel efficiency. The use of latency hiding, where the local spins are calculated in parallel with the inter-processor communications, is essential to ensure good scaling for these problems.
To demonstrate the performance and scalability of vampire , we have performed tests for three different system sizes: small (10628 spins), medium (8 × 10 5 spins), and large (8×10 6 spins). We have access to two beowulfclass clusters; one with 8 cores/node with an Infiniband 10 Gbps low-latency interconnect, and another with 4 cores/node with a Gigabit Ethernet interconnect. For parallel simulations the interconnect between the nodes can be a limiting factor for increasing performance with increasing numbers of processors, since as more processors are added, each has to do less work per time step. Eventually network communication will dominate the calculation since processors with small amounts of work require the data from other processors in shorter times, leading to a drop in performance. The scaling performance of the code for 100,000 time steps on both machines is presented in Fig. 13.
The most challenging case for parallelisation is the small system size, since a significant fraction of the system must be communicated to other processors during each timestep. On the Ethernet network system for the smallest system size reasonable scaling is seen only for 4 CPUs due to the high latency of the network. However larger problems are much less sensitive to network latency do to latency hiding, and show excellent scalability up to 32 CPUs. Essentially this means that larger problems scale much better than small ones, allowing more processors to be utilised. This is of course well known for parallel scaling problems, but even relatively modest systems consisting of 10 5 spins show significant improvements with more processors.
For the system with the low-latency Infiniband network, excellent scalability is seen for all problems up to 64 CPUs. Beyond 64 CPUs the reduced scalability for all problem sizes is likely due to a lack of network bandwidth. The bandwidth requirements are similar for all problem sizes, since smaller problems complete more time steps in a given period of time and so have to send several sets of data to other processors. Nevertheless improved performance is seen with increasing numbers of CPUs allowing for continued reductions in compute time. Although not shown, initial tests on an IBM Blue Gene class system have demonstrated excellent scaling of vampire up to 16,000 CPUs, allowing the real possibility for atomistic simulations with lateral dimensions of micrometers. Additional scaling tests for systems including calculation of the demagnetising field and a long-ranged exchange interaction are presented in Appendix C.

VII. CONCLUSIONS AND PERSPECTIVES
We have described the physical basis of the rapidly developing field of atomistic spin models, and given examples via its implementation in the form of the vampire code. Although the basic formalism underpinning atomistic spin models is well established, ongoing developments in magnetic materials and devices means that new approaches will need to be developed to simulate a wider range of physical effects at the atomistic scale. One of the most important phenomena is spin transport and magnetoresistance which is behind an emergent field of spin-electronics, or spintronics. Simulation of spin transport and spin torque switching is already in development, and must be included in atomistic level models in order to simulation of a wide range of spintronic materials and devices, including read sensors and MRAM (magnetic random access memory). Other areas of interest include ferroelectrics, the spin Seebeck effect [133], and Coloured noise [111] where simulation capabilities are desirable, and incorporation of these effects are planned in future. In addition to modelling known physical effects, it is hoped that improved models of damping incorporating phononic and electronic mechanisms will be developed which enable the study of magnetic properties of materials at sub-femtosecond timescales.
The ability of atomistic models to incorporate magnetic parameters from density functional theory calculations is a powerful combination which allows complex systems such as alloys, surfaces and defects to be accurately modelled. This multiscale approach is essential to relate microscopic quantum mechanical effects to a macroscopic length scale accessible to experiment. Such a multiscale approach leads to the possibility of simulation driven technological development, where the magnetic properties of a complete device can be predicted and optimised through a detailed understanding of the underlying physics. Due to the potential of multiscale simulations, it is planned in future to develop links to existing DFT codes such as castep [75,76] to allow easier integration of DFT parameters and atomistic spin models.
The computational methods presented here provide a sound basis for atomistic simulation of magnetic materials, but further improvements in both algorithmic and computational efficiency are of course likely. One area of potential computational improvement is GPGPU (general purpose graphics processing unit) computation, which utilises the massively parallel nature of GPUs to accelerate simulations, with speed ups over a single CPU of 75 times routinely reported. With several supercomputers moving to heterogenous computing architectures utilising both CPUs and GPUs, supporting GPGPU computation is likely to be important in future, and an implementation in our vampire code is currently planned. In terms of algorithmic improvements it should be noted that the Heun numerical scheme although simple is relatively primitive by modern standards, and moving to a midpoint scheme may allow for larger time steps than currently to be used.
With the continuing improvements in computer power, atomistic simulations have become a viable option for the routine simulation of magnetic materials. With the increasing complexity of devices and material properties, atomistic models are a significant and important development. While micromagnetic models remain a powerful tool for the simulation and design of devices, the limitations of the (continuum) micromagnetic formalism are increasingly exposed by its failure to deal with the complex physics of elevated temperatures, ultrafast magnetisation processes and interfaces. While micromagnetics will remain the computational model of choice for large scale and coarse-grained applications, the ability to accurately model the effects of microscopic details, temperature effects and ultrafast dynamics make atomistic models an essential tool to understand the physics of magnetic materials at the forefront of of the field. In addition to implementing the necessary computational methods for magnetic atomistic calculations, it is also important to provide a framework structure for the code, where new additions in the form of features or improvements can be made with minimal disturbance to other sections of the code. Equally important for intensive computational problems is ensuring high performance of the code so that simulations proceed as rapidly as possible.
In vampire this is achieved through hybrid coding using a mixture of object-oriented and functional programming styles. Object-oriented programming is widely used in modern software projects as a level of abstraction around the data, or objects, in the code. This abstraction makes it easy to store information about an element, for example an atom, as a single unified data type, known as a class. One significant caveat with object-oriented code is that it is generally hard to optimise for maximum performance. High performance codes generally utilise a different coding approach known as functional programming, where the focus is on functions which operate on numerous data sets. However the organisation of data into large blocks in functional programming generally makes it harder to organise the data. vampire therefore makes use of both methodologies to gain the benefits of object-oriented design during the initialisation phase, while for the performance-critical parts of the code the data is re-arranged to use a functional style for maximum performance. Due to the requirements of high performance, object-oriented design and parallelisation, the C++ programming language was chosen for all parts of the code. The popularity of the C++ language also allows for easy future integration of other libraries, such as nvidia's cuda framework for utilising graphics processing units. For portability the code also has a minimal dependence on external libraries and also conforms to the published standard allowing simple compilation on the widest possible variety of computer architectures.
In addition to the low-level structure described in terms of object-oriented and functional programming styles, the code is also designed in a modular fashion so that most of the mechanistic operations (such as the parallelisation and data analysis) are separated from high level functions which control the various simulation types. This enables users to easily add new simulation types or physical effects to the code, without having to be concerned with the inner workings.

Appendix B: Atomistic system generation in VAMPIRE
vampire has a number of dedicated functions for generating atomic system within the nearest neighbour approximation. The principal advantage of the nearest neighbour approximation is its simplicity and ability to consider a wide range of physical effects such as finite size, surfaces, ferri and antiferromagnets, disordered systems etc. with relative ease. vampire also includes inbuilt particle structures to enable generation of systems with simple geometric shapes such as spheres, cylinders, truncated octahedra and cubes.
The first step is to generate a crystal lattice of the desired type and dimensions sufficiently large to incorporate the complete simulated system. For the nearest neighbour approximation the Hamiltonian is generally only well defined for a single unified crystal structure, and therefore such generic simulations require a single crystal from which the system is cut. More complex structures are readily simulated, however the user must define the complete Hamiltonian for the system, taking into account the realistic interfaces between different crystals. vampire uses the unit cell as the essential building block of the atomic structure, since the exchange interactions of atoms between neighbouring unit cells are known before the structure is generated. The global crystal is generated by replicating the basic unit cell on a grid in x,y and z. This bare crystal structure is then cut into the desired geometry, for example a single nanoparticle, voronoi granular structure, or a user defined 2D geometry by removing atoms from the complete generated crystal. Atoms within this geometry are then assigned to one or more materials as desired (each material having different magnetic properties such as atomic spin moments, anisotropy or exchange interactions), generating the complete atomic system. The assignment of different parts of the system to different materials enables the easy simulation of multilayers and core-shell nanoparticles, as well as combinations of these for systems such as multilayer magnetic recording media. As an example, Fig. 14 shows a visualisation of a multilayer magnetic recording media generated using vampire .
Once the structure is defined the exchange interactions for all atoms in the are calculated from a list of nearest neighbour interactions for the defined unit cell. Since each cell on the grid contains a fixed number of atoms, and the exchange interactions of those atoms with other neighbouring cells is known relative to the local cell, the interaction list is trivial to generate. For computational efficiency the final interaction list is then stored as a 1D linked list.

Appendix C: Parallelisation strategies
A consistent trend among computers today is the drive towards parallel architectures, designed to improve overall performance in a consistent and scalable fashion. The downside of this approach is that the software must be specially modified to take advantage of the hardware, which still presents a significant challenge. In order to make the best use of parallel computers, we have adopted a number of distinct parallelisation strategies. This approach means that for any given problem, an optimal strategy can be utilised to achieve maximum performance.

Statistical Parallelism
The most trivial form of parallelism is batch or statistical, where the statistical properties of a system are determined by a series of independent calculations, each of which can be run in parallel. Statistical parallelism has the prime advantage that the division of work leads to an ideal scaling behaviour, since each of the runs are entirely independent and require no intercommunication.
In magnetic simulations, the most common applications of statistical parallelism are sweeps of the parameter space for a particular system, or in determining thermodynamic averages. For the former, a given system is calculated for different values of key parameters, for example, anisotropy or exchange constants; for the latter, the same system is simulated, but each run is given a different seed for the random number generator. This leads to a different thermodynamic evolution, which can provide information about the statistical behaviour of the system. It should be noted that the correct seeding of the FIG. 14. Visualisation of a magnetic recording medium generated using vampire . The medium consists of magnetically hard and soft layers with interfacial mixing of atoms between the layers. The material is granular in nature, and so a voronoi tessellation as overlaid on top the layers to form the isolated magnetic grains. Finally, a dilute intermixing layer is applied between the grains representing the diffusion of magnetic atoms into the SiO2 between the grains, as seen in real media. (Colour online.) random number generator, where a number of uncorrelated random number sequences are generated, is quite complex [109]. For magnetic simulations, the chaotic nature of the system, whereby a small change in the time evolution rapidly leads to a significantly different result, means that crude sequential number seeding is quite satisfactory.

Geometric Decomposition
Although statistical parallelism is useful for some types of simulation, it has one significant limitation: it can only be applied to relatively small systems, as the entire problem must be solved on a single processor. For larger systems it is necessary to divide the system into smaller parts for parallel execution. The most efficient method for such parallelisation is generally geometric decomposition, where the space is divided into cells, and each processor is assigned a cell to simulate. If well implemented, geometric decomposition can be scaled to run on thousands of CPUs, and this is one of the key aims of our implementation.
The starting point for geometric decomposition is efficiently dividing the space to run on N CPU CPUs. In order to achieve this, we have devised an algorithm which takes into account the physical system dimensions and which searches for a solution where n x · n y · n z = N CPU (C1) while minimising the surface to volume ratio. If the dimensions of the overall system are given by l x ,l y , and l z , then the volume of each cell is: and the surface area of each cell is: The surface to volume ratio is then given by: It is clear that the minimum in the surface to volume ratio occurs for n α /l α = 1 for all three dimensions, essentially showing that longer dimensions parallelise better with more CPUs. Given that the dimensions of the system are fixed, the only free variables are the number of CPUs in each dimension, n x , n y , and n z . These are further constrained by equation C1. In order to find the optimal solution for a given number of CPUs, the starting point is n α = lx √ NCPU . Exact solutions for n x , n y and n z are then searched for and the one with the lowest surface to volume ratio is selected. This approach is very flexible and allows for efficient decomposition for any number of CPUs. The only problematic solution is for prime numbers of CPUs, where only one exact solution exists, though this is a rare occurrence for large numbers of CPUs. A visualisation of a cubic system decomposed into 48 blocks is shown in Fig. 15.
Having decomposed the system, each CPU is allocated a cell which defines its own spatial domain. In order to maintain maximum scalability, each CPU generates its own portion of the complete system, and all associated data. This has the advantage of minimising the memory footprint and also parallelising the system creation, which can become a significant bottleneck for very large numbers of processors. Once the local atoms have been generated it is necessary to know which atoms on remote processors (halo atoms) are potentially interacting with atoms on the local processor (boundary atoms), as well as which atoms are interacting locally only(core atoms). This essentially defines three distinct regions, as shown schematically in Fig. 16. The maximum interaction range of the atoms is known globally, and so provided all atoms in this range are included, generation of the neighbour list is trivially the same as the serial case. In practice this is implemented by a global broadcast of each processor's domain, i.e., which regions of space are 'owned' by each processor. Each processor then looks at each atom in its boundary region, and then dispatches a copy of the atom to the appropriate neighbouring processors. This method has the advantage that it is quite general, and can be applied to any decomposition method, not necessarily cubes. At this point parallel periodic boundary conditions are easily implemented in the same manner, by copying atoms at the edge of the system to the desired processors. Once all boundary atoms have been sent, and all halo atoms have been received, the neighbour list is generated in the usual fashion with a linked-cell algorithm. After the actual neighbour list has been generated, it is likely that some of the copied halo atoms are in fact not needed, and so these atoms are deleted. Similarly some atoms in the in the boundary region may not interact with the halo, and these atoms are re-assigned to the core region. Following this book-keeping exercise, parallel simulation of the system can begin.
The method we have adopted for parallel simulation of the system makes use of latency-hiding, where requests for data from other processors are made prior to a locally compute-intensive period, after which the requested data should have arrived. Such latency-hiding is an important consideration when running the code on many processors. In practice atoms on each processor are ordered according to their interaction classification, ie: core; boundary; and finally halo atoms. The integration of the system proceeds as follows: • A request is made for all halo data from other processors • The core region is then integrated • If the halo data has not arrived, then wait for it • Integrate the boundary region

• Global synchronisation
The parallel integration is repeated the desired number of times during the simulation.

Replicated Data
For continuous systems, geometric decomposition provides an efficient way of parallelisation of the calculation. However, for sparse systems geometric decomposition can be inefficient due to poor load balancing, where some processors have many more atoms than others. This means some processors spend a significant amount of time waiting for others to complete the integration step, leading to a reduction in scalability. In magnetism, such systems are typically granular, consisting of a small number of grains. One solution to sparse systems is to utilise a replicated data approach, where each processor has a complete copy of all data, similar to the statistical parallelism method. Each processor then simulates (1/n cpu ) th of the total system, without any constraints on spatial locality. The atoms are classified in a similar way to the geometric decomposition approach, as core, boundary, and halo, and integrated in exactly the same way.
The principal disadvantages of the replicated data approach are the increased memory footprint (each processor must generate a complete copy of the system), and the tendency to have a high proportion of boundary spins. The latter can be mitigated by re-ordering spins in memory to ensure some degree of spatial locality. For granular systems this is fact trivial, since the assignment of each spin to a grain provides the necessary geometric information, and so the spins are ordered by a unique grain identification, which is also spatially correlated. In addition to its use in simulation of sparse systems, the replicated data approach is also the strategy adopted for the parallel calculation of the long-ranged demagnetisation fields. The method for the parallel code is identical to that described earlier, where the spins are allocated to macrocells which then interact with each other. Since the calculation of the demagnetisation field in each cell requires knowledge of all other cells, replicated data is the logical choice for the parallelisation. The demagnetisation field calculation proceeds as follows: • The macrocell moments are initialised to zero on all processors • Each processor determines the contribution of its spins to each macrocell • The macrocell moments are summed globally, so that each processor has a complete copy of the macrocell magnetisations • Each processor calculates the demagnetisation field only for macrocells which contain local spins • The local demagnetisation field on each spin is determined from its macrocell demagnetisation field This approach leads to excellent scaling, as for reasonable macrocell sizes the communication costs are minimal, and for 1 macrocell per processor the method scales linearly with n cpu . Fine macrocell discretisations ( 27 unit cells/macrocell) can lead to significant memory and computation costs, but in general this is unnecessary for most atomistic scale calculations.

Additional scaling tests
vampire has already been shown to scale well for a generic system with nearest neighbour exchange interactions, but in order to verify the general usefulness of the parallelisation we have also considered an extended system including the demagnetisation field calculation, and a system using ab initio parameters for FePt [29] which includes a long-ranged exchange interaction extending over several unit cells, as shown in Fig 17. The simulations in Fig. 17(a) including the demagnetising fields use a 2 × 2 unit cell macrocell size, updated every 10 steps, for system sizes of approximately 10 4 , 10 5 and10 6 spins respectively. The high spatial and temporal resolution are in some sense a worst case scenario as these are probably not needed for most problems, and so for these simulations, calculation of the demagnetising fields dominates the run time. Nevertheless, the scaling of the code remains very good, showing the effectiveness of the parallelisation of the demagnetising field calculation. For small system sizes the scaling breaks down as the number of macrocells approaches the number of processors. Here the scaling is limited by the time required to update all of the cell magnetisations and the time required to calculate the contributions of the reduced number of local macrocells. For all problem sizes scaling begins to reduce somewhat due to limitations in the network bandwidth. For the long-ranged exchange interactions shown in Fig. 17(b) the scaling is only good within a single node (up to 8 processors). Above that, the larger the system the worse the scaling. This arises due to the large amount of data which has to be shared between processors. For the long-ranged exchange interaction, each atom to be simulated must know the spin configurations of over 1,000 neighbouring atoms. In a parallel simulation these spin directions must be passed between processors twice per time step, which is a bandwidth intensive operation. Thus, the reduced scaling, particularly with arguer system size, is due to saturation of the network link. Due to the long range nature of the exchange interaction, memory use also becomes an issue in terms of storing the neighbour list. For the largest simulation size of around 1.6 × 10 6 atoms,33GB of RAM is required to store all the interactions. However, due to the parallel system generation this divides nicely between all of the processors, so the memory required per processor is quite reasonable for larger numbers of processors.