turboMagnon -- A code for the simulation of spin-wave spectra using the Liouville-Lanczos approach to time-dependent density-functional perturbation theory

We introduce turboMagnon, an implementation of the Liouville-Lanczos approach to linearized time-dependent density-functional theory, designed to simulate spin-wave spectra in solid-state materials. The code is based on the noncollinear spin-polarized framework and the self-consistent inclusion of spin-orbit coupling that allow to model complex magnetic excitations. The spin susceptibility matrix is computed using the Lanczos recursion algorithm that is implemented in two flavors - the non-Hermitian and the pseudo-Hermitian one. turboMagnon is open-source software distributed under the terms of the GPL as a component of Quantum ESPRESSO. As with other components, turboMagnon is optimized to run on massively parallel architectures using native mathematical libraries (LAPACK and FFTW) and a hierarchy of custom parallelization layers built on top of MPI. The effectiveness of the code is showcased by computing magnon dispersions for the CrI$_3$ monolayer, and the importance of the spin-orbit coupling is discussed.


Introduction
The characterization of magnetic excitations at the atomistic level has become possible in the last 50 years thanks to the development and continuous refinement of magnetic spectroscopies, most notably inelastic neutron scattering spectroscopy (INSS) for bulk materials [1], spin-polarized electron energy loss spectroscopy (SPEELS) and inelastic scanning tunneling spectroscopy for thin films [2,3]. Moreover, a promising recent development is represented by resonant inelastic x-ray scattering (RIXS) spectroscopy, which has been shown able to probe magnetic excitations both in bulk [4][5][6] and thin crystals [7,8]. The interpretation of these spectroscopies usually relies on model Hamiltonians, possibly in conjunction with ground-state ab initio calculations to fit the parameters appearing therein [9][10][11][12]. These treatments, however, are formally justified only in the limit of localized magnetic moments and cannot capture most of the features of itinerant magnetic systems. Even in the former case, they face difficulties in complying with the growing number of parameters when more exchange paths are at play or in the presence of magnetic anisotropies, such as the ones induced by spin-orbit coupling. An alternative route is represented by a fully ab initio treatment of spin dynamics, which can be pursued via the computation of the dynamical spin susceptibility from either time-dependent density-functional theory (TDDFT) [13][14][15][16][17][18][19][20][21] or many-body perturbation theory (MBPT) [22][23][24][25][26]. Both these methods treat charge and spin fluctuations on an equal footing in a self-consistent manner and they are formally exact, though in practice they rely on different approximations and have different computational requirements. TDDFT is numerically way less demanding than MBPT, particularly when adopting the adiabatic local spin density approximation (ALSDA), which often results in a good compromise between computational cost and accuracy [27,28] and has in fact been widely adopted for modeling magnetic excitations. Previous attempts to compute magnon dispersion relations from linear-response TDDFT were based on either the solution of the time-dependent Sternheimer equation [13,19] or of the Dyson equation for the spin susceptibility, starting from the independent-electron spin and charge susceptibilities [14][15][16][17][18]21]. In all these approaches the linear-response problem must be solved for every individual value of the excitation frequency, which is one of the main computational bottlenecks to be addressed and overcome in this work. We remark here that, more recently, magnetic excitations were computed beyond the linear response regime from TDDFT via a real-time propagation technique [20].
In this paper we introduce a computer code, named turboMagnon , which implements the Liouville-Lanczos (LL) approach to INSS and SPEELS spectroscopies within time-dependent density-functional perturbation theory (TDDFpT) [29][30][31]. This allows us to treat the dynamical spin-fluctuation response of magnetic systems in a fully noncollinear framework, and thus model their spin-wave excitation spectra including spin-orbit-coupling effects entirely from first principles. By using techniques borrowed from static density-functional perturbation theory (DFpT) [32,33] and similarly to the method of Ref. [19], our method avoids computing independent-particle susceptibilities, and thus does not require computationally expensive and slowly converging sums over empty states. At variance with previous studies, in the case of adiabatic TDDFT kernels our method also avoids repeated linear-response calculations for each individual excitation frequency, by using a recursive Lanczos algorithm to obtain a tridiagonal matrix representation of the Liouvillian superoperator. The actual spectrum is then computed in an inexpensive post-processing step for any desired frequency. This allows us to obtain the full spectrum of magnetic excitations (both magnons and Stoner excitations) in any frequency range with three Lanczos chains per excitation wave-number in the most general case, where the full 3 × 3 spin susceptibility matrix is needed. turboMagnon has a similar structure as the turboTDDFT and turboEELS codes used to compute absorption spectra in molecular systems [34,35] and electron energy loss spectra in periodic solids [31,[36][37][38]. All these two codes share a large number of linear response routines which are gathered in the LR Modules repository of Quantum ESPRESSO. turboMagnon is distributed under the terms of the GPL license [39], as a component of the Quantum ESPRESSO suite of open-source codes based on plane-wave basis sets, pseudopotentials, and using periodic boundary conditions [40][41][42]. This paper is organized as follows. In Sec. 2 we provide a theoretical background for the LL approach to TDDFpT for magnetic excitations. In Sec. 3 we describe turboMagnon and the calculation workflow for computing magnons. In Sec. 4 we provide the instructions for installing turboMagnon and discuss various levels of parallelization that can be used. In Sec. 5 we show how to use turboMagnon for computing magnon dispersions in the CrI 3 monolayer in the noncollinear framework with or without spin-orbit coupling, how to check the sum rules, and also we discuss the scaling of the code. Finally, Sec. 6 contains conclusions and perspectives for future work.

Statement of the problem
In INSS experiments a neutron beam with wave-vector k i and energy E i impinges on the target sample. Due to inelastic scattering, the outgoing neutron will be characterized by the wave-vector k f = k i − q and energy E f = E i − ω, where q and ω are the momentum and energy transferred to the sample, respectively. In the first Born approximation [43,44], the double-differential cross section corresponding to magnetic excitations of electrons can be written in the compact form as: where S(q, ω) = −Im Tr P ⊥ (q) χ(q, q; ω) .
Here, −e and g n ≈ 3.826 are the electron charge and the neutron g-factor, respectively, P ⊥ (q) is the 3 × 3 matrix, P ⊥ αβ (q) = δ αβ − q α q β /q 2 (with α, β = x, y, z), which is a projector on to the plane perpendicular to the direction of q, and χ(q, q; ω) is the 3 × 3 spin susceptibility matrix (see, e.g., Sec. 5.7 in Ref. [45]). The poles of S(q, ω) occur at frequencies of magnons and Stoner excitations. This quantity is accessible from linear-response theory, and in the following we will show how it can be computed efficiently using the LL approach to TDDFpT. Hereafter, Hartree atomic units will be used, and the formalism is presented for insulating systems while the generalization to metals can be found in Appendix B of Ref. [46] and in Ref. [47].

Time-dependent density-functional perturbation theory
Magnetic excitations in solids can be modeled using the time-dependent Pauli-type Kohn-Sham (KS) equations of TDDFT that read: where n and k are the electronic band index and the crystal momentum, respectively, Ψ n,k (r, t) are the twocomponent time-dependent KS spinor wave functions, andĤ(t) is the time-dependent Hamiltonian operator of the system [48]. It is often convenient to use perturbation theory to first order within TDDFT (i.e. TDDFpT) what is also known as the linear-response TDDFT. Moreover, Eq. (3) is often solved in the frequency domain rather than in the time domain. Therefore, it can be shown that but linearizing Eq. (3) and by making a Fourier transform from the time domain to the frequency domain we can obtain the following set of the so-called resonant and anti-resonant Pauli-type linear-response KS (Sternheimer) equations [46,47]: All the quantities appearing in the equations above are defined in the following. We want to note in passing that a step-by-step derivation of the equations (4) and (5) starting from the ground-state problem is presented in Ref. [46]; here we are presenting briefly the final results. In equations (4) and (5) we are dealing with the latticeperiodic quantum-mechanical operators and functions. In particular, we have used the following definitions for the ground-state and linear-response KS spinor wave functions: Ψ • n,k (r) = e ik·r U • n,k (r)/ √ N k andΨ n,k+q (r, ω) = e i(k+q)·rŨ n,k+q (r, ω)/ √ N k , where q is the wavevector of the perturbation, N k is the number of k points in the Brillouin zone (BZ), U • n,k (r) andŨ n,k+q (r, ω) are the lattice-periodic functions. Equations (4) and (5) must be solved for a specific perturbing potential, that in the cases of spin waves is defined asV ext,q (ω) = −µ B σ ·b ext,q (ω), where µ B is the Bohr magneton, σ = (σ x , σ y , σ z ) is the vector of Pauli matrices, andb ext,q (ω) is the lattice-periodic operator of the external magnetic field [33]. Equation (5) was obtained from Eq. (4) by changing the sign of ω, k, and q, and by applying the time-reversal operatorT = iσ yK , whereK is the complex-conjugation operator. The operatorĤ • k has the real-space representation H • k (r, r ) = e −ik·r H • (r, r )e ik·r , and we defined for conveniencê whereĤ • is the Hamiltonian operator of the unperturbed system. The definition ofĤ • k+q is as follows [46]: where σ • is the unit 2 × 2 matrix, the first term in Eq. (6) is the kinetic energy,v • loc is the local part of the pseudopotential (PP) [49],V • NL,k+q is the 2 × 2 matrix which is the nonlocal part of the PP containing the scalar-relativistic and spin-orbit coupling (SOC) potentials [50][51][52][53] δm are respectively the scalar and magnetic parts of the exchange-correlation (XC) ground-state potential [the functional derivatives defining v • XC (r) and b • XC (r) are evaluated at ground-state charge density n = n • (r) and magnetization density m = m • (r)]. In Eqs. (4) and (5), ε • n,k and ε • n,−k are the ground-state KS energies. It is important to remark that the inclusion of SOC when noncollinear magnetism is explicitly contemplated is trivial, because SOC is time-reversal invariant [47]; in practice this means thatTV • NL,−k−q =V • NL,k+qT on the left-hand side of Eq. (5), and thusĤ •+ k+q differs fromĤ • k+q only by the opposite sign in the last term of Eq. (6).
It is important to note that here we are using a number of approximations. Firstly, the external perturbing potentialV ext,q (ω) (defined above) contains only a coupling between the spin angular momentum and the magnetic field (i.e. the spin Zeeman term), while we have neglected the coupling between the orbital angular momentum and the magnetic field (i.e. the orbital Zeeman term). This is motivated by the fact that we are considering a regime of a weak (vanishing) external magnetic field. It was shown that in this case for first-row transition-metal elements the orbital Zeeman term is very small compared to the spin Zeeman term and hence it can be neglected (see e.g. Refs. [17,54]). Secondly, we neglected the diamagnetic term (it is proportional to the second power of the magnetic field amplitude). The diamagnetic term is generally much smaller than the spin (plus orbital) Zeeman term and in particular in the case of weak magnetic fields (which is the case here) [55]. Finally, we follow Ref. [56] and use the common approximation: only interaction terms up to 1/c 2 are considered, where c is the speed of light. For this reason we neglected the interaction between the spin-orbit coupling and the magnetic field that is a higher-order term proportional to 1/c 3 (see e.g. Ref. [54]). Now let us describe the linear-response potentials in Eqs. (4) and (5).V HXC,q (ω) is the monochromatic q component of the response Hartree and XC (HXC) potential, which reads: whereṽ is the response Hartree potential in the coordinate representation, whilev XC,q (ω) andb XC,q (ω) are the response scalar and magnetic XC potentials, respectively, which in the coordinate representation within ALSDA read [33,57]: b XC,q (r, ω) = ∂b XC ∂n n • ,m •ñ q (r, ω) + ∂b XC ∂m n • ,m •m q (r, ω) .
From Eqs. (9) and (10) we can see that there are mixed scalar and magnetic responses of v XC and b XC , which are coupled in a self-consistent way. As will be seen in the following, this allows us to compute the spin susceptibility directly by avoiding calculations of charge-charge responses and cross-terms spin-charge responses. The response potentials in Eqs. (8) - (10) are expressed in terms of the monochromatic q components of the response charge and magnetization densities, which read: and satisfy the following relations [58]:ñ * −q (r, −ω) =ñ q (r, ω) andm * −q (r, −ω) =m q (r, ω). In Eqs. (11) and (12), f n,k and f n,−k are the occupation factors that are equal to 1 for occupied states and to 0 for empty states at zero temperature. Using the aforementioned properties of the response densities it is easy to show thatV + HXC,q (ω) ≡ TV HXC,−q (−ω)T −1 is the operator of Eq. (7) with the opposite sign in the response magnetic XC potential. The same applies forV + ext,q (ω) ≡TV ext,−q (−ω)T −1 , which is the external perturbing potential with a reversed direction of the magnetic field. Lastly, the operatorsP k+q andP + k+q , appearing in Eqs. (4) and (5), respectively, are the projectors on to the empty-states manifold, and in the coordinate representation they read: We stress that these projectors are expressed in terms of the ground-state spinors U • m,k+q andTU • m,−k−q , respectively, which in turn refer to the occupied-states manifold (similarly to the static DFpT [32,33]) and thus avoiding the computationally expensive summations over empty states. Finally, it is important to stress that the self-consistent solution of the coupled equations (4) and (5) is done for a fixed external perturbation with a fixed wavevector q, which thus does not require the mixing of responses to different q-specific perturbations which greatly simplifies the solution of the problem.
We note that the Sternheimer equations (4) and (5) can be solved directly by using e.g. the conjugate-gradient algorithm [19]. However, this requires solving a separate self-consistent problem for each frequency ω, so increasing the computational cost very rapidly for very dense ω grids. In the next section, we present an alternative way to solve these equations using the LL approach, which allows us to bypass the aforementioned bottleneck via the use of an effective recursive Lanczos algorithm.
We recall that the formalism presented above is based on ALSDA. The extension to generalized-gradient approximation (GGA), especially in the noncollinear framework, is quite involved and not considered here. In this respect it is useful to mention Ref. [59] where it was shown that adiabatic GGA generally worsens the spin-excitation spectra by overestimating the magnon energies and suppressing the intensity of spin waves. The extension to meta-GGA functionals is even more challenging, in particular because of the known numerical stability issues even when performing ground-state calculations [60], not to mention the difficulties in generalizing meta-GGA to (time-dependent) linear-response theory. Moreover, e.g. the recently-proposed SCAN meta-GGA functional [61] exhibits some potential limitations in describing magnetic system [62,63]. As a note of caution, when attempting to generalize TDDFpT to GGA and meta-GGA functionals one has to pay special attention on to whether the zero-torque theorem is still satisfied. Finally, as possible future extensions of the current TDDFpT formalism it would be interesting and important to perform generalizations to hybrid and Hubbard functionals which allow to better describe the localized d and f electrons by alleviating large self-interaction errors for these states [25,64,65].

Quantum Liouville equation and spin susceptibility matrix
The resonant and anti-resonant linear-response KS equations (4) and (5) can be equivalently expressed in terms of the quantum Liouville equation for the 2 × 2 response spin-charge density matrix operatorρ q (ω) [31]: whereV ext,q (ω) is the external perturbing potential,ρ • is the unperturbed 2×2 spin-charge density matrix operator, andL q is the Liouvillian superoperator, the action of which is defined as: whereV HXC,q is the response HXC potential [see Eq. (7)].
The magnetization-density response linearly induced by the external magnetic perturbing potential at a specific transferred momentum q and at a specific frequency ω can be defined as: where with (·, ·) we indicate a scalar product in an operator space. Using the following convention for the external perturbing potential [46]V ext,q (ω) =m q ·b ext,q (ω), we can rewrite the expectation value (17) as where χ(q, q; ω) is the 3 × 3 spin susceptibility matrix, which reads: The poles of this quantity mark the magnetic excitations of the system, and they allow to characterize the cross section of numerous magnetic spectroscopies, both bulk ones such as INSS [Eqs.
(1)-(2)], or surface ones such as SPEELS [66]. Moreover, the usage of a noncollinear framework allows us to take into account the spin-orbit coupling effect as well as to study systems with complex noncollinear patterns in the ground state. Finally, it is worth noting that our formalism allows us to compute the whole 4 × 4 generalized susceptibility matrix which contains spin-spin [Eq. (20)], charge-charge, spin-charge, and charge-spin couplings (see Appendix A in Ref. [46]).

Batch representation
Equations (11) and (12) show that the response charge and magnetization densities are uniquely determined by the two sets of spinor wave functions X q = {x n,k+q } and Y q = {y n,k+q }, which are called respectively upper and lower components of the batch representation (BR) of the response spin-charge density matrix operator: This mapping can be formalized by defining BR of a generic operatorÔ q (ω) aŝ similarly to how it is done in Refs. [30,34]. Therefore, the commutator appearing on the right-hand side of Eq. (15) in BR will result in: Thus, the quantum Liouville equation (15) [or equivalently Eqs. (4) and (5)] in BR takes the following form: and the Liouvillian in BR reads: where the actions of the superoperators, appearing in Eq. (25), on the response batches are defined as: It is worth noting that due to the lack of time-reversal symmetry, it is not useful to make a rotation of the batches as it was done for other spectroscopies [30,31]. Finally, we can formally representm q in BR as: Therefore, using Eq. (18) in Eq. (23), and using Eqs. (25) -(30), we can write the spin susceptibility matrix (20) in BR as: which can be efficiently computed using the Lanczos recursion algorithm, as explained in the next section.

Lanczos recursion algorithm
In order to compute the spin susceptibility matrix χ(q, q; ω) using Eq. (31), we need to evaluate the off-diagonal matrix element of the resolvent of the Liouvillian, (ω −L q ) −1 . A straightforward inversion of such a matrix in plane-wave framework is a formidable task. It is therefore convenient to use recursive algorithms, such as e.g. the Lanczos recursion algorithm, which does not rely on the inversion of the matrices, but a recursive evaluation of an off-diagonal matrix element as in Eq. (31) [67]. We will briefly review the two flavors of the Lanczos algorithm that are implemented in the turboMagnon code, namely, the non-Hermitian Lanczos biorthogonalization algorithm [29,30,34], and the pseudo-Hermitian Lanczos algorithm [35,68,69]. A more detailed description of the algorithms can be found in the corresponding references.
According to Eq. (31), each element of the spin susceptibility matrix can be written in the form where λ and µ are the indices labeling Cartesian components, and the specific matrix element χ λµ depends on the u λ and v µ q vectors, which in the BR read: In the non-Hermitian Lanczos biorthogonalization algorithm, by starting from the initial pair of Lanczos vectors , two coupled Lanczos chains are generated by recursively applyingL q andL † q to the previous Lanczos chain vectors, q i and p i [30,31]. We note that q i and p i implicitly depend on the µ and q indices. A pair of biorthogonal basis sets of increasing dimension are thus recursively constructed, {q i } and {p i }, where i = 1, M , and M being the number of Lanczos iterations. The Lanczos coefficients, α µ i,q , β µ i,q , and γ µ i,q , are thus computed at each iteration to form a sparse M -dimensional tridiagonal matrix, T µ q , which is an oblique projection of the Liouvillian onto such biorthogonal bases: T µ q ij = (p i ,L q q j ). In order to speed-up the Lanczos recursion, one can take advantage of the pseudo-Hermiticity of the Liouvillian superoperator [68]. In this case, by defining a proper metric of the linear space, it is possible to recover a standard Hermitian Lanczos algorithm with a modified scalar product [35,68,69], which requires the application of the Liouvillian superoperator only once per Lanczos iteration, resulting in a factor of two speed-up [35]. As a result of the use of this algorithm one also generates the tridiagonal matrix T µ q . After generating the tridiagonal matrix, T µ q , the spin susceptibility matrix (32) can be computed as [36]: where I is the M -dimensional unit matrix, e 1 = (1, 0, . . . , 0) is the M -dimensional unit vector, and z λµ q = (z λµ 1,q , z λµ 2,q , . . . , z λµ M,q ) is the M -dimensional array whose coefficients z λµ i,q are computed on-the-fly of the Lanczos recursion and they are defined as z λµ i,q = (q i , u λ ), and the µ and q dependence implicitly come from q i . In practice, the right-hand side of Eq. (35) is computed by solving the equation: where η µ q,ω is the M -dimensional vector which is the solution of the equation above at a fixed value of ω, and finally calculating the scalar product χ λµ (q, q; ω) z λµ q , η µ q,ω , which are both inexpensive operations from the computational point of view. This allows one to use very dense low-frequency grids for a very accurate sampling of magnons or extended high-frequency grids for exploring Stoner excitations. The convergence of these spectra with respect to the number of Lanczos iterations, M , can be sped up by making use of the extrapolation technique for the Lanczos coefficients, which is described in detail in Refs. [30,34]. More details about the Lanczos algorithm for computing magnetic spectra can be found in Ref. [47]. Finally, we note that in practice when solving Eq. (36), a small imaginary part η is added to the frequency argument, ω → ω + iη, so as to regularize the spin susceptibility matrix χ λµ (q, q; ω + iη) [29,30,47].

Sum rules
As in the case of collective charge excitations [36], collective spin excitations also satisfy certain sum rules [21]. To show this, we introduce the anti-Hermitian part of the susceptibility tensor It can be shown that L λµ (q, ω+iη) satisfies the following sum rule: where m ν is the expectation value of the ν Cartesian component of the total ground-state magnetization in the unit cell, and λµν is the Levi-Civita tensor. It is important to note that Eq. (38) holds for any value of the transferred momentum q and for any broadening parameter η. In addition, we note that the real part of the spin susceptibility matrix is even with respect to ω while the imaginary part is odd, which implies that only the real part is involved in the sum rule. Finally, by using Eqs. (37) and (38), we present the expression for the sum rule in the case of λ = x and µ = y that is particularly useful for magnets polarized along the z direction, which will be used later in Sec. 5 for the CrI 3 monolayer:

Description of software components
The turboMagnon code is designed as a module of the Quantum ESPRESSO distribution [40][41][42], and it resides in a self-contained directory TDDFPT under the root directory of the Quantum ESPRESSO tree, which contains also the turboTDDFT and turboEELS codes for the calculation of the absorption and electron energy loss spectra, respectively. The turboMagnon code uses many of the generic linear response routines contained in the LR Modules repository. When the turboMagnon code is installed (see Sec. 4.1), the bin/ directory in the Quantum ESPRESSO root contains links to the executable turbo magnon.x (the main program) and turbo spectrum.x (a post-processing program). The code turbo magnon.x performs q-specific Lanczos recursions (up to three; one for each column of the spin susceptibility matrix) to obtain Lanczos coefficients α µ i,q , β µ i,q , γ µ i,q , and z λµ i,q coefficients (see Sec. 2.5), and to construct the tridiagonal representation of the Liouvillian, T µ q , while turbo spectrum.x uses this matrix to calculate the spin susceptibility matrix, χ λµ (q, q; ω), according to Eqs. (35) and (36).

Ground-state calculation
In order to compute the magnetic spectra of a system, a standard spin-polarized ground-state DFT calculation has to be performed first, yielding the KS spinor wave functions U • n,k (r) and KS energies ε • n,k for all occupied states, which allow us to determine the ground-state charge density, n • (r), and magnetization density, m • (r). The information thus obtained is then used as input for the linear-response calculation with the turboMagnon code. This ground-state calculation is performed by the pw.x code, which is one of the key components of the Quantum ESPRESSO package. In Appendix A a sample input file for pw.x is shown for the case of a CrI 3 monolayer. After successful completion of the ground-state calculation, the pw.x code writes the ground-state KS wave functions, energies, and charge and magnetization densities to disk, together with all relevant information about the system, like geometry, pseudopotentials, etc. This data is used by the turboMagnon code which reads all this data at program start. Therefore, it is not necessary to redefine the system under study in the input file of turbo magnon.x.

TDDFpT calculation
The linear-response calculation is done using the turbo magnon.x code, which performs the Lanczos recursions (see Sec. 2.5) for a given transferred momentum, q, and for a given direction of the perturbing magnetic field (which is controlled by the input parameter ipol). This is by far the most time consuming step of the calculation. In Appendix A a sample input file for turbo magnon.x is shown for the case of a CrI 3 monolayer. A list of all input variables of turbo magnon.x is given in Table B.1 of Appendix B. The integer input variable itermax sets up the number of Lanczos iterations, and so determines the dimension M of the tridiagonal matrix, T µ q (see Sec. 2.5). In fact, one can check whether the number of iterations is sufficient to achieve an adequately converged spectrum only at the post-processing level (see Sec. 3.3). It is possible to add more iterations to an existing calculation by restarting the turbo magnon.x code, setting the parameter restart=.true. and increasing itermax. The strings defined in the input variables prefix and outdir identify the system data on disk and must correspond to files created by the pw.x code.
The input variables q1, q2, and q3 are the three Cartesian components of the transferred momentum, q, specified in units of 2π/a, where a is the lattice parameter specified in the ground-state calculation by pw.x.
One can choose which flavor of the Lanczos algorithm to use (see Sec. 2.5). By setting pseudo hermitian=.true., the pseudo-Hermitian Lanczos algorithm will be used, otherwise the non-Hermitian Lanczos biorthogonalization algorithm will be used. It is recommended to use the former, because it is two times faster.
During the execution of the turbo magnon.x code, a file named prefix.beta gamma z.dat will be written to the outdir directory. This file contains the Lanczos coefficients α µ i,q , β µ i,q , γ µ i,q , and z λµ i,q coefficients needed for the post-processing calculation. One can use this information for the analysis of the behavior of these coefficients (see Fig. 5).

Post-processing spectrum calculation
Once the tridiagonal matrix, T µ q , is constructed from the Lanczos coefficients, one can compute the spin susceptibility matrix according to Eqs. (35) and (36). This task is performed by the turbo spectrum.x program as a post-processing step, which requires negligible amount of the CPU time with respect to turbo magnon.x.
The turbo spectrum.x program is used also for the calculation of the absorption spectra computed with turboTDDFT and electron energy loss spectra computed using turboEELS. In order to distinguish the different applications, it is necessary to set magnons=.true. in the input for turbo spectrum.x. The labels prefix and outdir identify the system on disk and must correspond to files created by the turbo magnon.x code. It is necessary to specify the input parameter ipol which controls which column of the spin susceptibility matrix is computed for a given direction of the magnetic field (the direction of the magnetic field is specified in the input file for turbo magnon.x).
In Appendix A a sample input file for turbo spectrum.x is shown for the case of a CrI 3 monolayer. A list of input variables for the turbo spectrum.x program is given in Table B.2 of Appendix B.
As it was mentioned in Sec. 2.5, when solving Eq. (36), a small Lorentzian broadening parameter η is added to the frequency in order to regularize the spin susceptibility matrix χ λµ (q, q; ω + iη) [29,30,47]. The magnetic spectra can be computed in any frequency range specified by the keywords start and end, with a step of frequency given by the increment parameter.
The convergence of the spectrum in the desired frequency range can be checked by varying the number of Lanczos coefficients used. This number is set by the input keywords itermax0 and itermax. If no extrapolation of Lanczos coefficients is used (extrapolation='no'), then itermax=itermax0. These variables can take values up to the number of iterations which have been performed using the turbo magnon.x code. For a given number of Lanczos iterations, it is possible to improve the convergence of the computed spectra by extrapolating the coefficients [34]. Such an extrapolation can either be bi-constant (extrapolation='osc') or constant (extrapolation='constant') [34]. In this case, the input variable itermax0 indicates the number of exact coefficients to be read from file, while itermax is set to a value which can be chosen arbitrarily large without any significant computational cost. Such an extrapolation procedure amounts to increasing the dimension of the tridiagonal matrix, T µ q . It is worth noting though that the extrapolation of Lanczos coefficients for magnetic spectroscopies is most useful to converge the Stoner excitations and less relevant for converging magnons (when they are not overlapping with the former).
Finally, the turbo spectrum.x program generates a file called prefix.plot chi.dat which contains real and imaginary parts of one column (controlled by ipol) of the spin susceptibility matrix, χ λµ (q, q; ω), for a given q and for each value of the frequency ω. This can be directly used to plot the magnetic spectra using Eq. (2).

Installation instructions
The turboMagnon program is distributed as source code, like the other components of the Quantum ESPRESSO distribution. The installation procedure is the same for all modules in the Quantum ESPRESSO package. Quantum ESPRESSO and turboMagnon make use of GNU autoconf [70]. The TDDFPT repository, which contains the source turboMagnon code must be residing within the Quantum ESPRESSO tree. The code is compiled with the following commands from within the Quantum ESPRESSO tree: ./configure make pw make tddfpt Alternatively, it is possible to use cmake [71] instead of ./configure. Here, the first step sets up the environment (compilers, libraries, etc.). The second step compiles the pw.x code and creates a link to this executable in the bin/ repository of the Quantum ESPRESSO tree. In the third step, the turboMagnon code (turbo magnon.x and turbo spectrum.x) are compiled, together with the turboTDDFT and turboEELS codes. Links to these programs are created in the bin/ directory of the Quantum ESPRESSO tree. Further detailed installation instructions can be found in the documentation that comes with the Quantum ESPRESSO distribution.

Parallelization
Like the other components of the Quantum ESPRESSO package, the turboMagnon code is optimized to run on massively parallel architectures. The parallelization of the turboMagnon code is achieved by using the message-passing paradigm and calls to standard Message Passing Interface (MPI) libraries [72]. High performance on massively parallel architectures is achieved by distributing both data and computations in a hierarchical way across processors. The turboMagnon code supports two levels of parallelization: i) a plane-wave parallelization, which is implemented by distributing real-and reciprocal-space grids across the processors, and ii) a k points parallelization, which is implemented by dividing all processors into pools, each taking care of one or more k points. The Fast Fourier Transforms (FFT's), which are used for transformations from real space to reciprocal space and vice versa, are also efficiently parallelized among processors.

Benchmarking
We now proceed to the validation of the turboMagnon code by calculating the magnetic spectra for a CrI 3 monolayer, a 2D ferromagnetic insulator with honeycomb-arranged magnetic moments and strong spin-orbit coupling. We remark that the correctness of the LL approach in the case of collinear metallic systems (bulk Fe and Ni) has already been inspected and validated in our previous publications [46,47], and that the goal here is to validate the noncollinear implementation including spin-orbit coupling. Nonetheless, at the end of this section we will inspect the fulfillment of the sum rule not only for CrI 3 monolayer but also for bulk Fe and Ni, since this was not discussed before.

Technical details
All the calculations were performed using the Quantum ESPRESSO distribution [40][41][42], a suite of computer codes based on plane waves and pseudopotentials. We performed the ground-state noncollinear spin-polarized calculations using the pw.x program; local spin density approximation (LSDA) was used for the XC functional, and SOC was included self-consistently. Fully-relativistic norm-conserving (NC) pseudopotentials (PPs) were taken from the pseudopotential library of Ref.
[73] (Cr.rel-pz-n-nc.UPF and I.rel-pz-n-nc.UPF). The KS wave functions and potentials were expanded in plane waves up to a kinetic-energy cutoff of 60 and 240 Ry, respectively. The BZ was sampled using a uniform Γ-centered 8 × 8 × 1 k points mesh. The CrI 3 2D crystal structure (see Fig. 1) is described using a honeycomb lattice lying in the xy plane and leaving 17.5Å of vacuum along the z direction. The monolayer was obtained by extracting one layer from the trigonal bulk structure and by optimizing atomic positions and the in-plane lattice constant (12.98 Bohr). The DFT calculations correctly yield a ferromagnetic (FM) ordering and a magnetic anisotropy with the out-of-plane directions as the easy axis [74]. The ground-state magnetization is polarized along the z direction as depicted in Fig. 1, and the total magnetic moment is 6 µ B per unit cell coming mainly from the Cr atoms, consistently with a S = 3/2 spin on each Cr site. We also quantified the magnetic anisotropy energy: the energy difference between the in-plane and out-of-plane magnetizations is 0.74 meV.
TDDFpT calculations were performed using the turboMagnon code (turbo magnon.x) in the noncollinear framework consistently with the ground state. Magnetic spectra were obtained using the post-processing program turbo spectrum.x using a Lorentzian smearing function with broadening parameters reported in the next sections. The kinetic-energy cutoff and the k points mesh were used the same as for the ground-state calculation. The convergence of spectra with respect to the number of Lanczos iterations will be shown in the following.
The data used to produce the results of this work are available in the Materials Cloud Archive [75].

Spin susceptibility matrix
We start by analyzing the real and imaginary components of the 3 × 3 spin susceptibility matrix χ λµ (q, q; ω) computed for the transferred momentum q being equal to the high-symmetry M point in the BZ. The result for the CrI 3 monolayer is shown in Fig. 2. We used the non-Hermitian Lanczos algorithm that was already introduced in our previous work [46], but here it is generalized to the noncollinear framework including SOC (this will be discussed in more detail in the following). The goal of this subsection is to analyze and compare all components of the spin susceptibility matrix, and to understand their meaning. We are mainly interested in the imaginary part of χ λµ (q, q; ω) since it is used in Eqs. (1) and (2), and can be directly compared with the double-differential cross sections that are determined from the INSS experiments. But it is also useful to analyze the real parts of χ λµ (q, q; ω) that are used e.g. to check the sum rules (see Sec. 2.6).  As can be seen in Fig. 2, the intensities of Im[χ xx ] and Im[χ yy ] are 6 orders of magnitude larger than that of Im[χ zz ]. This is so because for the specific geometry setup shown in Fig. 1, Im[χ xx ] and Im[χ yy ] are the transversal components of the spin susceptibility matrix that correspond to the excitations of magnons, while Im[χ zz ] describes longitudinal spin excitations, which are considerably stiffer than the former. Despite the huge difference in the intensities, the peak positions in the transversal and longitudinal components of the spin susceptibility appear at similar energies. More specifically, the first peak is at exactly the same energy for all three components (17 meV), while the second peak appears at 26 meV for the two transversal components and at 29 meV for the longitudinal component. Moreover, it is interesting to observe that the first peak is more intense than the second peak both in Im[χ xx ] and Im[χ yy ], while the trend is the opposite for Im[χ zz ]. These two peaks in the transversal components of the spin susceptibility matrix correspond to the acoustic and optical magnon excitations, and it appears that the intensity of the acoustic mode is larger than that of the optical mode. It is instructive to analyze also the offdiagonal components of the spin susceptibility matrix. We can see from Fig. 2  It is important to note that these latter components, which couple the transversal and longitudinal excitations, are identically zero in collinear magnets in the absence of spin-orbit coupling. Similar trends are seen for the real part of the spin susceptibility matrix in Fig. 2.
From the practical point of view, in order to obtain the full 3 × 3 spin susceptibility matrix as shown in Fig. 2 one needs to perform 3 Lanczos chains. More specifically, the first column in the χ λµ (q, q; ω) matrix is obtained by performing one Lanczos chain for the external magnetic field applied along the x axis, i.e. µ = x (ipol=1) and by measuring the magnetic response along the 3 Cartesian directions, i.e. λ = x, y, z (this is done automatically by the turboMagnon code). The second and the third columns of χ λµ (q, q; ω) are obtained by applying the external magnetic field along the y (ipol=2) and z (ipol=3) axes, respectively. However, in the case of the setup shown in Fig. 1 there is actually no need to perform 3 Lanczos chains but just one either with ipol=1 or ipol=2, since we are interested only in the transversal components of the spin susceptibility matrix, and Im[χ xx ] = Im[χ yy ]. This is very convenient from the computational point of view since we can save the CPU time by a factor of 3.

Non-Hermitian vs pseudo-Hermitian Lanczos algorithms
We now to proceed to a validation of the new pseudo-Hermitian Lanczos algorithm that was implemented in the turboMagnon code. As was briefly explained in Sec. 2.5, the pseudo-Hermitian Lanczos algorithm is two times faster than the non-Hermitian one due to the reduction in the number of linear-algebra operations by taking advantage from the pseudo-Hermiticity of the linear-response equations [68]. In order to check the implementation, in Fig. 3 we compare the real and imaginary parts of one row (λ = y) of the spin susceptibility matrix χ λµ (q, q; ω) computed using the pseudo-Hermitian and non-Hermitian Lanczos algorithms. For the sake of clearer comparison, we slightly increased the value of the Lorentzian broadening parameter compared to Fig. 2. As can be seen in Fig. 3, the two flavors of the Lanczos algorithm give spectra that are in remarkable agreement with each other which validates the correctness of the implementation of the pseudo-Hermitian algorithm. It is worth mentioning that both algorithms give converged spectra (with a precision of a fraction of meV) for q = M after performing 30000 Lanczos iterations. This is a rather large number of iterations, and hence the numerical stability of the Lanczos chain becomes very relevant (i.e. that Lanczos coefficients do not diverge). In this respect, the pseudo-Hermitian algorithm turns out to be more numerically stable than the non-Hermitian one, therefore the former one should be used by default. Additionally, we want to stress that converging magnetic spectra at different q requires different number of Lanczos iterations, and as will be shown in Sec. 5.5 for smaller values of q we need a smaller number of Lanczos iterations. In the rest of this paper we present results obtained using only the pseudo-Hermitian algorithm.

Magnon dispersion and the importance of spin-orbit coupling
The effect of SOC on magnetic excitations within TDDFT was addressed only in Ref. [17] to the best of our knowledge. Such a scarce availability of TDDFT studies with SOC is partly due to its higher computational cost and partly due to the difficulty in enforcing the correct long-wavelength limit, as it will be explained in the following. As shown in Sec. 2, our formulation of TDDFpT equations naturally incorporates SOC, allowing us to investigate this rather unexplored subject. Figure 4 (a) shows the magnon dispersion for the CrI 3 monolayer along the Γ-M high-symmetry direction in the BZ computed using the pseudo-Hermitian Lanczos algorithm with and without SOC. This magnon dispersion represents the q dependence of the acoustic magnon branch, while the optical magnon branch has vanishing intensity and hence it is not shown on the plot (the optical branch starts having nonvanishing intensities only for q's approaching the M point). The magnon energy at q → Γ deserves a special attention due to the Goldstone theorem, which implies that the acoustic magnon energy must be exactly zero in absence of SOC [76,77]. In practice, the Goldstone theorem is often violated in actual calculations due to different numerical approximations that are used to describe the ground state and the excited states (e.g. different k point grids, basis sets, etc.) [14][15][16]26]. Not incurring in the aforementioned limitations, our implementation is expected to satisfy the Goldstone theorem with high accuracy and yield the correct long-wavelength limit. We note that currently the turboMagnon code does not contain the implementation of the exact limit q = 0 (q = Γ), hence in practice we use very small but finite values of q close to the Γ point. At q → Γ without SOC we find indeed a vanishing magnon energy, with a value of 0.04 meV for |q| = 0.01 (2π/a), where a is the lattice parameter reported in Sec. 5.1. A quadratic fit of the five smallest |q| points close to Γ extrapolates to ≈ 0.02 meV at q = Γ. This value is orders of magnitude smaller than typical values reported in other works when using other methods (often as large as several tenths of meV [21,26]).  In the presence of SOC, the Goldstone theorem is not supposed to hold since spin-rotational symmetry is already broken, and the acoustic magnon can acquire a finite energy at q = Γ, so forming the so-called "Goldstone gap". When accounting for spin-orbit coupling, our calculations yield a Goldstone gap of 1.3 meV at q → Γ in the CrI 3 monolayer. We note that this value is consistent with a localized spin model where magnetic anisotropies are only onsite: in this case the energy difference between the in-plane and out-of-plane magnetization is estimated to be 0.74 meV from our ground-state DFT calculations, and the Goldstone gap is twice of that value (i.e. 1.48 meV) [74]. Finally, we notice that the inclusion of SOC does not induce a simple rigid upward shift of the acoustic magnon branch, but has a more complex q-dependent effect, which in a localized spin model implies some renormalization of the intersite exchange coefficients.
No experimental measurements of magnon energies in the CrI 3 monolayer have been performed so far. Therefore, we resort to the available INSS measurements in bulk samples in order to give some insights about the predictive accuracy of our calculations. The computed magnon dispersion along the Γ-M direction in Fig. 4 (a) presents the same features and trends as the one for the bulk CrI 3 in experiments. We find, however, an overall overestimation of the magnon bandwidth (≈ 26 meV [78] vs ≈ 17 meV [79]), which can be related to the known overestimation of magnon stiffness when using ALSDA. Likewise, the theoretical Goldstone gap in the bulk CrI 3 is about 1.3 meV (which turns out to be the same as in the CrI 3 monolayer) and it overestimates the experimental one, whose estimates in CrI 3 bulk samples point towards values in the order of 0.37 meV [80]. Ultimately, to increase the accuracy of quantitative predictions using the LL approach to TDDFpT more advanced XC functionals must be used, which will be the topic of future developments for the turboMagnon code. Figures 4 (b) and (c) show the convergence of S(q, ω) with and without SOC, respectively. We remind that S(q, ω) is computed using Eq. (2) which requires on input the spin susceptibility matrix χ λµ (q, q; ω) discussed previously. We note that S(q, ω) takes into account both the transversal and longitudinal magnetic excitations, that are both probed in INSS experiments. In the CrI 3 case, however, transverse excitations dominate the response spectrum, so that S(q, ω) is predominantly a fingerprint of magnon excitations. We can see in Figs. 4 (b) and (c) that S(q, ω) is converged after 10000 Lanczos iterations in both cases. We notice, however, that in the case when SOC is included it takes 3 times less Lanczos iterations to convergence the spectrum at q → Γ than at q = M (see Sec. 5.3), which means that the convergence is strongly q-dependent.  Figure 5: The behavior of Lanczos coefficients α µ i,q and β µ i,q (in Ry) and z λµ i,q coefficients (in µ B ) for the CrI 3 monolayer as a function of the number of Lanczos iterations i. The data is generated using the pseudo-Hermitian Lanczos algorithm at q → Γ without SOC (left 4 panels) and with SOC (right 4 panels). The coefficients plotted here correspond to the polarization of the external magnetic field along the y axis (i.e. µ = y); moreover, we note that z zy is vanishing compared to z xy and z yy , thus only the latter two are shown. For the sake of simplicity, we dropped certain indices from the coefficients on the plots.

Convergence of magnetic spectra and the behavior of Lanczos coefficients
It is also useful to analyze the behavior of the Lanczos coefficients. Figure 5 shows the behavior of the Lanczos coefficients α µ i,q and β µ i,q , as well as of the z λµ i,q coefficients as a function of the number of Lanczos iterations i using the pseudo-Hermitian algorithm at q → Γ with and without SOC when computing the response to a magnetic field polarized along the y axis. The Lanczos coefficients γ µ i,q are equal by modulus to β µ i,q and can differ only by sign [34], thus we report only the latter. First of all, we observe that the α µ i,q coefficients are extremely small compared to the β µ i,q coefficients, which was also found in the case of bulk Fe and Ni [46]. In fact, the α µ i,q coefficients are exactly zero in the case of the absorption [30,34] and electron energy loss spectroscopies [31,36], due to the possibility to perform a rotation to the standard batch representation. Such an argument no longer holds when magnetic ground states are considered and, even though in practice we always find α µ i,q to be very small, we chose not to constrain their value in the absence of a formal proof. As for the β µ i,q coefficients, we show even and odd values as was also done in previous works [30,34,36,46]. We find that the even and odd β's oscillate around a mean value that is approximately equal to half the value of the kinetic-energy cutoff (∼ 60/2 = 30 Ry). As we discussed in Sec. 5.4,in Figs. 4 (b) and (c) we see that the magnon peaks are converged after about 10000 Lanczos iterations. From the respective panels in Fig. 5 we see that the Lanczos β µ i,q coefficients have oscillatory behavior and that after 10000 Lanczos iterations signatures of stabilization are observed.
Finally, we analyze the behavior of the z λµ i,q coefficients, which appear to be peaked functions as can be seen in Fig. 5 (lower panels). Interestingly, the z xy i,q components are peaked much earlier than z yy i,q with respect to the number of Lanczos iterations. The z yy i,q coefficients are peaked at ∼ 8000 Lanczos iterations both without and with SOC, while the full convergence of the magnon peaks is reached a couple of thousands of Lanczos iterations later. This means that in order to convergence the magnon energies it is necessary not only to reach the point when the β µ i,q Lanczos coefficients are stabilized but also when all the z λµ i,q coefficients reached their maxima and decay significantly.

Sum rules
In this section we present a validation of the sum rule given by Eq. (39). Here we present the results not only for the CrI 3 monolayer, but also for bulk Fe and Ni, two prototypical metallic ferromagnets whose magnon dispersions have already been studied with the present algorithm and shown elsewhere [46].  [46], and |q| = 0.1 (2π/a) along the Γ-X direction for Ni [46]. The value of the integral in Eq. (39) is reported as a function of the upper integration bound. Lorentzian broadening of η = 1.0 meV was used for CrI 3 , and η = 10 meV was used for Fe and Ni. mz = mz is the z component of the ground-state magnetization density as obtained from DFT.
In practice, the upper bound in the integral in Eq. (39) must be replaced with some finite value of the frequency. We found that this upper bound is very large (several hundreds of eV) compared to typical energies of magnons (from tens to hundreds of meV). This is so because the integration of the spin susceptibility matrix takes into account also the Stoner excitations that appear at larger energies than magnons. As can be seen in Fig. 6, the value of the integral converges to the value of the z component of the DFT ground-state magnetization density with the accuracy of ∼ 1% for all three systems. We checked that the sum rule is satisfied independently of the specific q point, the choice of the Lorentzian broadening, and the number of Lanczos iterations as shown in previous works [29,36]. Finally, we note that in the case of bulk Fe and Ni the Hamiltonian commutes withŜ z and the additional relation χ xy (q, q; ω) = −χ yx (q, q; ω) holds [81,82], so only one of the two components is needed for checking the sum rule using Eq. (39).

Scaling
Scaling of the turboMagnon code is one of the crucial aspects that requires a separate discussion which we present in this section. Here we analyze benchmark tests that were performed for the CrI 3 monolayer on the "Galileo100" HPC cluster at CINECA [83]. Each node is equipped with 2 CPUs of type "Intel CascadeLake 8260", each CPU having 24 cores (2.4 GHz, 384 GB RAM); each node executes 24 MPI ranks, each one with 2 OpenMP threads.
For the scaling tests we used a Monkhorst-Pack [84] 8 × 8 × 1 k points mesh (we checked that the Monkhorst-Pack and Γ-centered k points meshes give the same converged spectra for the CrI 3 monolayer). The LL approach currently does not use symmetries, and hence in the ground-state DFT calculation we end up with the full grid of 64 points that contain k and −k (no inversion symmetry). In the linear-response calculation, we need to add to each of these points q and −q, so that the final list of points is: k, −k, k + q, k − q, −k + q, and −k − q. Therefore, the total number of points that have to be considered when using the 8 × 8 × 1 Monkhorst-Pack mesh is 192. It is important to note that e.g. for the hexagonal BZ (which is the case here) the N × N × 1 Monkhorst-Pack (i.e. shifted) k points meshes with N being odd or for N × N × 1 Γ-centered (i.e. unshifted) k points meshes with N being even some points fall on the edge of the BZ (for the Γ-centered k points meshes, the Γ point always falls on the BZ border irrespective of N ). These "special k points" are assigned a weight of 1/2 and the −k point is generated, so the total number of points is larger than 3N 2 .
The overall scaling of the turboMagnon code is obtained combining the plane-wave and k points parallelization levels. The TDDFpT calculations scale linearly with respect to the number of k and q points. We recall that all q points are independent and hence the TDDFpT calculations for different q points can be run independently and in parallel without communicating with each other. It is instructive to discuss the scaling of the turboMagnon code using different number of compute nodes, number of k points pools, and number of MPI ranks per pool. These results are shown in Figs. 7 (a) and (b) that report time needed to perform 100 Lanczos iterations. In general, for the highest efficiency of computations the number of k points pools should be chosen such that the ratio between the number of pools and the number of nodes is an integer number. If this criteria is not satisfied and some pools are split between different nodes, the calculations are slower due to the loss of time for inter-node communications. For this reason, in Fig. 7 (a) the number of k points pools is chosen to be equal to the number of compute nodes (but of course the number of pools can be chosen to be larger than the number of nodes), and therefore the number of MPI ranks/pool is 24 for all data points. Moving forward, in this specific example we have 192 points, and these should be divided into pools in such a way that 6 "sister points" (i.e. k, −k, k + q, k − q, −k + q, and −k − q) reside in the same pool. For this reason, in Fig. 7 (a) we choose the number of nodes (pools) such that the ratio between 192 and the number of nodes (pools) is 6 times an integer number. By following this logic, the maximum number of nodes (pools) possible is 32 (192/32 = 6, i.e. 6 points per pool). Hence, in Fig. 7 (a) we see that the turboMagnon code scales (quasi-)linearly with the number of compute nodes provided the number of k points pools equals the number of compute nodes: the speedup is ∼ 30 when using 32 nodes compared to the calculation using only 1 node. We note in passing that it is possible to use the number of nodes even larger than 32 (e.g. 48, 64, etc.) however in this case the number of MPI tasks/pool becomes larger than 24. This brings us to Fig. 7 (b) where we investigate the scaling of the turboMagnon code as a function of the MPI ranks/pool. In order to exploit an increasingly large number of processors maintaining a high parallel efficiency it is necessary to tune the number of MPI ranks used in each pool. In order to investigate this point, we fixed the number of k points pools to 32 and changed the number of compute nodes (16, 32, 48, and 64). We can see in Fig. 7 (b) that the time needed to perform 100 Lanczos iterations decreases when we increase the number of MPI ranks/pool as expected, and that there is a change in the slope at 24 MPI ranks/pool. This change in the slope indicates that the parallel efficiency drops when the number of MPI ranks/pool is larger than 24. The trade-off between the total amount of computational time needed to solve the TDDFpT equations and the speedup has to be found for large-scale production calculations on HPCs -from Fig. 7 (a) and (b) we can conclude that this is achieved when using 32 nodes and 32 k points pools for the current example (i.e. 24 MPI ranks/pool).
In contrast, the scaling of the turboMagnon code with respect to the number of atoms in the system is cubic. Finally, in Fig. 7 (c) we show the scaling with respect to the kinetic-energy cutoff for wavefunctions E cut (the kinetic-energy cutoff for the charge and magnetization density and potentials is 4 times larger since we use NC PPs). For these tests we used 32 compute nodes and different number of pools, 16 and 32 (in order to see trends with respect to a different number of MPI ranks/pool). We see that in both cases when increasing E cut from 60 to 100 Ry (i.e. by 40%) the computational time is increased by a factor of ∼ 3 (i.e. by 300%). Interestingly, when using 32 k points pools we see a straight line between 60 and 90 Ry and then a change in the slope, while when using 16 k points pools the data is somewhat more noisy. Overall, we can conclude that the computational cost of turboMagnon calculations increases rapidly when increasing E cut and thus it is important to optimize the value of E cut by checking the converge of the magnetic spectra.
To conclude, we use the scaling tests discussed above to highlight the computational cost for magnetic spectra calculations for the CrI 3 monolayer. Using the Monkhorst-Pack 8 × 8 × 1 k points mesh and E cut = 60 Ry we choose the optimal setup of 32 compute nodes and 32 k points pools (i.e. 24 MPI ranks/pool). With this setting, it takes 1 hour 40 minutes to perform 10000 Lanczos iterations (with SOC), which means approximately 53 node hours. This result does not depend on the value of the q point since we do not use symmetries; however, as we discussed above the number of Lanczos iterations needed to converge the spectra varies for different q points. Finally, without SOC the computational cost is slightly smaller but not substantially (since calculations were also performed in the noncollinear framework).

Conclusions
We have presented the turboMagnon code as a component of the Quantum ESPRESSO distribution that implements a Liouville-Lanczos approach to time-dependent density-functional perturbation theory for the computation of magnetic spectra for any transferred momentum q. The turboMagnon code does not require the calculation of empty electronic states due to the use of standard techniques of the static density-functional perturbation theory [33]. The solution of the linear-response equations is done in the frequency domain using the non-Hermitian or pseudo-Hermitian Lanczos recursive algorithms that allow to avoid computationally expensive inversions of response matrices. The turboMagnon code is implemented in the noncolinear spin-polarized framework including relativistic effects such as spin-orbit coupling. The effectiveness of the code is showcased on the example of the 2D ferromagnetic insulator CrI 3 and the fulfillment of linear-response sum rules is verified for monolayer CrI 3 , bulk Fe and bulk Ni.
In the same spirit as the Quantum ESPRESSO project, turboMagnon provides scientists worldwide with a well commented and open-source framework for implementing their ideas. It is in our best hope that turboMagnon can benefit from the already well established users community of Quantum ESPRESSO for incorporating new ideas and keep growing in the future. The turboMagnon code is hosted in a community accessible Git repository [85] and hence, apart from the releases of Quantum ESPRESSO [86], researchers who are willing to test the latest experimental implementations are welcome to do so and to contribute with their feedback.
The turboMagnon code can be extended so as to employ more advanced exchange-correlation functionals (e.g. to include the Hubbard U correction or SCAN meta-GGA [61]), to use ultrasoft and projector-augmented-wave pseudopotentials, use symmetry, to name a few. Moreover, one of our core goals is to generalize the turboMagnon code to make it run on modern GPU architectures. The files generated by the ground state pw.x run should have this same prefix.
outdir './' Working directory. On start, it should contain the files generated by a ground state pw.x run.
When set to .true., turbo magnon.x will attempt to restart from a previous interrupted calculation (see restart step variable).

restart step itermax
The code writes restart files every restart step iterations. Restart files are automatically written at the end of itermax Lanczos steps.

lr verbosity 1
Verbosity level: the larger the value the more data is printed in the output file.
lr control itermax 500 Number of Lanczos iterations to be performed. q1, q2, q3 1, 1, 1 Cartesian components of the transferred momentum q in units of 2π/a (where a is the lattice parameter of the unit cell).
If .true. then the pseudo-Hermitian Lanczos algorithm is used, if .false. then the non-Hermitian Lanczos biorthogonalization algorithm is used (which is two times slower).

'./'
The directory where the output files produced by the previous turbo magnon.x run are stored.
Must be set to .true. for the magnetic spectrum calculation.
itermax0 1000 Number of Lanczos coefficients to be read from the file.
itermax 1000 The total number of Lanczos coefficients that will be considered in the calculation of the spin susceptibility matrix. If itermax > itermax0, the Lanczos coefficients in between itermax0+1 and itermax will be extrapolated.

2.5
The susceptibility is computed up to this value of ω (in units).
increment 0.001 Incremental step ∆ω used to define the mesh between start and end (in units).
verbosity 0 Verbosity level: the larger the value the more data is printed in the output file.