Wannier90 as a community code: new features and applications

Wannier90 is an open-source computer program for calculating maximally-localised Wannier functions (MLWFs) from a set of Bloch states. It is interfaced to many widely used electronic-structure codes thanks to its independence from the basis sets representing these Bloch states. In the past few years the development of Wannier90 has transitioned to a community-driven model; this has resulted in a number of new developments that have been recently released in Wannier90 v3.0. In this article we describe these new functionalities, that include the implementation of new features for wannierisation and disentanglement (symmetry-adapted Wannier functions, selectively-localised Wannier functions, selected columns of the density matrix) and the ability to calculate new properties (shift currents and Berry-curvature dipole, and a new interface to many-body perturbation theory); performance improvements, including parallelisation of the core code; enhancements in functionality (support for spinor-valued Wannier functions, more accurate methods to interpolate quantities in the Brillouin zone); improved usability (improved plotting routines, integration with high-throughput automation frameworks), as well as the implementation of modern software engineering practices (unit testing, continuous integration, and automatic source-code documentation). These new features, capabilities, and code development model aim to further sustain and expand the community uptake and range of applicability, that nowadays spans complex and accurate dielectric, electronic, magnetic, optical, topological and transport properties of materials.

Wannier90 is an open-source computer program for calculating maximally-localised Wannier functions (MLWFs) from a set of Bloch states.It is interfaced to many widely used electronicstructure codes thanks to its independence from the basis sets representing these Bloch states.In the past few years the development of Wannier90 has transitioned to a community-driven model; this has resulted in a number of new developments that have been recently released in Wannier90 v3.0.In this article we describe these new functionalities, that include the implementation of new features for wannierisation and disentanglement (symmetry-adapted Wannier functions, selectivelylocalised Wannier functions, selected columns of the density matrix) and the ability to calculate new properties (shift currents and Berry-curvature dipole, and a new interface to many-body perturbation theory); performance improvements, including parallelisation of the core code; enhancements in functionality (support for spinor-valued Wannier functions, more accurate methods to interpolate quantities in the Brillouin zone); improved usability (improved plotting routines, integration with

I. INTRODUCTION
Wannier90 is an open-source code for generating Wannier functions (WFs), in particular maximallylocalised Wannier functions (MLWFs), and using them to compute advanced materials properties with high efficiency and accuracy.Wannier90 is a paradigmatic example of interoperable software, achieved by ensuring that all the quantities required as input are entirely independent of the underlying electronic-structure code from which they are obtained.Most of the major and widely used electronic-structure codes have an interface to Wannier90, including Quantum ESPRESSO, 1 ABINIT, 2 VASP, [3][4][5] Siesta, 6 Wien2k, 7 Fleur 8 and Octopus. 9As a consequence, once a property is implemented within Wannier90, it can be immediately available to users of all codes that interface to it.
Over the last few years, Wannier90 has undergone a transition from a code developed by a small group of developers to a community code with a much wider developers' base.This has been achieved in two principal ways: (i) hosting the source code and associated development efforts on a public GitHub repository; 10 and (ii) building a community of Wannier90 developers and facilitating personal interactions between individuals through community workshops, the most recent in 2016.In response, the code has grown significantly, gaining many novel features contributed by this community, as well as numerous fixes.
In this paper, we describe the most important novel contributions to the Wannier90 code, as embodied in its 3.0 release.The paper is structured as follows: In Sec.II we first summarise the background theory for the computation of MLWFs (additional details can be found in Ref. 11), and introduce the notation that will be used throughout the paper.In Sec.III we describe the novel features of Wannier90 that are related to the core wannierisation and disentanglement algorithms; these include symmetry-adapted WFs, selective localisation of WFs, and parallelisation using the message-passing interface (MPI).In Sec.IV we describe new functionality enhancements, including the ability to handle spinor-valued WFs and calculations with noncollinear spin that use ultrasoft pseudopotentials (within Quantum ESPRESSO); improved interpolation of the k-space Hamiltonian; a more flexible approach for handling and using initial projections; and the ability to plot WFs in Gaussian cube format on WF-centred grids with non-orthogonal translation vectors.In Sec.V we describe new functionalities associated with using MLWFs for computing advanced electronic-structure properties, including the calculation of shift currents, gyrotropic effects and spin Hall conductivities, as well as parallelisation improvements and the interpolation of bands originating from calculations performed with many-body perturbation theory (GW).In Sec.VI we describe the selectedcolumns-of-the-density-matrix (SCDM) method, which enables computation of WFs without the need for explicitly defining initial projections.In Sec.VII we describe new post-processing tools and codes, and the integration of Wannier90 with high-throughput automation and workflow management tools (specifically, the Ai-iDA materials' informatics infrastructure 12 ).In Sec.VIII we describe the modern software engineering practices now adopted in Wannier90, that have made it possible to improve the development lifecycle and transform Wannier90 into a community-driven code.Finally, our conclusions and outlook are presented in Sec.IX.

II. BACKGROUND
In the independent-particle approximation, the electronic structure of a periodic system is conventionally represented in terms of one-electron Bloch states ψ nk (r), which are labelled by a band index n and a crystal momentum k inside the first Brillouin zone (BZ), and which satisfy Bloch's theorem: where u nk (r) = u nk (r + R) is a periodic function with the same periodicity of the single-particle Hamiltonian, and R is a Bravais lattice vector.(For the moment we ignore the spin degrees of freedom and work with spinless wave functions; spinor wave functions will be treated in Sec.IV A.) Such a formalism is also commonly applied, via the supercell approximation, to non-periodic systems, typically used to treat point, line and planar defects in crystals, surfaces, amorphous solids, liquids and molecules.

A. Isolated bands
A group of bands is said to be isolated if it is separated by energy gaps from all the other lower and higher bands throughout the BZ (this isolated group of bands may still show arbitrary crossing degeneracies and hybridisations within itself).For such isolated set of J bands, the electronic states can be equivalently represented by a set of J WFs per cell, that are related to the Bloch states via two unitary transformations (one continuous, one discrete): 13 where w nR (r) = w n0 (r − R) is a periodic (but not necessarily localised) WF labelled by the quantum number R (the counterpart of the quasi-momentum k in the Bloch representation), V is the cell volume and U k are unitary matrices that mix Bloch states at a given k and represent the gauge freedom that exists in the definition of the Bloch states and that is inherited by the WFs.MLWFs are obtained by choosing U k matrices that minimise the sum of the quadratic spreads of the WFs about their centres for a reference R (say, R = 0).This sum is given by the spread functional Ω may be decomposed into two positive-definite parts, 14 where is gauge invariant (i.e., invariant under the action of any unitary U k on the Bloch states), and is gauge dependent.Therefore, the "wannierisation" of an isolated manifold of bands, i.e., the transformation of Bloch states into MLWFs, amounts to minimising the gauge-dependent part Ω of the spread functional.Crucially, the matrix elements of the position operator between WFs can be expressed in reciprocal space.Under the assumption that the BZ is sampled on a uniform Monkhorst-Pack mesh of k-points composed of N points (V BZ dk (2π) 3 → 1 N k ), the gauge-independent and gauge-dependent parts of the spread may be expressed, respectively, as 14 and where b are the vectors connecting a k-point to its neighbours, w b are weights associated with the finite-difference representation of ∇ k for a given geometry, the matrix of overlaps M (k,b) is defined by and the centres of the WFs are given by Minimisation of the spread functional is achieved by considering infinitesimal gauge transformations U mnk = δ mn + dW mnk , where dW is anti-Hermitian (dW † = −dW ).The gradient of the spread functional with respect to such variations is given by where A and S are the super-operators A[B] = (B − B † )/2 and S[B] = (B + B † )/2i, respectively, and For the full derivation of Eq. ( 11) we refer to Ref. [14].This gradient is then used to generate a search direction D k for an iterative steepest-descent or conjugate-gradient minimisation of the spread: 15 at each iteration the unitary matrices are updated according to where α is a coefficient that can either be set to a fixed value or determined at each iteration via a simple polynomial line-search, and the matrix exponential is computed in the diagonal representation of D k and then transformed back in the original representation.Once the unitary matrices have been updated, the updated set of M (k,b) matrices is calculated according to where is the set of initial M (k,b) matrices, computed once and for all, at the start of the calculation, from the original set of reference Bloch orbitals |u (0) nk .

B. Entangled bands
It is often the case that the bands of interest are not separated from other bands in the Brillouin zone by energy gaps and are overlapping and hybridising with other bands that extend beyond the energy range of interest.In such cases, we refer to the bands as being entangled.
The difficulty in constructing MLWFs for entangled bands arises from the fact that, within a given energy window, the number of bands J k at each k-point k in the BZ is not a constant and is, in general, different from the target number J of WFs: J k ≥ J.Even making the energy window k-dependent would see discontinuous inclusion and exclusion of bands as the BZ is traversed.The treatment of entangled bands requires thus a more complex approach that is typically a two-step process.In the first step, a J-dimensional manifold of Bloch states is selected at each k-point, chosen to be as smooth as possible as a function of k.In the second step, the gauge freedom associated with the selected manifold is used to obtain MLWFs, just as described in Sec.II A for the case of an isolated set of bands.
Focusing on the first step, an orthonormal basis for the J-dimensional subspace S k at each k can be obtained by performing a semi-unitary transformation on the J k states at k, where To select the smoothest possible manifold, a measure of the intrinsic smoothness of the chosen subspace is needed.It turns out that such a measure is given precisely by what was the gauge-invariant part Ω I of the spread functional for isolated bands. 16Indeed, Eq. ( 7) can be expressed as where and "Tr" represents the trace over the entire Hilbert space.Tr[P k Q k+b ] measures the mismatch between the subspaces S k and S k+b , vanishing if they overlap identically.Hence Ω I measures the average mismatch of the local subspace S k across the BZ, so that an optimallysmooth subspace can be selected by minimising Ω I .Doing this with orthonormality constraints on the Bloch-like states is equivalent to solving self-consistently the set of coupled eigenvalue equations 16 b The solution can be achieved via an iterative procedure, whereby at the i th iteration the algorithm traverses the entire set of k-points, selecting at each one the Jdimensional subspace S and selecting the J eigenvectors with the largest eigenvalues. 16Self-consistency is reached when (to within a user-defined threshold dis conv tol) at all the k-points.To make the algorithm more robust, the projector appearing on the left-hand-side of Eq. ( 21) is replaced with [P which is a linear mixture of the projector that was used as input for the previous iteration and the projector defined by the output of the previous iteration.The parameter 0 < β ≤ 1 determines the degree of mixing, and is typically set to β = 0.5; setting β = 1 reverts precisely to Eq. ( 21), while smaller and smaller values of β make convergence smoother (and thus more robust) but also slower.
In practice, Eq. ( 21) is solved by diagonalising the Hermitian operator appearing on the left-hand-side in the basis of the original J k Bloch states: Once the optimal subspace has been selected, the wannierisation procedure described in Sec.II A is carried out to minimise the gauge-dependent part Ω of the spread functional within that optimal subspace.

C. Initial projections
In principle, the overlap matrix elements M (k,b) mn are the only quantities required to compute and minimise the spread functional, and generate MLWFs for either isolated or entangled bands.In practice, this is generally true when dealing with an isolated set of bands, but in the case of entangled bands a good initial guess for the subspaces S k alleviates problems associated with falling into local minima of Ω I , and/or obtaining MLWFs that cannot be chosen to be real-valued (in the case of spinless WFs).Even in the case of an isolated set of bands, a good initial guess for the WFs, whilst not usually critical, often results in faster convergence of the spread to the global minimum.(It is important to note that both for isolated and for entangled bands multiple solutions to the wannierisation or disentanglement can exist, as discussed later.) A simple and effective procedure for selecting an initial gauge (in the case of isolated bands) or an initial subspace and initial gauge (in the case of entangled bands) is to project a set of J trial orbitals g n (r) localised in real space onto the space spanned by the set of original Bloch states at each k: where the sum runs up to either J or J k , depending on whether the bands are isolated or entangled, respectively, and the inner product A mnk = ψ mk | g n is over all the Born-von Karman supercell.(In practice, the fact that the g n are localised greatly simplifies this calculation.)The matrices A k are square (J × J) or rectangular (J k ×J) in the case of isolated or entangled bands, respectively.The resulting orbitals are then orthonormalised via a Löwdin transformation: 17 where is a unitary or semi-unitary matrix.In the case of entangled bands, once an optimally-smooth subspace has been obtained as described in Sec.II B, the same trial orbitals g n (r) can be used to initialise the wannierisation procedure of Sec.II A. In practice, the matrices A k are computed once and for all at the start of the calculation, together with the overlap matrices M (k,b) .These two operations need to be performed within the context of the electronic-structure code and basis set adopted; afterwards, all the operations of Wannier90 rely only on A k and M (k,b) and not on the specific representation of ψ mk (e.g., plane waves, linearised augmented plane waves, localised basis sets, real-space grids, . . .).

III. NEW FEATURES FOR WANNIERISATION AND DISENTANGLEMENT
In this section we provide an overview of the new features associated with the core wannierisation and disentanglement algorithms in Wannier90, namely the ability to generate WFs of specific symmetry; selectively localise a subset of the WFs and/or constrain their centres to specific sites; and perform wannierisation and disentanglement more efficiently through parallelisation.

A. Symmetry-adapted Wannier functions
In periodic systems, atoms are usually found at sites q whose site-symmetry group G q is a subgroup of the full point group F of the crystal 18 (the symmetry operations in the group G q are those that leave q fixed).The set of points q a that are symmetry-equivalent sites to q is called an orbit. 19These are all the points in the unit cell that can be generated from q by applying the symmetry operations in G that do not leave q fixed.If q a is a high-symmetry site then its Wyckoff position has a single orbit; 19 for low-symmetry sites different orbits correspond to the same Wyckoff position.The number of points in the orbit(s) is the multiplicity n qa of the Wyckoff position.MLWFs, however, are not bound to reside on such high-symmetry sites, and they do not necessarily possess the site symmetries of the crystal. 16,20,21hen using MLWFs as a local orbital basis set in methods such as first-principles tight binding, DFT+U and DFT plus dynamical-mean-field theory (DMFT), which deal with beyond-DFT correlations in a local subspace such as that spanned by 3d orbitals (for transition metals or transition-metal oxides) or 4f orbitals (for rare-earth or actinide intermetallics), it is often desirable to ensure that the WFs basis possesses the local site symmetries.
Sakuma 20 has shown that such symmetry-adapted Wannier functions (SAWFs) can be constructed by introducing additional constraints on the unitary matrices U k of Eq. ( 2) during the minimisation of the spread.SAWFs, therefore, can be fully integrated within the original maximal-localisation procedure.The SAWF approach gives the user a certain degree of control over the symmetry and centres of the Wannier functions at the expense of some localisation since the final total spread of the resulting SAWFs can only be equal to, or most often larger than, that of the corresponding MLWFs with no constraints (note that in principle some SAWFs can have a smaller individual spread than any MLWFs).
A set of SAWFs {w ( ) can be specified by one representative point q a of the orbit (in the home unit cell R = 0), and by the irreducible representation (irrep) of the corresponding site-symmetry group G a (the dimension of the irrep being n ).For instance, in simple fcc crystals, like copper (space group F m−3m), the Cu atom is placed at the origin of the unit cell q = (0, 0, 0) (Wyckoff letter a with multiplicity 1, i.e., only one point in the orbit of (0, 0, 0) in the unit cell, due to the fact that F m−3m is symmorphic 19 ), whose site-symmetry group is m−3m (also referred to as O h ).One of the irreps of O h is that with character T 2g , which is 3-dimensional.
To find symmetry-adapted Wannier functions, one needs to specify the unitary transformations U ( ) miak of the Bloch states, defined by Therefore, the goal is to construct basis functions of the irrep , {ψ ( ) iak (r)} , from a linear combination of the J eigenstates ψ nk (r) of the Hamiltonian H. Since H is invariant under the full space-group G, the representation of a given symmetry operation g = (R|t) ∈ G (where R and t are the rotation and fractional-translation parts of the symmetry operation, respectively) in the {ψ nk (r)} basis must be a J × J unitary matrix 18 d k (g), i.e. d k (g) represents how the J Bloch states are transformed by the symmetry operation g: When single-particle eigenfunctions ψ nk (r) are used, as in this case, the matrix elements d k (g) can be computed directly as As for the overlap matrices M (k,b) , the resulting d k (g) are also basis-set independent.Moreover, once computed (using the original gauge) they remain fixed during the calculation.For instance, in a plane-wave code the integrals in Eq. ( 30) can be easily computed in reciprocal space.
On the other hand, it can be shown that the SAWFs transform under the action of g ∈ G with a different matrix d 18,20 which in turn defines how the ψ ( ) iak (r) transform under the action of g ∈ G: gψ where and is a lattice translation vector.It is worth to mention that a in Eq. ( 31) is uniquely defined by specifying the symmetry operations g ∈ G; see Ref. 18 and 20 for details.D k (g) is block-diagonal in the index.For a given set of Wyckoff positions, the number of blocks is given by the sum of the number of all irreps considered (if non-equivalent Wyckoff positions (q b = q a ) are present then D k (g) contains also blocks corresponding to these positions).Each block contains n 2 qa sub-matrices of dimension n × n .Therefore, if only one Wyckoff position is given with multiplicity n qa , then there are J = n qa n energy bands in the representation given by the {ψ ( ) iak (r)} (see Ref. 20

for full details).
To compute the D k (g) matrices one needs to specify the centre q a and the symmetry of the initial functions (e.g., s, p, and d).Then, for each symmetry operation g qa in the site-symmetry group G qa one needs to calculate the matrix representation of the rotational part expressed in the basis of these functions.
From Eqs. (28), (29) and (31) one can show that the following relationship exists between U k and U Rk Let g k now be the symmetry operations that leave a given k unchanged.Then, Eq. ( 34) gives the condition that U k must satisfy in this case: The initial unitary matrix U k (k ∈ IBZ) must satisfy the constraints in Eq. (35); this can be done in an iterative fashion, as discussed in Ref. 20.In practice, the Wannier functions are generated from a limited subspace spanned by a finite number of states inside a target "energy window", but this does not guarantee that a U k can be constructed for any desired irrep.In fact, if a given irrep is not compatible with the symmetry of the states within the energy window, Eq. ( 35) cannot be fulfilled.
For an isolated set of bands, the minimisation of Ω with the constraints defined in Eq. ( 34) requires the gradient G sym k of the total spread Ω with respect to a symmetryadapted gauge variation to generate a search direction D sym k .The equation for the symmetry-adapted gradient reads where G k is the original gradient given in Eq. ( 11), and n k is the number of symmetry operations in G that leave k fixed.The procedure described above for isolated bands has to be modified only slightly for the case of entangled bands.The main difference with respect to the unconstrained case of Sec.II is that the J eigenvectors of the J largest eigenvalues of the Z k matrix in Eq. ( 23) do not necessarily span the same subspace spanned by the desired symmetry-adapted Wannier functions.Since direct minimisation is not bound to give symmetry-adapted WFs, Sakuma 20 has proposed an alternative steepestdescent approach to construct the optimal unitary matrix from a set of J k ≥ J Bloch wavefunctions that also fulfil symmetry constraints.Once this step is completed and optimal symmetry-adapted Bloch functions ψ ( ) iak (r) have been computed, the algorithm proceeds as in the isolated case where one seeks the U k that minimise Ω and give the symmetry-adapted Bloch functions in terms of ψ ( ) iak (r) as in Eq. (28).

B. Selectively-localised Wannier functions and constrained Wannier centres
Wang et al. have proposed an alternative method 22 to the symmetry-adapted Wannier functions described in Section III A. Their method permits the selective localisation of a subset of the Wannier functions, which may optionally be constrained to have specified centres.Whilst this method does not enforce or guarantee symmetry constraints, it has been observed in the cases that have been studied 22 that Wannier functions whose centres are constrained to a specific site typically possess the corresponding site symmetries.
For an isolated set of J bands, selective localisation of a subset of J ≤ J Wannier functions is accomplished by minimising the total spread Ω with respect to only J ×J degrees of freedom in the unitary matrix U k .The spread functional to minimise is then given by which reduces to the original spread functional Ω of Eq. ( 3) for J = J.When J < J, it is no longer possible to cast the functional Ω as a sum of a gauge-independent term Ω I and gauge-dependent one Ω, as done in Eq. ( 4) for Ω.Nevertheless, the minimisation can be carried out with methods very similar to those described in Section II.In fact, for J < J, Ω can be written as the sum of two gauge-dependent terms, Ω = Ω IOD + Ω D , where Ω IOD is formally given by the sum of Ω I and the off-diagonal term (m = n), m, n ≤ J < J of Ω, and Ω D by the diagonal term (m = n) of Ω.If one adopts the usual discrete representation on a uniform Monkhorst-Pack grid of k-points, Ω IOD and Ω D are given by 22 and With this new spread functional, we can mimic the procedure used to obtain a set of MLWFs, and derive the gradient G k of Ω which gives the search direction to be used in the minimisation.The matrix elements of G k read are given by Eq. ( 12) and Eq. ( 13), respectively.As a result of the minimisation, we obtain a set of J maximally-localised Wannier functions, known as selectively-localised Wannier functions (SLWFs), whose spreads are in general smaller than the corresponding MLWFs.Naturally, the remaining J − J functions will be more delocalised than their MLWF counterparts, as they are not optimised, and the overall sum of spreads will be larger (or in the best case scenario equal).
The centres of the SLWFs may be constrained by adding a quadratic penalty function to the spread functional Ω , defining a new functional given by where λ is a Lagrange multiplier and x n is the desired centre for the n th WF.The procedure outlined above for minimising Ω can be also adapted to deal with Ω λ (see Ref. 22 for details), and minimising Ω λ results in selectively-localised Wannier functions subject to the constraint of fixed centres (SLWF+C).As noted above, it is observed that WFs derived using the SLWF+C approach naturally possess site symmetries, and their individual spreads are usually smaller than the corresponding spreads of MLWFs, although the total spread, combination of the J selectively optimised WFs and the J − J unoptimised functions, is larger than the total spread of the MLWFs (see, for instance last column in Tab. 1).
In the case of entangled bands, the SLWF(+C) method implicitly assumes that a subspace selection has been performed, i.e., that a smooth J-dimensional manifold exists.Since for the Ω and Ω λ functionals it is not possible to define an Ω I that measures the intrinsic smoothness of the underlying manifold, the additional constraints in Eq. (37) and Eq. ( 41) can only be imposed during the wannierisation step.This means that SLWF(+C) can be seamlessly coupled with the disentanglement procedure, with no further additions to the original procedure of Sec.II B.

SAWF and SLWF+C in GaAs
As an example of the capabilities of the SAWF and SLWF+C approaches, we show how to construct atomcentred WFs that possess the local site symmetries in gallium arsenide (GaAs).In particular, we discuss how to obtain one WF from the four valence bands of GaAs that is centred on the As atom and that transforms like the identity under the symmetry operations in T d , the site-symmetry group of the As site (for completeness, we also show one MLWF and one SLWF without constraints).Since we only deal with the four valence bands of GaAs-an isolated manifold-no prior subspace selection is required for the wannierisation.All calculations were carried out with the planewave DFT code Quantum ESPRESSO, 1 employing PAW pseudopotentials 23,24 from the pslibrary (v1.0). 25 For the exchange-correlation functional we use the Perdew-Burke-Ernzerhof approximation. 26The energy cut-off for the plane-waves basis is set to 35.0 Ry, and a 4×4×4 uniform grid is used to sample the Brillouin zone.The lattice parameter is set to the experimental value (5.65 Å).The overlap matrices M (k,b) mn in Eq. ( 9), the projection matrices A mnk in Eq. ( 26) and both d k (g) in Eq. ( 30) and D k (g) in Eq. ( 32) have been computed with the pw2wannier90.xinterface.
GaAs is a III-V semiconductor that crystallises in the fcc cubic structure, with a two-atom basis: the Ga cation and the As anion (space group F −43m); in our example the Ga atom is placed at the origin of the unit cell, whose Wyckoff letter is a and site-symmetry group is −43m, also known as T d .The As atom is placed at ( 1 /4, 1 /4, 1 /4), whose Wyckoff letter is c and site-symmetry group is also T d .
Marzari and Vanderbilt 14 have shown that the MLWFs for the 4-dimensional valence manifold are centred on the four As-Ga bonds, have sp 3 character and can be found by specifying four s-like orbitals on each covalent bond as initial guess (a representative is shown in Fig. 1(a)).These bond-centred functions correspond to the irreducible representation A 1 of the site-symmetry group C 3v of the Wyckoff position e.Hence, the MLWFs can also be obtained with the SAWF approach by specifying the centres and the shapes of the initial projections, e.g.four s-like orbitals centred on the four As-Ga bonds, and the symmetry operations in the point group C 3v .
Using the SAWF method we can enforce the WFs to have the local site symmetries.In particular, since T d has 5 irreps of dimension 1, 1, 2, 3 and 3 respectively, one can form an 1+3-dimensional representation for the four SAWFs.Thus, a set of initial projections compatible with the symmetries of the valence bands is: one s-like orbital (1-dimensional irrep whose character is A 1 ) and three p-like orbitals (3-dimensional irrep whose character is T 2 ) centred on As.Fig. 1(b) shows the SAWF which corresponds to the A 1 representation and transforms like the identity under T d .
The same result can be obtained with the SLWF+C method by selectively localising one function J = 1 (J = 4) and constraining its centre to sit on the As site ( 1 /4, 1 /4, 1 /4).In the case of GaAs the SLWF+C method turns out to be very robust, to the point that four s-like orbitals randomly centred in the unit cell can be used as initial guess without affecting the result of the optimised function.Fig. 1(c) shows the resulting function using the SLWF method without constraints, while Fig. 1(d) shows the result using SLWF+C.
It is worth to note that for this particular system, it is possible to achieve this result with the maximal localisation procedure if one carefully selects the initial projections, i.e., one s-like and three p-like orbitals on the As atom.The resulting WFs will possess the local site symmetries but will not correspond to the global minimum of the spread functional Ω.More precisely, they will correspond to a saddle point of Ω (unstable against small perturbations of the initial projections).In   For MLWF, SLWF and SLWF+C, four s-type orbitals centred at the midpoints of the four Ga-As bonds (( 1 /8, 1 /8, 1 /8),( 1 /8, 1 /8, -3 /8),( -3 /8, 1 /8, 1 /8),( 1 /8, -3 /8, 1 /8)) were used as initial guess.In the case of SLWF+C, we optimise the first WF and constrain its centre to sit at ( 1 /4, 1 /4, 1 /4).For SAWF one s-type and three p-type orbitals centred on the As atom are used for the initial guess.For all plots we choose an isosurface level of ± 0.5 Å−3 /2 (blue for + values and red for − values) using the Vesta visualisation program. 27Bottom (table ): Cartesian coordinates of the centres r, (minimised) individual spreads r 2 − r 2 and the total spread Ω of all four valence WFs using the above mentioned four different minimisation schemes and initial guesses.
FIG. 2: Illustration of the parallelisation scheme for a 3 × 3 mesh of k-points (black dots) and one MPI process per k-point.The calculation of the M (k,b) , Z k , ∆W k and U k matrices are distributed over processes by k-point.The U k+b matrices for the neighbouring k-points are sent from process to process (orange arrows) for the calculation of the M (k,b) and Z k matrices.

C. Parallelisation
In Wannier90 v3.0 we have implemented an efficient parallelisation scheme for the calculation of MLWFs using the message passing interface (MPI).
Calculation of the spread.The time-consuming part in the evaluation of the spread Ω is updating the M (k,b) matrices according to Eq. ( 16), since this requires computing overlap matrix elements between all pairs of bands, and between all k-points k and their neighbours k + b.Therefore, an efficient speed up for the evaluation of the spread can be achieved by distributing over several processes the calculation of the M (k,b) matrices for different k-points.In order to compute the M (k,b) according to Eq. ( 16), the U k+b matrices are sent from process to process prior to the calculation of the overlap matrices.We stress the fact that the U k+b matrices are the only large arrays that have to be shared between processes, which limits the time spent in communication.The relatively large M (k,b) matrices are not sent between processes for the evaluation of Eqs.(7) and (8).Instead, it is enough to collect the contributions to the spread from the different k-points, i.e., a set of scalars, and then sum them up for evaluation of the total spread.This parallelisation scheme is illustrated in Fig. 2 for a 3 × 3 mesh of k-points with 9 MPI processes.
Minimisation of the spread.The minimisation of the spread functional is based on an iterative steepestdescent or conjugate-gradient algorithm.In each iteration, the unitary matrices U k are updated according to 14 where ∆W k = αD k , see Eq. (15).
Updating the U k matrices according to this equation is by far the most time-consuming part in the iterative minimisation algorithm, as it requires a diagonalisation of the ∆W k matrices.A significant speed-up can be obtained, however, by distributing the diagonalisation of the different ∆W k matrices over several processes, and performing the calculations fully in parallel.The evaluation of ∆W k essentially requires the calculation of the overlap matrices M (k,b) , as discussed above.
Disentanglement.The disentanglement procedure is concerned with finding the optimal subspace S k .As the functional Ω I measures the global subspace dispersion across the Brillouin zone, at first sight it is not obvious that the task of minimising the spread Ω I can be parallelised with respect to the k-points.In the iterative algorithm of Eq. ( 21), the systematic reduction of the spread functional at the i th iteration is achieved by minimising the spillage of the subspace S (i) k over the neighbouring subspaces from the previous iteration S k+b .This problem reduces to the diagonalisation of N independent matrices (N is the total number of k-points of the mesh), where an efficient speed-up of the disentanglement procedure can be achieved by distributing the diagonalisation of the Z (i) k matrices of Eq. ( 23) over several processes, which can be done fully in parallel.Since the construction of Z (i) k only requires the knowledge of the U k+b matrices, these must be communicated between processes, as shown in Fig. 2.This results in a similar time spent in communication for the disentanglement part of the code as for the wannierisation part.
Distribution of large matrices.The parallelisation scheme discussed above relies on the parallel evaluation of relevant matrices over k-points on each processor.For systems with large number of k-points and bands, it is also desirable to distribute the matrices themselves among the available cores so that the memory per core required to store them is reduced.For example, in the case of isolated bands, storing all the M (k,b) matrices requires an allocation of dimension J × J × N × N b , where N b is the number neighbours of each of the N k-points of the mesh.By distributing the storage across N c cores, the storage requirement per core decreases accordingly by a factor of approximately N c .
Performance.We have tested the performance of this parallelisation scheme for the calculation of the MLWFs in a L1 0 −FePt(5)/Pt(18) thin film.Computational details were given in Ref. 28.The benchmarks have been performed on the JURECA supercomputer of the Jülich Supercomputing Center.We have extracted an optimal subspace of dimension J = 414 from a set of 580 Bloch states per k-point.The upper limit of the inner window was set to 5 eV above the Fermi energy, and 414 MLWFs were constructed by minimising the spread Ω.The performance benchmark was based on the average wall-clock time for a single iteration of the minimisation procedure (several thousand iterations are usually needed for convergence).We first analyse the weak scaling of our im- plementation, i.e., how the computation time varies with the number of cores N c for a fixed number of k-points per process.We show in Fig. 3(a) the time per iteration for the disentanglement and wannierisation parts of the minimisation, always using one k-point per process.As we vary the number of k-points N from 4 to 144, the computation time increases only by a factor of 1.3 and 1.8 for disentanglement and wannierisation, respectively.
We then demonstrate the strong scaling of our parallelisation scheme in Fig. 3(b), i.e., how the computation time varies with the number of cores N c for a fixed number N = 64 of k-points.When varying the number of cores from 4 to 64, we observe a decrease of the computation time per iteration by a factor of 12.6 and 9.5 for disentanglement and wannierisation, respectively.The deviation from ideal scaling is mostly explained by the time spent in inter-core communication of the U k+b matrices.

IV. ENHANCEMENTS IN FUNCTIONALITY
In this section we describe a number of enhancements to the functionality of the core Wannier90 code, namely: the ability to compute and visualise spinorvalued WFs, including developments to the interface with the Quantum ESPRESSO package to cover also the case of non-collinear spin calculations performed with ultrasoft pseudopotentials (previously not implemented); an improvement to the method for interpolating the kspace Hamiltonian; the ability to select a subset from a larger set of projections of localised trial orbitals onto the Bloch states for initialising the WFs; and new functionality for plotting WFs in Gaussian cube format on WF-centred grids with non-orthogonal translation vec-tors.

A. Spinor-valued Wannier functions with ultrasoft and projector-augmented-wave pseudopotentials
The calculation of the overlap matrix in Eq. ( 17) within the ultrasoft-pseudopotential formalism proceeds via the inclusion of so-called augmentation functions, 29 where |ψ ps mk is the pseudo-wavefunction, is the Fourier transform of the augmentation charge, and , where |β k Ii denotes the i th projector of the pseudopotential on the I th atom in the unit cell.We refer to Appendix B of Ref. 29 for detailed expressions.
When spin-orbit coupling is included, the Bloch functions become two-component spinors (ψ ↑ nk (r), ψ ↓ nk (r)) T , where ψ σ nk (r) is the spin-up (for σ =↑) or spin-down (for σ =↓) component with respect to the chosen spin quantisation axis.Accordingly, 18) in Ref. 30) and Eq. ( 42) becomes The above expressions, together with the corresponding ones for the matrix elements of the spin operator, have been implemented in the pw2wannier90.xinterface between Quantum ESPRESSO and Wannier90.
The plotting routines of Wannier90 have also been adapted to work with the complex-valued spinor WFs obtained from calculations with spin-orbit coupling.It then becomes necessary to decide how to represent graphically the information contained in the two spinor components.
One option is to only plot the norm spinor WFs, which is reminiscent of the total charge density in the case of a 2×2 density matrix in non-collinear DFT.Another possibility is to plot independently the up-and downspin components of the spinor WF.Since each of them is in general complex-valued, two options are provided in the code: (i) to plot only the magnitudes |ψ ↑ nk (r)| and |ψ ↓ nk (r)| of the two components; or (ii) to encode the phase information by outputting |ψ ↑ nk (r)|sgn(Re{ψ ↑ nk (r)}) and |ψ ↓ nk (r)|sgn(Re{ψ ↓ nk (r)}), where sgn is the sign function.Which of these various options is adopted by the Wannier90 code is controlled by two input parameters, wannier plot spinor mode and wannier plot spinor phase.
Finally we note that, for WFs constructed from ultrasoft pseudopotentials or within the projector-augmentedwave (PAW) method, only pseudo-wavefunctions represented on the soft FFT grid are considered in plotting WFs within the present scheme, that is, the WFs are not normalised.

B. Improved Wannier interpolation by minimal-distance replica selection
2][33] In many respects it resembles Fourier interpolation, which uses discrete Fourier transforms to reconstruct faithfully continuous signals from a discrete sampling, provided that the signal has a finite bandwidth and that the sampling rate is at least twice the bandwidth (the so-called Nyquist-Shannon condition).
In the context of Wannier interpolation, the "sampled signal" is the set of matrix elements of a lattice-periodic operator such as the Hamiltonian, defined on the same uniform grid {k j } that was used to minimise the Wannier spread functional (see Sec. II A).
The states |χ nkj are the Bloch sums of the WFs, related to ab initio Bloch eigenstates by |χ nkj = m |ψ mkj U mnkj .
To reconstruct the "continuous signal" H nmk at arbitrary k, the matrix elements of Eq. ( 45) are first mapped onto real space using the discrete Fourier transform where N = N 1 × N 2 × N 3 is the grid size (which is also the number of k-points in Wannier90).The matrices H mnkj are then interpolated onto an arbitrary k using an inverse discrete Fourier transform, where the sum is over N lattice vectors R , and the interpolated energy eigenvalues are obtained by diagonalising H k .In the limit of an infinitely dense grid of k-points the procedure is exact and the sum in Eq. ( 47) becomes an infinite series.Owing to the real-space localisation of the Wannier functions, the matrix elements H mnR become vanishingly small when the distance between the Wannier centres exceeds a critical value L (the "bandwidth" of the Wannier Hamiltonian), so that actually only a finite number of terms contributes significantly to the sum in Eq. ( 47).This means that, even with a finite N 1 × N 2 × N 3 grid, the interpolation is still accurate provided that -by analogy with the Nyquist-Shannon condition -the "sampling rate" N i along each cell vector a i is sufficiently large to ensure that Still, the result of the interpolation crucially depends on the choice of the N lattice vectors to be summed over in Eq. (47).Indeed, when using a finite grid, there is a considerable freedom in choosing the set {R } as H mnR is invariant under R → R+T for any vector T of the Bornvon Karman superlattice generated by {A i = N i a i }.The phase factor in Eq. ( 47) is also invariant when k ∈ {k j }, but not for arbitrary k.Hence we need to choose, among the infinite set of "replicas" R = R + T of R, which one to include in Eq. (47).We take the original vectors R to lie within the Wigner-Seitz supercell centred at the origin.If some of them fall on its boundary then their total number exceeds N and weight factors must be introduced in Eq. (47).For each combination of m, n and R, the optimal choice of T is the one that minimises the distance between the two Wannier centres.With this choice, the spurious effects arising from the artificial supercell periodicity are minimised.Earlier versions of Wannier90 implemented a simplified procedure whereby the vectors R in Eq. ( 47) were chosen to coincide with the unshifted vectors R that are closer to the origin than to any other point T on the superlattice, irrespective of the WF pair (m, n).As illustrated in Fig. 4, this procedure does not always lead to the shortest distance between the pair of WFs, especially when some of the N i are small and the Wannier centres are far from the origin of the cell.
Wannier90 now implements an improved algorithm that enforces the minimal-distance condition of Eq. ( 48), yielding a more accurate Fourier interpolation.The algorithm is the following: (a) For each term in Eq. ( 47) pick, among all the replicas R = R + T of R, the one that minimises the distance between Wannier centres (Eq.( 48)).
(b) If there are N mnR different vectors T for which the distance of Eq. ( 48) is minimal, then include all of them in Eq. ( 47) with a weight factor 1/N mnR .
An equivalent way to describe these steps is that (a) we choose T such that r n +R+T falls inside the Wigner-Seitz supercell centred at r m (see Fig. 4), and that (b) if it falls on a face, edge or vertex of the Wigner-Seitz supercell, we keep all the equivalent replicas with an appropriate weight factor.In practice the condition in step (b) is enforced within a certain tolerance, to account for the numerical imprecision in the values of the Wannier centres and in the definition of the unit cell vectors.Although step (b) is much less important than (a) for obtaining a good Fourier interpolation, it helps ensuring that the interpolated bands respect the symmetries of the system; if step (b) is skipped, small artificial band splittings may FIG. 4: Owing to the periodicity of the Wannier functions over the Born-von Karman supercell (with size 2 × 2 here), the matrix element H mnR describes the interaction between the m th WF w m0 (shown in orange) with centre r m inside the home unit cell R = 0 (green shaded area) and the n th WF w nR (shown in blue) centred inside the unit cell R, or any of its supercell-periodic replicas displaced by a superlattice vector T. When performing Wannier interpolation, we now impose a minimal-distance condition by choosing the replica w n,R+T of w nR whose centre lies within the Wigner-Seitz supercell centred at r m (thick orange line).
occur at high-symmetry points, lines, or planes in the BZ.The procedure outlined above amounts to replacing Eq. (47) with where {T (j) mnR } are the N mnR vectors T that minimise the distance of Eq. ( 48) for a given combination of m, n and R; R lies within the Wigner-Seitz supercell centred on the origin.
The benefits of this modified interpolation scheme are most evident when considering a large unit cell sampled at the Γ point only.In this case N = 1 so that Eq. ( 47) with {R } = {R} = {0} would reduce to H mnk = H mn0 , yielding interpolated bands that do not disperse with k.This is nonetheless an artefact of the choice {R } = {0} (of earlier versions of Wannier90) and not an intrinsic limitation of Wannier interpolation, as first demonstrated in Ref. 31 for one-dimensional systems.Indeed, equation (49), which in a sense extends Ref. 31 to any spatial dimension, becomes in this case which can produce dispersive bands.This is illustrated in Fig. 5(a) for the case of a one-dimensional chain of carbon Clear improvements in the interpolated bands are also obtained for bulk solids, as shown in Fig. 5(b) for the case of silicon.The earlier implementation breaks the two-fold degeneracy along the X−W line, with one of the two bands becoming flat.The new procedure recovers the correct degeneracies, and reproduces more closely the ab initio band structure (the remaining small deviations are due to the use of a coarse k-point mesh that does not satisfy the Nyquist-Shannon condition, and would disappear for denser k-grids together with the differences between the two interpolation procedures).

C. Selection of projections
In many cases, and particularly for entangled bands, it is necessary to have a good initial guess for the MLWFs in order to properly converge the spread to the global minimum.Determining a good initial guess often involves a trial and error approach, using different combinations of orbital types, orientations and positions.While for small systems performing many computations of the projection matrices is relatively cheap, for large systems there is a cost associated with storing and reading the wavefunctions to compute new projection matrices for each new attempt at a better initial guess.Previously, the number of projections that could be specified had to be equal to the number J of WFs to be constructed.The latest version of the code lifts this restriction, making it possible to define in the pre-processing step a larger number J + > J of projection functions to consider as initial guesses.In this way, the computationally expensive and potentially I/O-heavy construction of the projection matrices A k is performed only once for all possible projections that a user would like to consider.
Once the A k matrices (of dimension J × J + at each k) have been obtained, one proceeds with constructing the MLWFs by simply selecting, via a new input parameter (select projections) of the Wannier90 code, which J columns to use among the J + that were computed by the interface code.Experimenting with different trial orbitals can thus be achieved by simply selecting a different set of projections within the Wannier90 input file, without the need to perform the pre-processing step again.
Similarly, another use case for this new option is the construction of WFs for the same material but for different groups of bands.Typically one would have to modify the Wannier90 input file and run the interface code multiple times, while now the interface code may compute A k for a superset of trial orbitals just once, and then different subsets may be chosen by simple modification of a single input parameter.As a demonstration, we have adapted example11 of the Wannier90 distribution (silicon band structure), that considers two band groups: (a) the valence bands only, described by four bond-centred s orbitals, and (b) the four valence and the four lowest-lying conduction bands together, described by atom-centred sp 3 orbitals.In the example, projections onto all 12 trial orbitals are provided, and the different cases are covered by specifying in the Wannier90 input file which subset of projections is required.

D. Plotting cube files with non-orthogonal vectors
In Wannier90 v3.0 it is possible to plot the MLWFs in real-space in Gaussian cube format, including the case of non-orthogonal cell lattice vectors.Many modern visualisation programs such as Vesta 27 are capable of handling non-orthogonal cube files and the cube file format can be read by many computational chemistry programs.Wannier90's representation of MLWFs in cube format can be significantly more compact than using the alternative xsf format.With the latter, MLWFs are calculated (albeit with a coarse sampling) on a supercell of the computational cell that can be potentially large (the extent of the supercell is controlled by an input parameter wannier plot supercell).Whereas, with the cube format, each Wannier function is represented on a grid that is centred on the Wannier function itself and has a user-defined extent, which is the smallest parallelepiped (whose sides are aligned with the cell vectors) that can enclose a sphere with a user-defined radius wannier plot radius.Because MLWFs are strongly localised in real space, relatively small cut-offs are all that is required, significantly smaller than the length-scale over which the MLWFs themselves are periodic.As a result, the cube format is particularly useful when a more memory-efficient representation is needed.The cube format can be activated by setting the input parameter wannier plot mode to cube, and the code can handle both isolated molecular systems (treated within the supercell approximation) as well as periodic crystals by setting wannier plot mode to either molecule or crystal, respectively.

V. NEW POST-PROCESSING FEATURES
Once the electronic bands of interest have been disentangled and wannierised to obtain well-localised WFs, the Wannier90 software package includes a number of modules and utilities that use these WFs to calculate various electronic-structure properties.Much of this functionality exists within postw90.x, an MPI-parallel code that forms an integral part of the Wannier90 package.In v2.x of Wannier90, postw90.xincluded functionality for computing densities of states and partial densities of states, energy bands and Berry curvature along specified lines and planes in k-space, anomalous Hall conductivity, orbital magnetisation and optical conductivity, Boltzmann transport coefficients within the relaxation time approximation, and band energies and derivatives on a generic user-defined list of k-points.Some further functionality exists in a set of utilities that are provided as part of the Wannier90 package, including a code (w90pov.F90) to plot WFs rendered using the Persistence of Vision Raytracer (POV-Ray) 34 code and to compute van der Waals interactions with WFs (w90vdw.F90).
In addition, there are a number of external packages for computing advanced properties based on WFs and which interface to Wannier90.These include codes to generate tight-binding models such as pythTB 35 and tbmodels, 36 quantum transport codes such as sisl, 37 gollum, 38 omen 39 and nanoTCAD-ViDES, 40 the EPW 41 code for calculating properties related to electron-phonon interactions and WannierTools 42 for the investigation of novel topological materials.
Below we describe some of the new post-processing features of Wannier90 that have been introduced in the latest version of the code, v3.0.

A. postw90.x: Shift Current
4][45] It can be divided phenomenologically into linear (LPGE) and cir-cular (CPGE) effects, which have different symmetry requirements within the acentric crystal classes.The CPGE requires elliptically-polarised light, and occurs in gyrotropic crystals (see next subsection).The LPGE occurs with linearly or unpolarised light as well; it is present in piezoelectric crystals and is given by where J(0) is the induced DC photocurrent density, E(ω) = E * (−ω) is the amplitude of the optical electric field, and σ abc = σ acb = σ * abc is a nonlinear photoconductivity tensor.
The shift current is the part of the LPGE photocurrent generated by interband light absorption. 46Intuitively, it arises from a coordinate shift accompanying the photoexcitation of electrons from one band to another.][50][51] The shift current along direction a induced by light that is linearly polarised along b is described by the following photoconductivity tensor: 51,52 Here, f nmk = f nk − f mk is the difference between occupation factors, ω mnk = mk − nk is the difference between energy eigenvalues of the Bloch bands, r b nmk is the b th Cartesian component of the interband dipole matrix (the off-diagonal part of the Berry connection matrix A nmk = i u nk |∂ k u mk ), and is the shift vector (not to be confused with the lattice vector R, or with the matrix R (k,b) defined in Eq. ( 12)).
The shift vector has units of length, and it describes the real-space shift of wavepackets under photoexcitation.The numerical evaluation of Eq. ( 53) is tricky because the individual terms therein are gauge-dependent, and only their sum is unique.Different strategies were discussed in the early literature in the context of model calculations 50,53 and more recently for ab initio calculations.The ab initio implementation of Young and Rappe 54 employed a gauge-invariant k-space discretisation of Eq. ( 53), inspired by the discretised Berry-phase formula for electric polarisation. 55he implementation in Wannier90 is based instead on the formulation of Sipe and co-workers. 51,56In this formulation, the shift (interband) contribution to the LPGE tensor in Eq. ( 51) is expressed as where is the generalised derivative of the interband dipole.When b = c, Eq. ( 54) becomes equivalent to Eq. ( 52). 51he generalised derivative r b;a nmk is a well-behaved (covariant) quantity under gauge transformation but -as in the case of the shift vector -this is not the case for the individual terms in Eq. ( 55), leading to numerical instabilities.To circumvent this problem, Sipe and co-workers used k•p perturbation theory to recast Eq. ( 55) as a summation over intermediate virtual states where the individual terms are gauge-covariant. 51,56That strategy has been successfully employed to evaluate the shift-current spectrum from first principles. 57,58s it is well known, similar "sum rule" expressions can be written for other quantities involving k derivatives, such as the inverse effective-mass tensor and the Berry curvature tensor.When evaluating such expressions, a sufficiently large number of virtual states must be included to achieve convergence.Alternatively, one can work with a basis spanning a finite number of bands, such as a tight-binding or Wannier basis, and carefully reformulate k • p perturbation theory within that incomplete basis to avoid truncation errors.This was done first for the inverse effective-mass tensor 59,60 and later for the Berry curvature, 32 and is at the heart of the Wannier-interpolation technique for calculating the intrinsic anomalous Hall conductivity (AHC). 32 truncation-free tight-binding sum rule for the generalised derivative of Eq. ( 55) was given in Ref. 61.Contrary to the inverse effective-mass tensor, which only depends on the Hamiltonian matrix elements, 59,60 the generalised derivative also depends -in some cases rather strongly -on the intracell coordinates of the basis orbitals. 61Building on that formulation, Wannier interpolation schemes for calculating the shift current were recently introduced 62,63 (the implementation in Wannier90 follows Ref. 63).In addition to the Hamiltonian matrix elements and Wannier centres, the shift current was found to depend sensitively on the off-diagonal position matrix elements.
The generalised derivative can be used to evaluate other nonlinear optical responses, such as secondharmonic generation. 51,62While these are not currently implemented in Wannier90, it should be straightforward to adapt the shift-current routines for that purpose.

B. postw90.x: Gyrotropic module
The spontaneous magnetisation of ferromagnets endows their linear conductivity tensor σ ab (ω) with an antisymmetric part.At ω = 0 that part describes the anomalous Hall conductivity (AHC), and at finite frequencies it gives rise to magneto-optical effects such as Faraday rotation in transmission and magnetic circular dichroism in absorption.In paramagnets, those effects appear under an external magnetic field.
Interestingly, an antisymmetric conductivity can be induced in certain nonmagnetic (semi)conductors by purely electrical means, namely, by passing a current through the sample. 64,65Symmetry arguments indicate that this is allowed in the gyrotropic crystal classes, a subset of the acentric crystal classes that includes those that are chiral, polar, or optically active. 43The first experimental demonstration consisted in the measurement of a currentinduced change in the rotatory power of p-doped trigonal tellurium. 66,679][70][71][72] Like the linear (spontaneous) AHE in ferromagnetic metals, the nonlinear (current-induced) AHE in gyrotropic conductors has an intrinsic contribution associated with the Berry curvature in momentum space. 68long with nonlinear magneto-optical and anomalous Hall effects, the flow of electrical current in a gyrotropic conducting medium also generates a net magnetisation.This kinetic magnetoelectric effect was originally proposed for bulk chiral conductors, 65,73 and later for twodimensional (2D) inversion layers with an out-of-plane polar axis, 74,75 where it has been studied intensively. 76he kinetic magnetoelectric effect in 2D -also known as the Edelstein effect -is a purely spin effect, whereas in bulk crystals an orbital contribution is also present. 73he orbital kinetic magnetoelectric effect was recently formulated in terms of the intrinsic orbital moment of the Bloch electrons, 77,78 a quantity closely related to the Berry curvature.
Another phenomenon characteristic of gyrotropic crystals is the circular photogalvanic effect (CPGE) that was mentioned briefly in Sec.V A. This nonlinear optical effect consists in the generation of a photocurrent that reverses sign with the helicity of light, [43][44][45]65,79 and it occurs when light is absorbed via interband or intraband scattering processes. The itraband contribution to the CPGE is closely related to the nonlinear AHE, as both arise from the Berry curvature of the conduction electrons.68,80,81 The gyrotropic effects listed above are being very actively investigated in connection with novel materials ranging from topological semimetals 69,82,83 to monolayer and bilayer transition-metal dichalcogenides.[70][71][72] The sensitivity of both the Berry curvature and the intrinsic orbital moment to the details of the electronic structure, together with the need to sample them on a dense mesh of k-points, calls for the development of accurate and efficient ab initio methodologies for this class of problems.
Building on existing Wannier-interpolation schemes for calculating the spontaneous intrinsic AHC and orbital magnetisation, 32,84 the corresponding methodology for gyrotropic effects was presented in Ref. 85, where it was applied to p-doped trigonal tellurium (in that work, only the intraband contribution to the CPGE was considered).The resulting computer code has been incorporated in Wannier90 as the gyrotropic.F90 module.
The central task of that module is to evaluate response tensors such as the "Berry-curvature dipole" 68 where f 0 is the equilibrium occupation factor and Ω nk = ∇ k ×A nnk is the Berry curvature of the Bloch bands (the curl of the band-diagonal Berry connection A nnk introduced in Sec.V A).Also of interest is the tensor K ab , obtained by replacing the Berry curvature in Eq. ( 56) with the intrinsic magnetic moment of the Bloch states. 77,78,85he two tensors D ab and K ab describe several of the aforementioned gyrotropic effects as follows: where j is the induced current density, M is the induced magnetisation, E or E(ω) is the amplitude of the static or optical electric field, τ is the relaxation time of the conduction electrons, and ε abc is the alternating tensor.The reader is referred to Ref. 85 for more details such as the prefactors in the expressions above, as well as the formulas for the current-induced Faraday effect and natural optical activity, both of which are also implemented in the gyrotropic.F90 module.

C. postw90.x: Spin Hall conductivity
The spin Hall effect (SHE) is a phenomenon in which a spin current is generated by applying an electric field.The current is often transverse to the field (Hall-like), but this is not always the case. 86The SHE is characterised by the spin Hall conductivity (SHC) tensor σ spin,c ab as follows: where J spin,c a is the spin-current density along direction a with its spin pointing along c, and E b is the external electric field of frequency ω applied along b.In nonmagnetic materials the equal number of up-and downspin electrons forces the AHE to vanish, resulting in a pure spin current.
Like the AHC, the SHC contains both intrinsic and extrinsic contributions. 87The intrinsic contribution to the SHC can be calculated from the following Kubo formula, 88 where s c , v a and j spin,c a = 1 2 {s c , v a } are the spin, velocity and spin current operators, respectively; V is the cell volume, and N is the total number of k-points used to sample the BZ.Equations ( 58) are very similar to the Kubo formula for the AHC, except for the replacement of a velocity matrix element by a spin-current matrix element.As mentioned in the previous two subsections, Wannierinterpolation techniques are very efficient at calculating such quantities.
A Wannier-interpolation method scheme for evaluating the intrinsic SHC was developed in Ref. 88 (see also Ref. 89 for a related but independent work).The required quantities from the underlying ab initio calculation are the spin matrix elements mk δ mn , and the overlap matrix elements of Eq. ( 17).Since the calculation of all these quantities has been previously implemented in pw2wannier90.x(the interface code between pwscf and Wannier90), this advantageous interpolation scheme can be readily used while keeping to a minimum the interaction between the ab initio code and Wannier90.
The application of the method to fcc Pt is illustrated in Fig. 6.Panel (a) shows the calculated SHC as a function of the Fermi-level position, panel (b) depicts the "spin Berry curvature" of Eq. (58b) that gives the contribution from each band state to the SHC.The aforementioned functionalities have been incorporated in the berry.F90, kpath.F90 and kslice.F90 modules of postw90.x.

D. postw90.x: Parallelisation improvements
The original implementation of the berry.F90 module in postw90.x(for computing Berry-phase properties such as orbital magnetisation and anomalous Hall conductivity 84 ), introduced in Wannier90 v2.0, was written with code readability in mind and had not been optimised for computational speed.In Wannier90 v3.0, all parts of the berry.F90 module have been parallelised while keeping the code readable; moreover, its scalability has been improved, accelerating its performance by several orders of magnitude. 90o illustrate the improvements in performance we present calculations on a 128-atom supercell of GaAs interstitially doped with Mn.We use a lattice con- stant of the elementary cell of 5.65 Å.We use normconserving relativistic pseudopotentials with the PBE exchange-correlation functional.The energy cut-off for the plane waves is set to 40 Ry, and the Brillouin-zone sampling of the supercell is 3 × 3 × 3. We use a Gaussian metallic smearing with a broadening of 0.015 Ry.For the non-self-consistent step of the calculation, 600 bands are computed and used to construct 517 Wannier functions.The initial projections are chosen as a set of sp 3 orbitals centred on each Ga and As atom, and a set of d orbitals on Mn.The calculations were performed on the Prometheus supercomputer of PL-GRID (in Poland).
The Berry-phase calculations can be performed in three distinct ways: (i) 3D quantities in k-space (routine berry main), (ii) the same quantities resolved on 2D planes (routine kslice.F90), and (iii) 1D paths (routine kpath.F90) in the Brillouin zone.In the benchmarks, we will refer to these three cases as "Berry 3D", "Berry 2D", and "Berry 1D", respectively.
The first optimisation target was the function utility rotate in the module utility.F90, which calculates a matrix product of the form B = R † AR using Fortran's built-in matmul function.The new routine utility rotate new uses instead BLAS and performs about 5.7 times better than the original one, giving a total speedup for berry main of about 55%.
A second performance-critical section of code was identified in the routine get imfgh k list, which took more than 50% of the total run-time of berry main.This routine computes three quantities: F αβ , G αβ and H αβ , which are defined in Eqs. ( 51), ( 66) and (56) of Ref. 84.By some algebraic transformations, it was possible to reduce 25 calls to matmul, carried out in the innermost runtime-critical loop, to only 5 calls.After replacement of matmul with the Basic Linear Algebra Subprogram (BLAS), the speed up of this routine exceeds a factor of 11, and the total time spent in berry main is 2.5 times shorter (including the speed-up from the first optimisation).
In the third step, a bottleneck was eliminated in the initialisation phase, where mpi bcast was waiting more than two minutes for the master rank to broadcast the parameters.The majority of this time was spent in loops computing matrix products of the form S = (V 1 ) † S 0 V 2 .Again, we replaced this with two calls to the BLAS gemm routine.This resulted in a speed-up of a factor of 610 for the calculation of this matrix product in our test case, and the total initialisation time dropped to less than 15 seconds.In total, the berry main routine runs about 5 times faster than it did originally.
The scalability results of berry main, kslice.F90 kpath.F90 are presented in Fig. 7, and a comparison with the scalability of the previous version of berry main is also given.Absolute times for some of the calculations are reported in Table I TABLE I: Wall-time for some of the runs performed with the Berry module, before (Wannier90 v2.0) and after (Wannier90 v3.0) the optimisations, for the test system described in the main text.N c indicates the number of cores used in the calculation.

E. GW bands interpolation
While density-functional theory (DFT) is the method of choice for most applications in materials modelling, it is well known that DFT is not meant to provide spectral properties such as band structures, band gaps and optical spectra.Green's function formulation of many-body perturbation theory (MBPT) 91 overcomes this limitation, and allows the excitation spectrum to be obtained from the knowledge of the Green's function.Within MBPT the interacting electronic Green's func- FIG.7: (Top) Speedup of the new Wannier90 v3.0 with respect to v2.0, for a run of the berry module (mode "Berry 3D") on the test system described in the text, demonstrating the improvements implemented in the new version of the code.(Bottom) Total CPU time (defined as total walltime times number of CPUs) for the three cases "Berry 3D", "Berry 2D" and "Berry 1D", normalised with respect to the same case run with N cpu = 24, for the Wannier90 v3.0 code.Note that calculations with N cpu ≥ 480 for "Berry 3D" were run on a denser grid (100 × 100 × 100 rather than 30 × 30 × 30) and values have been rescaled using the time measured for both grids at N cpu = 480.
tion G(r, r , ω) may be expressed in terms of the noninteracting Green's function G 0 (r, r , ω) and the so-called self-energy Σ(r, r , ω), where several accurate approximations for Σ have been developed and implemented into first-principles codes. 92While maximally-localised Wannier functions for self-consistent GW quasiparticles have been discussed in Ref. 93, here we focus on the protocol to perform bands interpolation at the one-shot G 0 W 0 level.For solids, the G 0 W 0 approximation has proven to be an excellent compromise between accuracy and computational cost and it has become the most popular MBPT technique in computational materials science. 94In the standard one-shot G 0 W 0 approach, Σ is written in terms of the Kohn-Sham (KS) Green's function and the RPA dielectric matrix, both obtained from the knowledge of DFT-KS orbitals and eigenenergies.Quasi-particle (QP) energies are obtained from: where ψ nk and nk are the KS orbitals and eigenenergies, Z nk is the so-called renormalisation factor and V xc is the DFT exchange-correlation potential.In addition, in the standard G 0 W 0 approximation the QP orbitals are approximated by the KS orbitals.At variance with DFT, QP corrections for a given k-point require knowledge of the KS orbitals and eigenenergies at all points (k + q) in reciprocal space.In practice, codes such as Yambo 95 compute QP corrections on a regular grid and rely on interpolation schemes to obtain the full band structure along high-symmetry lines.Wannier90 supports the use of G 0 W 0 QP corrections through the general interface gw2wannier90.pydistributed with Wannier90, while dedicated tools for Quantum ESPRESSO and Yambo allow an efficient use of symmetries.Here we briefly outline the procedure for performing Wannier interpolation at the G 0 W 0 level using Quantum ESPRESSO and Yambo.The procedure starts by obtaining MLWFs at the DFT level.While Wannier90 works with uniform coarse meshes on the full BZ (FBZ), Yambo uses symmetries to compute quantities on the irreducible BZ (IBZ).In addition, the G 0 W 0 self-energy may require finer k-point grids to achieve convergence compared to those required for the charge density in DFT the Wannier interpolation itself.The k-mapper.pyutility allows the user to quickly select only the symmetryinequivalent k-points in the IBZ that belong to the grid used by Wannier90.At this point, Yambo computes the QP corrections on these selected k-points only.After that, a post-processing code of Yambo (ypp) unfolds the QP corrections onto the full BZ as required by Wannier90.Using the unfolded QP corrections, the utility gw2wannier90.pycorrects and reorders in energy both the KS eigenvalues and the input matrices.After reading these eigenvalues and matrices, Wannier90 can proceed as usual to interpolate the desired quantities such as the band structure, but now at the G 0 W 0 level.

VI. AUTOMATIC WANNIER FUNCTIONS: THE SCDM METHOD
An alternative method for generating localised Wannier functions, known as the selected columns of the density matrix (SCDM) algorithm, has been proposed by Damle, Lin and Ying. 96,97At its core the scheme exploits the information stored in the real-space representation of the single-particle density matrix, a gaugeinvariant quantity.Localisation of the resulting functions is a direct consequence of the well-known nearsightedness principle 98,99 of electronic structure in extended systems with a gapped Hamiltonian, i.e., insulators and semiconductors.1][102] Since the SCDM method does not minimise a given gaugedependent localisation measure via a minimisation procedure, it is free from any issue regarding the dependence on initial conditions, i.e., it does not require a good initial guess of localised orbitals.It also avoids other problems associated with a minimisation procedure, such as getting stuck in local minima.More generally, the localised Wannier functions provided by the SCDM method can be used as starting points for the MLWF minimisation procedure, by using them to generate the A k projection matrices needed by Wannier90.
For extended insulating systems, the density matrix is given by As shown in Sec.II, the P k are the spectral projectors associated with the crystal Hamiltonian operator H k onto the valence space S k , hence their rank is N e (number of valence electrons).Moreover, they are analytic functions of k and also manifestly gauge invariant. 103,104As mentioned above, the nearsightedness principle 99 guarantees that the columns of the kernels P k (r, r ) = r|P k |r are localised along the off-diagonal direction and therefore they may be used to construct a localised basis.If we consider a discretisation of the J Bloch states at each k on a real-space grid of N g points, we can arrange the wavefunctions into the columns of a unitary N g × J k- In this representation, it is straightforward to see that the columns of P k (r i , r j ) are projections of extremely localised functions (i.e., Dirac-delta functions localised on the grid points) onto the valence eigenspace.As a result, selecting any linearly-independent subset of J of them will yield a localised basis for the span of P (r, r ).However, randomly selecting J columns does not guarantee that a well-conditioned basis will be obtained.For instance, there could be too much overlap between the selected columns.Conceptually, the most well conditioned columns may be found via a QR factorisation with column pivoting (QRCP) applied to P (r, r ), in the form P Π = QR, with Π being a matrix permuting the columns of P , Q a unitary matrix and R an upper-triangular matrix (not to be confused with the lattice vector R, or with the matrix R (k,b) defined in Eq. ( 12), or with the shift vector of Eq. ( 53)), and where Π is chosen so that Then the J columns forming a localised basis set are chosen to be the first J of the matrix with permuted columns P Π.
The SCDM-k 97 method suggests that it is sufficient to apply the QRCP factorisation at k = 0 (Γ point) only, and use the same selection of columns at all kpoints.However, this is still often impractical since P Γ is prohibitively expensive to construct and store in memory.Therefore an alternative procedure is proposed, for which the columns can be computed via the QRCP of the (smaller) matrix Ψ † Γ instead: i.e., the same Π matrix is obtained by computing a QRCP on Ψ † only.Once the set of columns has been obtained, we need to impose the orthonormality constraint on the chosen columns without destroying their locality in real space.This can be achieved by a Löwdin orthogonalisation, similarly to Eq. ( 26).In particular, the selection of columns of Ψ Γ can be used to select the columns of all Ψ k , which in turn define the A mnk matrices needed as input by Wannier90 to start the MLWF minimisation procedure, by defining A mnk = ψ * mk (r Π(n) ), where the Π(n) is the index of the n th column of P after permutation with Π.In fact, we can write the n th column of P after permutation, P k (r, r Π(n) ), as The unitary matrix U k sought for is then constructed via Löwdin orthogonalisation We can also extend the SCDM-k method to the case where the Bloch states are represented as two-component spinor wavefunctions ψ nk (r, α), e.g., when including spin-orbit interaction in the Hamiltonian.Here, α =↑, ↓ is the spinor index.In this case, we include the spin index as well as the position index to perform QRCP.First, we define the 2N g × J matrix Ψ k Next, as in the spinless case, the QRCP of Ψ † Γ is computed, and the first J columns of the Π matrix are selected.Now, Π(n), the index of the n th column of P after permutation with Π, determines both the position index r Π(n) and the spin index α Π(n) .We define ) and perform Löwdin orthogonalisation to obtain the unitary matrix U k .
In the case of entangled bands, we need to introduce a so-called quasi-density matrix defined as where f ( nk ) ∈ [0, 1] is a generalisation of the Fermi-Dirac probability for the occupied states.Also in this case we only use the information at Γ to generate the permutation matrix.Depending on what kind of entangled manifold one is interested in, f ( ) can be modelled with various functional forms.In particular, the authors of Ref. 97 suggest the following three forms: 1. Isolated manifold, e.g., the valence bands of an insulator or a semiconductor: f ( ) is a step function, with the step inside the energy gap ∆ g = c − v , where c(v) represents the minimum (maximum) of the conduction (valence) band: Both ∆ g and v are not free parameters, as they may be obtained directly from the ab initio calculation.
2. Entangled manifold (case I), e.g., the valence bands and low-lying conduction bands in a semiconductor: f ( ) is a complementary error function: where µ is used to shift the mid-value of the complementary error function, so that states with energy equal to µ have a weight of f (µ) = 1/2.The parameter σ is used to gauge the "broadness" of the distribution function.
3. Entangled manifold (case II), e.g., the d bands in a transition metal: The procedure then follows as in the previous case, by computing a QRCP factorisation on the quasi-density matrix.It is worth to note that in the case of an entangled manifold, the SCDM method requires the selection of two real numbers: µ and σ, as well as the number of Wannier functions to disentangle J.These parameters play a crucial role in the selection of the columns of the density matrix.While the selection of these parameters requires some care, as a rule of thumb (e.g., in entangled case I) σ is of the order 2−5 eV (which is the energy range of a typical bandwidth), while µ can often be set around the Fermi energy (but the exact value depends on various factors, including the number J of bands chosen and the specific properties of the bands of interest).It is worth to mention that since the SCDM-k method is employed as an alternative way of specifying a set of initial projections and hence to compute the A k matrices in Eq. ( 26), the disentanglement procedure can be used in exactly the same way as described in Sec.II B. However, in the case of entangled bands the column selection is done on a quasi-density matrix, which implicitly defines a working subspace larger than the target subspace of dimension J.We find that for well-known systems SCDM-k is typically already capable of selecting a smooth manifold and no further subspace selection is needed.This method is now implemented as part of the pw2wannier90.xinterface code to Quantum ESPRESSO.We have decided to implement the algorithm in the interface code(s) rather than in Wannier90 itself, because the SCDM method requires knowledge of the wavefunctions ψ nk , which are only available in the ab initio code.
In Wannier90 only a single new input parameter auto projections is required.This disables the check on the number of projections specified in the input file (as we rely on SCDM to provide us with the initial guesses) and adds a new entry to the <seedname>.nnkpfile (which is read by pw2wannier90.x in order to compute the quantities required by Wannier90) that specifies the number of Wannier functions required.The remaining control parameters for the SCDM method are specified in the input file for the pw2wannier90.xcode, including whether to use the SCDM method, the functional form of the f ( ) function in Eq. ( 67) and, optionally, the values of µ and σ in the definition of f ( ).

VII. AUTOMATION AND WORKFLOWS:
AIIDA-WANNIER90 PLUGIN AiiDA 12 (Automated Interactive Infrastructure and Database for Computational Science) is an informatics infrastructure that helps researchers in managing, automating, storing and sharing their computations and results.AiiDA automatically tracks the entire provenance of every calculation to ensure full reproducibility, which is also stored in a tailored database for efficient querying of previous results.Moreover, it provides a workflow engine, allowing researchers to implement high-level workflows to automate sequences of tedious or complex calculation steps.AiiDA supports simulation codes via a plugin interface, and over 30 different plugins are available to date. 105mong these, the AiiDA-Wannier90 plugin provides support for the Wannier90 code.Users interact with the code (to submit calculations and retrieve the results) via the high-level python interface provided by Ai-iDA rather than directly creating the Wannier90 input files.AiiDA will then handle automatically the various steps involved in submitting calculations to a cluster computer, retrieving and storing the results, and parsing them into a database.Furthermore, using the Ai-iDA workflow system users can chain pre-processing and post-processing calculations automatically (e.g., the preliminary electronic structure calculation with an ab initio code).These scientific workflows, moreover, can encode in a reproducible form the scientific knowledge of expert computational researchers in the field on how to run the simulations, choose the numerical parameters and recover from potential errors.In turn, their availability reduces the training time of new researchers, eliminates sources of error and enables large-scale high-throughput simulations.
The AiiDA-Wannier90 plugin expects that each calculation takes a few well-defined input parameters.Among the most important ones, a Wannier90 calculation run via AiiDA requires that the following input nodes are provided: an input crystal structure, a node of parameters with a dictionary of input flags for Wannier90, a node with the list of kpoints, a node representing the atomic projections, and a local input folder or remote input folder node containing the necessary input files (.amn, .mmn,.nnkp,.eig,.dmn)for the Wannier90 calculation as generated by an ab initio code.
All of these parameters, with the exception of projections, are generic to AiiDA to facilitate their reuse with different simulation codes.More detailed information on all inputs can be found in the AiiDA-Wannier90 package documentation. 106fter the Wannier90 execution is completed, the AiiDA-Wannier90 plugin provides parsers that are able to detect whether the convergence was successful and retrieve key parameters including the centres of the Wannier functions and their spread, as well as the different components of the spread (Ω I , Ω D , Ω OD and Ω), and (if computed) the maximum imaginary/real ratio of the Wannier functions and the interpolated band structure.
The whole simulation is stored in the form of a graph, representing explicitly the provenance of the data generated including all inputs and outputs of the codes used in the workflow.An example of a provenance graph, automatically generated by AiiDA when running a Quantum ESPRESSO calculation followed by a Wannier90 calculation, is shown in Fig. 8.
We emphasise that the availability of a platform to run Wannier90 in a fully-automated high-throughput way via the AiiDA-Wannier90 plugin has already proved to be beneficial for the Wannier90 code itself.Indeed, it has pushed the development of additional features or improvements now part of Wannier90 v3.0, including additional output files to facilitate output parsing and improvements in some of the algorithms and their default parameters to increase robustness.

VIII. MODERN SOFTWARE ENGINEERING PRACTICES
In this section, we describe a number of modern software engineering practices that are now part of the de-FIG.8: The provenance graph automatically generated by AiiDA when running a Wannier90 calculation for a diamond crystal using Quantum ESPRESSO as the DFT code.Rectangles represent executions of calculations, ellipses represent data nodes, and diamonds are code executables.Graph edges connect calculations to their inputs and outputs.In particular, the following calculations are visible: Quantum ESPRESSO pw.x SCF (dark blue) and NSCF (green), Quantum ESPRESSO pw2wannier90.x(brown), and Wannier90 pre-processing (yellow) and minimisation run (purple).The initial diamond structure (light blue) and the final interpolated band structure (dark grey) are also highlighted.
velopment cycle of the Wannier90 code.In particular, Wannier90 includes a number of tests that are run at every commit via a continuous integration approach, as well as nightly in a dedicated test farm.Version control is handled using git and the code is hosted on the GitHub platform. 107We follow the fork and pullrequest model, in which users can duplicate (fork) the project into their own private repository, make their own changes, and make a pull request (i.e., request that their changes be incorporated back into the main repository).When a pull request is made, a series of tests are automatically performed: the test suite is run both in serial and parallel using the Travis continuous integration platform, 108 and code coverage is checked using codecov. 109If these tests are successful then the changes are reviewed by members of the Wannier90 developers group and, if the code meets the published coding guidelines, it can be merged into the development branch.
In addition, while interaction with end users happens via a mailing-list forum, discussion among developers is now tracked using GitHub issues.This facilitates the maintenance of independent conversation threads for each different code issue, new feature proposal or bug.These can easily reference code lines as well as be referenced in code commit messages.Moreover, for every new bug report a new issue is opened, and pull requests that close the issue clearly refer to it.This approach facilitates tracking back the reasoning behind the changes in case a similar problem resurfaces.
In the remainder of this section we describe more in detail some of these modern software engineering practices.

A. Code documentation (FORD)
The initial release of Wannier90 came with extensive documentation in the form of a User Guide describing the methodology, input flags to the program and format of the input and output files.This document was aimed at the end users running the software.Documentation of the code itself was done via standard code comments.In order to foster not only a community of users but also of code contributors to Wannier90, we have now created an additional documentation of the internal structure of the code.This makes the code more approachable, particularly for new contributors.To create this code documentation in a fully automated fashion, we use the FORD (FORtran Documenter) 110 documentation generator.We have chosen this over other existing documentation solutions because of FORD's specific support for Fortran.This tool parses the Fortran source, and generates a hyperlinked (HTML) index of source files, modules, procedures, types and programs defined in the code.Furthermore, it constructs graphs showing the dependencies between different modules and subroutines.Additional information can be provided in the form of special in-code comments (marked with double exclamation marks) describing in more detail variables, modules or subroutines.By tightly coupling the code to its documentation using in-code comments, the documentation maintenance efforts are greatly reduced, decreasing the risk of having outdated documentation.The compiled version of the documentation for the most recent code version is made available on the Wannier90 website. 111

B. Testing infrastructure and continuous integration
With the recent opening to the community of the Wannier90 development, it has become crucial to create a non-regression test suite to ensure that new developments do not break existing functionalities of the code.Its availability facilitates the maintenance of the code and ensures its long-term stability.
The Wannier90 test suite relies on a modified version of James Spencer's python testcode.py. 112This provides the functionality to run tests and compare selected quantities parsed from the output files against benchmarked values.
At present, the Wannier90 test suite includes over 50 tests which are run both in serial and parallel and cover over 60% of the source code (with many modules exceeding 80% coverage).The code coverage is measured with the codecov software. 109Developers are now required to add tests when adding new features to the code to ensure that their additions work as expected.This also ensures that future changes to the code will never break that functionality.Two different test approaches are implemented, serving different purposes.
First, the Wannier90 repository is now linked with the Travis continuous integration platform 108 to prevent introducing errors and bugs into the main code branch.Upon any commit to the GitHub repository, the test suite is run both in serial and in parallel.Any test failure is reported back to the GitHub webpage.Additionally, for tests run against pull requests, any failed test results in the pull request being blocked and not permitted to merge.Contributors will first need to change their code to fix the problems highlighted in the tests; pull requests are able to be merged only after all tests pass successfully.
Second, nightly automatic tests are run on a Buildbot test-farm.The test-farm compiles and runs the code with a combination of compilers and libraries (current compilers include GFortran v6.4.0 and v7.3.0,Intel Fortran Compiler v17 and v18, and PGI compiler v18.05; current MPI libraries include Open MPI v1.10.7 and v3.1.3,Intel MPI v17 and MVAPICH v2.3b).This ensures that the code runs correctly on various highperformance computer (HPC) architectures.More information on the test-farm can be found on the Wannier90 GitHub wiki website. 113n addition to these tests, we have implemented git pre-commit hooks to help keep the same code style in all source files.The current pre-commit hooks run Patrick Seewald's Fortran source code formatter fprettify 114 to remove trailing whitespaces at the end of a line and to enforce a consistent indentation style.These precommit hooks, besides validating the code, can reformat it automatically.Developers may simply run the formatter code to convert the source to a valid format.If a developer installs the pre-commit hooks, these will be run automatically before every commit.Even if this is not the case, these tests are also run on Travis; therefore, a pull request that does not conform to the standard code style cannot be merged before the style is fixed.

C. Command-line interface and dry-run
The command-line interface of the code has been improved.Just running wannier90.xwithout parameters shows a short explanation of the available command line options.In addition, a -v flag has been added to print the version of the code, as well as a new -d dry-run mode, that just parses the input file to perform all needed checks of the inputs without running the actual calculation.The latter functionality is particularly useful to be used in input validators for Wannier90 or to precalculate quantities computed by the code at the beginning of the simulation (such as nearest-neighbour shells, b-vectors or expected memory usage) and use this information to validate the run or optimise it (e.g., to decide the parallelisation strategy within automated AiiDA workflows).

D. Library mode
Wannier90 also comes with a library mode, where the core code functionality can be compiled into a library that can then be linked by external programs.This library mode is used as the default interaction protocol by some interface codes.The library mode provides only support for a subset of the full functionality, in particular at the moment it only supports serial execution.We have now added and improved support for the use of excluded bands also within the library mode.Moreover, beside supporting the generation of a statically-linked library, we now also support the generation of dynamically-linked versions.Finally, we have added a minimal test code, run together with all other tests in the test suite, that serves both to verify that the library functionality works as expected, and as an example of the interface of the library mode.

IX. CONCLUSIONS AND OUTLOOK
Wannier90 v2.0 was released in October 2013 with a small update for v2.1 in January 2017.The results and developments of the past years, presented in this work, were released in Wannier90 v3.0 in February 2019.Thanks to the transition of Wannier90 to a community code, Wannier90 includes now a large number of new functionalities and improvements that make it very robust, efficient and rich with features.These include the implementation of new methods for the calculation of WFs and for the generation of the initial projections; parallelisation and optimisations; interfaces with new codes, methods and infrastructures; new user functionality; improved documentation; and various bug fixes.The effect of enlarging the community of developers is not only visible in the large number of contributions to the code, but also in the modern software engineering practices that we have put in place, that help improve the robustness and reliability of the code and facilitate its maintenance by the core Wannier90 developers group and its long-term sustainability.
The next major improvement that we are planning is the implementation of a more robust and general library mode.The features that we envision are: (1) the possibility to call the code from C or Fortran codes without the need to store files but by passing all variables from memory; (2) a more general library interface that is easily extensible in the future when new functionality is added; and (3) the possibility to run Wannier90 from a parallel MPI code, both by running each instance in parallel and by allowing massively-parallel codes to call, in parallel, various instances of Wannier90 on various structures or with different parameters.This improvement will demand a significant restructuring of most of the codebase and requires a good design of the new interface.Currently we are drafting the new library interface, by collecting feedback and use cases from the various contributors and users of the code, to ensure that the new library mode can be beneficial to all different possible use cases.

k
that has the smallest mismatch with the subspaces S (i−1) k+b at the neighbouring k-points obtained in the previous iteration.This amounts to solving b w b P ) where G mnk are the matrix elements of the original gradient in Eq.(11) (see also Ref. 14), and R (k,b) mn and T (k,b) mn

Fig. 1 -
(a)-(b)-(c)-(d) we show a comparison of the centre and symmetries of a MLWF, SAWF, SLWF and SLWF+C; the individual spreads and total spreads-for all four valence states-are reported in the

FIG. 3 :
FIG. 3: Plots of the time per single minimisation iteration as a function of the number of cores N c .(a) Weak scaling of the implementation, where the number of k-points per process is fixed to one, i.e., N c = N .The time only increases by a factor 1.3 (1.8) for the disentanglement (wannierisation) parts of the code, when going from N c = 4 to N c = 144.(b) Strong scaling of the algorithm for a fixed number of k-points N = 64.The time per iteration with one single CPU (serial) is reported in the figure.

FIG. 5 :
FIG.5: Comparison between the bands obtained using the earlier interpolation procedure (blue lines), those obtained using the (current) modified approach of Eq. (49) (orange lines), and the ab initio bands (black crosses).(a) Linear chain of carbon atoms, with 12 atoms per unit cell (separated by a distance of 1.3 Å along the z direction) and Γ-point sampling.36Wannier functions have been computed starting from projections over p x and p y orbitals on carbon atoms and s-orbitals midbond between them.A frozen window up to the Fermi energy (set to zero in the plot) has been considered, while the disentanglement window included all states up to ∼ 14 eV above the Fermi level.(b) Bulk silicon, with the BZ sampled on an unconverged 3 × 3 × 3 grid of k-points.

FIG. 6 :
FIG. 6: (a) Intrinsic spin Hall conductivity σ spin,z xy of fcc Pt, plotted as a function of the shift in Fermi energy relative to its self-consistent value.(b) Band structure colour-coded as a function of the value of the spin Berry curvature Ω spin,z nk,xy of Eq.(58b).Note that the value plotted in the graph is r(Ω spin,z nk,xy ), with the function r(x) defined by r(x) = x/10 |x| < 10 log 10 (|x|)sign(x) |x| ≥ 10 , with sign being the sign function and log 10 the base-10 logarithm.
Table below it. .