Many-body perturbation theory calculations using the yambo code

yambo is an open source project aimed at studying excited state properties of condensed matter systems from first principles using many-body methods. As input, yambo requires ground state electronic structure data as computed by density functional theory codes such as Quantum ESPRESSO and Abinit. yambo’s capabilities include the calculation of linear response quantities (both independent-particle and including electron–hole interactions), quasi-particle corrections based on the GW formalism, optical absorption, and other spectroscopic quantities. Here we describe recent developments ranging from the inclusion of important but oft-neglected physical effects such as electron–phonon interactions to the implementation of a real-time propagation scheme for simulating linear and non-linear optical properties. Improvements to numerical algorithms and the user interface are outlined. Particular emphasis is given to the new and efficient parallel structure that makes it possible to exploit modern high performance computing architectures. Finally, we demonstrate the possibility to automate workflows by interfacing with the yambopy and AiiDA software tools.

Computational materials science based on first principles atomistic methods plays a key role in the discovery, characterization, and engineering of novel and complex materials.While density functional theory (DFT) is the established workhorse for ground state properties of a wide range of systems ranging from atoms and molecules to solids and nanostructures containing thousands of atoms, there is an increasing demand for an accurate description of excited state properties in even the most challenging materials.Within the framework of solid state physics, the Green's function formulation of many-body perturbation theory (MBPT) -specifically the GW approach to quasiparticles (QP) for charged excitations and the Bethe-Salpeter equation (BSE) for neutral excitations -offers a quantitatively accurate solution.The GW-BSE approach has been implemented in a number of free and commercially available codes (for a recent review, see Ref. 1) and applied to a wide range of materials.Nonetheless, the complexity and relatively poor scaling of the GW-BSE method, and often of its implementation, constitutes a barrier towards its application to realistic systems of large size or to physical phenomena that lie outside the scope of most state-of-the-art approaches.
Tackling these challenges in a software environment requires a fourfold strategy: • First, the description of underlying physical phenomena must be regularly advanced, both in terms of extensions of existing tools and by devising new methods.Oft-neglected terms such as electronphonon and spin-orbit coupling play a crucial role in several physical phenomena.Examples are the finite temperature properties (dictated by the electron-phonon interaction) or the study of novel materials like topological insulators, perovskites and layered transition metal dichalcogenides.In addition to extensions of existing tools yambo implements brand new methods like real-time tools to tackle the calculation of nonlinear optical properties.
• Second, algorithms must be refined and augmented in order to improve technical precision and numerical efficiency.This includes tricks for accelerating convergence as well as implementing alternatives to standard GW-BSE approximations such as plasmon-pole models of electronic screening and the Tamm-Dancoff approximation to exciton coupling.
• Third, codes must be designed to follow current trends in high-performance computing towards massively parallel, distributed memory architectures, while allowing for flexibility and control over tasks, memory, and disk usage in order to keep simulations efficient.
• Fourth, as the codes themselves become more complex and harder to maintain, modern software practices must be adopted.These include a wide range of aspects including improved documentation, use of modules and standard libraries, and automation of tasks for convergence, benchmarking and reproducibility.
In this paper we describe how the yambo project has embraced this broad strategy.yambo is an open-source code based on many-body perturbation theory for computing electronic and optical excitations within a high performance environment (Fig. 1).Since its first public release in 2008, the project has evolved in a dramatic fashion and its development and user base has greatly expanded.Within the following ten years, the original paper was cited more than 500 times -considerable for a pure MBPT code -and the code has been used in many high impact studies spanning a wide range of novel materials and exciting technologies.The highest cited applications cover graphene derivatives [2][3][4][5], metal-halide perovskites [6,7], van der Waals bonded layered compounds [8][9][10][11], Li-air and K-ion batteries [12,13], and TiO 2 photocatalytic surfaces [14,15], to select just a handful.yambo has moreover helped advance fundamental understanding of physical phenomena such as excitonic Bose-Einstein condensation [16], excitonic insulators [17], the influence of zero point motion [18], charge transfer excitations [19], etc.A full list of publications can be found through our website [20], http: //www.yambo-code.org.
Part of yambo's popularity and success may be ascribed to the code's user-friendliness: thanks to an intelligent command line interface, a full GW-BSE calculation on an unfamiliar material can in principle be carried out launching a single command.Extensive user documentation is provided on our website [20].This includes descriptions of the fundamental theory, command line interface, and input variables, and provides a wide range of tutorials directed at explaining different functionalities of the code across a number of systems with different dimensionalities.Support is given by the developers through a forum.In addition to the website [20,21], the theory and use of yambo has been disseminated through a num-ber of international schools and workshops including a dedicated biennial CECAM event run by the developers and aimed at showcasing the latest developments.
The first major release, version 3.2.0, was described in detail in Marini et.al. [22] (henceforth referred to as CPC2009), and therefore the basic methodology, formalism, and code structure will not be repeated here.Instead we describe the main additions made to the code up to and including version 4.4.0.Much development of the code has been driven by its status as a key ab initio spectroscopy code of the European Theoretical Spectroscopy Facility (ETSF) [23] and as a flagship code of the MaX European Centre of Excellence for Materials Design at the Exascale [24] and of the Nanoscience Foundries and Fine Analysis -Europe user infrastructure [25].
With regard to the broad strategy outlined above, yambo now includes the possibility to compute the following state-of-the-art physical phenomena discussed later: • Electron-phonon and exciton-phonon interaction: influence of temperature on electronic structure and optical spectra (section VI); • Real-time propagation of the density matrix (section VII A) and Bloch states for nonlinear optics (section VII B); • Spin-orbit coupling and Kerr effect within a fully noncollinear BSE framework (section V B).
Numerous methodological advances have been incorporated in the code in the last decade.We will discuss in more detail the following key features: • Alternative approaches for computing dipole matrix elements and commutators (section III A); • Incorporation of empty state terminators in the linear response (section III C) and self-energy (section IV C); • Full frequency GW, including computation of lifetimes (section IV A); • Double grid approach and Krylov algorithm for improved BSE efficiency (section V A).
Regarding parallelism, Section VIII outlines the code's strategies for exploiting massively-parallel architectures through the use of a highly user-tunable mixed MPI-OpenMP coding paradigm and the use, where possible, of external parallel libraries for linear algebra and I/O tasks.As different quantities (i.e.linear response, GW, BSE) computed by yambo have very different behaviours in terms of performance, scalability, and memory distribution, it is important to outline the different approaches -ultimately controlled by the user -adopted by the code in each case.
Last, yambo has been almost completely rewritten since the first major release in order to follow modern software design practices such as modularity, reuse of routines and libraries, and so on, and the project as a whole has been expanded to include rigorous self-testing and automation frameworks.Here we highlight a few key features: • Test-suite and benchmarking scripts (section IX C); • The yambopy python scripts for automation and analysis (section IX A); • Plugin for workflow management via AiiDA (section IX B); • Wide use of standard libraries (section II A).
• Maintenance and distribution through GitHub.
In the following section we recall the structure of the yambo software package and outline new features in its installation environment and interface with external codes and libraries.Sections III-VII outline new features implemented relating to improved algorithms and new capabilities.Section VIII discusses the new parallelism paradigm and performance issues.Sec.IX introduces new scripting and automation tools.Following some general conclusions, various technical information is presented in the appendices along with a useful glossary of acronyms.

II. TECHNICAL OVERVIEW
The yambo package is released under the GNU GPL (v2) license and is hosted on GitHub in a set of public and private repositories at https://github.com/yambo-code.Snapshots of major releases are also available for direct download through the yambo website [20].
The general structure of the yambo software is laid out in Fig. 2. The software consists of three kinds of executable that generally reflect the order in which the code is run.First, the output from standard DFT codes are converted into NetCDF 'database' files (ns.db1 and ns.wf) within a SAVE directory using the a2y and p2y routines (see Section II C below).Second, the main calculations ('runlevels') of linear response, GW, and BSE are performed using the standard executable yambo or the project-specific executables.These include yambo rt for real-time propagation (Section VII A), yambo nl for nonlinear optics (Section VII B), and yambo ph for electronphonon simulations (Section VI).Running these codes results in the reading and writing of further databases (SAVE/ndb.*),as well as generation of text files for reading or plotting.Third, post-processing routines (ypp and runlevel specific ypp nl or ypp rt executables) are used to manipulate and analyze the computed quantities stored in the databases.In special cases ypp, ypp nl or ypp rt executables are needed as pre-processing tools to further manipulate the core databases (i.e. to remove some symmetries before real time simulations) or to create new databases (i.e. an ndb containing a mapping between core databases on two different k-grids), before actually running the main calculation.
A. Installation & projects yambo is compiled using the standard autotools procedure: ./configure;make all will generate the main executables listed in Fig. 2. Since the first release the configure script has been wholly upgraded to reflect the widespread availability of high performance software libraries and to aid portability across a wider range of system architectures.
By running ./configure;make, the list of possible executables is returned While yambo and ypp are the main code components, a series of projects appear in the form of yambo PJ/ypp PJ with P J being the specific project identifier (ph,rt,nl,kerr).These projects correspond to pre-processor flags that, during the compilation, activate lines of code and procedures that are projectspecific.In yambo several different codes coexist in the same source.

B. Configuration
In many cases configure will manage to detect the compilation environment and external libraries automatically.
For more control, a flexible list of options is available (see ./configure --help).
A wide range of optional features can be activated via --enable-FEATURE[=ARG] flags, e.g.

./configure --enable-open-mp
including options controlling serial/parallel linear algebra, timing/memory profiling, type of fast Fourier transform (FFT) library, etc. External libraries can be linked to by specifying either the installation directory including the "libs" and "include" folders, --with-libname-path=<path> or the "libs" and "include" paths directly --with-libname-libdir=<path> --with-libname-includedir=<path> or finally the libraries and the include command --with-libname-libs=<libs> --with-libname-incs=<include command> This is an important improvement for allowing installation on machines with non-standard system directories.
Choice of compilers and preprocessors can be overridden via the environmental variables FC, CPP etc. Finally, the generated config/setup file can be tweaked by hand prior to compilation.
yambo can make use of several external libraries for improving performance and portability (see Table I).In addition to standard MPI (openmpi, Intel MPI, etc.) and OpenMP for parallel computation, these include standard scientific computation libraries such as BLAS and LAPACK (including the Intel MKL and IBM ESSL), scalable versions of these (BLACS, ScaLAPACK; --enable-par-linalg), as well as advanced parallel numerical libraries (SLEPc, PETSc; --enable-slepc-linalg).
Use of the latter in yambo is discussed in detail in Section VIII F. Heavy use is made of fast Fourier transforms (FFTs).yambo supports many FFT implementations: Goedecker (--enable-internal-fftsg), FFTW (internal default) and 3D or standard FFT implementation of Quantum ESPRESSO (--enable-3D-fft or --enable-internal-fftqe) can be compiled while MKL and ESSL can be externally linked.Regarding internal I/O, linking to NetCDF or HDF5 format libraries is a requirement.The exchangecorrelation functional library libxc is also required.Interfacing with the yambopy and AiiDA platforms is explained thoroughly in Section IX.Libraries related to porting data from DFT codes are discussed in the following section.

External Libraries
An important feature of the new configuration procedure in yambo is that all required libraries can be automatically downloaded, configured and compiled at the compilation time.
Indeed, if configure does not find a required library (dependency), it will automatically download and com-  pile it.A useful option is the --with-extlibs-path=<full_path> where one can provide a path of choice where yambo will install all the automatically downloaded libraries, once and for all.The content of the folder is never erased.In subsequent compilation the library will be automatically re-used just specifying the same option.

C. Interfaces with DFT codes
yambo is interfaced with two widely used planewave first-principles codes: pwscf from the Quantum ESPRESSO (QE) distribution [26,27] and Abinit [28][29][30].The two interfaces have been introduced in Ref. [22] (sections 5.1 and 5.3).Both work with norm conserving pseudo-potentials and import Kohn-Sham (KS) eigenenergies nk and eigen-functions ψ nk as well as information needed to compute the non-local part of the pseudopotential V nl (x, x ).Since the publication of Ref. [22] both interfaces have been largely improved and extended.Two are the most relevant changes.All interfaces are now able to deal with both collinear and non-collinear spin systems.All interfaces take advantage of the XC library [31,32], thus a very broad class of functionals is supported.A more detailed summary of the changes follows.
1. Interface with Quantum ESPRESSO p2y (pwscf-2-yambo) is the yambo interface with Quantum ESPRESSO.Its development line followed two routes, one related to the developments of QE I/O and one aimed at adding new features to p2y.
A wider class of pseudo-potentials (psps) is now supported, including UPF version 2, and multi-projector psps -i.e. with more then one projector per angular momentum channel.In the same direction the XC library [31,32] allows for the support of most of the LDA and GGA functionals as a starting point for the MBPT (quasiparticle or response function) calculations.In addition, hybrid functionals, with fractions of exchange and screened exchange interaction, are also supported within p2y.To keep compatibility with all versions of QE within a user-friendly approach, p2y has now an automatic detection of the I/O format used in the ground-state calculation and is able to read different xml data-file formats (qexml and qexsd, in the QE language), also supporting the more recent HDF5 binary files.
Spin is now fully supported both in collinear and noncollinear frameworks.For example, the use of magnetic symmetries allows to take advantage of composite symmetries, i.e. which contain time-reversal, in systems which are not invariant under pure time-reversal.Work is in progress to extend the support of ultra-soft pseudopotentials (USPP).
Other important changes were carried out to optimize the interface, first of all with an improved parallelization (implemented over the writing of wavefunction fragments).Moreover the Kleinman-Bylander (KB) form factors are now converted in a yambo-like database, while the calculation of the commutator [r, V nl ], which was previously done at the p2y level, is now more efficiently done by yambo while computing the dipoles.

Interface with Abinit
a2y and e2y are the Abinit-2-yambo interfaces.The original a2y implementation reads data in Fortran binary format.e2y was developed later and is based on the ETSF-IO [33] and NetCDF [34,35] libraries.Both interfaces are based on the Abinit KSS file and were developed following the evolution of Abinit.However, since the support to the KSS file was dropped by the Abinit team, the development and maintenance of interfaces based on it became difficult.As an example, the support for multi-projector pseudo-potentials was first released via a patch for the Abinit code, which allows the printing of the relevant data into the Abinit KSS file.[138] As a consequence, the development of the KSSbased interfaces was also dropped by the yambo team.The old a2y implementation works up to Abinit version 7, while e2y is supported up to the very recent Abinit 8 releases.
Starting with yambo 4.4, we will release a new version of the a2y interface, which is based on the direct reading of the Abinit wave-function files (WFK files) written in NetCDF format.A preliminary version of e2y based onto the WFK file was also released with yambo 4.0.However, since the support to the ETSF-IO library is not developed anymore, the WFK based e2y interface was never finalized.The new strategy (i) avoids the need for the KSS file, (ii) is numerically more efficient and (iii) reduces the I/O, since wave functions are stored on the smaller k-centred spheres in reciprocal space (as opposed to the KSS file which relied on a larger gamma-centred sphere).Finally, since WFK files are fully supported by the Abinit team, the new interface will be compatible with all recent Abinit developments (also including multi-projectors pseudo-potentials) and naturally portable to work with future Abinit releases.ypp is the yambo postprocessing and preprocessing tool.It has several capabilities which can be used to prepare yambo simulations (preprocessing) or subsequently analyse (postprocessing) the outcome.
As one of the preprocessing options, ypp can generate random grids of k-points to be used as input for a DFT code to compute the corresponding KS energies.The same ypp can then generate an auxiliary database with a map linking the KS energies on the random grid to the uniform grid used to compute, for example, spectral properties.The approach is useful to speed up convergence as discussed in Sec.V A 1. Another preprocessing option is the removal of a specific set of symmetries and thus the expansion of the wave-functions from the IBZ associated to the full set of symmetries to the resulting IBZ.This is needed to perform real-time simulations as described in Sec.VII.Finally preprocessing can also be used to map DFT calculations with and without spin-orbit coupling (SOC) to compute absorption spectra with SOC corrections included in a perturbative way as described in the supplemental material of Ref. [36].
Most of the postprocessing features involve data analysis.ypp can prepare readable ascii files to plot several single-particle properties such as wave-functions, charge density, density of states (DOS), magnetization, current and band structures.In particular, it can be used to obtain the QP-DOS and to interpolate QP-energies to plot the resulting band-structure along high-symmetry paths.A mixed feature (i.e. which can be used both for prepro-cessing and for postprocessing) is the ability of ypp to manipulate QP-databases (ndb.QP).Indeed, this is useful both for QP plots or for using ndb.QP files as input in the BSE calculations.Finally, it can be used as a tool to analyse the excitonic wave-function.As examples of postprocessing, we discuss in detail (i) how to plot the QP band structure starting from calculations on a regular grid in Sec.IV D and (ii) how to plot the excitonic wave-function in Sec.V C.

E. Usage
yambo relies on a powerful and user friendly command line interface for generating and modifying input files as well as for launching the executables.The basic functionality is unchanged from that described in CPC2009; however, some flags have been changed since the initial release.Several new options have been added to aid usage or debugging on parallel clusters or cross-compiled architectures.For instance, yambo -M and yambo -N switch off the MPI and OpenMP functionalities, respectively, yambo -Q stops the text editor from launching, and yambo -W <opt> places an internal wall clock limit on the runtime.Launching yambo -H shows the fully updated list of command options: see Table II.

III. LINEAR RESPONSE
In the independent particle (IP) approximation, the density-density response function can be written as: where n, m indexes represent band indexes (which also include the spin index in case of spin collinear calculations and which refer to spinors in case of non-collinear calculations), f nk and nk are the occupations and the energies of the KS states, f s = 1 for spin-polarized calculations, f s = 2 otherwise.In practice, the sum in Eq. ( 1) is split into two terms as described in Appendix B. The matrix elements have been already introduced in Ref. [22] and constitute one of the core quantities computed by the yambo code.Their evaluation is done via the Fourier transform of the wave-function product in real space, ψ * nk (r)ψ mk−q (r), and has been strongly optimized being one of the most common operations performed by yambo (see discussion in sec.VIII C).

A. Dipole matrix elements
Despite the computational cost, the numerical algorithm to compute the terms in Eq. ( 2) is straightforward.Since absorption is defined as the macroscopic average of the density-density response function, χ(q → 0), the knowledge of ρ nmk (q → 0, 0) is also needed.To this end, the dipole matrix elements r nmk = nk|r|mk are commonly computed [37,38] within periodic boundary conditions (PBC) using the relation The direct evaluation of Eq. (3) (G-space v approach) is quite demanding due to the [r, V nl ] term, and is evaluated from the KB form factors loaded by the interfaces, see also Sec.II C.This implementation has been strongly optimized and extended to account for projectors with angular momentum l > 2.
We have also made available alternative strategies for computing the dipoles.The shifted grids approach is based on the idea of numerically evaluating ρ nmk (q , 0) for a very small q = |q |.Thus the wave-function at k and the wave-functions at k − q are needed.Since the q → 0 limit may be direction dependent, this is done in practice by means of wave-functions computed on four different grids in k-space, i.e. a starting k-grid plus three grids with k + q e i slightly shifted along the three Cartesian directions e x , e y , e z .Such approach is computationally more efficient, although it requires to generate a larger set of wave-functions.However, there exists a random phase associated to the wave-functions on each of the four k-grids, since they are obtained by independent diagonalizations of the KS Hamiltonian.Because of this, shifted grids dipoles have inconsistent phases among different directions and it is not possible to use them when the dipole matrix elements are needed (instead of their square modulus only) as for example in the evaluation of the Kerr effect (see Sec. V B 1) or for non-linear optics (see Sec. VII B).
The G-space v approach assumes that the only nonlocal terms in H are the kinetic energy and the pseudopotentials.There are however cases, for example when the Hamiltonian contains non-local hybrid functionals, Hubbard U terms, or nonlocal self-energies, in which the evaluation of the commutator may become very cumbersome.To solve this issue one could in principle use the shifted grids approach.However, also this approach may become impractical because of the calculation of wavefunctions on the shifted grids.
For those cases we have implemented in yambo two alternative strategies, one for extended and one for isolated systems.For extended systems the Covariant approach exploits the definition of the position operator in k space: r = i∂ k .The dipole matrix elements are then evaluated as finite differences between the k-points of a single regular grid.A five-point midpoint formula is used, with a truncation error O(∆k 4 ).The shifted grids and the Covariant approach are very similar, however in the latter the arbitrary phase of the wave-functions at different k-points is correctly accounted for.To this aim, i∂ k is implemented as a covariant derivative which cancels the relative phase factor (see Appendix D for details).For these reasons the Covariant approach overcomes the limitations of the shifted grids approach.The main drawbacks of the Covariant approach is that the numerical value of the dipoles needs to be converged against the size of the k grid and the present implementation does not work for metals.However, in practice the convergence of dipole matrix elements is usually faster than that of the absorption spectrum.For finite systems, finally, the dipole matrix elements can be directly evaluated in real space (R-space x approach).
We underline that in the case of a local Hamiltonian all approaches are equivalent.The desired strategy can be selected via the input variable: (G-space v being the default value).

B. Coulomb interaction
The Coulomb interaction enters in many sections of the yambo code, such as linear response, self-energy, and BSE kernel calculation.In reciprocal space, the bare Coulomb interaction for bulk systems is defined as v(q + G) = 4π/|q + G| 2 .For the calculation of quantities requiring integration over transferred momenta in the Brillouin zone (BZ), such as the self-energy, the integrals are evaluated by summations over regular q-grids.In order to remove divergencies in systems of reduced dimensionality, i.e. in the presence of a 2D or 1D sampling of k-points, or to speed up the convergences in 3D systems, yambo offers the possibility to evaluate Coulomb integrals by using the random integration method (RIM), which consists of evaluating these integrals by Monte Carlo sampling (as already discussed in detail in Sec.3.1 of Ref. [22]), dividing the full BZ in small regions around each k-point of the chosen uniform grid.
In order to avoid spurious interaction between replicas when dealing with low-dimensional materials such as clusters, slabs, or wires, yambo can also use Coulomb cutoff truncation techniques.These consists of truncating the Coulomb interaction beyond a certain region (depending on the chosen geometry): Different geometrical choices are available.Spherical and cylindrical shapes, suitable to treat 0D and 1D systems, respectively, have been already described in details in Ref. [39].In addition, a box-like cutoff obtained by performing a numerical Fourier Transform of the real space expression is available for 0D systems.By defining only one or two sides of the box, it is possible to treat 2D or 1D systems within the same numerical approach.It is important to stress that, as the construction of such potential requires integration over the BZ, the RIM method discussed above must be activated.Finally, a Wigner-Seitz truncation scheme, similar to the one discussed in Ref. [40] is also available.In this scheme the Coulomb interaction is truncated at the edge of the Wigner-Seitz super-cell compatible with the kpoint sampling.This truncated Coulomb potential turns out to be suitable for finite systems as well as for 1D and 2D systems, provided that the supercell size, determined by the adopted k-point sampling, is large enough to get converged results [41].
C. Sum-over-states terminators in IP Linear Response The independent particle polarizability χ 0 GG (q, ω), Eq. ( 1), and the correlation part of the GW self-energy Σ c (ω), Eq. ( 7) Sect.IV A, are evaluated through sumover-states (SOS) expressions obtained by applying an energy cutoff to the infinite sum over virtual states.These expressions are, however, slowly convergent and, especially for large systems, require the inclusion of a large number of empty states (N b ).This condition makes GW calculations computationally demanding, both in terms of time-to-solution and memory requirements.In order to overcome this limitation, a number of approaches have been proposed to reduce [42][43][44][45] or remove [46][47][48] sum over states; among them, we have implemented in yambo the extrapolar correction scheme proposed by Bruneval and Gonze (BG) [42].
This scheme, here referred as X-terminator, permits to accelerate GW convergence by reducing of a sensible amount the number of virtual orbitals necessary to calculate both polarizability and self-energy.In this procedure extra terms, whose calculation imply a small computational overhead, are introduced to correct both polarizability and self-energy by approximating the effect of the states not explicitly taken into account.The method consists in replacing the energies of empty states that are above a certain threshold, and that are not explicitly treated, by a single adjustable parameter defined as extrapolar energy.When the method of terminators is applied, the independent-particle polarizability can be written as [42]: where the first term on the r.h.s. is truncated at the N b state (in general N b N b ) and the second term depends on the extrapolar energy for the polarizability ¯ χ0 .The explicit expression for ∆χ GG (q, ω, ¯ χ0 ) is provided in App. C.
In the present implementation of yambo, the input parameter governing the use of the terminator corrections on the response function (X-terminator) is XTermKind= "none" # [X] X terminator ("none","BG") (default: "none").When the variable is set to none (default option), the X-terminator is not applied.On the contrary when XTermKind assumes the value BG, the extrapolar corrective term is calculated.The extrapolar energy ¯ χ0 , see Eq. C3, is defined by the input variable (default: 1.5 Ha)

XTermEn= 1.5 Ha # [X] X terminator energy
The value means ¯ χ0 = N b k + 1.5 Ha, with N b k the highest energy state included in the calculation.
For demonstration purposes, in Fig. 3 we report the calculated QP corrections for the valence band maximum (VBM) and the conduction band minimum (CBM) of a bulk Si described in a supercell (36  states).Results are obtained by increasing the number of bands explicitly included in the calculation of the response function χ and by imposing a very high number of bands in the self-energy, that is therefore converged.Empty circles connected with solid lines denote the results obtained for the VBM and CBM states without applying any correction.Improvements induced by the use of the X-terminator are depicted by solid circles connected with solid lines that have been obtained imposing XTermEn=1.5Ha.We can observe that the X-terminator leads to a relevant reduction in the number of bands necessary to converge the polarizability and thus the GW corrections.

IV. QUASI-PARTICLE CORRECTIONS
Accurate quasi-particle energies can be obtained by calculating self-energy corrections to KS energies [49].In general, the non-local, non-Hermitian and frequency dependent electronic self-energy operator can be expressed as the sum of a bare, energy independent exchange term and a screened, dynamic correlation term: In this section we describe features implemented in yambo aimed at improving the accuracy of GW calculations by going beyond the commonly used plasmon-pole approximation [50] for the dielectric matrix and in speeding up calculations by reducing the number of empty states needed to get converged results.GW energies on top of KS eigenvalues are commonly calculated by considering one-shot corrections using the G 0 W 0 approximation.Nevertheless in yambo is also possible to perform partial self consistent calculations (evGW), where the energies of the Green's function, and the polarizability are iterated until self-consistency is reached, while wave functions are kept frozen.This approach generally reduces the starting point dependence and it has been shown to provide reliable results for molecular systems [51,52], wide band-gap materials [53] and perovskites [54,55].In the following we will just refer in general to the GW approach and discuss how the GW self-energy is computed.
A. Full frequency GW Within the GW approximation, the matrix elements of the correlation self-energy over the KS basis are expressed as: I is linked to the self-energy spectral function.From a computational point of view the definition of I is really critical as, in Eq.7, it is connected to the self-energy via a complex Hilbert transformation.In yambo I is defined as In Eq. ( 8), W δ is the delta-like part of the screened interaction.This is defined by In Eq.9 W GG (q, ω) is the screened Coulomb potential defined as Note that, in the case of systems with both spatial and time reversal symmetry, W GG (q, ω) = W G G (q, ω) and W GG (q, ω) reduces to the imaginary part of W .
In order to take into account the frequency dependence of the self-energy, two different strategies are implemented in yambo.As already described in Ref. [22], it is possible to adopt the plasmon-pole approximation (PPA) in order to model the dynamic screening matrix.This approximation essentially assumes that all the spectral weight of the dielectric function is concentrated at a plasmon excitation.Among different models present in the literature yambo implements the Godby-Needs construction [56] where the parameters of the model are chosen in such a way that −1 GG (q, ω) is reproduced at two different frequencies: the static limit ω = 0 and another imaginary frequency ω = iω p given in the input file by PPAPntXp (default: 1 Ha).Quasi-particle energy levels calculated within this approximation have been shown to agree to a large extent with numerical integration methods for materials with different characteristics including semiconductors and metal-oxides [50,57].Moreover, it has the great advantage to avoid the computation of the inverse of the dielectric matrix for many frequency points and to make the frequency integral of Eq. ( 7) expressible in an analytic form.Nevertheless the assumption made for the PPA breaks down in certain situations as when dealing with metals [58][59][60] or interfaces [61] and the frequency integral need to be solved numerically.In yambo the integral is solved on the real-axis which implies the knowledge of the full frequency dependence of W GG (q, ω).In practice, first the inverse dielectric function −1 GG (q, ω) is evaluated for a number of frequencies set by the variable ETStpsXd, and uniformly distributed in the energy range given by the maximum electron-hole pairs included in the response function defined in Eq. (1).Next, the summation over G and G is performed computing I nn k mq (ω ) defined in Eq. ( 8), and finally the correlation part of the self-energy is computed via a Hilbert transform defined in Eq. (7).
In this scheme, the evaluation of Eq. ( 10) is the most time consuming step due the computation of the inverse dielectric matrix for a large number of frequencies (order of 100) in order to have converged results.Nevertheless as the calculations for each frequency are independent from each other, parallelization over frequencies provides a linear speedup.
Quasi-particle energies calculated by using the realaxis method have been demonstrated to provide the same level of accuracy of other beyond plasmon-pole techniques such as the contour deformation scheme [62].

B. Electron-mediated lifetimes
The ability of yambo to calculate the real-axis GW self-energy allows direct access to the quasiparticle electron-mediated lifetimes.Indeed if we define Γ e−e nk (ω) ≡ ( nk|Σ c (ω)|nk ), from Eq.7 it is easy to see that yambo can evaluate the quasi-particle lifetimes τ nk (ω), proportional to the inverse of Γ e−e nk (ω), either in the onthe-mass-shell approximation (OMS) or in the full GW approximation.The difference between the two is the inclusion of the renormalization factors, Z nk .For details on the theory see, for example, Ref.59 and references therein.
In the OMS we have that Γ e−e nk OM S = Γ e−e nk ( nk ).The e-e lifetimes of bulk copper are shown in Fig. 4 using several flavours of GW approximations [59].
An important numerical property of the electronmediated lifetimes calculation is that they depend only on the k-grid.Indeed, as evident from Eq.11 the band summations are limited by the two theta functions that confine the scattering events in reduced regions of the BZ.This is the mechanism that, in simple metals, leads to the well known quadratic scaling of Γ e−e nk OMS near the Fermi level, as a function of distance of nk from the Fermi level itself.
More physical insight in the electronic lifetimes will be given in Sec.VI B where the phonon-mediated case will be described.In Sec.III C we have discussed the X-terminator procedure.A similar scheme can be adopted to study the correlation part of the GW self-energy, as from Eq. ( 16) of Ref. [42].Also in this case the approximation implies the introduction of an extra term that takes into account contributions arising from states not explicitly included in the calculation.The input parameter governing the use of the terminator corrections on the self-energy (Gterminator) is GTermKind= "none" # GW terminator ("none","BG") (default: "none").When the variable is set to none, the G-terminator is not applied.On the contrary when it assumes the value BG, the extrapolar corrective term is calculated.The extrapolar energy for the self-energy is defined by the tunable input variable Also in this case, the value is referenced to the highest band included in the calculation.In Fig. 5 we reconsider the system discussed in the example of Section III C, Fig. 3.In this case, however, we study the convergence of the self-energy by exploiting the G-terminator procedure.Empty circles connected with solid lines show the usual GW convergence for the VBM and CBM states (no corrections applied).Calculations have been performed by imposing a high number of bands in the polarisability (that is therefore converged) and by increasing the number of bands included in the self-energy.We set GTermEn=1. 5 Ha, that represents the best choice for this system.Improvements provided by the use of the G-terminator procedure are represented by solid circles connected with solid lines; it is evident that the application of this scheme accelerates the convergence by leading to a significant reduction in the number of states necessary to converge the GW self-energy and therefore the calculated QP correction.In order to elucidate the role played by the extrapolar parameter, we report in Fig. 6 a convergence study of the VBM GW correction for a TiO 2 nanowire (NW).The black line is obtained without applying any correction.Coloured lines are instead obtained by applying both Xand G-terminators, moving the extrapolar energy from 1.0 to 3.0 Ha.Results are reported as a function of the number of states explicitly included in the calculation of both polarisability and self-energy.As pointed out in Ref. [42], the extrapolar energy for the self-energy can be safely taken equal to the extrapolar energy introduced in Eq. (C3) for the polarizability; for this reason we impose XTermEn = GTermEn.Consistently with the study of Fig. 6, the convergence of the VBM without terminators is very slow and requires the inclusion of a large number of bands to be achieved; this condition makes the calculation cumbersome also on modern HPC-machines.When the terminator technique is adopted to correct both polarisability and self-energy, the convergence becomes much faster; especially for some values of the extrapolar energy (about 1.5 Ha), we observe a significant reduction in the number of bands necessary to converge the calculation, with a strong reduction of both the time-to-solution and the allocated memory.Noticeably, the correction is almost independent on the selected extrapolar energy (terminators are convergence accelerators and the extrapolar correction vanished in the limit of infinite bands included); this parameter therefore influence the number of bands necessary to converge the calculation (and thus the computational cost of the simulation) but not the final result.

D. Interpolation of the QP band structure
In DFT the eigenvalues of the Hamiltonian at every k-point can be obtained by the knowledge of the groundstate charge density, allowing one to perform non-selfconsistent calculations on an arbitrary set of k-points.Instead at the HF or GW level, to obtain QP corrections for a given k-point it is necessary to know the KS wave-functions and eigen-energies on all (k + q)-points, having chosen a regular grid of q-points as convergence parameter.In practice yambo computes QP corrections on a regular grid.As a consequence the evaluation of band structures along high-symmetry lines can be computationally very demanding.
A simple strategy which is implemented in ypp is to interpolate the QP corrections from such regular grid to the desired high symmetry lines.The approach implemented is based on a smooth Fourier interpolation [63], which is particularly efficient for 3D grids.The interpolation scheme can also take, as additional input, the KS energies computed along the high symmetry lines to better deal with bands crossing and regions with non analytic behaviour, such as cusp-like features.
A more involved strategy is instead based on the Wannier interpolation scheme as implemented in the wannier90 [64] and WanT [65] codes, where electronic properties computed on a coarse reciprocal-space mesh can be used to interpolate onto much finer meshes at low cost [66].In the context of GW calculations, the Wannier interpolation scheme can be used to interpolate the QP energies and other band structure properties [65] (e.g.effective masses) from QP corrections computed only on selected k-points.Wannier interpolation of GW band structures requires two sets of inputs: on one side quantities computed at the DFT level such as KS eigenvalues, overlaps between different KS states, and orbital and spin projections of KS states, that are imported from Quantum ESPRESSO, and on the other side the QP corrections computed by yambo.In fact, wannier90 works with uniform coarse meshes on the whole BZ, while yambo uses symmetries to compute quantities on the IBZ.In addition, converging the GW self-energy typically requires denser meshes with respect to what is needed for the charge density or Wannier interpolation.To address this issue, ypp allows one to unfold the QP corrections from the IBZ to the whole BZ, as required by wannier90 for interpolation purposes [139].Finally the wannier90 code yields a GW-corrected Wannier Hamiltonian and interpolates the GW band structure.A similar procedure is implemented in WanT.
For example, in monolayer WS 2 a grid of 48x48x1 (or denser) is required to converge the GW self-energy.In this case, the band structure can be obtained either by explicitly computing the QP corrections on all k-points of the 48x48x1 grid, or it can be Wannier-interpolated from the QP corrections computed onto coarse subgrids, such as a 6x6x1 corresponding to 7 symmetry-nonequivalent k-points only in the IBZ (see Fig. 7).The second approach requires substantial less CPU time.

V. OPTICAL ABSORPTION
The solution of the Bethe-Salpeter equation on top of DFT-GW is the state-of-the-art first principles approach to calculate neutral excitations in solid-state systems [38], with successful applications to molecules [51,67], surfaces [68,69], two-dimensional materials [70,71], and nanostructures [72,73], including biomolecules in complex environments [74,75].The BSE is a Dyson equation for the four point response function L. It can be rewritten as an eigenproblem for a two-particle effective Hamiltonian H 2p in the basis of electron and hole pairs |eh .H 2p is the sum of an independent-particle Hamiltonian H IP -i.e. the e-h energy differences corresponding to the independent-particle four-point response function L 0 -and the exchange V and direct contributions W accounting for the e-h interaction.The original implementation of the BSE in yambo (See Sec.2.2 and 3.2 of CPC2009) has been extended in the past decade to (i) improve its numerical efficiency (Sec.V A)-allowing one to treat systems with a large number of electronhole pairs (i.e.above 10 5 ) -and (ii) to capture physical effects (Sec.V B) that were originally neglected-e.g.allowing for the description of the Kerr effect in magnetic materials (Sec.V B 1).Finally, a range of tools have been developed to analyse the exciton localization both in real and reciprocal space (Sec.V C).

A. Numerical efficiency
The computational cost of the BSE grows as a power of the number of electron-hole pairs.As this number can be as large as 10 5 − 10 6 , it is crucial to devise numerically efficient algorithms for the calculation of the V and W matrix elements and the solution of the BSE.The massive parallelization and memory distribution which contributes in making these calculations possible for very large systems are discussed in Sec.VIII.Here we discuss the use of the double grid for the sampling of the BZ [76], where the BSE is solved (Sec.V A 1)-which aims at reducing the number of degrees of freedom involvedand the use of Lanczos-based algorithms together with the interface to the SLEPC library (Scalable Library for Eigenvalue Problem Computations) [77] (Sec.V A 2)which aims at avoiding the full diagonalization of H 2p .

Double-grid and the inversion solver
The BSE implementation in yambo is based on an expansion of the relevant quantities in the basis of electronhole states.This expansion often requires a very dense k-point sampling of the Brillouin zone (BZ).Typically, the number of electron-hole states used in the expansion can be relatively small if one is only interested in the absorption spectra, but the number of k-points can easily reach several thousands.Different approaches have been proposed in the literature to solve this problem.A common approach is the use of arbitrarily shifted kpoint grids, that often yield sufficient sampling of the BZ while keeping the number of k-points manageable.Such a shifted grid, indeed, does not use the symmetries of the BZ and guarantees a maximum number of nonequivalent k-points thereby accelerating spectrum convergence.However, it may induce artificial splitting of normally degenerate states, thus producing artifacts in the spectrum.In yambo we introduced a strategy to solve the BSE equation that alleviates the need for dense k-point grids and does not break the BZ symmetries.Such approach takes into account the fast-changing independent-particle contribution [76,78].Indeed the independent-particle term of the BSE, L 0 , is evaluated on a very dense k-grid and then the BSE is solved on a coarse k-grid.This means in practice that L 0 remains defined on the coarse grid, but each matrix element of L 0 contains the sum of the nearby poles on the dense grid.The dense grid can be generated by means of DFT and read using ypp -m, that creates a mapping between the coarse and the dense grid.Then BSE is solved by inversion setting BSSmod='i'.A similar approach can be also used when computing the response function in G space, by replacing each transition in the F nmk (q, ω) term in eq.(B1) with a sum over the transitions in the dense grid.

Spectra and exciton wavefunctions via Krylov subspace methods
Solving the BSE implies the solution of an eigenvalues problem for the two-particle Hamiltonian that in the e-h basis can result in a matrix as large as 10 6 × 10 6 .The standard dense matrix diagonalization algorithm is available in yambo through the interface with the LA-PACK and the ScaLAPACK libraries [79] (Sec.VIII).Alternatively, when only the spectrum is required, yambo provides the Haydock-Lanczos solver [80].The latter, originally developed for the Hermitian case only (see Section 3.2 of CPC2009) -by neglecting the coupling between e-h at positive and negative energies within the Tamm-Dancoff approximation-has been extended to treat the full non-Hermitian two-particle Hamiltonian [81,82].Cases in which considering the full non-Hermitian two-particle Hamiltonian turns out to be important have been discussed in Refs.[81,83,84].
More recently, yambo has been interfaced with the SLEPC library [77] which uses objects and methods from the PETSC library [85] to implement Krylov subspace algorithms to iteratively solve eigenvalue problems.These are used in yambo to obtain selected eigenpairs of the excitonic Hamiltonian.This allows the user to select a fixed number of excitonic states to be explicitly calculated thus avoiding the full dense diagonalization and saving a great amount of computational time and memory.Two options are available for the SLEPC solver.The first, which is the default, uses the PETSC matrix-vector multiplication scheme; it is faster but duplicates the BSE matrix in memory when using MPI.The second, which is activated by the logical BSSSlepcShell in the input file, uses the internal yambo subroutines (the same also used for the Haydock solver); it is slower but distributes the BSE matrix among the MPI tasks.To select the part of the spectra of interest, the library allows one to use different extraction methods controlled by the variable BSSSlepcExtraction.The standard method, ritz, obtains the lowest lying eigenpairs, while the harmonic method obtains the eigenpairs closest to a defined energy.The SLEPC solver makes it possible to obtain and plot exciton wave functions (ypp -e w) in large systems where the full diagonalization might be computationally too demanding.For example, the spectrum and the wave function of the lowest-lying exciton in monolayer hBN are shown in Fig. 8.The BSE eigenmodes were extracted only for the two lowest-lying excitonic states, and a √ 3 × √ 3 × 1 supercell was used in the calculation (the SLEPC spectrum is shown in blue).The full Haydock solution is displayed with a red line for comparison.

Spin-orbit coupling and Kerr
With the implementation and release of the full support for non-collinear systems it is now possible to account for the effects of spin-orbit coupling (SOC) on the optical properties at the BSE level.A detailed description of the implementation and a comparison with other simplified approaches (like the perturbative SOC) can be found in Ref. [87].Since the BSE is written in transition space, the definition of the excitonic matrix is not differ-ent from the collinear cases of both unpolarized and spincollinear systems.For a given number of bands, the main difference is that in the unpolarized case the matrix can be blocked in two matrices of size N × N , describing singlet and triplet excitations.Already in the spin-collinear case this is not possible and the matrix has twice the size 2N × 2N .In the non-collinear case, the z-component of the spin operator, S z , is not a good quantum number and the size of the matrix becomes 4N × 4N .Since SOC is usually a small perturbation, this means in practice that in the non-collinear case there are peaks which are shifted in energy as compared to the collinear cases (∆S z = 0 transitions) plus the possible appearance of very low intensity peaks corresponding to spin flip transitions.
The ability of the BSE matrix to capture the interplay between absorption and spin, makes the approach suitable to describe magneto-optical effects.Indeed, starting from the BSE matrix, the off-diagonal matrix elements of the macroscopic dielectric tensor ε ij (ω) can be derived, thus describing the magneto-optical Kerr effect [88].Notice that in the definition of ε ij (ω) the product of dipoles x * nmk y nmk enters, thus requiring approaches where the relative phases between different dipoles are correctly accounted for.To this end the yambo kerr executable must be used, activating the EvalKerr flag in the input file.The correct off-diagonal matrix elements of the dielectric tensor can be obtained in the velocity gauge (see sec.V B 2), and only for systems with a gap and Chern number equal zero in the length gauge [88].

Fractional occupations, gauges and more
Other extensions have been made available.The implementation has been modified so that the excitonic matrix is now Hermitian (or pseudo-Hermitian if coupling is included) also in the presence of fractional occupations in the ground state.This is done in practice by introducing a slightly modified four-point response function L which is divided by the square root of the occupations as discussed in Eqs.(14-16) of Ref. [89].The resulting excitonic Hamiltonian has the form with l = {nkk} a super-index in the transition space, with the square root of the occupation factors appearing on the left and on the right of the BSE kernel v−W .This has been used to compute absorption of systems out of equilibrium, but it is also important to describe metallic systems like graphene or carbon nanotubes where excitonic effects can be non-negligible due to the reduced dimensionality of the system.Further, it is now possible to compute the dielectric tensor starting from the different response functions, as described in Ref. [90].Indeed, starting from the excitonic propagator L, it is possible to construct the densitydensity response function χ ρ,ρ , or the dipole-dipole response function χ d,d at q = 0 (length gauge), and the current-current response function χ j,j (velocity gauge).This can be controlled by setting Gauge="length" or Gauge="velocity" in the input file (the length gauge is the default).In case the velocity gauge is chosen the conductivity sum rule is imposed unless the flag NoCondSumRule is activated in the input file.At zero momentum, changing response function is equivalent to change gauge.At finite q instead the use of χ j,j allows for the calculation of both the longitudinal and the transverse components of the dielectric function.The finite-q BSE has been implemented and it is currently under testing before its final release.
Another extension is connected to the output of a BSE run, which also generates a file with the joint densityof-states, at the IP level, and the excitonic density of states, at the BSE level.These can be used for example to visualize dark or very small intensity peaks as shown in Fig. 9.

C. Analysis of excitonic wavefunctions
Once a BSE calculation is performed using an algorithm which explicitly computes the excitonic eigenvectors A λ cvk , several properties of the excitons can be analyzed as shown in Sec.V B 2 (see Fig. 8).First of all, the excitonic eigenvalues E λ can be sorted and plotted.The so-called amplitudes and weights can also be calculated to inspect which are the main contributions in terms of single quasi-particles to a given excitonic state.The weights are defined as the squared modulus of the excitonic wavefunction |A λ eh | 2 (by default only electron-hole pairs that contribute to the exciton more than 5% are considered; the threshold can be tuned by modifying the input file 'MinWeight').The amplitudes are defined as cvk |A λ cvk | 2 δ( ck − vk − hω).Moreover the excitonic wavefunction written in real-space Ψ λ (r e , r h ) = cvk A λ cvk ψ * vk (r h )ψ ck (r e ) can be computed.Ψ λ (r e , r h ) is a two-body quantity or jointcorrelation function.Fixing the position of the hole r h = rh , |Ψ λ (r h , r)| 2 provides the conditional probability of finding the electron somewhere in space.This quantity is clearly nonperiodic and its spatial decay can change from material to material, marking the distinction between Frenkel and Wannier excitons.As an alternative it is also possible to plot |Ψ λ (r, r)| 2 which is instead Blochlike.
In Fig. 10 we focus on two interlayer excitonic states of bilayer hexagonal boron nitride (λ = 3 and λ = 8).We first plot |Ψ λ (r h , r)| 2 (top frames), then we proceed to extract more information by analyzing the phase of Ψ λ (r h , r) (lower frames).By comparing the phase for two positions of the hole related by inversion symmetry (r h and I(r h )), we see that the first exciton (Fig. 10a) is even with respect to inversion symmetry, while the second one is odd (Fig. 10a).Symmetry analysis of the wave function permits us to conclude that Ψ 3 (r h , r e ) and Ψ 8 (r h , r e ) transform as the A 1g and A 2u representations of point group D 3h of the lattice, respectively.

VI. ELECTRON-PHONON INTERACTION
The electron-phonon (EP) interaction is related to many materials properties [92] such as the critical temperature of superconductors, the electronic band gap and electronic carrier mobility of semiconductors [93], the temperature dependence of the optical spectra, the Kohn anomalies in metals [94], and the relaxation rates of carriers [95,96].yambo ph calculates fully ab initio the EP coupling effects on the electronic states, on the excitonic states energies, and on the optical spectra.The approach used is the many-body formulation which is the dynamical extension of the static theory of EP coupling originally proposed by Heine, Allen, and Cardona (HAC) [97,98].In this framework, the quasi-particle (QP) energies are the complex poles of the Green's function written in terms of the EP self-energy, Σ el-ph nk (ω, T ), composed of two terms, the Fan, Σ FAN nk (ω, T ), and Debye-Waller Σ DW nk (T ) contributions [99,100] (for the complete derivation see, for example, Ref. 101, 102): Similarly In Eqs.13-14 N qλ (T ) is the Bose function distribution of the phonon mode (q, λ) at temperature T .The ingredients of Σ el−ph nk (ω, T ), apart from the electronic states, are the phonon frequencies ω qλ and the EP matrix elements: g q,λ nn k (first order derivative of the self-consistent and screened ionic potential) and Λ qλ nn k (a complicated expression written in terms of the first order derivative [101,102]).
These quantities are currently calculated with Quantum ESPRESSO within the framework of DFPT [103].They are read and opportunely stored by the post-processing tool ypp ph and then reloaded by yambo ph.The procedure is analogous to the one followed by Abinit [104].
The HAC approach corresponds to the limit lim ω qλ →0 Σ FAN nk ( nk , T ).In the HAC the Fan correction reduces to a static self-energy [101].
In the next subsections we will give more details about how yambo ph has been used to calculate the temperature dependence of the band structure (Sec.VI A) and of the optical spectrum (Sec.VI C).Finally, in Sec.VI D we will describe the way the q → 0 divergence of EP matrix elements has been addressed.

A. Temperature-dependent electronic structure
The HAC approach is based on the static Rayleigh-Schrödinger perturbation theory, allowing one to calculate the temperature-dependent correction of the bare electronic energies, owing to the phonon field perturbation.In the QP approximation, the bare energy is instead renormalized because of the virtual scatterings described by the real part of the self-energy and it also acquires a finite lifetime due to the imaginary part of the self-energy.
The eigenvalues E nk (T ) are then complex and depend on the temperature.The more the QP approximation is valid the more the renormalization factors Z nk are close to 1, analogously to the GW method.
If the QP approximation holds, the spectral function In case of strong EP interaction it has been proven that Ref. [101] and with the permission of American Physical Society from Ref. [105]).
the spectral function spans a wide energy range [105,106] and the QP approximation is no longer valid.
Figure 11 shows the spectral functions (SFs) of transpolyethylene and trans-polyacetylene, calculated at 0K.In Fig. 11(a) multiple structures appear in the SFs.SFs are then spread over a large energy range.In Fig. 11(b) a two-dimensional plot of the SFs reveals a completely different picture with respect to the original electronic band structures.Since SFs are featured by a multiplicity of structures, each carries a fraction of the electronic charge Z nk depriving the dominant peak of its weight.A crucial aspect is that some SFs overlap, like in the case of trans-polyacetylene, and in the end it is impossible to associate a well defined energy to the electron and to state which band it belongs to.

B. Phonon-mediated electronic lifetimes
By following the same strategy used in the electronic case, the phonon-mediated contribution to the electronic lifetimes can be easily calculated from Eq.13.Indeed if we define Γ e−p nk (ω, T ) ≡ Σ FAN nk (ω, T ) it is easy to see that In perfect analogy with the electronic case, within the OMS, we have that Γ e−p nk (T ) OMS = Γ e−p nk ( nk , T ).Like in the electronic case the most important numerical property of the lifetimes calculation is that they depend only on the q-grid.
It is very instructive to compare Γ e−p nk (T ) OMS and the Γ e−e nk (T ) OMS for a paradigmatic material like bulk silicon.This is done in Fig. 12 in the zero temperature limit.The very different nature of the two lifetimes appear clearly.By simple energy conservation arguments, the electronic linewidth are zero by definition in the two energy regions VBM − E g and CBM + E g , with E g the electronic gap (in silicon E g ≈1.1 eV) and CBM ( VBM ) the conduction band minimum (valence band maximum).In these energy regions the e-p contribution is stronger and the corresponding linewidths are larger then the e-e ones.The quadratic energy dependence of the e-e linewidths inverts this trend at higher energies.
While the e-e contributions grow quadratic in the energy dependence, the e-p ones follow the electronic density of states profile.This property is confirmed by Fig. 13(c) and remains accurate when the temperature increases.Figure 13 shows the EP correction in singlelayer MoS 2 of the valence and conduction band states for several temperatures, together with the widths and the density of states (DOS).In general, the EP correction tends to close the bandgap.This is visible in Fig. 13 (panels a and b), the conduction state energy decreases with temperature, while the valence one increases.Only in a few cases we find an opening of the bandgap when temperature increases [108].

C. Finite Temperature Bethe-Salpeter Equation
Once the temperature-dependent corrections to electron and hole states have been calculated, they constitute the key ingredients of the finite temperature excitonic eigenvalue equation.Since the electron and hole eigenvalues are complex numbers the resulting excitonic eigenvalues have a real part (the exciton binding energy) and an imaginary part (the exciton lifetime).The dielectric function then depends explicitly on T , 2 (ω, , where S λ (T ) are the excitonic optical strengths and E λ (T ) are the com- plex excitonic energies.As shown in Fig. 14 for a single layer MoS 2 , the main effect of the temperature on the optical spectra is the renormalization of the energy transitions along with a broadening of the spectrum lineshape related to the finite lifetime of the underlying excitonic states which increases with T [109,110].This picture is also valid when T → 0 because of the zeropoint vibrations.A remarkable effect of the excitonphonon coupling has been observed in hexagonal BN.It has been proven that the optical brightness turns out to be strongly temperature-dependent such as to induce bright to dark (and viceversa) transitions [111].
D. Double grid in the electron-phonon coupling: a way to deal with the q → 0 divergence A technically relevant issue is the slowing down of energy correction convergence at some high-symmetry points.Some EP matrix elements might be zero by symmetry and are not representative of the discretization of an integral.yambo deals with this issue by computing the energy shift corrections on a random q-wavevector grid of transferred momenta.The numerical evaluation of the EP self-energy on a dense q-grid is a formidable task (see Eq. ( 5) of Ref. [101]).The reason is that such dense grids of transferred momenta are inevitably connected with the use of equally dense grids of k-points.The solution implemented in yambo is a double grid approach: matrix elements are calculated for a fixed k-point grid while energies are integrated using a larger grid of random-points in the whole BZ.
To speed-up the convergence with the number of random points, the BZ is divided in small spherical regions R q centered around each q point of the regular grid and the integral is calculated using a numerical Monte-Carlo integration technique.Furthermore, the divergence at q → 0 of the |g qλ n nk | 2 matrix elements is explicitly taken into account for the 3D case for which the q integration compensates the q −2 divergence.In the case of 2D materials the divergence of the EP matrix elements would not be lifted by the surface element 2πq.In principle, an analytic functional form for the EP matrix elements can be envisaged as reported by Ref. [113].

VII. REAL TIME PROPAGATION
A new feature in yambo is the numerical integration of a time-dependent (TD) equation of motion (EOM), able to describe the evolution of the electronic system under the action of an external laser pulse.Two different schemes are available.In one case, the density matrix of the system, ρ(r, r , t), is propagated in time, as described in sec.VII A. In the second case, the periodic part of the Bloch states u nk (r, t) are considered, as described in sec.VII B. In both schemes the propagation is done in the space of the equilibrium KS wave-functions where all quantities are projected and different levels of approximations are available.

A. Time-dependent Screened Exchange
The EOM for the density matrix projected in the space of the single particle wave-functions, ρ, is derived from non-equilibrium (NEQ) many-body perturbation theory and reads Here we underline quantities which are vectors in the transition space (and we will underline twice matricies in transition space).h rt contains the equilibrium eigenvalues nk plus the variation of the self-energy ∆Σ Hxc [ρ]; nk can be the KS energies or the QP corrected energies.QP corrections can be loaded either from a previous calculation or by adding a scissor operator from input.For ∆Σ Hxc [ρ] different levels of approximation can be chosen, setting the HXC kind input variable.U ext = −eE •r represents the external potential written in the length gauge; shape, polarization, intensity (and eventually frequency) of the field E can be selected in input.r is the position operator.The coupling to the external field is exact up to first order.From the knowledge of the density matrix, the first order polarization P(t) = −e i =jk r ijk ρ ijk is computed at each time step.The spectrum of the system can then be obtained by the Fourier transform of the polarization, which can be done as a post-processing step.Absorption is thus obtained via the dipole-dipole response function (equivalent to the length gauge in lin- ear response).A delta-like external field is convenient to obtain the spectrum for all frequencies.
The implementation of the external field in the velocity gauge (equivalent to the velocity gauge in linear response) is currently under testing before its final release.Equation ( 16) represents a set of equations, one for each k-point in the BZ, coupled via the functional dependency of ∆Σ Hxc on the whole ρ.Different options of the selfenergy are available, by setting the HXC kind variable to: IP, Hartree, DFT, Fock, Hartree+Fock, SEX, or Hartree+SEX.For HXC kind=IP one has ∆Σ Hxc = 0.For local HXC kind like Hartree and DFT, ∆Σ Hxc can be computed on the fly from the real-space density n(r, t) and the approach is in practice equivalent to TD-DFT to linear order in the field.For non-local HXC kind like Hartree+Fock (HF) and Hartree+SEX (HSEX), the self-energy is written in the form ∆Σ Hxc [ρ] = K Hxc • ρ with K Hxc computed before starting the real-time propagation.The calculation of K Hxc can be either done in a preliminary run, with the matrix-elements stored on disk and then reloaded, or on-the-fly before starting the real-time propagation.In case the HSEX approximation is used the resulting spectrum is equivalent to a BSE calculation in the limit of small perturbations, as shown both analytically and numerically in Ref. 114.The comparison between the two approaches is reported in Fig. 15.
To run simulations and compute the spectra as described in the present section the yambo rt and ypp rt executables need to be used.

Double-grid in real time
As for the BSE case (see Section V A 1), also the realtime propagation can be done taking advantage of a double-grid in k-space.Similarly to the BSE, the matrix elements, i.e. the dipoles and K Hxc , are computed using the wave-functions on the coarse grid, while energies and occupations are defined on the fine grid.At variance with the BSE implementation however the matrix elements on the coarse grid are then extrapolated onto | (cgs x 10 the fine grid with a nearest-neighbour technique since ρ is then defined and propagated on the double grid.This is different in spirit from the inversion solver.It would be equivalent to define the excitonic matrix (or L propagator) on the double grid.Instead, in the double-grid approach within BSE the excitonic matrix remains defined on the coarse grid, while the fine grid enters only in the definition of L 0 , as described in sec.V A 1.

B. Nonlinear optics
Alternative to the time-evolution of the density matrix, it is possible to perform the time-evolution of the Schrödinger equation for the periodic part of the Bloch states projected in the eigenstates of the equilibrium Hamiltonian: |v mk .Here we briefly present the actual implementation in yambo and how it can be used to obtain non-linear optics response, for more details see Ref. [117].The EOM for the valence band states reads: where the effective Hamiltonian h rt k has been introduced in sec.VII A and ρ(t) is constructed starting from |v mk .The second term in Eq. ( 17), E • ∂k , describes the coupling with the external field E in the dipole approximation.As we imposed Born-von Kármán periodic boundary conditions, the coupling takes the form of a k-derivative operator ∂k .The tilde indicates that the operator is "gauge covariant" and guarantees that the solutions of Eq. ( 17) are invariant under unitary rotations among occupied states at k (see Ref. [118] for more details).
Propagating the single particle wave-functions presents advantages and disadvantages with respect to the density matrix.The major advantage is that the coupling of electrons with the external field, within the length gauge, is now written in terms of Berry's phase, which is exact to all orders also in extended systems [119].Moreover, from the evolution of |v mk in Eq. ( 17) also the timedependent polarization [120] P along the lattice vector a can be computed in terms of the Berry phase: where S(k, k + q) = v nk |v mk+q is the overlap matrix between the valence states, Ω c is the unit cell volume, f is the spin degeneracy, N k is the number of k points along the polarization direction, and q = 2π/(N k a).The resulting polarization can be expanded in a power series of the field E j as: where the coefficients χ (i) are functions of the frequency of the perturbing fields and of the outgoing polarization.From the Fourier analysis of the P i it is possible to extract all the non-linear coefficients (see Ref. [117] for more details).As in Sec.VII A, the level of approximation of the so-calculated susceptibilities depends on the effective Hamiltonian that appears in the right hand side of Eq. (17).Different choices are possible, namely, the independent particle approximation (IPA), the timedependent Hartree, the real-time Bethe-Salpeter equation (RT-BSE) framework, or TD-DFT.This approach has been successfully applied to study second-, thirdharmonic generation and two-photon absorption in bulk materials and nanostructures [117,121,122].As before, in the limit of small perturbation Eq. ( 17) reproduces the optical absorption calculated with the standard GW + BSE approach [123].
Since the exact polarization is available, the approach based on Eq. ( 17) not only reduces to TD-DFT when local functionals are considered, it also includes density polarization functional theory (DPFT) as a special case.Thus specific approximations for both the microscopic and the macroscopic part of ∆Σ Hxc are available, which, within TD-DPFT, are expressed as functionals of ρ for the microscopic part (v Hxc [ρ]) and of P for the macroscopic part (E Hxc ) as discussed in Refs.[124,125].
In non-linear optics simulations the system is excited with a laser at given frequency ω and dephasing term λ deph is added to the Hamiltonian.After a time T λ deph , sufficient to damp out the eigenmodes of the system, the signal is analyzed to extract the non-linear response functions, see Fig. 16 and Ref. [117].To run simulations and compute the spectra as described in the present section the yambo nl and ypp nl executables need to be used.

VIII. PARALLELISM AND PERFORMANCE
During the last years, the evolution of supercomputing technologies pushed towards the adoption of architectural solutions based on many-core platforms.This was due mainly to energetic constraints that did not permit to increase the single core performance, imposing the need for alternative solutions.Two main paradigms arose: on one side, the emergence of hybrid architectures exploiting GPU accelerators.On the other side, homogeneous architectures increased the performance per node, by increasing the number of cores, starting the many-core era.In the latter approach, the main advantage is the possibility to rely on well-known and largely adopted software paradigms, in contrast to the GPU programming model, where the porting required to adopt ad-hoc languages such CUDA or OpenCL, having a deep impact on the sustainability of the software development.However, even if the many-core paradigm can appear easier to adopt, getting a satisfactory performance on such architectures may be very challenging.In fact, in order to exploit as much as possible the features of a many-core node, it is mandatory to use both a shared memory and a distributed memory approach.The first is able to leverage the single node power with an efficient usage of the available memory, while the second one can be used to scale-up on the nodes available on a cluster facility or a multi-purpose processor.
From yambo version 4.0, a deep refactoring of the parallel structure has been put in place in order to take full advantage of nodes with many-cores and a limited amount of memory per core.In particular, a MPI multilevel (up to 3-5 according to the runlevel) approach has been adopted, together with an OpenMP coarse grain implementation.An example of the measured parallel performance reaching up to the use of 1000 Intel KNL nodes in a single run is shown in Fig. 18.We refer to the performance page [126] on the yambo website for a more complete description and up-to-date data.
This novel multi level distribution of the cores is schematically shown in Fig. 17 in the case of 4 cores.Instead of using the core as elemental parallel unit, yambo adopts the concept of computing units (CU).A CU is composed of a varying number of cores.The work-load distribution is done among CU's rather than cores.Each core workload is decided by the workload of the CU that encloses it.To be more clear let's take the simple case of 4 cores shown in Fig. 17.In this case we have three possible levels of grouping with respectively 4, 2 and 1 CU's.The core workload is assigned to the CU's rather FIG.17: yambo parallel structure in the specific case of 4 cores.Each core is member, at the same time, of three different groups composed of different number of cores.These groups, represented with gray boxes, are the actual computing units of yambo.Therefore each core workload is dictated by the computing units directives and changes depending on the group the core belongs to.
that to the single core.This reduced enormously the inter-core communications and allows the distribution of a very large number of cores.
Technical details of the implemented parallelism will be discussed in the next sections.

A. General structure
The multilevel MPI structure of yambo is reflected in the input file where, for each computational kernel (runlevel here) there are two related input variables: the first one, runlevel ROLEs, sets on which parameters the user wants to distribute the MPI workload, while the second runlevel CPU defines how many MPI tasks will be associated to such parameters.As an example X finite q ROLEs="q.k.c.v.g" X finite q CPU="2.3.5.2.1" is a possible input for running on 2 × 3 × 5 × 2 × 1 = 60 MPI tasks, with the q-points distributed on 2 tasks, the k-points on 3, the conduction and valence bands on 5 and 2 respectively.One more level of parallelism (g) is present, acting and distributing the response matrix over space degrees of freedom (plane waves).The order of the parameters in the runlevel ROLEs variables is irrelevant.On top of that, more input variables are available to handle parallel linear algebra (e.g via scalapack and blacs libs) and to select the number of OpenMP threads (on a runlevel basis if needed).Such hierarchical organization makes it possible to have MPI communication only within the subgroups, thus avoiding, whenever possible, to deal with all2all communications.
If the user does not wish to deal with the complexity of such multi-level parallelization a default layout is provided.However, the fine-tuning of the MPI/OpenMP related variables can (further) reduce the load imbalance, improve the memory distribution or decrease the total time-to-solution.For this reason, in the Sec.VIII C- VIII F specific suggestions for best parallel exploitation of each runlevel are provided.

B. I/O: parallel and serial
yambo stores binary data using the netcdf library.Depending on the configuration flags, data can be stored in classic netcdf format (file size limit of 2 GB, activated with --enable-netcdf-classic), 64-bit netcdf format (no file size limit, default) or HDF5 format (requires at least netcdf v4 linked to HDF5, activated with --enable-netcdf-hdf5).Since version 4.4, in case the HDF5 format is specified, parallel I/O can also be activated (--enable-hdf5-par-io) to store the response function in G space or the kernel of the BSE.For the G-space response function, parallel I/O avoids extra communication among the MPI tasks and also reduces the amount of memory allocated per task.For the BSE case, parallel I/O makes it possible to load the kernel computed from a previous calculation using a different parallel scheme and/or a different number of MPI tasks.Indeed the calculation of the kernel matrix elements is very time consuming but has a very efficient memory and load distribution.In contrast, the solution of the BSE eigenproblem is less time consuming but also less efficiently distributed.It is thus suggested to first compute the kernel matrix on a large number of cores and then to solve the BSE on fewer tasks as a follow-up step.

C. Linear Response
According to Eq. ( 1), the computation of the response function in G-space can be distributed over 5 different levels: q-points, k-points, conduction and valence bands (c, v), and G-vectors.The distribution over the q-points would be the most natural choice, since the response functions at different q are completely independent.However, it may lead to significant memory duplication (multiple sets of wavefunctions are managed at the same time) and load imbalance since the number of possible transitions varies from point to point.It is usually not recommended unless a large number of qpoints has to be considered.Instead, it is usually more effective to parallelize over k, and bands (c,v) indexes.This requires slightly more MPI communication (due to a MPI reduction at the end of the calculation) but is very efficient in terms of speedup (almost linear) and in terms of memory distribution (especially for c, v, since usually wavefunctions are the leading memory contribution).While transitions are evenly distributed (balanced workload), yambo sorts and groups transitions which are (almost) degenerate in energy (see App. B) thus, in practice, small imbalances can still be present.
Further, the distribution over the G-vectors is the most internal one, and requires more communication among MPI tasks (unless parallel I/O is activated).It can be useful for systems with a very large number of G-vectors (such as low dimensional systems or surfaces) to distribute the response function and ease memory usage.Finally, the computation of χ 0 can also benefit from OpenMP parallelism.The distribution over threads has been implemented at the same level of the MPI parallelism (i.e. over transitions), resulting in a very good scaling while reducing the memory usage per core.Note however that some memory duplication (a M (G, G ) workspace matrix per thread) has to be paid to make the implementation more efficient.The OpenMP parallelism of χ 0 (including dipoles) is governed by the input variables: both defaults being set to 0, i.e. controlled as usual by the OMP NUM THREADS environment variable.Once the independent-particle response function χ 0 GG (q, ω) has been computed, a Dyson equation is solved for each frequency to construct the RPA response function.This can be done either by distributing over different frequencies or by using parallel linear algebra (see Sec. VIII F).D. Self-Energy: HF-Exchange and GW Following Eq. ( 7), the HF and GW correlation selfenergies can be parallelized with MPI over three different layers: q-points (q); bands in the Green's function (b) [see m in Eq.( 15)]; and quasiparticle corrections Σ nn k to be computed (qp).OpenMP parallelism here acts at the lowest level, dealing with sums over G and G , i.e. spatial degrees of freedom.The following variables can be modified to fine-tune the self-energy parallelization (here shown for 60 MPI tasks and 8 OpenMP threads): SE ROLEs="q.qp.b"SE CPU="1.4.15"SE Threads=8 Since the sum over q-points in Eq. ( 7) is over the whole BZ, the q-parallelism for the self-energy may be even more unbalanced than that for the response function (here every q-point needs to be expanded to account by symmetry for its whole star) and is recommended only when a large number of q-points is available.Instead, the parallelism over bands b tends to distribute evenly both memory and computation, at the price of a mild MPI communication, thereby resulting a natural choice (when enough bands are included in the calculation).qp-parallelism distributes the computation but tends to replicate memory (wavefunctions are not further distributed).In general, the OpenMP parallelism is extremely efficient for the GW self-energy without having to pay for any extra memory workspace.

E. Bethe Salpeter Equation
In the solution of the BSE most of the CPU time is spent in building up the excitonic matrix, or more precisely, its kernel.The input flags which control the parallel distribution of the workload needed to build the kernel are eh.k.t.To distribute the workload, first all possible transitions cvk, i.e. from valence band v to conduction band c at the k-point k, are split into transition groups (TGs).Then for each pair of TGs a block of the BSE matrix is created B ij = {T i → T j }.Defined N t the total number of TG, then the BSE matrix will be divided into N b = N 2 t blocks.In the Hermitian case (as in the Tamm-Dancoff approximation), only N b = N t (N t + 1)/2 blocks will be computed.The parallelization flags for the BSE define both N t and N b , and how the resulting blocks are distributed among the MPI tasks.Indeed N t = n eh n ibz k where n eh is the number of MPI tasks assigned to the eh field and n ibz k is the number of k-points in the IBZ.This means that even setting eh=1 a minimum number of k-based TGs (k-TGs) is always created, which is eventually split into subgroups when n eh > 1.It is important to note that k-TGs are defined using the k-sampling in the the IBZ, while the BSE matrix is defined in the whole BZ, resulting in groups of non-uniform size.However, the symmetry operations relating matrix elements within a given k-TGs are taken into account by yambo.As a consequence, in systems where n ibz k = n bz k the use of n eh > 1 is discouraged, as the splitting of k-TGs over different MPItasks implies that symmetry-related matrix-elements can be assigned to different MPI-tasks and need to be recomputed.Once N t and hence N b are defined, transitions and blocks are distributed among the MPI tasks as explained in the following example.
Suppose we have a system with 18 k-points in the IBZ, and we adopt the parallelization strategy 2.3.3 for eh.k.t in the case the BSE is Hermitian.Then N t = 2×18 = 36 and N b = 666.Thus, in our example we are using in total 2×3×3 MPI-tasks.The eh.k fields define 2×3 = 6 MPIgroups which split the 36 transition-groups.Thus, each MPI-group has to deal with 6 transition-groups.For each transition group T n , there are N t blocks B n ij = T i → T j for the Hermitian case, where the T n appears as initial (T i = T n ) or final (T j = T n ) state.Most of the blocks belongs to two transition-groups (only the blocks B ii belong to one transition-group).This means that each MPIgroup builds half of the B ij (6 * 35/2) plus all B ii (6) blocks.These 111 blocks are divided according to the t field and thus each MPI-task will be assigned to 37 blocks.

F. Linear Algebra
Dense linear algebra is extensively used in yambo.Among the most time-consuming tasks we have identified the inclusion of local field effects [38] in the RPA response function The solution of Eq.20 can be cast in the form of a matrix inversion.Indeed: Eq.21, and the solution of the BSE (diagonalization), which can be considered prototype kernels.
Once a finite basis set is adopted, the operators involved are represented as (dense) N ×N matrices, with N easily reaching few-to-tens of thousands or more, making standard linear algebra tasks (such as matrix multiplication, inversion, diagonalization) quite intense.We have therefore implemented dense parallel linear algebra by exploiting the ScaLAPACK library [79] within the MPI parallel structure of yambo.Concerning the RPA response, this means that on top of the MPI parallelism over qvectors, multiple instances of parallel linear algebra are run at the same time (one per q vector) to compute χ RPA .
The behavior of the yambo parallel linear algebra is governed by the variables: runlevel_nCPU_LinAlg_INV=64 runlevel_nCPU_LinAlg_DIAGO=64 where (runlevel could be, for example, the RPA response function or the BSE).Given the relevance, the calculation of the IP response function χ 0 GG (q, ω) has also been block-distributed over G, G vectors (g-parallelism in Sec.VIII C), both in terms of computation and memory-usage.
When using the SLEPC diagonalization method to obtain the BSE spectra, the memory distribution of the eigensolver (not to be confused with the memory distribution of the BSE matrix discussed in Sec.V A 2) is handled by the SLEPC library itself.For more details, the reader is referred to the SLEPC specific literature [77].

IX. SCRIPTING AND AUTOMATION
As a pure many-body code, yambo works as a sort of "quantum engine" that takes as input DFT calculations and material-specific parameters, producing very large amounts of temporary data (e.g. the response function) and outputs numerical results.Even a single calculation can produce enormous amounts of data.It is therefore necessary to carefully select or extract the relevant information to be stored for future analysis or sharing, possibly ensuring reproducibility.
In addition, the final quantities of interest (e.g.GW band gaps or BSE spectra) are often the results of a complex and tedious sequence of operations, involving transferring data from different codes (e.g. from Quantum ESPRESSO to yambo) or repeating calculations with different parameters (e.g. for convergence tests).The benefit of having platforms to organise, simplify and accelerate many-body perturbation theory calculations is obvious.Two parallel efforts are being developed to facilitate the use of yambo, namely yambopy and the yambo interface with the AiiDA platform [128].

A. Yambopy
Yambopy is a community project to develop Python classes and scripts to express, automate, and analyze calculations with the yambo code.A typical yambo workflow involves a few steps: generating the KS states with a DFT code, preparing the yambo databases and then running the yambo code.These workflows become more complicated when performing convergence tests or when repeating the same calculations for multiple materials.With yambopy the user can express these yambo workflows on a python script that can then be shared and reproduced among different users.We currently provide python classes to read and write the input files as well as the output of the yambo code.A lightweight python interface for the Quantum ESPRESSO Suite is also provided in the qepy package distributed along with yambopy.For more comprehensive python interfaces for the DFT codes see ASE [129], Abipy [130] for the Abinit code, or AiiDA [128].
The YamboIn class provided by yambopy is used to read and write the base yambo input file generated by yambo and modify it in a programmatic way.With this tool it is possible, for example, to create a set of input files by changing a single variable inside a for loop.We also provide classes to read the yambo NetCDF databases in Python (for a complete list see the on-line documentation [131]).These classes provide methods to manipulate and represent the data using the matplotlib [132] library giving great flexibility for the interpretation and analysis of the results.Running these workflows on a HPC context requires to write job submission scripts for different job schedulers, this can be done using the schedulerpy package also accompanying yambopy.
For quick access to some features from the command line, we provide the yambopy shell command.The script is automatically installed with yambopy in such a way that some functionalities of yambopy are directly callable from the command line.This script has features to plot the convergence tests for GW and BSE calculations, excitonic wave functions, and dielectric functions among others.
To ensure software quality and usability we provide yambopy as an open-source code along with documentation and automatic testing.A detailed documentation of the classes, features, and a tutorial are available on the yambopy website [131].We keep a public git repository hosted on Github where the users can download the latest version of the code as well as contribute with patches, features, and workflows.Sharing the workflows among users allow us to avoid repeated technical work and greatly simplifies the use of the yambo code.Continuous integration tests are done using the Travis-CI platform [133] leading the code to be tested at each commit and thus enforcing its reliability.yambopy is a project under development, and will be described on its own in a future work.

B. yambo within the AiiDA platform
The AiiDA platform is a materials' informatics infrastructure which implements the so-called ADES model (Automation, Data, Environment and Sharing) for computational science [128].The AiiDA plugins and workflows for yambo are publicly available on GitHub [134], while online documentation and tutorials are available on Read the Docs [135].

The yambo-AiiDA plugin
Input parameters and scheduler settings are stored as code-agnostic AiiDA data-types in a database, then converted by the yambo-AiiDA plugin into yambo input files and transferred to a computational unit (e.g a remote workstation or an HPC cluster).The AiiDA daemon submits, monitors and eventually retrieves the output files of the yambo calculation, the relevant information is then parsed and stored by the plugin.While the relevant data is properly stored in a suitable database, the raw input and output files are also stored locally in a repository.Therefore inputs, calculations and outputs are all stored as nodes of a database connected by directional links, preserving the full data provenance and ensuring reproducibility.
The yambo-AiiDA plugin currently supports calculations of quasi-particle corrections (e.g. at the COH-SEX or GW level) and optical properties (e.g.IP-RPA).Quantum ESPRESSO, one of the main DFT codes interfaced with yambo, is also strongly supported with specific plugins and workflows for AiiDA [136].Some of the parsing functionalities of the plugin are powered by the yambopy package [131].Different types of calculations can be performed, either starting from Quantum ESPRESSO or from p2y or from a previous (possibly unfinished) yambo run.

AiiDA workflows: automated GW
The yambo-AiiDA package provides automated workflows that capture the knowledge of an experienced user in performing e.g.GW calculations within the plasmon-pole approximation, accepting minimal inputs such as a DFT calculation or a crystal structure, and returning as outputs a set of quasi-particle corrections.The yambo-AiiDA plugin repository hosts four AiiDA workflows [128] of increasing complexity and abstraction: YamboRestart, YamboWf, YamboConvergence and YamboFullConvergence, that perform different but mutually interdependent tasks, with the latter depending on the former in the listed order.
YamboRestart is a low level AiiDA workflow that takes a DFT calculation (or a prior yambo calculation) as input, performs GW or BSE calculations, and returns the results.YamboRestart interacts directly with the yambo plugin, coping with common failures that may occur during a yambo GW run such as insufficient maximum walltime and out-of-memory issues: the workflow adjusts the scheduler options as well as the parallelization choices accordingly and resubmits the calculations.
YamboWf is a higher-level AiiDA workflow that uses YamboRestart and the Quantum ESPRESSO-AiiDA plugin to manage end-to-end a GW calculation from the DFT step to the completion of the yambo run.In contrast to YamboRestart, which starts from an already existing calculation (either DFT or yambo), YamboWf does not need to start from any calculation and performs all steps, including all necessary DFT, data interfaces, and yambo calculations.
YamboConvergence is built on top of YamboWf and automates the convergence of QP corrections (by focusing on the quasiparticle gap) with respect to a single parameter.A one-dimensional line search in the parameter space is used.The convergence is determined by comparing a series of the most recent calculations (four of them are used by default), and ensuring the change between all four successive calculations is less than the convergence tolerance.The deviation from convergence is estimated by fitting the gap to a function of the form f (x) = c+ a x+b .YamboFullConvergence iterates the above procedure over the main variables governing the convergence of GW calculations, namely the k-point grid, the number of Gvectors used to represent the response function (χ 0 cutoff), and the number of bands included in the sum-overstates for both the polarizability and the correlation selfenergy.Additionally, the possibility to further reduce the FFT grids with respect to the one used at the DFT level is also considered.A beta version of this workflow has been made available on GitHub for testing and finetuning of the algorithm.

C. Test-suite and benchmark-suite
A new important tool introduced to improve and stabilize the development of the yambo code is the test-suite.The yambo test-suite is stored in a dedicated repository (yambo-tests) on GitHub and contains a series of tests which can be run in an automated manner.The repository is freely accessible after registering as a "yambo user" on the GitHub.While the test-suite is mainly aimed at developers, users can also benefit from accessing its in-put and reference files and automatically checking if their compiled version of yambo works properly.
The test-suite is governed by using a Perl script, driver.pl.This script uses internal Perl modules to perform several tasks: it automatically compiles the yambo code (a precompiled version can also be used), it runs the code and checks the output against reference files stored in the test-suite repository.
The code can be run in serial, parallel with OpenMP threads and checking parallel I/O and/or parallel linear algebra.At least two different groups of tests are available: smaller (and faster) tests which are run on a daily basis and longer tests which are used for a deeper testing of the code before a release.
The same driver can also be used to run yambo benchmarks.Benchmarks tests are a particular group of materials that, describing complex nano-structures (a 1D polymer or carbon-based ribbon) or a water cell, require a large number of reciprocal space vectors and/or k-points.As a consequence these systems are suitable to be executed using a large number of cores on parallel machines.In this case the test-suite can collect the results and loop on different parallel configurations testing their performances.More importantly the test-suite organizes the results in machine dependent folders that can be, eventually, post-analyzed.
The results of the night runs of the test-suite are publicly available on the web-page [137] and can be inspected without having access to the machines that run the tests.This is very useful in order for any development to reproduce a specific error to be fixes.

X. CONCLUSIONS AND PERSPECTIVES
This paper describes the main development lines of the yambo project since the 2009 reference paper [22].Yambo is a scientific code supported and continuously developed by a collaborative team of researchers.The long list of authors of this work attests to the involvement of numerous experienced and young developers in addition to the four founders [22].
The yambo team currently comprises a balance of renowned scientists, with long-standing experience in abinitio approaches, and young researchers.We welcome students and post-docs with new ideas.This combination makes possible the growth of a software suite which is formally rigorous and able to address topics at the frontiers of materials science.By exploiting the power of many body perturbation theory at equilibrium and out-of-equilibrium within a state-of-the-art ab initio framework, the code is able to make predictions of the electronic and optical properties of novel materials, and moreover to provide interpretation of cutting-edge experiments ranging from ultrafast electron dynamics to nonlinear optics.
The involvement of parallel computing experts (two members of the Italian National Supercomputing Center CINECA co-authored this paper, for example) ensures that the code is also efficient and portable to the latest supercomputing architectures.As a result, new features added to the code immediately benefit from the native parallelized environment.
The modular structure of the code and the interface to external supporting software (AiiDA and Yambopy) complete the picture providing the end-user with a wealth of tools that cover the actual preparation, calculation and post-processing of data.The yambo suite thus provides all the ingredients for an advanced and computationally powerful approach to theoretical and computational material science.
What lies in yambo's future?We expect that the future development of yambo will be driven by the need to interpret new experiments.This will be achieved through the implementation of advanced computational algorithms and physical methodologies and will increasingly exploit interoperability with other software.Projects under current development include extension of GW to start from hybrid functionals, the possibility to use ultrasoft pseudopotentials, alternative schemes to avoid empty states, BSE at finite q, and incorporation of exciton-phonon coupling, to name just a few.We hope that all of these new developments will become available to general users in the near future.The code's efficiency will be continuously improved in order to tackle problems that remain computationally cumbersome.We expect that parts of yambo will be restructured in order to adapt to hybrid technologies (e.g.GPU units) and to fully exploit the computational power of the future computer architectures and future pre-and "exascale" machines.Developments are (and hopefully will be) also driven by the participation in European initiatives and projects.At present yambo is part of a user-based European infrastructure [25] and a member of the suite of codes selected for the exascale transition [24].
In conclusion, yambo is a lively community project characterized by a continuous technical and methodological development.The substantial development between the 2009 reference paper [22] and today demonstrates its enormous potential.The aim is to provide the scientific community with a tool to perform cutting edge simulations in a computationally efficient environment.

XI. ACKNOWLEDGMENTS
The yambo team acknowledges financial support from a number of sources, notably including H2020 funding, as follows.This work was in part supported by the project MaX -MAterials at the eXascale -by the European Union H2020-EINFRA-2015-1 and H2020-INFRAEDI-2018-1 programs (Grant No. 676598 and No. 824143, respectively).This work was also supported by the European Union H2020-INFRAIA-2014-2015 initiative (Grant No. 654360, project NFFA -Nanoscience Foundries and Fine Analysis).We also would like to acknowledge sup-In Eq. (C3), the parameter ¯ χ0 denotes the extrapolar energy for the polarizability, while N b is the number of conduction band states included in the calculation.Finally, as in sec.III, f s is the spin occupation factor, while n and m are band indexes.
FIG.1:The yambo project combines cutting edge computational materials science within a beyond-DFT framework with high performance algorithms, tools, and libraries

FIG. 2 :
FIG.2: Main software components and their general function.

5 FIG. 4
FIG. 4: e-e linewidths (Γ nk ) and lifetimes (τ nk ) of selected dbands of copper.Different level of approximations are shown together with the experimental data (diamonds with error bars).The calculated lifetimes are: Full line; G0W0.Dotted line: OMS G0W0.Dashed line: OMS G1W0.(reprinted with permission from Ref 59.Copyright (2001) by the American Physical Society).
FIG. 5: (Color online).Effect of the G-terminator on the convergence vs number of bands included in the self-energy on the VBM and CBM GW-corrections for a bulk Si described in a supercell.

FIG. 7 :
FIG.7:(Color online).GW band structure of monolayer WS2 including spin-orbit coupling and using 48×48×1 kpoints grid for the self-energy.The orange lines represent Wannier-interpolated bands obtained from 7 QP energies corresponding to a 6×6×1 grid (black dots), while the red dots shows the QP energies of the full 48×48×1 grid.
FIG. 8: (color online) Optical absorption spectrum of monolayer hBN in a √ 3× √ 3×1 supercell.The red line refers to an iterative solution using the Haydock solver.The blue shaded region corresponds to a SLEPC calculation where only the first two excitons were included.The inset shows the intensity of the exciton wave function corresponding to the main peak, based on the latter calculation (the hole position is fixed above a nitrogen atom and the resulting electron distribution is displayed).

FIG. 9 :
FIG. 9: (color online) Optical absorption spectrum of a WSe2 (panel a) from Ref. 86, compared with Excitonic DOS (panel b).Calculations are performed including SOC.In the Excitonic DOS both the bright A and the dark A * exciton are visible as a change in the slope (the dashed lines are a guide for the eyes, while the vertical continuous lines mark the energy positions of the excitons).Only the bright A exciton is instead visible in the absorption.

FIG. 10 :
FIG. 10: Exciton wave functions Ψ λ of states λ = 3 (a) and λ = 8 (b) of bilayer hBN (only the layer where the electron density is non-negligible is shown).The intensity of Ψ λ is shown in the top frames.Its phase is displayed in the lower frames for two inversion-symmetrical positions of the hole (r h and I(r h )).The hole is always fixed above a nitrogen atom of the layer not shown.Adapted from Ref. [91].

FIG. 11 :
FIG. 11: (a) Spectral functions (SFs) of few selected electronic states of trans-polyethylene.In the inset, the last four occupied bands are shown.The red line marks the k-point at which the corresponding SFs are presented.The selected states are marked with dots.(b) Two-dimensional plot of the SFs of trans-polyacetylene.The range of values of A(k,ω) are given in terms of dimensionless quantity Z nk .(reprinted with permission of the European Physical Journal (EPJ) fromRef.[101]and with the permission of American Physical Society from Ref.[105]).

FIG. 12 :
FIG. 12: Quasiparticle linewidths of bulk silicon calculated by using the GW approximation for the e-e scattering (green boxes) and the Fan approximation for the e-p scattering (red circles).The two gray areas denote the energy regions VBM − Eg and CBM + Eg.Adapted from Ref.[107].

FIG. 13 :
FIG. 13: Single-layer MoS2: (a) Electron-phonon correction of the eigenvalues ε nk for several temperatures.(b) Electronphonon widths of several temperatures and electronic density of states (black line).

FIG. 15 :
FIG. 15: [Color online] Time dependent polarization in hBN, panel (a), obtained obtained solving Eq. 16 within the HSEX approximation.In panel (b) its Fourier transform, red circles, matches the absorption computed within BSE, black line.(reprinted with permission from Ref. 114.Copyright (2011) by the American Physical Society.)

−7 s ) FIG. 16 :
FIG. 16: [Color online] Top panel: schematic representation of real-time simulation for the non-linear response.Bottom panel: magnitude of χ (2) (−2ω, ω, ω) for bulk CdTe calculated within the QPA (black triangles) and TDH (red circles).Each point corresponds to a real-time simulation at the given laser frequency.Comparison is made with experimental results from Refs.115, 116 (blue stars and hexagons).(reprinted with permission from Ref. 117.Copyright (2013) by the American Physical Society.)

FIG. 18 :
FIG. 18: yambo parallel performance.Upper panel: chemical structure of the precursor polymer of a chevron-like Graphene nanoribbon.Lower panel: yambo speedup of the linear response (χ0) and self-energy (Σc) kernels during a GW run.The scaling (obtained using 16 MPI tasks per node an 4 OpenMP threads/task) is shown up to 1000 Intel KNL nodes on Marconi at Cineca, A2 partition, corresponding to a computational partition of about 3 PetaFlops.The dashed line indicates the ideal scaling slope.(Adapted with permission from [127].Copyright (2017) by the Royal Society of Chemistry).
FIG. 19: Structure of the yambopy package.The qepy and schedulerpy packages are distributed as part of yambopy.

TABLE I
: Illustrative list of some of the configuration command line options.More options are available and can be listed by using ./configure--help.

TABLE II :
Command line options for the various yambo tools.
Si atoms, 72 occupied Effect of the X-terminator on the convergence (vs number of bands included in the response function) of the VBM and CBM GW-corrections for a bulk Si described in a supercell.
[42]r online).Convergence plots of GW-corrected data for the VBM of a TiO2 NW (27 atoms, 108 occupied states) as a function of the number of bands included in the calculation.Response and self-energy terminators are simultaneously applied.Calculations have been performed using the same number of bands for the polarizability and the self-energy.The black line show the usual GW convergence with no corrections.Coloured lines are obtained applying the method of Ref.[42]with different values of the extrapolar energy, ranging from 1.0 to 3.0 Ha above the last explicitly calculated KS state.