A GPU code for analytic continuation through a sampling method

We here present a code for performing analytic continuation of fermionic Green’s functions and self-energies as well as bosonic susceptibilities on a graphics processing unit (GPU). The code is bas ...


Introduction
The last two decades have seen a drastic increase of computational studies in solid state physics. These not only comprehend works based on density functional theory [1], but also more involved works based on many-body theory [2]. Solving the many-body problem usually involves the calculation of Green's functions and self-energies. At finite temperature, it becomes more convenient to perform this calculation at imaginary times, or equivalently at imaginary energies. Although this makes it possible to perform more efficient simulations, the physical observables one is interested in are still defined at real times, or equivalently at real energies. Hence, one needs to perform analytic continuation of Green's functions and selfenergies in the complex plane. Unfortunately, this results in an ill-posed problem, whose solution may be difficult to obtain without a priori knowledge on the function itself. For these reasons, a variety of numerical methods for performing the analytic continuation exist. The most celebrated methods include the maximum entropy method [3][4][5][6][7][8][9][10], the Padé approximant method [11][12][13][14], the Singular Value Decomposition [15,10], the non-negative Tikhonov method [16], the non-negative leastsquare method [17], stochastic regularization methods [18] and sampling methods [19][20][21][22].
In this article we focus on the approach proposed by Mishchenko et al. in Ref. [19], where the analytic continuation is approximated with a sum of rectangles to be determined via stochastic sampling. Although this method has raised high expectations in the scientific community, its applicability has been so far limited by the high computational cost. This requires that one needs to perform the analytic continuation on supercomputers facilities, which are not always accessible. Moreover, one has often to experience long queuing times, which makes the data analysis particularly inconvenient. It would be desirable to be able to work directly on a common laptop. To fulfill this need, we present here a code for the analytic continuation through Mishchenko's method for graphics processing units (GPUs). At the moment the code is applicable to fermionic Green's functions and self-energies, as well as to bosonic susceptibilities. For typical problems it is possible to perform several analytic continuations in a few minutes on a common laptop, which is going to significantly extend the number of researchers interested in Mishchenko's approach. Moreover, some of the presented numerical routines may find applicability for other problems requiring a stochastic sampling.

Problems and background
This program provides an approximate solution to the illposed Fredholm integral equation where G, K are known functions and ρ is unknown. For our purposes G(x) can be a fermionic Green's function, a fermionic self-energy or a bosonic susceptibility with the symmetry that its spectral function is odd. For the first two, x is either the imaginary time τ or the imaginary fermionic (Matsubara) frequency iω n . For a bosonic function x can only be the imaginary bosonic frequency iω n (extensions to τ are planned in the future). The function K in Eq. (1) is known as the kernel, and for the three possibilities above can take the form: where the subscripts f and b stay for fermionic and bosonic respectively. The wanted unknown function ρ(ω) is the spectral function, defined as a function of real energies ω.

Mishchenko's stochastic optimization method
To obtain a solution ρ(ω) to Eq. (1), A.S. Mishchenko et al. [19] have developed an approach based on the Monte Carlo method. The main idea is to approximate the true ρ(ω) withρ(ω), which is a sum of R rectangles, then calculateG(τ ) for that particularρ(ω) and compare the result to the known values of G(τ ) to get an indication of how adequate the guess ρ(ω) was. The rectangles are finally updated in a Monte Carlo fashion, as described below. More in detail,ρ(ω) can be written as where the sum runs over R rectangles, which are defined as Here {P i } = {h i , w i , c i } represents width, height and position (referred to the middle point). A set of rectangles is called a configuration and corresponds to a givenρ(ω). The latter can be transformed into the correspondingG(x) using Eq. (1). Given the fact that the rectangle function (1) gives The integral within the sum can either be solved analytically or calculated numerically depending on the kernel K . In a practical situation the function to continue is known at a finite set of N points. To express the deviation betweenG(x) obtained from configurationρ(ω) and the true G(x) that is known at N points, one can use the following deviation measure [19]: Minimizing Eq. (6) numerically is an ill-conditioned problem but by generating M independent solutions and taking the average the noise of the independent solutions averages out [19]. The implemented algorithm starts with an initial configuration and iteratively applies random changes to the configuration to reduce the deviation D[ρ(ω)] in Eq. (6). The full method can be separated into elementary, local and global updates of the configurations: Elementary update: A single alteration C of a configuration such as e.g. changing the width of a rectangle or shifting its center. Local update: A series of E consecutive elementary updates: Each elementary update is either accepted or discarded according to a probability function. The result of the local update is the configuration of the series that has the lowest deviation D[ρ(ω)]. Global update: A series of L local updates, each including E elementary updates, starting from a randomly The average of all independent solutions of the global updates, as in Eq. (7), provides the final approximation to ρ(ω).

Software architecture
The developed software is a GPU accelerated C++ code for the stochastic optimization method outlined above. All GPU related operations are based on CUDA, which is a parallel computing platform and programming model for NVIDIA graphic cards.
The software structure and interactions are shown in Fig. 1. The main computational library is accessed through the function compute which is defined in mish.cuh. This function takes three input arguments, i.e. an input structure and two float arrays, which contain the output spectral function ρ and its argument ω. The details of the input structure are described in the source code.

Software functionalities
The standard and recommended way of using the present software is through the program main.cu, located in the source folder. This program reads two input files, control.in and infloa.in. The former specifies both algorithm-specific and problem-specific parameters, while the latter contains the input function for a finite number of points (τ or iω n ). If one intends to perform analytical continuation of a self-energy, first the Hartree-Fock term should be removed, then the data should be normalized to get analogous asymptotics as for a Green's function. After the continuation, the self-energy spectrum has to be renormalized with the same factor as the input. For a bosonic function the input Matsubara data has to be divided by G(iω n = 0), while the output function has to be multiplied with −ωG(iω n = 0)/2. The file infloa.in requires input data in the column format x, Re[G(x)], Im[G(x)]. For real valued input functions, the imaginary part is simply left out. The file out.dat contains the resulting spectral density function from Eq. (7) in the column format ω, ρ(ω). The output is space separated and can be viewed directly by e.g. gnuplot.

Implementation details
The parallelization of the algorithm is done in two layers. Firstly, each particular solution of Eq. (7) can be generated independently. This is known as an embarrassingly parallel task. The second layer is parallelization of selected tasks within the computation of one particular solution. This specifically includes the computation of Eq. (5). The sum for each x can be computed independently. Eq. (6) is also done in parallel as well as various memory transfer related tasks.
The main CUDA kernel, which does the majority of the computations, is executed using a one dimensional CUDA grid with size equal to the number of global updates. The CUDA block size is set to be equal to the number of input Green's function points. This means that the grid and block size are fixed depending on the input parameters. A drawback with this is that the occupancy of the kernel might be low if the number of input points is low. This also means that there exist a maximum amount of input points which is limited by the hardware, typically 512 or 1024 input points. However, in our experience, this is not an issue in real world cases.
The significant data is stored in global memory on the GPU in single precision and the total GPU memory usage can exceed hundreds of megabytes. The algorithm implementation is heavy on memory load/store operations on the GPU which is the current main performance bottleneck.

Illustrative examples
The software is tested through an accurate comparison with existing CPU implementations of the stochastic optimization method [19], and a very good agreement is found. Illustrative examples can be provided for the analytic continuation of known functions. This implies first converting a known spectral function to a Green's function through Eq. (1) and then performing the analytic continuation through our software. Fig. 2 shows a comparison between the analytic continuation obtained by means of our code versus the exact result for two simple test-cases consisting of (fermionic) Gaussian functions. Although these tests involve only symmetric functions with a limited number of peaks (one or two), the provided software reconstructs their shapes very well in both cases. A much more demanding test is presented in Fig. 3(a), for the atomic-like selfenergy of a Sm atom in a cluster containing seven Sm atoms, as described in Ref. [14]. This is the most demanding test for the analytic continuation, since it involves resolving several narrow peaks at short distances from each other. Our software, as well as CPU versions of Mishchenko's method, captures only the two main peaks close to zero energy but fail to resolve the high energy peaks. One should not think that these issues are proper of the Mishchenko's method alone. Other techniques as well have severe problems in reproducing the high energy peaks, unless very high precision data are provided, which are not always accessible. In Fig. 3(b) we instead present an example of analytical continuation of a bosonic susceptibility for a noninteracting electron system on a cubic lattice in random phase approximation, as described in Ref. [23].
Our software will be mostly applied to perform the analytic continuation of data plagued by numerical noise, e.g. from quantum Monte-Carlo simulations [24]. Illustrative examples for the Hubbard model on an infinite-dimensional Bethe lattice    at half-filling for U/W = 2.5 and β = 150 are reported in Fig. 4. Here U is the strength of the local Coulomb repulsion, W is half the bandwidth for the non-interacting system, and β is the inverse temperature. For these parameters the fermionic Green's function has two solutions; an insulating one and a metallic one. Continuations for both solutions are reported in Fig. 4. Given that the exact analytic form of the solution is not known, we compare our results to the maximum entropy method [3][4][5]. The agreement is in general good. The maximum entropy method seems to have a slightly higher resolving power, but requires a priori information on the shape of the function, while the stochastic optimization method is completely unbiased. In fact, one of the most convenient applications of the stochastic optimization method is to provide an initial estimate of the true spectral function, which can then be used as an input for other techniques, e.g. as a model function for the maximum entropy method.

Performance
To our knowledge, the existing implementations of the stochastic optimization method by Mishchenko et al. are not very well optimized and thus, not suitable for speedup mea- surements of GPU versus CPU. One simple metric to measure the parallel efficiency is by looking at the total execution time while varying the global updates (number of particular solutions). The total amount of computations increases linearly with each added global update. Since we scale the number of GPU cores used with the number of global updates, a perfectly parallel execution would require a constant runtime, regardless of the number of global updates. This feature is tested for two different GPUs, and the results are shown in Fig. 5. The GTX 670 is a faster and more efficient GPU compared with the 610M. Fig. 5 shows that the GTX 670 scales better when increasing the number of global updates. This indicates that the implementation scales very well when increasing the parallel capacity.

Conclusions
We have presented a GPU implementation of the stochastic optimization method by Mishchenko et al. [19] to perform the analytic continuation of Green's functions. Our software makes it possible to obtain results with sufficient accuracy (i.e. a stochastic sampling which is large enough) on a common laptop, without requiring the access to a supercomputer. For a typical problem with 64 input points, in only 25 s our implementation can do about 200 elementary updates for 200 local updates and a total of 128 global updates, which corresponds to a total of 5 120 000 elementary updates. We are confident that this code can contribute to increase the usage of the stochastic optimization method among scientists working in many-body theory.