Distributed and multi-core computation of 2-loop integrals

. For an automatic computation of Feynman loop integrals in the physical region we rely on an extrapolation technique where the integrals of the sequence are obtained with iterated/repeated adaptive methods from the Q UADPACK 1D quadrature package. The integration rule evaluations in the outer level, corresponding to independent inner integral approximations, are assigned to threads dynamically via the OpenMP runtime in the parallel implementation. Furthermore, multi-level (nested) parallelism enables an efﬁcient utilization of hyperthreading or larger numbers of cores. For a class of loop integrals in the unphysical region, which do not suffer from singularities in the interior of the integration domain, we ﬁnd that the distributed adaptive integration methods in the multivariate P AR I NT package are highly efﬁcient and accurate. We apply these techniques without resorting to integral transformations and report on the capabilities of the algorithms and the parallel performance for a test set including various types of two-loop integrals.


Introduction
The integral for an n-dimensional scalar Feynman diagram with L loops and N internal lines can be represented in Feynman parameter space as Here the functions C and D are polynomials determined by the topology of the corresponding Feynman diagram [1] and we consider n = 4. Section 2 of this paper outlines a multi-core algorithm for iterated/repeated integration and extrapolation, which allows for parallelization at multiple levels of the integration scheme.A parameter ε is introduced in the integrand denominator and decreased according to a geometric sequence.The corresponding integrals are evaluated by iterated integration, using a one-dimensional adaptive program from the QUADPACK package [2].The ǫ-algorithm [3,4] is applied for convergence acceleration of the integral sequence as ε tends to zero.
Section 3 gives a brief description of the distributed adaptive algorithm in the PARINT multivariate integration (cubature) package [5].Written in C and layered over MPI [6], PARINT consists of parallel adaptive, Monte Carlo and quasi-Monte Carlo methods.Not unlike the QUADPACK routines, these are implemented as tools for automatic integration, where the user defines the integrand function and the domain, and specifies a relative and absolute error tolerance for the computation (t r and t a , respectively).Denoting the integral by If over the domain D, it is then the objective to return an approximation Qf and absolute error estimate , or indicate with an error flag that the requested accuracy cannot be achieved.When a relative accuracy (only) needs to be satisfied we set t a = 0 (and vice-versa).
Note that, after removing the δ-function in (1) and eliminating one of the variables in view of N j=1 x j = 1, the integration is over a d-dimensional simplex of the form (4) with d = N − 1.
The multivariate adaptive approach yields excellent results for a set of two-loop box integrals specified by Laporta [7].On the other hand, the iterated scheme with extrapolation has been shown to succeed for various classes of integrals in the physical region [8,9,10,11,12], where multivariate adaptive integration fails.
Results for the Laporta diagrams are reported in Sections 4 and 5 of this paper.The treatment of the sample integrals does not rely on integral transformations as proposed in [12] to improve the integrand behavior.We obtained most of the numerical approximations and timings on the HPCS (High Performance Computational Science) cluster at WMU.For this purpose we used 16-core cluster nodes with Intel Xeon E5-2670, 2.6GHz processors and 128GB of memory, and the cluster's Infiniband interconnect for message passing via MPI.
Some sample sequential results were also collected from runs on a PC node at minamivt005.kek.jp with 6-core Intel Xeon X5680@3.33GHzCPU, and on a 2.6GHz Intel Core i7 Mac-Pro with 4 cores.For the inclusion of OpenMP [13] multi-threading compiler directives in the iterated integration code (based on the Fortran version of QUADPACK) we used the (GNU) gfortran compiler and the Intel Fortran compiler, with the flags -fopenmp and -openmp, respectively.PARINT and its integrand functions were compiled with gcc.Code sections of the sample integrands are supplied in Appendix A.

Parallel iterated integration and extrapolation
The numerical integration over the interval [α j , β j ], 1 ≤ j ≤ d in (3) can be performed with the 1D adaptive integration code DQAGE from the QUADPACK package [2] in each coordinate direction.Via an input parameter to DQAGE we select the 15-point Gauss-Kronrod rule pair for the integral (and error) approximation on each subinterval.
where the w k and x (k) , 1 ≤ k ≤ K(= 15), are the weights and abscissae of the local rule scaled to the interval [a, b] and applied in the x j -direction.For j = 1 this is the outer integration direction.The function evaluation is itself an integral in the x j+1 , . . ., x d -directions for 1 ≤ j < d, and is computed by the method(s) for the inner integrations.For j = d, (6) is the evaluation of the integrand function Note that successive coordinate directions may be combined into layers in the iterated integration scheme.Furthermore, the error incurred in any inner integration will contribute to the integration error in all of its subsequent outer integrations [14,15,16].
Since the F () evaluations on the right of ( 5) are independent of one another they can be evaluated in parallel.Important benefits of this approach include that (i) the granularity of the parallel integration is large, especially when the inner integrals F () are of dimension two or greater; (ii) the points where the function F is evaluated in parallel are the same as those of the sequential evaluation; i.e., apart from the order of the summation in (5), the parallel calculation is essentially the same as the sequential one.This important property facilitates the debugging of parallel code.
As another characteristic the parallelization does not increase the total amount of computational work.
In addition, the memory required for the procedure is determined by (the sum of) the amounts of memory needed for the data pertaining to the subintervals incurred in each coordinate direction (corresponding to the length of the recursion stack for a recursive implementation).Consequently the total memory increases linearly as a function of the dimension d.
For a numerical extrapolation with the ǫ-algorithm [3,4] we generate a sequence of integral values I(ε ℓ ), using a geometric progression of ε ℓ which tends to zero with increasing ℓ (see also [8,10]).The extrapolation results given here are obtained with a version of the ǫ-algorithm code (DQEXT) from QUADPACK.Note that the accuracy of the extrapolated result is generally limited by the accuracy of the input sequence, which in turn is determined in PARINT by the user-prescribed error tolerance for the integration.
The input parameters of DQAGE include the maximum number (limit) of subdivisions allowed in the partitioning of the given interval by the adaptive algorithm.A related output parameter is the error code iflag, which returns 1 to indicate that the number of subdivisions reached its maximum, limit, upon termination (and is 0 otherwise).In an iterated integration it is sometimes beneficial to keep track of the number of times limit is reached throughout the integration on each particular level, as an indication of the difficulty of the integrand in different coordinate directions.

PARINT parallel adaptive integration
Layered over MPI [6] for distributed computing, the PARINT adaptive integration executable launches a user-specified number of processes which assume the role of controller and workers.The integration domain is divided initially among the workers.Each on its own part of the domain, the workers engage in an adaptive partitioning strategy similar to that of DQAGE, DCUHRE [17] and HALF [18] by successive bisections, and thus each generate a local priority queue of subregions as a task pool.The priority queue is implemented as a max-heap keyed with the estimated integration errors over the subregions, so that the subregion with the largest estimated error is stored in the root of the heap.However, if the user specifies a maximum size for the heap structure on the worker, the task pool is stored as a deap or double-ended heap which allows deleting of the maximum as well as the minimum element efficiently (in order to maintain the size of the data structure once it reaches its maximum).
Each task consists of the subdivision (bisection) of the associated subregion (generating two children regions), integration over the children, deletion of the parent region (root of the heap) and insertion of the children back into the heap.The bisection of a region is performed perpendicularly to the coordinate direction in which the integrand is found to vary the most, according to fourth-order differences computed Table 1.Times for N = 6 and N = 7 parallel iterated integration (HPCS cluster)  in each direction [17,18].The integration rule on a subregion is determined by the type of region and the user-selected polynomial degree.The available cubature rules in PARINT include a set of rules for the d-dimensional cube [19,20,21], the Quadpack rules for d = 1, and a set of rules for the d-dimensional simplex [22,23].The latter are the simplex rules by Grundmann and Möller [23] (which were also found by de Doncker [24] via extrapolation).
An important mechanism of the distributed integration algorithm is the load balancing strategy, which is receiver-initiated and geared to keeping the loads on the worker task pools balanced, particularly in cases of irregular integrand behavior.The message passing is performed in a non-blocking and asynchronous manner, and permits overlapping of computation and communication.The user has the option of turning load balancing on or off, as well as allowing or dis-allowing the controller to also act as a worker.Optionally the installation can be configured to use long doubles instead of doubles.
The multivariate partitioning (sequentially or in parallel) exhibits a dimensional effect (exponential as a function of d) with respect to the total number of regions generated and thus the required total memory.To some extent this is helped by the distribution of the subregions over the participating processors' local memories in the parallel system.
As a result of the asynchronous processing and message passing on MPI, PARINT executes on a hybrid platform (multi-core and distributed) by assigning multiple processes to each node.However, it is generally not possible to repeat an execution exactly.The parallelization also leads to breaking loss or extra work performed at the end of the computation, due to the time elapsed while workers continue to generate subregions after the termination message has been issued but before receiving it.This may increase the total amount of work (compared to the sequential work), particularly when the number of processes is large.

Results of multi-threaded iterated integration
We consider five two-loop box diagrams from Laporta [7], given in Fig 1 .Table 1 gives an overview of the problem specifications, the execution times for a varying number of threads, and the absolute accuracy achieved (|Error|) with respect to the values C 0 pp.46-47 in [7] for the diagrams in  2.
The multi-threading is invoked in the outer (x 1 ) integration in (3), except for the entry denoted (15,15) which represents nested parallelization applied in the outer two (x 1 and x 2 ) integrations.It emerges that the latter outperforms the single level parallelization, showing a more efficient use of the available cores.The integrations are performed over the simplex S d for d = N − 1 (see ( 4)).Appendix A.1-5 provides code segments for the integrands corresponding to Fig 1(a-e).The integrand evaluation in the code corresponds to an expression of the real part integrand where the factor εC in the denominator of (1) is replaced by ε for the extrapolation as ε → 0. Note also that the factor 1/(4π) nL/2 is omitted.
The problem specifications in Table 1 include: the number of internal lines N, the relative error tolerance t r requested for the 1D integrations of the iterated scheme, the maximum value limit for the number of subdivisions of the adaptive integration procedure in each coordinate direction, and the starting value of ε = ε 0 for the extrapolation sequence.For the sequence we use where the length of the sequence (ℓ max ) is around 15.In view of the trade-off between the amount of work done and the accuracy achieved, we choose smaller ε 0 and larger limit for higher accuracy work.

Diagram with five internal legs
For a sequential run on minamivt005, with target relative accuracy of 10 −6 for the 1D integrations, and the maximum number of subdivisions limit = 10, 10, 20, 20 in the x 1 , x 2 , x 3 and x 4 coordinate directions, respectively, extrapolation with the ǫ-algorithm on a sequence of 15 integrals for ε ℓ = 1.2 −15−ℓ , 0 ≤ ℓ < 15 converges with relative error 4 × 10 −6 in a total elapsed time of 30.3 seconds.
A more accurate computation leads to the convergence results in Table 2, where in successive columns the following are given: ℓ, ε ℓ , the integration time, the integral approximation corresponding to ε ℓ and the extrapolated value for C 0 .The integration time listed is that of a sequential run on a node of the HPCS cluster.The target 1D relative accuracy for the integrations was 10 −10 , ε 0 = 1.2 −20 and the maximum number of subdivisions limit = 30, 30, 50, 50 in the successive coordinate directions.The extrapolation converges within eight to nine digits which agree with Laporta's result.In view of the small computation times of the individual integrals, parallelization is not highly beneficial.Some speedup is observed for up to four threads with a parallel time of 445s, compared to 536s for two threads, and 843s for one thread (i.e., the sequential time of Table 2. Parallel speedup is defined as the ratio of sequential to parallel time.

Diagram with six internal legs
In test runs for a relative tolerance of 10 −8 , and allowed maximum number of subdivisions limit = 10 in each coordinate direction, extrapolation on the sequence with ε ℓ = 1.2 −10−ℓ , 0 ≤ ℓ < 15 converges to Laporta's result within 4 to 5 digits, in about 1,156 seconds on a node of the HPCS cluster for the sequential computation (764 seconds on Mac Pro).Timings for limit = 10 and 15, and for a varying number of threads are given in columns 2 and 3 of Table 1.
We achieved a higher accuracy (differing 1.8 × 10 −10 from Laporta's result, i.e., with a relative accuracy of 6.5 × 10 −10 ), as specified in column 4 of Table 1.Note how the time decreases overall from 37,381 seconds for the sequential run to 3,238 seconds with the parallel version, yielding a speedup of about 12.6.The convergence results in Table 3 and execution times per iteration, using 15 threads, were obtained on the HPCS cluster.Corresponding to C 0 = C 0 (ℓ), the last column represents |C 0 (ℓ) − C 0 (ℓ − 1)| which serves for the purpose of gauging the convergence.Since it is based on only two consecutive elements it may not be a good estimate of the error.The ǫ-algorithm routine DQEXT in QUADPACK produces a more sophisticated (and more conservative) error estimate.

Diagrams with seven internal legs
The integrands for the ladder diagram of By examining the iflag parameters, the output from the execution reveals that the maximum number of subdivisions is reached particularly toward the inner integrations.For this problem, reversing the order of the integration variables as shown in the function definition in Appendix A.3, improves the performance dramatically.Keeping all other inputs the same, the sequential execution time decreases to 4,151 seconds while the accuracy of the result is maintained.Parallel times obtained when varying the number of threads (on HPCS cluster nodes) are given in Table 1 (column of Fig 1(c)).A speedup of about 14.7 is achieved with nested parallelism.

Two-loop ladder box diagram Fig 1(d). For the diagram of Fig 1(d) a preliminary minamivt005
run with limit = 5 in all directions, relative integration tolerance 10 −6 and 15 integrations starting at ε = 1.2 −10 , gave the result 0.1036438 which, compared to Laporta's listed value of 0.10364072, has an absolute accuracy of about 3 × 10 −6 (relative accuracy 3 × 10 −5 ), after an execution time of 7,848 seconds.The iflag parameters indicated difficult behavior in the inner integrations for this problem.By changing the order of the integration variables as listed in the code segment of Appendix A.4 for the integrand function, while keeping all other inputs unchanged, the program runs in 1,072 seconds on Mac Pro and delivers the extrapolation result 0.1036451 (with relative error 4.2 × 10 −5 ).
We produced a more accurate result and timed the executions with limit = 14 in all directions, relative 1D integration tolerance 10 −8 , for 17 integrations starting at ε 0 = 1.2 −14 .For the timing results see Table 1 (column of Fig 1(d)).The extrapolated integral, 0.1036407236, has a relative error of 2.5×10 −8 .Note the overall gain in performance by the parallelization, from 15,317 seconds (sequential) to 1,217 seconds (parallel), or a speedup of about 12.6.

Two-loop crossed box diagram Fig 1(e).
A sequential run on minamivt005 with limit = 5 in all directions, relative integration tolerance 10 −6 , ε 0 = 1.2 −10 generated the result 0.85354533e-01 at ℓ = 15, in 2,901 seconds.The integration variables were set as: (6).With the order of the integration variables changed to that of the code segment in Appendix A.5 for the integrand, the program returned the result 0.85355139e-01 at ℓ = 14, and 0.85349174e-01 at ℓ = 15, in a total time of 801 seconds on Mac Pro.
Higher accuracy runs were performed with relative integration tolerance 10 −11 , limit = 14 and ε 0 = 1.2 −13 , resulting in 0.85351402e-01 at ℓ = 14 in 5,032 seconds, and 0.85351395e-01 at ℓ = 15 in 5,838 seconds (total).Table 4 gives an overview of the convergence and the times for the individual iterations using nested multi-threading (15,15).As another comparison, since the problem is actually unphysical for the specified kinematics, it is possible to set ε = 0 in the integrand.With 1D relative error tolerance 10 −8 and limit = 15, this gives 0.853513981536e-01 (with absolute error estimate of the outer integration 2.2e-09) in 4,383 seconds, using (15,15) nested multi-threading.Laporta's method did not solve this problem at the time of the writing of his article.

PARINT results
PARINT contains a parallel adaptive method with load balancing for d-dimensional rectangular regions roughly based on the subdivision strategy of [19,20,21] and for simplex regions [22,25], with results provided by d-dimensional integration rules of degrees 7 and 9 for the cube, and 3, 5, 7 and 9 for the simplex.For the problems at hand, some testing reveals that it is more efficient for PARINT to transform the integral from the (unit) simplex to the (unit) cube, followed by an integration over the cube.We use the transformation (x(1), . . ., x(d)) Since the integrals with kinematics specified by Laporta [7] are unphysical we set ε = 0, or eps = 0 in the integrand codes of Appendix A.1-5.However, care needs to be taken to handle boundary singularities in case the integration rule employs function evaluations on the boundary.For example, if x 1 = x 4 = x 5 = 0 (and eps = 0) in the definition of fc in A.1, then cc = dd = 0. We set the integrand to 0 to avoid a divide-by-0 floating point exception.Up to 64 MPI processes were spawned on up to 4 cluster nodes, using 16 procs (slots) per node.When referring to numbers of integrand evaluations, million and billion are abbreviated by "M" and "B", respectively.Since the times vary when the same run is repeated, the plots show timings usually averaged over three or four runs for the same p.The times T p are expressed in seconds, as a function of the number of MPI processes p, 1 ≤ p ≤ 64.
Table 5 gives a brief overview of pertinent test specifications, T 1 , T p and the speedup S p = T 1 /T p for p = 64.For example, the times of the N = 5 diagram decrease from 32.6s at p = 1 to 0.74s at p = 64 for t r = 10 −10 (reaching a speedup of 44); whereas the N = 7 crossed diagram times range between 27.6s and 0.49s for t r = 10 −7 (with speedup exceeding 56).
For the two-loop crossed box problem we also ran PARINT in long double precision.The results for an increasing allowed maximum number of evaluations and increasingly strict (relative) error tolerance t r (using 64 processes) are given in Table 6, as well as the corresponding double precision results.Listed are: the integral approximation, actual relative error estimate E r , number of function evaluations reached   and time taken in long double precision, followed by the relative error estimate, number of function evaluations reached and running time in double precision.Allowing for 16 processes per node and up to 64 processes total, the MPI host file has four lines of the form nx slots=16 where nx represents a selected node name.The running time is reported (in seconds) from the PARINT executable, and comprises all computation not inclusive the process spawning and PARINT initialization.For a comparable number of function evaluations, the time using long doubles is slightly less than twice the time taken using doubles.The iflag parameter returns 0 when the requested accuracy is assumed to be achieved, and 1 otherwise.Reaching the maximum number of evaluations results in abnormal termination with iflag = 1.
The integral approximation for the double computation is not listed in Table 6; it is consistent with the long double result within the estimated error (which appears to be over-estimated).Using doubles the program terminates abnormally for the requested relative accuracy of t r = 10 −10 .Normal termination is achieved in this case around 239B evaluations with long doubles.

Concluding remarks
In this paper we explored parallel integration methods for the automatic computation of a class of two-loop box integrals [7]: (i) multi-threaded iterated/repeated integration with extrapolation, and (ii) distributed adaptive multivariate integration.With regard to (i) we demonstrated the capabilities of the extrapolation for convergence acceleration, and of the parallelization on multiple levels of the iterated integration scheme to speed up the execution.
For physical kinematics we rely on calculating a sequence of integrals for extrapolation as the parameter ε decreases [8,9,10,11,12], and it has generally not been possible to calculate the integrals for small enough ε using adaptive partitioning of the multivariate domain.However, the test cases specified in [7] allow setting ε = 0 in the integrand, and it emerges that the distributed adaptive approach of PARINT solves these problems very efficiently (although iterated integration has an advantage with respect to low memory usage).PARINT executes on multiple nodes, and multiple cores per node.Future work in this area will address, for example, the distributed (non-adaptive) quasi-Monte Carlo and Monte Carlo methods in the current PARINT package, parallel random number generators and utilizing GPUs and other accelerators for the computations.
Future work on the iterated strategy will include improved error handling across the levels of the iterated scheme.Furthermore, the parallelization of the iterated strategy can be extended to handle infrared divergence as presented in [27].Higher-dimensional problems can be attempted with iterated integration by combining some successive 1D into multivariate layers as was done in [8].Evaluating the points in the 2D (or 3D) outer layer on GPUs may yield an alternative in some cases, if the structure of the computation is sufficiently similar for all evaluation points [26].In view of the computationally intensive nature of the underlying problems, roundoff error guards and suitable summation techniques need to be considered.Finally, the multi-threaded iterated integration can be incorporated within the distributed setting of PARINT on MPI [6].
Fig 1(b-d).For Fig 1(e) we compare with the result from PARINT executed in long double precision (cf., Section 5 below).Detailed results for the N = 5 diagram of Fig 1(a) are listed in Table

Fig 2
presents timing results obtained with PARINT on the HPCS cluster, corresponding to the five diagrams in Fig 1.

Table 5 .
Test specifications and range of times in Fig 2(a-d)

Table 6 .
PARINT double and long double results for crossed diagram Fig 1(e)