A GPU compatible quasi-Monte Carlo integrator interfaced to pySecDec

The purely numerical evaluation of multi-loop integrals and amplitudes can be a viable alternative to analytic approaches, in particular in the presence of several mass scales, provided suﬃcient accuracy can be achieved in an acceptable amount of time. For many multi-loop integrals, the fraction of time required to perform the numerical integration is signiﬁcant and it is therefore beneﬁcial to have eﬃcient and well-implemented numerical integration methods. With this goal in mind, we present a new stand-alone integrator based on the use of (quasi-Monte Carlo) rank-1 shifted lattice rules. For integrals with high variance we also implement a variance reduction algorithm based on ﬁtting a smooth function to the inverse cumulative distribution function of the integrand dimension-by-dimension. Additionally, the new integrator is interfaced to py SecDec to allow the straight-forward evaluation of multi-loop integrals and dimensionally regulated parameter integrals. In order to make use of recent advances in parallel computing hardware, our integrator can be used both on CPUs and CUDA compatible GPUs where available.


Introduction
High energy particle physics is in an era where the current underlying theory, the Standard Model (SM), is very well tested experimentally, as well as consistent and therefore predictive from a theoretical point of view. This means that we can control the SM predictions very well, and so should be able -at least in principle -to identify physics beyond the SM even if it is showing up only in small deviations.
In practice, there are several obstacles when trying to increase the precision of theoretical predictions. Focusing on problems accessible to perturbation theory, a major obstacle is the fast increase in complexity of the calculation as the number of loops and the number of kinematic scales increases. Despite the remarkable progress that has been achieved in the analytic calculation of multi-loop amplitudes and integrals in the last few years, analytical approaches are only at the beginning of a journey into largely unexplored mathematical territory if the function class of the results goes beyond multiple polylogarithms (MPLs), typically involving elliptic or hyper-elliptic functions, see e.g. [1][2][3][4][5][6][7][8][9][10][11][12][13].
On the other hand, (semi-)numerical approaches do not necessarily become less efficient if the result leaves the class of MPLs. This is one of the reasons why it is important to develop numerical methods which are fast and accurate enough to provide results where analytic approaches are at their limits. Sector decomposition [14][15][16][17] is an example of such a method; other recent semi-numerical methods are described e.g. in Refs. [18][19][20][21][22][23][24][25][26]. Sector decomposition is a procedure which can be applied to dimensionally regulated integrals in order to factorize singularities in the regulator. The resulting finite parameter integrals, which form the coefficients at each order in the regulator, can then be numerically integrated. There are several public implementations of sector decomposition [27][28][29][30][31][32][33][34][35]. Recently, an analytical method, based on sector decomposition followed by a series expansion in the Feynman parameters and analytic integration, has been worked out in Ref. [36].
Currently, in the publicly available sector decomposition tools, numerical integration mostly relies on either deterministic integration rules for integrals of low dimensionality or Monte Carlo integration, as implemented in the Cuba library [37,38]. However, the integration error for Monte Carlo integration scales only like 1/ √ n, where n is the number of samples, which limits the accuracy that can be obtained in a given integration time. To improve on this, a different integration method has to be chosen. One such method is the quasi-Monte Carlo (QMC) method [39], where the integration error scales like 1/n or better, rather than 1/ √ n. In order for this scaling to be achieved, the integrand functions need to fulfil certain requirements. The QMC method discussed herein was first applied to functions produced by the sector decomposition algorithm in Ref. [40], where it was shown practically that the conditions for 1/n or better scaling are usually met and the good performance of Graphics Processing Units (GPUs) when evaluating such functions was also demonstrated. An application of quasi-Monte Carlo methods to two-and three-loop integrals also has been presented in Ref. [41]. The QMC method, implemented to run on GPUs, has already been applied successfully to phenomenological applications involving multi-scale two-loop integrals including Higgs-boson pair production [42,43] and H+jet production [44] at NLO.
In this work, we present a new stand-alone QMC integrator capable of utilizing multiple cores of Central Processing Units (CPUs) and multiple Graphics Processing Units (GPUs). We also present a new version of the program pySecDec which makes available our QMC implementation as an additional integrator. Furthermore, we present and implement a method for combining the QMC integration with importance sampling. We emphasize that our QMC implementation can also be straightforwardly used outside of the pySecDec context.
The outline of the paper is as follows. In Section 2 we give an overview on the QMC method as implemented in our program and describe our variance reduction procedure. Section 3 is dedicated to the stand-alone usage of the QMC integrator library, we also describe the design of the library and some basics regarding the use of GPUs. In Section 4 we explain the usage of the QMC integrator within pySecDec and describe various examples. Section 5 is dedicated to profiling the QMC method and our implementation. After we conclude in Section 6, we provide detailed API documentation in Appendix.

Quasi-Monte Carlo integration
Our aim is to numerically compute the multiple integral of a function f : R d → R or f : R d → C over a d-dimensional unit hypercube [0, 1] d , (1) In this section we will briefly introduce the concept of quasi-Monte Carlo integration and state the most relevant results and formulae. The study of QMC integration has produced a vast amount of literature, for a more thorough review we refer the reader to the existing mathematical literature, for example Refs. [39,45] and references therein.
Unlike Monte Carlo integrators, quasi-Monte Carlo (QMC) integrators are based on a predominantly deterministic numerical integration. An unbiased estimateQ n,m [f ] of the integral I[f ] can be obtained from the following (QMC) cubature rule, known as a rank-1 shifted lattice (R1SL) rule [39]: The rank of the rule denotes the minimal number of generating vectors required to generate the lattice rule. In this work we will A reliable estimate of the integral can be obtained even without random shifts provided that the lattice is sufficiently large, however, the random shifts allow the remaining error to be estimated. More precisely, an unbiased estimate of the meansquare error can be obtained from the random shifts of the lattice according to In typical applications only 10-20 random shifts are required to obtain a reliable estimate of the error. In Fig. 1 an example shifted lattice is shown. In the left panel a single lattice is displayed. The zeroth point is shifted from the origin by the random shift vector ∆ 0 . Further points are generated by adding z/n and wrapping back into the unit square as necessary. The lattice displayed contains a total of n = 55 points. In the right panel, three additional shifted lattices are displayed. They are generated by shifting the original lattice and can be used to produce an estimate of the integration error using Eq. (3).
The classical theoretical error bounds on QMC rules take the form of a product where t i are the cubature points generated by the generating vector(s), D is the discrepancy of the point set and V is the variation of f . The discrepancy depends only on the points and the variation depends only on the integrand. If f can be differentiated once with respect to each variable then it can be proven that for a particular choice of cubature points (or, equivalently, a particular generating vector) QMC methods converge as O((log n) d /n).
This error bound grows exponentially with dimension, seemingly implying that QMC integration is not useful in a large number of dimensions.
However, by working with weighted function spaces, it can be shown that the error bound can be independent of the dimension provided that the variables of the integrand f have some varying degree of importance. In the modern literature, error bounds have been studied in terms of the product where e γ is the worst case error in a weighted function space with weights γ and ∥f ∥ γ is the norm of f in the weighted space.
Following Ref. [39] we will discuss two function spaces (Sobolev spaces and Korobov spaces) that allow important properties of the QMC to be proven. In both cases we will state theorems from the literature that bound the worst case error for a rank-1 shifted lattice rule in the corresponding function space. By definition, the worst case error for a shifted lattice rule in the weighted function space is the largest possible error for any function with norm less than or equal to 1, Here we use the notation e γ (z, ∆) in place of e γ (t 0 , . . . , t n−1 ) to refer to the worst case error of the point set generated by z (the generating vector) and ∆ (the random shift vector Note that the norm depends only on the mixed first derivative because we never differentiate more than once with respect to a particular variable. It can be shown [39,46] that for functions belonging to such a space a R1SL rule exists for which the shift averaged worst case error is given by for all λ ∈ (1/2, 1]. Here ζ is the Riemann zeta function and ψ is where u(h) := {j ∈ {1, . . . , d} : h j ̸ = 0} andf (h) are the Fourier coefficients of the integrand, given bŷ For functions belonging to a Korobov space with smoothness α the shift averaged worst case error is given by for all λ ∈ (1/(2α), 1]. The best convergence rate is obtained when λ → 1/(2α), which yields a convergence close to O(n −α ) independently of d (for suitably chosen weights). Functions which are smooth but not periodic can be periodized by an integral transform as described in Section 2.3. This can improve the rate of convergence of quasi-Monte Carlo integration but may also increase the variance (or norm) of the function, especially in high dimensions.

Generating vectors
The convergence of the rank-1 lattice rule given in Eq. (2) depends on the choice of the generating vector z and in particular the worst case errors given in Eqs. (8) and (10) can only be achieved with specific choices of z. An efficient algorithm to construct good generating vectors z is the component-by-component construction [47], where a generating vector in d dimensions is obtained from a d −1 dimensional one by selecting the additional component such that the worst-case error is minimal. This allows to construct the generating vectors with a cost of O(d n log n).
We provide generating vectors for different fixed lattice sizes n, which have been obtained for a Korobov space with product More details on the generating vectors provided with the QMC library are given in Appendix A.3.

Transformations
Lattice rules perform particularly well for continuous and smooth functions which are periodic with respect to each variable. Sector decomposed functions are typically continuous and smooth but not periodic. However, they can be periodized by a suitable change of variables x = φ(u), where φ(u) = (φ(u 1 ), . . . , φ(u d )), ω d (u) = d ∏ j=1 ω(u j ) and In practice, the periodizing transform may be specified in terms of the weight function, ω, in which case the change of variables is given by We have implemented the following periodizing transformations: • Korobov transforms [48][49][50], • Sidi transforms [51], • Baker's transform [52].
The Korobov transform is defined by the polynomial weight function The weight parameters r 0 , r 1 are usually chosen to be equal. The behaviour of the integrand near the endpoints should be taken into account when choosing the weight parameters r, as the variance of the integral can depend critically on their choice.
Asymmetric Korobov transforms with r 0 ̸ = r 1 can be beneficial in cases where the integrand approaches a singularity near one of the endpoints, while the other endpoint does not exhibit any singular behaviour. Sidi transforms [49,51] are trigonometric integral transforms with a weight proportional to (sin πu) r : The Sidi transforms may be used to periodize an integrand in a similar manner to the Korobov transforms. One potentially negative feature of the Sidi transforms is that several trigonometric functions need to be computed for each sample of the integrand. This can increase the cost (in terms of machine operations) considerably, especially for relatively simple integrands.
The baker's transformation [52] (also called ''tent transformation''), given by can be applied to achieve close to O(n −2 ) convergence for nonperiodic integrands. The transform periodizes the integrand by mirroring rather than forcing it to a particular value on the integration boundary. Naively the fact that the transform is discontinuous might lead us to expect a poor asymptotic scaling (due to the fact that the transform is not smooth). However, an analysis based on considering the transform as a modification of the lattice rather than of the integrand allows the convergence of O(n −2 ) to be proven. In a moderate number of dimensions (d ≳ 9) the baker's transform typically does not increase the variance of the integrand as much as the Korobov and Sidi transforms. Therefore, although it has a slower convergence rate, the baker's transform can still prove useful. A critically important point to consider when choosing a periodization strategy is the number of dimensions in which the integration will be performed. In particular, applying a periodizing transform can increase the variance of the integrand exponentially with its dimension d. Although it is possible to construct rank-1 lattice rules whose worst case error is independent of d (or depends at most polynomially on d), increasing the variance of the integrand can spoil the convergence of the quasi-Monte Carlo integration [50]. For integrands in a relatively low number of dimensions (d ≲ 8) the increase in variance caused by higher weight (r ≳ 3) periodizing transforms can be counteracted by the improved smoothness of the integrand which leads to an improved asymptotic scaling behaviour with the number of lattice points n.

Variance reduction
By applying a variable transformation y = p(x) to a onedimensional integral the integration becomes trivial if p ′ (x) ∝ f (p(x)) −1 . While it is usually not possible to find a transformation fulfilling this condition exactly, it is possible to find approximations to it. This leads to an integrand with reduced variance, which can significantly improve the convergence of the integration when using numerical integration techniques. A well known method to apply variance reduction to multi-dimensional integrals is the Vegas algorithm [53], where the above procedure is applied to each integration variable separately, with the remaining variables integrated out. In the algorithm of Ref. [53], for each integration variable x, the transformation p(x) is constructed as a strictly increasing, piecewise linear function, such that p ′ (x) resembles the shape of |f (p(x))| −1 . However, this procedure leads to discontinuities in p ′ (x), which spoil the smoothness of the integrand and thus the scaling of the numerical integration when directly applying this algorithm in combination with QMC integration. Instead, we use the ansatz to parametrize the variance reducing transformation. The parameters a i are obtained via a fit to the inverse of the cumulative distribution function (CDF), The ansatz in Eq. (19) is chosen such that p(0) = 0 and p(1) = 1. The parameters a 2 and a 3 are required to be positive, and a 0 ∈ [1.001, ∞), a 1 ∈ (−∞, −0.001] such that no singularities are introduced within the domain of integration by the transforms. The parameters are optimized by sampling the integrand with a lattice of given size to numerically obtain an estimate of the CDF for each integration parameter and applying a non-linear leastsquares fit using the routines implemented in the GNU Scientific Library [54]. We find that the ansatz in Eq. (19) works well for typical functions obtained by sector decomposition. While this ansatz in principle can be applied to other integrals as well, we expect that for other functions it can be beneficial to modify it to improve the fit of the CDF of the corresponding integrand.

Installation
If you wish to use the integrator with your own code rather than within pySecDec, then it is available as a c++11 singleheader header-only library at https://github.com/mppmu/qmc. Download the header and include it in your project. Since the QMC is a header only c++ template library it does not need to be separately configured and built.
In order to build the header as part of your project you will need: • A c++11 compatible c++ compiler. • The GNU Scientific Library (GSL), version 2.5 or greater. • (Optional GPU support) A CUDA compatible compiler (typically nvcc). • (Optional GPU support) CUDA compatible hardware with Compute Capability 3.0 or greater.
Simply include it in your project, ensure that it can be found by your compiler (using compiler include path specifiers if necessary) and then build your project, linking against the GSL.

Minimal example
In this section we provide examples of the usage of the integrator as a stand-alone package. The usage within the c++ interface to pySecDec is similar while the usage via the python interface to pySecDec differs significantly. Both uses within pySecDec are described in Section 4.2.
The code of a minimal program demonstrating the usage of the integrator is shown in Fig. 2. In this example, the 3-dimensional where <arch> is the architecture of the target GPU or com-pute_30 for just-in-time compilation (see the Nvidia nvcc manual for more details). The compile flag -x cu explicitly specifies the language of the input files as CUDA, rather than letting the compiler choose a default based on the file name suffix. The compile flag -Xptxas -O0 disables optimization of the code by the PTX assembler, as of CUDA 9.2 we found rare cases where code optimization led to wrong results. The flag -Xptxas -disable-optimizer-constants disables the use of the optimizer constant bank which can be exhausted for large integrands, it is not strictly necessary to pass this flag for simple examples.
In Fig. 2, on lines 4-13, a functor my_functor, containing the function to be integrated is defined and instantiated. On line 19 the QMC integrator is instantiated with a Korobov transform of weight 3. The MAXVAR variable controls the maximum number of integration variables over which a particular instance of the QMC integrator can integrate, it should be set to a value equal to or larger than the maximum number of integration parameters present in any functor that will be passed to the instance of the QMC integrator. On line 20 the functor instance is passed to the integrate function of the integrator, this will trigger the numerical integration. The integrator returns a result struct containing the integral and its uncertainty, which are printed on lines 21-22. The CUDA function execution space specifiers __host__ and __device__ on line 7 are present only when compiling with GPU support. This is controlled by the presence on line 6 of the __CUDACC__ macro which is automatically defined by the compiler during CUDA compilation.

Usage
We envisage two typical use case scenarios for the QMC library: (1) The user knows relatively little about the integrand but wishes to know the result with a specific relative and/or absolute accuracy. (2) The user has a reasonable idea how the QMC performs on their integrand and wishes to obtain a result as quickly as possible.
We discuss these use cases in turn. A description of all public fields and member functions is given in Appendix.

Case 1
In order to evaluate an integral to a specific relative and/or absolute accuracy without significant human input, the following QMC integrator member variables are relevant: epsrel, epsabs, maxeval along with the member function integrate.
Firstly, the QMC must be initialized with a suitable integral transform and the user must decide whether to use the variance reduction methods described in Section 2.4. If nothing is known about the periodicity and variance of the integrand, we would typically recommend using a Korobov weight 3 transform if the integrand lives in less than 9 dimensions (otherwise the baker transform may be more suitable) and no variance reduction. If the QMC does not produce even a rough estimate of the integral (∼ 20% error) with a moderate lattice size then the variance reduction procedure may prove useful and the fit function should be specified as shown in Fig. 3.
The user can then set the epsrel and epsabs fields to the desired accuracy. In addition, the parameter maxeval ensures that the integration terminates in a reasonable time, even if the desired accuracy cannot be reached. The integration terminates once any of the three conditions is met. What constitutes a suitable value of maxeval depends on the complexity of the integrand (in terms of floating point operations), the hardware available for computing the integral and the time the user is willing to wait for a result.
Finally, the integrate function can be called on the input function. In Fig. 3 we display the above steps in code (for a 5-dimensional real integrand named my_integrand).
If a fit function has been provided, the QMC library will evaluate evaluateminn lattice points and use them as input to the fitting and variance reduction procedure as described in Section 2.4. The QMC library will then apply the selected periodizing transform to the fitted function. If no fit function has been provided the QMC will apply the periodizing transform to the input function and proceed to the next step directly.
In the next step, a total of minm randomly shifted copies of the smallest possible lattice greater than minn in size will be sampled and used to estimate the integration error. If the required error goal has not been reached the result will be discarded and a larger lattice will be selected and computed. This procedure will be repeated as necessary until the desired error goal is reached or maxeval function evaluations have been performed; at which point the integration will terminate and the last result obtained will be returned.
If, during the iteration, the QMC requires a lattice larger than can be produced with the available generating vectors it will instead select the largest lattice and attempt to reduce the integration error by adding random shifts. In this case the QMC will achieve only Monte Carlo O(n −1/2 ) scaling. If an acceptable result cannot be achieved with Monte Carlo scaling then the user is advised to compute and supply additional (larger) generating vectors as described in Appendix A.3.
Note that, unlike some other integration algorithms, the results from all but the last iteration have no effect on the final result. It is therefore always more efficient to directly evaluate a lattice that gives an acceptable integration error rather than asking the library to try to find a suitable lattice size by iterating.

Case 2
If the user has a reasonable idea how the QMC performs on their integrand, for example by studying similar integrands or evaluating their integrand with a small lattice, then a result can most quickly be obtained by setting the parameters minn and calling the member function integrate. In order to ignore the default error goals epsrel and epsabs, the parameter maxeval should be set to 1. In Fig. 4 we display the above steps in code (for a 5-dimensional real integrand named my_integrand).
The QMC library will evaluate minm randomly shifted copies of the smallest possible lattice with at least minn points and return the result. If this result is not satisfactory then the user can increase minn and retry the integration. In order to estimate what lattice size is suitable it is sometimes useful to investigate the scaling behaviour of the integrand by evaluating several different lattices. Note that, as can be seen in Fig. 14, the scaling of the QMC is quite 'noisy' in the sense that very similarly sized lattices can produce estimates of the integral with errors that differ by an order of magnitude or more. This behaviour can hinder straightforward attempts to estimate the scaling behaviour of an integrand.

Usage on GPUs
In order to use the QMC library on CUDA enabled GPUs the user must ensure that their integrand functor can be evaluated on the chosen device. This usually entails taking the following steps: • Ensuring that the c++ language features used in the integrand function are supported by the relevant CUDA device.
• Designing the integrand function so that it does not need to access data that will be stored only in host memory.
• Marking the call operator of the integrand functor __host__ __device__, as shown in the examples above.
For the purpose of monitoring GPU usage and debugging we have found the following tools provided by Nvidia, and distributed with the CUDA toolkit, to be useful: • nvidia-smi, a top like management and monitoring utility for Nvidia GPU devices.
In most cases the usage of GPUs within the QMC is straightforward, however, the attentive user may notice that the program behaves in a slightly different manner than when using only CPUs. Let us discuss some of the most prominent features of CUDA devices which can affect the usage of the QMC library.
The Nvidia kernel mode driver must be running and connected to the GPU device before any user interaction with that device can take place. If the kernel mode driver is not already running and connected to the target GPU the invocation of a program that interacts with the GPU will cause the driver to load and initialize the GPU. This will incur a start up cost of 1-3 s per GPU. For short running integration jobs this cost can be a significant fraction of the integration time. On Windows, the kernel mode driver is loaded at start up and kept loaded until shut down, however, by default the time-out detection and recovery (TDR) feature will cause driver reload and should be disabled (we refer to the latest Nvidia documentation). Similarly, under Linux, if an X-like process is run from start up to shut down it will usually initialize and keep alive the kernel mode driver. However, if no long-lived X-like client is kept running (for example in many HPC environments) the kernel mode driver will initialize and de-initialize the target GPU each time a GPU application starts and stops. As of CUDA 9.2 the Nvidia recommended way to circumvent the delay due to starting and stopping the kernel mode driver is to run the Nvidia Persistence Daemon (we refer to the latest Nvidia documentation).
When compiling with nvcc we strongly recommend to look up and enter the real architecture of the graphics card in use, e.g. -arch=sm_70. If a virtual architecture is specified, the device code is just-in-time compiled for the real architecture on a single core, which may become the dominant fraction of the runtime.
For initial tests, the virtual architecture compute_30, which is the oldest supported in CUDA version 9.2, should be compatible with most GPUs that are currently in use. For more information we refer to the nvcc manual.

Design and implementation
In order to numerically integrate a single function, the QMC integrator library can concurrently utilize multiple multi-threaded CPUs as well as multiple CUDA hardware accelerators, provided they all belong to a single system. To achieve reasonable performance on heterogeneous systems a receiver-initiated central work queue load balancing algorithm is utilized.
The load balancing algorithm consists of the following steps: • A central work queue is initialized by the main thread.
• The main thread then spawns cputhreads worker threads if −1 is listed in devices and additionally one worker thread per GPU is listed in devices. • Each worker requests work from the work queue and when the work is completed continues to request work until the queue is cleared, at which point the workers terminate.
The disadvantage of this algorithm is that the central work queue must be atomically locked to ensure work is not repeated. This can impact performance significantly when a quick-to-evaluate integrand is computed using a large number of cores and/or CUDA devices. The advantage of this design is that even if the performance of the workers differs vastly (for example a worker computing on a single CPU core compared to a worker distributing work to a powerful GPU) the workload is reasonably balanced provided that the work packages are not so large as to leave a poorly performing worker with so much work that it finishes significantly later than all others.
In order to utilize massively parallel CUDA hardware, threads assigned to provide work to an accelerator will request significantly more work from the queue per access than workers assigned to a single CPU core. The amount of work (number of ''work packages'') requested at once by a worker assigned to a CUDA device is controlled by the product of cudablocks and cudathreadsperblock, while workers assigned to a CPU core always request only a single ''work package''.
At the time of release the default values of parameters affecting the load balancing are usually a reasonable choice for most feasible integrands and existing hardware. Naturally, as the state of the art advances and computer hardware evolves we may alter these default values in future QMC releases.

Usage of the integrator library within pySecDec
Here we describe briefly the installation and usage of pySecDec, focusing on the usage of the QMC integrator. For more details we refer to the manual https://secdec.readthedocs.io and to the examples distributed with pySecDec.

Installation
Before installing pySecDec, make sure that recent versions of numpy (http://www.numpy.org/) and sympy (http://www. sympy.org/) are installed. The pySecDec program (which includes the QMC integrator library) can be downloaded from https:// github.com/mppmu/secdec/releases. To install pySecDec, perform the following steps tar -xf pySecDec-<version>.tar.gz cd pySecDec-<version> make <copy the highlighted output lines into your .bashrc> The make command will automatically build further dependencies in addition to pySecDec itself. Further notes on the installation procedure are summarized in the online documentation https://secdec.readthedocs.io. To get started, we recommend to read the section ''getting started" in the online documentation.

Usage
Depending on the availability, it is possible to use the program with CPUs, GPUs or a combination of both.

Using CPUs only
The basic steps can be summarized as follows: (1) Write or edit a python script to define the integral, the replacement rules for the kinematic invariants, the requested order in the regulator and some other options, see e.g. the example examples/easy/generate_easy.py. (2) Run the script generate_easy.py using python. This will generate a subdirectory according to the name specified in the script.
(3) Type make -C <name>, where <name> is your chosen name. This will create the c++ libraries. (4) Write or edit a python script to perform the numerical integration using the python interface, see e.g. examples/easy/integrate_easy.py. Make sure that the QMC integrator is chosen in that file.

Using GPUs and CPUs
When using GPUs, steps (1), (2) and (4) of the previous Section 4.2.1 are the same. The only difference is in the compilation of the sector files (3) Type CXX=nvcc SECDEC_WITH_CUDA=<arch> make -C <name>, where <name> is your chosen name and <arch> is the argument forwarded to nvcc as -arch=<arch>. This will create the c++ libraries.
The compute capability <arch> is specific to each graphics card. The parameter <arch> can either be a suitable virtual architecture or a real architecture. We strongly recommend to look up and enter the real architecture of the graphics card in use, e.g. sm_70. If a virtual architecture is specified, the device code is just-in-time compiled for the real architecture on a single core, which may become the dominant fraction of the runtime. For first tests however, the virtual architecture compute_30, which is the oldest supported in CUDA version 9.2, should be compatible with most GPUs that are currently in use. For more information refer to the nvcc manual.

Basic usage
The basic usage of pySecDec is illustrated in the example easy. The slightly modified easy_cuda example shows how to compile and run the easy example on all available GPUs using either the python or the c++ interface. The generate file generate_easy.py shown in Fig. 5 is identical for both examples, easy and easy_cuda. The integrate file integrate_easy.py differs by the optional lines that select the QMC integrator. Choosing the QMC integrator as shown in Fig. 5 will make pySecDec use all CPU cores as well as all available GPUs.
In order to use GPUs, the code should be compiled with Nvidia's nvcc compiler. It is also possible to use non-CUDA compilers, though this will disable GPU support. The commands to run the examples are (a) using CPUs and GPUs: python generate_easy.py CXX=nvcc SECDEC_WITH_CUDA=compute_30 make -C easy python integrate_easy.py or (b) using CPUs only: python generate_easy.py make -C easy python integrate_easy.py How to use the QMC integrator with GPU support via the c++ interface is shown in the example easy_cuda. Note that above we have set the compute capability to compute_30. Please read the remarks about the compute capability in the previous section before running more complicated examples on the GPU.

Pentabox
The example pentabox_fin calculates a fully massless twoloop five-point function in the physical region with d = 6 − 2ϵ, see Fig. 8.
The pentabox is a master integral occurring in the calculation of 2 → 3 scattering at two loops. In 4 − 2ϵ dimensions, sector decomposition produces poles of order ϵ −5 at intermediate  stages. The 6 − 2ϵ dimensional version we investigate here is finite and therefore a more suitable master integral for numerical evaluation. This is an example where the use of a fit function considerably improves the convergence.

Elliptic 2-loop integral
The example elliptic2L_physical calculates a planar twoloop four-point function with one off-shell leg and a massive loop in the physical region, see Fig. 9. This diagram enters the NLO corrections to Higgs+ jet production and contains elliptic structures. The analytical result in the Euclidean region is given in Ref. [55]. While a numerical result for this integral already has been given in Ref. [35], the purpose of this example is to demonstrate that the number of correct digits which can be obtained using the QMC integrator cannot be reached in a reasonable amount of time using Monte Carlo integration.

Hyperelliptic 2-loop integral
The example hyperelliptic calculates a non-planar twoloop four-point function with three different masses and all propagators massive in the physical region, see Fig. 10. This integral is special since it is extremely hard to compute analytically, but is easily accessible numerically.

4-loop form factor example
The example formfactor4L calculates a four-loop threepoint integral in d = 6 − 2ϵ, see Fig. 11. Its analytic result is given in Eq. (7.1) of Ref. [57]. This example demonstrates the power of pySecDec to perform an efficient sector decomposition, even for integrals with many loops and internal propagators. Furthermore, it is a prime example to show how the QMC algorithm works for a larger number of integration dimensions (in this case 11 dimensions).
Since the integral has only one scale, the latter can be factorized. For better comparison with Ref. [57], we set the scale to −1 and the prefactor to (Γ (d/2 − 1)) 4

. Note that a factor (iπ
where L is the number of loops, is part of the integral measure used in pySecDec, such that the prefactor corresponds to Eq. (2.4) of Ref. [57].
To achieve an approximate 1/n scaling behaviour, the Baker transform had to be applied to the integrand. For this 11-dimen sional parameter integral, the Baker transform is superior to the Korobov transform as it does not increase the variance of the integrand. For details we refer to Ref. [58].

6-loop bubble
The bubble6L example consists of the 6-loop 2-point integral shown in Fig. 13. The pole coefficients are given analytically in Eq. (A3) of Ref. [59] (at p 2 = −p 2 E = −1, where p E is the external momentum in Euclidean space). The pySecDec symmetry finder reduces the number of sectors from more than 14 000 to 8774. We also note that the decomposition method 'geometric' needs to be used, as the method 'iterative' leads to an infinite recursion. The analytic result is given by The pySecDec result at p 2 = −1 obtained with the QMC integrator using minn=10**7, maxeval=1, transform='baker', fitfunction='polysingular' reads B num.
(32) leading to a relative precision of 10 −9 for n ≈ 10 9 . With a weight parameter α = 3, we obtain slightly larger errors for small n, but due to a scaling with approximately O(n −2 ), a relative precision of 10 −14 can be reached with n ≈ 10 9 . We note that the expected O(n −3 ) asymptotic scaling is not observed for lattices with n ≈ 10 8 and that, due to the use of double precision arithmetic, the integration error does not decrease when choosing even larger lattice sizes. The plot also shows that increasing the lattice size does not always lead to a corresponding improvement of the integration error. Instead, for individual lattices, the integration error can be significantly larger than that obtained from a lattice of similar size. We observe this effect for nearly all integrals, but which lattices lead to relatively large uncertainties depends on the integrand.

Scaling behaviour
The right-hand plot of Fig. 14 shows the results of an integral contributing to the NLO QCD corrections in Higgs + jet production [44] and has been selected as an example showing only slow For lattice sizes n ≲ 10 6 , we observe that the integration error only scales with n −1/2 and is larger than the true result of the integral. For larger lattice sizes, however, we find the expected O(n −1 ) scaling of the integration, allowing us to obtain the result with a precision better than 0.1%. Combining the QMC with importance sampling, the integration error for small lattice sizes is reduced by about a factor of 3 and improvements by more than a factor of 10 can be seen for large n. This example shows that sampling the integrand with a lattice of sufficient size is required to obtain the desired scaling of the QMC integration. We want to point out that for loop integrals it is possible to change the basis of required integrals using integration-by-part identities [60,61]. In many cases, this allows one to find a basis of integrals with an improved convergence of the numerical integration. For the results presented in Ref. [44], the integral discussed above was not used and, instead, it was possible to find an integral basis, where evaluating the corresponding integrals with n ≈ 10 6 was sufficient to obtain accurate results.

Timings for test functions
For comparison of the performance of numerical integrators, Genz [62] has introduced a test suite, consisting of six integrand functions with different features: 1. Oscillatory

Corner peak
6. Discontinuous The test functions help to identify how well an integrator handles oscillatory functions, multiple periodic peaks, one peak anywhere in the integration region, one peak at the end of the integration region, C ∞ functions, a continuous function whose derivatives are not continuous and finally a discontinuous function.
The integration region for all test integrands is the unit hypercube. The parameters w i can be chosen randomly and should not affect the rate of convergence as long as 0 ≤ w i ≤ 1. On the contrary, the positive parameters c i > 0 should affect the convergence behaviour, raising the complexity of the integral when ∥c∥ 1 is increased.
We integrate each of the above functions with the parameters: Number of dimensions: Requested relative accuracy: Maximum number of samples: In Table 1 we show for each integrator the average number of digits obtained (calculated by comparing to the analytic result) and the time taken. Note that the Cuba integrators and the QMC (CPU) instance do not make use of the GPUs, whilst the integrator denoted QMC makes use of all CPUs and GPUs. Some of Cuba's algorithms sample the integrand serially for at least part of the numerical integration, which can greatly increase the time required to reach n max evaluations. On the contrary, the QMC always samples in parallel and can usually make good use of all cores and devices. Furthermore, with the settings suggested by the test suite demo (distributed with Cuba), we find that our test bed machine is not well loaded by the Cuba integrators. The load produced by the Vegas algorithm of Cuba can be increased by up to a factor of 10 by increasing the nstart and/or nincrease settings and altering the nbatch setting.
The timings for the Divonne integrator of Cuba are omitted from Table 1. The Divonne algorithm, as implemented in Cuba, Table 1 Number of correct digits computed (time in seconds) for the evaluation of the test integrands using the QMC and the integrators implemented in Cuba (Vegas, Suave and Cuhre). We reiterate the warning given in Ref. [37] that the comparison chart should be interpreted with care. In particular, we emphasize that the test integrands appearing in the test suite, by virtue of their simplicity, bear few similarities to integrands for which numerical integration is typically applied. For this suite of functions the Cuhre routine as implemented in Cuba performs best, this is due to the particular functions chosen for the test suite and is usually not the case when applying the integrators to sector decomposed functions. The QMC performs reasonably on the example functions, often beating the number of digits obtained by any of the other Cuba integrators and taking less time. The QMC performs worse, as expected, when integrating functions which are not smooth, in particular, the C 0 and discontinuous functions. When utilizing GPUs the QMC typically takes around 1 s to compute the samples (compared to 30-80 s for the CPU) regardless of the actual number of function evaluations. This indicates that the number of samples to be computed in these examples is too small to fully saturate the GPUs.

Timings for loop integrals
In Table 3, the timings for several of the examples described in Section 4 are compared using the QMC on CPUs & GPUs, the QMC on CPUs and Vegas as implemented in the Cuba library. The timings are performed on a machine with 2 x Intel Xeon Gold 6140 CPU @ 2.30 GHz CPUs (36 cores, 72 threads) and 4 x Nvidia Tesla V100 GPUs. The times reported in Table 3 correspond to the wall clock times for running the integration via the python interface of pySecDec. In particular, the numerical integration of all orders reported for the examples given in Section 4.3 is included in the timings. The integrands are summed before integration (together=True).
The timings given in Table 3 are obtained with the same parameters of the QMC as stated in Section 4.3 except for the number of samples (minn). The maxeval parameter is set to 1 such that the QMC does not iterate. Vegas is also run with a fixed number of function evaluations (maxeval) while the error goals epsrel and epsabs are set to 10 −100 such that they do not trigger. The real and the imaginary part are integrated separately with Vegas (real_complex_together=False).
A special situation is encountered when integrating the 4-loop form factor with Vegas. The first output in verbose mode (flags=2) is printed to the screen only after about fifteen minutes. We suspect this long startup time is due to the rather large (879 MB) size of the dynamic pylink library in combination with the parallelization using fork as implemented in the Cuba integrators library.
It is generally faster to obtain many significant digits with the QMC integrator than with the Vegas integrator, especially when GPUs are available. For low-precision results however, Vegas can sometimes be faster.

Conclusions
We have presented a quasi-Monte Carlo integrator (QMC) which can be used both with GPUs and CPUs as a stand-alone library or within the pySecDec program. We have described the implementation of the QMC, based on a rank-1 shifted lattice rule, and We have presented a novel approach to combine the QMC integration with importance sampling. We have investigated how the O(1/n) scaling of the error estimate depends on the dimension and form of the integrand, in particular on the transformation used to achieve a periodic integrand. In agreement with Refs. [40,41], we have demonstrated that rank-1 shifted lattice rules can considerably outperform integrators based on the Monte Carlo method. We also confirm that, in many cases, the use of GPUs (rather than CPUs) can lead to a speed-up of an order of magnitude or more. This implies that the number of accurate digits which can be computed in a reasonable amount of time using our implementation is often beyond that which can be reached using Vegas-like Monte Carlo integration. It should be noted, however, that the functions produced by sector decomposition are typically continuous and smooth enough to achieve O(1/n) scaling, while this is not necessarily the case for other integrands, as they occur for example in NNLO phase space integrals based on analytic subtraction of doubly unresolved real radiation.
We believe that the method presented here, along with the easy-to-use, publicly available implementation, can boost the numerical evaluation of multi-loop amplitudes with several mass scales to an unprecedented level of automation, speed and accuracy.
The stand-alone version of the QMC integrator is publicly available at https://github.com/mppmu/qmc. The new version of pySecDec is available at https://github.com/mppmu/secdec/ releases and the online documentation can be found at https: //secdec.readthedocs.io. the European Union through the ERC Advanced Grant MC@NNLO (340983).

Appendix. API documentation
The QMC class has 7 template parameters: • T the return type of the function to be integrated (assumed to be a real or complex floating point type) • D the argument type of the function to be integrated (assumed to be a floating point type) • M the maximum number of integration variables of any integrand that will be passed to the integrator • P an integral transform to be applied to the integrand before integration • F a function to be fitted to the inverse cumulative distribution function of the integrand in each dimension, used to reduce the variance of the integrand (default: fitfunctions::None::template type) • G a c++11 style pseudo-random number engine (default: std::mt19937_64) • H a c++11 style uniform real distribution (default: std::uniform_real_distribution<D>) Internally, unsigned integers are assumed to be of type U = unsigned long long int. Typically the return type T and argument type D are set to type double (for real numbers), std::complex<double> (for complex numbers on the CPU only) or thrust::complex<double> (for complex numbers on the GPU and CPU). In principle, the QMC library supports integrating other floating point types (e.g. quadruple precision, arbitrary precision, etc.), though they must be compatible with the relevant STL library functions or provide compatible overloads.
To integrate alternative floating point types, first include the header(s) defining the new type into your project and set the template arguments of the class T and D to your type. The following standard library functions must be compatible with your type or a compatible overload must be provided: • sqrt, abs, modf, pow • std::max, std::min If your type is not intended to represent a real or complex type number then you may also need to overload functions required for calculating the error resulting from the numerical integration, see the files src/overloads/real.hpp and src/overloads/complex.hpp.
Example 9_boost_minimal_demo demonstrates how to instantiate the QMC with a non-standard type (boost::multiprecision::cpp_bin_float_quad). To compile this example you will need the boost library available on your system.

A.1. Public fields
Logger logger A wrapped std::ostream object to which log output from the library is written.
To write the text output of the library to a particular file, first #include <fstream>, create a std::ofstream instance pointing to your file then set the logger of the integrator to the std::ofstream. For example to output very detailed output to the file myoutput.log:

U minn
The minimum lattice size that should be used for integration. If a lattice of the requested size is not available then n will be the size of the next available lattice with at least minn points.  The functor func must define its dimension as a public member variable number_of_integration_variables. Calls: get_next_n. template <typename I> samples<T,D> evaluate(I& func) Evaluates the functor func on a lattice of size greater than or equal to evaluateminn. The samples are returned in a samples struct with the following members: • std::vector<U> z -the generating vector of the lattice used to produce the samples • U n -the size of the lattice used to produce the samples • D get_x(const U sample_index, const U inte-gration_variable_index) -a function which returns the argument (specified by integration _variable_index) used to evaluate the integrand for a specific sample (specified by sample_index).
The functor func must define its dimension as a public member variable number_of_integration_variables. Calls: get_next_n. template <typename I> typename F<I,D,M>:: transform_t fit(I& func) Fits a function (specified by the type F of the integrator) to the inverse cumulative distribution function of the integrand dimension-bydimension and returns a functor representing the new integrand after this variance reduction procedure.

A.3. Generating vectors
We offer generating vectors for different lattice sizes n, and also for different maximal dimensions s: s ≤ 6 or s ≤ 100.
The generating vectors which are distributed with the version described in this paper are summarized in Table A.1. We used the so-called Component-By-Component (CBC) construction [47], computed using partly D. Nuyens' fastrank1pt.m tool [63] and, for very large lattice sizes, our own CBC tool based on the FFTW algorithm [64]. The generating vectors distributed with the code are produced for Korobov spaces with smoothness α = 2, in the notation of Ref. [65] we use: • Kernel ω(x) = 2π 2 (x 2 − x + 1/6), • Weights γ i = 1/s for i = 1, . . . , s, • Parameters β i = 1 for i = 1, . . . , s.
The generating vectors used by the QMC can be selected by setting the integrator's generatingvectors member variable. Example (assuming an integrator instance named integrator): i n t e g r a t o r . g e n e r a t i n g v e c t o r s = i n t e g r a t o r s : : g e n e r a t i n g v e c t o r s : : cbcpt_dn2_6 ( ) ; If you prefer to use custom generating vectors and/or 100 dimensions and/or 15173222401 lattice points are not enough, you can supply your own generating vectors. Compute your generating vectors using another tool then put them into a map and set generatingvectors. For example, to instruct the QMC to use only two generating vectors (z = (1, 3) for n = 7 and z = (1, 7) for n = 11) the generatingvectors map would be set as follows:

A.5. Fit functions
The fit function used by the QMC can be selected when constructing the QMC. These functions are used to approximate the inverse cumulative distribution function of the integrand dimension-by-dimension. Example (assuming a real type integrator instance named integrator): i n t e g r a t o r s : : Qmc<double , double, 1 0 , i n t e g r a t o r s : : transforms : : Korobov <3 >:: type , i n t e g r a t o r s : : f i t f u n c t i o n s : : P o l y S i n g u l a r : : type > i n t e g r a t o r ; instantiates an integrator which reduces the variance of the integrand by fitting a PolySingular type function before integration. Possible fit functions are shown in Table A.3.