Parallel cross interpolation for high-precision calculation of high-dimensional integrals

We propose a parallel version of the cross interpolation algorithm and apply it to calculate high-dimensional integrals motivated by Ising model in quantum physics. In contrast to mainstream approaches, such as Monte Carlo and quasi Monte Carlo, the samples calculated by our algorithm are neither random nor form a regular lattice. Instead we calculate the given function along individual dimensions (modes) and use this data to reconstruct its behaviour in the whole domain. The positions of the calculated univariate fibers are chosen adaptively for the given function. The required evaluations can be executed in parallel both along each mode (variable) and over all modes. To demonstrate the efficiency of the proposed method, we apply it to compute high-dimensional Ising susceptibility integrals, arising from asymptotic expansions for the spontaneous magnetisation in two-dimensional Ising model of ferromagnetism. We observe strong superlinear convergence of the proposed method, while the MC and qMC algorithms converge sublinearly. Using multiple precision arithmetic, we also observed exponential convergence of the proposed algorithm. Combining high-order convergence, almost perfect scalability up to hundreds of processes, and the same flexibility as MC and qMC, the proposed algorithm can be a new method of choice for problems involving high-dimensional integration, e.g. in statistics, probability, and quantum physics.


Introduction
High-dimensional integrals often occur in quantum mechanics [1], in statistics and probability, e.g.expectations with multivariate probability distributions [2], inverse problems with uncertainty [3], and many more.Analytical formulae for them are rarely available, hence numerical approaches become the mainstream approach.Unfortunately, high-dimensional integrals are notoriously difficult for numerical methods as well.A naïve approach, based on tensor product of one-dimensional quadrature rules, requires the total number of function evaluations N that grows exponentially with the problem dimension d, exceeding the possibilities of modern computers for d ≳ 10.This behaviour, known as the curse of dimensionality, motivates development of special methods for the integration in higher dimensions.Currently the most popular methods are the Monte Carlo quadrature [4], quasi Monte Carlo [5][6][7][8], Markov chain Monte Carlo [2], and their derivatives such as multilevel Monte Carlo methods [9][10][11][12].These algorithms are rigorously studied and many theoretical results are available, including error bounds which typically do not depend on problem dimension d for problems of interest.Unfortunately, MC and qMC methods converge slowly -the relative accuracy ε depends on the number of function evaluations N eval as ε ∼ N −α eval , where the convergence rate α = 0.5 for MC and 0.5 ⩽ α ⩽ 1 for qMC.The numerical costs therefore grow quickly when higher precision is required, making calculations expensive, prohibitively long, or impossible.Methods based on Smolyak's sparse grids [13][14][15] are often used to mitigate, but cannot fully remove, the curse of dimensionality.
In this paper we consider a problem of numerical integration of a multivariate function in a simple tensor-product domain such as free space R d or hypercube [0, 1] d .We follow the naïve approach and use a tensor product of univariate quadrature rules, hence reducing the problem to calculation and summation over the entries of a multi-dimensional array (which we call tensor).To overcome the curse of dimensionality, we approximate the whole array based on a few entries from it, but avoid calculating the whole array.To achieve this, we develop and use the parallel version of the tensor cross interpolation algorithm proposed by one of the authors in [16].This algorithm interpolates the given https://doi.org/10.1016/j.cpc.2019.1068690010-4655/© 2019 The Authors.Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).array in the tensor train (TT) decomposition [17,18], essentially performing separation of variables.The array entries are evaluated along one-dimensional lines or fibres, each of which is formed by freezing all indices of the multivariate function and only varying one.The lines intersect forming crosses, and on the positions of each cross the constructed approximation interpolates the data exactly.The positions of the crosses, and hence the nodes of the quadrature rule, are chosen adaptively for the given function, following the maximum-volume method [19,20].Using the interpolant, various observables, including the integral of the function, can be computed in linear in d time.
Essentially, the proposed algorithm reconstructs all n d values of the function f (x 1 , . . ., x d ) on a tensor product n × • • • × n quadrature grid from a linear in d number of samples, which are adapted specifically to f .This adaptivity allows the proposed algorithm to locate important samples (e.g.areas of concentration of density) and reach faster convergence, compared to mainstream numerical methods, such as MC and qMC, where the positions of the samples are either not optimised, or are optimal for a wide class of functions.For the family of Ising integrals, considered in the numerical experiments section of this paper, the proposed algorithm demonstrates high-order convergence of the order of ε ∼ N −7   eval , clearly outperforming MC and qMC.Using multiple precision arithmetic, we were able to compute an integral in more than thousands dimensions to more than hundred decimal digits, observing exponential convergence of the proposed method.As a flexible and non-intrusive algorithm, it can become a new method of choice for problems involving numerical integration in higher dimensions.
Data-sparse algorithms based on tensor product decompositions (canonical polyadic [21], Tucker [22], tensor train (TT) [17] or Hierarchical Tucker (HT) [23]) have a long history of development [24][25][26][27], with applications in quantum physics and chemistry [28][29][30][31][32], signal processing [33,34], plasma modelling [35], stochastics and uncertainty quantification [36,37], and fractional calculus [38,39].However, scalable high performance implementation of tensor product algorithms is a relatively new area of research.A straightforward idea is to parallelise dense tensor algebra in computations of factors of a decomposition [40].However, this typically requires all-to-all communications which limits the scalability.Another strategy is to parallelise a tensor decomposition over different factors, or dimensions.One of the first examples of such approach was the parallel density matrix renormalisation group (DMRG) algorithm [41] for ground state computations in quantum physics.In mathematical community this research direction started with dimension-parallel linear solver [42] and cross algorithms in HT format [43].The main difficulty of parallelisation over dimension is the need of algorithmic modifications, since state of the arts tensor algorithms were designed in intrinsically sequential way.Ideally, such modifications should not compromise numerical stability or convergence for the sake of parallel efficiency.
In this paper we develop a parallel version of the TT cross interpolation algorithm [16].The parallel algorithm is adaptive and converges with the same rate as the sequential version.It involves only local communications with constant loading of processes, and demonstrates almost perfect scaling up to the ultimate partitioning where each process is responsible for a single direction (mode, variable).Moreover, further speedup can be achieved using OpenMP parallelisation of tensor algebra in each process.
The rest of the paper is organised as follows.In Section 2 we recall the cross interpolation method for matrices and provide necessary definitions.In Section 3 we discuss how the matrix interpolation can be applied for high-dimensional arrays (tensors).We compare the existing methods and explain why the cross interpolation algorithm proposed by one of the authors in [16] seems to be the most suitable for parallelisation over the dimensions.We then present the parallel version of this algorithm.In Section 4 we explain how the cross interpolation algorithm can be applied for numerical integration.We also introduce more formally the MC and qMC methods.In Section 5 we introduce Ising susceptibility integrals which will be our main numerical example in this paper.We demonstrate that the proposed method achieves high-order (sometimes exponential) convergence, while the convergence of MC and qMC remains sublinear.In the conclusion, we briefly summarise the results of this paper and discuss some challenges and potential directions for future work.

Cross interpolation of matrices
Cross interpolation is based on a simple observation: for a given m × n matrix A = [A(i, j)] m,n i,j=1 its rank-r interpolation can be recovered from its r columns J = {J (t)  } r t=1 and r rows s=1 as follows: ( Remark 1 (Compression).To compute the right-hand side we use only the elements of selected columns A(i, J (t) ), and rows A(I (s) , j).Other elements of A are not required to construct Ã and we can avoid calculating them.Thus, evaluation and storage of Ã requires (mr + nr − r 2 ) matrix elements -this is more cost-efficient than working with the whole matrix Due to the shape of the locus of computed entries, shown in Fig. 1, this decomposition is known as skeleton [44], pseudoskeleton (if the exact inverse is replaced with, say, pseudo-inverse) [45], or cross [46].

Notation for matrices and submatrices
Eq. ( 1) is understood element-wisely, i.e. holds for all possible values of free indices i and j.According to the matrix multiplication rule, the summation is performed over the summation indices from the sets I and J .Note that the sums are taken over the indices which are repeated in the formula, cf.Einstein's summation convention [47].Notation A(I, J ) refers to a submatrix on the intersection of rows I and columns J , mimicking the intuitive syntax of programming languages like Fortran90, Matlab, R and Julia, where a vector of indices can be passed into an array to select a subsection of it, e.g.A(1:2,1:3) for a 2 × 3 leading submatrix of A. We can also use index sets I = {1, . . ., m} and J = {1, . . ., n} to refer to full columns and rows.For instance, the approximant Ã in (1) is a product of three matrices: Embracing this notation, we will keep the same letter A for all three factors of the cross interpolation.Compared to the CGR notation [45,48] or CUR notation [49], our notation in (1) highlights that factors of the cross decomposition are submatrices of the given matrix A, which distinguishes it from SVD, QR and LU factorisations.Cross interpolation for matrices.The full matrix A is approximated by a low-rank decomposition Ã based on a small number of columns and rows computed in A. Note that the approximation (1) is exact in the positions of computed rows and columns.

Maximum volume principle
The approximation A ≈ Ã is exact on the positions of computed rows I and columns J , which is why we call it interpolation.For other entries the mismatch between A and Ã can be arbitrary large in general, because the approximation Ã does not use information about the matrix A apart of its few chosen columns and rows.Theoretical error upper bounds can be obtained based on additional properties of the matrix, e.g. when A = [f (x i , y j )] m,n i,j=1 is generated by asymptotically smooth function [50].
However, the quality of the cross approximation Ã depends critically on a choice of good positions (I, J ) for the cross.Good theoretical estimates are available for the maximum-volume cross, i.e. such that A(I, J ) has the largest possible volume of all submatrices of this size.The maximum-volume principle for matrix approximation was first proposed in [19,20,48], and the estimates were later generalised to other norms [51,52], and rectangular submatrices [53,54].
Unfortunately, the search for a maximum-volume submatrix is NP-hard [55] and cheaper alternative algorithms are required for practical calculations with large matrices.

Practical algorithms for matrix cross interpolation
When matrix A is available in full, reliable algorithms for low-rank approximation are available, such as the famous singular value decomposition (SVD) [56], and faster rank-revealing QR [57] and LU [58] algorithms.However, these approaches are unfeasible for very large-scale matrices, e.g.those coming from high-dimensional problems, when even O(mn) costs become prohibitive.
To compute a sufficiently good cross with sublinear costs, the incomplete cross approximation [46] algorithm was proposed, that increases the volume of the intersection matrix by alternating updates of rows I and columns J .If the set of columns J is fixed, there is a combinatorial number of possible row sets I to compare.To keep costs feasible, rows I are updated one-by-one with a greedy algorithm first suggested by Donald Knuth [59].Greedy updates of rows, shown in Alg. 1, continue until the volume is large enough.Then rows are fixed and columns are updated, and the algorithm alternates until vol A(I, J ) is significantly large.The details of this maxvol algorithm for matrix cross interpolation are given in [20,46].
A conceptually simpler adaptive cross approximation (ACA) algorithm [60] follows a greedy optimisation approach by increasing the interpolating sets by one columns and row at a time.It can be seen as a Gaussian elimination with partial column Algorithm 1 One step of the practical row selection algorithm [59] Input: Sets (I, J ) of the interpolation (1)

Algorithm 2 One step of the matrix cross interpolation algorithm
Input: Sets (I, J ) of the interpolation (1) 1: Pick a random set of samples L = {(i, j)} and choose the one with the largest error, % column and row partial pivoting updates 3: (i ⋆ , j ⋆ ) ← arg max j∈J |A(i ⋆ , j) − Ã(i ⋆ , j)| 5: until rook condition (2) is met or computational budget is exhausted pivoting [61], which is computationally cheap but may result in exponential amplification 2 r of the error.A more conservative complete pivoting is believed to be numerically stable [62], but involves a search through all matrix elements, and thus is more expensive.A good alternative is the rook pivoting [63], which searches for a pivot (i ⋆ , j ⋆ ) that is dominant in its own row and columns: Rook pivoting avoids exponential deterioration of the error [64,65] and in practice seems to have the same asymptotical complexity as partial pivoting, thus combining the best of both worlds.We use rook pivoting in combination with random pivoting, as shown in Alg.Remark 3 (Accuracy).Algorithm 2 does not access all elements of a matrix and therefore is heuristic, i.e. its accuracy cannot be guaranteed in general.
Algorithm 2 is written in a very general way and many details are clearly improvable.For example, the choice of L = {(i, j)} for initial sampling can be optimised to ensure i / ∈ I and j / ∈ J since the error A − Ã is zero on the positions of the cross.A variety of other heuristic tricks were proposed, e.g.Mahoney et al. [49] suggest to estimate the column and row norms of A and sample (i, j) ∈ L with probabilities proportional to these norms.The focus of this paper is not the 'best heuristic' for the matrix case, but the extension to high-dimensional problems.We refer the reader to [66] for a review of matrix low-rank approximation algorithms.

Notation for tensors and multi-indices
We consider an array A = [A(i 1 , . . ., i d )] with d indices i k , k = 1, . . ., d, which are also called dimensions or modes.Each index assumes values i k ∈ I k = {1, . . ., n k }, where n k is called the mode size.Such arrays are called tensors in numerical linear algebra (NLA) community [61], although we do not differentiate upper and lower indices, as it is customary for tensors in mathematical physics [47].The total storage required for A grows exponentially with the dimension, prohibiting work with full A for large d.Hence, tensor product representations are required for all practical calculations with tensors.
At the heart of tensor product formats lies the idea of separation of indices.Consider grouping indices i 1 , . . ., i k together and separating them from the group i k+1 , . . ., i d , thus reshaping called kth matricization or unfolding of the tensor.As in (1), the equation is understood element-wisely for all possible values of all indices, i.e.A {k} differs from A only by 'shape'.Rows and columns of A {k} are enumerated by multi-indices To separate row and column (multi)-indices i ⩽k and i >k , we apply matrix interpolation formula (1) to A {k} , yielding Here (I ⩽k , I >k ) indicate the positions of r k rows and columns of the interpolation cross in the unfolding A {k} .

Tensor train format
The use of element-wise notation allows us to drop the superscript for the unfolding, because the dimensions of matrices and tensors are unambiguous from the range of the variables within.Hence, Eq. (3) can be simplified to that emphasises separation of left and right groups of indices.By continuing the separation process, we arrive to the decomposition where all indices i k are isolated: ( This formula is a direct generalisation of skeleton/cross interpolation (1) to tensor case and is therefore called skeleton/cross tensor decomposition [18].Note that the factors of the cross decomposition are constructed from fibres A(I ⩽k−1 , i k , I >k ) of the given tensor, while in a general tensor train (TT) decomposition [17] we deal with general factors.TT decomposition is itself a particular case of more general Hierarchical Tucker (HT) decomposition [23,67].Cross approximation algorithms are available for HT format [68,69], as well as for more specialised tensor formats, including Tucker [22] and canonical polyadic decomposition [21].
Remark 4 (Compression).The right-hand side of (5) involves In general, tensor cross decomposition ( 5) is an approximation, but not an interpolation formula for A. The following result [16,Theorem 4] provides the sufficient condition for (5) to be called tensor cross interpolation.
Theorem 1 (Interpolation, see [16]).If the crosses (I ⩽k , I >k ) are nested: formula (5) interpolates the evaluated entries of the tensor, Theorem 1 cannot be reversed, i.e. nestedness of indices is not necessary for the interpolation, as shown by the following.

Theorem 2 (Exact Recovery of the Exact-rank Tensor). If rank
and all submatrices A(I ⩽k , I >k ) are non-singular, the formula (5) recovers the original tensor exactly, This theorem was first proven in [18] with the additional requirement of nestedness.In the exact-rank case the requirement of nestedness can be relaxed, as we show here.

Proof. If rank A {k}
= r k , the rank-r k interpolation (3) recovers the unfolding A {k} exactly, which means that (4) is also exact, We start by applying (7) with k = 1, yielding Writing (7) with k = 2, we get This is true for all i 1 ∈ I 1 , so it is also true for i 1 ∈ I 1 = I ⩽1 , because I 1 ⊂ I 1 .Reducing (8) to A(I ⩽1 , i 2 , . . ., i d ) and plugging into the previous equation, we obtain Continuing in the same way for k = 3, . . ., d − 1, we complete the proof.□ Algorithm 3 Left-to-right sweep of the ALS maxvol cross approximation algorithm [18] Input: Sets (I ⩽k , I >k ) of the interpolation (5) 1: for k = 1, . . ., d − 1 do 2: If A {k} 's are only approximately low-rank, the good choice of crosses (I ⩽k , I >k ) is important to ensure accurate approximation in (5).If all (I ⩽k , I >k ) are maximum-volume submatrices in respective unfoldings A {k} , the lower accuracy bounds are extended from matrices [19,20,48] to the tensor case [16,Theorem 1].
Inspired by the maximum volume principle, we will now discuss practical algorithms for computation of sufficiently good crosses for the tensor cross interpolation.

Practical algorithms for tensor cross interpolation
In this section we provide a brief overview of tensor cross interpolation algorithms for TT format and compare them.[18] The algorithm in the pioneering paper [18] is a direct generalisation of the matrix cross interpolation algorithm from [46] to the tensor case.Starting from a selection of crosses (I ⩽k , I >k ), it updates them one-by-one using the maximum-volume principle.The left-to-right sequence of updates, called sweep, is shown in Alg. 3. It is followed by a similar right-to-left sweep and the algorithm sweeps back and forth through the TT cores until convergence.This pattern of updates is often referred to as ALS, coming from alternating least squares or alternating linear scheme, although the abbreviation is often used in a broader sense.The main limitation of this algorithm is that it cannot update the ranks r k of the interpolation, and therefore its success relies on two assumptions, both of which are not easy to ensure in practice:

ALS maxvol algorithm
1. the ranks r k of the interpolation Ã are not underestimated to ensure that a good accuracy |A − Ã| is achievable; and 2. the ranks r k of the interpolation Ã are not overestimated and non-singular submatrices A(I ⩽k , I >k ) can be chosen at the initialisation step.

DMRG maxvol algorithm [70]
To allow rank adaptation, we can consider a superblock If we can compute the superblock in full, its low-rank decomposition can be computed by standard algorithms e.g.SVD [56].This allows us to adapt the rank r k in accordance with the desired accuracy and Algorithm 4 Left-to-right sweep of the DMRG maxvol cross approximation algorithm [70] Input: Sets (I ⩽k , I >k ) of the interpolation (5), accuracy threshold Algorithm 5 Left-to-right sweep of the DMRG greedy cross interpolation algorithm [16] Input: Sets (I ⩽k , I >k ) of the interpolation (5) 1: Apply Alg. 2 to the superblock A(I 3: Density matrix renormalisation group (DMRG) [71] and related matrix product states (MPS) [72,73] algorithms were developed in quantum physics community to find the ground state of a quantum spin system.The ranks of the ground state are not known in advance, which makes the rank adaptation crucial for the success of the method.Then the DMRG/MPS format was rediscovered in numerical linear algebra as the TT format [17], it was applied to a variety of problems including signal processing [33,34], partial and fractional differential equations [38,74,75], modelling of ionospheric plasma [35] and simulation of NMR [30].Tailoring DMRG framework to compute interpolation and integration of high-dimensional functions is yet another example of extreme power and flexibility of algorithms, which can be understood, analysed and applied beyond the boundaries of the area where they were discovered.
Remark 6 (Nestedness in Alg. 4).Similar to Alg. 3, the DMRG Alg. 4 does not preserve nestedness (6) during the sweep, but recovers it at the end of each sweep.Therefore, the output of Alg. 4 interpolates the initial tensor on all positions ( Unfortunately, Alg. 4 is moderately expensive -it evaluates O(dn 2 r 2 ) points of the given tensor and interpolates only O(dnr 2 ) of them.

DMRG greedy algorithm [16]
Calculation of the superblock A(I ⩽k−1 i k , i k+1 I >k+1 ) requires O(r 2 n 2 ) function evaluations.This may be too expensive, particularly when we aim for high precision and hence employ large mode sizes n k for accurate quadratures and expect large ranks r k to achieve accurate interpolation (5).To reduce costs we can replace maxvol optimisation step by greedy cross interpolation step, as proposed in [16] and shown in Alg. 5 and Fig. 2. The algorithm sweeps back and forth the tensor train (5) and attempts to add one cross to each set (I ⩽ , I >k ) at a time.
Remark 7 (Nestedness in Alg. 5).By construction, Alg. 5 preserves nestedness (6) at each internal step of the sweep.The output of Alg. 5 interpolates the initial tensor on all positions To the best of our knowledge, Alg. 5 is one of the fastest tensor interpolation algorithms currently available in public domain.As all algorithms considered in this section, Alg. 5 allows trivial parallelisation along each mode, which means that O(n) tensor entries forming each fibre can be evaluated in parallel.However, Alg. 5 also maintains nestedness on each internal step, which makes it suitable for parallelisation over all modes.Indeed, since no particular step violates the nestedness, all rank-one updates can be performed in parallel, as will be explained in the following section.

Dimension parallel tensor cross interpolation algorithm
Traditional ALS algorithm is carried out sequentially over tensor factors.However, it was noticed that this dependence is more technical than essential.A concurrency in ALS type algorithms is a matter of active research.As observed in [41], the DMRG algorithm for ground state computations can be executed in parallel over subsets of TT blocks with only a little deterioration of the convergence.Later a dimension parallel version of the HT-ALS for linear equations was developed [42].In a non-adaptive HT-Cross method the samples and the factors can also be reconstructed in parallel [43].
In this section we show that the adaptive Alg. 5 allows a natural parallelisation over dimensions.From Line 3 of Alg. 5 we see that two consecutive steps k and k + 1 are connected by only one new pivot i ⋆ ⩽k , which expands the left index set I ⩽k .For the sake of scalability we can accept a slight restriction of the search space and replace expanded index sets I ⋆ ⩽k−1 with the sets I ⩽k−1 taken from the previous sweep, see Fig. 3 (top).This allows us to search for new pivots in Line 2 of Alg. 5 in a superblock A(I ⩽k−1 i k , i k+1 I >k+1 ) with the old index sets, which can be done in parallel over all different bonds k = 1, . . ., d − 1.
Different processes find their new pivots (i ⋆ ⩽k , i ⋆ >k ) independently, and then communicate them and expand index sets before the next whole sweep, rather than each internal step as in Alg. 5. Since the superblocks owned by different processes overlap only for the neighbouring processes (e.g. the index i k belongs to only (k −1)th and kth superblocks), only the neighbouring processes need to communicate: the multi-index i ⋆ ⩽k is sent from kth to (k + 1)th process, and i ⋆ >k+1 is sent from (k + 1)th to kth process, see Fig. 3 (bottom).
Although the proposed restriction might potentially lead to a different (sub-optimal) pivot selection, we observed no noticeable change of convergence between the sequential and parallel versions in our numerical experiments.
If fewer than d − 1 processes are available, each process can be given several consecutive superblocks.The algorithm becomes similar to parallel DMRG algorithm of S. White [41], see Alg. 6: each process performs the sequential sweep as in Alg. 5 over its local part of the TT decomposition, and after that the neighbouring processes exchange new pivots in exactly the same way as described above.
Remark 9.This dimension parallel procedure can be hybridised with multi-threaded local computations, which consist of the evaluation of different samples in Alg. 2 and additional linear algebra operations.
Assuming balanced splitting over P processes, we conclude that each process performs O(dnr 2 /P) evaluations of tensor elements and O(dnr 3 /P) additional operations.Moreover, the tuples i ⋆ ⩽k and i ⋆ >k consist of at most d − 1 integers, which need to be communicated with neighbours using two messages in each of r iterations, resulting in a total communication volume of O(dr).Convergence checks require a global communication between all processors, amounting to O(r log P) single-word messages in total.
The parallelisation over the modes proposed in Alg.6 can scale well for the number of processes P ≲ d.It requires only a small number of global communications and lends itself well to distributed-memory 'cluster' architectures and MPI-based implementation.In contrast, the parallelisation along each mode

Algorithm 6 Dimension parallel DMRG greedy cross interpolation algorithm
Input: Sets (I ⩽k , I >k ) of the interpolation (5) 1: Deduce the range [k beg , k end ) of superblocks belonging to the process p.
requires all workers to access the shared block of memory where the fibre (or superblock) is stored.Hence, this level of parallelisation is best for shared-memory architectures, such as cores and/or threads of a CPU/GPU processor and OpenMP-based implementation.It scales efficiently when the number of cores/threads sharing the same memory is T ≲ n.
In our algorithm we combine both MPI and OpenMP parallelisation to achieve the best performance.

High-dimensional integration
In this section we review quadrature rules for the numerical integration in high dimensions.We aim at computing an integral of a continuous function f (x) on a rectangular domain [0, 1] d .The exact integral is approximated by a quadrature where N eval nodes {x i } and weights {w i } are chosen to reduce the error |I − Ĩ|.Below we consider several examples of the quadrature rules.

Full tensor product quadratures
One of the simplest strategies is to rely on an appropriate onedimensional quadrature rule (e.g.Gauss-Legendre, tanh-sinh), defined by the nodes {t i } n i=1 ⊂ [0, 1] and the weights {w i } n i=1 .The tensor product quadrature approximates each of the onedimensional integrals independently, The main advantage of the tensor product quadrature is the fast convergence in n, which stems from the fast convergence of the one-dimensional Gauss-Legendre rule.For example, if a function To make the calculations in high dimensions feasible and benefit from the fast convergence of the Gauss-Legendre quadrature, we may replace the full tensor F = [f (t i 1 , . . ., t i d )] in (9) with a cheaper approximation, reducing the complexity from exponential in d to a manageable polynomial cost.Specifically, in this paper we replace F with tensor product interpolation (5), computed e.g. by Alg. 6, Plugging ( 10) in ( 9), we can split the summations and treat each mode individually, breaking the curse of dimensionality.The result is now given as a product of (2d − 1) matrices, In practical calculations, we adapt the interpolation (10) with a step of Alg. 6 and re-calculate the approximate integral using (11), as explained in Alg. 7. The stopping criterion in Alg.7 is based on internal convergence, i.e. we consider the result accurate to the desired precision ε when the desired number of leading digits no longer changes after the update cycle or (better yet) several sequential updates.We will also terminate the calculations if all applications of Alg. 6 in one or several sequential iterations of Alg. 7 fail to find a new pivot (i.e. an element with significantly large residual).
We discovered in numerical experiments that the proposed algorithm converges faster if it is applied to a tensor where the function values are pre-multiplied with quadrature weights, This often leads to lower TT ranks/error compared to the approximation of f (t i 1 , . . ., t i d ) if the function has a complicated structure near the boundaries.In this case, the boundary elements, multiplied by small cumulative products of the quadrature weights, become less influential to both the quadrature and the cross interpolation algorithm.

Complexity
After r steps of Alg. 7 we obtain the interpolation (10)
Evaluation of the integral by (11) requires O(dnr 2 ) + O(r 3 ) operations.When calculations are performed in parallel on a distributed-memory platform, the matrix multiplications in (11) have to be arranged in a balanced tree way to reduce the parallel depth of the algorithm from linear to logarithmic in the number of processors.Although a single evaluation of (11) does not increase the asymptotic estimate of the cost of interpolation, re-calculation of the integral at each step of Alg. 7 increases the costs to O(dnr 3 ) + O(r 4 ).This term may become dominant if the function evaluation requires O(d) operations and r ≫ d, in which case we can update the integral on each step using Sherman-Morrison-Woodbury formula [77][78][79], reducing the numerical costs to those of interpolation.

Accuracy
To provide an analytic estimate for the error of the integration |I − Ĩ|, we need to answer the following questions: 1. Is function f (x) separable, i.e. does it admit an accurate representation in the TT format with moderate ranks?This question was answered for particular classes of functions, see for example [80][81][82], and the accuracy ε of the best separable approximation of rank r is shown to grow logarithmically, r = O(log ε −1 ).
2. Is it possible to reach the same asymptotic convergence rate if the function is represented by the interpolation (10) with the best choice of interpolation crosses (I ⩽k , I >k )?
This question is answered positively by the analysis in [16] and [83].3. Is it possible to reach the same asymptotic convergence with the interpolation crosses (I ⩽k , I >k ) found by a practical algorithm with (low) polynomial complexity in d, r and n?
The heuristic nature of Alg. 6 does not allow us to answer this question in this work.We hope that further research will lead to tensor product algorithms combining fast practical convergence with better theoretical properties.

Monte Carlo and quasi Monte Carlo techniques
The Monte Carlo quadrature is a statistical method which is based on the central limit theorem.It introduces random nodes integral is approximated by an average of the values of the function at these nodes and all weights equal,

and the
The integration error depends on the variance of f (x) (treated as a random field after randomisation of the coordinates x), . Provided that the variance is independent of the dimension, so is the error.However, the decay rate of N −0.5 eval is often prohibitively slow, especially if a high accuracy is needed.
Firstly, one constructs a deterministic lattice rule, defined by a generating vector q = (q 1 , . . ., q d ).The lattice is optimised to minimise the worst-case error component by component [7,84].The quadrature nodes are then computed as shifted multiples of the generating vector, Here s = (s 1 , . . ., s d ) is a vector of random shifts, distributed uniformly on [0, 1], and frac(x) denotes the fractional part of x.
Standard qMC rules provide a convergence rate O(N −α eval ), with 0.5 ⩽ α ⩽ 1.Under certain assumptions on the function, the rate can be proven to be close to 1, and the constant to be independent of d.There exist higher order qMC rules [8] which can achieve faster convergence, but at a price of more sophisticated lattice construction algorithms and stronger assumptions on the function.
The shifts s make the quadrature (13) unbiased, and they also allow to estimate the quadrature error.We repeat qMC experiments using the same generating vector q but S different shifts.Thus we obtain S sets of nodes (13), and use (12) to calculate the estimators Ĩj , j = 1, . . ., S. Now the error can be estimated as the empirical standard deviation, For the MC experiment we estimate the standard deviation in a similar way by repeating the experiments S times.

Ising integrals
To demonstrate the efficiency of the proposed approach, we apply tensor product interpolation to calculate high-dimensional integrals of so-called Ising class [85].They are motivated by the famous 2D Ising model, explaining spontaneous magnetisation in ferromagnetic materials.It describes a ferromagnet as a rectangular M × N grid of spin- 1  2 particles where each spin σ i,j can be observed in one of two possible states, σ i,j ∈ {+ 1 2 , − 1 2 } = {↑, ↓}.The energy of configuration σ = {σ i,j } i=1,...,M j=1,...,N in magnetic field H is given as follows:

   response to magnetic field
The probability of each configuration is given by the Gibbs measure exp(−E(σ )/kT )/Z , where T denotes the temperature and . Susceptibility is particularly interesting as it relates to long-distance spin-spin correlation and hence can explain collective behaviour in a ferromagnetic system connected by only next-neighbour interactions as shown in Fig. 4.
The 2D Ising model was first solved by Lars Onsager in 1944, who has never published the results.The solution for the magnetisation was published by Yang [86], and the susceptibility was calculated by Wu, McCoy, Tracy and Barouch [87] as where T c denotes critical (Curie) temperature, which for the square and isotropic lattice is given by kT c = 2/ln(1 + √ 2), and ± refers to T → T c from above (+ ) or below (−).The coefficients of the asymptotic expansion are given as infinite series, where D d 's are (d−1)-dimensional integrals, which can be written as shown below [85]: with Bailey et al. [85] first suggested the computational / experimental mathematics approach to this problem -they took up a challenge to calculate D d 's numerically with high accuracy and then use inverse symbolic calculator [88] to conjecture the values in closed form as a linear combination of physically relevant constants.The integrals C d and E d were introduced as 'structurally similar', but simpler versions of D d in assumption that their evaluation may lead to certain insights.Indeed, all C d 's were analytically reduced to two-dimensional integrals and resolved numerically to extreme precision [85].Based on numerical results, Bailey and co-authors were able to conjecture and then prove that , where γ is the Euler-Mascheroni constant.[85].Using dimension reduction techniques, they calculated D d and E d for d ⩽ 4 in closed form in terms of Riemann zeta function and Dirichlet L-function and conjectured the analytic value for E 5 .
Ten years later, Erik Panzer developed the code to symbolically evaluate all E d 's in terms of alternating multizeta functions [89], which he used to calculate integrals E 1 to E 8 , thus confirming the conjecture of Bailey et al. for E 5 .The complexity of symbolic evaluation grows rapidly with d, e.g.E 8 required 28 CPU hours and 30 GB of memory.Hence, this method cannot be seen as a practical way for evaluation of E d 's for d > 10.Also, no clear indication is given in [89] on whether the proposed method can be applied to more complex integrals D d .
To the best of our knowledge, apart of results listed above, no further progress has been made in calculation of Ising integrals, meaning that accurate evaluation of Ising susceptibility integrals D d remains an open problem for all except relatively small d.
In this work we pick up the baton and apply the proposed algorithm 6 to calculate D d 's with high accuracy for d ≲ 1000.However, we will first verify our algorithm by applying it to calculate the values of C d 's treating them as high-dimensional integrals, and verifying the accuracy against the results calculated by Bailey et al. in [85].

Experiment setup for double-, quadruple-and high-precision calculations
Following Bailey [85], we evaluate the integrals numerically using tensor product of one-dimensional Gauss-Legendre quadratures, as explained in Section 4.2.The number of quadrature points in each direction, n, is chosen adaptively to reach the desired accuracy.Since functions A d and B d are infinitely smooth, the Gauss-Legendre quadrature for C d , D d and E d converges exponentially, and we can expect the number of accurate digits to grow linearly with n.
The parallel implementation of the proposed algorithm is implemented in Fortran by authors.
Double-precision calculations are implemented using GNU Fortran compiler with BLAS and Lapack libraries from Intel MKL.
For quadruple-precision calculations we compile the same code using a compiler option -fdefault-real-8, that sets the default size for double precision to 16 bytes and increases precision to approximately 33 decimal digits.We compiled the reference implementation of BLAS and Lapack libraries with the same parameter to reach quadruple precision in the whole calculation.
For high-precision calculations we use the MPFUN2015 library [90,91].We use the version of MPFUN2015 utilising the MPFR library, 4 which is several times faster than the version implemented fully in Fortran.We had to rewrite reference implementation of necessary BLAS libraries to use the mp_real data type offered by MPFUN.The code itself was compiled using the same compilers and options as for double precision calculations.The MPFUN2015 library was set up to provide accuracy of 120 decimal digits.
The experiments were performed on two computers: • at the University of Bath: this research made use of the Balena High Performance Computing (HPC) Service.Each node on Balena contains an Intel Xeon E5-2650 v2 CPU with 16 cores, running at 2.6 GHz.A single job can occupy up to 32 nodes for 5 days.
• at the University of Brighton: the development, testing and numerical experiments were made possible by use of a dedicated workstation.The workstation has two Intel Xeon E5-2650 v4 CPUs with 12 cores and 2 threads each, running at 2.2 GHz.It is also equipped with 0.5 TB of operating memory, which proved essential for large-scale calculations reported below.

Verification and benchmarking of the cross interpolation algorithm
Bailey et al. [85] found analytic transformation to convert We compute C 1024 directly as a (d − 1)-dimensional integral using the proposed tensor product interpolation algorithm, and compare the numerical result with the one obtained by Bailey [85].The comparison is shown in Fig. 5.For double and quadruple precision calculations we observe an expected stagnation at the level of 15 and 32 decimal digits, respectively.When multiple precision calculations are used, the proposed algorithm seemingly provides exponential convergence for the integral C 1024 .As we can see in Fig. 5, the observed convergence of relative accuracy ε agrees well with the assumption ε ∼ exp(− √ N eval ).Since the number of samples evaluated by the cross interpolation algorithm is N eval ∼ dnr 2 , and d, n remain constant, this allows us to conjecture that ε ∼ exp(−r), i.e. the relative accuracy improves exponentially with the average TT rank r.This observation shows that the underlying multivariate function admits an accurate low-rank representation, and, more importantly, that using the proposed algorithm a good interpolation can be constructed and the integral can be approximated much faster than by other currently known techniques, such as MC and qMC algorithms.Based on this example, we are optimistic that the challenging theoretical questions discussed in Section 4.2.3 can be answered positively for functions considered in Ising susceptibility theory, and hopefully a wider class of functions as well.
It should be noted that although the use of quadruple and multiple precision calculations comes at a small extra cost in terms of number of points (it is sufficient to double the mode size n to double the number of accurate digits), it leads to significant overhead in terms of CPU time, since the quadruple and multiple precision calculations are not optimised to the same degree as native double precision calculations and BLAS libraries.This is the reason why we report the convergence behaviour both as a function of the number of evaluated points, and of the CPU time, as shown in Fig. 5. calculations.The results are verified against the 1000-digit result reported in [85].The relative accuracy is shown w.r.t.number of function evaluations (left) and w.r.t.CPU time (right).We can clearly see that the proposed method converges exponentially.Fig. 6.Integral C 1024 calculated by TT cross interpolation (Alg.6), Monte Carlo (MC), and quasi Monte Carlo (qMC).Cross interpolation algorithm uses tensor product of one-dimensional Gaussian quadrature rules with n = 33 points for double-precision and n = 65 points for quadruple-precision calculations.QMC algorithm uses lattice generating vectors q 20 and q 26 minimising the worst-case error on 2 20 and 2 26 points, respectively.Solid lines: errors of numerical methods verified against the result of Bailey et al. [85].Dashed lines: relative standard deviation estimates (14) of MC and qMC with number of repetitions S = 16.Left: relative accuracy w.r.t.different numbers of function evaluations N eval .Right: relative accuracy w.r.t.total CPU time.

Convergence and comparison with quasi Monte Carlo
In Fig. 6 the proposed algorithm is compared with the state of the art Monte Carlo (MC) and Quasi MC approaches (see Section 4.3).For the MC quadrature we use uniformly distributed samples on [0, 1] d .
For the qMC algorithm a particular care must be taken when choosing the correct lattice.Frances Kuo's website 5 provides a large collection of pre-generated lattices which were generated by optimising the worst case error with product weights k −2 , motivated by stochastic PDEs.The integrals considered in this paper do not exhibit the same decay, hence we decided to use the lattice with equal weights.We used the component by component algorithm from Dirk Nuyens's website 6 and constructed generating vectors q 20 and q 26 by minimising the worst case error on 2 20 and 2 26 points respectively.Notice that the lattice generated from q 20 starts repeating when the number of points exceeds 2 20 , leading to a visible stagnation of the q 20 quadrature error in Fig. 6.This is why we created lattice q 26 which remains convergent and allows to scale the computations up to billions of points.It has to be noted that optimising a lattice is rather 5 http://web.maths.unsw.edu.au/~fkuo/.
expensive -the CBC algorithm took several days to produce q 26 (this cost is not included in further analysis).
As in the previous subsection, we calculate C 1024 and compare our results against the 1000-digit accurate value computed in [85].These errors are plotted on solid lines in Fig. 6.We also show by dashed lines the relative empirical standard deviation for MC and qMC algorithms as described in (14).Notice that the true error exhibits a higher fluctuation for different N eval , although the overall convergence trend coincides with that for the standard deviation.
We see that the MC method converges with the rate N −0.5 eval as expected from the CLT, while the qMC method (with q 26 ) exhibits a higher rate N −0.7 eval .The TT decomposition has a much richer approximation capacity, and provides a sub-exponential convergence, as shown also in Fig. 5.When all calculations are performed in double precision, TT cross interpolation is always faster than MC and qMC methods.Switching to quadruple precision increases the TT time significantly, since we lose optimisations of the Intel MKL library, but the rapid convergence still makes it the fastest method for high accuracy.

Evaluation of ising susceptibility integrals
Now we attempt to compute original Ising susceptibility integrals D d given by (18).Computing D d 's for large d is much The convergence plots are shown in Fig. 7.The convergence rate is approximately of order 7 for all considered integrals; noting a slight bent of the curve for D 256 we are hopeful that exponential convergence could have been revealed if calculations were allowed to run longer and reach higher accuracy.By looking at the values of D d 's in Fig. 7 it is easy to note that they decay exponentially.This was noted by Bailey et

Evaluation of susceptibility coefficients
We can now come back to the asymptotic expansion of magnetic susceptibility (15) and the coefficients C 0,± and C 1,± represented via the sums (16), Noting that D d ∼ ∆ −d with ∆ given by ( 20), the coefficients in the sums Σ ± decay geometrically, D d /(2π ) d ∼ (2π ∆) −d ≈ π(31.9)−d .This means that each next term in the sums is approximately 1000 times smaller than the previous one.Hence, if we want to evaluate Σ + to, say, 50 decimate digits, we could start with D 1 = 2, then evaluate D 3 to 47 digits (noting that the analytical value is known since [85]), calculate D 5 to 45 digits and so on.We used the proposed algorithm to evaluate D d 's for d ⩽ 45 with sufficient precision (judged by the internal convergence of the algorithm), and obtained the sums to 50 decimal digits as follows: As noted in [85], these coefficients were also computed by Nickel [92] to 40 accurate decimal digits by high-order integration of differential equation for Painlevé function of the third kind in [87, Eq. (2.36)].Bailey et al. [85], using the values of D d obtained by quasi Monte Carlo algorithm, were able to match the results of Nickel to 20 decimal digits.Using tensor product interpolation Algorithm 6, we are able to reach higher accuracy for D d 's and confirm that the first 40 digits in our results fully agree with the results obtained by Nickel.

Performance and scalability
In Fig. 8 we benchmark the algorithm for different numbers of processes and threads using MPI, OpenMP and hybrid parallelisation.The first two lines in Fig. 8 (left) show the CPU time for OpenMP-only parallelisation of local computations (i.e.essentially Alg. 5 with no dimension parallelisation), and for MPI-only approach where all local computations are performed in one thread, but different chunks of the TT decomposition are assigned to different processes (Alg.6).Moreover, the hybrid approach always uses T = 16 threads for local operations, and different numbers of processes P for parallelisation over dimension.In Fig. 8 (left) we report the product of the number of processes and the number of threads in each process.
The integral D 32 is taken over 31 variables, which means there are 30 rank-update steps to perform simultaneously (recall that TT ranks separate the dimensions, so there is one less rank than dimensions).Hence, our MPI parallelisation can scale only up to 30 processors.Using the hybrid framework, we accelerate the computing further up to a maximum of 512 cores, available on the Balena cluster per one job.We notice a very good scaling, since the cost of communicating O(rd + r log P) bytes is much smaller than the cost of computing O(dnr 2 /P) tensor elements.A slight deviation from the linear scaling for the largest numbers of processes is due to load imbalance, as different TT blocks pick up different ranks in the course of the cross algorithm.This is demonstrated further in Fig. 8 (right), where we approximate a function for the D 512 integral.The maximal number of processes 510 allows us to use only T = 1 OpenMP thread, and instead vary the number of MPI processes P in the entire range.We see that the time is closer to the perfect scaling due to better balancing when each process owns more TT blocks.Even better scaling could be expected for D 2 p +2 integrals, where the same number of TT blocks could be assigned to each of 2 p processes.Nevertheless, even in a deliberately unbalanced situation (which is more practical though), the algorithm scales almost linearly up to the maximum computing capacity available at the given machine.
Finally, we should note that even though with the proposed algorithm 6 we enjoy fast convergence, the numerical costs remain quite high.For example, calculation of D 1024 to 18 decimal digits (see Fig. 7) took about 4 days on 512 nodes of Balena supercomputer at the University of Bath, consuming approximately a megawatt hour of energy.Based on our preliminary experiments with qMC, and assuming that the convergence rate ε ∼ N −0.7 eval will not deteriorate, we estimate that to reach the same accuracy with qMC we would need approximately 10 13 years of calculations and 10 9 terawatt hours of energy -which exceeds the age of the Universe (≈ 1.3 • 10 10 years) and annual world energy consumption (≈ 1.5 • 10 5 TWh in 2014) by three orders of magnitude.

Conclusion
The problem of high-dimensional integration is a particularly important and challenging area.Motivated by risk simulation in finance and engineering, this problem was actively researched and resulted in Monte Carlo Metropolis algorithm [4] considered as one of top 10 algorithms of the 20th century [93].The use of random samples in the MC algorithm allows to break away from tensor-product quadratures and hence avoid the curse of dimensionality, seemingly inevitable in higher dimensions.The flexibility and simplicity of MC was spoiled by its slow convergence, motivating the further development, until the arrival of the quasi Monte Carlo algorithm [5,6].QMC lattices can be optimised for a class of functions (e.g.those appearing from stochastic [7,8]), and demonstrate faster convergence, which currently makes qMC the of choice in areas of sPDEs, finance and risk modelling, engineering, etc.However, the convergence is still not too fast, particularly considering that in practice many end users can make sub-optimal choices in choosing/creating the correct qMC lattice for their problems, leading to excessive numerical costs and increasing negative effect of the HPC industry on the environment.The curse of dimensionality turns therefore in a challenge of precision.Although admittedly many practical problems (e.g. in areas of stochastic inference or machine learning) do not require precision above one or two decimal digits, many applications (e.g.engineering, theoretical quantum physics, quantum computations) need the answer to be precise to ten(s) or hundred(s) of decimal digits, which cannot be achieved (or leads to excessive costs in terms of energy and CPU time) using mainstream MC/qMC approaches.In this paper we address this challenge by development of a new algorithm, based on tensor decompositions.We are pleased to see that the idea of the decompositional approach to matrix computation [94], which was also recognised as a top 10 algorithm of 20th century [93], can break the curse of dimensionality -arguably one of the main challenges of numerical mathematics since 1960s [95] and till the present day.
Tensor product algorithms have undergone very rapid development during the last 15 years, including progress in theory, publicly available algorithmic implementations, and growth of areas of applications.Using the idea of separation of variables, tensor methods give a new hope in lifting the curse of dimensionality and drastically reducing the computational burden associated with high-dimensional problems in a number of areas from quantum physics and chemistry to stochastics, signal processing and data analysis.In this paper we applied tensor cross interpolation algorithm [16] to reconstruct the behaviour of the given highdimensional function from a few samples and to numerically integrate it.Our research proposes a new step in development of tensor product algorithms, by combining the algorithmic power provided by data-sparse low-rank tensor product representations, and the efficient parallel implementation utilising the potential of modern HPC systems.
The Ising susceptibility integrals, which we use in this paper to demonstrate the efficiency of the proposed method, are important not only because of their applications in the quantum theory of ferromagnetism [87], but also as a convenient benchmark for testing and comparing numerical algorithms and analytic approaches.Bailey, Borwein and Crandall [85] approached this problem from many different directions, and their results mark the state of the art of what can be achieved using the algorithms and methods of the 20th century.This is not an easy competition, and we are pleased that our algorithm stands up for it: we are able to reproduce the values calculated in [85] and also to improve the precision of physically relevant integrals from 5-6 to 18-32 decimal digits in dimensions d ≲ 1000.Using multiple precision library developed by David Bailey [91], we were able to reach precision of over 100 decimal digits which revealed sub-exponential convergence of our algorithm ε ∼ exp(− √ N eval ) for one of the considered integrals.The potential to converge sub-exponentially w.r.t. the number of function evaluations clearly distinguish the proposed method from MC/qMC algorithms, which usually demonstrate sublinear convergence ε ∼ N −α eval with 0 ⩽ α ⩽ 1.
The use of multiple precision arithmetic comes at a significant price.Even though MPFUN2015 [91] and other arbitrary precision libraries [90] are well optimised, the lack of optimisation at CPU level and vectorisation at the level of BLAS operations slows the calculations down, increasing the challenge of high precision.From programming point of view, extra steps are needed to rewrite BLAS and Lapack functions in multiple precision.
Although this problem is mitigated in more modern languages (such as Matlab, Python and Julia), they do not always provide enough control of parallelisation at both the distributed-memory (MPI) and shared-memory (OpenMP) levels.This is why for the development and demonstration stage we decided to implement the algorithm in Fortran, although it is clear that further work is required to provide simpler user-friendly interfaces to high-level languages mentioned above.
The context of numerical integration is particularly convenient because the final answer is simply a number, allowing us to objectively evaluate and compare the quality of different algorithms for the given problem.High accuracy of the produced results shows that for the considered examples the proposed method is superior to MC and qMC algorithms.However it must be noted that tensor cross interpolation does not just compute the integral, but indeed reconstructs the whole function in the high-dimensional tensor-product domain and represents it in TT form.When the compact representation of the function is available, it can be post-processed (e.g.interactively) to produce projections, nonlinear functionals (e.g.high-order moments), etc.This approach can be compared to calculation with functions using Chebyshëv polynomials [96], and integrating Chebyshëv interpolation together with the tensor cross interpolation seems to be a natural direction for further work, continuing the existing work in two and three-dimensions [97,98].
The most important direction of development of this work is without doubt the application of the proposed method to larger variety of applications.Many problems motivating precise high-dimensional integration are listed in [90]; we can extend this list by mentioning applications in multivariate probability [99], stochastics [37,100], and optimal control [101,102].We are hopeful that the proposed tensor cross interpolation algorithm will demonstrate fast convergence in these applications and eventually becomes a method of choice for high-dimensional integration.

Fig. 1 .
Fig. 1.Cross interpolation for matrices.The full matrix A is approximated by a low-rank decomposition Ã based on a small number of columns and rows computed

Remark 5 (
Nestedness in Alg. 3).The nestedness condition (6) is not preserved during the sweep in Alg. 3. Consider the moment when the left-to-right sweep reaches position k in the train and replaces previous I ⩽k with the updated rows I ⋆ ⩽k .The nestedness I ⋆ ⩽k ⊂ I ⋆ ⩽k−1 × I k is ensured by construction, so the nestedness of rows is maintained from the left side up to the current active core.However the nestedness of rows in the right part of the train is lost, I ⩽k+1 ̸ ⊂ I ⋆ ⩽k ×I k+1 , because I ⩽k+1 have not yet been updated.The nestedness is recovered when the sweep reaches the end of the train, so the output Ã of Alg. 3 interpolates the given tensor A.

Fig. 3 .
Fig. 3. Parallel version of the cross interpolation algorithm.Top: excluding i ⋆ ⩽k−1 and i ⋆ >k+1 from the row and column sets during the pivot search disentangles different steps in Alg. 5, with mild effect on the pivoting efficiency.Bottom: searching of pivots in different superblocks in parallel implies local data overlap and next-neighbour communication between processors.

Fig. 4 .
Fig. 4. Two-dimensional Ising model shown as a square lattice of interacting spins.Normally, one would expect to observe individual spins in both states σ i,j ∈ {↑, ↓}with equal probability (as on the left panel).Spins also would align with the direction of external magnetic field (as shown on the right panel).Surprisingly, ferromagnetics will also exhibit collective large-distance behaviour (e.g.spontaneous magnetisation) at H = 0 for sub-critical temperatures T < T c .Theoretical explanation of this fact was first proposed by Lars Onsager in 1944.
kT ) is known as partition function.Assuming temperature and volume are constant, the Helmholtz free energy of the system is F = −kT log Z (T , H), and energy per particle is f (T , H) = lim M→∞ N→∞ F (T , H)/(MN).We may be interested in spontaneous magnetisation m 0 (T ) = − ∂f ∂H ⏐ ⏐ H=0 and zero-field magnetic susceptibility χ 0

(d − 1 )
-dimensional integrals C d to two-dimensional form.Using this simple representation, they calculated C d 's to 1000 decimal digits for d ⩽ 1024.They conjectured that C ∞ = lim d→∞ C d = 2e −2γ , where γ is the Euler-Mascheroni constant.This result was then proven analytically in [85, Theorem 2].

Fig. 5 .
Fig. 5. Convergence of cross interpolation for calculation of C 1024 in double, quadruple and multiple precision.Cross interpolation algorithm uses tensor product of one-dimensional Gaussian quadrature rules with n = 33 points for double-precision, n = 65 points for quadruple-precision and n = 257 points for multiple-precision

Fig. 7 .
Fig. 7. Evaluation of the Ising susceptibility integrals D d given by (18).The results are computed by the cross interpolation algorithm in quadruple precision using product of one-dimensional Gaussian quadrature rules with n = 129 points (for D 8 to D 256 ) and n = 65 points (for D 512 and D 1024 ).Left: convergence of cross interpolation algorithm measured by the relative internal convergence, as a function of total CPU time spent on the calculation.Right: values of the D d 's calculated by the proposed algorithm.more challenging than evaluating C d 's, for two reasons.Firstly, each evaluation of the integrand takes O(d) operations for C d , but O(d 2 ) for D d .Secondly, all C d 's can be analytically reduced to two dimensional integrals, while for D d 's reduction performed in [85] only reduces the dimensionality by one in special cases.Using a combination of analytic transforms and Gaussian tensorproduct quadratures, Bailey and collaborators calculated D 5 to 500 decimal digits using 18h on 256 CPUs of IBM Power5 nodes at the Lawrence Berkeley National Laboratory.They also produced D 6 to almost 100 decimal digits.Using qMC algorithm, they also calculated D 7 and D 8 to 5 decimal digits.Further integrals D d were not made available.We apply the proposed tensor interpolation algorithm to calculate D d 's in the original form (18) as (d − 1)-dimensional integrals.We use the quadruple-precision version of the code and aim to calculate integrals D 8 , D 16 , D 32 , . . ., D 1024 to about 30 decimal digits, which is measured by the internal convergence.The convergence plots are shown in Fig.7.The convergence rate is approximately of order 7 for all considered integrals; noting a slight bent of the curve for D 256 we are hopeful that exponential convergence could have been revealed if calculations were allowed to run longer and reach higher accuracy.By looking at the values of D d 's in Fig.7it is easy to note that they decay exponentially.This was noted byBailey et  al. who proved [85, Thm.3] that O(14 −d ) ⩽ D d ⩽ O(4 −d ).They al. who proved [85, Thm.3] that O(14 −d ) ⩽ D d ⩽ O(4 −d ).They conjectured that as d → ∞, D d ∼ ∆ −d , and estimated ∆ ≈ 5 based on a few available to them values D d .Based on our values D 128 and D 256 shown in Fig. 7, we improve this estimate to ∆ ≈ 5.0792202086636783360436879567820.(20)

Fig. 8 .
Fig. 8. Left: strong scaling for D 32 for different numbers of processes P and numbers of threads T , quadruple precision with n = 129.Right: strong MPI scaling (1 OpenMP thread per process) for D 512 , double precision with n = 33. 2.
with ranks r k ⩽ r.It is composed of blocks A(I ⩽k−1 , I k , I >k ), k = 0, . .., d, each of which consists of the values of function f (x) at points x i = (t i 1 , . .., t i d ) with (i 1 , . .., i k−1 ) ∈ I ⩽k−1 , i k ∈ I k and (i k+1 , . .., i d ) ∈ I >k .In total, Eq. (10) requires O(dnr 2 ) evaluations of function f (x) and O(dnr 3 ) additional operations, as explained by Remarks 4 and 8.Note that each function evaluation requires at least linear number of operations (to engage all input values), which makes the overall complexity at least quadratic in d.Algorithm 7