Separable projection integrals for higher-order correlators of the cosmic microwave sky: Acceleration by factors exceeding 100

We study the optimisation and porting of the"Modal"code on Intel(R) Xeon(R) processors and/or Intel(R) Xeon Phi(TM) coprocessors using methods which should be applicable to more general compute bound codes."Modal"is used by the Planck satellite experiment for constraining general non-Gaussian models of the early universe via the bispectrum of the cosmic microwave background. We focus on the hot-spot of the code which is the projection of bispectra from the end of inflation to spherical shell at decoupling which defines the CMB we observe. This code involves a three-dimensional inner product between two functions, one of which requires an integral, on a non-rectangular sparse domain. We show that by employing separable methods this calculation can be reduced to a one dimensional summation plus two integrations reducing the dimensionality from four to three. The introduction of separable functions also solves the issue of the domain allowing efficient vectorisation and load balancing. This method becomes unstable in certain cases and so we present a discussion of the optimisation of both approaches. By making bispectrum calculations competitive with those for the power spectrum we are now able to consider joint analysis for cosmological science exploitation of new data. We demonstrate speed-ups of over 100x, arising from a combination of algorithmic improvements and architecture-aware optimizations targeted at improving thread and vectorization behaviour. The resulting MPI/OpenMP code is capable of executing on clusters containing Intel(R) Xeon(R) processors and/or Intel(R) Xeon Phi(TM) coprocessors, with strong-scaling efficiency of 98.6% on up to 16 nodes. We find that a single coprocessor outperforms two processor sockets by a factor of 1.3x and that running the same code across a combination of processors and coprocessors improves performance-per-node by a factor of 3.38x.

which should be applicable to more general compute bound codes. "Modal" is the simulation and analysis pipeline used by the Planck satellite experiment for constraining general non-Gaussian models of the early universe via the bispectrum (or three-point correlator) of the cosmic microwave background radiation. We focus on one particular element of the code which is the projection of bipectra from the end of inflation to spherical shell at decoupling which defines the CMB we observe today. This code involves a three-dimensional inner product between two functions, one of which requires an integral, on a non-rectangular domain containing a sparse grid. We show that by employing separable methods this calculation can be reduced to a one dimensional summation plus two integrations reducing the overall dimensionality from four to three. The introduction of separable functions also solves the issue of the non-rectangular sparse grid allowing efficient vectorisation and load balancing. This separable method can become unstable in certain scenarios and so we present a discussion of the optimisation of both approaches. By making bispectrum calculations competitive with those for the power spectrum (or two-point correlator) we are now able to consider joint analysis for cosmological science exploitation of new data. We demonstrate significant speed-ups of ≈100× , arising from a combination of algorithmic improvements and architecture-aware optimizations targeted at improving thread and vectorization behaviour. The resulting MPI/OpenMP hybrid code is capable of executing on clusters containing Intel R Xeon R processors and/or Intel R Xeon Phi TM coprocessors, with strong-scaling efficiency of 98.6% on up to 16 nodes. We find that a single coprocessor outperforms two processor sockets by a factor of 1.3× and that running the same code across a combination of Intel R Xeon R processors and Intel R Xeon Phi TM coprocessors improves performanceper-node by a factor of 3.38× .

Introduction
The current best explanation for the origin of our universe is the inflationary big bang scenario, where it is believed that a period of exponential expansion created the large flat empty universe we see today. In addition, this model predicts that quantum fluctuations in the energy during this time will be stretched to galactic scales forming the seeds from which all structure, from planets and stars through to super-clusters of galaxies, grew. The statistics of these fluctuations give a window onto the dynamics at play during the birth of our universe. In particular, any deviation of these fluctuations from a Gaussian (i.e. Normal) distribution would be direct evidence of interesting new physics. The Modal pipeline [4] developed at the University of Cambridge and used by the Planck satellite experiment [1,2] is the only general method for constraining these non-Gaussianities from the currently available data. Its main strength is that by using a general mode expansion it can constrain thousands of theoretical predictions simultaneously: the naïve problem of computing the bi-spectra would require a few times 10 22 floating point operations, which is challenging for the worlds largest supercomputers. By using an appropriate basis tuned for the theoretical models of interest, the Modal team have created a fast and efficient way to probe cosmological data for hints of new physics in the early universe.
This paper investigates the optimisation and modernisation of Modal, as part of an effort to accelerate it using Intel R Xeon Phi TM coprocessors. The existing MPI-level parallelism in the original code is not sufficient to enable efficient utilization of this hardware, and we show that moving to a hybrid MPI/OpenMP implementation can significantly improve performance. For portability reasons, we avoid making any significant code changes that would benefit only one particular hardware platform, and thus the high-level programming languages and techniques that we use apply equally well to Intel R Xeon R processors. When compared to the original code on 2× Intel R Xeon R processor sockets, our code optimisation efforts deliver speed-ups of 1765× on an Intel R Xeon Phi TM coprocessor and 833× on 2× Intel R Xeon R processor sockets in the 2D case; and 108× on an Intel R Xeon Phi TM coprocessor and 83.9× on 2× Intel R Xeon R processor sockets in the 3D case. These speed-ups are large enough to significantly impact the rate of scientific discovery at COSMOS, and enable liberal use of the Modal calculation as part of future Monte Carlo pipelines -something that had previously been considered infeasible.
The rest of this paper is organized as follows: Section 2 discusses related work; Section 3 provides an introduction the two Modal routines which are being optimised, the full three dimensional calculation on the sparse non-rectangular domain, and the fast two dimensional version on a dense rectangular domain. It also provides a highlevel introduction to Intel R Xeon Phi TM coprocessors; Section 4 details the optimization and modernization of Modal; Section 5 presents a detailed performance study of the final application, demonstrating its scalability within a node and across multiple nodes; and finally, Section 6 concludes the paper, and discusses potential new uses for the accelerated version of the code.

Related Work
A number of previous studies have investigated the use of Intel R Xeon Phi TM coprocessors to accelerate other scientific codes [14,13,11,8,3], and there are many similarities between the optimizations we discuss here and those explored in other domains. However, we note that many of these studies were performed before the standardisation of OpenMP 4.0 (and thus often rely on manual vectorization via architecture-specific intrinsics). This is also the first paper (to the best of our knowledge) to explore the use of Intel R Xeon Phi TM coprocessors for this specific application domain.

Direct integration
The primary concern of this paper is the efficient calculation of a three-dimensional inner product which concerns the projection of a set of primordial basis bispectra defined at at the end on inflation into a set of basis bispectra on a spherical shell defined by the CMB. This inner product is defined by where: The array in brackets is the Wigner 3J symbol, which is a geometric factor related to the projection onto the 2-sphere of the CMB. It has two important properties. Firstly, it is zero if the three i , when treated as lengths, are unable to form a triangle (the triangle condition) and secondly it is zero whenever 1 + 2 + 3 is odd (the parity condition). These two conditions present complications in evaluating the sums over the l i as the region is non-rectangular and inside the allowed region all odd combinations must be excluded, these two constrains present issues with load balancing and vectorisation which will be discussed later. When both these conditions are met, h 2 has an exact expression in terms of factorials which in turn can be accurately calculated using the Gosper approximation. The resulting expression is: where L = l 1 +l 2 +l 3 and L i = L−2 i . This inner product needs to be calculated between our projected primordial basis Q, which unavoidably involves a radial integral along a line of sight from primordial times until now, and out late time basis Q. For the estimation and projection to be tractable we must form our basis functions from products of one-dimensional basis functionsq and q respectively which must be symmertrised over the l i . Writing it out explicitly in terms of these 1D functions: where θ is a top-hat like function, which is 1 when the triangle and parity conditions are met and 0 elsewhere and there is a known mapping between n → i jk. Evaluation of Eqn.4 (where v , C andq have been precomputed) is performed by the "Modal 3D" (note that however as have a radial integral in addition to the three summations the calculation is in fact four-dimensional for each of the n 2 max independent matrix). For full details of the origin of Eqn.4 see the Appendix. Typical values for the problem size are n max ≈ 2000, l max ≈ 2000 and ≈ 200 points needed to calculate the integral over r using splines.

Separable integration
One significant simplification that can be made in some cases is to note that the Wigner 3J symbol can be written as an integral over three Legendre polynomials so h 2 takes the form: This automatically preserves the triangle and parity conditions allowing us to work on a domain which is both rectangular and dense. Using this allows us to write Eqn.4 in a much simpler form: r 2 dr dµ P ii (r, µ)P j j (r, µ)P kk (r, µ) + 5 perms , where we have made the definition: The simplified expression in Eqn.6 is solved by "Modal 2D". One consequence of trying to retain the step like conditions of h 2 while using an integral form is that this calculation is very sensitive and must be done to very high precision. Fortunately as the integrand is polynomial we can use Gauss-Legendre quadrature which is principle exact. However we found that standard libraries for calculation of weights and abscissas were not sufficiently accurate and we had to use a specialist implementation which had been optimised for this purpose [6] before the calculation proved stable for specific choices for Q andQ. We note that this stability is not sufficient to allow us to reverse the order of the r and µ integrations and the µ integration must always be carried out first.
It is this issue with stability which leads us to retaining both implementations of the calculation. The 2D version is employed for any bispectra where bases can be chosen so that the integral representation of h 2 is stable, with any remaining cases being calculated by the slower but robust 3D version. We found that the majority of bispectra we are interested in can use the 2D version but there are some very well motivated cases that cannot, for example the bispectrum induced by gravitational lensing.
The original code is written in C and parallelized with MPI. The 2D variant is parallelized over the loop of all the inner products Q n Q n , whereas the 3D variant is parallelized over the n and 1 indices in Equation 4. Throughout the rest of this paper, "Modal" without a 2D or 3D suffix can be assumed to refer to both variants of the application.

Intel R Xeon Phi TM Coprocessors
An Intel R Xeon Phi TM coprocessor consists of many (≈ 60) low frequency in-order cores which share a coherent memory. Each of these cores can run up to four hardware threads, which is useful for hiding the latencies of memory accesses and multi-cycle instructions in the absence of out-of-order execution. Unlike the "hyperthreads" on Intel R Xeon R processors, a single thread on Intel R Xeon Phi TM cannot issue instructions on back-to-back cycles -it is therefore necessary to use at least two threads per core to fully utilize the hardware.
Each core additionally has support for 512-bit SIMD instructions from the Initial Many-core Instruction (IMCI) set. Executing a fully vectorized fused multiply add (FMA) every cycle amounts to 16 double precision FLOPs per cycle per core, or a theoretical peak of ≈1 TFLOP/s in double precision.
The coprocessor is physically mounted on a PCIe card with its own GDDR memory and Linux operating system. Since the cores are x86-based and run Linux, the coprocessor is amenable to existing parallel programming languages and libraries such as MPI and OpenMP -compiling existing codes for the coprocessor is often very simple, but extracting performance may require some significant algorithmic restructuring to expose sufficient parallelism [7].

Experimental Setup
We use the cosmic supercomputer at the University of Cambridge, which is an SGI* UV2000 system consisting of 28 Intel R Xeon R sockets, 24 Intel R Xeon Phi TM coprocessors, and 1.6 TB of RAM. The specification of a single node of cosmic is given in Table 1. The host processors are based on the Intel R microarchitecture previously code-named "Sandy Bridge", and the coprocessors on the microarchitecture previously code-named "Knights Corner". For the sake of brevity, we refer to these two microarchitectures henceforth as "SNB" and "KNC", respectively.
All experiments (unless otherwise stated) were performed using all of the cores available within a node, running the maximum number of threads supported -4 threads on KNC and 1 thread on SNB (hyper-threading is disabled on the host) -and the thread affinity used on Intel R Xeon Phi TM is set using KMP AFFINITY=close,granularity=fine. The Intel R Xeon Phi TM is used in offload mode, and no additional flags are passed to the compiler beyond those used for the host: -O3 -xHost -mcmodel=medium -restrict -align -fno-alias -qopenmp. All experiments are repeated 5 times, and we present the average (mean), in order to account for system noise and any nondeterministic performance effects arising from threading.
Profiling and analysis of the code was performed using Intel R VTune TM Amplifier XE and the optimization reporting functionality of the Intel R C compiler. For both forms of the algorithm, we measure the performance in terms of the number of loop iterations executed by the whole code per second. Since the amount of work per loop iteration is different in the two cases, we differentiate between them by referring to them as 2D and 3D iterations per second (2D it/s and 3D it/s) respectively. Henceforth we will also adopt a convenient notation for displaying the performance figures for SNB and KNC side-by-side as a tuple -(SNB, KNC) it/s.

Optimization and Modernization
Prior to exploring any algorithmic changes, hardware-specific optimizations or code modernizations, we carried out some high-level refactoring of the two Modal variants to improve their performance. Specifically, we replaced a number of abstract "getter" functions used to retrieve array values from other compilation modules with inlined direct array accesses, removing function call overheads and providing the compiler with more optimization freedom. Some occurrences of routines from the GNU Scientific Library (GSL) were also replaced with equivalent, pre-optimized, routines from the Intel R Math Kernel Library (MKL).
The purpose of performing such a high-level refactoring before engaging in other optimization activities is twofold: first, it ensures that the baseline performance of the code is representative of the performance of the original algorithm (as opposed to only its implementation); second, it exposes more accurately the incremental performance Listing 1: Code snippet for gamma pt in Modal2D.
sum1 = s 1+s 2+s 3+s 4+s 5+s 6 ; r e s u l t += g l w g t [ j ] * sum1 ; } . . . r e t u r n r e s u l t ; } improvements from our subsequent changes to the code. After our refactoring, the SNB performance of the 3D variant of the algorithm increased by 3.27× , however the 2D variant showed negligible improvement.

Modal2D
The hotspot in the Modal2D code is in a function called gamma pt (see Listing 1) which evaluates the inner most integrand in equation 6, and is called for each mn point in the Γ mn matrix. Since the size of Γ is O(1000 2 ), there is a significant amount of exploitable parallelism present.

Threading
The original code has the loop over mn points as a nested loop (n as the outer loop, m as the inner loop), and distributes the outer loop across processors using MPI. We opt to extend this by distributing the both inner loop and the MPI subdomain of the outer loop across cores within a node using OpenMP. Since the amount of work performed for each iteration of the m loop is the same, and the calculation of each mn point is separable and independent of all others, this is achieved simply through the application of a omp parallel for collapse(2) pragma.

Vectorization
The code in its original form does not express the algorithm in a way that is conducive to compiler auto-vectorization, and the Intel R compiler is unable to vectorize the rsl loop nest because of the indirection arising from the pvec and qvec arrays. Since the rs loop nest has a hard-coded trip count of 9 iterations, it is practical to simply unroll this loop nest and compute the 9 sums simultaneously, thereby replacing the pvec and qvec with scalars and removing the indirection. Loop unrolling also gives the compiler more freedom to re-order the instructions, which is particularly important for an in-order core like that of KNC. Following these code transformations the compiler is able to auto-vectorize the loop, resulting in dramatic speed-ups of 2.65× on SNB and 7.0× on KNC -performance of (5.2, 11.32) 2D it/s. Tuning the data layout to assist vectorization (i.e. ensuring data alignment) and hoisting some repeated calculations from the inner-most loop increases performance further, to (5.3, 12.0) 2D it/s.

Algorithmic Improvements
A significant issue with the original algorithm is that the value P T ii (x, µ) is calculated repeatedly from equation 7. Looking at the sizes of the dimensions of equation 7 -0 ≤ i ≤ 3 √ N terms , 0 ≤ x ≤ 216, and 0 ≤ µ ≤ 3000 -if we were to precalculate P ii for all values of i, µ, and x then the resultant array would require only 1 GB of storage (in double precision), which is small enough to remain within the 8 GB of GDDR available on KNC. Pre-computing P ii in this way reduces algorithmic complexity significantly, yielding a dramatic speed-ups of 278× on KNC and 300× on SNB -giving a new performance of (1649.3, 3344.5) 2D it/s. Following the introduction of the new algorithm, it is necessary to re-examine the hotspots and any assumptions about achievable performance. Since the size of the look-up table we introduce is larger than the size of cache, and its access pattern is too irregular to benefit from blocking, our changes shift Modal2D from compute-to memory-bound. Furthermore, running the full number of threads per core causes contention for entries in each core's Translation Look-aside Buffer (TLB) -somewhat counterintuitively, we can increase performance by a further 5% by reducing the number of threads used per core to 2.

Results
The graph in Figure 1 shows the speed-up resulting from each of our optimizations and modernizations. For the original code following re-factoring and the introduction of threads (Version 1) we see that KNC is out-performed by two SNB sockets. However, tuning to ensure efficient use of vectors (Versions 2-4) is sufficient to invert this relationship, and highlights the importance of achieving high SIMD efficiency on KNC. The most significant speed-ups arise from algorithmic change (Version 5) on both processor and coprocessor. On KNC we also show that using less threads per core can

Modal3D
In the original Modal3D code, the majority of the time is spent in the loop structure shown in Listing 2. The outer loop(n) iterates over the late time modes, and for each late time mode we compute the inner product with all the primordial modes at each valid angular momentum. The constraints on the valid angular momentum modes give rise to the triangular 1 2 3 loop.

Threading
As with Modal2D, the original code is parallelised with MPI only over the n loop, and introducing threading via OpenMP is a necessary step to improving utilization of KNC's many-core architecture. However, in the 3D variant there are multiple candidate loops that could be parallelised with OpenMP. The inner-most m loop best matches the approach taken in the 2D case, but would be an unwise choice here (since the calculation of x and z would not benefit from parallelization). Threading the 1 2 3 loops instead would be a better idea, if not for the fact that the loop space is triangular; threading only one of the loops would lead either to work imbalance or an iteration count too small to fully utilize all of the available cores.
Ideally we would like to flatten the whole 1 2 3 space into a single loop, similar to the effect of OpenMP's collapse(3) construct. Use of this construct is illegal here, since the number of iterations of the inner loops depends on the indices of the inner loops [12]. We experimented with many different approaches of achieving good load balance using standard OpenMP: spawning separate OpenMP tasks for groups of loop iterations; recursively partitioning the iteration space; and using dynamic scheduling with a threaded outer-loop. In all cases we found that the overheads of these methodologies was significant, the load imbalance was not necessarily adequately addressed, and cache behaviour suffered due to the inability to specify thread affinities. The Intel R Threading Building Blocks [15] (TBB) library provides task-affinity functionality that may have assisted us with this last point, but we have two reasons for not using it: first, to keep Modal as a C (rather than C++) application; and second, to ensure that our threaded version did not depend heavily on non-standard language features and libraries.
Instead, we carve up the iteration space manually across tasks, and assign contiguous "chunks" of loop iterations to each thread. This scheme (demonstrated in Figure 2) allows us to achieve very good load balancing at the expense of a small one-time setup overhead. Note that this method is scalable beyond threads -we can manually carve up the total mn 1 2 3 iteration space across MPI, then across sockets/coprocessors and finally across threads -which allows us to improve the multi-node scalability of the code.
While restructuring the loop for parallelism, we brought the n loop inside of the 1 2 3 nest to reduce the amount of redundant work per thread. However, this increases the amount of temporary storage required by each thread to store its own partial result of Γ mn . The size of this storage increases by the square of the number of terms, and is O(10 MB) for a typical problem size. Assuming 4 threads per core, the amount of temporary storage far exceeds the 512 KB capacity of KNC's L2 cache, and we see performance degradation due to thrashing effects. If all the threads on one core collaborate to read and write to the same array, then the size of this temporary storage is reduced by a factor of 4 -while it still doesn't fit in the cache, blocking the loop to compute B iterations between accesses can amortize the bandwidth overheads.
Collaboration between threads in this way requires a "nested" parallelism model, with a 2-tier hierarchy of threads (e.g. C groups of 4 threads, where C is the number of cores). A naïve implementation of nested parallelism in OpenMP is not practical here because the cost of creating and destroying new teams of threads for each iteration of the inner loop is prohibitively expensive. Instead, we adopt an approach where the threads in both tiers are spawned together, and work is assigned to threads based on a combination of their team id (i.e tid / 4) and local worker id (i.e. tid % 4) -all threads execute the same set of 1 2 3 loop iterations as the other threads in their group, but execute for different eignenmodes inside the angular momentum loop. Although this manual nesting approach gives us a great deal of flexibility, it depends on a custom on-core barrier (such as the one shown in Listing 3) to synchronize only the threads in a particular group. There are two reasons to use a local barrier instead of the global barrier provided by OpenMP: first, a global barrier scales O(log 2 (4C)), whereas a local barrier has a constant overhead of O(4); and second, a global barrier can easily lead to accidental deadlocks, since all threads in all groups must be present for all barriers.

Algorithmic Change
The bottleneck left in the code after the loop modernisation is the integration routine called by calculate xint. In the original code, integration was performed using the function gsl spline eval integ from GSL, but was replaced with a faster equivalent from Intel R MKL. We find that replacing this functionality with a numerical integration routine is much faster still and less memory-intensive, since we do not need to store a cubic spline nor solve a global system.
For the results reported here, we use a simple application of the Trapezium rule, combined with hand-selected sampling points to improve accuracy in areas of interest. This scheme has sufficient numerical accuracy for us to obtain a physically meaningful answer, within a few percent of the answer calculated by Intel R MKL. Other numerical integration routines (e.g. Gaussian quadrature) may be required for other corner cases, and we leave this investigation to future work.
This leaves two bottlenecks in the code -the partial reduction stage in calculate gamma 3d, and the calculate xint routine. The reduction step is actually just a matrix multiply Γ mn = P lm X ln , and thus it can be replaced with a call to the BLAS-3 DGEMM routine from Intel R MKL, which is cache-blocked and vectorized efficiently out-of-the-box. We call MKL from only a single thread in each of our nested thread groups, and empirically derive the best blocking factor B to be 64.

Results
The graph in Figure 3 shows the speed-up resulting from each of our optimizations and modernizations. For the original code following re-factoring and the introduction of threads (Version 1-3) we already see significant gains on the host (3.27× ). Note that we do not provide KNC results until the introduction of OpenMP (Version 4) since we are using the offload model, and offloading work to a single KNC thread does not make sense. Tuning the threading behaviour (Versions 5 and 6) continues to improve Listing 3: Example implementation of a fast, on-core, barrier.     performance on both platforms but, as in the case of Modal2D, more significant speedups are only possible with algorithmic change -in this case, a change in integration method (Version 7). Further tuning of vectorization behaviour (Versions 8 and 9) provide a small boost in performance, and tuning the prefetch distance (Version 10) finally pushes the performance of KNC ahead of SNB.
Since the same code can be recompiled and ran on both processor and coprocessor, very little effort was required to make the code heterogeneous. We use the asynchronous offload features of Intel R Language Extensions for Offload (LEO) to offload a fraction (empirically derived to be 71%) of the total problem to KNC, and process the remainder on SNB. On a single node of cosmic, using KNC in addition to SNB gives a 3.5× speed-up over using SNB only.

Performance Study
The new implementations of the 2D and 3D variants of Modal are optimized but they are not optimal. In this section, we analyze and discuss the remaining bottlenecks to performance at both single-and multi-node scale. Figures 4a and 4b show how performance of Modal2D and Modal3D scale with the number of cores on SNB and KNC, respectively. The number of threads per core on KNC is fixed at the number that gives the highest performance in each case -2 and 4 threads per core for Modal2D and Modal3D, respectively.

Single Node
For Modal2D, we see that scaling tapers off with increasing core-count, reaching a maximum speed-up of 32.4× . This is not for want of parallel work, nor high synchronization costs, but rather because the cores are competing for limited memory bandwidth. Running with all 59 cores on KNC, the bandwidth from GDDR to L2 was measured (by Intel R VTune TM Amplifier XE ) to be 147.5 GB/s on average -very close   to its peak STREAM [9] bandwidth of 165 GB/s (see Table 1). The fact that we are hitting close to peak shows that there is little room left to further tune the computation performed by Modal2D; any additional work on the 2D variant will need to focus on improving cache locality or algorithmic complexity, not vectorization.
For Modal3D, we see much better scaling with increasing core-count (close to linear). This is to be expected, since the code is very compute intensive. Furthermore the groups of threads, once spawned, require no communication with each other until the reduction stage at the very end of the calculation. The scaling behaviour suggests that, unlike the 2D variant, Modal3D is compute bound -running with all 59 cores on KNC, the number of vector instructions issued per cycle was measured by Speedometer [10] to be 39.8% of peak. The majority of this inefficiency is lost to transfers between L2 and L1 cache; although there is a large amount of data re-use within a core, the number of streams per thread is too high for the prefetchers to effectively hide the latency of L2 accesses.

Scaling to Multiple Nodes
Figures 5a and 5b show how performance scales with the number of cosmic nodes for Modal2D and Modal3D, respectively. For Modal2D we show results for SNB and KNC alone, while for Modal3D we also show results for a "hybrid" model running 30% of the work on SNB and 70% of the work on KNC. In all cases, we place a single MPI rank per node. The reader is reminded that a single cosmic node contains only a single SNB socket, and a single KNC coprocessor (see Table 1).
For both the 2D and 3D variants, we see scaling that is fairly close to linear. There are two reasons for this: first, the only communication required between tasks occurs right at the start of a run (to agree on task decomposition) and right at the end (to reduce the final gamma matrix); second, there is a significant amount of work (601 iterations in the 2D case, and 2 × 10 9 iterations in the 3D case) to be split between MPI ranks, so we do not reach a scale where communication costs begin to dominate execution time. Note that although the single-node scaling of Modal2D is inhibited by hardware limitations, this is not the case here -when scaling to multiple nodes, the total available memory bandwidth also scales accordingly. Figures 6a and 6b show how performance scales with problem size for Modal2D and Modal3D, respectively. As in the previous section, we again report performance for three different configurations: SNB alone; KNC alone; and hybrid execution across both processor and coprocessor. Across all variants and problem sizes, KNC beats SNB. Focussing on the 2D variant, the gap between KNC and SNB is largest for large problems. For the two small problems with 50 and 101 terms, the execution time is dominated by the precompute stage of calculation which does not scale well with increasing numbers of threads, and which is thus relatively expensive on KNC. For the larger problems like the 601 and 1001, the execution time is dominated by the real computation of Γ mn , which is highly scalable on KNC and thus KNC excels to 4.1× over the SNB for the 1001 case.

Scaling with Problem Size
For the 3D variant, the gap in execution times between SNB and KNC is more consistent, owing to the large amount of parallelism present even in small problems as a result of our fine grained decomposition of the loops. All the times in 6b are for the same 1 2 3 space of 18712695 iterations, but for a different number of terms. The amount of computation required in each of Modal3D's functions scales at different rates with respect to the problem size, and this is reflected in the execution times. The number of eigenmodes that must be integrated scales linearly, but the size of the Gamma matrix and thus the overhead of reducing it scales quadratically. For all the problem sizes we have tested, integration remains the top hotspot, which is why the scaling is close to what one would expect, but the increasing cost of other functions is noticeable.

Conclusions
We have presented an optimization study for the early universe simulation and analysis code "Modal" for the 2D and 3D versions of its algorithm. These codes represent two common computational challenges. Evaluation a multi-dimensional integral/sum both on a rectangular dense domain, and on a domain which is neither. As such the optimisation steps detailed here would be applicable to any similar compute bound code.
Through a combination of algorithmic improvements, the introduction of threadlevel parallelism, and exposing opportunities for hardware vectorization, we have achieved significant whole application speed-ups: 1765× on KNC and 833× on 2× SNB in the 2D case; and 108× on KNC and 83.9× on 2× SNB in the 3D case. In both cases, the greatest source of speed-up is algorithmic change, and the increased amount of exploitable parallelism it brings. Although still significant, hardware-specific tuning of the new algorithms yields less than 10× improvement in performance.
Our use of standard programming languages ensures that code changes benefit not only the new Intel R Xeon Phi TM coprocessors, but also Intel R Xeon R processors -and performance improvements are expected to persist on the next generation of processors and coprocessors (codenamed "Knights Landing"). Investing in code optimization and modernization today can deliver significantly greater gains than waiting for future advances in processor technology and will ensure that we maximise the science done on any given architecture.
Configurations: see Table 1. Tests performed by DAMTP, University of Cambridge. For more complete information visit http://www.intel.com/performance. Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessordependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice Revision #20110804 Intel, Xeon, Xeon Phi and VTune are trademarks of Intel Corporation in the U.S. and/or other countries.
*Other names and brands may be claimed as the property of others.

A Physics of the Modal code A.1 Theory
We observe the temperature T of the cosmic microwave background (CMB) on a distant sphere and we can represent anisotropies in this temperature ∆T/T as where Y lm (n) are the usual spherical harmonics with multipoles , m with − ≤ m ≤ .
Most quantitative cosmology has developed using the two-point correlation or power spectrum a m a * m defined from a average over the azimuthal multipole m However, we are interested in new information from the three-point correlator or bispectrum, averaged over orientations as the CMB is assumed isotropic, This is the spherical respresentation of the 2D CMB bispectrum of triangles on the sky. One of our key goals is to connect this to the 3D primordial bispectrumB(k 1 , k 2 , k 3 ) generated in the early universe; here,B(k 1 , k 2 , k 3 ) is defined in terms of wavenumbers k i and must be projected forward using transfer functions describing the evolution of the Universe to predict the late-time B 1 2 3 . The modal method is designed to constrain the bispectrum. If you integrate the bispectrum you get the skewness (i.e. how much the distribution leans to one side) of a distribution and, since the bispectrum is zero for a Gaussian distribution, this an effective test of non-Gaussianity. The issue is that the bispectrum is a full three dimensional quantity and so calculating and measuring it is nontrivial. If we wish to constrain a theory which predicts a bispectrumB(k 1 , k 2 , k 3 ) at the end of inflation, we first need to evolve it forward to today via convolution with transfer functions ∆: where h is a geometric factor related to the projection onto a 2-sphere which will be defined below. This is a 7-dimensional calculation and is impossible in practice. The modal method simplifies this by decomposing the inflationary bispectrum into a set of specially chosen basis functionsQ n for which this calculation can be dramatically simplified. TheseQ n are defined as: Q n (k 1 , k 2 , k 3 ) = 1 6 q i (k 1 )q j (k 2 )q k (k 3 ) +q j (k 1 )q k (k 2 )q i (k 3 ) +q k (k 1 )q i (k 2 )q j (k 3 ) +q k (k 1 )q j (k 2 )q i (k 3 ) +q j (k 1 )q i (k 2 )q k (k 3 ) +q i (k 1 )q k (k 2 )q j (k 3 ) where the relation between n to i jk is defined via a pre-determined one-to-one mapping which is optimised for convergence. The mapping both is non-analytic and sparse (so not all, or even most, i jk triples correspond to a n) and is read in from a pre-calculated list. An example mapping would look like: With this basis we then haveB = (k 1 k 2 k 3 ) 2B = nᾱnQn where theB is the signal to noise weighted version ofB. The particular form of the basis function allows the projection to be calculated simply and we have: Q n 1 2 3 ≡ 1 6 r 2 dr q i (r, 1 )q j (r, 2 )q k (r, 3 ) + 5 permutations (14) q i (x, ) ≡ dkk 2q i (k)∆ (k) j (kr) (15) and now B 1 2 3 = nᾱn Q n and the convolution is effectively 2-dimensional. However this form still proves difficult to use for estimation because of the radial integral r (i.e. distance from inflation to now, along a line of sight). It is more efficient to use a second basis at the time of observation: Q n 1 2 3 = 1 6 q i ( 1 )q j ( 2 )q k ( 3 ) + q j ( 1 )q k ( 2 )q i ( 3 ) + q k ( 1 )q i ( 2 )q j ( 3 ) + q k ( 1 )q j ( 2 )q i ( 3 ) + q j ( 1 )q i ( 2 )q k ( 3 ) + q i ( 1 )q k ( 2 )q j ( 3 ) , and to project a signal to noise weighted version of the Q into this new basis, so that we have where the v i = (2 + 1) 1/6 . If we define the inner product (which is designed to mimic the signal to noise structure of the estimator) as: where h the previously mentioned geometric factor defined by Using this we can then define Γ in terms of the inner product as: Thus for optimising the calculation of Γ we only need to focus on optimising the evaluation of the inner product. The majority of the calculation time is in evaluation the second inner product due to the radial integral inherent in Q, so we will restrict our attention to that part alone defining which is Eqn.4. The reader is referred to [5] and [4] for a full explanation of the physics.