Comparison of distributed memory algorithms for X-ray wave propagation in inhomogeneous media.

Calculations of X-ray wave propagation in large objects are needed for modeling diffractive X-ray optics and for optimization-based approaches to image reconstruction for objects that extend beyond the depth of focus. We describe three methods for calculating wave propagation with large arrays on parallel computing systems with distributed memory: (1) a full-array Fresnel multislice approach, (2) a tiling-based short-distance Fresnel multislice approach, and (3) a finite difference approach. We find that the first approach suffers from internode communication delays when the transverse array size becomes large, while the second and third approaches have similar scaling to large array size problems (with the second approach offering about three times the compute speed).


Introduction
Diffraction limited storage rings are providing the next advance in X-ray brightness from quasitime-continuous synchrotron light sources [1]. These allow one to combine the high penetrating power and short wavelength of X-rays for nanoscale imaging of increasingly large specimens. Due to the overlap of features in a single view of an extended object, one must use tomography to obtain a 3D view of a specimen from a series of 2D projection images. However, as the transverse spatial resolution δ res is improved, the depth of focus (DOF) decreases according to [2,3] where θ is the numerical aperture of the imaging optic, and 0.61 = 1.22/2 comes from the Airy function for circular optics. Because of the depth of focus, features at different depths in an extended specimen are no longer sharply viewed in a single projection image. One way to overcome this limitation is to move to an optimization-based approach to image reconstruction, where one constructs a guess of the 3D object, calculates wavefield propagation through the object leading to an exit wave (and subsequently to predicted image intensities), and then adjusts the guess of the object until the difference between predicted and measured image intensities is minimized. Variations of such an approach have been demonstrated in electron microscopy [4,5], light microscopy [6][7][8], and X-ray microscopy [9][10][11][12].
In order to accurately represent the forward problem, these approaches all require one to implement computational wavefield modulation and propagation through a complex 3D object, and thus determine the complex exit wave leaving the object. This propagation is usually done with a multislice approach [13][14][15], where for one illumination direction one treats the object as

Algorithms
We consider here the forward problem of how to calculate the exit wave leaving a heterogeneous refractive index distribution for a large object. In the full-array Fresnel multislice approach, this is done by a sequence of wavefield propagation calculations for one guess of the object, which in turn, is nested within the iterative adjustment of the guess of the object to minimize the difference between predicted and measured image intensities. Because the calculation of wavefield propagation in a heterogeneous medium is undertaken repeatedly, we are interested in approaches that minimize computational time.
Specialized hardware has been developed that can calculate (10,000) 2 pixel holograms in just 100 ms [30], though this hardware only solves one step in the overall optimization problem, as noted above. Using a single workstation with a graphical processing unit (GPU), a solid-state drive (SSD) for rapid transfer of partial data to and from limited random access memory (RAM), and efficient tiling strategies as will be discussed in Sec. 2.2 below, single instances of short-distance wavefield propagation of (131,072) 2 pixel arrays have been demonstrated with an impressive calculation time of 3.6 minutes [27]. Doing calculations like this in a shorter computational time, and within the context of an optimization-based image reconstruction approach, can be achieved if one utilizes distributed memory parallelism in high performance computing clusters. These clusters typically consist of nodes that are connected by high-bandwidth, low-latency interconnects (with many advances on the latest supercomputers [31]), and protocols such as the message passing interface (MPI) for distribution and coordination of parallelized operations [32].
Because of our interest in X-ray microscopy applications, we limit ourselves to considering the refractive effect of an inhomogeneous medium. Most transmission imaging in X-ray microscopy is either done at soft X-ray photon energies around light element K absorption edges (0.2-0.8 keV), or energies of 2-15 keV where one obtains good penetration while still maintaining reasonable contrast from microscopic features [3,33]. In our energy range of interest, X-ray interactions are well described by a complex refractive index n of n(x, y, z) = 1 − δ(x, y, z) − iβ(x, y, z) (2) where we have used the sign convention appropriate for writing forward wave propagation in the propagation directionẑ as exp[−iknz] with k = 2π/λ. The phase shifting part δ and absorptive part β of the refractive index are well tabulated [34], and are typically in the range of 10 −3 -10 −7 , with δ β in most cases. When representing an object in a 3D array with slice thickness ∆ z along the illumination directionẑ, the net refractive effect of the j th slice is determined by leading to an advance in the phase ϕ of for the slice. In visible light one might want to use the mean refractive indexn for propagation to the plane of the next slice, and the refractive index variations n(x, y, z) −n the calculation of Eq. (3) for the effect of inhomogeneities within a slice. However, the small values of δ and β for X-rays make that unnecessary.

Full-array Fresnel multislice
As noted above, full-array Fresnel multislice is a well-known approach developed first in electron microscopy [13,14] and subsequently applied in visible light optics [15] and in X-ray microscopy [18][19][20][21] studies. With our particular interest in X-ray optics, this approach has been shown to produce results for very thick optics that are equivalent to those provided by coupled-wave equations [22], allowing for the simulation of the focusing properties of combinations of diffractive X-ray optics [35] as well as accurate modeling of the forward problem for image recovery of objects extending beyond the depth of focus limit [11,12]. What the approach leaves out is the ability to account for backscattered waves [36], but this effect is weak in X-ray interactions with non-crystallline media. Starting with a wave ψ s incident on a slice, we first apply the phase advance and magnitude reductions of Eqs. (4) and (5) produced by the slice, giving as the modulated wavefield, where represents pointwise multiplication. We then propagate this modulated wavefield to the exit plane of this slice, giving a downstream wavefield ψ d (x, y) of F represents a Fourier transform and F −1 its inverse, and (u, v) are the transverse coordinates in the Fourier transform domain. The reciprocal space Fresnel propagation kernel is preferred (rather than the equivalent kernel in real space) for short propagation distances to avoid aliasing artifacts [37,38]. One then has N z = t/∆ z (9) slices for the overall calcuation for a specimen with thickness t. This leads to Algorithm 1 for full-array Fresnel multislice. How thin should the slices be in the multislice method? Based on Eq. (1), one would expect to require ∆ z ≤ DOF. One comparison tested the full-array Fresnel multislice method (which can model arbitrary refractive index distributions) against coupled wave equation methods (which can be applied to easily defined, regular structures) [22]. This comparison used a parameterization that (in hindsight) is equivalent to the Klein-Cook parameter Q [39] of for X-ray volume gratings with period 2∆ r and thickness z aligned to the propagation directionẑ. When Q is well below 1, one can use simple scalar diffraction to describe the effects of the grating, whereas Q ≥ 1 corresponds to the case where volume grating effects become pronounced. If we limit the slice thickness to a value such that Q ≤ 0.5 and assume 1 − δ 1, the slice thickness ∆ z should be kept below a value z K-C of or z K-C 0.32δ 2 res /λ instead of DOF 5.4δ 2 res /λ. However, this is an extreme case that applies to regular gratings at the Nyquist limit, aligned to the beam propagation directionẑ. In practice, a good approach is to start with ∆ z = DOF, and reduce it towards ∆ z = z K-C while watching for asymptotic convergence of the exit wave. Algorithm 1: Algorithm for implementation of full-array Fresnel multislice.
The refractive modulation step of Eq. (6) is a per pixel operation (i.e., each pixel in the output array depends only on the corresponding pixel in the input array), leading it to be trivially parallelizable, and one could even distribute the set of δ j (x, y) and β j (x, y) from all depth planes to the appropriate nodes prior to initiating the calculation. However, the Fourier transform of Eq. (7) is a whole-2D-array operation, so it is not trivially parallelizable. While there is considerable activity on developing efficient large-array parallel FFT implementations [40][41][42], inter-node communication requirements still set performance limits [43]. This motivates the use of other approaches for carrying out the operation of Eq. (7).

Tiling-based short-distance Fresnel multislice
In the mathematical definition of a discrete Fourier transform, the value of one input plane pixel affects all pixels in the transform, and vice versa. However, in short-distance wavefield propagation, information is localized due to the finite angle θ = λ/(2∆ r ) of first-order diffraction from the finest features that are Nyquist sampled when using a pixel size of ∆ r [44,45]. For a propagation distance z prop , this means that Nyquist-sampled diffraction information at the subsequent slice is contained within a radius r 1 of where the identity tan(sin −1 (x)) = x/ √ 1 − x 2 has been used; the approximate result applies to our case because the pixel size ∆ r is much larger than the X-ray wavelength. However, this does not incorporate the reality that interference fringes from weak features taper off in amplitude at large transverse distances. An alternative criterion is to consider diffraction from a half-edge, which can be characterized using the Cornu spiral [46] in terms of a dimensionless parameter w = r 2 2/(λz prop ). This gives as an expression for the transverse distance r 2 for a given propagation distance z prop . In half-edge Fresnel diffraction from a fully opaque object, one reaches the 8 th dark fringe, where the intensity modulation is down to 8%, at a value of w = 5.61. Since X-ray microscopy usually involves weak phase objects, their effect at this transverse distance will be quite small; as a result, we use w = 5.61 to give r 2 = 3.97 λz prop (14) as a reasonable transverse distance beyond which there should be little effect from neighboring features.
Recognizing the limited transverse extent of diffraction from upstream features, one can use a tiling approach to parallelize the short-distance Fresnel multislice calculation [27]. In this approach, a large 2D array is split into a large series of much smaller tiles with buffer zones around their edges as shown in Fig. 1. One can then propagate these tiles separately, discard the buffer zones, and recombine the tiles to form the large 2D array at the downstream plane. These tiles can be sized to fit GPU memory [27] or other specialized hardware [30]. They can also be distributed to nodes on a high performance computing cluster, which is the approach we use here. Consider a large input wavefield array ψ s (N x , N y ), and a refractive index array The number of slices is N z from Eq. (9), with ∆ z no larger than the depth of focus DOF of Eq. (1) and possibly as small as z K-C of Eq. (11). The tiles will have dimensions ψ s (N x,tile , N y,tile ) so that there are N x /N x,tile and N y /N y,tile tiles in the x and y dimensions, respectively. To any interior tile, one must add a buffer zone of physical width r 2 from Eq. (14), or pixel width to each side of the tile with information from neighboring tiles as shown in Fig. 1. This allows one to account for diffraction from features at the edge of nearby tiles into the field of view of the particular tile being processed. For a multislice calculation, one can choose between two alternative approaches to tiling-based Fresnel diffraction using these arrays: • 2D tiling: In this approach, at each slice one divides ψ s (N x , N y ) into tiles ψ s (N x,tile + 2N buffer , N y,tile + 2N buffer ), and also the refractive index distribution n = 1 − δ − iβ for slice j into tiles of δ j (N x,tile + 2N buffer , N y,tile + 2N buffer ) and β j (N x,tile + 2N buffer , N y,tile + 2N buffer ).
In this case, r 2 (Eq. (14)) and N buffer (Eq. (15)) are calculated for the thickness of one slice, or z = ∆ z . The tiles with their buffer zones are distributed to nodes. The refractive index modulation is then applied using Eq. (6), after which propagation by the slice thickness z = ∆ z is carried out using Eq. (7) to yield ψ d (N x,tile + 2N buffer , N y,tile + 2N buffer ). The buffer zone is then stripped, and ψ d (N x,tile , N y,tile ) is then returned. The various tiles of ψ d (N x,tile , N y,tile ) are then used to form the full input wavefield ψ s (N x , N y ) entering the next slice, and the process is repeated. Because the free space propagation step of Eq. (7) is carried out over a small distance z = ∆ z ranging between z = DOF and z = z K-C , this approach has the advantage of requiring a smaller buffer zone size N buffer . However, at each of the N z slices of the calculation, it requires collecting the ψ d (N x,tile , N y,tile ) tiles from the computational nodes to re-form the full array ψ d (N x , N y ) which then is used to distribute the set of ψ s (N x,tile + 2N buffer , N y,tile + 2N buffer ), δ j+1 (N x,tile + 2N buffer , N y,tile + 2N buffer ), and β j+1 (N x,tile + 2N buffer , N x,tile + 2N buffer ) tile arrays to the computational nodes.
• 3D tiling: In this approach, at the outset one divides the input wavefield ψ 0 (N x , N y ) into tiles ψ 0 (N x,tile + 2N buffer , N y,tile + 2N buffer ) with r 2 (Eq. (14)) and thus N buffer (Eq. (15)) calculated using z = t, the total sample thickness (a much larger value than the slice thickness ∆ z ). One also generates 3D tilings of the refractive index arrays δ j=1...N z (N x,tile + 2N buffer , N y,tile + 2N buffer ) and β j=1...N z (N x,tile + 2N buffer , N y,tile + 2N buffer ) for all the N z slices, which are then distributed to computational nodes. The multislice calculation through all N z slices can then be calculated on each node, after which the buffer zone is removed and the wavefield exiting the sample at each tile position is returned as ψ d (N x,tile , N y,tile ) so that the overall specimen exit wave ψ d (N x , N y ) can be assembled. This approach has the advantage of not requiring any data transfer between computational nodes during the multislice calculation, but it involves each node carrying out its calculations on a larger array due to the increased size of N buffer with z = t. Fig. 1. For tiling-based short-distance Fresnel multislice, one can use a tiling approach to split a large 2D array of dimension N x × N y into a set of smaller arrays, each of size N x,tile × N y,tile , so that these smaller arrays can be processed on separate computational nodes. When doing so, one must add a buffer zone of physical width r 2 (Eq. (14)), and pixel width N buffer = r 2 /∆ x (Eq. (15)), to each side of the tile with information from neighboring tiles. This accounts for diffraction from features at the edge of nearby tiles coming into the field of view of the tile being processed.
We use the 3D tiling approach, as described in Algorithm 2. Before conducting numerical experiments using the choice of r 2 = 3.97 λz prop as in Eq. (14), we conducted a validation test using a 256 3 voxel object that was also used in another publication [12]. The object array contains a hollow capillary tube positioned in the middle. We propagated a plane wave through the object, with the object divided into four 3D tiles of 128 × 128 × 256 in a 2 × 2 grid. This way, each tile has a part of the non-vacuum object filling up to its edge; when the buffer zone width is too small, diffraction fringes of the object would wrap around and reenter from the opposite side, causing errors compared to the result given by full-array Fresnel multislice (the reference). We repeated the propagation simulation to sweep the value of w/ √ 2 from 1 to 8, leading to the results shown in Fig. 2. When using w/ √ 2 = 4, which is very close to the value of 3.97 that we have chosen, the mean-squared-error (MSE) of the wavefield moduli between the output of tiling-based propagation and the reference falls to about 1 × 10 −9 . Given that the variance of the reference modulus is 4 × 10 −6 , this is a negligible error and should lead to sufficiently accurate results.  14) gives good results (a mean squared error of 10 −9 compared to a variance in the reference modulus of 4 × 10 −6 ). Shown here is the result of using tiling-based short-distance propagation through a 256 3 voxel object as used in another publication [12]. The object was split into 4 128 × 128 tiles with the "seams" of the tiles running across the object, and buffer zones are added around each tile.

Algorithm 2:
Algorithm for implementing the 3D tiling aproach to tiling-based shortdistance Fresnel multislice. Variable names with lower case n, namely n x,tile and n y,tile , denote the number of tiles in x and y. O g and O l are respectively the global and local (tile) object function, which symbolize both δand β. Similarly, ψ g and ψ l represent global and local wavefields.
describes the propagation of a wave ψ with wavenumber k = 2π/λ through an inhomogeneous medium with refractive index n(x, y, z). To simplify its solution, the wave ψ is separated into two parts: a part u(x, y, z) that varies in the weak refractive medium, and a part exp[−ikz] that is an unmodified forward-propagating wave in the propagation directionẑ. This gives Given the weak X-ray refractive index of Eq. (2), we can assume ∂ 2 u/∂z 2 ∂ 2 ψ/∂z 2 and approximate Eq. (16) with the parabolic wave equation [47,48] If we make the definitions we can write Eq. (18) as The expression of Eq. (21) presents a linear second order parabolic differential equation that describes a boundary value problem. Given that we know u(x, y, z s ) (at the source plane) and require u(x, y, z d ) (at the destination plane), it is more appropriate to rewrite Eq. (21) as an initial value problem [23] so that the equation being solved for at each plane is elliptic. Note that while a more recent formulation of an equivalent to Eq. (22) exists [24], the expression of Eq. (22) is sufficiently accurate for our purposes given the fact that we work at the hard X-ray energy regime. The formulation of Eq. (21) has also been used in prior studies of X-ray wave propagation in thick zone plates [49] and waveguides [23]. We can rewrite Eq. (21) as The expression of Eq. (22) can be discretized by the use of finite difference methods. Traditionally, the space derivatives are evaluated using a central difference scheme and the time integration is performed via implicit methods (where we have defined time to be the coordinate along the propagation axis). As noted in Sec. 1, this finite difference method has been shown to outperform the full-array Fresnel multislice algorithm when comparing compute time for the same degree of accuracy on single node computers [23,24]. The general Helmholtz equation problem is known to be challenging to solve using finite difference methods [50]. Previous implementations have favored methods that only require tridiagonal matrix inversions using the Thomas algorithm [51]. For one-dimensional systems, the Crank-Nicolson method [52] has been used, while two-dimensional problems have been tackled using Alternating Direction Implicit schemes [51,53] where the wave is propagated along one axis at a time to generate the familiar tridiagonal system of equations. The main disadvantage of ADI is poor scalability to large-scale problems [54].
Instead of formulations that require tridiagonal inversions, we employ iterative solvers along with preconditioners to enable the use of the Crank-Nicholson method for both one-and twodimensional problems. As expanded upon in the implementation section, we are not required to program these algorithms since we express the problem using PETSc [28,29] which allows us to compose scalable solvers.
The recent availability of high-level discrete adjoint frameworks [55,56] offers another approach for the optimization problem. These frameworks allow one to access the sensitivity of the parameters necessary for an optimization-based inverse-problem reconstruction algorithm. These automatically generated adjoint solvers utilizing the same mode of parallelism as the equations, and potentially run faster than the forward problem (owing to the properties of adjoints and the fact that the adjoints are implemented as a series of linear solves). This is in contrast to the approach used by algorithmic differentiation [57,58], which operates on low-level operations and therefore does not offer quite as high performance.

Implementation
The full-array Fresnel multislice (Sec. 2.1) and finite difference (Sec. 2.3) algorithms described above have been implemented using the 3.13.1 release of the PETSc/TS framework [28,29] which is designed to support scalable solvers for partial differential equations (with code available [59]). PETSc supports distributed memory computing using the Message Passing Interface (MPI) [60] as well as the use of graphical processing units. The tiling-based short-distance Fresnel multislice algorithm (Sec. 2.2) was implemented in Python (with code available [61]) using the mpi4py package [62,63] for distributed memory parallelism, and the scientific Python stack SciPy [64] for multithread parallelism for each MPI task. All algorithms used the HDF5 library [65] for parallel disk I/O.
PETSc was installed on workstations and clusters using the spack package manager [66] with the Intel Compiler Collection to take full advantage of the underlying hardware. The Intel Math Kernel Library was chosen as the BLAS/LAPACK implementation for optimal performance for all algorithms.
Initial development and debugging was done on a Linux-based workstation "xrmlite." Algorithm composition and tuning for optimal distributed memory performance were carried out on the cluster "bebop," while final scaling studies were performed using the supercomputer "theta," both at Argonne National Laboratory. The characteristics of these systems are listed in Table 1. PETSc does not use multi-threading, but benefits from higher memory bandwidth, which is available on the KNL processor at high process counts. The tiling-based short-distance Fresnel multislice method prefers the number of ranks per node to be a perfect square, which in this case, happens to match the maximum number of physical cores. Thus, for tests of all three approaches on "theta," we set the CPU affinity to "depth", used one thread per rank, and one thread per core. The terminology for the configuration options is given in [67]. We used the balsam workflow manager [68] to pack multiple jobs for queue submission.

Full-array Fresnel multislice
The full-array Fresnel multislice algorithm was implemented using the PETSc [28,71] framework which provides data structures for scalable and efficient linear algebra [72]. The PETSc application programmer interface (API) conceptualizes the fast Fourier transform (FFT) as a matrix multiplication by an "FFT" matrix, where FFT(x) is a matrix multiply A * x, but the A matrix is never explicitly constructed. Behind the scenes, the FFT is executed by the FFTW library [73] on CPUs. This matrix multiply can, however, only be performed on a specific class of vectors, since FFTW has its own requirements for distribution of data. PETSc also includes functionality to either create vectors that conform to the FFTW format (with the correct data distribution and padding), or the ability to scatter data from a regular MPI vector to a FFT-compatible vector.
We choose the "FFTW format" as the data structure for all of the vectors that are used in the full-array Fresnel multislice algorithm. This frees us from the tedious task of performing explicit data restructuring to switch between having the wave be FFTW-aligned and having it be distributed as a regular array for dot products (corresponding to Eq. (6)). The only downside to this approach is the poor scalability of distributed memory FFT for a large number of MPI ranks [40,41].
The above implementation of the multislice algorithm makes it straightforward to carry out the functions described in Algorithm 1 using the PETSc API.

Tiling-based short-distance Fresnel multislice
We used a hybrid programming model combining the message-passing interface (MPI) and multi-threading to implement the tiling-based short-distance Fresnel multislice algorithm. After propagation, the buffer zone of size N buffer on each edge of a wavefield tile is discarded, and the valid region of the wavefield is written directly into the correct position of the output array. The output array is stored in an HDF5 file that is accessed in parallel by all ranks. The HDF5 [65] library was accessed via the Python interface h5Py [74,75]. Distributed memory programming was done via the mpi4Py package [62,63] which provides Python bindings to the MPI standard. Fast Fourier transforms (FFTs) were performed using the Intel-processor-optimized package mkl-fft [76] via its NumPy bindings.

Finite difference
For the finite difference approach, the TS ODE/DAE [29] integrator library (distributed as part of PETSc/TAO) provides a wide variety of scalable solvers for ordinary differential equations (ODEs) and differentiable-algebraic equations (DAEs), obviating the need to write explicit time integration algorithms. Therefore, we chose to implement the finite difference problem as a linear time-step (TS) object in PETSc. To manage the distributed memory grid, we used PETSc's data-management distributed-array (DMDA) object [28] which is designed for optimal performance when using logically rectangular grids (it re-orders the memory mapping to suit typical differential equation solver operations). As mentioned earlier, previous implementations [23,24] of parabolic wave equation solvers relied on algorithms that were not easily scalable to distributed parallel compute nodes. PETSc enables using a wide variety of preconditioners and Krylov solvers which can be tuned to the problem at hand, thus allowing us to design an algorithm with superior performance and scaling characteristics for parallel computing.
The discretization in space was performed via the central differencing scheme for space derivatives, and the time integration was performed by the TS object using either a first or second-order implicit method as dictated by the needs to the problem being solved. Because our eventual goal is to go from solving the forward propagation problem for a particular object guess (the forward problem), to solving for the object (the inverse problem), maintaining large propagation sizes per step is important as this ensures a minimal size of the refractive index grid while still accurately modelling the diffraction phenomenon. For this reason, we did not test explicit methods such as Euler or non-adaptive Runge-Kutta, as these are unstable at large step sizes. While a one-stage second order implicit method (known as the implicit midpoint method) gives the same result as a two-stage second order implicit method (with endpoint, known as Crank-Nicholson), the two-stage method is significantly faster. Therefore, we used the Crank-Nicholson scheme [52] in PETSc.
We used a GMRES linear solver [77,78] with preconditioning determined by the nature of diffracting object. When the object has some order to its structure (such as for simulating the focusing of thick Fresnel zone plates), algebraic multigrid preconditioning [79][80][81] was used. When the object of interest is better characterized as being irregular, we observed that an additive Schwarz preconditioner [82,83] was faster. For elliptic problems, one-level additive Schwarz methods are known to be non-optimal with increasing problem size (hence one needs multigrid methods). Depending on the ratio of the time-step size to the subdomain size squared for parabolic problems, it is possible to show the algorithms are optimal [84]; that is, the coarser level solvers of multigrid are not needed. This phenomena is similar to the fact one can replace the full-array Fresnel multislice method with the tiling-based short-distance Fresnel multislice method.
The general idea of multigrid schemes arises from the observation that low frequency residuals are challenging to eliminate using classical relaxation-based preconditioning schemes. Thus multigrid preconditioning works by transferring the residual to a coarser grid (where the residual now contains high frequency components), solving for which gives an estimate for the error which is then transferred back to the fine grid. Classical geometric multigrid preconditioning [85] uses interpolation operators based on the grid geometry to generate the coarse grids. However, algebraic multigrid preconditioning requires no information about the grid, and constructs coarse grids based on the system of equations being solved [80,81]. The selection of coarse grids (akin to graph partitioning) and construction of interpolation operators (with a Galerkin process) together form the "setup" phase. These coarse grids and interpolation operators can be reused for subsequent applications [80,81] of the preconditioner, thereby amortizing the cost of the setup phase. Algebraic multigrid preconditioning has been shown to work well for discretized Helmholtz operators [86].
The general idea of domain decomposition schemes is to split the task of solving the system of equations (arising from the partial differential equation discretization) from one large domain into smaller overlapping domains [82,83,87]. These sub-blocks are then solved independently and these solutions are then iteratively combined. In particular, we used the restricted additive Schwarz method as a preconditioner, which has been shown to improve performance when compared to using it as a solver [87].

Wavefield convergence metric
In order to numerically compare the step size sensitivity of each method, we measured the minimum number of slices N z,min that each method takes to yield a converged result. In the limit of taking thinner slices (that is, as the number of slices N z is increased towards N z,∞ ), multislice calculations converge on an exit wave ψ d with magnitudes A ∞,i and phases φ ∞,i at pixel positions i. However, if one were to calculate convergence using the phase φ ∞,i of the exit wave ψ d , one would possibly need to use phase unwrapping from the complex wavefield, which can be time-consuming and prone to error. The problem can be circumvented by calculating the root-mean-square (RMS) difference of the complex wavefields. Suppose the magnitude of the wavefield calculated using N z steps is A n,i at pixel i, while the converged magnitude one would obtain using an infinite number of infintitessimally thin slices is A ∞,i ; also, suppose the phases in the two cases are φ n,i and φ ∞,i , respectively. The complex wavefield RMS difference is given by Because phase contrast dominates in hard X-ray imaging, we can assume A n,i A ∞,i Ā everywhere, withĀ exp[−kβt] using the Lambert-Beer law of I = I 0 exp[−2kβt] with the X-ray refractive index n = 1 − δ − iβ [3], andβ indicating the spatial average within the inhomogenous specimen. Since cos(x) ≈ 1 − x 2 /2, Eq. (23) reduces tō This approximation is illustrated in Fig. 3. Since the RMS phase error ξ φ represents the standard deviation of a Gaussian distribution, the net reduction in the summation of amplitudes from many . When obtaining a particular measure of the phase difference φ n,i − φ ∞,i from a complex valuez = A exp[iφ] on the real (Re) and imaginary (Im) plane, one could obtain erroneous values in the case shown where the phase before convergence is reported as π − n,i while the phase after convergence is reported as −(π − ∞,i ), one would obtain an erroneous phase difference φ n,i − φ ∞,i of near 2π. Calculating the RMS difference between complex wavefields (Eq. (24)) circumvents this problem by measuring the end-to-end distance between the red and blue vectors at individual pixels i, a result that does not require phase wrapping. When the moduli A n,i and A ∞,i are similar, the average modulus of the green vector (labeled here asĀ ξ φ using Eq. (24)) is approximately linearly related to the RMS average ofĀ |φ n,i − φ ∞,i | subtended by the blue and red vectors.
waves [3,88,89] is given by exp[−ξ 2 φ /2]. Therefore if we set the requirement we see that we obtain errors in the unattenuated amplitude (Ā 1) of a wave no greater than 4.8%, since exp[−(0.05(2π)) 2 /2] = 0.952. This is far more stringent than the usual Rayleigh quarter wave criterion for tolerance of phase errors. We therefore judge convergence by decreasing the slice thickness ∆ z (and thus increasing the number of slices N z for a given specimen thickness t) until further decreases in ∆ z lead to changes in ξ φ of less than (0.05)2π = 0.31 in accordance with Eq. (25). This gives us a measurement for the number of slices n C required to reach convergence of which we will report for various tests of calculating X-ray wave propagation through thick inhomogeneous media. For cases where the inhomogeneous object is surrounded by a featureless outer border (such as is the case for circular zone plates within a rectangular array), the calculation of the RMS amplitude error ξ complex Ā ξ φ shown below will be from the feature-containing region, with featureless regions excluded.

Experiments
Our goal is to understand the characteristics of the full-array Fresnel multislice, tiling-based short-distance Fresnel multislice, and finite difference methods for propagating large area X-ray wavefields through thick inhomogeneous media. To do this, we carried out numerical tests using two different diffracting objects: a Fresnel zone plate thick enough that waveguide effects become apparent (Sec. 5.1), and a X-ray microtomography reconstruction of a charcoal specimen scaled to match the conditions of nanoscale imaging (Sec. 5.2). In order to understand the relative scattering power of these objects, we calculated the object's RMS phase deviation as whereδ(z) refers to a uniform object with the phase shifting part of the refractive index set to the weighted mean of the refractive indices of the same slice (with axial coordinate z). For our calculations, we assumed a photon energy of E = 15 keV (giving λ = 0.0827 nm), and a transverse calculation grid size or pixel size of ∆ x = 2 nm. Assuming that half-period features can be as small as δ res = ∆ x , one can use Eq. (1) to find a calculation depth of focus of DOF = 26.0 nm, and Eq. (11) to find that the thickness at which the Klein-Cook parameter becomes Q = 0.5 is given by z K-C = 1.54 nm (so that one can assume scalar diffraction, without waveguide effects, within one slice thickness). However, one can in fact use larger slice thicknesses ∆ z and still meet our convergence criteria ξ φ ≤ 0.05(2π) from Eq. (25), as will be shown below.

Fresnel zone plate test object
Fresnel zone plates (Fig. 4) are widely used as X-ray nanofocusing optics [3], since they offer normal incidence mounting and easy energy tunability. For conventional Fresnel zone plates, the spatial resolution is given by δ res = 1.22dr N where dr N is the width of the outermost, finest zone, and for gold zone plates at E = 15 keV a thickness along the X-ray beam directionẑ of about t = 3 µm is required to achieve focusing efficiencies that can in theory be as high as about 25%.
Since we wish to test the ability of different wave propagation calculation methods to account for waveguide effects in thick structures, we chose to simulate a zone plate with a finest outermost zone width of dr N = 20 nm and a thickness of t = 30.81 µm, giving a Klein-Cook parameter (Eq. (10)) of Q = 10 for the zone width rather than the pixel size. For this zone plate, the depth of focus DOF corresponding to the spatial resolution of δ res = 1.22dr N is DOF = 38.7 µm (Eq. (1)), while the distance over which the zone width would produce a value for the Klein-Cook parameter of Q = 0.5 is z K-C = 2.3 µm (Eq. (11)). The magnitude of the exit wave for a zone plate averaged over the central region (≈ 25% side length) isĀ ≈ 0.43 for a plane wave input with a magnitude of unity. Thus, threshold for convergence (Eq. (24)) for the zone plate test object is Within this central region, the diffractive power (Eq. (27)) was calculated to be σ φ = 13.60. Because we wanted to explore the scaling of our calculation with increasing array size (N x , N y ) with constant pixel size ∆ x = 2 nm, at each array size we generated a zone plate with the above minimum zone width dr N and thickness t, but with a diameter d equal to 80% of the array size, or d = 0.8N x ∆ x . We used partial voxel filling of the zone plate material's refractive index to handle the cases where the boundary of a zone plate zone was within a voxel [22]. Since there is no variation of the zone structure along the direction of propagationẑ, we only need to store one two-dimensional array for each (N x , N y ) array which greatly simplifies storage issues. When using a plane wave ψ s for illumination, we would be able to propagate the exit wave ψ d to the focus position f given by which we have done in other studies [22,35]. However, for our present purposes, we just wish to find the minimum number of slices that lead to convergence of the exit wave to approximately the same value obtained with using a much larger number of slices in the calculation: that is, n C as described by Eq. (26).

Porous aluminum test object
As noted above, a Fresnel zone plate provides a structure that can be extended axially to the case where waveguide effects come into play. However, a zone plate is also a highly regular structure, whereas more general specimens in X-ray microscopy are quite irregular. For a test object that more accurately represents objects that are imaged at synchrotron sources, we used part of an X-ray tomographic reconstruction of an activated charcoal sample acquired in a previous study [90], and now available as the dataset "activated-charcoal" in TomoBank [91]. This 4 mm diameter specimen was imaged using 25 keV X-rays, with a reconstruction pixel size of 0.6 µm, resulting in object slices of 6613 × 6613 pixels, and a total number of 4198 object slices (or a 6613 × 6613 × 4198 voxel array). We then generated a baseline phantom from a subvolume from this array in a manner we now describe (and which is illustrated in Fig. 5). Each object slice in the original reconstruction had a slight ring artifact near the rotation axis center due to imperfect alignment of the data on the reconstruction rotation axis, and these rings could have contributed to a cylindrical waveguide artifact in the final phantom. Therefore, a 2448 × 2448 × 4198 voxel subregion was selected that did not include this ring artifact. This subregion was then replicated into a 2 × 2 grid in the plane of the object slices, with pyramid blending [90,92] used at the tile overlaps, and the outermost 15% of the object slice area blended out to vaccum (that is, to a specimen density of zero). This resulted in a 4096 × 4096 × 4198 voxel volume, which was then rotated so that the original tomographic rotation axis became the beam propagation direction. Since the multislice propagation slice thickness ∆ z is usually much larger than the transverse pixel size ∆ x , we then selected 51 tomographic reconstruction object slices (each separated from its neighbor by 50 slices, out of the center 51 · 50 = 2550 slices in the 4198 slice direction) from this volume. This way, the selected slices are sufficiently different to avoid waveguide effects. This yielded a 4096 × 4096 × 51 voxel array, with 4096 × 4096 pixels in the transverse direction and 51 pixels along the beam propagation directionẑ. Finally, the "baseline" phantom in our numerical study is assumed to contain 500 slices of thickness ∆ z = 147.46/500 = 0.295 µm along the beam propagation directionẑ. This (N x × N y × nz) = (4096 × 4096 × 500) voxel baselone object was formed by looping the 51 object slices back and forth (i.e., arranging the slices in the order 1, 2, . . ., 50, 51, 50, 49, . . ., 2, 1, 2, . . . and its repeat). For convergence tests where N z was varied to find n C of Eq. (26) in both Fresnel multislice methods, a smaller number of slices were obtained by linear interpolation of the baseline object along the propagation directionẑ, leading to a modified phantom of 4096 × 4096 × N z voxels. The process used to generate the porous aluminum phantom object (right). A larger scale tomographic reconstruction of an activated charcoal specimen (left) was used as the data source. From the 4198 tomographic reconstruction slices of 6613 × 6613 pixels each, a 2448 × 2448 × 51 voxel subregion was selected through all slices to avoid ring artifacts near the rotation axis. This subregion was then replicated into a 4 × 4 grid in the plane of the object slices, with pyramid blending used at the tile overlaps and the edges blended out to vaccum (that is, to a specimen density of zero). The resulting 4096 × 4096 × 51 voxel array was then rotated so that the original data rotation axis (veritcal, at left) became the beam propagation directionẑ in the phantom object at right, after which both the pixel size and the contrast of the object were modified to yield the porous aluminum phantom object.
The original tomographic reconstruction was acquired using absorption contrast, whereas we wished to simulate a complex object. Therefore we normalized the absorption map so that it had a mean occupancy of 1, and multiplied it by the tabulated value [93] of δ + iβ (Eq. (2)) for aluminum at 15 keV, effectively giving each voxel a different fractional filling with aluminum. The histogram of the resulting densities shown in Fig. 6 reveals that this led to some pixels having unphysically high densities (which we realized after our convergence and scaling tests were complete), but this only serves to make the object have slightly larger refractive properties with no impact on measuring convergence or calculating speeds of the algorithms tested. We also added to each slice 10-20 disk-shaped gold particles with radii ranging from 5 to 20 pixels (10-40 nm), and a thickness of one slice (or ∆ z = 0.295 µm) to create strongly scattering features. This was accomplished by replacing the δ + iβ of aluminum with that of gold at those positions. The total object thickness was set to t = 147.5 µm so that the object gave the same diffractive power σ φ = 13.60 (Eq. (27)) as the zone plate object. The object thus created had a magnitude of its exit wave of A = 0.78 averaged over the central region, leading to a threshold for convergence (Eq. (24)) of We refer to this test specimen as the porous aluminum phantom. Fig. 6. Histogram of voxel densities of the porous aluminum test object. By setting the average occupancy of the charcoal test object to 1 and then multiplying by the refractive index of aluminum, the porous aluminum test object was inadvertently created with voxel densities exceeding the actual density of aluminum. This means that the test object was more strongly refracting than a true aluminum object would be, but this does not affect our measurement of the convergence or timing properties of the algorithms tested.

Convergence results
Our first test was to compare the convergence of the three algorithms as a function of the number of slices N z = t/∆ z (Eq. (9)). For this test, we used the porous aluminum test object with thickness t = 147.5 µm and (4096) 2 transverse pixels. The object was re-sampled along the propagation directionẑ to vary ∆ z and thus N z . The exit wave was then calculated with the three algorithms of Sec. 2 and successful convergence was measured usingĀ ξ φ of Eqs. (24) and (25) using the value [Ā ξ φ ] Al = 0.245 of Eq. (30). As seen in Fig. 7, the Fresnel multislice approaches gave similar values for the minimum number of slices n C = 84 for full-array Fresnel multislice and n C = 90 for tiling-based short-distance Fresnel multislice. The corresponding maximum slice thicknesses of ∆ z = 1.76 and ∆ z = 1.64 µm, respectively, are both well beyond the minimum pixel value of DOF = 0.026 µm noted above. For the finite-difference algorithm, a much larger minimum number of slices of n C = 352 was required, corresponding to ∆ z = 0.42 µm for this irregular object. We also used the porous aluminum object to test the performance of the finite difference algorithm as a function of the transverse array size N x × N y , with results shown in Fig. 8. As the array size was decreased from 4096 2 to 512 2 transverse pixels, the minimum number of slices decreased from n C = 352 to n C = 96, with the n C = 96 result corresponding to a slice thickness of ∆ z = 1.54 µm which is more similar to what is required for the Fresnel multislice approaches.
We then tested the three algorithms on the t = 30.81 µm thick Fresnel zone plate test object, where one would expect to see the finest zone width dr N = 20 nm give rise to depth of focus effects (Eq. (1)) at DOF = 38.7 µm and waveguide effects (Eq. (11)) at z K-C = 2.3 µm. We tested the convergence of all three algorithms as a function of transverse array size as shown in Fig. 9, where the zone plate diameter d (and thus focal length f ) was adjusted to fill 80% of the transverse array size in each case. With this highly regular object, the finite difference algorithm required fewer slices to converge (n C = 8, giving ∆ z = 3.85 µm) while the two Fresnel multislice algorithms required more slices (n C = 21 and ∆ z = 1.47 µm for full-array Fresnel multislice, and n C = 23 and ∆ z = 1.34 µm for tiling-based short-distance Fresnel multislice). Since the finite difference method has been shown to converge quickly in calculations of X-ray waveguides [23,24], it is not surprising that it performs better with the regular structure of thick zone plates. All three cases required slice thicknesses that are within a factor of 2 of the Klein-Cook thickness Fig. 8. Convergence of the finite difference approach as a function of transverse array size for the porous aluminum object. As in Fig. 7, the indicated size of transverse array (ranging from 512 2 to 4096 2 pixels) was extracted from the object, and the total object thickness t = 147.5 µm was bilinearly sampled along the propagation directionẑ to vary N z . For each array size, the minimum number of slices n C (Eq. (26)) was calculated using the convergence threshold [Ā ξ φ ] Al = 0.245 of Eq. (30). As can be seen, the finite difference method converges more quickly with smaller transverse arrays, reaching n C = 96 (with slice thickness ∆ z = 1.54 µm) at 512 2 transverse grid size with this irregular object.
z K-C = 2.3 µm estimated using Eq. (11), and all three cases had convergence properties that did not depend on the transverse grid size.

Scaling results
Having established the convergence of each approach, we then considered performance scalings on parallel computing systems as required for future image reconstruction problems with increasingly large array sizes. Because of the need to adjust transverse array sizes for these tests, we used the Fresnel zone plate test object described in Sec. 5.1, where the zone plate diameter was 80% of the transverse array size. Based on the results shown in Fig. 9, we used n C = 8 slices for the finite difference approach, n C = 21 for the full-array Fresnel multislice approach, and n C = 23 slices for the tiling-based short-distance Fresnel multislice approach. We carried out these tests on the compute system "theta" with properties described in Table 1. The tiling-based short-distance Fresnel multislice approach requires the number of ranks per node to be a perfect square, so we were able to use 64 MPI ranks per compute node, matching the number of compute cores available on each node of "theta." The finite difference approach includes a "setup" phase that constructs a new preconditioner object each time a new calculation size is encountered (for the algebraic multigrid preconditioner, this involves setting up the coarse grids and interpolation matrices as described in Sec 3). This preconditioner can then be reused for subsequent applications on the same array size. In our tests below, a preconditioner was constructed for the first propagation step and used, as is, for all the following steps. In an optimization context where the time-stepping (TS) object (which Fig. 9. Convergence test of the three algorithms for a Fresnel zone plate as a highly regular test object. In all cases, the zone plate thickness was t = 30.81 µm and the minimum zone width was dr N = 20 nm, but the diameter d (and thus focal length f ) of the zone plate was adjusted to match 80% of the transverse array size for 4096 2 , 16384 2 , and 65536 2 transverse pixels, respectively. Using the convergence threshold [Ā ξ φ ] zp = 0.135 (Eq. (28)) for this object to find the minimum number of slices n C (Eq. (26)), all three algorithms had minimum slice numbers n C that were independent of transverse array size and that were within a factor of 2 of the thickness z K-C = 2.3 µm (Eq. (11)) at which waveguide effects would be expected for this specimen. The finite difference method required fewer slices with n C = 8 and ∆ z = 3.85 µm, while the two Fresnel multislice methods required slightly more slices (n C = 21 and ∆ z = 1.47 µm for full-array Fresnel multislice, and n C = 23 and ∆ z = 1.34 µm for tiling-based short-distance Fresnel multislice).
handles the time integration, as described in Sec 3) is re-used, this setup cost would amount to a negligible amount of total runtime.
Our first test was to look at the total computation time for a fixed transverse array size of 32768 2 pixels while increasing the number of computational nodes used. In the case of 100% parallel computing efficiency, this so-called "strong scaling" test would be carried out in the time it takes one node to do the entire problem divided by the number of nodes used. As seen in Fig. 10, the full-array Fresnel multislice approach shows only a modest decrease in compute time when more nodes are used, until at 64 nodes and above the calculation time begins to increase, rather than decrease. The "strong scaling" details shown in Fig. 11 show why: the time for doing a fast Fourier transform (FFT) increases with these many nodes as internode communication speed limits outweigh the gains offered by using more cores when carrying out FFTs on large arrays. Because the tiling-based short-distance Fresnel multislice approach gives each node a subarray to work on with communication only at the calculation start (when each node is given its data) and end (when the full exit wave is assembled), it has improved scaling properties especially in terms of FFT time as shown in the bottom row of Fig. 11. The finite difference method implemented using PETSc takes a longer time even though a smaller number of slices n C are required (Fig. 9), but using more nodes for the same problem gives a more rapid relative decrease in calculation time (that is, better "strong scaling") until there is no further gain when using more than about 100 nodes for this problem size.
Because our ultimate goal is to use parallel computing to calculate X-ray propagation through increasingly large objects, our next test was to consider array sizes that scaled up with increasing numbers of nodes used. Based on the observation in Figs. 10 and 11 that FFT performance decreases when each node is required to work with array sizes smaller than 4096 2 , we used an array size of (4096 √ N nodes ) 2 in this "weak scaling" test. We increased the number of nodes N nodes by factors of 4 (1, 4, 16, . . .) for the full-array Fresnel multislice method in order to have radix-2 array sizes, and also for the finite difference method; since the tiling-based short-distance Fig. 10. Time for calculating the exit wave from the zone plate test object as a function of the number of nodes used. This "strong scaling" test was done with a constant transverse grid size of N x × N y = 32768 2 pixels on the computational cluster "theta" (see Table 1), and using the number of slices n C each algorithm required for convergence to the error tolerance [Ā ξ φ ] zp of Eq. (28) (the resulting values of n C were consistently within 1 or 2 slices of the values shown in Fig. 9). While the finite difference method takes the longest amount of time with a small number of nodes, it benefits the most from increased parallelization so that the calculation time drops significantly by the time 128 nodes are employed. The full-array Fresnel multislice method shows only a modest time decrease as more nodes are employed, until at 64 nodes the calculation time begins to increase due to the requirement for considerable data communication between nodes. Because the tiling-based short-distance Fresnel multislice approach allows each node to proceed through to the exit wave plane before inter-node communication is again required, it takes the least time but after 64 nodes one again sees a slight increase in calculation time if additional nodes are used. Note that 64 nodes corresponds to a transverse array size of 4096 2 pixels per node. Further details on this "strong scaling" test are provided in Fig. 11.  Fig. 11. Further details on the "strong scaling" test results shown in Fig. 10. These tests were of the zone plate test object on a N x × N y = 32768 2 pixel transverse grid. For each of the three calculation methods, we show at top the speedup versus the number of nodes used (with a a linear "perfect scaling" trend showing up as a curved line on this log-linear plot). This shows that the finite difference method has the best scaling to calculation speedup with increased number of nodes. At bottom we show the time required for key operations in the various methods: the time required for a fast Fourier transform (FFT) in the full-array Fresnel multislice and tiling-based short-distance Fresnel multislice methods, and the time for problem setup and then problem solution for the finite difference method. With the full-array FFT approach, the advantage of having more processors is outweighed by data communication overhead when 64 or more nodes are used.
Fresnel multislice approach requires splitting a large transverse array into an integer number of tiles in each direction, for that method, we increased N nodes by a series of perfect squares (1 2 , 2 2 , 3 2 , . . .). As shown in Fig. 12, the total compute time for the full-array Fresnel multislice approach increases dramatically when the transverse array size goes beyond N x × N y = 32768 2 , with Fig. 13 making it clear that this is due to the time of calculating the FFT of a single large array with lots of internode data communication. The tiling-based short-distance Fresnel multislice approach is the fastest in this test, with the FFT time essentially unaffected by overall transverse array size as shown in Fig. 13; this is as expected given that in this approach each node works only on its local "tile" of the larger array. Once again, the finite difference approach is slower than the tiling-based short-distance Fresnel multislice approach, but it also shows more favorable scaling efficiency with larger problem size as shown in Fig. 13.
While the largest transverse array size used in the scaling tests described above was N x × N y = 65536 2 , we also carried out one calculation using the tiling-based short-distance Fresnel multislice approach on a 131072 2 pixel array. This was done on "theta" using 256 nodes and 64 ranks per node. This calculation took 12 seconds for data reading and tile division, 99 seconds for Fig. 12. Time for calculating the exit wave for the zone plate test object as a function of increasing the transverse array size along with the number of nodes, with each node given a transverse grid size of N x × N y = 4096 2 (leading to a net array size of 65536 2 for 256 nodes, as indicated just below the top of the plot). For each algorithm, the number of slices n C was as required for convergence to the error tolerance [Ā ξ φ ] zp of Eq. (28), giving values of n C that were in all cases within 1 or 2 slices of the values shown in Fig. 9. This "weak scaling" test shows that both the finite difference and tiling-based short-distance Fresnel multislice approaches scale well as the problem size increases with the number of nodes used, consistent with the "strong scaling" test results of Fig. 10. With the full-array Fresnel multislice approach, the time required for data communication between nodes for full-array FFTs means that even with many nodes available large problems require considerably more time to compute. writing the array out to an HDF5 file, and 25 seconds for a series of n C = 22 slices, or about 1.1 seconds per slice. Recognizing that the scale of compute power available from "theta" makes this a completely unfair comparison, this is much faster than the 3.6 minute calculation time reported for the equivalent of 1 slice on a standard workstation [27]. We also carried out a tiling-based short-distance Fresnel multislice calculation on a 4096 2 zone plate test object broken into four tiles, with these tiles divided between two graphical processing units (GPUs) on the compute server "xrmlite" (Table 1). In this case, it took 3.68 seconds for CPU-GPU communication, and 0.14 seconds for the FFT calculation, confirming the benefits of GPU based parallelism for FFTs (as expected [94]). Therefore, distributed GPU parallelism is a viable approach for the tiling-based short-distance Fresnel multislice algorithm when the tiles fit in GPU memory.
While the main components of PDE solvers involve sparse linear algebra that are also memorybound due to their low arithmetic intensity [28], optimal design and implementation of solvers for good performance on GPUs [95] remains a challenge that is being currently addressed [96,97]. In particular PETSc's algebraic multigrd (GAMG) can execute the solve phase entirely on the GPUs and there is ongoing development on performing the setup phase on the GPUs as well Fig. 13. Further details on the "weak scaling" test results shown in Fig. 12. These tests were of the zone plate test object with a constant array size of N x × N y = 4096 2 per node, leading to a net array size of 65536 2 for 256 nodes. The top row shows the scaling efficiency for each of the three algorithms, which is the completion time compared to the 1 node result divided by the number of nodes used. The bottom row shows the time for key operations in each method: a fast Fourier transform or FFT for the Fresnel multislice approaches, and problem setup and solution for the finite difference method. As can be seen, the full-array Fresnel multislice approach has especially poor "weak scaling" performance due to the need for internode communcation at each slice position, while the tiling-based short-distance Fresnel multislice approach offers better parallel performance. The finite difference approach takes a longer time, but with less of a decrease in efficiency for larger transverse array size. [97]. These performance benefits are passed onto users and do not require changes to application codes. Thus, we note that our finite difference solver will be able to use GPU parallelism when available with minimal code changes.

Conclusion
To reach the full potential of X-ray microscopy of combining high penetration in thick samples with nanoscale spatial resolution, it will become necessary to compute wave propagation through inhomogeneous media with a very large array size. This can be used both for calculating the performance of X-ray focusing optics, and also for creating a forward model that can be used for image reconstruction using numerical optimization approaches.
Two approaches used thus far in X-ray microscopy are the finite difference method [23,24], and the Fresnel multislice method [18][19][20][21][22]. We have implemented the finite difference method using the PETSc package [28,29], which offers efficient scaling on distributed memory compute systems. For the Fresnel multislice approach, we have used the fast Fourier transform interfaces built into PETSc for the full-array Fresnel multislice algorithm. We have also used mpi4py to implemented a parallelized version of the tiling-based short-distance Fresnel multislice approach that has been developed first for digital holography [27]. All three algorithms can compute moderately large transverse array problems in a reasonable time. In contrast to multislice methods, the finite difference approach requires fewer slices n C for convergence when applied to the highly-regular, strong-waveguide-effect zone plate test object (Fig. 9), with more slices, n C required for the irregular porous aluminum test object. Additionally, for an irregular object, the finite difference approach requires fewer slices to convergence at larger pixel sizes (Fig. 8), which is opposite to the behavior of full-array Fresnel multislice approach. This behavior of the finite difference approach could be used for robust reconstructions at downsampled resolutions.
With the full-array Fresnel multislice approach, one begins to suffer from internode communication bottlenecks at array sizes of N x × N y = 32768 2 on the compute cluster "theta" described in Table 1. The finite difference method and the tiling-based short-distance Fresnel multislice approach offer much better scaling to large array sizes, as shown in Figs. 12 and 13, with the latter approach requiring roughly a third less compute time. We have also tested the tiling-based short-distance Fresnel multislice approach on array sizes as large as 131072 2 with good results. Together, these approaches show that parallelized software packages on high performance compute clusters allow one to calculate X-ray wave propagation in inhomogeneous media with reasonable execution times for very large array sizes. The combination of advances in bright X-ray sources [1] and in high performance computing therefore makes it possible to contemplate nanoscale X-ray imaging of macro-sized objects. The methods tested here show that the forward problem of simulating the exit wave for a guess of the object is computationally tractable, allowing for its use in an optimization approach for object reconstruction.