Higher-order wide-angle split-step spectral method for non-paraxial beam propagation

We develop a higher-order method for non-paraxial beam propagation based on the wide-angle split-step spectral (WASSS) method previously reported [Clark and Thomas, Opt. Quantum. Electron., 41, 849 (2010)]. The higher-order WASSS (HOWASSS) method approximates the Helmholtz equation by keeping terms up to third-order in the propagation step size, in the Magnus expansion. A symmetric exponential operator splitting technique is used to simplify the resulting exponential operators. The HOWASSS method is applied to the problem of waveguide propagation, where an analytical solution is known, to demonstrate the performance and accuracy of the method. The performance enhancement gained by implementing the HOWASSS method on a graphics processing unit (GPU) is demonstrated. When highly accurate results are required the HOWASSS method is shown to be substantially faster than the WASSS method. © 2013 Optical Society of America OCIS codes: (000.4430) Numerical approximation and analysis; (080.1510) Propagation methods; (080.1753) Computation methods; (350.5500) Propagation. References and links 1. G. R. Hadley, “Multistep method for wide-angle beam propagation,” Opt. Lett. 17, 1743–1745 (1992). 2. K. Q. Le, R. Godoy-Rubio, P. Bienstman, and G. R. Hadley, “The complex Jacobi iterative method for threedimensional wide-angle beam propagation,” Opt. Express 16, 17021–17030 (2008). 3. Y. Y. Lu and P. L. Ho, “Beam propagation method using a [(p−1)/p] Padé approximant of the propagator,” Opt. Lett. 27, 683–685 (2002). 4. A. Sharma and A. Agrawal, “New method for nonparaxial beam propagation,” J. Opt. Soc. Am. B 21, 1082–1087 (2004). 5. A. Sharma and A. Agrawal, “Non-paraxial split-step finite-difference method for beam propagation,” Opt. Quantum. Electron. 38, 19–34 (2006). 6. C. D. Clark and R. Thomas, “Wide-angle split-step spectral method for 2D or 3D beam propagation,” Opt. Quantum. Electron. 41, 849–857 (2010). 7. M. Guizar-Sicairos and J. C. Gutiérrez-Vega, “Computation of quasi-discrete Hankel transforms of integer order for propagating optical wave fields,” J. Opt. Soc. Am. A 21, 53–58 (2004). #187594 $15.00 USD Received 25 Mar 2013; revised 24 May 2013; accepted 14 Jun 2013; published 25 Jun 2013 (C) 2013 OSA 1 July 2013 | Vol. 21, No. 13 | DOI:10.1364/OE.21.015815 | OPTICS EXPRESS 15815 Distribution A: Approved for public release, distribution unlimited (approval given by Public Affairs Office TSRL-PA-12-0068 8. W. Magnus, “On the exponential solution of differential equations for a linear operator,” Comm. Pure Appl. Math. 7, 649–673 (1954). 9. M. Bauer, R. Chetrite, K. Ebrahimi-Fard, and F. Patras, “Time-ordering and a generalized Magnus expansion,” Lett. Math. Phys. 103, 331–350 (2012). 10. M. J. Adams, An Introduction to Optical Waveguides (Wiley, 1981).


Introduction
Beam propagation methods (BPM) are a large class of numerical methods for solving the scalar Helmoltz equation, and are popular for simulating guided waves and laser beams, as they are typically both fast and efficient.Early methods made use of the paraxial approximation, which greatly simplifies the problem by reducing the propagation equation to first order.However, these methods were severely limited in their application.Any beam profile containing spatial frequencies with angles greater than a few degrees, with respect to the propagation axis, incur significant phase errors.Many methods have been developed to drop the paraxial approximation and include wide-angle waves.In several of these, the Helmholtz equation is formally rewritten as a first-order differential equation which includes the square root of an operator.The square root operator is either approximated using real or complex Padé approximants and a finitedifference or iterative method is used to solve the equation [1,2], or the analytical solution is found which results in an exponential of the square root of an operator that is approximated with a Padé approximant [3].Recently, Sharma extended an operator-splitting technique used on the paraxial wave equation to the non-paraxial wave equation (Helmholtz equation) [4].The splitting allows diffraction and the refractive index variations to be handled separately.Various numerical methods can be used once the operator has been split, such as collocation or finitedifference [4,5].
In a recent publication, we described a numerical beam propagation method that represents the beam profile in the basis of the eigenvectors of the Laplacian operator and uses a symmetric operator splitting technique to account for the refractive index variations, known as the wideangle split-step spectroscopic (WASSS) method [6].In general, the method provided a two-fold speedup to the finite-difference method reported by Sharma [4] This improvement could be increased by use of a fast Fourier transform algorithm.Here we develop a higher-order WASSS (HOWASSS) method that extends the approximation to higher-order, providing a more efficient method when high accuracies are required.We apply the HOWASSS method to the problem of waveguide propagation to demonstrate the performance and accuracy of the method.An additional performance enhancement is obtained by implementing the method on a graphics processing unit (GPU) using compute unified device architecture (CUDA) technology from NVIDIA ™ .

Formulation
Beam propagation in a medium with a non-uniform refractive index is described by the scalar Helmholtz equation, where ψ (z, r) is the complex scalar electric field, z is a Cartesian coordinate in the direction of propagation, r are the transverse coordinates, and ∇ 2 r is the transverse Laplace operator.As usual, k 0 is the free space wavenumber, n (z, r) is the refractive index distribution, and n2 = min n 2 (z, x) .For simplicity we will only consider the two-dimensional (2D) case, however generalization to three dimensions (3D) follows quite trivially.In two dimensions, the electric field is denoted ψ (z, x).Note that x is not necessarily a Cartesian coordinate but could, for example, be a radial coordinate.Expanding ψ (z, x) in terms of the eigenfunctions of the transverse Laplace operator we write where Multiplying the 2D form of Eq. ( 1) by φ * j (x), integrating over the transverse coordinate, and applying Eqs. ( 3) and ( 4) we are left with To discretize the problem space, let N x and N z be the number of discrete grid points in x and z respectively, and let i, j ∈ [1, N x ], and l ∈ [1, N z ].We let x i = iΔx + x 0 and z l = lΔz + z 0 , where Δx and Δz are the corresponding step sizes.The matrix N (z) is defined by The discrete eigentransform matrix, S [6], transforms a vector into the spectral basis, while S −1 transforms vectors back into the spatial basis.In the case of Cartesian coordinates with hard boundary conditions this matrix becomes the discrete Fourier transform matrix, and in cylindrical coordinates with azimuthal symmetry and hard boundary conditions, S takes the form of a discrete Hankel transform [7].We also must define the constant matrix M with elements Notice that if λ i > k 0 n then M i, j becomes imaginary.If we allow eigenvalues such that λ i > k 0 n, then the trigonometric functions appearing in P will become hyperbolic (see Eq. ( 25)), making the method numerically unstable.For this reason, we will limit N x to satisfy λ i < k 0 n.However, we must include at least enough eigenfrequencies to sufficiently represent the initial conditions, otherwise large errors will be incurred.Using these definitions Eq. ( 5) becomes Now if we define we can write the Helmholtz equation as a first order vector differential equation The exact solution to Eq. ( 11) is given by a Magnus expansion [8], The first two terms of the expansion are Here [H (t 1 ) , H (t 2 )] is the usual commutator.Note that Eq. ( 11) is essentially the timedependent Schrödinger equation, in which context the Magnus expansion is more well-known as the time-ordered exponential operator [9].However, we will refer to Eq. ( 12) as a Magnus expansion to be consistent with discussions concerning initial value problems of the form of Eq. ( 11).H (z) can be written as the sum of two matrices, one that is constant and one that depends on z, Using this and the trapezoid rule to approximate the integral in Eq. ( 13) we obtain, To approximate Ω 2 (z l+1 , z l ) we first need to compute the commutator.After making use of Eq. ( 15) and simplifying we find Note that [H 2 (z 1 ) , H 2 (z 2 )] = 0. We apply the trapezoid method twice to approximate the double integral in Eq. ( 14) and obtain The next term in the Magnus series will contribute a factor of (Δz) 3 as it involves a triple integral.Keeping terms up to (Δz) 3 Because this contains the exponential of a dense matrix, which is difficult to handle numerically, we will split this exponential into a form that will be easier to work with via the symmetric exponential splitting technique exp First, we split the Δz terms in the exponential from the (Δz) 2 terms.Then we split the terms involving H 1 from those involving H 2 , and we are left with where Each operator P, Q (z), and C (z 1 , z 2 ) can be expressed by a 2 × 2 block matrix where the matrices within each operator are diagonal.This allows the exponential operators to be simply evaluated by expanding the exponential into its power series Now all matrices inside of functions are diagonal making them quite simple to numerically evaluate.In this form the physical interpretation of these operators is most transparent.P is the operator corresponding to a propagation of the pulse through a homogenous medium with an index of refraction of n.Q(z) takes into account the difference n(z, x) − n, while C(z 1 , z 2 ) depends on the change in the refractive index over the step taken.Thus, in this sense we would like to draw the analogy that Q(z) acts like the constant term of a Taylor's series expansion, and C(z 1 , z 2 ) acts in a manner similar to the first derivative term in such a series.We can save some additional matrix-vector multiplications by combining

Numerical example
To test our method we will simulate beam propagation through a two-dimensional (Cartesian) symmetric Epstein-layer waveguide tilted at an angle θ from the positive z axis [5].Hard boundary conditions where ψ (z, x 0 ) = 0 and ψ z, x f = 0 are assumed.The eigenfunctions of the transverse Laplace operator are then given by The resulting eigentransform matrix is given by the discrete Fourier matrix This particular Fourier matrix was chosen because the hard boundary conditions at x 0 and x f are automatically satisfied.We do not need to include these points in our grid, defined by where x is a shifted coordinate to align the waveguide in the center of our computational region to make the hard boundary conditions as negligible as possible, Δn is the height of the refractive index shift, w is the width of the waveguide.The shifted coordinate is given by x The initial electric field is given by the zeroth order mode of the Epstein-layer waveguide Here, i is the unit imaginary number, not to be confused with the counting index used elsewhere.
W and K 0 are given by The exact solution for this tilted Epstein-layer waveguide is [10] ψ e (z, x) = sech W 2 x cos (θ ) − z sin (θ ) To measure the error we use the correlation factor, as this provides a measure for both the profile shape and amplitude of the beam [5,6].C and CUDA versions of both the WASSS and HOWASSS methods were implemented.The code was compiled using nvcc version 4.2 for CUDA code and gcc version 4.6.3 for the C code.The code was run on a workstation equipped with an Intel Xeon X5960 CPU, which is a hyperthreaded six-core CPU clocked at 3.47 GHz with 48 GB of RAM, and a NVIDIA QUADRO 6000 GPU, which has 448 CUDA cores at 574 MHz core clock speed (750 MHz memory clock speed) and 6 GB of dedicated GPU DDR5 RAM.
The simulation was run using the parameters listed in Table 1.The value of k 0 was selected to ensure that we could include 1000 transverse modes while still keeping M real-valued.When the waveguide was aligned (θ = 0 • ) roughly an order of magnitude decrease in the error was observed with the HOWASSS method over the WASSS method (see Fig. 1).However, the improvement was more pronounced when we tilted the waveguide at an angle of 50 degrees, especially for large step sizes (see Fig. 2). Figure 3 shows the calculated electric field next to the exact solution to further illustrate the accuracy of the HOWASSS method.The simulations shown in Figs. 1, 2, and 3 were run using double precision accuracy numbers on the GPU.Both methods were capable of producing accurate results using single precision arithmetic, until about Δz = 0.05 μm.At this point, round-off error began to affect the results of the HOWASSS method due to the increased number of matrix-vector multiplications required in each time step.The WASSS method does not suffer as quickly from this problem because it is computationally more simple.The limit where round-off error begins to affect the double precision code was not observed for either method.

Stability
By virtue of being a higher order method, the HOWASSS method has improved stability over the WASSS method (see Fig. 4).As we increase the tilt of the waveguide, the WASSS method performs more poorly as the angle increases, and at large step sizes even shows signs of being a bit unstable.However, the HOWASSS method actually becomes more accurate at larger angles.
Traditionally, beam propagation methods struggle to obtain accurate results when there are rapid changes in the index of refraction.To simulate a rapidly changing index of refraction we increase the change in the index between the waveguide and the surrounding medium, Δn, while keeping the width of the waveguide fixed.We see that at a 50 • waveguide tilt the WASSS method actually becomes unstable with a large stepsize, while the HOWASSS is able to remain stable for at least an additional order of magnitude increase in Δn.However, both methods do loose accuracy as the change in the refractive index becomes steeper, but by decreasing Δz accurate results can still be obtained in a reasonable run time.This same analysis was done for a 0 • tilted waveguide and the results were quite similar and so are not shown here.

Speed
Computationally, the HOWASSS method requires the multiplication of matrices with vectors.If we restrict N x so that the matrix M is real then, for the example in Section 3, all the matrices will be real, however the vector A will be complex.In general, the eigenfunctions and the refractive index could be complex.However, for simplicity we will assume these to be real.Diagonal matrix multiplication is equivalent to element-by-element vector multiplication.Hence, multiplying an N x × N x diagonal matrix by an N x element complex vector requires 2N x operations.Multiplying an N x × N x dense matrix by an N x element complex vector is equivalent to performing 2N x dot products, each requiring N x multiplications and N x − 1 additions, for a total of 2N x (2N x − 1) = 4N 2 x − 2N x operations.Applying P requires a total of 4 diagonal matrix-vector multiplications for a total of 8N x operations.Applying Q (z l+1 ) Q (z l ) requires 2 diagonal matrix-vector multiplications, 2 dense matrix-vector multiplications, 2 vector-vector additions, and 1 scalar-vector multiplication for a total of 8N 2 x +6N x = 2N x (4N x −3) operations.Applying C (z l+1 , z l ) requires 4 dense matrix-vector multiplications, 1 vector-vector addition, and 2 element-by-element exponential operations (assumed to only be 1 operation per element) for a total of 16N 2 x − 2N x = 2N x (8N x − 1) operations.P is applied four times, Q (z l+1 ) Q (z l ) is applied twice, and C (z l+1 , z l ) is applied once each step, giving a total of 32N 2 x + 42N x = 2N x (16N x + 21) operations.Following the same logic for the WASSS method, we find that it requires 8N 2 x + 6N x = 2N x (4N x + 3) operations.Note that this operation count differs from the one reported by Clark and Thomas [6] because we are counting both multiplications and additions in this calculation.
Furthermore, to leading order in N x , the HOWASSS method is approximately a factor of 4 slower for a given Δz (see Fig. 5).However, the initial hypothesis was that a significant improvement in accuracy may lead to a more efficient algorithm.To test this hypothesis, we executed the HOWASSS and WASSS methods using the GPU implementation with various step sizes, holding all other aspects of the problem constant.Figure 6 summarizes the result.For a propagation length of 100 micrometers, and a waveguide tilt angle of 50 degrees, we show the compute time per micron of propagation as a function of the maximum absolute error.For two different values of N x , the efficiency at constant error is better for the HOWASSS method, as indicated by shorter compute times.In fact, for error values smaller than about 10 −4 , the HOWASSS method is much better, with efficiency rapidly exceeding an order of magnitude for absolute error values of less than 10 −6 .The data presented in this paper was generated using double precision arithmetic.This does not result in a significant difference in the run time when the code is run on a CPU.However, GPUs are intrinsically designed to deal with single-precision arithmetic.In order to compute one double-precision number the GPU must perform two single-precision calculations and some additional overhead to combine the result, leading to, at a minimum, a factor of two speed-up when single-precision numbers are used.This speed-up also depends on the specific graphics card used, as some have less capability than others when it comes to double precision arithmetic.Rounding errors can affect the HOWASSS method if N z is large, around 2000 for the specific example considered in this paper.If the application does not require extreme accuracy, then a substantial speed-up can be achieved by using single-precision arithmetic on the GPU.

Parallelization
Both the WASSS and HOWASSS methods readily lend themselves to parallelization.For our implementation we chose to use both OpenMP, to make use of multi core CPUs, and NVIDIA compute unified device architecture (CUDA) to make use of the processing power of the graphics processing unit (GPU).Modern GPUs possess orders of magnitude more computational power than the typical CPU.However, to completely utilize this power the algorithm must possess a very high level of parallelism to completely saturate the many processing units of the GPU.
Unfortunately, neither the WASSS nor the HOWASSS method are ideal for utilizing the full power of the GPU because many of the matrix-vector multiplications involve diagonal matrices.Faster than their dense counterparts, these computationally reduce to simple elementby-element vector multiplication.While this is a perfectly parallel operation, it does not offer the large number of independent calculations needed to saturate the GPU.These element-byelement vector multiplications suffer additionally from the fact that they require two memory reads and one memory write to compute only one multiplication.On the GPU reading and writing to memory is much slower than math operations.Given this, there is still a significant speed-up when the code is run on the GPU versus on a single core of the CPU (see Fig. 5).This speed-up was accomplished with little attention payed to optimization of the code.With further optimization an additional factor of two or more would be likely.Additionally, extending this technique to even higher-orders might yield additional improvements.

Generalizations to other coordinate systems, boundary conditions, and higher dimensions
Both the WASSS and HOWASSS methods can be generalized to different coordinate systems or different boundary conditions.In fact the only complication is that the eigenfunctions and eigenvalues must be known.For a more detailed discussion on this topic see Clark and Thomas [6].

Conclusion
We have presented a method that is more efficient than previous non-paraxial beam propagation methods.The method casts the analytic solution of the Helmholtz equation as a Magnus expansion.Keeping terms up to (Δz) 3 in the Magnus expansion we use a symmetric operator splitting technique in order to analytically reduce the exponential matrices into a more simple form.The solution to the Helmholtz equation is then approximated via straightforward matrix multiplication.We have demonstrated the method in a simple geometry with 2D Cartesian coordinates with hard boundary conditions.The results obtained show our higher-order approach significantly improves the overall efficiency, when measured as compute time per distance of propagation, in cases where high accuracy results are required.The method can be easily extended to more generalized coordinate systems, higher dimensions, and various boundary conditions, provided that the eigenfunctions and eigenvalues of the transverse Laplace operator can be found for that geometry.

#Fig. 1 .
Fig.1.Error as a function of propagation distance for an aligned waveguide with N x = 1000 for the (a) WASSS and (b) HOWASSS methods.Note that we have sampled the error every 0.5 mm, and not at every z step calculated, because the rapid oscillations would make the graph difficult to read otherwise.

Fig. 2 .||Fig. 3 .
Fig. 2. Error as a function of propagation distance for a waveguide rotated 50 degrees with N x = 1000 for the (a) WASSS and (b) HOWASSS methods.

Fig. 4 .Fig. 5 .
Fig. 4. (a) The maximum error obtained as a function of waveguide tilt angle showing that the HOWASSS method actually gains a small amount of accuracy at larger angles.(b) The maximum error obtained as a function of waveguide depth, Δn.

#Fig. 6 .
Fig. 6.Comparison of the HOWASSS method and WASSS method compute time per propagation distance with N x = 1000 and N x = 2000.Both methods were run on the GPU.