Memory sparing, fast scattering formalism for rigorous diffraction modeling

The basics and algorithmic steps of a novel scattering formalism suited for memory sparing and fast electromagnetic calculations are presented. The formalism, called ‘S-vector algorithm’ (by analogy with the known scattering-matrix algorithm), allows the calculation of the collective scattering spectra of individual layered micro-structured scattering objects. A rigorous method of linear complexity is applied to model the scattering at individual layers; here the generalized source method (GSM) resorting to Fourier harmonics as basis functions is used as one possible method of linear complexity. The concatenation of the individual scattering events can be achieved sequentially or in parallel, both having pros and cons. The present development will largely concentrate on a consecutive approach based on the multiple reflection series. The latter will be reformulated into an implicit formalism which will be associated with an iterative solver, resulting in improved convergence. The examples will first refer to 1D grating diffraction for the sake of simplicity and intelligibility, with a final 2D application example.

Scattering formalisms are needed for the rigorous diffraction modeling of complex elements, particularly for grating computation, where a widely adopted procedure is the reduction of Maxwell's equations from a system of partial differential equations to a system of ordinary differential equations by projection onto a set of basis functions [3,4,20]. These basis functions are of the type of forward and backward functions (up and down functions, called up and down amplitudes in this paper), decomposing the field into forward and backward propagating, as well as exponentially decreasing and increasing, fractions. Considering these forward and backward functions at certain points in the modulated region and taking into account the boundary conditions, a 1D boundary value problem is obtained. A memory sparing, fast solution of this boundary value problem for a given cluster of scattering items without sacrifice of flexibility is the subject of [18,19], and this paper gives the rationale and the detailed description of the solution, as well as application examples. Here, 'fast' means a complexity linear or almost linear (N N ln( )) with respect to the number N of considered diffraction orders.
There are various established procedures for the scattering based solution of 1D boundary value problems [10,11,[21][22][23]. In the case of the formulation of the 1D boundary value problem with respect to up and down functions, recognized and popular scattering-based algorithms are the S-matrix algorithm [21,22] and the Bremmer series method [23] or rather the 'multiple reflection series method' [24][25][26] (see below).
The S-matrix algorithm operates with reflection and transmission matrices assigned to a stack of layers and requires matrix inversions. It yields reliable results but is of O(N 3 )-complexity [21,22]. After having been used in early publications [27,28], the S-matrix algorithm and related variants spread widely in grating computation, particularly in the nineties [21,22,[29][30][31].
The Bremmer series method [23], also called 'multiple reflection series method' when applied to grating diffraction [24][25][26], uses a different approach: it converts a 1D boundary value problem into a series of initial value problems, and has an explicit, adding-up character. The ith summand in the series corresponds to an i-fold reflected wave. The Bremmer series method operates with internal reflections throughout, and does not eliminate field quantities from the inner observational planes. It has first been formulated for wave propagation in a 1D inhomogeneous medium by Bremmer [23], where the author also proved that the series is a formal solution of the Helmholtz equation. Convergence criteria for the Bremmer series have been given in [32][33][34]. Bremmerlike series converging faster than the Bremmer series have been formulated in [35][36][37][38].
The first use of the Bremmer series in the more general Sluijter-formulation [38] applied to diffraction orders for grating diffraction calculation was published in 1991 by Pai and Awada [24] and called 'multiple reflection series'. The propagation method used at the single scattering layers was RCWA [24]. Neviere and Montiel combined the multiple reflection series with the differential method [25,26]. For this, they divided gratings into layers and inserted infinitesimally thin slices of a homogeneous basis medium between these layers. The diffraction computations at the single layers were conducted by the differential method. The multiple reflection series was then applied for the calculation of the scattering behaviour of the entire grating from the scattering behaviour of these single layers. Li investigated the Bremmer-or, rather, the multiple reflection series-theoretically in [39] and proved the formal equivalence between this series and the S-matrix propagation algorithm: he established the validity of the multiple reflection series as a solution of Maxwell's equations to the grating problem.
We have realized, however, that the Bremmer series method is not an end in itself, and that the above-described development may just be the base for the origination of a highly efficient and flexible scattering formalism, implementing the following two algorithmic key processes: • We diffract the vectors of the up and down functions at the scattering items by a fast method such as [12][13][14][15][16] (here: GSM [16]) instead of utilizing a slow method computing entire S-matrix blocks at O N 3 ( )-complexity and storing them as a preliminary measure as in [24][25][26]. This leads to O N N ln ( ( ))-complexity of the entire method.
• We derive the implicit counterpart to the explicit multiple reflection series. This allows the application of an iterative solver like GMRES (Generalized Minimal Residual Method) [40,41] to the derived underlying linear equation. GMRES converges reliably and quickly in contrast to the explicit multiple reflection series, often exhibiting poor convergence properties [24][25][26]. These poor convergence properties have been the main reason why the use of this procedure has not been common up to now-in contrast to the S-matrix algorithm.
Denominating the above algorithmic construction 'S-vector algorithm' by analogy with the 'S-matrix algorithm' emphasizes that the calculations take place only for a single incident vector of basis functions. The implicit variant of the S-vector algorithm promises to be often more efficient than the S-matrix algorithm of O N 3 ( )-complexity [21,22,27,28], as well as more capable and flexible than the sole use of a fast integral equation-based method like GSM [12][13][14][15][16], which we demonstrate by examples in this manuscript. This paper has the following structure: • Chapter 2 presents the S-vector algorithm in explicit and implicit formulations, concentrating mainly on consecutive procedures in the framework of this publication. This chapter applies to 1D as well as 2D diffraction. • Chapter 3 contains numerical investigations of the convergence behaviour as well as some potential applications. The examples will first refer to 1D grating diffraction for the sake of simplicity and intelligibility, with a final 2D application example. • Chapter 4 concludes.

The scattering-vector algorithm
This chapter will start with a definition of the diffraction problem. Then, the S-vector algorithm is described-at first in parallel, then in detail in a more sophisticated sequential way. Figure 1 illustrates the geometry of the diffraction problem; it does this in 1D for sake of clarity, but the development below applies to both 1D and 2D structures. Light is incident solely from above onto a grating in the x-y-plane. The homogeneous medium on top of the grating is the medium of incidence and also called 'cladding' (cld). The homogeneous medium below the grating is named 'substrate' (sub). The diffracting structure is divided into scattering layers along the z-direction.

Definition of the scattering problem
Inside these layers-meaning for z z z j j 1 < < + -GSM [15,16] is employed in the framework of the S-vector algorithm as one representative of a fast, rigorous scattering calculation method for reason of its availability. GSM works with the concept of a basis medium; in [15,16], this is a homogeneous medium specified by the permittivity j  and the permeability j m within layer j. The basis functions (up and down functions, also called up and down amplitudes) inside the jth layer are defined with respect to this basis medium. In the GSM case, the basis functions are plane waves and the eigenfunctions of the basis medium. Basis functions of such kind have already been applied in [3,21,26] between layers computed by the conventional differential method. The up and down amplitudes are denoted in this chapter by  [15,16]. The entries of z a( ) are the basis functions decomposing the field into upward propagating (and-where occurring-in +z-direction exponentially decreasing) harmonics and downward propagating (and-where occurringin −z-direction exponentially decreasing) harmonics in the basis medium ; j j  m ( ) in the layer j. For z 0  , the basis medium is the substrate and for z h  the cladding.
One basis medium ; j j  m ( ) per layer j is defined in this paper. With a scattering-based formulation of the diffraction problem (see (2)-(5) below) being possible for non-magnetic as well as magnetic scatterers, the S-vector algorithm can be applied in both of these cases. The choice of this basis medium does not affect the correctness of the results, but has an influence on the number of GSM iterations required at the considered layer-see also chapter 3 and [43]. At each border between adjacent layers, the separation of the field into up and down functions as a requirement for the scattering-based calculations has to be enabled-similarly to [26]. This is accomplished by the theoretical insertion of an infinitesimally thin slice of the basis medium applied also inside the adjacent layer below and serves as an auxiliary device here. At the border between two basis media, the field amplitudes are scattered according to the Fresnel equations. The segmentation of the grating into scattering layers (scattering items) can be seen from figure 1: a diffracting region z z z j j 1 < < + together with the lower interface between two adjacent basis media will be condensed into one scattering item here. The uppermost scattering item additionally contains the jump between cladding and adjacent basis medium. The corresponding up and down amplitudes in two adjacent planes z j , z j 1 + can be related by scattering-matrix blocks (S-matrix blocks). ) the transmission matrix block for incidence from above. The order of the superscripts of the S-matrix blocks is arbitrary here; for sake of uniformity, the smaller superscript will be on the right throughout.
Quantities referring to the n x th diffraction order with respect to the x-direction and the n y th diffraction order with respect to the y-direction are labelled by the indices n n , holds, with N x the maximum diffraction order considered in the x-direction and N y the maximum diffraction order considered in the y-direction. In the case of 1D diffraction, N 0 y = and n y = 0 hold for all diffraction orders. The parameter N N N 2 1 2 1 x y + + ≔ ( ) · ( ) denotes the number of diffraction orders in total. There are two k-vectors per diffraction order in the jth basis medium defined by j  , ; , with ω as angular frequency.
The expounded boundary value problem can be summarized by the following system of linear equations: at the substrate, 5 onto a grating of height h. The grating is periodic in the x-direction with period p x , translation invariant in the y-direction in the 1D case or periodic in the y-direction with period p y in the 2D case and is divided into layers along the integration direction z.

S
a a a at layer 0, 10 a a a at layer 0, 11 which can be given to an iterative solver, evaluating (6)- (11) in parallel in each iteration. This parallel procedure has first been described in [18] and then been conducted in [44]. If one CPU-or better one GPU for each layer-is available, the iterative solution of this system of equations is a good option; the computations at the layers within one iteration can then be conducted simultaneously. Note that iterative solvers are applied here at two levels: at a lower level, the single matrix-vector products S a j j j , is not set up explicitly, each of these products means a reflection or transmission computation at layer j by a fast, iterative method such as GSM for a vector of incident harmonics. At a higher level, the boundary value problem (6)-(11) between substrate and cladding is solved iteratively.
In the case of a significantly lower number of processors, one may think of alternative procedures; moreover, an important point is that the convergence when solving (6)-(11) would be rather slow in the case of a large number of layers since these equations relate only quantities in neighbouring planes to each other in one iteration. In the case of s layers, s 1 iterations would e.g. be required until a s  ( ) had an effect on a 1  ( ) . One way out involves the use of an iterative procedure evaluating (2)-(5) consecutively. This means updating a s 1 is established already by one iteration. Parallel calculations at different layers do not take place here; however, there may be sufficient parallelization opportunities at the single layers-depending on the chosen method. This outlined procedure could be named 'multiplicative (Gauss-Seidel) Schwarz-algorithm applied to up and down functions' [11,42,45]. It is the foundation of the Bremmer series [23], or rather the multiple reflection series [24][25][26] (see next section). Unconventional at this stage is the combination of the division of a grating into layers to be computed separately with linear complexity by any fast method or any combination of such methods and the search for the up and down amplitudes between these layers by an iterative solver; the matrix blocks in (2)-(5) stand for an instruction how to propagate the up and down amplitudes.
In this paper, we call the solution of (6)-(11) by parallel calculations of linear complexity at the layers 'parallel S-vector algorithm' and the solution of (2)-(5) by consecutive calculations of linear complexity at the layers combined with immediate updates of the unknowns (see next two sections) 'sequential S-vector algorithm'. The remaining part of the present paper is focussed on the more sophisticated and potentially more efficient sequential S-vector algorithm (which can be derived from (2)-(5)); a numerical comparison between parallel and sequential variants will be given in a future paper.

The explicit formulation of the sequential S-vector algorithm
The stack of scattering layers defined above can be treated by the Bremmer series-or, rather, the multiple reflection series -method. This procedure can be looked up in [23] (Bremmer series method) and its application to grating diffraction in [24,26], where the employed series has been called 'multiple reflection series'. The mathematical equivalence to the S-matrix algorithm under certain conditions and a comparison with it are presented in [39]. The equations (19)- (21) in [39] (listed also below in (12)-(18)) give a concise definition of the multiple reflection series. In appendix A, we present a detailed example of an efficient implementation-also in order to make the reader familiar with our notation and way of thinking.
First, some terminology is introduced. A 'path' is defined here by a sequence of scattering events of an incident harmonic at scattering layers until it exits the grating. A 'scattering event' is defined by an operation S a j j j , Illustratively, a scattering computation can then be understood as a summation over paths. This way of thinking is illustrated in figure 2. The Bremmer series or multiple reflection series method aims at a summation over paths in an efficient, stepwise way and is based on sequential downward and upward propagation.
In view of minimizing the memory demand, the focus is subsequently on the computation of a j The summands here are sorted by an increasing number of reflections. They can be calculated recursively: 'P' stands for 'propagation'. Equations (12)-(18) specify the multiple reflection or Bremmer series. The series (12) converges if there is a norm   · and a real constant 46]; an analogous criterion holds for (13). According to our experience, this is usually fulfilled if at least a constant fraction of the light is coupled out of the grating in each upward or downward propagation step-if numerical errors do not play a role. In practice, the application of a sufficiently accurate and stable propagation method is essential when using the series (12)- (13); this is one motivation for the introduction of the respective implicit formulation (see next section). Our implementation of the multiple reflection seriesbased propagation scheme utilized by the S-vector algorithm in the subsequent parts of this paper is described in appendix A. It has to be emphasized that we do not compute the S-matrix blocks in the framework of the S-vector algorithm in contrast to the multiple reflection series method in [24][25][26]. As well, we do not apply matrix-vector multiplications in case of the S-vector algorithm (see appendix A), which would cost N O 2 ( ) operations. Instead, these matrixvector products have to be understood as instructions to calculate the vector of transmitted or reflected amplitudes at a given layer from a vector of given incident amplitudes by means of a fast rigorous method-e.g. GSM [15,16]. This is the novelty of the expounded explicit procedure at this point compared to the previous procedures described in [24][25][26].
In order to accomplish a more concise notation, alleviating our next steps, the following abbreviations are introduced. Skipping the superscript of the vector a means referencing all planes j s The vector given at the beginning is then written as and the vector of the down amplitudes after step 0 (see figure 2 and appendix A) as The sub-matrix P j l propagates them downwards to the observational planes j, j s where the components of a i ,2  are given by (16) and the components of a in agreement with the arrangement of the sub-vectors in (19).
holds. The matrix P ↑ for 'upward propagation' is defined analogously:  figure 2). The matrix R ↑↓ reflects the down amplitudes at z j for j s 1 1  at the corresponding scattering layers: An efficient implementation of the multiple reflection series ,intm,0 ,start The application of P • has been postponed here to the respective subsequent step, in contrast to (12)- (16).
Since the up and down amplitudes are mutually dependent, we eliminate a Some investigations on the convergence of the Bremmer series are given in [32][33][34], and some practical experiences in the use of this series for rigorous grating computation e.g. in [24]. We expound the explicit formulation here, since we utilize it for the derivation of the implicit variant in the next section, as well as for numerical experiments in chapter 3. We recommend the use of the implicit variant, and will show its superiority over the explicit variant by numerical experiments in chapter 3. The complexity connected with (44) )), depending on the fast method employed for the scattering calculations at the single layers. In this context, it has to be emphasized again that all matrices (operators) here are defined by their action on the considered vectors. They are not calculated explicitly, but stand for a job instruction how to propagate the up and down amplitudes by a fast method. Having

The implicit formulation of the sequential S-vector algorithm
The calculation of C a can be done in a more efficient and viable way than in (42) by switching to an implicit procedure: Since the application of a direct solver, computing the inverse of an explicitly specified matrix at a clearly higher than linear complexity, would not match the concept of vector-based fast calculations here, (52) is solved iteratively. Any iterative solver can be tried. Since GMRES [40,41] is an alreadyestablished solver for the solution of scattering problems in the case of GSM [15,16], it will be applied in the framework of this paper. Compared to other iterative solvers, GMRES has a reliable, optimal convergence-although this is at the expense of a memory demand proportional to the number of iterations [40,41]. Once a ,intm  has been computed, all other quantities can be computed in the same way, as described in the last paragraph of the preceding section by (45)-(49).
Concluding, the implicit formulation (52) is also very memory sparing: instead of both up and down amplitudes at the layer boundaries, as in e.g. (6)- (11), only half of them (merely down amplitudes) are stored during the iterative solution. A summary of the algorithmic steps required by the approach (52) is listed in appendix B. For the sake of completeness, we mention that there is an equivalent to (52) when choosing C a a ,intm ,intm instead of a ,intm  as vector of unknowns, which is obtained from (52) by multiplication with C ↑↓ . The completion, which is required for incidence from both sides, is obtained by the application of the superposition principle.
To our knowledge, the reformulation of the Bremmer (multiple reflection) series method in an implicit way, and its solution by an iterative solver, is new.
We call the solution of (52) by an iterative solver 'implicit variant of the sequential S-vector algorithm' in this paper-provided that the calculations at the single layers are conducted by a method of linear complexity in time and memory; it is an alternative to the solution of (6)-(11) by an iterative solver, which we call 'implicit variant of the parallel S-vector algorithm '.
In what follows, we give the merits of the implicit formulations.
The implicit formulation (52) can also be derived directly from (2)-(5) by inserting a j  ( ) sequentially for decreasing j into the other equations and then inserting a j  ( ) sequentially for increasing j into the other equations; in contrast to the multiple reflection series, this does not require the expandability of an inverse matrix into a series and the conditions connected with this.
If GMRES is taken as iterative solver, the convergence is ensured as long as the matrix on the left side of (52) is invertible (and the computational accuracy is sufficient), since GMRES yields the exact solution of a linear system of equations of full rank, if the number of iterations is equal to the number of unknowns and rounding errors are absent [40,41]. Fortunately, the problem (52) is usually well-conditioned due to its scattering-based formulation. Hence, the number of iterations required for a certain accuracy is normally much lower than the number of unknowns, and rounding errors play a minor role, so that even single precision computations are sufficient. In contrast to this, obtaining results of a desired accuracy is not guaranteed when employing the explicit procedure (44) and truncating the series in practice.
A more subtle advantage of the implicit formulation (52) compared to the explicit procedure (44) is a higher accuracy of the results and, especially, a better stability of the method. The implicit formulation (52) requires propagation of the light only once between cladding, substrate and back in one iteration. So, the propagation is conducted only over a very limited distance. In contrast to this, the explicit procedure (44) requires multiple-fold propagation between substrate and cladding-meaning propagation over a long distance. Since the propagation or integration errors at the layers accumulate, this can lead to a large total error. A very accurate and stable propagation procedure is therefore crucial in this case. We think that the divergence of the multiple reflection series in [26] is caused by such integration errors (although we have not tried to reproduce this computation). We are convinced that this divergence results from a combination of insufficient stability of the applied numerical integration method at the layers with the multiple reflection series (44) as an explicit procedure that leads to a strong error accumulation. Having always combined the S-vector algorithm with GSM [15,16], such divergences are not part of our wealth of experience.

Numerical examples on the basis of the sequential S-vector algorithm and discussion
This chapter investigates the convergence of the sequential S-vector algorithm on the basis of numerical examples (first examples for the parallel S-vector algorithm are given in [44]). Subsequently, two examples for the beneficial use of the superior, reliably converging implicit formulations are given. GSM [15,16] as one representative for a fast, rigorous method is used for the computations at the single scattering layers.
The examples presented here refer to glass gratings. We have also applied the combination of the S-vector algorithm with GSM to gratings of higher refractive index contrast, such as gratings consisting of bumps of chromium and grooves of air, and observed convergence. Yet, GSM is less efficient and therefore less suited for gratings of high refractive index contrast and metal gratings than for gratings of moderate refractive index contrast, since GSM needs a higher number of iterations in such cases. This motivates work on efficient methods for gratings of high refractive index contrast.
With the presented examples referring to non-magnetic scatterers, we choose j v a c m m = (permeability of the vacuum) there.
The computation results for diffraction at 1D gratings have here been checked by comparison with the differential method [3,4] and the ones for diffraction at a 2D grating (beam splitter) by RCWA. For the implicit formulation (52), we have also made comparisons with the sole application of GSM to gratings (without the S-vector algorithm). Here, the differences in the outgoing diffracted complex amplitudes between use of and no use of the S-vector algorithm are roughly of the order of the residuals with which the equations are solved iteratively. Consequently, the use of the implicit formulations does not falsify results, and so the implicit formulations are correct.
All computations here are conducted using double precision variables. However, all inversions are performed merely with single precision: A residual equal to or smaller than 10 −7 is also applied to GSM as termination criterion; 10 −7 is roughly in the order of float precision. The specified termination criteria are sufficient here due to the stability of the employed methods. All inversions here are executed by GMRES as one representative of an iterative solver. The zero vector is taken as initial guess in case of all presented examples.

Convergence behaviour
This subsection investigates the convergence behaviour of the sequential S-vector algorithm under various conditions, and especially compares the explicit formulation (44) with the implicit one (52). The test cases are depicted in figure 3, and are listed below.
(i) The first test case concerns almost perpendicular incidence onto a 1D Ronchi phase grating, which is frequently applied in interferometric setups. The vacuum wavelength is 632.8 nm; vac l = in the case of fused silica as a grating material, this requires a height of h 692.2 nm = of the modulated region. The angle of incidence is 0.7 here, and would avoid an undesired back reflection in practical applications. The modulated region is divided into s=2 layers of thickness h 2 346.1 nm = . Between these layers, there are infinitesimally thin slices of a basis medium (also called background medium) as auxiliary device-see section 2.1 and [26]. The refractive index of the basis medium above a layer j is n 1 1.4571 2 1.2286 ). Generally, a good choice for the refractive index of the basis medium is the average refractive index. here, and in the number of layers, which is s=8. Their thickness is ) and the bumps of air (refractive index n 1 1 = ). The maximum diffraction order is N 40 x = (N y is 0) and the length of period is p 1000 nm x = . TE-polarization is considered. We consider these three test cases in order to investigate the convergence of the S-vector algorithm variants under various conditions: the first case represents the scenario of a low number of rather strongly scattering layers, the second case the scenario of a large number of thin, rather weakly scattering layers and the third case the scenario of a large number of thick, rather strongly scattering layers. A comparison between test cases 1 and 2 allows practical recommendations whether to divide a structure of given thickness into many thin (weakly scattering) layers or few thick (strongly scattering) ones. Test case 3 is listed as an extreme case where two features slowing down the convergence are combined: a large number of layers as well as strong reflections at these layers due to their large thickness. Here, the explicit variant (44) arrives at its limits in practice.
All GSM computations applied to entire gratings as well as single layers have been conducted using a 4th-order integration technique based on midpoint rule and extrapolation (step sizes z 692.  [43]. The convergence behaviour of the two variants (44), (52) is presented in figures 4-6, each for all three test cases and TEpolarization. (We made the investigations for TM-polarization for the considered examples as well; here, the convergence for TM-polarization turned out to be slightly better, which we attribute to smaller reflection coefficients at the layer boundaries for TM-polarization.) In the case of the explicit variant (44), the residual defined in (53) is plotted as a function of the truncation number i max . In the case of the implicit variant (52), the GMRES residual defined in (54) is plotted as a function of the iteration number. Regarding the explicit variant (44) and the implicit one (52), an increment of one on the x-axis means another single-fold application of C ↓↑↑↓ or C  - respectively. Table 1. Diffraction efficiencies as a function of the diffraction order in reflection and transmission for TE-polarization in case of the grating and numerical parameters depicted in figure 3 for h 692.2 nm = . The results have been obtained by a combination of the S-vector algorithm with GSM at the single layers, and checked by the differential method [3,4].  Table 2. Diffraction efficiencies as a function of the diffraction order in reflection and transmission for TE-polarization in case of the grating and numerical parameters depicted in figure 3 for h 2768.8 nm = . The results have been obtained by a combination of the S-vector algorithm with GSM at the single layers, and checked by the differential method [3,4].  A comparison between the explicit (44) and the implicit variant (52), both involving the same matrix C ↓↑↑↓ , shows a much better convergence behaviour of the implicit technique, particularly in test case 3 ( figure 6). The implicit procedure converges super-linearly, whereas the explicit procedure converges maximally linearly, and exhibits a trend towards stagnation in some places. All in all, the benefit of the implicit variant can be seen, as expected.
Next, the division of a grating of given height h into a small number and a large number s of layers are compared with each other. This refers to the comparison of test cases 1 with 2 and thus of figure 4 with 5. The comparison indicates that the division of a grating of given height h into a large number s of layers results in a deteriorated convergence behaviour of all considered variants (44), (52) compared to a division into a small number s of layers. The physical interpretation of this is that the waves are reflected multiple times at the layers before leaving the grating, particularly in the case of a high number of scattering layers. This motivates the choice of a small s 2  (meaning possibly thick layers) andwhere required-the use of a recursive procedure (division of single possibly thick layers in figure 1 into sub-layers to be computed separately and to be combined by the S-vector algorithm to layers). Finally, the effect of the grating thickness h on the convergence is observed. This refers to the comparison of test cases 1 and 3. The scattering layers are identical in both cases, but their number s is different and thus also the height of the entire grating. The comparison of figure 4 with 6 shows that a larger number of scattering layers (or rather a larger height h) results in slower convergence. The physical interpretation here is similar to the one given in the preceding paragraph. So, the efficiency of the applied combination of GSM with the S-vector algorithm has a tendency to decrease with increasing grating thickness (the sole application of GSM has such a tendency here, too). Yet, gratings of a thickness in the order of a few wavelengths are rather rare in practice-apart from waveguides (or rather fibers), where modal methods are a good choice.
All in all, the implicit variant of the S-vector algorithm converges reliably and very well.

The S-vector algorithm as a means of enhancing the flexibility of integral equation-based methods
This section outlines some potential applications of the S-vector algorithm beyond a sole reduction of the memory demand. A feature of the S-vector algorithm is its flexibility and generality; the requirements for its application are the presence of two or more scattering items and the representation of the field at or between these items in terms of forward and backward functions-e.g. plane waves, modal basis functions, spherical waves etc. Hence, it is particularly suited for the combination of scattering layers of different material and shape or of a layer with scattering particles on top of it.
In the framework of this paper, GSM [15,16] based on an integral equation is applied for the computations at the layers. Methods based on integral equations have mainly been connected with the application of equidistant sampling [12][13][14][15][16] for reasons of simplicity and efficiency until now. The radiation operators in [12][13][14][15][16] have a Toeplitz structure, then. Here, the S-vector algorithm can be used for an adaptation of numerical parameters like the integration step sizee.g. a smaller step size for metallic layers than for dielectric ones. It can be used, as well, for precise adaption of the integration mesh to the thicknesses of the layers. Besides, it allows the individual choice of the basis medium for each layer.
The following simulation example refers to a 1D 4-level phase grating with layers of different thicknesses due to fabrication errors (figure 7). Application of equidistant meshing between substrate and cladding with a constant step size of a few nanometers seems not advisable here, since the heights of the stairs differ from each other. This motivates an individual meshing at each layer, and a combination of the layers by the S-vector algorithm. Moreover, all GSM computations at each layer are conducted twice here with different step sizes z D (see figure 7 on the left, as well as the corresponding caption) and the results are extrapolated as in the previous subsection for an accuracy enhancement. The basis medium is specified for each layer separately (see refractive indices n b in figure 7). The segmentation of the grating into scattering items can be seen from figures 1 and 7: the uppermost scattering layer is comprised of the transition between cladding and basis medium of refractive index n b,3 , the uppermost binary profile and the transition from the basis medium of refractive index n b, 3   profile and the transition from the refractive index n b,1 to the substrate.
The number of S-vector iterations is merely seven for both polarizations, using the implicit formulation (52). The diffraction efficiencies for the grating depicted in figure 7 are listed in table 3.
The main purpose of the preceding sections was the investigation and explanation of basic features and aspects of the S-vector algorithm; this took place on the basis of 1D examples. In the next section, we turn to an application example in 2D in combination with computations on a graphic card. The method applied at the layers remains GSMbut now in the formulation for 2D diffraction [16].

The S-vector algorithm as memory sparing technique facilitating GPU computations of 2D grating diffraction
In recent years, graphic card (GPU) computations have become an appealing alternative to central processor (CPU) computations. Yet, a critical point even nowadays is memory limitation on GPUs (4 GB in case of the GPU used in this paper). A higher memory demand necessitates the use of the RAM on the motherboard, and therefore time-consuming data exchanges between GPU and motherboard. In the S-vector implementation utilized in this section, we conduct the computationally heavy GSM computations at the single layers of a 2D grating on a GPU and use the RAM only when the GPU memory becomes insufficient. These are the places where the RAM usage increases rapidly in tables 4 and 5. The S-vectors are always stored in the RAM, hardly occupying any memory (see table 5 for 51 2 and 101 2 diffraction orders). The implicit formulation is applied and tested.
The subsequent example shows that the memory demand of GSM can be strongly reduced by the S-vector algorithm. Normal incidence of light of the wavelength 1000 vac l = nm onto a beam splitter of 2×2 pixels is considered (see figure 8). The incident electric field at h 0; 0; ( ) is parallel to the y-axis. The pixels as well as the substrate consist of glass of n 1.5 sub = . The periods are p p 7.5294 x y = = μm. The height h=1000 nm would effect a phase shift of π in the ray optical limit at the glass pixels, and thus a diffraction efficiency of 0 in the zeroth order. The rigorously computed diffraction efficiency in the zeroth order is 0.02016; the efficiency in the orders (−1, −1), (−1, 1), (1, −1), (1, 1) is 0.15655 each, summarizing TE-and TM-portions.
We conduct the numerical integration with 120 and with 150 sampling points equally distributed over the height h and extrapolate for the limit of an infinite number of sampling points. Subsequently, we compare the sole use of GSM (see table 4) with the S-vector algorithm combined with GSM (see table 5) for an increasing number of diffraction orders. The number of S-vector layers is three here, each one containing 150/3=50 and 120/3=40 sampling points along the z-direction.
The first line of table 4 shows a very high number of GMRES iterations (above 300) required by GSM. In combination with a high number of integration sampling points (150 and 120), the memory demand of GSM is very high. So, the RAM is utilized to some extent already for 51 2 diffraction orders (per polarization). Only the circumstance that our workstation has 64 GB RAM allows the computation of 151 2 diffraction orders; 201 2 orders cannot be computed any more. The 4th line of table 4 demonstrates that the computation time increases faster than linearly with the number of diffraction orders, due to an increasing use of the RAM and time-consuming data exchanges between GPU and RAM. This occurs already from the beginning of our test with 51 2 diffraction orders. The first line of the second table shows a much lower number of iterations required by GSM at a single layer (not more than 100 iterations, since the layers have only a thickness of h 3). In combination with a much lower number of integration sampling points (50 and 40), the memory demand is lower here by about a factor of nine, compared to the sole use of GSM. Hence, the RAM utilization is much lower. No RAM is used for GSM computations at single layers for up to 151 2 diffraction orders. The column belonging to 101 2 diffraction orders shows that the RAM occupancy by the S-vector algorithm is negligible here; the respective down amplitudes are considered merely in the two planes between the three layers in figure 8. A computation with 501 2 orders (per polarization) is possible. This is a considerable step towards the computation of large 2D objects. The deviation of the sum of efficiencies from 1 is merely 7 10 8 here, which is in the order of the residuals of 10 −7 with which all the systems of equations are solved here. This is further evidence for the correctness of the methods applied here. For comparison, we have conducted an RCWA-computation of the 2×2-beam splitter with 51 2 diffraction orders on a CPU (not using any GPU), which may be considered as a standard. The memory demand has been 13.7 GB here and the computation time 1.8 hours. A computation with 101 2 diffraction orders has not even been possible on our workstation, due to a lack of memory.

Conclusion
A novel scattering formalism called 'S-vector algorithm' has been presented in its consecutive variants (explicit and implicit). The new formalism can be considered as a fast, iterative counterpart to the S-matrix algorithm of O N 3 ( )-complexity, and as a complement to existing fast scattering formalisms based on integral equations.
The new formalism differs from the multiple reflection series [24][25][26] by two significant improvements: firstly, the vectors of up and down functions are diffracted at the scattering items by a fast method (here: GSM) instead of a slow method computing entire S-matrix blocks with O N 3 ( )-complexity as a preliminary measure. This results in O N N ln ( ( ))-complexity of the method overall. Secondly, the implicit counterpart to the multiple reflection series has been derived. This allows the application of an iterative solver like GMRES, and ensures convergence (provided that the method for the scattering computations at the single scattering items converges). The very good convergence, particularly in the case of a low number of layers, also suggests iterative solvers different from GMRES here.
The computational load of the entire calculation procedure is proportional to the number of S-vector iterations, as well as to the number of iterations at the single layers; it is dominated by the complexity of the method employed at the single scattering objects, which should ideally be linear with a small constant. The memory demand of the S-vector algorithm is linear with respect to the number of considered basis functions, and proportional to the number of iterations of which the result is stored by the iterative solver.
Some examples for the use of the presented method have been given here: the S-vector algorithm allows a drastic reduction of the memory consumption (at the expense of a higher number of operation counts) by division of gratings into layers to be computed separately. This makes it especially appealing for graphic card computations; in the case of computations of large 2D gratings (e.g. with 501 2 harmonics) on graphic cards, the memory demand is indeed the limiting criterion. Another application example is non-equidistant meshing in combination with higher-order integration. The expounded method also allows the combination of scattering objects of different shape (e.g. spherical ones), size or material, suggesting utilization of different methods or the adaptation of numerical parameters to these objects. Such parameters are e.g. integration step size and basis medium in the case of use of GSM.
Summarizing, the S-vector algorithm extends the possibilities of fast grating computation by enabling more flexibility. We think that the proposed method will gradually become more important with an increasing variety of fast methods available, and a growing demand for scattering computations at complex-shaped, large scattering objects.

Acknowledgments
The first author is grateful to Prof N Lindlein, supervisor of his PhD thesis at the University of Erlangen-Nuremberg, and his group for the constant support as well as the funding of his PhD thesis. The commitment of Prof Emeritus O Parriaux, Saint-Etienne University, and Prof Gérard Granet, Clermont-Ferrand University, in reviewing the manuscript is acknowledged. This work was supported by the LABEX MANUTECH-SISE (ANR-10-LABX-0075) of Université de Lyon, within the program 'Investissements d'Avenir' (ANR-11-IDEX-0007) operated by the French National Research Agency (ANR). Table 3. Diffraction efficiencies as a function of the diffraction order in reflection and transmission for TE-and TM-polarization in case of the grating and the numerical parameters depicted in figure 7. The results have been obtained by a combination of the S-vector algorithm with GSM at the single layers and been checked by the differential method [3,4].  . All other a j ,1  ( ) will follow from the algorithm due to (A.8) and increasing j.
For all j from j=1 to j s 1 = -: • add buffered Step 2: Propagation of the double-fold reflected amplitudes downwards to the substrate (see . All other a j ,2  ( ) will follow from the algorithm due to (A.12) and decreasing j.
For all j from j s 1 =to j=1: • add buffered fulfilling j s 1 1   -).