All-at-once solution of linear wave equations

Efficient parallel-in-time methods for hyperbolic partial differential equation problems remain scarce. Here we investigate an approach based on circulant preconditioned generalised minimal residual (GMRES) for the monolithic block Toeplitz equations which arise from constant time-step discretizations. We present theoretical results which guarantee convergence in a number of iterations independent of the number of time-steps and demonstrate the potential utility of the approach with numerical results employing several different finite difference schemes of varying orders of accuracy.


INTRODUCTION
Several techniques now exist for parallel-in-time computation, that is computational approaches which reduce wall-clock time for evolutionary problems by employing computations acting in parallel along the temporal domain: we point the interested reader to the excellent review on the topic written by Gander, 1 as well as to the more recent one by Ong and Schroder. 2 In particular, for problems arising from the discretization of parabolic partial differential equations (PDEs), a number of schemes have been suggested, the principle ones being based on the Parareal 3 or PFASST 4 algorithms, and their acceleration using multigrid ideas. [5][6][7] Examples of successful applications of these methods to hyperbolic problems are however still scarce, with some noticeable exceptions, [8][9][10] although research in this direction is underway. [11][12][13] Other suggested approaches have employed a so-called monolithic or all-at-once approximation, where equations that define solutions at many or all time levels are considered; the contribution from this article falls into that category. Monolithic systems arising from the approximation of constant-coefficient, time-evolving differential equations present a clear structure when discretized using fixed time steps: their constituting blocks, in fact, are the same down each diagonal. When dealing with periodic problems, this gives rise to a block circulant matrix, whose structure can be directly employed for time-parallelization, 14 since it can be diagonalized via fast Fourier transform (FFT). When solving an initial value problem, instead, the circulant structure is broken, but we retain a block Toeplitz monolithic system: in this case, we can still exploit a circulant approximation in order to accelerate the iterative solution of such system, 15,16 which is the approach we pursue in this article. The literature provides theoretical results which indicate the potential of this approach for parallel-in-time computation, 16 and preliminary results from simple parallel implementations bear out the optimal parallel efficiency of this approach. 17 Intriguingly, this efficiency still seems to hold even for a particular discretization of the wave equation. 17 Given that regular time-stepping might be considered more appropriate for wave problems, we further investigate this and related approaches in this article. We show how a result of Chan, Potts, and Steidl on circulant preconditioning 18 applies to this situation and establishes in several cases that only a small number of generalised minimal residual (GMRES) iterations is required for convergence/termination of the method. This number is independent of the number of time steps. An alternative approach due to Gander et al. 19,20 is applicable when irregular time steps are chosen; by contrast to the methods described in this article, such an approach might be particularly relevant for parabolic problems.
In Section 2, we describe the all-at-once preconditioned Krylov iteration method, as well as the main theorem employed in our work, 18 together with its consequences. In Section 3 we give numerical results for hyperbolic problems. We conclude this article in Section 4.

CIRCULANT PRECONDITIONING FOR TOEPLITZ SYSTEMS
In the literature, a variety of circulant preconditioners for Toeplitz matrices have been proposed, 21 and indeed, for the simple test cases considered, some of these approaches coincide. In this article we exploit some of the results from Chan, Potts, and Steidl: 18 we decide then to follow most of the notation used in their article, and describe the -circulant preconditioner therein defined.

-Circulant preconditioner
Let us introduce a 2 -periodic generating function f (x), with the associated Toeplitz matrix A N (f ) ∈ R N×N , that is: where the a k (f ) are the Fourier coefficients of f (x). In our work we are mostly concerned with banded Toeplitz matrices with a small band, which have trigonometric polynomials as generating functions, in the form In this case, the Fourier coefficients (1) are well defined, and coincide with the coefficients of the polynomial (2); moreover, the summation extrema s 1 and s 2 in (2) directly denote the number of sub-and superdiagonals of A N (f ), respectively, so that this matrix has the following structure: We further request that f (x) has no zeroes on the N equispaced nodes: where w ∈ [ 0, 2 ∕N) is a free parameter which gives some flexibility in ensuring that we steer away from zeroes of the generating function: the in the name of the original preconditioner 18 is given by = e iNw . To simplify our notation, in this section we limit ourselves to the choice w = 0 (i.e., = 1), but the treatment would remain largely unchanged otherwise. We also introduce the diagonal matrix D N (f ) ∈ R N×N , which contains evaluations of f on the nodes x N,l defined in (4): as well as the discrete Fourier transform operator, The -circulant matrix M N (f ) associated with A N (f ) is then defined as: In our case, characterizing M N (f ) is particularly simple: [ (8) that is, in order to recover M N (f ), the off-diagonal elements of A N (f ) are summed to those N diagonals away. Notice that in our particular setting, where the band of A N (f ) is small (specifically, smaller than N/2), and = 1, the -preconditioner coincides with the Strang circulant preconditioner. 15 Formula (7) exposes a particularly favorable property of circulant matrices: they can be trivially diagonalized via Fourier transform. Inverting M N (f ), then, amounts to applying a fast Fourier transform (FFT), inverting a diagonal matrix, and applying the inverse FFT: a task of overall complexity (N log N). This argument can be similarly extended to block circulant matrices, in which case we would achieve block-diagonalization. Therein also lies the concurrency of the algorithm: solving a (block) diagonal system is a task which can be trivially parallelized. Being simple to invert, then, (7) definitely satisfies the first requirement of a preconditioner. As a second requirement, we would also hope for it to effectively accelerate the convergence of iterative methods. We address this in the following.

Convergence of circulant-preconditioned Krylov methods
There is a choice of iterative methods one can use, paired with the preconditioner (7) (or slight modifications thereof), to solve a system of equations involving (3). For example, 16 one could symmetrize the Toeplitz system via a Hankel matrix, and use MINRES with the absolute value of (7) as preconditioner. An alternative 18 consists in solving the normal equation using conjugate gradient, preconditioned with M N (|f | 2 ). In both cases, strong theoretical bounds guarantee fast convergence of the methods used. We are mostly concerned with convergence of circulant preconditioners applied to GMRES. While convergence results are harder to come by in this case, 22 some experiments 23,24 hint at it being a more effective iterative solver with respect to other choices available.
In this article, we use a result from Chen, Potts, and Steidl, 18 which deals with the application of (7) as a right-preconditioner for GMRES iterations of a system involving (3). Adapted to our case, the theorem states the following: Theorem 1. Let f be a trigonometric polynomial such as (2), satisfying (4), and generating the Toeplitz matrix A N (f ), with associated -circulant preconditioner M N (f ). Then, where I N ∈ R N×N is the identity matrix, while R N (s) ∈ R N×N identifies a matrix of low rank s.
Proof. In our situation, the proof simplifies significantly. We can split so that Provided that the entries in Ã N (f ) which do not adhere to the Toeplitz structure are only limited to the first s 1 and last s 2 rows, then, Theorem 1 still holds. This is particularly useful if we consider that we are dealing with the solution of discrete space-time systems. In this framework, in fact, we are often forced to modify the equations corresponding to the first few temporal nodes, be it in order to accommodate for initial conditions on the derivative, and/or in order to kick-start high-order discretizations in time: this has the effect of perturbing the Toeplitz structure of the system, but not to an extent that would break the result of Theorem 1, as described more in detail in Section 3.
As a note, the original theorem 18 is more general, and deals rather with rational generating functions: In that case, the original theorem states that the rank of the resulting perturbation of the identity is equal to max{d 1 , d 2 }.
It is possible to recover our version of Theorem 1 by noticing that we can rewrite (2) as f (x) = ( ∑ s 1 +s 2 k=0 a −s 2 +k e ikx ) ∕e is 2 x . Noticeably, Theorem 1 gives us a theoretical guarantee that GMRES with M N (f ) as a right-preconditioner will converge to the solution after at most s 1 + s 2 + 1 iterations, at least in exact precision: if the band of A N (f ) is small enough, then (7) provides an excellent preconditioner. Examples of its effectiveness, as well as a discussion of the cases in which this theorem fails to come to our aid, are described in Section 3.

ALL-AT-ONCE SOLUTION OF LINEAR DIFFERENTIAL EQUATIONS
In this section, we illustrate how simple systems arising from the discretization of differential equations are (close-to-) Toeplitz in nature, which makes them amenable to be preconditioned using (7). We also introduce a few test cases, and the methods used in our experiments to recover their numerical solution. The setup follows the one presented by Goddard and Wathen. 17

An ordinary differential equation
As a proof of concept, useful in introducing some of the notation employed, we propose the simplest linear second-order differential equation: whereū 0 andū 1 ∈ R represent the initial values of the solution itself and its derivative, respectively.

Discretization aspects
To discretize our temporal domain, we choose a uniform grid with step size Δt = T∕(N t − 1). Furthermore, to approximate (13) we consider Störmer-Verlet schemes (SV) of various orders: this is a family of multistep methods specifically tailored for the solution of second-order differential equations [25,. The general formulation for a kth order SV scheme is given by where the right-hand side of the system f n simply corresponds to u n in our case; ∇ identifies the backward difference operator while the coefficients j are picked in such a way to ensure the desired level of accuracy [25, .
Notice that for all cases but k = 2, (14) defines an implicit method. For ease of notation, we identify discretization schemes of different accuracies by reporting their order as a subscript to their acronym. We can see that choosing SV 1 corresponds to approximating the second-order derivative using a first-order backward difference formula, which produces the following recurrence relation: Notice that we use a ghost-node approach to include the initial condition on the derivative at t = 0; that is, we introduce a fictitious node u −1 , approximate the first derivative using the central difference (CD) formula u 1 − u −1 = 2Δtū 1 , and substitute in the first equation of (17) with n = 1 to recover the second equation. We can gather the Equation (17) together and build the system collects the values of the solution at each temporal node. The operator Φ SV 1 ∈ R N t ×N t represents the discretization of the time derivative (including the perturbation from the ghost-node approach), and has the following structure: Operator Φ SV 1 r stems from the evaluation of the right-hand side in (14), and it reduces to Φ the identity matrix, for SV 1 . The effect of the initial conditions is included in the right-hand side: Notice how imposing the initial conditions breaks the Toeplitz structure of the matrix (19), and hence that of the system A SV 1 : the second element in the second row is different from the others on the diagonal. However, we can still assemble the circulant preconditioner disregarding this perturbation, as such: whereΦ SV 1 is our circulant approximation to Φ SV 1 , built as follows: As hinted in Remark 1, the statement of Theorem 1 still remains valid using M SV 1 as a preconditioner for M SV 1 , even if the system is not strictly Toeplitz: in fact, we only perturbed the second row on a matrix with a band of size 2.
We also experiment using higher order discretizations. For orders >2, this requires recovering the values of the solution at the first few time steps in some alternative way, in order to kick-start the multistep method [26, chap. 5.9.3]. This is due both to the stencil generally becoming larger as the order increases, and also to the fact that the ghost-node approach used in (17) is only second-order accurate, and will end up polluting the global accuracy of higher order schemes. There are a number of ways in which the kick-starting procedure might be conducted (generally resorting to self-starting schemes to solve for the first few unknowns), which end up disrupting the Toeplitz structure of the space-time system. For simplicity, in our experiments we directly evaluate the analytical solution for the necessary nodes; however, in order to mimic the impact the perturbation stemming from a kick-starting procedure might have on the performance of the circulant preconditioner, we also modify the first few equations in the system by leaving only the contribution from the main diagonal. For example, using SV 5 gives rise to the space-time system: where Φ SV 5 and Φ SV 5 r play the same roles as Φ SV 1 and Φ SV 1 in (19), but both have their first four rows perturbed, as such: The circulant preconditioner is then assembled as in (21), starting from the circulant approximations to the operators in (24), that is.
whereΦ SV 5 is the same asΦ SV 1 , whileΦ SV 5 r is built disregarding the perturbations from the underlying Toeplitz structure ofΦ SV 5 r . Finally, the right-hand side of the space-time system is built as  assuming the values at the first few instants u j = u(jΔt) (with j = 0, … , 3), are given. Notice that Remark 1 still applies, and that the result of Theorem 1 remains valid even in this case, since we perturbed the first four rows on a matrix where s 1 = 4 and s 2 = 0. Similar considerations hold for the systems corresponding to higher order discretizations, which are built following a similar procedure.

Results
In Table 1 we collect the convergence results from the application of (7) as a right-preconditioner for GMRES, used to recover the solution of the systems (18), (23), and others built using SV discretizations of various orders. The theoretical bound provided in Theorem 1 is respected reasonably well: the discrepancies are attributable to round-off errors, and seem more pronounced for higher order schemes and more refined grids. The -preconditioner is further tuned by picking the parameter w in (4) as w = ∕N t , which (at least for the lowest order schemes) ensures that the roots of the corresponding generating function are the farthest away from 0; this in turns attains the desirable property of minimizing the condition number of the preconditioner. However, we deem relevant to point out that, while reviewing this article, a much more detailed analysis of the impact of this parameter on the effectiveness of the preconditioner has been presented. 27

A hyperbolic partial differential equation
Ultimately, our interest resides in applying this method to the solution of hyperbolic partial differential equations. As a test case of this class, we consider the wave equation, with Dirichlet boundary conditions:

Discretization aspects
Most of the notation used in Section 3.1 can be extended to the PDE case, but we need to introduce an appropriate discretization of the spatial operator. To this purpose, we choose once again a uniform grid spanning the interior of the spatial domain, with spacing Δx = L∕(N x + 1), and we consider a CD scheme for the second-order spatial derivative. Even in this case we make use of approximations of different orders of accuracy, to match the temporal discretizations. Furthermore, the Dirichlet boundary conditions in (27) are imposed naturally. This causes some inconvenience when using high-order schemes: in that case, in fact, we cannot use symmetric stencils anymore as we approach the boundaries, since we would require information from nodes outside the domain. We circumvent this issue by using, for those few nodes, finite difference of the same order of accuracy, but defined on a slanted stencil. 28 For example, for second-and fourth-order-accurate discretizations, we obtain the following matrices: We point out that the scheme stemming from choosing SV 2 for the temporal discretization and CD 2 for the spatial one is also known as the explicit leapfrog method [26, chap. 10.2.2], which is conditionally stable. This can be shown via Von Neumann analysis [26, chap. 9.6]: assuming the solution is composed of a single Fourier mode u n = G n e ikx , for a certain growth factor G and frequency k, and plugging this into the scheme, we obtain the recurrence relation We are interested in the roots of the associated characteristic polynomial, given by the solutions of where we named c 2 = cos (kΔx) − 1 ∈ (−2, 0), and r = Δt∕Δx, in order to reduce notation. Its roots, denoted as g ± , satisfy the relationships and g + + g − = 2(1 + r 2 c 2 ).
We want to ensure that |g ± | ≤ 1 ∀k, and that roots falling on the unit circle are simple. This can only be guaranteed if these are given as a complex conjugate pair of modulo 1 (otherwise they would break (31a)). Consequently, we can translate (31a) into requesting |ℜ(g ± )| = |1 + r 2 c 2 | ≤ 1, which only holds if r 2 ≤ 1, given the range of c 2 . For the fourth-order discretization, a similar reasoning can be applied. The roots of the characteristic polynomial associated with using SV 4 in time and CD 4 in space, are given by this time with c 4 = (16 cos(kΔx) − cos(2kΔx) − 15)∕12 ∈ (−8∕3, 0). The conditions on its roots resemble those in (31a); for stability, we then need to impose For higher order methods the analysis gets more complex, as increasing the size of the stencil increases the order of the associated characteristic polynomial, but they also require constraints on the size of r in order to ensure stability. To avoid these restrictions, we also introduce another discretization, namely an implicit version of the leapfrog method, 29 which we consider only to second-order accuracy, and which we denote by CI 2 . The only difference with explicit leapfrog scheme lies in the fact that the spatial operator is averaged over three consecutive time steps, giving rise to the following recurrence relation: Even in this case, Von Neumann analysis comes to our aid in establishing that the method is unconditionally stable. After some algebraic manipulations, we can show that the characteristic polynomial of interest this time is given by with c 2 ∈ (−2, 0) defined as before. Similar considerations as for the case above show that its roots are a complex conjugate pair falling on the unit circle whenever |2 + r 2 c 2 |< |2 − r 2 c 2 |; given c 2 is always negative, this holds ∀r 2 > 0. We are finally ready to assemble the monolithic space-time system, by opportunely combining the matrices introduced above. For CI 2 , or SV k temporal discretizations of a given order k, we do so in the following blockwise fashion: where ⊗ represents the Kronecker product, ⌈ * ⌉ is the ceiling function, while I N t −i is the lower shift matrix, containing only ones on the ith subdiagonal. Also,Ĩ N t is a slight modification of the identity matrix, containing a 2 in the second row: this is done to account for the initial conditions on the derivative. Notice how the part for A CI 2 referring to the spatial discretization (the term containing the Kronecker product with K CD 2 ) is made of three adjacent blocks per block-row, corresponding to an averaging over three consecutive instants. The corresponding right-hand sides are assembled in a similar way as to what was done in Section 3.1: and g CI 2 = [( Notice that, since we defined the monolithic systems discretizing (27) as a Kronecker product with the (quasi) Toeplitz matrices in Section 3.1, we still retain a (quasi) block Toeplitz structure even for the PDE case. We can then build our block circulant preconditioner.

Block circulant preconditioning
Analogously to what was done in Section 3.1, the following block circulant preconditioners are built for the systems (36), , where Î N t −i is the circular lower shift matrix, that is the circulant counterpart to I N t −i , containing ones both on the ith subdiagonal and the (N t − i)th superdiagonal. As described in Section 2, these preconditioners can be (block) diagonalized using a (block) FFT. For instance, for SV 2 we have: where Λ N t n are the diagonal matrices containing the N t th roots of unity, sampled at frequency n: more explicitly, we have Λ In order to apply the inverse of (39) to a space-time vector b ∈ R N x N t , we then need to perform the following sequence of operations: This corresponds to computing a total of N x different FFTs on the coefficients of b. Each of these FFTs can be performed independently. The signals that need to be Fourier transformed, denoted as s n ∈ R N t , are each of length N t and are recovered by collecting every N x coefficients of b, starting from the nth coefficient: Basically, we are tracking how the solution at each node evolves in time, and applying a Fourier transform to each of these evolutions. The FFTs produce the transformed signalsŝ n = F * N t s n . Their coefficients are then rearranged in order to recover, as a final result, the vector where ⌊*⌋ and % denote the floor and modulo operators, respectively. 2. Invert A SV 2 . As already pointed out, this matrix is block diagonal. As a consequence, this step requires solving for N t independent systems (the blocks on the diagonal), using , n = 0, … , N t − 1, as right-hand sides. Looking at (39) we can see how each of these solves requires inverting a complex-shifted discrete Laplacian operator. 3. Apply F N t ⊗ I N x . The last step is equivalent to the first, except N x inverse FFTs (iFFT) must be performed.

Cost analysis of circulant preconditioning versus time-stepping
Canonically, approximate solutions to (27) are recovered via time-stepping, which corresponds to applying block forward substitution to space-time matrices such as (36). This is an inherently sequential procedure which, at least for implicit methods, requires inverting a shifted Laplacian at each time step. Assuming inverting the Laplacian operator has an associated cost of C(N x ), the overall procedure has a complexity of (N t C(N x )). Given the availability of optimal solvers for this kind of operators, 30 we can expect C(N x ) to scale linearly with N x , with spatial parallelization eventually coming to our aid in reducing the required computational time. Conversely, our procedure increases dramatically the global cost required for solving systems with the coefficient matrix (36). For each GMRES iteration, in fact, we need to multiply the space-time residual by the matrix (36), an operation which by itself has a complexity of  (N t N x ). On top of this, applying the preconditioner involves N x independent FFT and iFFT, of overall complexity (N x N t log(N t )), as well as inverting N t systems involving a complex-shifted Laplacian. The latter can be safely assumed to have a cost comparable to C(N x ) (in fact, even for this operator optimal solvers are available 31,32 ), which makes the application of the preconditioner a procedure with an overall complexity of (N x N t log(N t ) + N t C(N x )). Unlike time-stepping, however, solution via GMRES with circulant preconditioning exposes parallelization along the temporal domain, as stated in Section 2.1. This can be exploited to reduce computational time at each step of the GMRES iteration. Firstly, when multiplying by the space-time matrix, if we have N t processors available, each assigned to a specific instant (or block-row in the matrix), then the computational time necessary is effectively reduced from (N t N x ) to (N x ), if we assume that time-to-solution is directly proportional to complexity. Here we are neglecting the fact that, if each processor contains information only pertaining to a specific instant, a certain amount of communication must occur in order to perform multiplication by the space-time matrix. This overload is however limited, since the space-time matrix has a small band of size s 1 + s 2 , implying that this exchange of information only involves s 1 + s 2 − 1 processors at most, and in particular its cost per processor is independent on N t . Secondly, when inverting the complex-shifted Laplacians, time similarly reduces form (N t C(N x )) to (C(N x )), since again each system can be inverted independently (and no communication must occur among processors). Thirdly, a similar reasoning can be followed also when applying the Fourier transforms: not only the FFTs on each of the N x signals are independent, meaning that any additional processor for spatial parallelization can still be employed to attack the N x factor in the (N x N t log(N t )) complexity, but also parallelization over time is perfectly feasible. Most of the FFT algorithms available involve a step consisting of operations acting over the whole signal vector: 33,34 the operations on each element are independent and can hence be parallelized, 35,36 ideally dropping computational time from (N x N t log(N t )) to  (N x log(N t )). The actual performance of the parallel FFT depends heavily on how communication among processors is organized, and on the architecture of the system considered: 37 since FFT represents the bottleneck in the application of the circulant preconditioner, any efficient parallel implementation will need to be fine-tuned in order to take this into account. This is however beyond the scope of this article, which rather wishes to focus on the efficacy of the preconditioner when applied to space-time systems.
To summarize, we have on the one hand time-stepping: a sure-fire direct procedure which recovers the solution to a space-time system in a number of operations scaling as ∼N t C(N x ); on the other, we have an iterative procedure whose computational time per iteration scales as ∼ C(N x ) + N x log(N t ). The difference between the two regimes allows some room for speed-up, provided that the number of necessary iterations remains bounded, and particularly that it does not scale with N x , nor with N t . Even in the PDE case, we can still invoke Theorem 1 to secure ourselves from the latter, since it confirms that the maximum theoretical number of iterations to convergence is independent of the number of time steps taken. Unfortunately, though, Theorem 1 does not shield us from the former, since in principle the number of iterations can grow with the size of the blocks composing the matrices (36), that is, the number of spatial nodes N x considered in the discretization. In practice, however, and particularly under certain regimes, this seems not to be the case. Details on this are shown in the following.

Results
In this section, we report the results from the application of GMRES for the solution of systems involving (36) as coefficient matrix, using the corresponding (38) as right-preconditioners. In particular, and in light of the considerations made in the previous paragraph, we are interested in its performance in terms of number of iterations to convergence. For all of our experiments, GMRES is set to a tolerance of 10 −10 , starting from a null initial guess. We consider the unit square, T = L = 1, as a domain for the PDE (27). As an initial condition, we choose a shifted gaussian: where c is picked so thatū 0 (0) =ū 0 (L) = 0, and 2 = 0.002; for the derivative, we simply takeū 1 (x) = 0. The solutions to the systems involving the complex-shifted Laplacian are recovered at each iteration using the backslash command in MATLAB. The code used for the experiments is publicly available in its Git repository. 38 We test the two different schemes presented above, for a variety of orders of accuracy. Results are presented in Table 2 for SV discretizations of the temporal derivative, and in Table 3 for CI 2 discretizations. The actual number of iterations to convergence remains well below the limit indicated by Theorem 1 for all cases considered. We can notice however that this number is still far from being independent from the spatial mesh size. The convergence behavior when increasing N t is somewhat hard to grasp, but shows a slight increase in iterations count as N t gets higher. Increasing N x has a similar impact, and also slows convergence. This is expected from the statement of Theorem 1: N x determines the size of the blocks composing (7), and hence directly affects the rank of the perturbation (10). Also increasing the order of accuracy of the time discretizations used has a similar negative effect on performance. This is also expected from Theorem 1: in this case it is the actual number of blocks in (10), rather than their dimension, that is directly affected.
The SV 1 scheme (top-left in Table 2) deserves a separate discussion: this scheme produces a solution which presents a noticeable degree of numerical diffusion, as reported in the literature 17 and shown in Figure 1. There, we can see how the solution profile with SV 1 (at the bottom) is visibly smeared out, contrasting with the result from leapfrog (top and middle): TA B L E 2 Number of iterations to convergence for GMRES, right-preconditioned with (7), applied to the solution of systems arising from the discretization of (27) Note: Different number of spatial and temporal nodes considered (N x and N t , respectively). Time derivative approximated using SV, and space derivative approximated using CD of matching orders of accuracy. The rightmost column in each table reports the theoretical upper bound from Theorem 1; a cross identifies a simulation which did not converge due to memory requirements becoming too severe. Values for N x > N t are not reported, as they violate the CFL condition, and give rise to unstable solutions: convergence results are very poor in those cases. The different colors in the cells background provide an indication of the accuracy of the solution for the test case considered: the L 2 error in space-time with respect to the analytical solution decreases as we move to darker shades, becoming <10 −4 , <10 −6 , and <10 −8 , respectively. (7), applied to the solution of systems arising from the discretization of (27) Note: Different number of spatial and temporal nodes considered (N x and N t , respectively). Time derivatives approximated using CI 2 , space derivatives using CD with various accuracies. The rightmost column in each table reports the theoretical upper bound from Theorem 1. The different shades indicate the accuracy with respect to the analytical solution as in Table 2.   (7), applied to the solution of systems arising from the discretization of (27), with different initial conditions the difference is clear by comparing the shapes of the final negative peaks in the three surface plots, or by looking at how the heatmap at the bottom appears more blurred. The fact that diffusion is present seems to aid with the convergence of the scheme, a feature shared with other parallel-in-time integration methods. 39,40 In fact, we can see from Table 2 that the number of iterations to convergence remains reasonably small, even when the meshes are refined. However, the recovered solution is of very poor quality: even with the finest meshes used, the error with respect to the analytical solution remains above 10 −4 . The background colors of the cells in Table 2 draw a clearer picture of the performances of each scheme, by providing an indication of the accuracy of the recovered approximation: moving to darker shades, the numerical solution reaches an approximation error of 10 −4 , 10 −6 , and 10 −8 , respectively, although at around 10 −8 the convergence of high-order schemes tends to stall, likely because of round-off errors becoming relevant, which makes the results less informative for this last range. The first lesson we can extract from this is that there is no advantage in using a low-order scheme over a high-order one, so long as they share the same stencil: by comparing the tables relative to SV 2 and SV 4 (two schemes both with a stencil of size 3), we can see that SV 4 achieves a much smaller error with respect to SV 2 on the same meshes, and often even with a smaller number of iterations. The latter is possibly due to the negative impact of the extra perturbation introduced to the Toeplitz system when accounting for the initial conditions, since we used a ghost-node approach for SV 2 . As the order of the method increases, it becomes harder to make such strong statements, but if the stencil grows together with the order, then low-order, small-stencil schemes seem to be preferable. For example, SV 4 reaches an error of 10 −6 with 25 iterations on a 640 × 960 mesh, while SV 5 needs 65 iterations (roughly 2.5 as many) on a mesh with 1/4 as many degrees of freedom, and SV 7 uses 312 (or roughly 12.5 as many) iterations on a mesh with only 1/16 as many degrees of freedom. Given how the cost per iteration scales linearly in N x but logarithmically in N t (as per the analysis conducted in the previous paragraph), SV 5 is of comparable cost with respect to SV 4 , but SV 7 far exceeds it, and is hence not worth employing for the same target accuracy.

TA B L E 3 Number of iterations to convergence for GMRES, right-preconditioned with
The nature of the initial conditions also has an impact on the speed of convergence. We test this by running a series of experiments where we progressively narrow the gaussian used as an initial condition. The number of iterations to convergence increases as the gaussian becomes steeper, regardless of the discretization used, as is shown in Table 4. This is not entirely surprising: with steeper profiles, the range of relevant modes in the solution extends to higher frequencies, and we can expect to require a richer Krylov subspace in order to properly approximate them. An extreme example of the converse can be shown by picking a single mode as an initial condition, of any frequency. This gives rise to a stationary wave, which is particularly simple to track, since the spatial and the temporal components of the solution factorize: the solution at each instant is nothing but the initial condition, opportunely rescaled. In this case, the circulant preconditioner is extremely effective, as it is capable of building a Krylov subspace which captures the degrees of freedom of the solution in very few iterations: only 2, for all schemes of order of accuracy 2 or smaller, as was reported already in the literature. 17,24 This is however a peculiar feature of the circulant preconditioner, limited to the very specific combination of parameters employed: if we vary the type of spatial discretization used (or if we add a forcing term, or vary the initial guess for the GMRES algorithm), we cannot expect such a dramatic convergence anymore, as shown in the first row of Table 4.

CONCLUSION
Parallel-in-time computational methods for evolutionary equations have been studied over a long period and there are now several successful schemes which are effective in speeding up the solution of parabolic PDEs. By contrast, hyperbolic PDE problems seem to present a bigger challenge for parallelization.
Here we have investigated a recently suggested approach based on circulant preconditioning for the block Toeplitz all-at-once (or monolithic) system that arises with the use of a constant time step for the discretization of the linear wave equation. The use of regular time steps would seem more appropriate in the context of such wave problems, rather than for dissipative problems. We have shown how a result of Chan, Potts, and Steidl guarantees convergence (termination) of GMRES in a small number of iterations-the number being independent of the number of time steps-in ideal cases with this preconditioning methodology. Parallel application of such iterative schemes has been well covered in the literature. We have also presented numerical results which bear out this theory and demonstrate its potential utility even where exact structures are broken, for example, by the application of initial conditions. Further work will be required to generalize to variable coefficient and nonlinear hyperbolic equations. In both of these cases, the structure of the space-time system A N is bound to steer further away from that of a (block) Toeplitz matrix. A way for us to still make use of the circulant preconditioner (8), consists in designing an effective projection operator, so to trace A N back to the space of (block) Toeplitz matrices. Possibly the most straightforward way to achieve this is given by diagonally averaging the entries of A N . It remains however not the only way by far, and indeed identifying the most apt one remains a matter for future research.