Minimal residual space-time discretizations of parabolic equations: Asymmetric spatial operators

We consider a minimal residual discretization of a simultaneous space-time variational formulation of parabolic evolution equations. Under the usual `LBB' stability condition on pairs of trial- and test spaces we show quasi-optimality of the numerical approximations without assuming symmetry of the spatial part of the differential operator. Under a stronger LBB condition we show error estimates in an energy-norm which are independent of this spatial differential operator.


INTRODUCTION
This paper is about the numerical solution of parabolic evolution equations in a simultaneous space-time variational formulation. Compared to classical timestepping schemes, simultaneous space-time methods are much better suited for a massively parallel implementation (e.g. [NS19,vVW20]), allow for local refinements in space and time (e.g. [SY18,GS19,SvVW21,vVW21]), and produce numerical approximations from the employed trial spaces which are quasi-best.
The standard bilinear form that results from a space-time variational formulation is non-coercive, which makes it difficult to construct pairs of discrete trial and test spaces that inherit the stability of the continuous formulation. For this reason, in [And13] R. Andreev proposed to use minimal residual discretizations. They have an equivalent interpretation as Galerkin discretizations of an extended self-adjoint, but indefinite, mixed system having as secondary variable the Riesz lift of the residual of the primal variable w.r.t. the PDE.
For pairs of trial spaces that satisfy a Ladyshenskaja-Babuska-Brezzi (LBB) condition, it was shown that w.r.t. the norm on the natural solution space, being an intersection of Bochner spaces, the Galerkin solutions are quasi-best approximations from the selected trial spaces. This LBB condition was verified in [And13] for 'full' and 'sparse' tensor products of various finite elements spaces in space and time. The sparse tensor product setting was then generalized in [SvVW21,Proposition 5.1] to allow for local refinements in space and time whilst retaining (uniform) LBB stability.
A different minimal residual formulation of first order system type was introduced in [FK21], see also [GS21]. Here the various residuals are all measured in L 2 -norms, meaning that they do not have to be introduced as separate variables, and the resulting bilinear form is coercive.
Closer in spirit to [And13] are the space-time methods presented in [Ste15,LMN16,BEEN19], in which error bounds are presented w.r.t. mesh-dependent norms. In [Dev20,SZ20] space-time variational methods are presented that lead to coercive bilinear forms based on fractional Sobolev norms of order 1 2 . A first order space-time DPG formulation of the heat equation is presented in [DS20].
A restriction imposed in [And13], as well as in the other mentioned references apart from [BEEN19,GS21], is that the spatial part of the PDO is not only coercive but also symmetric. In [SW20] we could remove the symmetry condition for the analysis of a related Brézis-Ekeland-Nayroles (BEN) ( [BE76,Nay76]) formulation of the parabolic PDE. In the current work, we prove that also for the minimal residual (MR) method the symmetry condition can be dropped. So for both MR and BEN we show that under the aforementioned LBB condition the Galerkin approximation are quasi-optimal, where the bound on the error in the numerical approximation for BEN improves upon the one from [SW20].
The error bounds for both MR and BEN degrade for increasing asymmetry. This is not an artefact of the theory but is confirmed by numerical experiments. Under a stronger LBB condition on the pair of trial spaces, however, we will prove that the MR and BEN approximations are quasi-best w.r.t. a continuous, i.e., 'meshindependent', energy-norm, uniformly in the spatial PDO.
We present numerical tests for the evolution problem governed by the simple PDE ∂ t − ε∂ 2 x + ∂ x + eId on (0, 1) 2 with initial and boundary conditions, where e is either 0 or 1. For the case that homogeneous Dirichlet boundary conditions are prescribed at the outflow boundary x = 1, the results for very small ε illustrate that quasi-optimal approximations do no necessarily mean accurate approximations. Indeed the error in the computed solution is large because of the unresolved boundary layer. The minimization of the error in the energy norm of least squares type causes a global spread of the error along the streamlines. We tackled this problem by imposing these boundary conditions only weakly.
1.1. Organization. In Sect. 2 we recall the well-posed space-time variational formulation of the parabolic problem and study its conditioning. Under the usual LBB condition, in Sect. 3 we show quasi-optimality of the MR method without assuming symmetry of the spatial differential operator. A similar result is shown for BEN in Sect. 4. Known results concerning the verification of this LBB condition are summarized in Sect. 5, together with results about optimal preconditioning. In Sect. 6 we equip the solution space with an energy norm, and, under a stronger LBB condition, show error estimates for MR and BEN which are independent of the spatial differential operator. We present an a posteriori error estimator which, under an even stronger LBB condition, is efficient and, modulo a date-oscillation term, is reliable.
In Sect. 7 we apply the general theory to the example of the convection-diffusion problem. We give pairs of trial-and test spaces which satisfy the 2nd and 3rd mentioned LBB conditions. Finally, in Sect. 8 we present numerical results for the MR method in the simple case of having a one-dimensional spatial domain.
To solve the problems caused by an unresolved boundary layer, we modify the method by imposing a boundary condition weakly.
1.2. Notations. In this work, by C D we will mean that C can be bounded by a multiple of D, independently of parameters which C and D may depend on. Obviously, C D is defined as D C, and C D as C D and C D.
For normed linear spaces E and F, by L(E, F) we will denote the normed linear space of bounded linear mappings E → F, and by Lis(E, F) its subset of boundedly invertible linear mappings E → F. We write E → F to denote that E is continuously embedded into F. For simplicity only, we exclusively consider linear spaces over the scalar field R.

WELL-POSED VARIATIONAL FORMULATION
Let V, H be separable Hilbert spaces of functions on some "spatial domain" such that V → H with dense embedding. Identifying H with its dual, we obtain the Gelfand triple V → H H → V . We use ·, · to denote both the scalar product on H × H as well as its unique extension to the duality pairing on V × V or V × V , and denote the norm on H by · .
For a.e. t ∈ I := (0, T), let a(t; ·, ·) denote a bilinear form on V × V such that for any η, ζ ∈ V, t → a(t; η, ζ) is measurable on I, and such that for some ∈ R, for a.e. t ∈ I, With A(t) ∈ Lis(V, V ) being defined by (A(t)η)(ζ) := a(t; η, ζ), given a forcing function g and an initial value u 0 , we are interested in solving the parabolic initial value problem to finding u such that In a simultaneous space-time variational formulation, the parabolic PDE reads as finding u from a suitable space of functions X of time and space such that for all v from another suitable space of functions Y of time and space. One possibility to enforce the initial condition is by testing it against additional test functions. A proof of the following result can be found in [SS09], cf. [DL92, Ch.XVIII, §3] and [Wlo82, Ch. IV, §26] for slightly different statements.
where for t ∈Ī, γ t : u → u(t, ·) denotes the trace map. That is, assuming g ∈ Y and u 0 ∈ H, finding u ∈ X such that is a well-posed simultaneous space-time variational formulation of (2.3).
we will always assume that, besides (2.1), (2.2) is valid for = 0, i.e., for a.e. t ∈ I, and equip Y with 'energy'-scalar product ·, · Y := (A s ·)(·), and norm being, thanks to (2.1) and (2.5), equivalent to the standard norm on Y. Equipping Y with the resulting dual norm, A s ∈ Lis(Y, Y ) is an isometric isomorphism, and so for f ∈ Y we have being, thanks to X → C(I; H), equivalent to the standard norm on X. In addition, we define the energy-norm on X by ||| · ||| X := B · 2 Y + β γ 0 · 2 , which, thanks to Theorem 2.1, is indeed a norm on X.
Proposition 2.2. With α := A a L(Y,Y ) , for 0 = w ∈ X it holds that so that, in particular, both norms are equal when A a = 0.
Proof. Using that for w, v ∈ X, we find that (2.6) For w ∈ X, and so, for any η = 0, Young's inequality shows that showing one of the bounds of the statement. From again by Young's inequality, by solving η 2 from 1 − η 2 = (1 − η −2 )α 2 + 1 the other bound follows.
Remark 2.3. Because · Y is defined in terms of the symmetric part A s of the spatial differential operator A, α = A a L(Y,Y ) is a measure for the relative asym-

MINIMAL RESIDUAL (MR) METHOD
Let (X δ , Y δ ) δ∈∆ a family of closed, proper, non-zero subspaces of X and Y, respectively. For δ ∈ ∆, let E δ X and E δ Y denote the trivial embeddings X δ → X and Y δ → Y that we often write for clarity. We assume that Furthermore, for efficiency reasons we assume to have available a K δ or, equivalently, .
defines an equivalent norm on Y δ , and our Minimal Residual approximation u δ ∈ X δ of the solution u ∈ X of (2.4) is defined as for some constant β ≥ 1. Later we will see that, thanks to (3.2) and (3.3), > 0 (even uniformly in δ ∈ ∆) 1 which implies that (3.4) has a unique solution. The numerical approximation (3.4) was proposed in [And13] 2 , and further investigated in [SW20]. In both these references the analysis of the MR method was restricted to the case that A a = 0. The introduction of the parameter β ≥ 1 allows to appropriately weight both terms in the least squares minimization. The solution u δ of the MR problem is the solution of the resulting Euler-Lagrange equations, which read as being a useful representation when no efficient preconditioner is available and one has to resort to (K δ , and "projected" or "approximate" optimal test space Z δ := ran T δ | X δ , a third equivalent formulation of (3.4) (see e.g. [DG11], [BS14, Prop. 2.2], [DG14]) is finding u δ ∈ X δ which solves the Petrov-Galerkin system . Note that (3.9) avoids the 'normal equations' (3.6). It will allow us to derive a quantitatively sharp estimate for the error in u δ . From (3.3) and (3.5), one infers Theorem 3.1. Under conditions (3.1), (3.2), and (3.3), the solution u δ ∈ X δ of (3.6) exists uniquely, and satisfies (In particular, u δ is the best approximation to u from X δ when γ ∂ t ∆ = r ∆ = R ∆ = 1 and α = 0.) 1 This follows by combining (3.13), (3.15), and (3.16).
Remarks 3.2. One can verify that which clarifies the behaviour of the bound from Theorem 3.1 in terms of α and γ ∂ t ∆ . For α = 0 (and β = 1), the estimate from Theorem 3.1 equals the one found in [SW20, Thm. 3.7 & Rem. 3.8].
Proof. Let u be the solution of (2.4), i.e., g = Bu and u 0 = γ 0 u. The mapping P δ ∈ L(X, X) from u to the solution u δ ∈ X δ of (3.4) or, equivalently, (3.6) or (3.9), is a projector onto X δ which, by our assumption X δ ∈ {0, X}, is unequal to 0 or Id.
, that is equal to the unique solution of (2.4), is the unique solution of As we have seen in (2.6), this system is equivalent to The formulation (4.2) of the parabolic equation can alternatively be derived from the application of the Brézis-Ekeland-Nayroles variational principle ( [BE76,Nay76], cf. also [And12, §3.2.4]), which generalizes beyond the linear, Hilbert space setting.
Given δ ∈ ∆, we consider the Galerkin discretization of (4.3), i.e., (4.5) , the solutions of BEN and MR are equal. Indeed, (3.14) shows that in this case the operator at the left-hand side of (4.5) equals the operator in (3.6), and from E δ one deduces that also the right-hand sides agree.
In contrast to MR, with BEN, however, it is not possible to replace (E δ Y A s E δ Y ) −1 by a general preconditioner as in (3.7)-(3.6) and still obtain a quasi-best approximation to (λ, u) from Y δ × X δ . This can be understood by noticing that replacing A −1 s in (4.2) by another operator changes the solution, whereas this is not the case in (4.1). So for the iterative solution of BEN one has to operate on the saddle point system (4.4) instead of on a symmetric positive definite system as with MR, see (3.6).
On the other hand, with BEN it is not needed that X δ ⊆ Y δ , as we will see below.
The applicability of BEN for the case that A a = 0 was already demonstrated in [SW20]. The following result gives a quantitatively better error bound.
Theorem 4.2. Under the sole condition (3.2), the solutionū δ ∈ X δ of (4.5) exists uniquely, and satisfies Proof. With g = Bu and u 0 = γ 0 u, using B = C + A s and γ 0 γ 0 = γ T γ T − (C + C), the right-hand side of (4.5) reads as are projectors onto X δ and Y δ , respectively, the latter being orthogonal, for any v ∈ Y δ and w ∈ X δ it holds that For w ∈ X, we have by Proposition 2.2. Since (G(δ)·)(·) is symmetric semi-positive-definite, we con- by following the lines starting at the second line of (3.15), in particular showing that E δ X G(δ)E δ X is invertible.
The theorem follows by combining the above estimates.
In the final subsection of this section we will briefly comment on the construction of preconditioners at the Y-side, i.e. condition (3.3), and the X-side. The preconditioner K δ Y has its application for the reduction of the saddle-point system to the elliptic system (3.6), and as an ingredient for building a preconditioner for the saddle-point system (4.4), whereas K δ X can be applied for preconditioning (3.6), and as the other ingredient to construct a preconditioner for (4.4).
Since inf-sup conditions of the type γ δ > 0, or inf δ∈∆ γ δ > 0, will be encountered more often in this work, in an abstract setting we recall their relation with existence of certain Fortin interpolators, see [DSW21,Thm. 3.10].  L(B,B) . Conversely, if G > 0, then there exists a Q as in (5.2), which is a projector, and Q L(B,B) ≤ 2 + 1/G. 5.1. 'Full' tensor product case. Concerning the verification of inf δ∈∆ γ δ > 0, we start with the easy case of X δ and Y δ being 'full' tensor products of approximation spaces in time and space (as opposed to sparse tensor products, see below). With Y t := L 2 (I) and X t := H 1 (I), for Z ∈ {X, Y} let (Z δ t ) δ∈∆ and (Z δ x ) δ∈∆ be families of closed subspaces of Z t and V, respectively, and let Z δ := Z δ t ⊗ Z δ x . Assuming that a tensor product argument shows that Obviously, (5.3) is true when d dt X δ t ⊆ Y δ t , which however is not a necessary condition. For example, when X δ t is the space of continuous piecewise linears w.r.t. some partition of I, and Y δ t is the space of continuous piecewise linears w.r.t. a once dyadically refined partition, an easy computation ([And13, Prop. 6.1]) shows that γ δ t ≥ √ 3/4. Considering, for a domain Ω ⊂ R d and Γ ⊂ ∂Ω, H = L 2 (Ω) and V = H 1 0,Γ (Ω) := {v ∈ H 1 (Ω) : v| Γ = 0}, H 1 (Ω)-stability of the L 2 (Ω)-orthogonal projector onto Lagrange finite element spaces X δ x = Y δ x is an extensively studied subject. In view of Proposition 5.1, taking F to be the Riesz map H → H viewed as a mapping V → V , this stability implies (5.4). For finite element spaces w.r.t. shape regular quasi-uniform partitions into, say, d-simplices, where Γ is the union of faces of T ∈ T , this stability follows easily from direct and inverse estimates. It is known that this stability holds also true for (shape regular) locally refined partitions when they are sufficiently mildly graded. In [GHS16], it is shown that in two space dimensions the meshes generated by newest vertex bisection satisfy this requirement, see also [DST20] for extensions. 5.2. Sparse tensor product case. As shown in [And13, Prop. 4.2], these results for full tensor products extend to sparse tensor products. When (Z δ t ) δ∈∆ and (Z δ x ) δ∈∆ are nested sequences of closed subspaces Z

Time-slab partition case.
Another extension of the full tensor product case is given by the following. Let (X δ ,Ȳ δ ) δ∈∆ be a family of pairs of closed subspaces of X and Y for which Then if, for δ ∈ ∆, X δ and Y δ are such that for some finite partition then γ δ ≥ γ∆ > 0 as one easily verifies by writing I du dt , v dt = ∑ i t i t i−1 du dt , v dt. An example of this 'time-slab partition' setting will be given in Sect. 7. Thinking of theX δ as being finite element spaces, notice that the condition X δ ⊂ X will require that possible 'hanging nodes' on the interface between different time slabs do not carry degrees of freedom. 5.4. Generalized sparse tensor product case. Finally, we informally describe a 'generalized' sparse tensor product setting that allows for local refinements driven by an a posteriori error estimator. For Z ∈ {X, Y}, let the nested sequences of closed subspaces Z x ⊂ · · · ⊂ V be equipped with hierarchical bases, meaning that the basis for Z Let us consider the usual case that the diameter of the support of a hierarchical basis function with level i is 2 −i , and let us assign to each basis function φ on level i > 0 one (or a few) parents with level i − 1 whose supports intersect the support of φ. We now let (Z δ ) δ∈∆ be the collection of all spaces that are spanned by sets of product hierarchical basis functions, which sets are downward closed (or lower) in the sense that if a product of basis functions is in the set, then so are all their parents in both directions. Note that the sparse tensor product spaces x are included in this collection, but that it contains many more spaces.
Under conditions on the hierarchical bases for Z δ 0 t ⊂ Z δ 1 t ⊂ · · · ⊂ Z t for Z ∈ {X, Y}, which should be of wavelet-type, in [SvVW21] it is shown that to any X δ one can assign a Y δ with dim Y δ dim X δ , such that γ δ 1 holds.

Preconditioners.
Moving to condition (3.3), obviously we would like to construct K δ Y such that it is not only a uniform preconditioner, i.e., it satisfies (3.3), but also that its application can be performed in O(dim Y δ ) operations. In the full-tensor product case, after selecting bases for Y δ t and Y δ x , the construction of K δ Y boils down to tensorizing approximate inverses of the 'mass matrix' in time, which does not pose any problems, and the 'stiffness matrix' in space. For V = H 1 (Ω) (or a subspace of aforementioned type), it is well-known that by taking a multigrid preconditioner as the approximate inverse of the stiffness matrix the resulting K δ Y satisfies our needs. A straightforward generalization of this construction of K δ Y applies to spaces Y δ that correspond to the time-slab partitioning approach.
Finally, for the efficient iterative solution of (3.6) or (4.4), one needs a K δ X = K δ X ∈ Lis(X δ , X δ ) whose norm and norm of its inverse are uniformly bounded, and whose application can be performed in O(dim X δ ) operations. For the full and generalized sparse tensor product setting such preconditioners have been constructed in [And16] and [SvVW21], respectively.

ROBUSTNESS
The quasi-optimality results presented in Theorems 3.1 and 4.2 for MR and BEN degenerate when α = A a L(Y,Y ) → ∞. Aiming at results that are robust for α → ∞, we now study convergence w.r.t. the energy-norm ||| · ||| X on X. On its own this change of norms turns out not to be helpful. By replacing · X by ||| · ||| X in Theorems 3.1 and 4.2, and adapting their proofs in an obvious way yields for MR the same upper bound for |||u−u δ ||| X inf w∈X δ |||u−w||| X as we found for u−u δ X inf w∈X δ u−w X (for u ∈ X δ ), whereas instead of Theorem 4.2 we arrive at the only slightly more favourable bound which is, however, still far from being robust.
In order to obtain robust bounds, instead of the condition γ ∂ t ∆ > 0 ((3.2)) we now impose which, when considering a family of operators A, we would like to hold uniformly for α → ∞.
We conclude that for a family of (asymmetric) operators A robustness w.r.t. ||| · ||| X is obtained when (γ C ∆ ) −1 is uniformly bounded for α = A a L(Y,Y ) → ∞. A family for which this will be realized is presented in Sect. 7. 6.1. A posteriori error estimation. In particular because for α = A a L(Y,Y ) → ∞ meaningful a priori error bounds for inf w∈X δ |||u − w||| X will be hard to derive, it is important to have (robust) a posteriori error bounds.
for w ∈ X δ and u the solution of (2.4) it holds that then the a posteriori error estimator is an efficient and, modulo the data oscillation term e δ osc (g), reliable estimator of the error |||u − w||| X . If sup δ∈∆ Q δ B L(Y,Y) and max(R ∆ ,1) min(r ∆ ,1) are bounded uniformly in α → ∞, then this estimator is even robust.
Remark 6.2. In view of Proposition 5.1, the aforementioned assumptions ran In applications the conditions γ ∂ t ∆ > 0, γ C ∆ > 0, and γ B ∆ > 0 are increasingly more difficult to fulfill.
To have a meaningful reliability result, in addition we would like to find above Q δ B such that, for sufficiently smooth g, the term e δ osc (g) is asymptotically, i.e. for the 'mesh-size' tending to zero, of equal or higher order than the approximation error inf w∈X δ |||u − w||| X . We will realize this in the setting that will be discussed in Sect. 7.2. and |Γ| > 0 when the latter ess inf is zero, so that (2.1) and (2.5) are valid. In this setting, the operators A a , A s = A s (ε), and so A = A(ε) = A s (ε) + A a , are given by
Let (T δ ) δ∈∆ , (T δ S ) δ∈∆ be a families of such partitions of Ω, which are uniformly shape regular (which for d = 1 should be read as to satisfy a uniform K-mesh property), and where T δ S is a refinement of T δ of some fixed maximal depth in the sense that |T| |T | for T δ S T ⊂ T ∈ T δ , so that dim T δ S dim T δ . On the other hand, fixing a q ≥ 1, we require that the refinement from T δ to T δ S is sufficiently deep that it permits the construction of a projector P δ q for which As shown in [DSW21, Lemma 5.1 and Rem. 5.2], regardless of the refinement rule (e.g. red-refinement of newest vertex bisection) that is (recursively) applied to create (T δ S ) δ∈∆ from (T δ ) δ∈∆ , there is a refinement of some fixed depth that suffices to satisfy (7.3) as well as Condition (7.4) is stronger than (7.2), and will be relevant in Sect. 7.2 on robust a posteriori error estimation. For d ∈ {1, 2, 3} and q ∈ {1, 2, 3}, and both newest vertex bisection and redrefinement it was verified that it is sufficient that the aformentioned depth creates in the space S 0,q T δ S ,0 an additional number of degrees of freedom interior to any T ∈ T δ that is greater or equal to ( q+d q ).
Remark 7.1. To satisfy condition (7.2)-(7.3) generally a smaller number of degrees of freedom interior to any T ∈ T δ suffices. For d = 2 = q, in [DSW21, Appendix A] it was shown that in order to satisfy (7.2)-(7.3) it is sufficient to create T δ s from T δ by one red-refinement, which creates only three of such degrees of freedom, whereas to satisfy (7.3)-(7.4) six additional interior degrees of freedom are needed.
We show robustness of MR and BEN in a time-slab partition setting.

Proof. As follows from Proposition 5.1 the statement inf
where we recall that, thanks to constant b, Y = L 2 (I; H 1 0,Γ (Ω)) is equipped with norm It holds that i )×Ω , satisfies (7.6). Indeed its uniform boundedness w.r.t. the energy-norm on Y follows by the boundedness of Q δ x w.r.t. both the L 2 (Ω)and H 1 (Ω)-norms. By writing , and using (7.7) one verifies the third condition in (7.6).
We seek Q δ .
We takeQ δ x as a modified Scott-Zhang quasi-interpolator onto S 0,q T δ S ,0 ([GL01, Appendix]). The modification consists in setting the degrees of freedom on Γ to zero. When applied to a function from H 1 0,Γ (Ω) it equals the original Scott-Zhang interpolator ( [SZ90]), but thanks to the modification it is uniformly bounded w.r.t. L 2 (Ω), and so Q δ x L(L 2 (Ω),L 2 (Ω)) is uniformly bounded.
, L 2 (Ω)), h −1 δ P δ qhδ ∈ L(L 2 (Ω), L 2 (Ω)), andQ δ x ∈ L(H 1 0,Γ (Ω), H 1 0,Γ (Ω)) all being uniformly bounded, and · H 1 (Ω) h −1 δ · L 2 (Ω) on S 0,q T δ S ,0 , we infer the uniform bounded- Next under the condition that ess inf(e − 1 2 div x b) > 0, we consider the case of variable b and e. The scaling argument that was applied directly below Theorem 2.1 shows that it is no real restriction to assume that ess inf(e − 1 2 div x b) > 0. Although we will not be able to show inf ε>0 γ C ∆ (ε) > 0, this inf-sup condition will be valid modulo a perturbation which can be dealt with using Young's inequality similarly as in the proofs of Theorems 3.1 and 4.2. It will result in ε-(and β-) robust quasi-optimality results for MR and BEN similar as for constant b and constant e ≥ 0. Theorem 7.3. Consider the situation of Theorem 7.2, but now without the assumption of b and e being constants. Assume b ∈ W 1 ∞ (I × Ω) d , ess inf(e − 1 2 div x b) > 0, and, only for the case that b is time-dependent, Then for MR and BEN it holds uniformly in ε > 0 and β ≥ 1.
Proof. As in the proof of Theorem 6.1, we follow the proofs of Theorems 3.1 (MR) and 4.2 (BEN). We only need to adapt the derivation of a lower bound for the expression in the second line of (3.15). With ξ = ess inf(e − 1 2 div x b), it holds that ξ · Y ≤ · L 2 (I×Ω) ≤ 1 √ ξ · Y . Let b δ be the piecewise constant vector field defined by taking the average of b over each prismatic element (t by  (7.11). An application of the inverse inequality on the family of spaces (S 0,q T ,0 ) T ∈∆ shows that for some constant L > 0, for w ∈ X δ it holds that . Because (7.7) is also valid for piecewise constant b, and only dependent on e − 1 2 div x b L ∞ (I×Ω) /ξ, the proof of Theorem 7.2 shows that for some constant γ > 0, for w ∈ X δ it holds that L 2 (I;H 1 0,Γ (Ω) ) + ε 2 · 2 L 2 (I;H 1 0,Γ (Ω)) + ε γ T · 2 + (1 − ε) γ 0 · 2 , and so even for a general smooth u, √ ε times the approximation error cannot be expected to be smaller than h 2 δ . Since for g ∈ L 2 (I; H 1 (Ω)) ∩ H 2 (I; h 2 δ ([DSW21, Thm. 5.6]), we conclude that E δ (w; g, u 0 , β) from (6.4) is an efficient and, modulo above satisfactory data-oscillation term, reliable a posteriori estimator of the error in w in ||| · ||| X -norm.

NUMERICAL TEST
We tested the minimal residual (MR) method applied to the parabolic initial value problem with the singularly perturbed 'spatial component' as given in (7.1). We considered the simplest case where I = Ω = (0, 1), b = 1, and e is either 0 or 1, and X δ = S 0,1 I δ ⊗ S 0,1 T δ ,0 , where I δ = T δ is a uniform partition of the unit interval with mesh size h δ . Taking always (K δ which for any fixed ε > 0 gives γ ∂ t ∆ > 0 (Sect. 5.1), so that the MR approximations are quasi-optimal approximations from the trial space w.r.t. · X (Thm. 3.1), or (ii) Y δ := S −1,1 where T δ s is a uniform partition with mesh-size h δ /3 which even gives inf ε>0 γ C ∆ (ε) > 0 (Thm. 7.2), so that the MR approximations are quasi-optimal approximations from the trial space w.r.t. the energy-norm ||| · ||| X also uniformly in ε > 0 (Thm. 6.1). Remark 4.1 shows that in these cases the BEN and MR methods give the same solution.
As discussed in Sect. 7.2, for the case that e = 0 it is natural to take the weight β = ε −1 . Unlike with e = 0, for e = 1 and 0 = v ∈ Y the energy-norm (A s v)(v) does not tend to zero for ε ↓ 0 but converges to v L 2 (I×Ω) . In view of this there is no reason to let β tend to infinity for ε ↓ 0, and we took β = 1.
For Y δ as in (ii), in Sect. 7.2 it was shown that for (e, β) = (0, ε −1 ) it holds that inf ε>0 γ B ∆ (ε) > 0, and more specifically that the a posteriori error estimator E δ (w; g, u 0 , β) from (6.4) is an efficient and, modulo a data-oscillation term which is at least of equal order, reliable estimator of the error |||u − w||| X . Therefore to assess our numerical results, we used Y δ as in Option (ii) for error estimation, even when solving with Y δ as in (i).
For (e, β) = (1, 1), we numerically observed that for our model problems the a posteriori error estimator E δ (w; g, u 0 , β) computed with Y δ as in (ii) is efficient and reliable as, knowing that the estimator equals |||u − w||| X for Y δ = Y, we saw that further overrefinement of the test space Y δ never increased the estimated error by more than a percent. So again, regardless of whether we took Y δ as in Option (i) or (ii), we used Y δ as in (ii) to compute E δ (w; g, u 0 , β).
In experiments below, we choose ε = 1, 10 −1 , 10 −3 , 10 −6 ; to compare different values of ε, we show the estimated error divided by an accurate approximation for g 2 Y + β u 0 2 which is equal to the ||| · ||| X -norm of the exact solution.
The left of Figure 2 shows the relative estimated error progression of Option (ii) as a function of dim X δ ; as Option (i) again suffers from degradation for small ε (with results very similar to the left of Figure 1), we omit a graph of its error progression. Its right shows the discrete solution at h δ = 1 512 and ε = 10 −6 . The solution resembles the pure transport solution quite well, with the exception of a small artefact near x = t = 0. FIGURE 2. Solving the internal layer problem with Option (ii). Left: relative estimated error progression as function of dim X δ for different diffusion rates ε. Right: solution at h δ = 1 512 and ε = 10 −6 .
8.3. Boundary layer problem. We choose u 0 (x) := sin(πx) and g = 0, select (e, β) = (1, 1), and set homogenous Dirichlet boundary conditions on ∂Ω, i.e. Γ = {0, 1}. Due to the condition on the outflow boundary, the problem is ill-posed in the limit ε = 0, hence for ε small, the solution has a boundary layer at x = 1. Figure 3 shows that the method fails to make progress until the boundary layer is resolved at h δ ε. Figure 4 shows two discrete solutions at h δ = 1 512 computed for Option (ii). We see that for ε = 10 −3 , the boundary layer is resolved and the solution resembles the pure transport solution quite well, with the exception of a small artefact near x = t = 1. For ε = 10 −6 though, the boundary layer cannot be resolved with the current (uniform) mesh, and the solution is completely wrong. For ε ↓ 0, the energy-norm of the error in an approximation w approaches (∂ t + b · ∇ x )w 2 L 2 (I×Ω) + u 0 − γ 0 w 2 L 2 (Ω) . As a result, for streamlines that hit the outflow boundary, the method 'chooses' to smear the unavoidably large error as a consequence of the layer along the whole streamline resulting in a globally bad approximation. This is a well-known phenomenon when using a least squares method to approximate a solution that has a sharp layer or a shock. 8.4. Imposing outflow boundary conditions weakly. One common work-around is to resolve the problem caused by the boundary layer is to refine the mesh strongly towards this layer. An alternative is to impose at the outflow boundary the Dirichlet boundary condition only weakly, see e.g. the references [CDW12,BS14,CEQ14,CFLQ14] where this approach has been applied with least squares methods for stationary convection dominated convection-diffusion methods. Without having a rigorous analysis we tried this second approach by computing, with Y δ as in Option (ii), u δ := argmin w∈X δ E δ Y (BE δ X w − g) 2 Y δ + β γ 0 E δ X w − u 0 2 + ε w(·, 1) 2 L 2 (I) .  Here,X δ denotes the space X δ after removing the Dirichlet boundary condition at x = 1. Figure 5 shows the resulting error progression, which is robust in ε, as well as the minimal residual solution at h δ = 1 512 and ε = 10 −6 ; it resembles the pure transport solution quite well, and does not suffer from the artifact present at the right of Figure 4.