1 Introduction

Convection–diffusion equations are a basic model to describe the distribution of a scalar quantity in fluids. Besides modeling the heat distribution in a room (energy balance), they can describe the concentration of drugs in blood and the propagation of chemical substances in water (mass balance) to name just a few. Mathematically speaking, they are given in a bounded domain \(\Omega \subset {\mathbb {R}}^d\), \(d\in \{2,3\}\), with polyhedral Lipschitz boundary \(\Gamma = \Gamma _{\textrm{D}} \cup \Gamma _{\textrm{N}}\) with \(\Gamma _{\textrm{D}} \cap \Gamma _{\textrm{N}} = \emptyset \). The steady-state convection–diffusion–reaction problem with homogeneous Neumann boundary conditions on \(\Gamma _{\textrm{N}}\) then reads as follows: Find a sufficiently smooth function u such that

$$\begin{aligned} \begin{array}{rcll} -{\varepsilon }\Delta u + {{\varvec{b}}}\cdot \nabla u + cu &{} = &{} f &{} \text { in } \Omega , \\ u &{} = &{} g &{} \text { on } \Gamma _{\textrm{D}}, \\ {\varepsilon }\nabla u \cdot {{\varvec{n}}}&{} = &{} 0 &{} \text { on } \Gamma _{\textrm{N}}, \end{array} \end{aligned}$$
(1)

where \({\varepsilon }>0\) is the diffusion coefficient, the convection field is denoted by \({{\varvec{b}}}\in [W^{1,\infty }(\Omega )]^d\), \(c\in L^\infty (\Omega )\) describes the reaction coefficient, and \(f\in L^2(\Omega )\) models sources. On the Dirichlet boundary \(\Gamma _{\textrm{D}}\) Dirichlet conditions g are prescribed and the outer unit normal vector on the boundary of \(\Omega \) is denoted by \({{\varvec{n}}}\). At the inflow boundary \(\Gamma _- = \{ {{\varvec{x}}}\in \Gamma \;:\;{{\varvec{b}}}({{\varvec{x}}})\cdot {{\varvec{n}}}({{\varvec{x}}})< 0 \}\), Dirichlet boundary conditions have to be prescribed, i.e., \(\Gamma _- \subset \Gamma _{\textrm{D}}\).

In many applications the convective transport dominates the diffusive one. Then, the characteristic feature of solutions of Eq. (1) are layers, which are thin regions with a very large gradient. The thickness of layers is usually so small that they cannot be resolved on feasible grids. This situation, which is mathematically expressed by \({\varepsilon }\ll h\Vert {{\varvec{b}}}\Vert _{L^\infty (\Omega )}\), where h is the (local) mesh size, is called the convection-dominated regime. It is a typical property of multiscale problems that very important features of the solution cannot be resolved. Consequently, convection–diffusion–reaction problems are usually multiscale problems, with the layers being the subgrid scales. It is well known that in this case the discrete solution to Eq. (1) obtained by classical numerical schemes, like the central finite difference method and Galerkin finite element method, exhibits huge so-called spurious oscillations, i.e., unphysical values such as negative concentrations or an unreasonable amount of energy, e.g., see [1,2,3,4].

There have been many contributions concerning discontinuous Galerkin (DG) methods for discretizing second order elliptic boundary value problems in the last decades, e.g., see the monographs [5,6,7], even though they were already invented in 1973 in [8]. One advantage is that, compared with conforming finite elements, hp-refinement on both simplicial and also polygonal and polyhedral meshes can be performed very easily, e.g., see [9, 10]. With respect to convection–diffusion equations, DG methods with a standard upwind flux are stable discretizations in the convection-dominated regime. It was shown in [6, 11,12,13] that they even control the streamline derivative without needing an additional term, as it is the case for conforming finite elements. Furthermore, DG methods are known to produce very sharp layers in the convection-dominated regime. But on the other hand these methods also have the flaw of producing large over- and undershoots [4, 14, 15].

A computational cheap way to significantly reduce spurious oscillations in a post-processing step are so-called slope limiters. In a first step, they identify so-called troubled-cells where over- and undershoots occur and, in a second step, replace the solution locally by a polynomial of lower degree. The solution is usually replaced by a constant approximation [16, 17] or a (at most) linear one [5, 18]. This approach is typical for appropriate numerical methods for multiscale problems in the sense that different schemes are applied for the different scales. Using low order finite elements in a vicinity of layers is fine since error bounds for higher order elements contain the norm of the solution in a higher order Sobolev space and these norms scale (locally at layers) with inverse powers of the diffusion coefficient. The power increases with the order of the Sobolev space and finally there is a huge constant in the error bound such that it cannot be expected to obtain a better accuracy in a neighborhood of layers when using higher order elements there. In [14, 15] several of these methods have been numerically investigated for convection-dominated convection–diffusion equations. These methods share the advantage that they are computationally cheap, keep the higher order approximation away from layers, and most important that most of the methods also reduce the oscillations significantly, but not completely [14, 15].

Within the last decades the interest in deep neural networks as a form of deep learning has risen sharply. Thanks to their ability to be universal function approximators [19,20,21, Chapter 6.4.1] and classifiers [22], they have also been used to detect troubled-cells. In 2018, Ray and Hesthaven have trained a multilayer perceptron (MLP) to identify troubled-cells for one-dimensional scalar and systems of conservation laws [23]. They have observed that their MLP detector can mimic a classical limiter but without the need of fine-tuning a parameter. Their results have been extended by the authors to two-dimensional problems in [24]. Liu et al. have trained a convolutional neural network (CNN) based shock detector for Euler’s gas equations and saw that their detector was significantly faster than classical ones [25]. In 2018 and 2020, Han Veiga and Abgrall have trained a MLP detector based on data from a Runge–Kutta DG scheme and tested how well it can then be transferred to a residual distribution (RD) scheme without retraining it [26, 27]. Again, they have used scalar transport equations and Euler’s gas equations. Morgan et al. have trained and tested a MLP detector with a Lagrangian RD method to detect troubled-cells in the two-dimensional Taylor–Green vortex and Triple-point vortex [28]. Beck et al. have trained a CNN based limiter in 2020 for a DG spectral element method to approximate the solution of Euler’s gas equations [29]. Their limiter is able to both detect cells and also locate the position of the shock inside the cell. As it can be seen, all these results are for different numerical schemes for the discretization of scalar or hyperbolic systems of equations, mainly Euler’s gas equations. Neural networks have been applied also with respect to other aspects of the numerical solution of partial differential equations, e.g., see [30,31,32,33,34].

The goal of this paper consists in performing a first step in systematically exploring the behavior of MLPs for classifying mesh cells for the numerical solution of convection–diffusion–reaction problems. In contrast to the above mentioned papers, this manuscript studies an elliptic boundary value problem. On the one hand, slope limiting processes in the hyperbolic and the elliptic case are quite similar: Based on local features of the numerical solution, slope limiting algorithms try to detect the cells where spurious oscillations are present. Indeed, many slope limiting techniques that were proposed for elliptic problems are either initially constructed for hyperbolic ones or are modified versions of methods for such problems. Therefore, in general it could be expected that similar architectures of the neural networks can be employed in the elliptic case as in the hyperbolic counterparts. This is also the motivation why in the present work multilayer perceptron models are used as in [23, 24, 26,27,28]. Note that with the same argument it should also be possible to use CNN-based architectures as done in [25, 29], but for the sake of brevity, this approach is postponed to future work. Moreover, due to the close relationship between slope limiters for hyperbolic and elliptic problems, it is not surprising that many features that are used as the input to the MLPs in this work are also considered in [23, 24, 26,27,28], e.g., the integral mean of the cell and its neighbors, and values at interface midpoints.

On the other hand, numerical methods for hyperbolic (or convection-dominated parabolic) problems apply often small time steps. Then, the starting solution, which is (nearly) free of spurious oscillations, can be utilized to reduce such oscillations in the solution of the next time step, since it can be expected that both solutions are in some sense not that much different. This strategy is used, e.g., in the construction of limiters for flux corrected transport schemes, e.g., see [35] for a description. However, there is no such starting solution for elliptic problems. Slope limiting for elliptic problems has to be applied to a solution that was computed with some (stabilized) discretization and which possesses usually notable spurious oscillations.

As already mentioned, the present work focuses on the elliptic boundary value problem given in Eq. (1). It is well known that for computing an accurate solution of Eq. (1) without or with only small (acceptable) spurious oscillations, one has to treat the subgrid scale layer regions differently than the large scale regions, see the recent survey paper [35]. This first step aims to figure out whether or not there are MLPs that work well for a situation where the result is already principally known, just to have a kind of benchmark situation to compare with. This means, this first step should be considered as a proof of concept. The considered situation is the above described reduction of spurious oscillations in DG methods using limiters. To this end, a limiter is going to be trained with data that is obtained by applying classical limiters to the lowest order discrete solution of a standard benchmark problem defined in [36]. Several architectures are tested and it will be investigated how well the resulting limiter can be applied to higher order solutions and to another benchmark problem, the so-called Hemker example from [37]. Since standard limiters are already quite efficient, there is no need to replace them by the MLP-based limiter for increasing the efficiency. However, using the MLP-based limiter might be appealing in practice, since it combines good properties of various traditional limiters and it is not necessary that the user chooses explicitly a concrete limiter, together with the necessary parameters. But one should be aware that the MLP-based limiter depends implicitly on the parameters of traditional limiters that were used in the training process. Our principal intention consists in using the results and experience obtained in this proof of concept study in situations that require to perform different algorithms for different mesh cells and where efficient methods are not known. Such a situation is the choice of local parameters in stabilized discretizations of convection–diffusion–reaction problems. Currently available approaches, based on solving non-convex optimization problems, e.g., [38, 39], are rather time-consuming.

The remainder of the paper is structured as follows: in Sect. 2, both the standard DG method for discretizing equation (1) and relevant classical slope limiters are introduced. Section 3 follows with explaining the multilayer perceptron model and how the data is created with which the MLP limiter is trained. Several architectures are trained in Sect. 4 and are tested numerically. The paper concludes with a short summary and outlook. All data and code created and used for this work can be found at www.doi.org/10.20347/40vd-f944 [40].

In what follows the usual notation is used for Lebesgue and Sobolev spaces and their respective norms. The inner product in \(L^2(\Omega )\) is denoted by \((\cdot ,\cdot )\), a norm of a space X is denoted by \(\Vert \cdot \Vert _X\) and a seminorm by \(\vert \cdot \vert _X\).

2 Discontinuous Galerkin Methods and Slope Limiter for Convection–Diffusion Equations

2.1 Discontinuous Galerkin Methods

Equation (1) can be transformed to its weak formulation in a standard way which then reads as follows: Find \(u\in H^1(\Omega )\) such that \(u=g\) on \(\Gamma _{\textrm{D}}\) and

$$\begin{aligned} \left( {\varepsilon }\nabla u,\nabla v\right) + \left( {{\varvec{b}}}\cdot \nabla u + cu, v\right) = (f,v){} & {} \forall v \in H^1_{\textrm{D},0}(\Omega ), \end{aligned}$$
(2)

where \(H^1_{\textrm{D},0}(\Omega ) {:}{=}\{ v\in H^1(\Omega ) \;:\;v = 0 \text { on } \Gamma _{\textrm{D}} \}\). Under the assumptions that

$$\begin{aligned} c- \frac{1}{2} \nabla \cdot {{\varvec{b}}}\ge 0,{} & {} \Gamma _{\textrm{D}} \ne \emptyset ,{} & {} {{\varvec{b}}}\cdot {{\varvec{n}}}\ge 0 \text { on } \Gamma _{\textrm{N}}, \end{aligned}$$

by applying the Lax–Milgram Lemma it can be proven that problem (2) possesses a unique weak solution, e.g., see [1, Section III.1.1].

To introduce the DG discretization of (2), some notation needs to be fixed. Let \(\{ {\mathcal {T}}_h \}\), \(0<h\), be a quasi-uniform family of decompositions of \({\overline{\Omega }}\) into simplicial or quadrilateral/hexahedral meshes such that for any h it holds \({\overline{\Omega }} = \cup _{K\in {\mathcal {T}}_h} K\) and the cells have pairwise disjoint interiors. As usual the triangulations should be admissible, see [41, p. 38, p. 51], i.e., among other properties, each facet of a mesh cell that lies on \(\Gamma \) is either contained in \(\Gamma _{\textrm{D}}\) or \(\Gamma _{\textrm{N}}\). The set of all facets is denoted by \({\mathcal {E}}_h{:}{=}\cup _{K\in {\mathcal {T}}_h}{\mathcal {E}}_h(K)\), where \({\mathcal {E}}_h(K)\) is the set of all facets \(E\subset \partial K\) of a cell K. Furthermore, this set can be decomposed into the set of all interior facets \({\mathcal {E}}_h^\textrm{I}\) and boundary facets \(\partial {\mathcal {E}}_h{:}{=}{\mathcal {E}}_h\cap \partial \Omega \). The inflow boundary facets are called \({\mathcal {E}}_h^{-}{:}{=}\Gamma _{-}\cap {\mathcal {E}}_h\), the set of Dirichlet boundary facets is denoted by \({\mathcal {E}}_h^\textrm{D}{:}{=}\partial {\mathcal {E}}_h\cap \Gamma _\textrm{D}\) and the notation \({\mathcal {E}}_h^\textrm{ID}{:}{=}{\mathcal {E}}_h^\textrm{I} \cup {\mathcal {E}}_h^\textrm{D}\) is set. In addition to that, \(\vert K\vert \) denotes the d-volume and \(h_K\) the diameter of a cell \(K\in {\mathcal {T}}_h\) and \(h{:}{=}\max _{K\in {\mathcal {T}}_h}h_K\) is defined. Due to the regularity of the family of triangulations there exists a constant \(C>0\) such that for all \({\mathcal {T}}_h\) and \(K\in {\mathcal {T}}_h\) it holds \(h_E\le h_K\le Ch_E\), where \(h_E\) is the diameter of a facet \(E\in {\mathcal {E}}_h(K)\).

If there exists a facet \(E\in {\mathcal {E}}_h(K_i)\cap {\mathcal {E}}_h(K_j)\) that is shared by the cells \(K_i,K_j\in {\mathcal {T}}_h\), then the cells are called neighbors. Under the assumption that there is a fixed numbering of the mesh cells \(K_0\), \(K_1\), ...\(\in {\mathcal {T}}_h\), the unit normal vector \({{\varvec{n}}}_E\) on a facet \(E\in {\mathcal {E}}_h\) is defined by

$$\begin{aligned} {{\varvec{n}}}_E {:}{=}{\left\{ \begin{array}{ll} {{\varvec{n}}}_K, &{}\text { if }E\in \partial {\mathcal {E}}_h\cap {\mathcal {E}}_h(K) \text { for a } K\in {\mathcal {T}}_h,\\ {{\varvec{n}}}_{K_i}, &{}\text { if } K_i \text { and } K_j \text { are neighbors along facet } E \text { and } i < j, \end{array}\right. } \end{aligned}$$

where \({{\varvec{n}}}_K\) denotes the outward unit normal vector of the cell \(K\in {\mathcal {T}}_h\).

The below defined DG space is a subspace of the broken Sobolev space

$$\begin{aligned} H^k({\mathcal {T}}_h) = \{ v\in L^2(\Omega ) \;:\;v\vert _K \in H^k(K) \text { for any } K\in {\mathcal {T}}_h \} \supset H^k(\Omega ), \qquad k\in \mathbb {N}, \end{aligned}$$

that is equipped with its piecewise-defined norm and semi-norm

$$\begin{aligned} \Vert v\Vert _{H^k({\mathcal {T}}_h)}^2 {:}{=}\sum _{K\in {\mathcal {T}}_h} \Vert v\Vert _{H^k(K)}^2, \quad |v|_{H^k({\mathcal {T}}_h)}^2 {:}{=}\sum _{K\in {\mathcal {T}}_h} |v|_{H^k(K)}^2. \end{aligned}$$

Given a fixed \(p\in \mathbb {N}\), the finite element space is defined by

$$\begin{aligned} V_{h,p} {:}{=}\left\{ v_h\in L^2(\Omega ) \,:\, v_h\vert _K \in {\mathcal {R}}_{p}(K) \text { for any } K\in {\mathcal {T}}_h \right\} \subset H^k({\mathcal {T}}_h), \end{aligned}$$

where \({\mathcal {R}}_{p}(K){:}{=}P_p(K)\) is the space of polynomials of at most degree p on simplicial mesh cells and \({\mathcal {R}}_{p}(K){:}{=}Q_p(K)\) is the tensor product space of polynomials of at most degree p on quadrilateral/hexahedral cells.

Both \(H^k({\mathcal {T}}_h)\) and \(V_{h,p}\) contain functions that are discontinuous along interior facets \(E\in {\mathcal {E}}_h\). Hence, a given function \(v\in V_{h,p}\) itself is not well-defined on E but its jump \(\llbracket v \rrbracket \) and average \(\left\langle v \right\rangle \) can be defined for any \({{\varvec{x}}}\in E\) by

$$\begin{aligned} \llbracket v \rrbracket ({{\varvec{x}}}) {:}{=}{\left\{ \begin{array}{ll} v\vert _{K_i}({{\varvec{x}}}) - v\vert _{K_j}({{\varvec{x}}}), &{}\text { if }K_i \text { and } K_j \text { are neighbors along facet } E \text { and } \\ &{}\ i < j,\\ v\vert _{ K}({{\varvec{x}}}), &{}\text { if } E\in \partial {\mathcal {E}}_h\cap {\mathcal {E}}_h(K) \text { for a } K\in {\mathcal {T}}_h, \end{array}\right. } \end{aligned}$$

and

$$\begin{aligned} \left\langle v \right\rangle ({{\varvec{x}}}) {:}{=}{\left\{ \begin{array}{ll} \frac{1}{2} (v\vert _{K_i}({{\varvec{x}}}) + v\vert _{K_j}({{\varvec{x}}})), &{}\text { if }K_i \text { and } K_j \text { are neighbors along facet } E \\ &{}\text { and } i \ne j,\\ v\vert _{K}({{\varvec{x}}}), &{}\text { if } E\in \partial {\mathcal {E}}_h\cap {\mathcal {E}}_h(K) \text { for a } K\in {\mathcal {T}}_h. \end{array}\right. } \end{aligned}$$

Finally, the DG discretization of (1) reads as follows: Find \(u_h\in V_{h,p}\) such that

$$\begin{aligned} a_{\textrm{DG}}(u_h, v_h) = f_{\textrm{DG}}(v_h)\quad \forall \ v_h\in V_{h,p}, \end{aligned}$$
(3)

where the bilinear form \(a_{\textrm{DG}}:\,H^1({\mathcal {T}}_h)\times H^1({\mathcal {T}}_h) \rightarrow \mathbb {R}\) is defined as \( a_{\textrm{DG}}(v, w) {:}{=}a_\varepsilon (v,w) + a_{bc}(v, w) \), where \(v,w\in H^1({\mathcal {T}}_h)\), with

$$\begin{aligned} \begin{aligned} a_{\varepsilon }(v, w) =&\sum _{K\in {\mathcal {T}}_h} \int _K \varepsilon \nabla v \cdot \nabla w\ \,\textrm{d}{{\varvec{x}}} \\ {}&- \sum _{E\in {\mathcal {E}}_h^{\textrm{ID}}} \varepsilon \int _{E} \Big (\left\langle \nabla v \cdot {{\varvec{n}}}_E \right\rangle \llbracket w \rrbracket +\kappa \left\langle \nabla w \cdot {{\varvec{n}}}_E \right\rangle \llbracket v \rrbracket \Big )\ \,\textrm{d}s \\ {}&+ \sum _{E\in {\mathcal {E}}_h^{\textrm{I}}} \frac{\sigma }{h_E} \int _E\llbracket v \rrbracket \llbracket w \rrbracket \ \,\textrm{d}s + \sum _{E\in {\mathcal {E}}_h^{\textrm{D}}} \frac{2\sigma }{h_E} \int _E v w\ \,\textrm{d}s \end{aligned} \end{aligned}$$
(4)

and

$$\begin{aligned} \begin{aligned} a_{bc}(v, w) =&\sum _{K\in {\mathcal {T}}_h} \int _K \big ( {{\varvec{b}}}\cdot \nabla vw + cvw\big ) \,\textrm{d}{{\varvec{x}}} - \sum _{E\in {\mathcal {E}}_h^{\textrm{I}}} \int _{E} {{\varvec{b}}}\cdot {{\varvec{n}}}_E \llbracket v \rrbracket \left\langle w \right\rangle \,\textrm{d}s \\ {}&+ \sum _{E\in {\mathcal {E}}_h^{\textrm{I}}} \int _E \frac{\eta }{2} \vert {{\varvec{b}}}\cdot {{\varvec{n}}}_E\vert \llbracket v \rrbracket \llbracket w \rrbracket \,\textrm{d}s - \sum _{E\in {\mathcal {E}}_h^-} \int _E {{\varvec{b}}}\cdot {{\varvec{n}}}_E vw\ \,\textrm{d}s. \end{aligned} \end{aligned}$$
(5)

The discrete right-hand side \(f_{\textrm{DG}}: \ H^1({\mathcal {T}}_h)\rightarrow \mathbb {R}\) of (3) is defined by

$$\begin{aligned} \begin{aligned} f_{\textrm{DG}}(w) =&\sum _{K\in {\mathcal {T}}_h} \int _K fw\ \,\textrm{d}{{\varvec{x}}} - \sum _{E\in {\mathcal {E}}_h^-} \int _E {{\varvec{b}}}\cdot {{\varvec{n}}}_E gw\ \,\textrm{d}s \\ {}&- \sum _{E\in {\mathcal {E}}_h^\textrm{D}} \varepsilon \kappa \int _E \nabla w\cdot {{\varvec{n}}}_E g\ \,\textrm{d}s +\sum _{E\in {\mathcal {E}}_h^\textrm{D}} \frac{2\sigma }{h_E} \int _E gw\ \,\textrm{d}s. \end{aligned} \end{aligned}$$
(6)

The discrete scheme (3) contains three user-chosen parameters. The parameter \(\kappa \) controls the symmetry of (4) where \(\kappa =1 \) corresponds to the symmetric interior penalty Galerkin (SIPG), \(\kappa = 0\) to the incomplete interior penalty Galerkin (IIPG), and \(\kappa = -1\) to the non-symmetric (NIPG) discretization of the Laplacian. The stability parameter \(\sigma \), also called penalty parameter, in (4) and (6) that is incorporated as in [13, Section 2.2] influences the coercivity of \(a_{\varepsilon }\): The bilinear form for the SIPG and IIPG method is coercive if \(\sigma \) is sufficiently large, where \(\sigma \) is proportional to \(\varepsilon \), and for the NIPG method it is coercive for any \(\sigma >0\), e.g., see [5, Chapter 2.7.1]. Last but not least, the stabilization parameter \(\eta \ge 0\) appearing in (5) has to be chosen by the user. The choice \(\eta = 0\) corresponds to a centered flux and \(\eta = 1\) to an upwind flux discretization across the facet E, e.g., see [6, p. 55, p. 65]. It can be proven that DG methods converge asymptotically with an optimal rate in the DG norm, with an optimal rate in the \(L^2\)-norm only for the SIPG variant and suboptimally for the IIPG and NIPG method, e.g., see [5,6,7, 13] and the references therein.

2.2 Slope Limiters

Slope limiters are a cheap post-processing technique to reduce spurious oscillations in the discrete solution. After the solution \(u_h\) of (3) is computed, the solution is adapted as follows:

  1. 1.

    Identify and mark cells in which the function might show spurious oscillations by

    1. (a)

      computing (cell wise) a set of features of the solution and

    2. (b)

      based on these features deciding whether to mark a cell.

  2. 2.

    Approximate the solution locally on the marked cells by a polynomial of lower degree.

These steps can be translated into mathematics by introducing some mappings. Let \({\mathcal {F}}_l:V_{h,p}\times {\mathcal {T}}_h\rightarrow \mathbb {R}^{n_l}\) be a function that maps locally a discrete function to \(n_l\in \mathbb {N}\) features, and \({\mathcal {M}}_l:{\mathbb {R}}^{n_l}\rightarrow \{ 0,1 \}\) be a decision maker function. The post-processing techniques can then be seen as mappings \(l:V_{h,p}\rightarrow V_{h,p}\) defined cell wise on a cell \(K\in {\mathcal {T}}_h\) for \(u_h\in V_{h,p}\) by

$$\begin{aligned} l(u_h)\vert _{K}{:}{=}{\left\{ \begin{array}{ll} u_h\vert _{K}, &{}\text { if }{\mathcal {M}}_l({\mathcal {F}}_l(u_h, {K}))= 0,\\ \Pi _{l,K}(u_h), &{}\text { else}, \end{array}\right. } \end{aligned}$$

where \(\Pi _{l,K}:V_{h,p}|_K\rightarrow {\mathcal {R}}_p(K)\) reconstructs the solution in marked cells.

Different methods differ only in their respective functions \({\mathcal {F}}_l\), \({\mathcal {M}}_l\), \(\Pi _{l,K}\). Several of these methods are described in detail and were tested numerically in [14, 15] and they will be briefly recalled here. Since for what comes later only \({\mathcal {M}}_l\) and \({\mathcal {F}}_l\) are important, \(\Pi _{l,K}\) is only mentioned in passing. For the sake of presentation, the methods are described in two dimensions on triangles, but they can be easily extended to three dimensions or to quadrilateral/hexahedral meshes.

It is worth to emphasize that for all the methods presented below, the mappings \({\mathcal {F}}_l\) and \({\mathcal {M}}_l\) act locally, i.e., they are defined cell wise. Especially \({\mathcal {F}}_l\) can be computed using only information of the discrete solution on a cell itself and possibly its direct neighbors, and globally defined quantities like a tolerance or reference values. The mapping \({\mathcal {M}}_l\) then takes cell wise features and returns locally the decision whether to mark a cell or not.

Since the numerical studies consider only two-dimensional problems, the individual slope limiters will be presented, for simplicity, only to this situation. As already noted in [14], their extension to three dimensions is usually straightforward.

\(\underline{LinTriaReco }\)

This method was proposed in [18] and described again in [5, pp. 103–104] and [14].

Let \(K\in {\mathcal {T}}_h\) be a simplicial cell with facets \(E_i\in {\mathcal {E}}_h(K)\), \(i=0,1,2\), and neighbors \(K_i\in {\mathcal {T}}_h\) along these edges. Using the notation \(m_i\), \(i=0,1,2\), for the edge mid points and \({\overline{u}}_{h,K}{:}{=}\int _K u_h\,\textrm{d}{{\varvec{x}}} / \vert K\vert \) for the integral mean of \(u_h\) in K, the feature mapping of LinTriaReco is defined by

$$\begin{aligned} {\mathcal {F}}_\textrm{LTR}(u_h,K){:}{=}\{ {\overline{u}}_{h,K_0},\ u_h\vert _K(m_0),\ {\overline{u}}_{h,K_1},\ u_h\vert _K(m_1),\ {\overline{u}}_{h,K_2},\ u_h\vert _K(m_2),\ {\overline{u}}_{h,K},\ tol \}, \end{aligned}$$
(7)

where \(tol\in {\mathbb {R}}\), \(tol\ll 1\) is a fixed positive tolerance. Hence, the number of features \(n_\textrm{LTR}\) of LinTriaReco is 8.

Let \(\left\lfloor a,b \right\rfloor {:}{=}\min \{ a,b \}\) and \(\lceil a,b \rceil {:}{=}\max \{ a,b \}\) for \(a,b \in {\mathbb {R}}\). The decision maker \({\mathcal {M}}_\textrm{LTR}\) is given by

$$\begin{aligned} {\mathcal {M}}_\textrm{LTR}({\mathcal {F}}_\textrm{LTR}(u_h,K)){:}{=}{\left\{ \begin{array}{ll} 1, &{}\text { if }{\mathcal {E}}_h(K)\cap \partial {\mathcal {E}}_h=\emptyset \ \wedge \\ &{}\ \big (u_h\vert _K(m_0)\notin \left[ \left\lfloor {\overline{u}}_{h,K_0},{\overline{u}}_{h,K} \right\rfloor -tol, \lceil {\overline{u}}_{h,K_0},{\overline{u}}_{h,K} \rceil +tol\right] \ \vee \ \\ &{}\ u_h\vert _K(m_1)\notin \left[ \left\lfloor {\overline{u}}_{h,K_1},{\overline{u}}_{h,K} \right\rfloor -tol, \lceil {\overline{u}}_{h,K_1},{\overline{u}}_{h,K} \rceil +tol\right] \ \vee \ \\ &{}\ u_h\vert _K(m_2)\notin \left[ \left\lfloor {\overline{u}}_{h,K_2},{\overline{u}}_{h,K} \right\rfloor -tol, \lceil {\overline{u}}_{h,K_2},{\overline{u}}_{h,K} \rceil +tol\right] \big )\\ 0, &{}\text { else.} \end{array}\right. } \end{aligned}$$
(8)

The tolerance tol is introduced to prevent marking of cells due to numerical round-off errors. Hence roughly speaking, LinTriaReco marks an interior cell K if for at least one edge the value of the solution at the edge midpoint is not between the cell averages of the function in the cell and the corresponding neighbor.

For the reconstruction \(\Pi _\textrm{LTR,K}\), three affine functions are constructed based on the cell averages of the discrete solution in the cell and its neighbors of which one is chosen, e.g., see [5, 14, p. 104].

\(\underline{ConstTriaReco }\)

This method was proposed in [14] and is a modification of LinTriaReco. Instead of evaluating the function at the edge midpoint the integral mean \({\overline{u}}_{h,K}^{_E}{:}{=}\int _E u_h\vert _K\,\textrm{d}s/h_E\) is used. Furthermore, for boundary edges \(E\in {\mathcal {E}}_h\cap \partial {\mathcal {E}}_h\), i.e., edges along which the cell has no neighbor, a virtual neighbor is constructed by mirroring the opposite vertex along the edge E. Then, on this virtual neighbor the discrete function is defined to be the continuation of \(u_h\vert _K\) which exists and is well-defined since \(u_h\vert _K\) is a polynomial of degree at most p. In this way every triangle has three neighbors and a cell average in each neighbor can be computed.

The feature mapping is then given by

$$\begin{aligned} {\mathcal {F}}_\textrm{CTR}(u_h,K){:}{=}\{ {\overline{u}}_{h,K_0},\ {\overline{u}}_{h,K}^{_{E_0}},\ {\overline{u}}_{h,K_1},\ {\overline{u}}_{h,K}^{_{E_1}},\ {\overline{u}}_{h,K_2},\ {\overline{u}}_{h,K}^{_{E_2}},\ {\overline{u}}_{h,K},\ tol \}, \end{aligned}$$
(9)

and hence \(n_\textrm{CTR}=8\).

ConstTriaReco ’s decision maker is then analogously defined by

$$\begin{aligned} {\mathcal {M}}_\textrm{CTR}({\mathcal {F}}_\textrm{CTR}(u_h,K)){:}{=}{\left\{ \begin{array}{ll} 1, &{}\text { if } {\overline{u}}_{h,K}^{_{E_0}}\notin \left[ \left\lfloor {\overline{u}}_{h,K_0},{\overline{u}}_{h,K} \right\rfloor -tol, \lceil {\overline{u}}_{h,K_0},{\overline{u}}_{h,K} \rceil +tol\right] \ \vee \ \\ &{}\ {\overline{u}}_{h,K}^{_{E_1}}\notin \left[ \left\lfloor {\overline{u}}_{h,K_1}, {\overline{u}}_{h,K} \right\rfloor -tol,\lceil {\overline{u}}_{h,K_1},{\overline{u}}_{h,K} \rceil +tol\right] \ \vee \ \\ &{}\ {\overline{u}}_{h,K}^{_{E_2}}\notin \left[ \left\lfloor {\overline{u}}_{h,K_2}, {\overline{u}}_{h,K} \right\rfloor -tol,\lceil {\overline{u}}_{h,K_2},{\overline{u}}_{h,K} \rceil +tol\right] \\ 0, &{}\text { else.} \end{array}\right. } \end{aligned}$$
(10)

To reconstruct the solution, \(\Pi _{\textrm{CTR},K}(u_h){:}{=}{\overline{u}}_{h,K}\) is used, which led often to good results in the numerical studies of [14].

\(\underline{ConstJumpMod }\)

A different approach is taken by ConstJumpMod that was proposed in [14] and improved in [15]. Based on the marking criterion of [16, 17] ConstJumpMod tries to approximate the local order of convergence along each edge and marks a cell if this order is smaller than some reference value.

Let \(0 < C_0\in {\mathbb {R}}\) be a positive constant, L be a characteristic length scale of the problem and \(u_0\) a characteristic scale of the solution. For each edge \(E\in {\mathcal {E}}_h\) the quantity

$$\begin{aligned} \alpha _E{:}{=}{\left\{ \begin{array}{ll} \left. \ln \left( \frac{1}{C_0 L u_0^2} \int _{ E }^{ } \llbracket u_h \rrbracket ^2 :\,\textrm{d} s \right) \Bigg / \ln \left( \frac{h_E}{L} \right) \right. , &{}\text { if }E\in {\mathcal {E}}_h^\textrm{I},\\ \alpha _{\textrm{ref}}, &{}\text { else, } \end{array}\right. } \end{aligned}$$

can be computed, where \(\alpha _{\textrm{ref}}\in {\mathbb {R}}\) is a fixed positive reference value. These values are used to define the feature mapping that is given by

$$\begin{aligned} {\mathcal {F}}_\textrm{CJM}(u_h,K){:}{=}\{ \alpha _{E_0}, \alpha _{E_1}, \alpha _{E_2}, \alpha _{\textrm{ref}} \} \end{aligned}$$
(11)

and it follows that \(n_{\textrm{CJM}}=4\).

To mark a cell K, the decision maker

$$\begin{aligned} {\mathcal {M}}_\textrm{CJM}({\mathcal {F}}_\textrm{CJM}(u_h,K)){:}{=}{\left\{ \begin{array}{ll} 1, &{}\text { if }\min _{i=0,1,2}{\alpha _{E_i}} < \alpha _{\textrm{ref}},\\ 0, &{}\text { else, } \end{array}\right. } \end{aligned}$$
(12)

is used. Note, to prevent having infinite values in the feature set before computing \({\mathcal {M}}_\textrm{CJM}\), these values can be replaced by \(\alpha _\textrm{ref}\) without changing the result of \({\mathcal {M}}_\textrm{CJM}\), which might be beneficial for the implementation.

The solution in the marked cells is again replaced by the cell integral mean, i.e., \(\Pi _{\textrm{CJM},K}(u_h){:}{=}{\overline{u}}_{h,K}\).

\(\underline{ConstJumpNorm }\)

Based on the previous approach, the method ConstJumpNorm was introduced in [15] that depends on the mean \(L^\infty (E)\)-norm of the jump of the function \(u_h\). The \(L^1\)- and \(L^2\)-norms have been investigated as well but significant differences could not be observed [15]. If this jump is larger than a fixed positive reference value \(0<\beta _{\text {ref}}\in {\mathbb {R}}\) the cell is marked.

To be precise, for each edge \(E\in {\mathcal {E}}_h\) the quantity

$$\begin{aligned} \beta _E{:}{=}{\left\{ \begin{array}{ll} \left. \left\| \llbracket u_h \rrbracket \right\| _{L^\infty (E)} \right. , &{}\text { if }E\in {\mathcal {E}}_h^\textrm{I},\\ 0, &{}\text { else, } \end{array}\right. } \end{aligned}$$

can be defined. Based on this quantity, the feature mapping

$$\begin{aligned} {\mathcal {F}}_\textrm{CJN}(u_h,K){:}{=}\{ \beta _{E_0}, \beta _{E_1}, \beta _{E_2}, \beta _{\textrm{ref}} \} \end{aligned}$$
(13)

can be computed, so that \(n_{\textrm{CJN}}=4\).

The decision maker function of ConstJumpNorm is given by

$$\begin{aligned} {\mathcal {M}}_\textrm{CJN}({\mathcal {F}}_\textrm{CJN}(u_h,K)){:}{=}{\left\{ \begin{array}{ll} 1, &{}\text { if }\max _{i=0,1,2}{\beta _{E_i}} \ge \beta _{\textrm{ref}},\\ 0, &{}\text { else, } \end{array}\right. } \end{aligned}$$
(14)

and the solution is approximated by \(\Pi _{\textrm{CJN},K}(u_h){:}{=}{\overline{u}}_{h,K}\).

3 Deep Neural Networks as Spurious Oscillations Detector

Deep learning techniques such as deep (neural) networks are a subpart of machine learning which try to approximate a possibly unknown function by learning it from data [21, p. 1–8]. In the following, multilayer perceptrons (MLPs) also known as feed forward neural networks are briefly introduced; see also [21, 42, Chapter 6] for detailed information.

Mathematically speaking, MLPs can be seen as functions that map an input domain \({\mathcal {X}}\) to some output domain \({\mathcal {Y}}\) by composing a sequence of functions \(g_1\), \(g_2\), ..., \(g_\ell \), i.e.,

$$\begin{aligned} x \mapsto g_\ell (g_{\ell -1}(\ldots g_1(x))\ldots )\in {\mathcal {Y}} \qquad (x\in {\mathcal {X}}). \end{aligned}$$

Here each \(g_i\), \(i=1,2,\ldots ,\ell \), also called ith layer has the form \(g_i(\bullet )=\sigma _i(W_i\bullet +b_i)\), where \(W_i\) is a rectangular matrix called weight matrix, \(b_i\) is a vector called bias vector, \(\sigma _i\) is a component wise defined nonlinear function called activation function. The first layer is called input layer, the last layer is called output layer and the layers in between hidden layers. In other words, starting with x as value(s) of the input layer each following layer takes as input all the output of the previous layer, also called nodes, performs an affine transformation and applies component wise an activation function. MLPs can be therefore characterized or rather parameterized by their corresponding weights, biases and activation functions, which is why they are often referred to as parameters. Different activation functions can be used, but what they all have in common is that they are nonlinear, which is crucial to approximate nonlinear functions [21, p. 168]. Possible choices are the sigmoid function \(\sigma (x)=1/\left( 1+e^{-x}\right) \), the rectified linear unit (ReLU) function \(\sigma (x)=\max \{ 0,x \}\) or the hyperbolic tangent \(\sigma (x)=\tanh (x)= (e^{x}-e^{-x}) / (e^{x}+e^{-x})\). To reach the goal that a MLP approximates a given function, the parameters need to be chosen accordingly. They are chosen in an optimization process that is called training.

Let \(F:{\mathcal {X}}\rightarrow {\mathcal {Y}}\) be the function that will be approximated by a MLP. During the training the parameters are optimized to minimize a given loss function \({\mathcal {L}}\) over a given finite data set \({\mathcal {D}}\subset {\mathcal {X}}\times {\mathcal {Y}}\) which consists of pairs \((x_i,y_i)\in {\mathcal {X}}\times {\mathcal {Y}}\) of features \(x_i\) and labels \(y_i=F(x_i)\).Footnote 1

In this work, different approaches are investigated to approximate (combinations of) the decision maker functions \({\mathcal {M}}_{\textrm{LTR}}\), \({\mathcal {M}}_{\textrm{CTR}}\), \({\mathcal {M}}_{\textrm{CJM}}\), and \({\mathcal {M}}_{\textrm{CJN}}\) by MLPs, see below. The concrete choice \(F={\mathcal {M}}_{\textrm{LTR}}\), \({\mathcal {X}}={{\,\textrm{Im}\,}}({\mathcal {F}}_{\textrm{LTR}})\subset {{\mathbb {R}}^8}\), \({\mathcal {Y}}=\{ 0,1 \}\),

$$\begin{aligned} {\mathcal {L}}(D){:}{=}- \frac{1}{N} \sum _{ i=1 }^{ N } y_i\ln ({\hat{y}}_i)+(1-y_i)\ln (1-{\hat{y}}_i), \end{aligned}$$
(15)

where N is the number of training data in \({\mathcal {D}}\) and \({\hat{y}}_i\) is the prediction of the MLP, may serve as a simple example and is used in Sect. 4.2. This loss function is usually called binary cross entropy loss.

During the training the parameters p are updated by

$$\begin{aligned} p\rightarrow p-\eta \nabla _p{\mathcal {L}}(p), \end{aligned}$$

where \(0<\eta \in {\mathbb {R}}\) is a positive step width, also called learning rate, and \(\nabla _p\) denotes the partial derivatives with respect to the parameters. In this work the minibatch stochastic gradient descent [21, 42, Chapter 8.1.3] is used together with the Adam algorithm [43] to adapt the step width automatically.

3.1 Generating the Data Set

To enable the MLP to approximate a given function training data is needed. As stated above, decision maker functions are approximated that take as input a feature vector of the solution on a single cell and return either 1 (mark the cell) or 0 (do not mark a cell). To generate training data the following problem is fixed.

Example 1

Let \(\Omega = (0,1)^2\) be the unit square and \({{\varvec{b}}}=(\cos (-\pi /3), \sin (-\pi /3))^T\), \(c=f=0\). On the whole boundary Dirichlet boundary conditions are prescribed, i.e., \(\Gamma _\textrm{D}=\partial \Omega \), by choosing

$$\begin{aligned} g = {\left\{ \begin{array}{ll} 1 &{} (y=1\wedge x>0) \text{ or } (x=0 \wedge y>0.75),\\ 0 &{} \text{ else }. \end{array}\right. } \end{aligned}$$

This example is a modification of a classical benchmark problem stated in [36] in which the discontinuity point of the Dirichlet boundary conditions is chosen slightly different to match the requirements of applying a DG method on a uniform grid.

For small diffusion coefficients \(\varepsilon \), the solution possesses two boundary layers at the outflow boundary and an interior layer in the direction of the convection, see Fig. 1 for a sketch of the solution.

Fig. 1
figure 1

Sketch of the solution to Example 1 for \(\varepsilon = 10^{-8}\) obtained with a nonlinear algebraic flux-corrected (AFC) finite element method with Kuzmin limiter, see [44]

Fig. 2
figure 2

Initial meshes for Example 1. The grid on the left-hand side is referred to as regular and the grid on the right-hand side as irregular grid

To generate training data, the discrete problem (3) can be solved on a series of uniformly refined meshes starting with the initial meshes depicted in Fig. 2. On each level, the discrete solution is calculated and afterwards on each cell the features of LinTriaReco, ConstTriaReco, ConstJumpMod, ConstJumpNorm and the corresponding labels are computed and stored, see Eqs. (7)–(13). Since a data point for each cell is created, a lot of data can be generated easily since the number of cells scales quadratically when the grid is refined. Here it comes in handy that the decision maker functions act locally.

The data is generated using discontinuous piecewise linear finite elements \(P_1\) for the above mentioned problem with a diffusion coefficient \(\varepsilon =10^{-8}\). The SIPG discretization is chosen with upwind stabilization, i.e., \(\kappa =1\) and \(\eta =1\) in Eqs. (4) to (6). Let \(n_0\) be the number of vertices in each cell, i.e., \(n_0=3\) on the triangular grids depicted in Fig. 2. Guided by [5, Chapter 2.7.1], the penalty parameter \(\sigma =2 \varepsilon n_0 (p + 1) (p + 2) / 2 = 18\varepsilon \) is used. All simulations are performed with ParMooN [45, 46] and the direct solver UMFPACK [47] is used to solve the linear systems of equations. In LinTriaReco and ConstTriaReco the tolerance tol is set to \(10^{-11}\). For ConstJumpMod the parameters \(C_0=1\), \(L=1.5\) and \(u_0=1\), and \(\alpha _\textrm{ref}=4\) are chosen. Lastly, the reference value \(\beta _{\textrm{ref}}\) is set to be the arithmetic mean of all \(\beta _E\) in ConstJumpNorm.

3.1.1 Rotation Invariance of the Data

Unfortunately, the features above described and defined in Eqs. (7), (9), (11) and (13) depend on the numbering of the edges and hence, so does the data. To overcome this problem, or in other words to introduce some sort of rotation invariance, each data point in the data set is stored three times, one for each particular counter clockwise numbering of the edges.

3.1.2 Magnitude Invariance

In [27] the authors have decided to scale the features to introduce some form of magnitude invariance. In contrast to this, here the features are not scaled. The reason for this is that all decision maker functions essentially compare the magnitude of a feature with other features. If features are scaled feature-wise as in [27], the ratio of the magnitude of features can be changed. In this way inconsistent data can be created, i.e., the label does not fit to the data anymore. To prevent this situation, a scaling of the features is therefore not applied.

3.2 Restricting the Data Set

Following the above describe procedure a lot of data can be generated, e.g., refining the regular grid nine times and the irregular grid eight times leads to 4.456.437 data points. Unfortunately, a lot of duplicates exist in the data, e.g., due to the fact that the solution of the problem is piecewise constant in huge parts of the domain and hence, the features can be equal. This can be the case for an individual limiter but also for any combination of limiters, e.g., also for all limiters at the same time. Our approach consists in removing the duplicates to prevent the MLP from learning a pattern specific to the duplicates and to prevent overfitting to the duplicates and hence, ending up in a MLP that does not generalize well to unseen data. That is, either the duplicates in the data of a single limiter are removed if a single limiter is learned, or duplicates of the data of all limiters if all limiters are learned at the same time, see also the numerical examples below.

After having removed the duplicates it can further be noted that there are for each limiter individually significantly less marked cells than unmarked cells, e.g., for LinTriaReco after removing the duplicates there are around 77% cells that are not marked and 23% marked cells. When inspecting the whole data set it can be seen that cells that are not marked by any limiter are more common (ca. 93.6%) than cells that are marked at least by one limiter (ca. 6.4%). It is well known that care has to be taken when it comes to such unbalanced data sets [48, Chapter 11.2]. To have a better balance between marked cells and unmarked cells, resp., in the distribution of the label combinations, the data set is further restricted to have either as many marked cells as cells that are not marked in the case that a single limiter is approximated or the amount of the combination where no limiter marks a cell is reduced to equal the amount of the second most occurring label combination in the case all limiters are approximated at a single time. The rows that are removed are chosen randomly using a fixed random seed to guarantee reproducibility.

3.3 Splitting the Data Set

Even after deleting duplicates and decreasing the amount of cells that are not marked, resp., the amount of the most occurring label combination, a lot of training data remains, e.g., 379.539 when all limiters are learned at the same time, and 260.436 if only ConstJumpNorm is considered. On the one hand, the more data exists the more likely it is that the network approximates the function that lies behind the data, but on the other hand, more training data increases the optimization time when the network is trained. Hence, it is recommended to have less data of higher quality, i.e., showing relevant features of the function, than more data with lower quality. In this data set it might be difficult to choose “good” data points a priori but it might be still useful to choose only a subset of the data points for performance reasons. Therefore, a subset of only 7500 data points is randomly chosen to be the training data for the networks.

Furthermore, to prevent overfitting of the data, another 1875 are chosen to be the validation data set, see also [21, Chapter 5.3] for an introduction to overfitting and validation sets. During the training the validation set is evaluated as well to see how well the network generalizes to unseen data. At some point the networks might only fit better the training data but they become worse on the validation data set, which is why the optimization can be stopped at this point to prevent overfitting.

Last but not least, a third set is introduced with which the trained networks are rated how well they work. The so-called test set consists of the validation set and all remaining data. After the training has finished the networks are applied on the test set to measure the overall performance of the networks.

3.4 Measuring the Performance of the Networks

To measure the performance of the trained networks the measures

$$\begin{aligned} \text {accuracy} \qquad&\textrm{acc}{:}{=}\frac{t_p+t_n}{t_p+f_p+t_n+f_n},\\ \text {precision} \qquad&\textrm{prc}{:}{=}\frac{t_p}{t_p+f_p},\\ \text {recall} \qquad&\textrm{rec}{:}{=}\frac{t_p}{t_p+f_n},\\ \end{aligned}$$

are used, where \(t_p\) denotes the true positive, \(t_n\) the true negative, \(f_p\) the false positive and \(f_n\) the false negative classifications. The measure accuracy is the ratio of correct classified data to all data, i.e., it measures how good the networks performs overall. While the second measure gives information about the proportion of positive classifications that was actually correct, recall answers what proportion of actual positives was identified as such. Since for reducing spurious oscillations it is worse to not detect a cell that should be marked than to mark a cell that should not, recall might be considered more important than precision. Therefore, the total rating \(r_{\textrm{tot}}\) of the limiters is set to be a weighted combination of the measures, namely

$$\begin{aligned} r_{\textrm{tot}}{:}{=}\frac{2}{5} \textrm{acc} + \frac{1}{5} \textrm{prc} + \frac{2}{5} \textrm{rec}, \end{aligned}$$

where \(\textrm{acc}\), \(\textrm{prc}\) and \(\textrm{rec}\) are computed based on the test set.

4 Numerical Studies

For the implementation of the MLP networks the open source library TensorFlow is used [49, 50]. As stated above, the finite element computations are performed with ParMooN [45, 46] and CppFlow [51] is used to open and deploy stored TensorFlow models in ParMooN. Note that the data and most parts of the code that are used in this section are publicly available at www.doi.org/10.20347/40vd-f944 [40].

4.1 Architecture of the MLPs

Table 1 Hyperparameters that are tested resulting in 648 different combinations

Given a specific mapping that should be approximated by a MLP, it is in general almost impossible to come up a priori with the optimal architecture of the MLP, i.e., the number of layers, activation functions, number of nodes per layer. To find a suitable architecture, in this work, different architectures are tested. Six different combinations of number of hidden layers and number of nodes which corresponds to the number of columns in the weight matrices \(W_i\), two different activation functions, three different batch sizes, six different learning rates and three different initializations of the parameters are used which are coded by different seeds, resulting in 648 different architectures that are investigated, see Table 1. Each combination therefore can also be identified by a number between one and 648 which is done below. The size of the input and the output layer are determined by the task to solve. While for all hidden layers the same activation function is used, i.e., one of the functions given in Table 1, the activation function for the output layer depends on the experiment and is therefore stated in the experiments below. The parameters of the layers are initialized using the Glorot normalized initialization [52] with different seeds for each layer based on the seeds given in Table 1. Also the loss function \({\mathcal {L}}\) depends on the experiment and hence is given below.

The networks are trained for at most 10000 epochs and the training is stopped earlier, if the loss of the validation set does not decrease for 100 epochs. The model with the best accuracy is then saved as trained model.

4.2 Learning Single Limiters

The first experiment figures out whether the individual feature mappings (8), (10), (12), and (14) can by approximated by a MLP. Since for all functions \({\mathcal {Y}}=\{ 0,1 \}\), the output layer consists of a single node and uses the sigmoid activation function. The size of the input layer is defined by the input of the decision maker functions, i.e., eight for LinTriaReco and ConstTriaReco, and four for ConstJumpMod and ConstJumpNorm. As loss the binary cross entropy loss \({\mathcal {L}}(D)\) given in (15) is applied. The data is prepared as described in Sects. 3.1 to 3.3 and the measures defined in Sect. 3.4 are used to measure the performance of the networks.

The total rating \(r_{\textrm{tot}}\) for the networks for each limiter is plotted in Fig. 3, where MLP (lim) denotes the MLP networks that approximate the decision maker function of lim. In general it can be seen that the results of MLP (LinTriaReco) and MLP (ConstTriaReco) look similar as well as the results of MLP (ConstJumpMod) and MLP (ConstJumpNorm). On the one hand, all architectures are able to approximate ConstJumpMod very well and ConstJumpNorm can be approximated by almost all architectures. On the other hand, LinTriaReco and ConstTriaReco cannot be approximated that well with the investigated architectures. The best total rating for MLP (LinTriaReco) and MLP (ConstTriaReco) is still around 0.253 worse than the mean of MLP (ConstJumpMod) and MLP (ConstJumpNorm), see also Table 2. As indicated by the standard deviation, the quality of the approximation of MLP (LinTriaReco) and MLP (ConstTriaReco) depends more on the choice of the architecture than of MLP (ConstJumpNorm), which in turn is more dependent than the approximation of MLP (ConstJumpMod). It can further be noted that in Fig. 3 there is a pattern indicating which architectures work worse for MLP (LinTriaReco) and MLP (ConstTriaReco). The Pearson correlation coefficients between the hyperparameters and the ratings of MLP (LinTriaReco) and MLP (ConstTriaReco) are given in Table 3. The correlation coefficients for MLP (ConstJumpMod) and MLP (ConstJumpNorm) are not investigated since almost all architectures lead to good results. It can be seen that the learning rate and the activation function have the largest impact. From the obtained results it can be deduced that the learning rate should not be chosen too small and after looking into the data it can be observed that the ReLU activation function works better than \(\tanh \). This might be a reason for the pattern in Fig. 3.

Table 2 Statistics about the total rating of the trained networks of Sect. 4.2
Fig. 3
figure 3

Rating for all architectures for all limiters for Sect. 4.2

Table 3 Pearson correlation coefficients between the hyperparameters and the total ratings from Sect. 4.2

4.3 Overcoming the Difficulties When Learning LinTriaReco and ConstTriaReco

As seen in the previous experiment, the decision maker functions of LinTriaReco and ConstTriaReco could not be approximated well and ConstJumpMod and ConstJumpNorm could be approximated by the chosen architectures if only the features of the respective limiter are used. This experiment investigates if enriching the feature set enables the architectures to predict the outcome of the decision maker function of LinTriaReco. The hope is that there is an implicit dependency between this enriched feature space and the outcome of LinTriaReco and that the MLPs can approximate this mapping better than \({\mathcal {M}}_{\textrm{LTR}}\) itself. To this end, the output layer consists again of a single node and the sigmoid activation function is used. The idea in this experiment is to use all features of LinTriaReco, ConstTriaReco, ConstJumpMod, and ConstJumpNorm. Hence, the input layer, in contrast to the previous experiment, is now larger and consists of \(n_{\textrm{LTR}}+n_{\textrm{CTR}}+n_{\textrm{CJM}}+n_{\textrm{CJN}}=24\) nodes. As a consequence, this experiment allows us to investigate whether there is a hidden dependency between the features of all these limiters and the label of LinTriaReco. Again the binary cross entropy loss (15) is applied to train the networks. The data is loaded, restricted and split as before and the measure \(r_{\textrm{tot}}\) is used to rate the trained MLPs.

Figure 4 shows the result for the tested architectures. All tested architectures have a similar good rating, i.e., it seems that there is a mapping from all features to the label of LinTriaReco that can be approximated with the used architectures. The best rating (0.979) is slightly worse than the best results for ConstJumpMod (0.999) and ConstJumpNorm (0.997) of the previous experiment but could increase the rating of LinTriaReco by around 0.235. Also all architectures are stable in the sense of producing similar good results as shown by the mean that is close to the best rating and the small standard deviation, see Table 4. Pearson correlation coefficients are not shown since all architectures are stable.

Altogether, the results of this and the former experiment seem to indicate that the features of LinTriaReco and ConstTriaReco might not be suited best for deciding whether to mark a cell or not. In contrast to this, the features of ConstJumpMod and ConstJumpNorm seem to provide better information and hence are more suited, since also the label of LinTriaReco can be approximated if these features are part of the input to the MLPs.

Table 4 Statistics about the total rating of the trained networks of Sect. 4.3
Fig. 4
figure 4

Results for all architectures from Sect. 4.3

4.4 Learning All Limiters Simultaneously Based on Vectors

The previous experiment raises the questions whether it is possible to learn all decision maker functions at once. The idea is to train a network that approximates the map from all features to all labels simultaneously, i.e., the size of the input layer is \(n_{\textrm{LTR}}+n_{\textrm{CTR}}+n_{\textrm{CJM}}+n_{\textrm{CJN}}=24\) and the output size is four, since we want to approximate four decision maker functions at once. In this sense the problem is a multi-label classification task. As in Sect. 4.2, the activation function of the output layer is set to be the sigmoid function such that the output of the network is in \([0,1]^4\). This means that by construction the networks return a vector of four predicted labels at once. Furthermore, the loss

$$\begin{aligned} {\mathcal {L}}(D) {:}{=}\frac{1}{4} \sum _{ j=1 }^{ 4}\left( - \frac{1}{N} \sum _{ i=1 }^{ N } y_{i,j}\ln ({\hat{y}}_{i,j})+(1-y_{i,j})\ln (1-{\hat{y}}_{i,j})\right) \end{aligned}$$

is used, where N is again the number of training data in \({\mathcal {D}}\), \(y_{i,j}\) is the jth component of the ith training data point and \({\hat{y}}_{i,j}\) is the jth component of the prediction \({\hat{y}}_i\) of the MLP. Comparing with Eq. (15), this loss is the average of the binary cross entropy loss of the four decision maker functions that are learned at once. The data is prepared following the procedure described in Sects. 3.1 to 3.3 and the total measure \(r_{\textrm{tot}}\) from Sect. 3.4 is used to rate the trained MLPs. The measures are evaluated element-wise and not vector-wise for the output of the MLPs.

Table 5 Statistics about the total rating of the trained networks of Sect. 4.4. Standard deviation is abbreviated by std

The results for all configurations are depicted in Fig. 5. It can be seen that almost all configurations are able to approximate all decision maker functions at once. As indicated also by Table 5 the best network achieves a total rating of around 0.95 which is slightly lower than the best rating in Sect. 4.2 but can still considered to be a good result. Both the mean of 0.938 that is close to the maximal total rating and the small standard deviation (0.007) indicate that there are only few configurations that work worse than the best one. For the sake of brevity and due to the small standard deviation, Pearson correlation coefficients are not shown. However, the learning rate has the largest impact with a coefficient of \(-\,0.116\), supporting the result from Sect. 4.2.

Fig. 5
figure 5

Results for all architectures from Sect. 4.4

4.5 Learning All Limiters Simultaneously Based on Classes

The multi-label problem of the previous section can be transformed to a multi-class problem. After preparing the data as before, every unique label combination is assigned to a unique number, e.g.,

$$\begin{aligned}{}[0, 0, 0, 0] \mapsto 0,{} & {} [0, 0, 0, 1] \mapsto 1,{} & {} [0, 0, 1, 0] \mapsto 2 \end{aligned}$$

and so on. This number j is then assigned to a probability vector, i.e., the components are non-negative and sum to 1, by mapping j to the vector \([0,\ldots ,0, 1,0,\ldots , 0]\) that has the 1 at the jth component. Every entry in this vector gives the probability that the input belongs to the respective class. The input layer size is again 24 and the output layer size is the number of classes which are in this experiment 12 since not all possible label combinations occur in the dataset. After splitting the data into training, validation, and test set it can be observed that not all label combinations occur in the training set since some combinations are very rare. The activation function of the output layer is chosen to be the softmax function \(\sigma (x)_i = \exp (x_i) / \sum _{ j=1 }^{ 12 } \exp (x_j)\), \(i=1,\ldots ,12\), such that the output is a probability vector. Hence, in contrast to the previous section, the output gives probabilities that the input belongs to the possible label combinations and does not return a specific combination. The usual loss for multi-class problems is used, namely the categorical cross entropy

$$\begin{aligned} {\mathcal {L}}(D) {:}{=}-\sum _{ i=1 }^{ N } \sum _{ j=1 }^{ 12 } y_{i,j}\ln ({\hat{y}}_{i,j}), \end{aligned}$$

where N is the number of training data points, \(y_{i,j}\) is the jth component of the ith training data vector and \({\hat{y}}_{i,j}\) the prediction of the network. Note that \(y_{i,j}\) is 0 for all except one entry where it is 1. To measure the performance of the networks the outputs of the networks are mapped back to label vectors by multiplying the probability of the classes with the corresponding label vectors of the classes and summing up the results. In other words, a weighted sum of all label vectors is computed where the weights correspond to the predicted probabilities. This procedure gives a vector of four labels that can be compared with the corresponding true labels in the test set, to identify the true positives, true negatives, false positives and false negatives. Afterwards the measures given in Sect. 3.4 can be computed.

In Fig. 6 the total rating of the trained networks is plotted. The results are similar to the results of Sect. 4.4 as also indicated by the values in Table 6. The best and the mean are negligibly smaller than the values obtained in the previous experiment. Hence, it does not matter whether to deal with the problem as a multi-label or a multi-class problem, at least in this particular setting with the used training set and the measures.

Fig. 6
figure 6

Results for all architectures from Sect. 4.5

Table 6 Statistics about the total rating of the trained networks of Sect. 4.5

4.6 Applying a MLP Limiter to Higher Polynomial Degrees

Until here the experiments have investigated how good the MLPs can approximate the data but they have not been applied to the DG solution of (other) convection-dominated convection–diffusion equations. To this end, the MLP from Sect. 4.4 with the best total rating is used, which has four hidden layers of 100 nodes and the hyperbolic tangent as activation function, and is trained with a learning rate of 0.0005, a batch size of 128, and is initialized with seed 42. After the DG solution of a convection–diffusion problem is solved, all features of the conventional limiters are calculated and the MLP is asked to predict the label combination given these features. A cell is finally marked if at least \(n\in [1,2,3,4]\) of the four predicted labels are true, i.e., larger than 0.5. If a cell is marked then the solution is locally replaced by its integral mean, i.e., \(\Pi _{\textrm{MLP},K}(u_h){:}{=}{\overline{u}}_{h,K}\) since this choice has produced the best results in [14] and [15]. This limiter is below called MLP limiter.

4.6.1 Determining the Minimum Number of Predicted Marks n

Since it is not a priori clear which value of n to choose, this is determined in a first step. The smaller n, the more cells are marked, which on the one hand hopefully leads to less spurious oscillations but on the other hand, marking too many cells might reduce the order of accuracy and leads to unnecessary computational overhead. Therefore, n should be chosen in such a way that enough but not too many cells are marked. To find the optimal n, Example 1 is used with exactly the same setting as in Sect. 3.1, i.e., the setting with which the data is created. Since the limiter is trained with this data it can be expected that it predicts the labels of the traditional limiter correctly in most of the cases. Since for Example 1 an analytical solution is not known, the discrete solution \(u_h\) cannot be compared against the exact solution. As in [14, 15], to assess the quality of the limited discrete solution therefore the measures

$$\begin{aligned} \textrm{osc}_{\textrm{max}}(u_h)= & {} \max _{(x,y)\in {{\overline{\Omega }}}} u_h(x,y)- u_{\textrm{max}} + u_{\textrm{min}}- \min _{(x,y)\in {{\overline{\Omega }}}} u_h(x,y), \\ \textrm{osc}_{\textrm{mean}}(u_h)= & {} \frac{1}{\vert {\mathcal {T}}_h\vert } \sum _{K\in {\mathcal {T}}_h} \bigg [\max \{0, \max _{(x,y)\in K}u_h(x,y)-u_{\textrm{max}}\}\\{} & {} + \max \{0,u_{\textrm{min}} -\min _{(x,y)\in K}u_h(x,y)\}\bigg ], \end{aligned}$$

are used to measure the maximal size and a mean value of spurious oscillations, where \(u_{\textrm{max}}\) and \(u_{\textrm{min}}\) are the largest and smallest value of the weak solution, resp., and \(\vert {\mathcal {T}}_h\vert \) denotes the number of cells in the triangulation. In Example 1 it is \(u_{\textrm{min}}=0\) and \(u_{\textrm{max}}=1\). To compute the desired quantities, \(u_h\) is evaluated at certain points, which are the points of the nodal functionals defining continuous \(P_p\) finite elements of the same order.

The results of the MLP limiter with \(n=1,2,3,4\) on both the regular and the irregular grid are shown in Fig. 7 and compared with ConstJumpMod and ConstJumpNorm, which are the classical limiters that work best for this problem [15].

Fig. 7
figure 7

Results of \(\textrm{osc}_\textrm{mean}\) and \(\textrm{osc}_\textrm{max}\) for Example 1 on the regular and irregular grid depicted in Fig. 2 for \(P_1\) finite elements with two classical limiters and various versions of the MLP limiter

It can be seen that, on the one hand, the MLP limiter with \(n=1\) behaves similarly to the one with \(n=2\), and, on the other hand, the limiters where \(n=3\) and \(n=4\) show almost no difference. While the former ones are as good as the classical ConstJumpMod and ConstJumpNorm limiters, the latter ones behave much worse, meaning they lead to larger mean and maximal oscillations. As a consequence, in what follows, the MLP limiter with \(n=2\) is used since it produces better results than the ones with \(n=3,4\) and should by definition mark less or the same amount of cells than the one with \(n=1\). Of course, it is not guaranteed that \(n=2\) is the optimal choice also in other scenarios, however this experiment indicates that it is a reasonable choice and the experiments below confirm the choice.

Fig. 8
figure 8

Results for measures for various limiters and various polynomial degrees obtained for Example 1 on the regular grid from Fig. 2

Fig. 9
figure 9

Results for measures for various limiters and various polynomial degrees obtained for Example 1 on the irregular grid from Fig. 2

4.6.2 Higher Polynomial Degrees

Since the MLP limiter is fixed it can be applied to other problems. To start, again Example 1 is used but the limiter is applied to the discrete solution obtained with finite elements with higher polynomial degrees, namely \(P_2\), \(P_3\), and \(P_4\) finite elements. The rest of the problem is not varied, i.e., \(\varepsilon =10^{-8}\), \(\kappa =1\), \(\eta =1\). As in Sect. 3.1 the penalty parameter is chosen to be \(\sigma = 2\varepsilon n_0 (p+1)(p+2)/2\), where again \(n_0\) denotes the number of vertices a cell has. Also the parameters used in the classical limiters are kept the same. The problems are solved on the series of uniformly refined grids starting as above with the initial meshes depicted in Fig. 2. In what follows Galerkin denotes the DG solution from Eq. (3) without being limited.

The results for the measures \(\textrm{osc}_{\textrm{mean}}\) and \(\textrm{osc}_{\textrm{max}}\) for the best conventional limiter as well as the MLP limiter on both types of meshes are shown in Figs. 8 and 9. It can be seen that the MLP limiter reduces both the mean and the maximal oscillations significantly compared to Galerkin. While on coarser grids it acts worse than ConstJumpNorm but better than ConstJumpMod, on finer grids all limiters almost coincide. A reason for this could be the fact that way more training data obtained on finer grids is available compared to data from coarser grids, since the number of available data scales exponentially with the number of the refinement.

4.7 Applying a MLP Limiter to the Hemker Problem

Finally, in this section the MLP limiter is applied to a different example, namely the Hemker benchmark problem. It was proposed in [37] and it is a very popular benchmark problem for convection-dominated convection–diffusion equations. It models the transport of energy from a body through a channel and shows many features of problems that are also relevant in applications. The structure of the solution is similar to the solution of the HMM example, e.g., it is constant in most regions, and hence there is hope that the MLP limiter is able to limit the solution in a reasonable way.

Example 2

(Hemker example) The problem is stated in \( \Omega = \{(-3,9)\times (-3,3)\} {\setminus } \{(x,y) : \ x^2+y^2\le 1 \}, \) and has the coefficients \({{\varvec{b}}}= (1,0)^T\), \(c=f=0\). If \(x=-3\) and at the circular boundary, Dirichlet boundary conditions are prescribed by setting

$$\begin{aligned} g=0 \text { if } x=-3,{} & {} g=1\text { at the circle.} \end{aligned}$$

Everywhere else homogeneous Neumann conditions are applied. The solution is sketched in Fig. 10 and takes values in [0, 1].

Fig. 10
figure 10

Sketch of the solution to Example 2 for \(\varepsilon = 10^{-8}\) obtained with a nonlinear algebraic flux-corrected (AFC) finite element method with Kuzmin limiter, see [44]

Fig. 11
figure 11

Initial mesh for Example 2 used in Sect. 4.7

As before, the diffusion coefficient is set to \(\varepsilon =10^{-8}\) and \(\kappa =1\), \(\eta =1\), and \(\sigma =\varepsilon n_0 (p + 1) (p + 2)\) are used as parameters in the DG method. The problem is solved on a series of grids starting from the one depicted in Fig. 11. The characteristic length scale for this problem is \(L=13.5\) and the remaining parameters of the limiter stay the same.

In Fig. 12 the results for both measures for the limited solution and the original solution for various polynomial degrees are shown. As before ConstJumpMod, ConstJumpNorm, and the MLP limiter are able to reduce the oscillations significantly compared to Galerkin. For \(P_1\) and \(P_2\) the MLP limiter is slightly worse than the traditional ones on coarser grids but on finer grids it behaves equally well. For \(P_3\) and \(P_4\) it is worse than the classical companions, except for the finest grid and except for \(P_4\) for \(\textrm{osc}_{\textrm{max}}\) on one coarser level. We observed that the MLP limiter is always better than LinTriaReco in both measures, better than ConstJumpNorm for \(\textrm{osc}_{\textrm{max}}\) for all polynomial degree, and for \(\textrm{osc}_{\textrm{mean}}\) for \(P_1\) and for \(P_2\) to \(P_4\) on the two finer levels, but these results are not presented for the sake of brevity. A visualization of the limited \(P_4\) solution with ConstJumpNorm and MLP on the second finest grid is shown in Fig. 13. It can be observed that the MLP limiter limits most of the cells correctly but forgets to mark some cells with undershoots. This is also is the reason why it has worse \(\textrm{osc}_{\textrm{mean}}\) and \(\textrm{osc}_{\textrm{max}}\) values compared to ConstJumpNorm.

Fig. 12
figure 12

Results for measures for various limiters and various polynomial degrees obtained for Example 2

Fig. 13
figure 13

Discrete \(P_4\) solution to Example 2 limited by the ConstJumpNorm limiter (left) and the MLP limiter (right)

5 Summary and Outlook

This paper is a contribution to deeper understanding how neural network based slope limiters can be created and applied. In contrast to previous papers, it was focused on constructing a multilayer perceptron model for limiting the discrete solution of an elliptic problem, namely convection–diffusion equations in the convection-dominated regime. It was shown how data from a lowest order discretization can be used to train a limiter that then can be applied to the discrete solution of higher order methods. The results have indicated that the limiter works almost equally well as classical methods for higher order methods for the same problem but somewhat worse than these methods when applied to the solution of a different problem.

These results are also in agreement with the findings of previous works that treat hyperbolic problems. In [24] the authors report that their MLP-based limiter works for some problems equally well and for some better than the classical reference approaches. The limiter constructed in [27] behaves equally well or slightly worse than traditional limiters. The authors furthermore discuss the lack of theoretical guarantees which indeed also cannot be provided in the present work. Finally, in [28] it is reported that the MLP-based limiter produces high-quality results even though it is not compared to traditional limiters.

In our opinion, there are four main conclusions to be drawn from the presented studies. First, it could be shown that it is possible to construct MLP-based limiters also for elliptic problems. Second, care has to be taken which features are chosen, since it was observed that features based on jumps are better suited than others. The other two conclusions are of importance for our future work and to reach such insight was actually a main motivation for performing the presented studies. First, it can be concluded that it is possible to classify mesh cells corresponding to subgrid scales and large scales on the basis of MLPs for the class of problems we are interested in. And finally, the concrete architecture of the MLP played only a minor role as long as no extreme values for the hyperparameters were chosen. Hence, the results were robust with respect to the choice of the MLP and pursuing a sophisticated approach for finding an appropriate MLP is not necessary.

As already mentioned in the introduction, we consider the presented study as a proof of concept to show that it is possible to construct MLPs which are capable of appropriately detecting subregions with subgrid scales for numerical solutions of convection–diffusion–reaction equations. Based on their prediction, spurious oscillations in the discrete solution could be reduced significantly in this study. We plan to use the obtained insights for developing MLP-based algorithms for choosing local parameters in stabilized discretizations of convection–diffusion–reaction problems. Among others, this process requires a classification of mesh cells into cells in subgrid scale regions and away of these regions. Since the choice of user-defined parameters is an issue in many numerical methods for many problems, a medium-range goal consists of extending successful MLP-based techniques to other kinds of boundary value or initial-boundary value problems.