1 Introduction

1.1 The primitive equations

The study of global weather prediction and climate dynamics is largely dependent on the atmosphere and oceans. Ocean currents transport warm water from low latitudes to higher latitudes, where the heat can be released into the atmosphere to balance the earth’s temperature. A widely accepted model to describe the motion and state of the atmosphere and ocean is the Boussinesq system, a combination of the Navier–Stokes equations (NSE) with rotation and a heat (or salinity) transport equation. As a result of the extraordinary organization and complexity of the flow in the atmosphere and ocean, the full governing equations appear to be difficult to analyze, at least for the foreseeable future. In particular, the global existence and uniqueness of the smooth solution to the 3D NSE is one of the most challenging mathematical problems.

Fortunately, when studying oceanic and atmospheric dynamics at the planetary scale, the vertical scale (a few kilometers for the ocean, 10–20 kms for the atmosphere) is much smaller than the horizontal scale (many thousands of kilometers). Accordingly, the large-scale ocean and atmosphere satisfy the hydrostatic balance based on scale analysis, meteorological observations, and historical data. By virtue of this, the primitive equations (PEs, also called the hydrostatic Navier–Stokes equations) are derived as the asymptotic limit of the small aspect ratio between the vertical and horizontal length scales from the Boussinesq system [2, 29, 58, 59]. Because of the impressive accuracy, the following 3D viscous PEs is a widely used model in geophysics (see, e.g., [5, 32, 33, 38, 41, 55, 68, 72] and references therein):

$$\begin{aligned}&\partial _t V + V\cdot \nabla _h V + w\partial _z V -\nu _h \Delta _h V - \nu _z \partial _{zz} V +f_0 V^\perp + \nabla _h p = 0 , \end{aligned}$$
(1.1a)
$$\begin{aligned}&\partial _z p + T= 0, \end{aligned}$$
(1.1b)
$$\begin{aligned}&\nabla _h \cdot V + \partial _z w =0, \end{aligned}$$
(1.1c)
$$\begin{aligned}&\partial _t T + V\cdot \nabla _h T + w\partial _z T -\kappa _h \Delta _h T - \kappa _h \partial _{zz} T = Q . \end{aligned}$$
(1.1d)

Here the horizontal velocity \(V = (u, v)\), vertical velocity w, the pressure p, and the temperature T are the unknown quantities which are functions of the time and space variables (txyz). The 2D horizontal gradient and Laplacian are denoted by \(\nabla _h = (\partial _{x}, \partial _{y})\) and \(\Delta _h = \partial _{xx} + \partial _{yy}\), respectively. The nonnegative constants \(\nu _h, \nu _z, \kappa _h\) and \(\kappa _z\) are the horizontal viscosity, the vertical viscosity, the horizontal diffusivity and the vertical diffusivity coefficients, respectively. The parameter \(f_0 \in \mathbb {R}\) stands for the Coriolis parameter, Q is a given heat source, and the notation \(V^\perp = (-v,u)\) is used.

According to whether the system has horizontal or vertical viscosity, there are mainly four different models considered in the literature (some works also consider the anisotropic diffusivity).

Case 1:

PEs with full viscosity, i.e., \(\nu _h>0, \nu _z>0\): The global well-posedness of strong solutions in Sobolev spaces was first established in [10], and later in [49]; see also the subsequent articles [52] for different boundary conditions, as well as [39] for some progress towards relaxing the smoothness on the initial data by using the semigroup method.

Case 2:

PEs with only horizontal viscosity, i.e., \(\nu _h>0, \nu _z=0\): [12,13,14] consider horizontally viscous PEs with anisotropic diffusivity and establish global well-posedness.

Case 3:

PEs with only vertical viscosity, i.e., \(\nu _h=0, \nu _z>0\): Without the horizontal viscosity, PEs are shown to be ill-posed in Sobolev spaces [71]. In order to get well-posedness, one can consider some additional weak dissipation [15], or assume the initial data have Gevrey regularity and be convex [30], or be analytic in the horizontal direction [60, 66]. It is worth mentioning that whether smooth solutions exist globally or form singularity in finite time is still open.

Case 4:

Inviscid PEs, i.e., \(\nu _h=0, \nu _z=0\): The inviscid PEs are ill-posed in Sobolev spaces [37, 44, 71]. Moreover, smooth solutions of the inviscid PEs can form singularity in finite time [11, 19, 44, 81]. On the other hand, with either some special structures (local Rayleigh condition) on the initial data in 2D, or real analyticity in all directions for general initial data in both 2D and 3D, the local well-posedness can be achieved [7, 8, 31, 36, 53, 54, 63].

Others also consider stochastic PEs, that is, the system (1.1) with additional random external forcing terms (usually characterized by generalized Wiener processes on a Hilbert space). For existence and uniqueness of solutions to those systems, see [9, 22, 23, 34, 35, 40, 42, 43, 73, 77].

In this paper, we focus on Case1 and Case2 in which the well-posedness is established in Sobolev spaces. Case1 is also assumed to have full diffusivity, while Case2 is considered to have only horizontal diffusivity. The analysis of the Case3 and Case4 requires rather different techniques as those models are ill-posed in Sobolev spaces for general initial data, and are left for future work.

System (1.1) has been studied under some proper boundary conditions. For example, as introduced in [12, 31], we consider the domain to be \({\mathcal {D}}:= \mathcal M \times (0,1)\) with \(\mathcal M:= (0,1)\times (0,1)\) and

$$\begin{aligned} \begin{aligned}&V, w, p, T \text { are periodic in } (x,y,z) \text { with period } 1, \\&V \text { and } p \text { are even in } z, \text { and } w \text { and } T \text { are odd in } z. \end{aligned} \end{aligned}$$
(1.2)

Note that the space of periodic functions with such symmetry condition is invariant under the dynamics of system (1.1), provided that Q is periodic in (xyz) and odd in z. When system (1.1) is considered in 2D space, the system will be independent of the y variable. In addition to the boundary condition, one needs to impose the initial condition

$$\begin{aligned} (V,T)|_{t=0} = (V_0,T_0). \end{aligned}$$
(1.3)

We point out that there is no initial condition for w since w is a diagnostic variable and it can be written in terms of V (see (2.1)). This is different from the Navier–Stokes equations and Boussinesq system.

1.2 PINNs

Due to the nonlinearity and nonlocality of many PDEs (including PEs), the numerical study for them is in general a hard task. A non-exhaustive list of the numerical study of PEs includes [6, 16,17,18, 50, 51, 61, 67, 74, 75, 78] and references therein. In the past few years, the deep neural network has emerged as a promising alternative, but it requires abundant data that cannot always be found in scientific research. Instead, such networks can be trained from additional information obtained by enforcing the physical laws. Physics-informed machine learning seamlessly integrates data and mathematical physics models, even in partially understood, uncertain, and high-dimensional contexts [47].

Recently, physics-informed neural networks (PINNs) have been shown as an efficient tool in scientific computing and in solving challenging PDEs. PINNs, which approximate solutions to PDEs by training neural networks to minimize the residuals coming from the initial conditions, the boundary conditions, and the PDE operators, have gained a lot of attention and have been studied intensively. The study of PINNs can be retrieved back to the 90 s [28, 56, 57]. Very recently, [69, 70] introduced and illustrated the PINNs approach for solving nonlinear PDEs, which can handle both forward problems and inverse problems. For a much more complete list of recent advances in the study of PINNs, we refer the readers to [20, 47] and references therein. We also remark that the deep Galerkin method [76] shares a similar spirit with PINNs.

In addition to investigating the efficiency and accuracy of PINNs in solving PDEs numerically, researchers are also interested in rigorously evaluating the error estimates. In a series of works [24, 25, 64, 65], the authors studied the error analysis of PINNs for approximating several different types of PDEs. It is worth mentioning that, recently, there has been a result on generic bounds for PINNs established in [26]. We remark that our work is devoted to establishing higher-order error estimates based on the higher-order regularity of the solutions to the PEs. The analysis requires nontrivial efforts and techniques due to the special characteristics of the PEs, and these results are not trivially followed from [26].

To set up the PINNs framework for our problem, we first review the PINN algorithm [69, 70] for a generic PDE with initial and boundary conditions: for \(x\in {\mathcal {D}}, y\in \partial {\mathcal {D}}, t\in [0,{\mathcal {T}}]\), the solution u satisfies

$$\begin{aligned} \text {PDE operator: }&{\mathcal {D}}[u](x,t)=0,\\ \text {Initial condition: }&u(x,0)=\phi (x),\\ \text {Boundary condition: }&\mathcal Bu(y,t) = \psi (y,t). \end{aligned}$$

The goal is to seek a neural network \(u_\theta \) where \(\theta \) represents all network parameters (see Definition 2.1 for details) so that

$$\begin{aligned} \text {PDE residual: }&\mathcal R_i[\theta ](x,t) = {\mathcal {D}}[u_\theta ](x,t),\\ \text {Initial residual: }&\mathcal R_t[\theta ](x,t) = u_\theta (x,0)-\phi (x),\\ \text {Boundary residual: }&\mathcal R_b[\theta ] = \mathcal Bu_\theta (y,t) - \psi (y,t), \end{aligned}$$

are all small. Based on these residuals, for \(s\in \mathbb N\) we defined the generalization error for \(u_\theta \):

$$\begin{aligned} \mathcal {E}_G[s;\theta ]^2 = \int _0^T (\Vert \mathcal R_i\Vert _{H^s( {\mathcal {D}})}^2 + \Vert \mathcal R_b\Vert _{H^s(\partial {\mathcal {D}})}^2) dt + \Vert \mathcal R_t\Vert _{H^s({\mathcal {D}})}^2. \end{aligned}$$
(1.4)

Notice that when \(u_\theta =u\) is the exact solution, all the residuals will be zero and thus \(\mathcal {E}_G[s;\theta ]=0\). In practice, one uses numerical quadrature to approximate the integral appearing in \(\mathcal {E}_G[s;\theta ]\). We call the corresponding numerical quadrature the training error \(\mathcal {E}_{T}[s;\theta ;{\mathcal {S}}]\) (see Sect. 4.3 for details), which is also the loss function used during the training. The terminology “physics-informed neural networks” is used in the sense that the physical laws coming from the PDE operator and initial and boundary conditions lead to the residuals, which in turn give the generalization error and training error (loss function). Finally, the neural networks minimize the loss function during the training to obtain the approximation for the PDE. Note that in the literature, the analysis for PINN algorithms exists only for \(s = 1\).

For our problem, residuals are defined in (2.4)–(2.6), the generalization error is defined in (2.7), and training error is defined in (2.8). In order to measure how well the approximation \(u_\theta \) is, we use the total error \({\mathcal {E}}[s;\theta ]^2 = \int _0^t \Vert u-u_\theta \Vert _{H^s}^2 dt\), and it is defined in (2.10) for our problem. And our analysis is for any \(s \in \mathbb N\).

In this work, we mainly want to answer two crucial questions concerning the reliability of PINNs:

Q1::

The existence of neural networks \((V_\theta ,w_\theta ,p_\theta ,T_\theta )\) such that the training error (loss function) \(\mathcal {E}_{T}[s;\theta ;{\mathcal {S}}]<\epsilon \) for arbitrary \(\epsilon >0\);

Q2::

The control of total error \({\mathcal {E}}[s;\theta ]\) by the training error with large enough sample set \({\mathcal {S}}\), i.e., \({\mathcal {E}}[s;\theta ]\lesssim \mathcal {E}_{T}[s;\theta ;{\mathcal {S}}] + f(|{\mathcal {S}}|)\) for some function f which is small when \(|{\mathcal {S}}|\) is large.

An affirmative answer to Q1 implies that one is able to train the neural networks to obtain a small enough training error (loss function) at the end. An affirmative answer to Q2 guarantees that \((V_\theta ,w_\theta ,p_\theta ,T_\theta )\) can approximate the true solution arbitrarily well in some Sobolev norms as long as the training error \(\mathcal {E}_{T}[s;\theta ;{\mathcal {S}}]\) is small enough and the sample set \({\mathcal {S}}\) is large enough. However, \(\mathcal {E}_{T}[s;\theta ;{\mathcal {S}}]\) is not convenient in the analysis, while \(\mathcal {E}_G[s;\theta ]\) provides a better way as it is in the integral form. As discussed in [24], one can, in turn consider the following three sub-questions:

SubQ1::

The existence of neural networks \((V_\theta ,w_\theta ,p_\theta ,T_\theta )\) such that the generalization error \(\mathcal {E}_G[s;\theta ]<\epsilon \) for arbitrary \(\epsilon >0\);

SubQ2::

The control of total error by generalization error, i.e., \({\mathcal {E}}[s;\theta ]\lesssim \mathcal {E}_G[s;\theta ]\);

SubQ3::

The difference between the generalization error and the training error \(\Big |\mathcal {E}_G[s;\theta ]- \mathcal {E}_{T}[s;\theta ;{\mathcal {S}}]\Big |< f(|{\mathcal {S}}|)\) for some function f which is small when \(|{\mathcal {S}}|\) is large.

Specifically, the answers of SubQ1 and SubQ3 lead to the positive answer of Q1, and the answers of SubQ2 and SubQ3 give the solution to Q2.

Our main contributions and results in this work are the followings:

  • We establish the higher-order regularity result for the solutions to the PEs under Case1 and Case2, see Theorem 3.4. To our best knowledge, such a result for Case1 was proven in [46], but is new for Case2. It is necessary as the smoothness of the solutions is required in order to perform higher-order error analysis for PINNs.

  • We answer Q1 and Q2 (and SubQ1SubQ3) for the PINNs approximating the solutions to the PEs, which shows the PINNs is a reliable numerical method for solving PEs, see Theorems 4.1, 4.2, 4.3, 4.4. Our estimates are all a priori, and the key strategy is to introduce a penalty term in the generalization error (2.7) and the training error (2.8). The introduction of such penalty terms is inspired by [4], where the authors studied the PINNs approximating the 2D NSE. By virtue of Theorem 3.4, the solutions for PEs in Case1 and Case2 exist globally, and therefore are bounded for any finite time. Such penalty terms is introduced to control the growth of the outputs of neural networks and to make sure they are in the target bounded set.

  • Rather than just consider \(L^2\) norm in the errors [24, 65], i.e., \({\mathcal {E}}[s;\theta ], \mathcal {E}_G[s;\theta ], \mathcal {E}_{T}[s;\theta ;{\mathcal {S}}]\) with \(s=0\), we use higher-order \(H^s\) norm in \({\mathcal {E}}[s;\theta ], \mathcal {E}_G[s;\theta ], \mathcal {E}_{T}[s;\theta ;{\mathcal {S}}]\) for \(s\in \mathbb N\). We prove that the usage of \(H^s\) norm in \(\mathcal {E}_{T}[s;\theta ;{\mathcal {S}}]\) will guarantee the control for \({\mathcal {E}}[s;\theta ]\) with the same order s. The numerical performance in Sect. 5 further verifies our theory. Such results are crucial, as some problems do require higher-order estimates, for example, the Hamilton-Jacobi-Bellman equation requires the \(L^p\) error estimate with p large enough in order to be stable, see [80]. We refer the readers to [21] for more discuss on the higher order error estimates for neural networks. We believe that the higher-order error estimates developed in this work can be readily applied to other PDEs, for example, the Euler equations, the Navier–Stokes equations, and the Boussinesq system.

The rest of the paper is organized as the following. In Sect. 2, we introduce the notation and collect some preliminary results that will be used in this paper. In Sect. 3, we prove the higher-order regularity of the solutions to the PEs under Case1 and Case2. In Sect. 4, we establish the main results of this paper by answering Q1 and Q2 (through SubQ1SubQ3) discussed above. In the end, we present some numerical experiments in Sect. 5 to support our theoretical results on the accuracy of the approximation under higher-order Sobolev norms.

2 Preliminaries

In this section, we introduce the notation and collect some preliminary results that will be used in the rest of this paper. The universal constant C that appears below may change from step to step, and we use the subscript to emphasize its dependence when necessary, e.g., \(C_r\) is a constant depending only on r.

2.1 Functional settings

We use the notation \(\varvec{x}:= (\varvec{x}',z) = (x, y, z)\in {\mathcal {D}}\), where \(\varvec{x}'\) and z represent the horizontal and vertical variables, respectively, and for a positive time \({\mathcal {T}}>0\) we denote by

$$\begin{aligned} \Omega = [0,{\mathcal {T}}]\times {\mathcal {D}}. \end{aligned}$$

Let \(\nabla =(\partial _x,\partial _y,\partial _z)\) and \(\Delta = \partial _{xx}+\partial _{yy}+\partial _{zz}\) be the three dimensional gradient and Laplacian, and \(\nabla _h = (\partial _{x}, \partial _{y})\) and \(\Delta _h = \partial _{xx} + \partial _{yy}\) be the horizontal ones. Let \(\alpha \in \mathbb {N}^n\) be a multi-index. We say \(\alpha \le \beta \) if and only if \(\alpha _i\le \beta _i\) for each \(i\in \{1,2,...,n\}\). The notation

$$\begin{aligned} |\alpha | = \sum \limits _{j=1}^n \alpha _j, \hspace{0.2in} \alpha ! = \prod \limits _{j=1}^n \alpha _j!, \hspace{0.2in} \left( {\begin{array}{c}\alpha \\ \beta \end{array}}\right) = \frac{\alpha !}{\beta !(\alpha -\beta )!} \end{aligned}$$

will be used throughout the paper. Let \(P_{m,n}=\{\alpha \in \mathbb N^n, |\alpha |=m \}\), and denote by \(|P_{m,n}|\) the cardinality of \(P_{m,n}\), which is given by \(|P_{m,n}| = \left( {\begin{array}{c}m+n-1\\ m\end{array}}\right) \). For a function f defined on an open subset \(U\subseteq \mathbb R^n\) and \(x=(x_1,x_2,...,x_n)\in U\), we denote the partial derivative of f with multi-index \(\alpha \) by \( D^\alpha f= \frac{\partial ^{|\alpha |}f}{\partial _{x_1}^{\alpha _1}\cdots \partial _{x_n}^{\alpha _n}}. \) Let

$$\begin{aligned} L^2(U)=\left\{ f: \int _U |f(x)|^2 dx < \infty \right\} \end{aligned}$$

be the usual \(L^2\) space associated with the Lebesgue measure restricted on U, endowed with the norm \( \Vert f\Vert _{L^2(U)} = (\int _{U} |f|^2 dx)^{\frac{1}{2}}, \) coming from the inner product \( \langle f,g\rangle = \int _{U} f(x)g(x) \;dx \) for \(f,g \in L^2(U)\). For \(r\in \mathbb N\), denote by \(H^r(U)\) the Sobolev spaces:

$$\begin{aligned} H^r(U) = \{f\in L^2(U): \Vert D^\alpha f\Vert _{L^2(U)}<\infty \text { for } |\alpha |\le r\}, \end{aligned}$$

endowed with the norm \( \Vert f\Vert _{H^r(U)} = (\sum \limits _{|\alpha |\le s}\int _{U} |D^\alpha f|^2 dx)^{\frac{1}{2}}. \) For more details about the Sobolev spaces, we refer the readers to [1]. Define

$$\begin{aligned} \begin{aligned}&\widetilde{V}_e:= \left\{ \varphi \in C^\infty ({\mathcal {D}}): \varphi \text { is periodic in } (x,y,z) \text { and even in } z, \, \int _0^1 \nabla _h\cdot \varphi (\varvec{x}', z)dz = 0 \right\} ,\\&\widetilde{V}_o:= \left\{ \varphi \in C^\infty ({\mathcal {D}}): \varphi \text { is periodic in } (x,y,z) \text { and odd in }z \right\} , \end{aligned} \end{aligned}$$

and denote by \(H^r_e({\mathcal {D}})\) and \(H^r_o({\mathcal {D}})\) the closure spaces of \(\widetilde{V}_e\) and \(\widetilde{V}_o\), respectively, under the \(H^r\)-topology. When \(r=0\), \(H^r_e({\mathcal {D}}) = L^2_e({\mathcal {D}})\) and \(H^r_o({\mathcal {D}}) = L^2_o({\mathcal {D}}).\) Note that in the 2D case, all notations need to be adapted accordingly by letting \(\mathcal M = (0,1)\). When the functional space and the norm are defined in the spatial domain \({\mathcal {D}}\), we frequently write \(L^2\), \(H^r\), \(\Vert \cdot \Vert _{L^2}\), and \(\Vert \cdot \Vert _{H^r}\) by omitting \({\mathcal {D}}\) when there is no confusion.

By virtue of (1.1c) and the boundary condition (1.2), one can rewrite w as

$$\begin{aligned} w(\varvec{x}',z) = -\int _0^z \nabla _h\cdot V(\varvec{x}', \tilde{z}) d\tilde{z}. \end{aligned}$$
(2.1)

Notice that since \(w(z=1)=0\), V satisfies the compatibility condition

$$\begin{aligned} \int _0^1 \nabla _h\cdot V(\varvec{x}', \tilde{z}) d\tilde{z} =0. \end{aligned}$$
(2.2)

By Cauchy–Schwarz inequality,

$$\begin{aligned} \begin{aligned} \Vert w\Vert ^2_{H^r({\mathcal {D}})}&= \sum \limits _{|\alpha |\le r} \int _{{\mathcal {D}}} \left| D^\alpha \int _0^z \nabla _h\cdot V(\varvec{x}', \tilde{z}) d\tilde{z} \right| ^2 d\varvec{x} \le C \Vert \nabla _h V\Vert ^2_{H^r({\mathcal {D}})}. \end{aligned} \end{aligned}$$
(2.3)

2.2 Neural networks

We will work with the following class of neural networks introduced in [24].

Definition 2.1

Suppose \(R \in (0, \infty ]\), \(L, W \in \mathbb {N}\), and \(l_0, \ldots , l_L \in \mathbb {N}\). Let \(\sigma : \mathbb {R} \rightarrow \mathbb {R}\) be a twice differentiable activation function, and define

$$\begin{aligned} \Theta =\Theta _{L, W, R}:=\bigcup _{L^{\prime } \in \mathbb {N}, L^{\prime } \le L} \bigcup _{l_0, \ldots , l_L \in \{1, \ldots , W\}} X_{k=1}^{L^{\prime }}\left( [-R, R]^{l_k \times l_{k-1}} \times [-R, R]^{l_k}\right) . \end{aligned}$$

For \(\theta \in \Theta _{L, W, R}\), we define \(\theta _k:=\left( \mathcal {W}_k, b_k\right) \) and \(\mathcal {A}_k: \mathbb {R}^{l_{k-1}} \rightarrow \mathbb {R}^{l_k}: x \mapsto \mathcal {W}_k x+b_k\) for \(1 \le k \le L\) and define \(f_k^\theta : \mathbb {R}^{l_{k-1}} \rightarrow \mathbb {R}^{l_k}\) by

$$\begin{aligned} f_k^\theta (\eta )= {\left\{ \begin{array}{ll}\mathcal {A}_L^\theta (\eta ) &{} k=L \\ \left( \sigma \circ \mathcal {A}_k^\theta \right) (\eta ) &{} 1 \le k<L\end{array}\right. } \end{aligned}$$

Denote by \(u_\theta : \mathbb {R}^{l_0} \rightarrow \mathbb {R}^{l_L}\) the function that satisfies for all \(\eta \in \mathbb {R}^{l_0}\) that

$$\begin{aligned} u_\theta (\eta )=\left( f_L^\theta \circ f_{L-1}^\theta \circ \cdots \circ f_1^\theta \right) (\eta ). \end{aligned}$$

In our approach to approximating the system (1.1), we assign \(l_0= d+1\) and \(\eta =(\varvec{x}, t)\). The neural network that corresponds to parameter \(\theta \) and consists of L layers and widths \((l_0,l_1,...,l_L)\) is denoted by \(u_\theta \). The first \(L-1\) layers are considered hidden layers, where \(l_k\) refers to the width of layer k, and \(\mathcal W_k\) and \(b_k\) denote the weights and biases of layer k, respectively. The width of \(u_\theta \) is defined as the maximum value among \({l_0,\dots ,l_L}\).

2.3 PINNs settings

We define the following residuals from the PDE system (1.1):

$$\begin{aligned}&\text {(PDE residuals)} \nonumber \\&\quad {\left\{ \begin{array}{ll} \mathcal R_{i,V}[\theta ]:= \partial _t V_\theta + V_\theta \cdot \nabla _h V_\theta + w_\theta \partial _z V_\theta +f_0 V_\theta ^\perp - \nu _h \Delta _h V_\theta - \nu _z \partial _{zz} V_\theta + \nabla _h p_\theta ,\\ \mathcal R_{i,p}[\theta ]:= \partial _z p_\theta + T_\theta ,\\ \mathcal R_{i,T}[\theta ] := \partial _t T_\theta + V_\theta \cdot \nabla _h T_\theta + w_\theta \partial _z T_\theta - \kappa _h \Delta _h T_\theta - \kappa _z \partial _{zz} T_\theta - Q,\\ \mathcal R_{i,div}[\theta ]:= \nabla _h\cdot V_\theta + \partial _z w_\theta , \end{array}\right. } \end{aligned}$$
(2.4)

the residuals from the initial conditions (1.3):

$$\begin{aligned}&\text {(initial residuals)} \nonumber \\&{\left\{ \begin{array}{ll} \mathcal R_{t,V}[\theta ] := V_\theta (t=0) - V_0,\\ \mathcal R_{t,T}[\theta ] := T_\theta (t=0) - T_0, \end{array}\right. } \end{aligned}$$
(2.5)

and for \(s\in \mathbb N\), the residuals from the boundary conditions:

$$\begin{aligned} \begin{aligned}&\text {(boundary residuals) }\mathcal R_{b}[s;\theta ]\\&:= \Big \{\sum \limits _{\varphi \in \{V_\theta ,w_\theta ,p_\theta ,T_\theta \}}\sum \limits _{|\alpha |\le s} \Big [\Big (D^\alpha \varphi (x=1,y,z)- D^\alpha \varphi (x=0,y,z)\Big )^2\\&+ \Big (D^\alpha \varphi (x,y=1,z)- D^\alpha \varphi (x,y=0,z)\Big )^2\\&+ \Big (D^\alpha \varphi (x,y,z=1)- D^\alpha \varphi (x,y,z=0)\Big )^2\Big ]\\&+\sum \limits _{|\alpha |\le s, \alpha _3=0} \Big [\Big (D^\alpha w_\theta (x,y,z=1)\Big )^2 + \Big (D^\alpha w_\theta (x,y,z=0)\Big )^2\Big ] \Big \}^{\frac{1}{2}}, \end{aligned}\nonumber \\ \end{aligned}$$
(2.6)

For \(s\in \mathbb N\), the generalization error is defined by

$$\begin{aligned} \begin{aligned}&\text {(generalization error) } \mathcal {E}_G[s;\theta ]\\&\qquad \qquad := \left( \mathcal {E}_G^i[s;\theta ]^2 + \mathcal {E}_G^t[s;\theta ]^2 +\mathcal {E}_G^b[s;\theta ]^2 + \lambda \mathcal {E}_G^p[s;\theta ]^2\right) ^{\frac{1}{2}}, \end{aligned} \end{aligned}$$
(2.7)

where

$$\begin{aligned} \begin{aligned}&\mathcal {E}_G^i[s;\theta ]^2:= \int _0^{{\mathcal {T}}}\left( \Vert \mathcal R_{i,V}[\theta ]\Vert _{H^s({\mathcal {D}})}^2 + \Vert \mathcal R_{i,p}[\theta ]\Vert _{H^s({\mathcal {D}})}^2 + \Vert \mathcal R_{i,T}[\theta ]\Vert _{H^s({\mathcal {D}})}^2 \right. \\&\quad \qquad \quad \left. + \Vert \mathcal R_{i,div}[\theta ]\Vert _{H^s({\mathcal {D}})}^2 \right) dt,\\&\mathcal {E}_G^t[s;\theta ]^2: =\Vert \mathcal R_{t,V}[\theta ]\Vert _{H^s({\mathcal {D}})}^2 + \Vert \mathcal R_{t,T}[\theta ]\Vert _{H^s({\mathcal {D}})}^2,\\&\mathcal {E}_G^b[s;\theta ]^2:= \int _0^{{\mathcal {T}}}\Vert \mathcal R_{b}[s;\theta ]\Vert _{L^2(\partial {\mathcal {D}})}^2 dt,\\&\mathcal {E}_G^p[s;\theta ]^2:= \int _0^{{\mathcal {T}}} \left( \Vert V_\theta \Vert _{H^{s+3}({\mathcal {D}})}^2 + \Vert p_\theta \Vert _{H^{s+3}({\mathcal {D}})}^2 + \Vert w_\theta \Vert _{H^{s+3}({\mathcal {D}})}^2 + \Vert T_\theta \Vert _{H^{s+3}({\mathcal {D}})}^2 \right) dt \end{aligned} \end{aligned}$$

and the training error is defined by

$$\begin{aligned}{} & {} \text {(training error) } \qquad \mathcal {E}_{T}[s;\theta ;{\mathcal {S}}]\nonumber \\{} & {} := \left( \mathcal {E}_{T}^i[s;\theta ;{\mathcal {S}}_i]^2 + \mathcal {E}_{T}^t[s;\theta ;{\mathcal {S}}_t]^2 +\mathcal {E}_{T}^b[s;\theta ;{\mathcal {S}}_b]^2 + \lambda \mathcal {E}_{T}^p[s;\theta ;{\mathcal {S}}_i]^2\right) ^{\frac{1}{2}}, \end{aligned}$$
(2.8)

where

$$\begin{aligned} \begin{aligned}&\mathcal {E}_{T}^i[s;\theta ;{\mathcal {S}}_i]^2:= \sum \limits _{(t_n,x_n,y_n,z_n)\in {\mathcal {S}}_i}\sum \limits _{|\alpha |\le s}w_n^i\Big [\Big (D^\alpha \mathcal R_{i,V}[\theta ](t_n,x_n,y_n,z_n)\Big )^2\\&\qquad \qquad \qquad \qquad + \Big (D^\alpha \mathcal R_{i,p}[\theta ](t_n,x_n,y_n,z_n)\Big )^2\\&\qquad \qquad \qquad \qquad +\Big (D^\alpha \mathcal R_{i,T}[\theta ](t_n,x_n,y_n,z_n)\Big )^2 +\Big (D^\alpha \mathcal R_{i,div}[\theta ](t_n,x_n,y_n,z_n)\Big )^2 \Big ],\\&\mathcal {E}_{T}^t[s;\theta ;{\mathcal {S}}_t]^2: =\sum \limits _{(x_n,y_n,z_n)\in {\mathcal {S}}_t}\sum \limits _{|\alpha |\le s}w_n^t\Big [\Big (D^\alpha \mathcal R_{t,V}[\theta ](x_n,y_n,z_n)\Big )^2 \\&\qquad \qquad \qquad \qquad + \Big (D^\alpha \mathcal R_{t,T}[\theta ](x_n,y_n,z_n)\Big )^2 \Big ],\\&\mathcal {E}_{T}^b[s;\theta ;{\mathcal {S}}_b]^2: =\sum \limits _{(t_n,x_n,y_n,z_n)\in {\mathcal {S}}_b}w_n^b \Big ( R_{b}[s;\theta ](t_n,x_n,y_n,z_n)\Big )^2,\\&\mathcal {E}_{T}^p[s;\theta ;{\mathcal {S}}_i]^2:= \sum \limits _{(t_n,x_n,y_n,z_n)\in {\mathcal {S}}_i}\sum \limits _{|\alpha |\le s+3}w_n^i\Big [\Big (D^\alpha V_\theta (t_n,x_n,y_n,z_n)\Big )^2 \\&\qquad \qquad \qquad \qquad + \Big (D^\alpha p_\theta (t_n,x_n,y_n,z_n)\Big )^2+\Big (D^\alpha w_\theta (t_n,x_n,y_n,z_n)\Big )^2\\&\qquad \qquad \qquad \qquad +\Big (D^\alpha T_\theta (t_n,x_n,y_n,z_n)\Big )^2 \Big ]. \end{aligned} \end{aligned}$$
(2.9)

with quadrature points in space-time constituting data sets \({\mathcal {S}}=({\mathcal {S}}_i, {\mathcal {S}}_t, {\mathcal {S}}_b)\) with \({\mathcal {S}}_i \subseteq [0,{\mathcal {T}}]\times {\mathcal {D}}\), \({\mathcal {S}}_t \subseteq {\mathcal {D}}\), \({\mathcal {S}}_b \in [0,{\mathcal {T}}]\times \partial {\mathcal {D}}\) and \((w_n^i,w_n^t,w_n^b)\) being the quadrature weights, defined in (4.31) and (4.32), respectively. Here \(\mathcal {E}_G^p\) and \(\mathcal {E}_{T}^p\) stands the penalty terms. Finally, for \(s\in \mathbb N\), the total error is defined as

$$\begin{aligned} \text {(total error) } \qquad \qquad \qquad {\mathcal {E}}[s;\theta ]:= \Big (\int _0^{{\mathcal {T}}}\big (\Vert V-V_\theta \Vert ^2_{H^s} + \Vert T-T_\theta \Vert ^2_{H^s}\big )dt\Big )^{\frac{1}{2}}. \end{aligned}$$
(2.10)

Remark 1

  1. (i)

    In our setting, we impose periodic boundary conditions, (Vp) to be even in z, and (wT) to be odd in z. The assumption of evenness and oddness allows us to perform the periodic extension in the z direction. This can be ignored when one wants to control the total error from the generalization error. Note also that, the boundary conditions \(w|_{z=0,1}=0\) and \(D^\alpha w|_{z=0,1}=0\) with \(\alpha _3=0\) have physical meanings, and are essential in the error estimate. Therefore they are included in the boundary residuals \(\mathcal R_b[s;\theta ]\).

  2. (ii)

    The total error is defined only for V and T, as for the primitive equations they are the prognostic variables, while w and p are diagnostic variables that can be recovered from V and T.

  3. (iii)

    If the original PINN framework (1.4) were followed, one could first obtain posterior estimates for SubQ1SubQ2, meaning that constants would depend on certain norms of the outputs of the neural networks, and then made it a priori by requiring high regularity for the solution. For this approach, see, for instance, [24, Theorem 3.1 and 3.4, and Corollary 3.14]. We proceeded in an alternative way, inspired by the approach proposed in [4]. That is, we consider the additional terms \(\mathcal {E}_G^p\) and \(\mathcal {E}_{T}^p\) in generalization and training errors which are able to bound these constants directly by the PDE solution, and therefore achieve an a priori estimate for the total error.

3 Regularity of solutions to the primitive equations

We first give the definition of strong solutions to system (1.1) under Case1. The following definition is similar to the ones appearing in [10, 12].

Definition 3.1

Let \({\mathcal {T}}>0\) and let \(V_0\in H_e^2({\mathcal {D}})\) and \(T_0\in H_o^2({\mathcal {D}})\). A couple (VT) is called a strong solution to system (1.1) on \(\Omega = [0,{\mathcal {T}}]\times {\mathcal {D}}\) if

  1. (i)

    V and T have the regularities

    $$\begin{aligned}&V\in L^\infty (0,{\mathcal {T}}; H_e^2({\mathcal {D}}))\cap C([0,{\mathcal {T}}];H_e^1({\mathcal {D}})),{} & {} T\in L^\infty (0,{\mathcal {T}}; H_o^2({\mathcal {D}}))\cap C([0,{\mathcal {T}}];H_o^1({\mathcal {D}}))\\&(\nabla _h V,\nu _z V_z)\in L^2(0,{\mathcal {T}}; H^2({\mathcal {D}})), \quad{} & {} (\nabla _h T,\kappa _z T_z)\in L^2(0,{\mathcal {T}}; H^2({\mathcal {D}}))\\&\partial _tV\in L^2(0,{\mathcal {T}}; H^1({\mathcal {D}})), \quad{} & {} \partial _tT \in L^2(0,{\mathcal {T}}; H^1({\mathcal {D}})) ; \end{aligned}$$
  2. (ii)

    V and T satisfy system (1.1) a.e. in \(\Omega = [0,{\mathcal {T}}]\times {\mathcal {D}}\) and the initial condition (1.3).

Definition 3.2

A couple (VT) is called a global strong solution to system (1.1) if it is a strong solution on \(\Omega =[0,{\mathcal {T}}]\times {\mathcal {D}}\) for any \({\mathcal {T}}>0\).

The theorem below is from [12] and concerns the global well-posedness of system (1.1) under Case2.

Theorem 3.3

[12, Theorem 1.3] Suppose that \(Q=0\), \(V_0\in H_e^2({\mathcal {D}})\) and \(T_0\in H_o^2({\mathcal {D}})\). Then system (1.1) has a unique global strong solution (VT), which is continuously dependent on the initial data.

Remark 2

Theorem 3.3 works for Case2. It can be easily extended to Case1, i.e., \(\nu _h, \kappa _h, \nu _z,\kappa _z>0\) (see [12, Proposition 2.6]). Moreover, under Case1, the solution (VT) satisfies that \(V\in L^2(0,{\mathcal {T}}; H_e^3({\mathcal {D}}))\) and \(T\in L^2(0,{\mathcal {T}}; H_o^3({\mathcal {D}}))\) as indicated in Definition 3.1. Theorem 3.3 is proved in [12] with \(d=3\), but it also holds when \(d=2\). The requirement \(Q=0\) can be replaced with Q being regular enough, for example, \(Q\in L^\infty (0,{\mathcal {T}}; H_o^2({\mathcal {D}}))\) for arbitrary \({\mathcal {T}}>0\).

In order to perform the error analysis for PINNs, we need to establish a higher-order regularity for the solution (VT), in particular, the continuity in both spatial and temporal variables. This is summarized in the theorem below.

Theorem 3.4

Let \(k, r\in \mathbb N\), \(d\in \{2,3\}\), \(r>\frac{d}{2}+2k\), \({\mathcal {T}}>0\), and denote by \(\Omega = [0,{\mathcal {T}}]\times {\mathcal {D}}.\) Suppose that \(V_0\in H_e^r({\mathcal {D}})\), \(T_0\in H_o^r({\mathcal {D}})\), and \(Q\in C^{k-1}([0,{\mathcal {T}}]; H_o^r({\mathcal {D}}))\). Then system (1.1) has a unique global strong solution (VT), which depends continuously on the initial data. Moreover, we have

$$\begin{aligned}&(V,T)\in L^\infty (0,{\mathcal {T}}; H^r({\mathcal {D}}))\cap C([0,{\mathcal {T}}];H^{r-1}({\mathcal {D}})), \end{aligned}$$
(3.1a)
$$\begin{aligned}&(\nabla _h V,\nu _z V_z, \nabla _h T,\kappa _z T_z)\in L^2(0,{\mathcal {T}}; H^r({\mathcal {D}}))\cap C([0,{\mathcal {T}}];H^{r-1}({\mathcal {D}})), \end{aligned}$$
(3.1b)
$$\begin{aligned}&(\partial _tV, \partial _t T)\in L^2(0,{\mathcal {T}}; H^{r-1}({\mathcal {D}})), \end{aligned}$$
(3.1c)

and

$$\begin{aligned}&(V,T)\in C^k\left( \Omega \right) ,\quad (w,\nabla p)\in C^{k-1}\left( \Omega \right) . \end{aligned}$$
(3.2)

To prove Theorem 3.4, we shall need the following lemma.

Lemma 3.5

[48, Lemma A.1], see also [3] Let \(s\ge 1\), and suppose that \(f, g\in H^s({\mathcal {D}})\). Let \(\alpha \) be a multi-index such that \(|\alpha |\le s\). Then

$$\begin{aligned} \left\| D^\alpha (fg) - f D^\alpha g \right\| _{L^2} \le C_s\left( \Vert f\Vert _{H^s} \Vert g\Vert _{L^\infty } + \Vert \nabla f\Vert _{L^\infty } \Vert g\Vert _{H^{s-1}} \right) . \end{aligned}$$

Proof of Theorem 3.4

We perform the proof when \(d=3\). The case of \(d=2\) follows similarly. Notice that in our setting \(r>3\). Let’s first consider Case1.

Case1: full viscosity.

We start by showing that, for arbitrary fixed \({\mathcal {T}}>0\), we have

$$\begin{aligned} (V, T)\in C([0,{\mathcal {T}}]; H^3({\mathcal {D}}))\cap L^2(0,{\mathcal {T}}; H^4({\mathcal {D}})), \quad \text {and} \quad (\partial _tV, \partial _t T)\in L^2(0,{\mathcal {T}}; H^{2}({\mathcal {D}})). \end{aligned}$$

Let \(|\alpha |\le 3\) be a multi-index. Taking \(D^\alpha \) derivative on the system (1.1), and then taking the inner product of (1.1a) with \(D^\alpha V\) and (1.1d) with \(D^\alpha T\), by summing over all \(|\alpha |\le 3\). One has

$$\begin{aligned}&\frac{1}{2} \frac{d}{dt} \left( \Vert V\Vert _{H^3}^2 + \Vert T\Vert _{H^3}^2\right) + \nu _h \Vert \nabla _h V\Vert _{H^3}^2 + \nu _z \Vert \partial _z V\Vert _{H^3}^2 + \kappa _h \Vert \nabla _h T\Vert _{H^3}^2 + \kappa _z \Vert \partial _z T\Vert _{H^3}^2\nonumber \\&\quad = \sum \limits _{|\alpha |\le 3} \Big (-\left\langle D^\alpha (V\cdot \nabla _h V+wV_z), D^\alpha V \right\rangle - \left\langle D^\alpha (V\cdot \nabla _h T+wT_z), D^\alpha T \right\rangle \nonumber \\&\qquad - \left\langle D^\alpha \nabla _h p, D^\alpha V \right\rangle + \left\langle D^\alpha Q, D^\alpha T \right\rangle \Big ). \end{aligned}$$
(3.3)

By integration by parts, thanks to (1.2), (1.1b) and (1.1c), using the Cauchy–Schwarz inequality, Young’s inequality and (2.3), one arrives at the following:

$$\begin{aligned} \left\langle D^\alpha \nabla _h p, D^\alpha V \right\rangle= & {} -\left\langle D^\alpha p, D^\alpha \nabla _h \cdot V \right\rangle = \left\langle D^\alpha p, D^\alpha \partial _z w \right\rangle \nonumber \\= & {} -\left\langle D^\alpha \partial _z p, D^\alpha w \right\rangle = \left\langle D^\alpha T, D^\alpha w \right\rangle \nonumber \\\le & {} ~ \Vert T\Vert _{H^3} \Vert w\Vert _{H^3} \le \Vert T\Vert _{H^3} \Vert \nabla _h V\Vert _{H^3} \le \frac{1}{4} \nu _h \Vert \nabla _h V\Vert _{H^3}^2 + C_{\nu _h} \Vert T\Vert _{H^3}^2.\nonumber \\ \end{aligned}$$
(3.4)

By the Cauchy–Schwarz inequality and Young’s inequality, one deduces

$$\begin{aligned} \left\langle D^\alpha Q, D^\alpha T \right\rangle \le \Vert Q\Vert _{H^3} \Vert T\Vert _{H^3} \le \frac{1}{2} \Vert Q\Vert _{H^3}^2 + \frac{1}{2} \Vert T\Vert _{H^3}^2. \end{aligned}$$
(3.5)

Using Lemma 3.5, integration by parts, and the boundary condition and (1.1c), from the Cauchy–Schwarz inequality, Young’s inequality and the Sobolev inequality, for all \(|\alpha |\le 3\) one has

$$\begin{aligned} \begin{aligned}&\left\langle D^\alpha (V\cdot \nabla _h V+wV_z), D^\alpha V \right\rangle + \left\langle D^\alpha (V\cdot \nabla _h T+wT_z), D^\alpha T \right\rangle \\ =&\left\langle D^\alpha (V\cdot \nabla _h V) - V\cdot D^\alpha \nabla _h V + D^\alpha (wV_z) - w D^\alpha V_z, D^\alpha V \right\rangle \\&+ \left\langle D^\alpha (V\cdot \nabla _h T) - V\cdot D^\alpha \nabla _h T + D^\alpha (wT_z) - w D^\alpha T_z, D^\alpha T \right\rangle \\&+ \underbrace{\left\langle V\cdot D^\alpha \nabla _h V + w D^\alpha V_z, D^\alpha V \right\rangle }_{=0} + \underbrace{\left\langle V\cdot D^\alpha \nabla _h T + w D^\alpha T_z, D^\alpha T \right\rangle }_{=0}\\ \le&C\Big (\Vert V\Vert _{H^3} \Vert \nabla _h V\Vert _{L^\infty } + \Vert \nabla V\Vert _{L^\infty } \Vert \nabla _h V\Vert _{H^{2}} \\&\quad + \Vert w\Vert _{H^3} \Vert V_z\Vert _{L^\infty } + \Vert \nabla w\Vert _{L^\infty } \Vert V_z\Vert _{H^{2}}\Big )\Vert V\Vert _{H^3}\\&+ C\Big (\Vert V\Vert _{H^3} \Vert \nabla _h T\Vert _{L^\infty } + \Vert \nabla V\Vert _{L^\infty } \Vert \nabla _h T\Vert _{H^{2}} + \Vert w\Vert _{H^3} \Vert T_z\Vert _{L^\infty } \\&\quad + \Vert \nabla w\Vert _{L^\infty } \Vert T_z\Vert _{H^{2}}\Big ) \Vert T\Vert _{H^3}\\ \le&C \Vert V\Vert _{H^3}^2 \Vert \nabla _h V\Vert _{H^3} + C \Vert V\Vert _{H^3} \Vert T\Vert _{H^3}^2 + C \Vert T\Vert _{H^3}^2 \Vert \nabla _h V\Vert _{H^3}\\ \le&\frac{1}{4} \nu _h \Vert \nabla _h V\Vert _{H^3}^2 + C_{\nu _h} (1+\Vert V\Vert _{H^3}^2+ \Vert T\Vert _{H^3}^2 ) (\Vert V\Vert _{H^3}^2+ \Vert T\Vert _{H^3}^2 ). \end{aligned} \end{aligned}$$
(3.6)

Note that we have applied Lemma 3.5 for the first inequality. Combine the estimates (3.3)–(3.6), we obtain

$$\begin{aligned} \begin{aligned}&\frac{d}{dt} \left( \Vert V\Vert _{H^3}^2 + \Vert T\Vert _{H^3}^2\right) + \nu _h \Vert \nabla _h V\Vert _{H^3}^2 + \nu _z \Vert \partial _z V\Vert _{H^3}^2 + \kappa _h \Vert \nabla _h T\Vert _{H^3}^2 + \kappa _z \Vert \partial _z T\Vert _{H^3}^2\\&\quad \le C_{\nu _h} (1+\Vert V\Vert _{H^3}^2+ \Vert T\Vert _{H^3}^2 )(\Vert V\Vert _{H^3}^2+ \Vert T\Vert _{H^3}^2 ) + C \Vert Q\Vert _{H^3}^2. \end{aligned} \end{aligned}$$

From Theorem 3.3 we know that \(V,T\in L^2(0,{\mathcal {T}}; H^3({\mathcal {D}}))\) for arbitrary \({\mathcal {T}}>0\). By Gronwall inequality, for any \(t\in [0,{\mathcal {T}}]\),

$$\begin{aligned} \begin{aligned}&\Vert V(t)\Vert _{H^3}^2 + \Vert T(t)\Vert _{H^3}^2 + \int _0^t \left( \nu _h \Vert \nabla _h V(\tilde{t})\Vert _{H^3}^2 \right. \\&\left. \quad + \nu _z \Vert \partial _z V(\tilde{t})\Vert _{H^3}^2 + \kappa _h \Vert \nabla _h T(\tilde{t})\Vert _{H^3}^2 + \kappa _z \Vert \partial _z T(\tilde{t})\Vert _{H^3}^2 \right) d\tilde{t}\\&\le \left( \Vert V_0\Vert _{H^3}^2 + \Vert T_0\Vert _{H^3}^2 + C \int _0^{{\mathcal {T}}} \Vert Q(\tilde{t})\Vert _{H^3}^2 d\tilde{t} \right) \\&\quad \exp \left( \int _0^{{\mathcal {T}}} C_{\nu _h} (1+\Vert V(\tilde{t})\Vert _{H^3}^2+ \Vert T(\tilde{t})\Vert _{H^3}^2 ) d\tilde{t} \right) <\infty . \end{aligned} \end{aligned}$$

Therefore, we get

$$\begin{aligned}&(V, T)\in L^\infty (0,{\mathcal {T}}; H^3({\mathcal {D}}))\cap L^2(0,{\mathcal {T}}; H^4({\mathcal {D}})). \end{aligned}$$

Now for any \(|\alpha |\le 2\), taking \(D^\alpha \) derivative on system (1.1) and then taking the inner product of (1.1a) and (1.1b) with \(\varphi \in \{f\in L^2({\mathcal {D}}): \nabla \cdot f =0\}\), one has

$$\begin{aligned} \begin{aligned} \left\langle D^\alpha \partial _t V, \varphi \right\rangle =&-\left\langle D^\alpha (V\cdot \nabla _h V), \varphi \right\rangle -\left\langle D^\alpha (w \partial _z V), \varphi \right\rangle + \nu _h \left\langle D^\alpha \Delta _h V, \varphi \right\rangle + \nu _z \left\langle D^\alpha \partial _{zz} V, \varphi \right\rangle \\&\quad - f_0 \left\langle D^\alpha V^\perp , \varphi \right\rangle - \left\langle D^\alpha \nabla _h p, \varphi \right\rangle - \left\langle D^\alpha p_z, \varphi \right\rangle - \left\langle D^\alpha T, \varphi \right\rangle . \end{aligned} \end{aligned}$$

By the Cauchy–Schwarz inequality, recalling that \(H^s\) is a Banach algebra when \(s>\frac{d}{2}\), we have

$$\begin{aligned} |\left\langle D^\alpha \partial _t V, \varphi \right\rangle | \le \left( C \Vert V\Vert _{H^3}^2 + C_{\nu _h,\nu _z,f_0} \Vert V\Vert _{H^4} + \Vert T\Vert _{H^2} \right) \Vert \varphi \Vert _{L^2}, \end{aligned}$$

where we have consecutively used integration by parts and \(\nabla \cdot \varphi =0\) to get \( \left\langle D^\alpha \nabla _h p, \varphi \right\rangle + \left\langle D^\alpha p_z, \varphi \right\rangle =0. \) Since the inequality above is true for any \(|\alpha |\le 2\), and the space \(\{f\in L^2({\mathcal {D}}): \nabla \cdot f =0\}\) is dense in \(L^2({\mathcal {D}})\), from the regularity of V and T, one deduces

$$\partial _t V\in L^2(0,{\mathcal {T}}; H^2({\mathcal {D}})).$$

A similar argument yields

$$\partial _t T\in L^2(0,{\mathcal {T}}; H^2({\mathcal {D}})).$$

Applying the Lions–Magenes theorem (see e.g. [79, Chapter 3, Lemma 1.2]), together with the regularity of V, T, \(\partial _t V\), \(\partial _t T\), we obtain

$$\begin{aligned} (V, T)\in C([0,{\mathcal {T}}];H^3({\mathcal {D}})). \end{aligned}$$

This completes the proof of (3.1) with \(r = 3\). The proof of \(r = 4\) and all subsequent r is then obtained by repeating the same argument. Therefore under Case1, we achieve (3.1) and moreover, \((V, T)\in C([0,{\mathcal {T}}];H^r({\mathcal {D}}))\).

Next, we show (3.2). Taking the horizontal divergence on equation (1.1a), integrating with respect to z from 0 to 1, and taking the vertical derivative on equation (1.1b) gives

$$\begin{aligned} \begin{aligned} \Delta p(\varvec{x}) =&-\int _0^1 \nabla _h \cdot \left( V\cdot \nabla _h V + w\partial _z V -\nu _h \Delta _h V - \nu _z \partial _{zz} V +f_0 V^\perp \right) (\varvec{x}',z) dz - T_z(\varvec{x})\\ =&-\int _0^1 \nabla _h \cdot \left( V\cdot \nabla _h V + w\partial _z V +f_0 V^\perp \right) (\varvec{x}',z) dz - T_z(\varvec{x}), \end{aligned} \end{aligned}$$
(3.7)

where the viscosity terms disappear due to (1.2) and (2.2). We first consider \(k=1\). Since \(r> \frac{d}{2}+2k\), we know \(H^{r-2}\) is a Banach algebra. Since \((V,T)\in C([0,{\mathcal {T}}];H^r({\mathcal {D}}))\), one has \( \Delta p \in C([0,{\mathcal {T}}];H^{r-2}({\mathcal {D}}))\), and thus \(\nabla p \in C([0,{\mathcal {T}}];H^{r-1}({\mathcal {D}}))\). This implies that \(\partial _t V\in C([0,{\mathcal {T}}];H^{r-2}({\mathcal {D}}))\) and therefore, \(V\in C^1([0,{\mathcal {T}}];H^{r-2}({\mathcal {D}}))\) and \(w\in C^1([0,{\mathcal {T}}];H^{r-3}({\mathcal {D}}))\). Moreover, since \(Q\in C^{k-1}([0,{\mathcal {T}}]; H^r({\mathcal {D}}))\), one has \(\partial _t T\in C([0,{\mathcal {T}}];H^{r-2}({\mathcal {D}}))\), consequently \(T\in C^1([0,{\mathcal {T}}];H^{r-2}({\mathcal {D}}))\).

When \(k= 2\), since \(H^{r-4}\) is a Banach algebra, we can take the time derivative on equation (3.7) and get that \(\Delta p_t \in C([0,{\mathcal {T}}];H^{r-4}({\mathcal {D}}))\), and therefore \(\nabla p \in C^1([0,{\mathcal {T}}];H^{r-3}({\mathcal {D}}))\). This implies \(\partial _t V \in C^1([0,{\mathcal {T}}];H^{r-4}({\mathcal {D}}))\) and therefore \(V\in C^2([0,{\mathcal {T}}];H^{r-4}({\mathcal {D}}))\). One can also get \(w\in C^2([0,{\mathcal {T}}];H^{r-5}({\mathcal {D}}))\) and \(T\in C^2([0,{\mathcal {T}}];H^{r-4}({\mathcal {D}})).\) By repeating the above procedure, one will obtain

$$\begin{aligned} \begin{aligned}&(V, T)\in \cap _{l=0}^k C^l([0,{\mathcal {T}}];H^{r-2l}({\mathcal {D}})),\\&w \in \cap _{l=0}^k C^l([0,{\mathcal {T}}];H^{r-2l-1}({\mathcal {D}})),\\&\nabla p \in \cap _{l=0}^{k-1} C^l([0,{\mathcal {T}}];H^{r-2l-1}({\mathcal {D}})). \end{aligned} \end{aligned}$$

Then by the Sobolev embedding theorem and \(r> \frac{d}{2} + 2k\), we know that \(H^{r-2l}({\mathcal {D}}) \subset C^{2k-2l}({\mathcal {D}})\) for \(0\le l \le k\) and \(H^{r-2\,l-1}({\mathcal {D}}) \subset C^{2k-2\,l-1}({\mathcal {D}})\) for \(0\le l \le k-1\). Therefore,

$$\begin{aligned} (V,T)\in C^k\left( \Omega \right) ,\quad (w,\nabla p)\in C^{k-1}\left( \Omega \right) . \end{aligned}$$

Case 2: only horizontal viscosity.

Under Case 2, the proof of (3.1) when \(r=3\) is more technically involved. The key difference is in the estimate of the nonlinear term.

When \(D^\alpha = \partial _{z}^3\), integration by parts yields

$$\begin{aligned} \begin{aligned}&\left\langle \partial _{z}^3(V\cdot \nabla _h V+wV_z), \partial _{z}^3 V \right\rangle + \left\langle \partial _{z}^3(V\cdot \nabla _h T+wT_z), \partial _{z}^3 T \right\rangle \\&=\left\langle \partial _{z}^3(V\cdot \nabla _h V) - V\cdot \partial _{z}^3\nabla _h V + \partial _{z}^3(wV_z) - w \partial _{z}^3 V_z, \partial _{z}^3 V \right\rangle \\&\quad + \left\langle \partial _{z}^3(V\cdot \nabla _h T) - V\cdot \partial _{z}^3\nabla _h T + \partial _{z}^3(wT_z) - w \partial _{z}^3 T_z, \partial _{z}^3 T \right\rangle \\&\quad + \underbrace{\left\langle V\cdot \partial _{z}^3\nabla _h V + w \partial _{z}^3 V_z, \partial _{z}^3 V \right\rangle }_{=0} + \underbrace{\left\langle V\cdot \partial _{z}^3\nabla _h T + w \partial _{z}^3 T_z, \partial _{z}^3 T \right\rangle }_{=0}\\&= \left\langle \partial _{z}^3(V\cdot \nabla _h V) - V\cdot \partial _{z}^3\nabla _h V, \partial _{z}^3 V \right\rangle + \left\langle \partial _{z}^3(V\cdot \nabla _h T) - V\cdot \partial _{z}^3\nabla _h T, \partial _{z}^3 T \right\rangle \\&\quad + \left\langle \partial _{z}^3(wV_z) - w \partial _{z}^3 V_z, \partial _{z}^3 V \right\rangle + \left\langle \partial _{z}^3(wT_z) - w \partial _{z}^3 T_z, \partial _{z}^3 T \right\rangle \\: =&I_1 + I_2 + I_3 + I_4. \end{aligned} \end{aligned}$$
(3.8)

Using Lemma 3.5, together with the Cauchy-Schwarz inequality, Young’s inequality, and the Sobolev inequality, one obtains

$$\begin{aligned} \begin{aligned} I_1 + I_2&= \left\langle \partial _{z}^3(V\cdot \nabla _h V) - V\cdot \partial _{z}^3\nabla _h V, \partial _{z}^3 V \right\rangle + \left\langle \partial _{z}^3(V\cdot \nabla _h T) - V\cdot \partial _{z}^3\nabla _h T, \partial _{z}^3 T \right\rangle \\&\le C\Big (\Vert V\Vert _{H^3} \Vert \nabla _h V\Vert _{L^\infty } + \Vert \nabla V\Vert _{L^\infty } \Vert \nabla _h V\Vert _{H^{2}} \Big )\Vert V\Vert _{H^3}\\&\qquad + C\Big (\Vert V\Vert _{H^3} \Vert \nabla _h T\Vert _{L^\infty } + \Vert \nabla V\Vert _{L^\infty } \Vert \nabla _h T\Vert _{H^{2}} \Big ) \Vert T\Vert _{H^3}\\&\le C \left( \Vert \nabla _h V\Vert _{H^{2}} + \Vert \nabla _h T\Vert _{H^{2}}\right) \left( \Vert V\Vert _{H^3}^2 + \Vert T\Vert _{H^3}^2\right) . \end{aligned} \end{aligned}$$
(3.9)

For the estimates of \(I_3\), we use (1.1c), Young’s inequality, the Hölder inequality and the Sobolev inequality, to achieve

$$\begin{aligned} \begin{aligned} I_3 =&\left\langle \partial _{z}^3(wV_z) - w \partial _{z}^3 V_z, \partial _{z}^3 V \right\rangle \\&\le ~ C \left| \left\langle \underbrace{(\nabla _h \cdot V)}_{L^\infty } \underbrace{\partial _{z}^3 V}_{L^2}, \underbrace{\partial _{z}^3 V}_{L^2} \right\rangle \right| + C \left| \left\langle \underbrace{(\nabla _h \cdot V_z)}_{L^6} \underbrace{\partial _{z}^2 V}_{L^3}, \underbrace{\partial _{z}^3 V}_{L^2} \right\rangle \right| \\&\quad + C \left| \left\langle \underbrace{(\nabla _h \cdot \partial _{z}^2 V)}_{L^2} \underbrace{V_z}_{L^\infty }, \underbrace{\partial _{z}^3 V}_{L^2} \right\rangle \right| \\ \le&C\Vert \nabla _h V\Vert _{H^2} \Vert V\Vert _{H^3}^2. \end{aligned} \end{aligned}$$
(3.10)

Similarly, one can deduce that

$$\begin{aligned} I_4 =&\left\langle \partial _{z}^3(wT_z) - w \partial _{z}^3 T_z, \partial _{z}^3 T \right\rangle \le C\Vert \nabla _h V\Vert _{H^2} \Vert T\Vert _{H^3}^2. \end{aligned}$$
(3.11)

Combining (3.8)–(3.11) yields

$$\begin{aligned}{} & {} \left\langle \partial _{z}^3(V\cdot \nabla _h V+wV_z), \partial _{z}^3 V \right\rangle + \left\langle \partial _{z}^3(V\cdot \nabla _h T+wT_z), \partial _{z}^3 T \right\rangle \nonumber \\{} & {} \quad \le C \left( \Vert \nabla _h V\Vert _{H^{2}} + \Vert \nabla _h T\Vert _{H^{2}}\right) \left( \Vert V\Vert _{H^3}^2 + \Vert T\Vert _{H^3}^2\right) . \end{aligned}$$
(3.12)

When \(|\alpha |\le 3\) and \(\alpha \ne (0,0,3)\), one has \( \Vert D^\alpha V\Vert _{L^2} \le \Vert \nabla _h V\Vert _{H^2} + \Vert V\Vert _{H^2}, \Vert D^\alpha T\Vert _{L^2} \le \Vert \nabla _h T\Vert _{H^2} + \Vert T\Vert _{H^2}. \) Therefore, an estimate to replace (3.6) for Case2 is

$$\begin{aligned} \begin{aligned}&\left\langle D^\alpha (V\cdot \nabla _h V+wV_z), D^\alpha V \right\rangle + \left\langle D^\alpha (V\cdot \nabla _h T+wT_z), D^\alpha T \right\rangle \\ \le&C\Big (\Vert V\Vert _{H^3} \Vert \nabla _h V\Vert _{L^\infty } + \Vert \nabla V\Vert _{L^\infty } \Vert \nabla _h V\Vert _{H^{2}} + \Vert w\Vert _{H^3} \Vert V_z\Vert _{L^\infty } \\&\quad + \Vert \nabla w\Vert _{L^\infty } \Vert V_z\Vert _{H^{2}}\Big )(\Vert \nabla _h V\Vert _{H^2} + \Vert V\Vert _{H^2})\\&+ C\Big (\Vert V\Vert _{H^3} \Vert \nabla _h T\Vert _{L^\infty } + \Vert \nabla V\Vert _{L^\infty } \Vert \nabla _h T\Vert _{H^{2}} + \Vert w\Vert _{H^3} \Vert T_z\Vert _{L^\infty } \\&\quad + \Vert \nabla w\Vert _{L^\infty } \Vert T_z\Vert _{H^{2}}\Big ) (\Vert \nabla _h T\Vert _{H^2} + \Vert T\Vert _{H^2})\\ \le&C \Vert V\Vert _{H^3}^2 \Vert \nabla _h V\Vert _{H^2} + C\Vert \nabla _h V\Vert _{H^3}\Vert V\Vert _{H^3}(\Vert \nabla _h V\Vert _{H^2} + \Vert V\Vert _{H^2})\\&\qquad \qquad + C \Vert V\Vert _{H^3} \Vert T\Vert _{H^3} \Vert \nabla _h T\Vert _{H^2} + C\Vert \nabla _h V\Vert _{H^3}\Vert T\Vert _{H^3}(\Vert \nabla _h T\Vert _{H^2} + \Vert T\Vert _{H^2})\\ \le&\frac{1}{4} \nu _h \Vert \nabla _h V\Vert _{H^3}^2 + C_{\nu _h} (1+\Vert V\Vert _{H^2}^2+ \Vert T\Vert _{H^2}^2+\Vert \nabla _h V\Vert _{H^2}^2\\&\quad + \Vert \nabla _h T\Vert _{H^2}^2 ) (\Vert V\Vert _{H^3}^2+ \Vert T\Vert _{H^3}^2 ). \end{aligned}\nonumber \\ \end{aligned}$$
(3.13)

Combining (3.12), (3.13), (3.3)–(3.5) yields

$$\begin{aligned} \begin{aligned}&\frac{d}{dt} \left( \Vert V\Vert _{H^3}^2 + \Vert T\Vert _{H^3}^2\right) + \nu _h \Vert \nabla _h V\Vert _{H^3}^2 + \kappa _h \Vert \nabla _h T\Vert _{H^3}^2\\&\qquad \le C_{\nu _h} (1+\Vert V\Vert _{H^2}^2+ \Vert T\Vert _{H^2}^2+\Vert \nabla _h V\Vert _{H^2}^2+ \Vert \nabla _h T\Vert _{H^2}^2 )(\Vert V\Vert _{H^3}^2+ \Vert T\Vert _{H^3}^2 ) + C \Vert Q\Vert _{H^3}^2. \end{aligned} \end{aligned}$$

From Theorem 3.3, we know that \((V,T)\in L^\infty (0,{\mathcal {T}}; H^2({\mathcal {D}})), (\nabla _h V, \nabla _h T) \in L^2(0,{\mathcal {T}}; H^2({\mathcal {D}}))\) for arbitrary \({\mathcal {T}}>0\). Thanks to Gronwall inequality, for any \(t\in [0,{\mathcal {T}}]\),

$$\begin{aligned}&\Vert V(t)\Vert _{H^3}^2 + \Vert T(t)\Vert _{H^3}^2 + \int _0^t \left( \nu _h \Vert \nabla _h V(\tilde{t})\Vert _{H^3}^2 + \kappa _h \Vert \nabla _h T(\tilde{t})\Vert _{H^3}^2 \right) d\tilde{t}\\&\le \left( \Vert V_0\Vert _{H^3}^2 + \Vert T_0\Vert _{H^3}^2 + C \int _0^{{\mathcal {T}}} \Vert Q(\tilde{t})\Vert _{H^3}^2 d\tilde{t} \right) \\&\qquad \times \exp \left( \int _0^{{\mathcal {T}}} C_{\nu _h} (1+\Vert V(\tilde{t})\Vert _{H^2}^2+ \Vert T(\tilde{t})\Vert _{H^2}^2 + \Vert \nabla _h V(\tilde{t})\Vert _{H^2}^2+ \Vert \nabla _h T(\tilde{t})\Vert _{H^2}^2) d\tilde{t} \right) <\infty . \end{aligned}$$

Therefore, we obtain

$$\begin{aligned}&(V, T)\in L^\infty (0,{\mathcal {T}}; H^3({\mathcal {D}})), \qquad (\nabla _h V, \nabla _h T) \in L^2(0,{\mathcal {T}}; H^3({\mathcal {D}})). \end{aligned}$$

Now following a similar argument as in Case1, and using the Lions–Magenes theorem, we obtain.

$$\begin{aligned} (V,T, \nabla _hV, \nabla _h T)\in C([0,{\mathcal {T}}];H^{2}({\mathcal {D}})), \qquad (\partial _t V, \partial _t T)\in L^2(0,{\mathcal {T}}; H^{2}({\mathcal {D}})). \end{aligned}$$

Notice that we cannot achieve \((V,T) \in C([0,{\mathcal {T}}];H^{3}({\mathcal {D}}))\) since \((V,T) \not \in L^2(0,{\mathcal {T}}; H^{4}({\mathcal {D}}))\). The proof of (3.1) for \(r=4\) follows the same argument as in Case1. By repeating this procedure one can eventually have (3.1) for Case2.

Finally, as \((V,\nabla _h V, T, \nabla _h T)\in C([0,{\mathcal {T}}];H^{r-1}({\mathcal {D}}))\), one can repeat the argument as in Case1 and gets

$$\begin{aligned} \begin{aligned}&(V, T)\in C([0,{\mathcal {T}}];H^{r-1}({\mathcal {D}})) \cap _{l=1}^k C^l([0,{\mathcal {T}}];H^{r-2l}({\mathcal {D}})),\\&w \in \cap _{l=0}^k C^l([0,{\mathcal {T}}];H^{r-2l-1}({\mathcal {D}})),\\&\nabla p \in \cap _{l=0}^{k-1} C^l([0,{\mathcal {T}}];H^{r-2l-1}({\mathcal {D}})). \end{aligned} \end{aligned}$$

This concludes the proof of (3.2).

4 Error estimates for PINNs

4.1 Generalization error estimates

In this section, we answer the question SubQ1 raised in the introduction: given \(\varepsilon > 0\), does there exist a neural network \((V_\theta ,w_\theta ,p_\theta ,T_\theta )\) such that the corresponding generalization error \(\mathcal {E}_G[s;\theta ]\) defined in (2.7) satisfies \(\mathcal {E}_G[s;\theta ]< \varepsilon \)? We have the following result.

Theorem 4.1

(Answer of SubQ1) Let \(d\in \{2,3\}\), \(n\ge 2\), \(k\ge 5\), \(r> \frac{d}{2} + 2k\), and \({\mathcal {T}}>0\). Suppose that \(V_0\in H_e^r({\mathcal {D}})\), \(T_0\in H_o^r({\mathcal {D}})\), and \(Q\in C^{k-1}([0,{\mathcal {T}}]; H_o^r({\mathcal {D}}))\). Then for any given \(\varepsilon >0\) and for any \(0\le \ell \le k-5\), there exist \(\lambda = \mathcal O(\varepsilon ^2)\) small enough and \(N>5\) large enough depending on \(\varepsilon \) and \(\ell \), and tanh neural networks \(\widehat{V}_i:= ( V_i^N)_\theta \), \(\widehat{w}:= w^N_\theta \), \(\widehat{p}:= p^N_\theta \), and \(\widehat{T}:= T^N_\theta \), with \(i=1,...,d-1\), each with two hidden layers, of widths at most and , such that the generalization error satisfies

$$\begin{aligned} \mathcal {E}_G[\ell ;\theta ] \le \varepsilon . \end{aligned}$$

Proof

For simplicity, we treat the case of \(d=3\). The proof of \(d=2\) follows in a similar way.

From Theorem 3.4, we know that system (1.1) has a unique global strong solution (VT) and

$$\begin{aligned}&(V,T)\in C^k\left( \Omega \right) \subset H^{k}\left( \Omega \right) ,\\&(w,p,\nabla p)\in C^{k-1}\left( \Omega \right) \subset H^{k-1}\left( \Omega \right) . \end{aligned}$$

By applying Lemma A.1, for fixed \(N>5\), there exist tanh neural networks \(\widehat{V}_1\), \(\widehat{V}_2\), \(\widehat{w}\), \(\widehat{p}\), and \(\widehat{T}\), each with two hidden layers, of widths at most and such that for every \(1\le s \le k-1\), one has

$$\begin{aligned}&\Vert V_i - \widehat{V}_i\Vert _{H^{s}(\Omega )} \le C_{V} \frac{1+\ln ^{s}N}{N^{k-s}},\qquad \Vert w - \widehat{w}\Vert _{H^{s-1}(\Omega )} \le C_{w} \frac{1+\ln ^{s-1}N}{N^{k-s}},\nonumber \\&\Vert p - \widehat{p}\Vert _{H^{s-1}(\Omega )} \le C_{p} \frac{1+\ln ^{s-1}N}{N^{k-s}},\qquad \Vert T - \widehat{T}\Vert _{H^{s}(\Omega )} \le C_{T} \frac{1+\ln ^{s}N}{N^{k-s}}, \end{aligned}$$
(4.1)

where the constants \(C_{V}, C_{w}, C_{p}\) and \(C_{T}\) are defined according to Lemma A.1. Denote by \(\widehat{V} = (\widehat{V}_1,\widehat{V}_2)\) and \((\partial _1, \partial _2)=(\partial _x, \partial _y)\). For \(0\le \ell \le k-5\) and \(i=1,2\), one has

$$\begin{aligned}&\Vert \partial _t V_i - \partial _t \widehat{V}_i\Vert _{H^\ell (\Omega )} \le \Vert V_i - \widehat{V}_i\Vert _{H^{\ell +1}(\Omega )} \le C_{V} \frac{1+\ln ^{\ell +1}N}{N^{k-\ell -1}}, \nonumber \\&\Vert \partial _i p - \partial _i \widehat{p}\Vert _{H^{\ell }(\Omega )}, \, \Vert \partial _z p - \partial _z \widehat{p}\Vert _{H^{\ell }(\Omega )} \le \Vert p - \widehat{p}\Vert _{H^{\ell +1}(\Omega )} \le C_{p} \frac{1+\ln ^{\ell +1}N}{N^{k-\ell -2}}, \nonumber \\&\Vert \Delta _h V_i - \Delta _h \widehat{V}_i\Vert _{H^\ell (\Omega )}, \, \Vert \partial _{zz} V_i - \partial _{zz} \widehat{V}_i\Vert _{H^\ell (\Omega )} \le \Vert V_i - \widehat{V}_i\Vert _{H^{\ell +2}(\Omega )} \le C_{V} \frac{1+\ln ^{\ell +2}N}{N^{k-\ell -2}}, \nonumber \\&\Vert \nabla _h \cdot V+ \partial _z w - (\nabla _h \cdot \widehat{V} + \partial _z \widehat{w})\Vert _{H^{\ell }(\Omega )} \le \sum \limits _{i=1}^2\Vert V_i - \widehat{V}_i\Vert _{H^{\ell +1}(\Omega )} + \Vert w - \widehat{w}\Vert _{H^{\ell +1}(\Omega )} \nonumber \\&\hspace{150pt}\le (C_{V} +C_w) \frac{1+\ln ^{\ell +1}N}{N^{k-\ell -2}}, \nonumber \\&\Vert \partial _t T - \partial _t \widehat{T}\Vert _{H^\ell (\Omega )} \le \Vert T - \widehat{T}\Vert _{H^{\ell +1}(\Omega )} \le C_{T} \frac{1+\ln ^{\ell +1}N}{N^{k-\ell -1}},\nonumber \\&\Vert \Delta _h T - \Delta _h \widehat{T}\Vert _{H^\ell (\Omega )}, \, \Vert \partial _{zz} T - \partial _{zz} \widehat{T}\Vert _{H^\ell (\Omega )} \le \Vert T - \widehat{T}\Vert _{H^{\ell +2}(\Omega )} \le C_{T} \frac{1+\ln ^{\ell +2}N}{N^{k-\ell -2}}. \end{aligned}$$
(4.2)

For the nonlinear terms we use the Sobolev inequality and (4.1), since \(0\le \ell \le k-5\),

$$\begin{aligned}&\Vert V\cdot \nabla _h V_i - \widehat{V} \cdot \nabla _h \widehat{V}_i\Vert _{H^\ell (\Omega )} \nonumber \\&\le \Vert (V- \widehat{V})\cdot \nabla _h V_i\Vert _{H^\ell (\Omega )} + \Vert \widehat{V} \cdot \nabla _h (V_i-\widehat{V}_i)\Vert _{H^\ell (\Omega )}\nonumber \\&\le C \Vert V_i\Vert _{W^{k-4,\infty }(\Omega )} \max _i\Vert V_i - \widehat{V}_i\Vert _{H^\ell (\Omega )} + C \Vert \widehat{V}\Vert _{W^{k-5,\infty }(\Omega )} \Vert V_i - \widehat{V}_i\Vert _{H^{\ell +1}(\Omega )}\nonumber \\&\le C\left( \Vert V\Vert _{W^{k-4,\infty }(\Omega )}+ \Vert \widehat{V}\Vert _{W^{k-5,\infty }(\Omega )}\right) C_{V} \frac{1+\ln ^{\ell +1}N}{N^{k-\ell -1}}\nonumber \\&\le C(\Vert V\Vert _{H^{k-2}(\Omega )} + C_V \frac{1+\ln ^{k-3}N}{N^3})C_V\frac{1+\ln ^{\ell +1}N}{N^{k-\ell -1}},\nonumber \\&\Vert w\partial _z V_i - \widehat{w} \partial _z \widehat{V}_i\Vert _{H^\ell (\Omega )} \le \Vert (w - \widehat{w})\partial _z V_i\Vert _{H^\ell (\Omega )} + \Vert \widehat{w} \partial _z (V_i-\widehat{V}_i)\Vert _{H^\ell (\Omega )}\nonumber \\&\le C \Vert V_i\Vert _{W^{k-4,\infty }(\Omega )} \Vert w - \widehat{w}\Vert _{H^\ell (\Omega )} + C \Vert \widehat{w}\Vert _{W^{k-5,\infty }(\Omega )} \Vert V_i - \widehat{V}_i\Vert _{H^{\ell +1}(\Omega )}\nonumber \\&\le C\left( \Vert V\Vert _{W^{k-4,\infty }(\Omega )}+\Vert \hat{w}\Vert _{W^{k-5,\infty }(\Omega )} \right) (C_w + C_{V})\frac{1+\ln ^{\ell +1}N}{N^{k-\ell -1}}\nonumber \\&\le C\left( \Vert V\Vert _{H^{k-2}(\Omega )}+\Vert w\Vert _{H^{k-3}(\Omega )} + C_w\frac{1+\ln ^{k-3}N}{N^{2}} \right) (C_w + C_{V})\frac{1+\ln ^{\ell +1}N}{N^{k-\ell -1}}. \end{aligned}$$
(4.3)

Here we have used (4.1) to obtain

$$\begin{aligned} C\Vert \widehat{V}\Vert _{W^{k-5,\infty }(\Omega )}&\le C\Vert \widehat{V}\Vert _{H^{k-3}(\Omega )} \le C(\Vert V - \widehat{V}\Vert _{H^{k-3}(\Omega )} + \Vert V\Vert _{H^{k-3}(\Omega )}) \\&\le C\left( \Vert V\Vert _{H^{k-2}(\Omega )} + C_{V} \frac{1+\ln ^{k-3} N}{N^3}\right) ,\\ C\Vert \widehat{w}\Vert _{W^{k-5,\infty }(\Omega )}&\le C\Vert \widehat{w}\Vert _{H^{k-3}(\Omega )}\le C(\Vert w - \widehat{w}\Vert _{H^{k-3}(\Omega )} + \Vert w\Vert _{H^{k-3}(\Omega )}) \\&\le C\left( \Vert w\Vert _{H^{k-3}(\Omega )} + C_{w} \frac{1+\ln ^{k-3} N}{N^2}\right) . \end{aligned}$$

A similar calculation of (4.3) yields

$$\begin{aligned} \begin{aligned}&\Vert V\cdot \nabla _h T - \widehat{V} \cdot \nabla _h \widehat{T}\Vert _{H^\ell (\Omega )}\\ \le&C\left( \Vert T\Vert _{H^{k-2}(\Omega )} + \Vert V\Vert _{H^{k-2}(\Omega )}+ C_V \frac{1+\ln ^{k-3}N}{N^3}\right) (C_V+C_T)\frac{1+\ln ^{\ell +1}N}{N^{k-\ell -1}},\\&\Vert w\partial _z T - \widehat{w} \partial _z \widehat{T}\Vert _{H^\ell (\Omega )}\\ \le&C\left( \Vert T\Vert _{H^{k-2}(\Omega )} + \Vert w\Vert _{H^{k-3}(\Omega )}+ C_w \frac{1+\ln ^{k-3} N}{N^2}\right) (C_w + C_{T})\frac{1+\ln ^{\ell +1}N}{N^{k-\ell -1}}. \end{aligned}\nonumber \\ \end{aligned}$$
(4.4)

Combining (4.2)–(4.4) gives

$$\begin{aligned}&\Big \Vert \partial _t \widehat{V} + \widehat{V} \cdot \nabla _h \widehat{V} + \widehat{w} \partial _z \widehat{V} + \nabla _h \widehat{p} + f_0 \widehat{V}^\perp - \nu _h \Delta _h \widehat{V} - \nu _z \partial _{zz} \widehat{V}\Big \Vert _{H^\ell (\Omega )} \nonumber \\&=\Big \Vert \left( \partial _t \widehat{V} + \widehat{V} \cdot \nabla _h \widehat{V} + \widehat{w} \partial _z \widehat{V} + \nabla _h \widehat{p} + f_0 \widehat{V}^\perp - \nu _h \Delta _h \widehat{V} - \nu _z \partial _{zz} \widehat{V}\right) \nonumber \\&\quad - \underbrace{\left( \partial _t V+ V \cdot \nabla _h V + w \partial _z V + \nabla _h p + f_0 V^\perp - \nu _h \Delta _h V - \nu _z \partial _{zz} V\right) }_{=0}\Big \Vert _{H^\ell (\Omega )}\nonumber \\&\le \Vert \partial _t \widehat{V} - \partial _t V\Vert _{H^\ell (\Omega )} \nonumber \\&\quad + \Vert \widehat{V} \cdot \nabla _h \widehat{V} - V \cdot \nabla _h V\Vert _{H^\ell (\Omega )} + \Vert \widehat{w} \partial _z \widehat{V} - w \partial _z V\Vert _{H^\ell (\Omega )} + f_0 \Vert \widehat{V} - V\Vert _{H^\ell (\Omega )}\nonumber \\&\quad + \Vert \nabla _h \widehat{p} - \nabla _h p\Vert _{H^\ell (\Omega )} + \nu _h \Vert \Delta _h \widehat{V} - \Delta _h V\Vert _{H^\ell (\Omega )} + \nu _z \Vert \partial _{zz} \widehat{V} - \partial _{zz} V\Vert _{H^\ell (\Omega )}\nonumber \\&\le C_{\nu _h,\nu _z,C_{V}, C_{p}, C_{w} , C_{T}}\left( 1+ \Vert V\Vert _{C^k(\Omega )} + \Vert T\Vert _{C^k(\Omega )} + \Vert w\Vert _{C^{k-1}(\Omega )} + \frac{\ln ^{k-3}N}{N^2}\right) \nonumber \\&\quad \frac{1+\ln ^{\ell +2}N}{N^{k-\ell -2}}. \end{aligned}$$
(4.5)

Similarly to (4.5), one can calculate that

$$\begin{aligned} \begin{aligned}&\Vert \partial _z \widehat{p} + \widehat{T}\Vert _{H^\ell (\Omega )} \le \Vert \partial _z \widehat{p} - \partial _z p\Vert _{H^\ell (\Omega )} + \Vert T-\widehat{T}\Vert _{H^\ell (\Omega )} \le (C_p+ C_T)\frac{1+\ln ^{\ell +1}N}{N^{k-\ell -2}},\\&\Vert \nabla _h \cdot \widehat{V} + \partial _z \widehat{w}\Vert _{H^\ell (\Omega )} \le \Vert V-\widehat{V}\Vert _{H^{\ell +1}(\Omega )} + \Vert w-\widehat{w}\Vert _{H^{\ell +1}(\Omega )} \le (C_{V} + C_w) \frac{1+\ln ^{\ell +1}N}{N^{k-\ell -2}}, \end{aligned}\nonumber \\ \end{aligned}$$
(4.6)

and

$$\begin{aligned} \begin{aligned} \Big \Vert \partial _t \widehat{T}&+ \widehat{V} \cdot \nabla _h \widehat{T} + \widehat{w} \partial _z \widehat{T} - \kappa _h \Delta _h \widehat{T} - \kappa _z \partial _{zz} \widehat{T} - Q\Big \Vert _{H^\ell (\Omega )}\\ \le&\Vert \partial _t \widehat{T} - \partial _t T\Vert _{H^\ell (\Omega )} + \Vert \widehat{V} \cdot \nabla _h \widehat{T} - V \cdot \nabla _h T\Vert _{H^\ell (\Omega )} + \Vert \widehat{w} \partial _z \widehat{T} - w \partial _z T\Vert _{H^\ell (\Omega )}\\&+ \kappa _h \Vert \Delta _h \widehat{T} - \Delta _h T\Vert _{H^\ell (\Omega )} + \kappa _z \Vert \partial _{zz} \widehat{T} - \partial _{zz} T\Vert _{H^\ell (\Omega )}\\ \le&C_{\kappa _h,\kappa _z,C_{V}, C_{w}, C_{T}}\left( 1+ \Vert V\Vert _{C^k(\Omega )} + \Vert T\Vert _{C^k(\Omega )} + \Vert w\Vert _{C^{k-1}(\Omega )} + \frac{\ln ^{k-3}N}{N^2}\right) \frac{1+\ln ^{\ell +2}N}{N^{k-\ell -2}}. \end{aligned}\nonumber \\ \end{aligned}$$
(4.7)

The combination of (4.5)–(4.7) implies that

$$\begin{aligned} \mathcal {E}_G^i[\ell ;\theta ]&= \Bigg [\int _0^{{\mathcal {T}}}\Big (\Vert \mathcal R_{i,V}[\theta ]\Vert _{H^{\ell }({\mathcal {D}})}^2 + \Vert \mathcal R_{i,P}[\theta ]\Vert _{H^{\ell }({\mathcal {D}})}^2\nonumber \\&\quad +\Vert \mathcal R_{i,T}[\theta ]\Vert _{H^{\ell }({\mathcal {D}})}^2+\Vert \mathcal R_{i,div}[\theta ]\Vert _{H^{\ell }({\mathcal {D}})}^2\Big )dt\Bigg ]^{\frac{1}{2}}\nonumber \\&\le C_{\nu _h,\nu _z,\kappa _h,\kappa _z,{\mathcal {T}},C_{V}, C_{p}, C_{w} , C_{T}}\nonumber \\&\quad \left( 1+ \Vert V\Vert _{C^k(\Omega )} + \Vert T\Vert _{C^k(\Omega )} + \Vert w\Vert _{C^{k-1}(\Omega )} + \frac{\ln ^{k-3}N}{N^2}\right) \frac{1+\ln ^{\ell +2}N}{N^{k-\ell -2}}. \end{aligned}$$
(4.8)

Next we derive the estimate for \(\mathcal {E}_G^t[\ell ; \theta ]\). Note that \((V,T, \widehat{V}, \widehat{T})\in C^k\left( \Omega \right) \) and \((w,p, \widehat{w}, \widehat{p})\in C^{k-1}\left( \Omega \right) \). By applying the trace theorem and the fact that \(\Omega \) is a Lipschitz domain, for \(0\le \ell \le k-3\) and \(|\alpha |\le \ell \), we have for \(\varphi \in \{V_1, V_2, w, T, p\}\):

$$\begin{aligned} \begin{aligned}&\Vert D^\alpha \varphi - D^\alpha \widehat{\varphi }\Vert _{L^{\infty }(\partial \Omega )}\le C\Vert D^\alpha \varphi - D^\alpha \widehat{\varphi }\Vert _{W^{1,\infty }(\Omega )} \le C\Vert \varphi - \widehat{\varphi }\Vert _{W^{\ell +1,\infty }(\Omega )}. \end{aligned} \end{aligned}$$

Therefore, for \(0\le \ell \le k-5\), one has

$$\begin{aligned} \mathcal {E}_G^t[\ell ;\theta ]&\le \sum \limits _{i=1}^2\Vert (V_0)_i - \widehat{V}_i(t=0)\Vert _{H^\ell ({\mathcal {D}})} + \Vert T_0 - \widehat{T}(t=0)\Vert _{H^\ell ({\mathcal {D}})}\nonumber \\&\le C\sum \limits _{|\alpha |\le \ell }\left( \sum \limits _{i=1}^2\Vert D^\alpha V_i - D^\alpha \widehat{V}_i\Vert _{L^{\infty }(\partial \Omega )} + \Vert D^\alpha T - D^\alpha \widehat{T}\Vert _{L^{\infty }(\partial \Omega )}\right) \nonumber \\&\le C\left( \sum \limits _{i=1}^2\Vert V_i - \widehat{V}_i\Vert _{H^{\ell +3}( \Omega )} + \Vert T - \widehat{T}\Vert _{H^{\ell +3}( \Omega )}\right) \le (C_{V}+C_{T}) \frac{1+\ln ^{\ell +3}N}{N^{k-\ell -3}}. \end{aligned}$$
(4.9)

Regarding the estimate for \(\mathcal {E}_G^b[\ell ; \theta ]\), for \(|\alpha |\le \ell \), thanks to the boundary condition (1.2),

$$\begin{aligned}&\Big |D^\alpha \widehat{V}_i(x=1,y,z)- D^\alpha \widehat{V}_i(x=0,y,z)\Big | \\ \le&\Big |D^\alpha \widehat{V}_i(x=1,y,z)- D^\alpha V_i(x=1,y,z)\Big | + \Big |D^\alpha \widehat{V}_i(x=0,y,z)- D^\alpha V_i(x=0,y,z)\Big |\\&+ \underbrace{\Big |D^\alpha V_i(x=1,y,z)- D^\alpha V_i(x=0,y,z)\Big |}_{=0}, \end{aligned}$$

and therefore,

$$\begin{aligned}&\left( \int _0^{{\mathcal {T}}} \Big |D^\alpha \widehat{V}_i(x=1,y,z)- D^\alpha \widehat{V}_i(x=0,y,z)\Big |^2 dt \right) ^{\frac{1}{2}} \\&\le C\Vert D^\alpha V_i -D^\alpha \widehat{V}_i\Vert _{L^2(0,{\mathcal {T}}; L^{\infty }(\partial {\mathcal {D}}))}\\&\le C_{{\mathcal {T}}}\Vert V_i - \widehat{V}_i\Vert _{H^{\ell +3}(\Omega )} \le C_{V} \frac{1+\ln ^{\ell +3}N}{N^{k-\ell -3}}. \end{aligned}$$

One can bound other terms similarly, and finally get

$$\begin{aligned} \begin{aligned} \mathcal {E}_G^b[\ell ;\theta ]&= \left( \int _0^{{\mathcal {T}}}\Vert \mathcal R_{b}[\ell ; \theta ]\Vert ^2_{L^2(\partial {\mathcal {D}})} dt\right) ^{\frac{1}{2}} \le (C_{V}+ C_w + C_p + C_{T}) \frac{1+\ln ^{\ell +3}N}{N^{k-\ell -4}}. \end{aligned}\nonumber \\ \end{aligned}$$
(4.10)

Finally, we recall that

$$\begin{aligned} \mathcal {E}_G^p[\ell ;\theta ]^2 = \int _0^{{\mathcal {T}}}\left( \Vert \widehat{V}\Vert _{H^{\ell + 3}({\mathcal {D}})}^2 + \Vert \widehat{p}\Vert _{H^{\ell + 3}({\mathcal {D}})}^2 + \Vert \widehat{w}\Vert _{H^{\ell + 3}({\mathcal {D}})}^2 + \Vert \widehat{T}\Vert _{H^{\ell + 3}({\mathcal {D}})}^2 \right) dt. \end{aligned}$$

Thanks to (4.1) we compute, for example,

$$\begin{aligned} \int _0^{{\mathcal {T}}} \Vert \widehat{w}\Vert _{H^{\ell + 3}({\mathcal {D}})}^2 dt&\le C \int _0^{{\mathcal {T}}} (\Vert w\Vert _{H^{\ell + 3}({\mathcal {D}})}^2 + \Vert w-\widehat{w}\Vert _{H^{\ell + 3}({\mathcal {D}})}^2) dt \\&\le C_{{\mathcal {T}}}\left( \Vert w\Vert _{C^{\ell + 3}(\Omega )}^2 + C_{w} \left( \frac{1+\ln ^{\ell +3}N}{N^{k-\ell -4}}\right) ^2\right) . \end{aligned}$$

As the true solution (VpwT) exists globally and

$$\begin{aligned} (V,T)\in C^k\left( \Omega \right) , \quad (w,p)\in C^{k-1}\left( \Omega \right) , \end{aligned}$$

for any give \({\mathcal {T}}>0\) and \(m\le k\), there exists some constant \(C_{{\mathcal {T}}, m}\) depending only on the time \({\mathcal {T}}\) and m such that

$$\begin{aligned} \Vert V\Vert _{C^m(\Omega )}^2 + \Vert w\Vert _{C^{m-1}(\Omega )}^2 + \Vert p\Vert _{C^{m-1}(\Omega )}^2 + \Vert T\Vert _{C^m(\Omega )}^2 \le C_{{\mathcal {T}}, m}. \end{aligned}$$
(4.11)

Indeed, by [45, 46] one has that \(C_{{\mathcal {T}}, m} = C_m\) independent of time \({\mathcal {T}}\) for Case 1, while analogue result for Case 2 still remains open. Since \(\ell +3\le k-2\), (4.11) implies that

$$\begin{aligned} \mathcal {E}_G^p[\ell ;\theta ]^2 \le C_{{\mathcal {T}}, k-1} + C_{{\mathcal {T}}} (C_V+C_p+C_w+C_T) \left( \frac{1+\ln ^{\ell +3}N}{N^{k-\ell -4}}\right) ^2. \end{aligned}$$

Now by taking \(\lambda \le \frac{\varepsilon ^2}{2C_{{\mathcal {T}}, k-1}}\), we obtain that

$$\begin{aligned} \lambda \mathcal {E}_G^p[\ell ;\theta ]^2 \le \frac{\varepsilon ^2}{2} + \lambda C_{{\mathcal {T}}} (C_V+C_p+C_w+C_T) \left( \frac{1+\ln ^{\ell +3}N}{N^{k-\ell -4}}\right) ^2. \end{aligned}$$
(4.12)

Notice that the bounds (4.8), (4.9) and (4.10) and the second part of (4.12) are independent of the neural networks’ parameterized solution \((\widehat{V}_i, \widehat{w}, \widehat{p}, \widehat{T})\), one can pick N large enough such that the bounds from (4.8), (4.9) and (4.10) are bounded by \(\frac{\varepsilon }{4}\), and the second part from (4.12) is bounded by \(\frac{\varepsilon ^2}{4}\), to eventually obtain that

$$\begin{aligned} \mathcal {E}_G[\ell ;\theta ] \le \varepsilon . \end{aligned}$$

4.2 Total error estimates

This section is dedicated to answer SubQ2 raised in the introduction: given PINNs \((\widehat{V}, \widehat{w}, \widehat{p}, \widehat{T}):=( V_\theta , w_\theta , p_\theta , T_\theta )\) with a small generalization error \(\mathcal {E}_G[s;\theta ]\), is the corresponding total error \({\mathcal {E}}[s;\theta ]\) also small?

In the sequel, we only consider Case 2: \(\nu _z=0\) and \(\kappa _z=0\). The proof of Case 1 is similar and simpler. To simplify the notation, we drop the \(\theta \) dependence in the residuals, for example, writing \(\mathcal R_{i,V}\) instead of \(\mathcal R_{i,V}[\theta ]\), when there is no confusion.

Theorem 4.2

(Answer of SubQ2) Let \(d\in \{2,3\}\), \({\mathcal {T}}>0\), \(k\ge 1\) and \(r=k+2\). Suppose that \(V_0\in H_e^r({\mathcal {D}})\), \(T_0\in H_o^r({\mathcal {D}})\), and \(Q\in C^{k-1}([0,{\mathcal {T}}]; H_o^r({\mathcal {D}}))\), and let (VwpT) be the unique strong solution to system (1.1). Let \((\widehat{V}, \widehat{w}, \widehat{p}, \widehat{T})\) be tanh neural networks that approximate (VwpT). Denote by \((V^*, w^*, p^*, T^*)= (V-\widehat{V}, w-\widehat{w}, p-\widehat{p}, T-\widehat{T})\). Then for \(0\le s\le k-1\), the total error satisfies

$$\begin{aligned} {\mathcal {E}}[s;\theta ]^2\le & {} C_{\nu _h,{\mathcal {T}}} \left( \mathcal {E}_G^2[s;\theta ] (1+\frac{1}{\sqrt{\lambda }}) + \mathcal {E}_G[s;\theta ] C_{{\mathcal {T}},s+1}^{\frac{1}{2}}\right) \nonumber \\{} & {} \exp \left( C_{\nu _h,\kappa _h} \left( C_{{\mathcal {T}},s+1} + \frac{C\mathcal {E}_G[s;\theta ]^2}{\lambda }\right) \right) , \end{aligned}$$
(4.13)

where \(\lambda \) appears in the definition of \(\mathcal {E}_G\) given in (2.7), and \(C_{{\mathcal {T}},s+1}\) is a constant bound defined in (4.28).

Remark 3

  1. (i)

    Thanks to the global existence of strong solutions to the PEs with horizontal viscosity (Theorem 3.4), the constant \(C_{{\mathcal {T}},s+1}\) is finite for any \({\mathcal {T}}>0\). This is in contrast to the analysis for the 3D Navier–Stokes equations [24], where the existence of the global strong solutions is still unknown, and therefore the total error may become very large at some finite time \({\mathcal {T}}>0\) even if the generalization error is arbitrarily small.

  2. (ii)

    The constants appearing on the right hand side of (4.13) are free of the neural network solution \((\widehat{V}, \widehat{w}, \widehat{p}, \widehat{T})\). Thus, if aligned with Theorem 4.1 and set \(\lambda \) such that \(\frac{\mathcal {E}_G[s;\theta ]}{\sqrt{\lambda }} = \mathcal O(1)\), the total error \({\mathcal {E}}[s;\theta ]\) is guaranteed to be small as long as \(\mathcal {E}_G[s;\theta ]\) is sufficiently small. In other words, this estimate is independent of training, thus a priori.

  3. (iii)

    In contrast with some previous works [24, 65] that only considered the \(L^2\) total error estimates, our approach provides higher-order total error estimate \(\int _0^{{\mathcal {T}}} \big (\Vert V^*(t)\Vert _{H^s}^2 + \Vert T^*(t)\Vert _{H^s}^2\big ) dt\). To obtain this estimate, it is necessary to control the residuals in corresponding higher-order Sobolev norms. This is a significant improvement over previous approaches, as the higher-order total error estimate provides more accurate and detailed information about the behavior of the solution, which can be crucial for certain applications such as fluid dynamics or solid mechanic.

Proof of Theorem 4.2

Let \({\mathcal {T}}>0\) be arbitrary but fixed. Since (VwpT) is the unique strong solution to system (1.1), from (2.4) one has

$$\begin{aligned}&\partial _t V^* + V^*\cdot \nabla _h V^* + V^*\cdot \nabla _h \widehat{V} + \widehat{V}\cdot \nabla _h V^* + w^*\partial _z V^* + w^*\partial _z \widehat{V} + \widehat{w}\partial _z V^* \nonumber \\&\qquad + f_0 V^{*\perp } - \nu _h \Delta _h V^* + \nabla _h p^* = -\mathcal R_{i,V}, \end{aligned}$$
(4.14a)
$$\begin{aligned}&\partial _z p^* + T^* = - \mathcal R_{i,p}, \end{aligned}$$
(4.14b)
$$\begin{aligned}&\nabla _h \cdot V^* + \partial _z w^* = -\mathcal R_{i,div},\end{aligned}$$
(4.14c)
$$\begin{aligned}&\partial _t T^* + V^*\cdot \nabla _h T^* + V^*\cdot \nabla _h \widehat{T} + \widehat{V}\cdot \nabla _h T^* + w^*\partial _z T^* + w^*\partial _z \widehat{T} + \widehat{w}\partial _z T^* = -\mathcal R_{i,T}. \end{aligned}$$
(4.14d)

Moreover, we know that \((\widehat{V}, \widehat{p}, \widehat{w}, \widehat{T})\) are smooth and

$$\begin{aligned} \begin{aligned}&(V^*, V, T^*, T)\in L^\infty (0,{\mathcal {T}}; H^r({\mathcal {D}})) \subset L^\infty (0,{\mathcal {T}}; W^{k,\infty }({\mathcal {D}})),\\&(\nabla _h V^*, \nabla _h V, \nabla _h T^*,\nabla _h T, w^*, w)\in L^2(0,{\mathcal {T}}; H^r({\mathcal {D}})) \subset L^2(0,{\mathcal {T}}; W^{k,\infty }({\mathcal {D}})), \end{aligned}\nonumber \\ \end{aligned}$$
(4.15)

and from (3.7),

$$\begin{aligned} (p^*, p) \in L^\infty (0,{\mathcal {T}}; H^r({\mathcal {D}}))\subset L^\infty (0,{\mathcal {T}}; W^{k,\infty }({\mathcal {D}})). \end{aligned}$$

Using equation (2.1), we can rewrite \(w^*\) as,

$$\begin{aligned} \begin{aligned} w^* = -\widehat{w}(z=0) - \int _0^z \nabla _h \cdot V^*(\varvec{x}', \widetilde{z}) d\widetilde{z} - \int _0^z (\nabla _h \cdot \widehat{V} + \partial _z \widehat{w})(\varvec{x}', \widetilde{z}) d\widetilde{z}. \end{aligned} \end{aligned}$$

Since \(\widehat{w}\) is smooth enough, one has

$$\begin{aligned} \Vert w^*\Vert _{H^s} \le C(\Vert \mathcal R_{b}[s]\Vert _{L^2(\partial {\mathcal {D}})} + \Vert \nabla _h V^*\Vert _{H^s} + \Vert \mathcal {R}_{i,div}\Vert _{H^s}). \end{aligned}$$
(4.16)

For each fixed \(0\le s \le k-1\), let \(|\alpha |\le s\) a multi-index. Taking \(D^\alpha \) derivative on system (4.14), taking the inner product of (4.14a) with \(D^\alpha V^*\) and (4.14d) with \(D^\alpha T^*\), then summing over all \(|\alpha |\le s\) gives

$$\begin{aligned} \begin{aligned}&\frac{1}{2} \frac{d}{dt} \left( \Vert V^*\Vert _{H^s}^2 + \Vert T^*\Vert _{H^s}^2\right) + \nu _h \Vert \nabla _h V^*\Vert _{H^s}^2 + \kappa _h \Vert \nabla _h T^*\Vert _{H^s}^2\\ =&- \sum \limits _{|\alpha |\le s} \Big (\left\langle D^\alpha (V^*\cdot \nabla _h V^* + V^*\cdot \nabla _h \widehat{V} + \widehat{V}\cdot \nabla _h V^* \right. \\&\quad \left. + w^*\partial _z V^* + w^*\partial _z \widehat{V} + \widehat{w}\partial _z V^*), D^\alpha V^* \right\rangle \\&\quad - \left\langle D^\alpha (V^*\cdot \nabla _h T^* + V^*\cdot \nabla _h \widehat{T} + \widehat{V}\cdot \nabla _h T^* + w^*\partial _z T^* + w^*\partial _z \widehat{T} + \widehat{w}\partial _z T^*), D^\alpha T^* \right\rangle \\&\quad + \left\langle D^\alpha \nabla _h p^*, D^\alpha V^* \right\rangle + \left\langle D^\alpha \mathcal R_{i,V}, D^\alpha V^* \right\rangle + \left\langle D^\alpha \mathcal R_{i,T}, D^\alpha T^* \right\rangle \Big ). \end{aligned}\nonumber \\ \end{aligned}$$
(4.17)

We first estimate the nonlinear terms. Applying the Sobolev inequality, the Hölder inequality, and Young’s inequality, and combining with (4.16), one has

$$\begin{aligned} \begin{aligned}&\left| \left\langle D^\alpha (V^*\cdot \nabla _h V^* + w^* \partial _z \widehat{V} + V^*\cdot \nabla _h \widehat{V} + \widehat{V}\cdot \nabla _h V^*), D^\alpha V^* \right\rangle \right| \\ \le&C\Big (\Vert V^*\Vert _{H^s}\Vert \nabla _h V^*\Vert _{W^{s,\infty }} + \Vert w^*\Vert _{H^s}\Vert \partial _z \widehat{V}\Vert _{W^{s,\infty }} \\&\quad + \Vert V^*\Vert _{H^s}\Vert \nabla _h \widehat{V}\Vert _{W^{s,\infty }} + \Vert \nabla _h V^*\Vert _{H^s} \Vert \widehat{V}\Vert _{W^{s,\infty }}\Big )\Vert V^*\Vert _{H^s}\\ \le&C\Vert \mathcal R_{b}[s]\Vert _{L^2(\partial {\mathcal {D}})}^2 + C\Vert \mathcal {R}_{i,div}\Vert _{H^s}^2 + C(\Vert V^*\Vert _{W^{s+1,\infty }} \\&\quad + \Vert \widehat{V}\Vert _{W^{s+1,\infty }})\Vert V^*\Vert _{H^s}^2 + C \Vert \widehat{V}\Vert _{W^{s+1,\infty }} \Vert \nabla _h V^*\Vert _{H^s} \Vert V^*\Vert _{H^s}\\ \le&C_{\nu _h}\left( 1+\Vert V^*\Vert ^2_{W^{s+1,\infty }} + \Vert \widehat{V}\Vert ^2_{W^{s+1,\infty }} \right) \Vert V^*\Vert _{H^s}^2 + \frac{\nu _h}{6} \Vert \nabla _h V^*\Vert _{H^s}^2 \\&\quad + C\Vert \mathcal R_{b}[s]\Vert _{L^2(\partial {\mathcal {D}})}^2 + C\Vert \mathcal {R}_{i,div}\Vert _{H^s}^2. \end{aligned}\nonumber \\ \end{aligned}$$
(4.18)

Similarly for T, one can deduce

$$\begin{aligned} \begin{aligned}&\left| \left\langle D^\alpha (V^*\cdot \nabla _h T^* + w^* \partial _z \widehat{T} + V^*\cdot \nabla _h \widehat{T} + \widehat{V} \cdot \nabla _h T^* ), D^\alpha T^* \right\rangle \right| \\ \le&C\Big (\Vert V^*\Vert _{H^s}\Vert \nabla _h T^*\Vert _{W^{s,\infty }} \\&\quad + \Vert w^*\Vert _{H^s}\Vert \partial _z \widehat{T}\Vert _{W^{s,\infty }} + \Vert V^*\Vert _{H^s}\Vert \nabla _h \widehat{T}\Vert _{W^{s,\infty }} + \Vert \nabla _h T^*\Vert _{H^s} \Vert \widehat{V}\Vert _{W^{s,\infty }}\Big )\Vert T^*\Vert _{H^s}\\ \le&C_{\nu _h,\kappa _h}\left( 1+\Vert T^*\Vert ^2_{W^{s+1,\infty }} + \Vert \widehat{T}\Vert ^2_{W^{s+1,\infty }} + \Vert \widehat{V}\Vert ^2_{W^{s+1,\infty }} \right) (\Vert V^*\Vert _{H^s}^2 + \Vert T^*\Vert _{H^s}^2)\\&+ \frac{\nu _h}{6} \Vert \nabla _h V^*\Vert _{H^s}^2+ \frac{\kappa _h}{2} \Vert \nabla _h T^*\Vert _{H^s}^2 + C\Vert \mathcal R_{b}[s]\Vert _{L^2(\partial {\mathcal {D}})}^2 + C\Vert \mathcal {R}_{i,div}\Vert _{H^s}^2. \end{aligned}\nonumber \\ \end{aligned}$$
(4.19)

For the rest of the nonlinear terms, we provide the details for \(s\ge 1\). The case of \(s=0\) follows easily. By the triangle inequality, one has

$$\begin{aligned} \begin{aligned}&\left| \left\langle D^\alpha (\widehat{w} \partial _z V^* + w^* \partial _z V^*), D^\alpha V^* \right\rangle \right| = \left| \left\langle D^\alpha (w \partial _z V^*), D^\alpha V^* \right\rangle \right| \\&\le \left| \left\langle D^\alpha ( w \partial _z V^*) - w \partial _z D^\alpha V^*, D^\alpha V^* \right\rangle \right| + \left| \left\langle w \partial _z D^\alpha V^*, D^\alpha V^* \right\rangle \right| := |I_1| + |I_2|. \end{aligned}\nonumber \\ \end{aligned}$$
(4.20)

Using the Hölder inequality, one has

$$\begin{aligned} |I_1| = \left| \int _{{\mathcal {D}}} \sum \limits _{|\beta |=0,\beta < \alpha }^{s-1} \left( {\begin{array}{c}\alpha \\ \beta \end{array}}\right) D^{\alpha -\beta } w D^\beta \partial _z V^* \cdot {{\mathcal {D}}}^\alpha V^* d\varvec{x}\right| \le C\Vert w\Vert _{W^{s+1,\infty }}\Vert V^*\Vert _{H^s}^2.\nonumber \\ \end{aligned}$$
(4.21)

By integration by parts and thanks to the boundary condition of w, we have

$$\begin{aligned} I_2 = - \frac{1}{2} \int _{{\mathcal {D}}} |D^\alpha V^*|^2 \partial _z w d\varvec{x} \le C \Vert w\Vert _{W^{s+1,\infty }} \Vert V^*\Vert _{H^s}^2. \end{aligned}$$
(4.22)

A similar calculation for T yields

$$\begin{aligned} \left| \left\langle D^\alpha (\widehat{w} \partial _z T^* + w^* \partial _z T^*), D^\alpha T^* \right\rangle \right| \le C\Vert w\Vert _{W^{s+1,\infty }}\Vert T^*\Vert _{H^s}^2. \end{aligned}$$
(4.23)

Next, by integration by parts, and using (1.1b) and (1.1c), we have

$$\begin{aligned}&\left\langle D^\alpha \nabla _h p^*, D^\alpha V^* \right\rangle = \int _0^1 \int _{\partial \mathcal M} D^\alpha p^* D^\alpha V^*\cdot \textbf{n} d\sigma (\partial \mathcal M) dz - \int _{\mathcal {D}} D^\alpha p^* D^\alpha \nabla _h \cdot V^* d\varvec{x}\\&= \int _0^1 \int _{\partial \mathcal M} D^\alpha p^* D^\alpha V^*\cdot \textbf{n} d\sigma (\partial \mathcal M) dz + \int _{\mathcal {D}} D^\alpha p^* D^\alpha ( \partial _z w^* + \mathcal R_{i,div}) d\varvec{x}\\&= \int _{\partial {\mathcal {D}}} D^\alpha p^* (D^\alpha V^*, D^\alpha w^*)^\top \cdot \textbf{n} d\sigma (\partial {\mathcal {D}}) + \int _{\mathcal {D}} D^\alpha p^* D^\alpha \mathcal R_{i,div} d\varvec{x} \\&\quad -\int _{\mathcal {D}} D^\alpha \partial _z p^* D^\alpha w^* d\varvec{x}\\&= \int _{\partial {\mathcal {D}}} D^\alpha p^* (D^\alpha V^*, D^\alpha w^*)^\top \cdot \textbf{n} d\sigma (\partial {\mathcal {D}}) + \int _{\mathcal {D}} D^\alpha p^* D^\alpha \mathcal R_{i,div} d\varvec{x} \\&\quad + \int _{\mathcal {D}} D^\alpha (T^* + \mathcal R_{i,p}) D^\alpha w^* d\varvec{x}\\&:= I_1 + I_2 + I_3. \end{aligned}$$

Since the domain \({\mathcal {D}} = \mathcal M\times (0,1)\) is Lipschitz and \((D^\alpha p^*, D^\alpha V^*, D^\alpha w^*) \in W^{1,\infty }({\mathcal {D}})\), the trace theorem and the Hölder inequality yield

$$\begin{aligned} \begin{aligned} I_1&\le C\left( \Vert D^\alpha p^*\Vert _{L^\infty ( \partial {\mathcal {D}})} + \Vert D^\alpha V^*\Vert _{L^\infty ( \partial {\mathcal {D}})} + \Vert D^\alpha w^*\Vert _{L^\infty ( \partial {\mathcal {D}})}\right) \Vert \mathcal R_{b}[s]\Vert _{L^1(\partial {\mathcal {D}})}\\&\le C \left( \Vert D^\alpha p^*\Vert _{W^{1,\infty }} + \Vert D^\alpha V^*\Vert _{W^{1,\infty }} + \Vert D^\alpha w^*\Vert _{W^{1,\infty }}\right) \Vert \mathcal R_{b}[s]\Vert _{L^2(\partial {\mathcal {D}})}\\&\le C \left( \Vert p^*\Vert _{W^{s+1,\infty }} + \Vert V^*\Vert _{W^{s+1,\infty }} + \Vert w^*\Vert _{W^{s+1,\infty }}\right) \Vert \mathcal R_{b}[s]\Vert _{L^2(\partial {\mathcal {D}})}. \end{aligned}\nonumber \\ \end{aligned}$$
(4.24)

By the Cauchy–Schwarz inequality, we have

$$\begin{aligned} \begin{aligned} I_2 \le C \Vert p^*\Vert _{H^{s}} \Vert \mathcal R_{i,div}\Vert _{H^{s}}. \end{aligned} \end{aligned}$$
(4.25)

Thanks to (4.16), by the Cauchy–Schwarz inequality and Young’s inequality,

$$\begin{aligned} \begin{aligned} I_3&\le \Vert w^*\Vert _{H^s}\Vert \mathcal {R}_{i,p}\Vert _{H^s} + \Vert T^*\Vert _{H^s}(\Vert \mathcal R_{b}[s]\Vert _{L^2(\partial {\mathcal {D}})} + \Vert \nabla V^*\Vert _{H^s} + \Vert \mathcal {R}_{i,div}\Vert _{H^s})\\&\le \Vert w^*\Vert _{H^s}\Vert \mathcal {R}_{i,p}\Vert _{H^s}+ \Vert T^*\Vert _{H^s}(\Vert \mathcal R_{b}[s]\Vert _{L^2(\partial {\mathcal {D}})} + \Vert \mathcal {R}_{i,div}\Vert _{H^s}) + C_{\nu _h} \Vert T^*\Vert ^2_{H^s} \\&\quad +\frac{\nu _h}{6} \Vert \nabla V^*\Vert _{H^s}. \end{aligned}\nonumber \\ \end{aligned}$$
(4.26)

For the rest two terms in (4.17), applying the Cauchy-Schwarz inequality gives

$$\begin{aligned} \begin{aligned} \left\langle D^\alpha \mathcal R_{i,V}, D^\alpha V^* \right\rangle + \left\langle D^\alpha \mathcal R_{i,T}, D^\alpha T^* \right\rangle&\le \Vert \mathcal R_{i,V}\Vert _{H^s} \Vert V^*\Vert _{H^s} + \Vert \mathcal R_{i,T}\Vert _{H^s} \Vert T^*\Vert _{H^s}\\&\le \Vert \mathcal R_{i,V}\Vert _{H^s} \Vert V^*\Vert _{W^{k,\infty }} + \Vert \mathcal R_{i,T}\Vert _{H^s} \Vert T^*\Vert _{W^{k,\infty }}. \end{aligned}\nonumber \\ \end{aligned}$$
(4.27)

Combining the estimates (4.17)–(4.27), one gets

$$\begin{aligned} \begin{aligned}&\frac{d}{dt} \left( \Vert V^*\Vert _{H^s}^2 + \Vert T^*\Vert _{H^s}^2\right) + \nu _h \Vert \nabla _h V^*\Vert _{H^s}^2 + \kappa _h \Vert \nabla _h T^*\Vert _{H^s}^2\\&\le C_{\nu _h,\kappa _h}\big (1+\Vert V^*\Vert ^2_{W^{s+1,\infty }}+\Vert T^*\Vert ^2_{W^{s+1,\infty }} + \Vert \widehat{V}\Vert ^2_{W^{s+1,\infty }} \\&\quad + \Vert \widehat{T}\Vert ^2_{W^{s+1,\infty }} + \Vert w\Vert ^2_{W^{s+1,\infty }} \big )(\Vert V^*\Vert _{H^s}^2 + \Vert T^*\Vert _{H^s}^2)\\&\quad + C_{\nu _h}\big (\Vert \mathcal R_{i,div}\Vert _{H^s}+ \Vert \mathcal R_{b}[s]\Vert _{L^2(\partial {\mathcal {D}})} + \Vert \mathcal R_{i,p}\Vert _{H^s} \\ {}&\quad + \Vert p^*\Vert _{W^{s+1,\infty }} + \Vert V^*\Vert _{W^{s+1,\infty }} + \Vert w^*\Vert _{W^{s+1,\infty }} + \Vert T^*\Vert _{W^{s+1,\infty }} \big )\\&\quad \times \big (\Vert \mathcal R_{b}[s]\Vert _{L^2(\partial {\mathcal {D}})} + \Vert \mathcal R_{i,div}\Vert _{H^s} \\&\quad + \Vert \mathcal R_{i,V}\Vert _{H^s} + \Vert \mathcal R_{i,T}\Vert _{H^s} + \Vert \mathcal R_{i,p}\Vert _{H^s} \big )\\&:= C_1(\Vert V^*\Vert _{H^s}^2 + \Vert T^*\Vert _{H^s}^2) + C_2 \big (\Vert \mathcal R_{b}[s]\Vert _{L^2(\partial {\mathcal {D}})} + \Vert \mathcal R_{i,div}\Vert _{H^s} \\ {}&\quad + \Vert \mathcal R_{i,V}\Vert _{H^s} + \Vert \mathcal R_{i,T}\Vert _{H^s} + \Vert \mathcal R_{i,p}\Vert _{H^s} \big ). \end{aligned} \end{aligned}$$

Thanks to Gronwall inequality, we obtain that, for any \(t\in [0,{\mathcal {T}}]\)

$$\begin{aligned} \Vert V^*(t)\Vert _{H^s}^2 + \Vert T^*(t)\Vert _{H^s}^2\le & {} \left( {\mathcal {E}}_G^t[s;\theta ](t)^2 + \int _0^{t} C_2(\tilde{t})({\mathcal {E}}_G^i[s;\theta ] + {\mathcal {E}}_G^b[s;\theta ])(\tilde{t}) d \tilde{t}\right) \\{} & {} \exp \left( \int _0^{t} C_1(\tilde{t})d\tilde{t}\right) . \end{aligned}$$

By integrating t over \([0,{\mathcal {T}}]\),

$$\begin{aligned} \int _0^{{\mathcal {T}}} \left( \Vert V^*(t)\Vert _{H^s}^2 + \Vert T^*(t)\Vert _{H^s}^2\right) dt&\le C_{{\mathcal {T}}}\left( {\mathcal {E}}_G[s;\theta ]^2 + {\mathcal {E}}_G[s;\theta ] \int _0^{{\mathcal {T}}} C_2(t) dt\right) \\&\exp \left( \int _0^{{\mathcal {T}}} C_1(t)d t\right) . \end{aligned}$$

The derivation till now gives both \(C_1\) and \(C_2\) depending on the PINNs approximation \((\widehat{V}, \widehat{w}, \widehat{p}, \widehat{T})\), thus if they are large the total error may not be under control. To overcome this issue, we next find proper upper bounds for \(C_1\) and \(C_2\) that are independent of the outputs of the neural networks.

First, we apply triangle inequality to bound

$$\begin{aligned} \Vert \varphi ^*\Vert _{W^{s+1,\infty }} \le \Vert \varphi \Vert _{W^{s+1,\infty }} + \Vert \widehat{\varphi }\Vert _{W^{s+1,\infty }}, \quad \varphi \in \{V,p,w,T\}. \end{aligned}$$

This implies

$$\begin{aligned} C_1 \le C_{\nu _h,\kappa _h}\left( 1+\Vert V\Vert ^2_{W^{s+1,\infty }}+\Vert T\Vert ^2_{W^{s+1,\infty }} + \Vert w\Vert ^2_{W^{s+1,\infty }} + \Vert \widehat{V}\Vert ^2_{W^{s+1,\infty }} + \Vert \widehat{T}\Vert ^2_{W^{s+1,\infty }} \right) . \end{aligned}$$

As the true solution VTw satisfy (4.15), there exists a bound \(C_{{\mathcal {T}},s+1}\) such that

$$\begin{aligned} \int _0^{{\mathcal {T}}} \left( \Vert V\Vert ^2_{W^{s+1,\infty }}+\Vert T\Vert ^2_{W^{s+1,\infty }}+ \Vert w\Vert ^2_{W^{s+1,\infty }} \right) dt\le C_{{\mathcal {T}},s+1} . \end{aligned}$$
(4.28)

For the PINNs approximation \(\widehat{V}\) and \(\widehat{T}\), by Sobolev inequality we have

$$\begin{aligned} \int _0^{{\mathcal {T}}} \left( \Vert \widehat{V}\Vert ^2_{W^{s+1,\infty }} + \Vert \widehat{T}\Vert ^2_{W^{s+1,\infty }}\right) dt&\le C \int _0^{{\mathcal {T}}} \left( \Vert \widehat{V}\Vert ^2_{H^{s+3}} + \Vert \widehat{T}\Vert ^2_{H^{s+3}}\right) dt \le C\mathcal {E}_G^p[s;\theta ]^2 \\&\le \frac{C\mathcal {E}_G[s;\theta ]^2}{\lambda }. \end{aligned}$$

Therefore,

$$\begin{aligned} \int _0^{{\mathcal {T}}} C_1(t) dt \le C_{\nu _h,\kappa _h} ( C_{{\mathcal {T}},s+1} + \frac{C\mathcal {E}_G[s;\theta ]^2}{\lambda }). \end{aligned}$$

Similarly, we can find that \(C_2\) can be bounded as

$$\begin{aligned} \int _0^{{\mathcal {T}}} C_2(t) dt \le C_{{\mathcal {T}}} \left( \int _0^{{\mathcal {T}}} C^2_2(t) dt\right) ^{\frac{1}{2}} \le C_{\nu _h,{\mathcal {T}}} \left( \mathcal {E}_G[s;\theta ] + C_{{\mathcal {T}}, s+1}^{\frac{1}{2}} + \frac{C\mathcal {E}_G[s;\theta ]}{\sqrt{\lambda }}\right) . \end{aligned}$$

This implies that

$$\begin{aligned} {\mathcal {E}}[s;\theta ]^2 \le C_{\nu _h,{\mathcal {T}}}&\left( \mathcal {E}_G^2[s;\theta ] (1+\frac{1}{\sqrt{\lambda }}) + \mathcal {E}_G[s;\theta ] C_{{\mathcal {T}},s+1}^{\frac{1}{2}}\right) \nonumber \\&\exp \left( C_{\nu _h,\kappa _h} \left( C_{{\mathcal {T}},s+1} + \frac{C\mathcal {E}_G[s;\theta ]^2}{\lambda }\right) \right) . \end{aligned}$$
(4.29)

4.3 Controlling the difference between generalization error and training error

In this section, we will answer question SubQ3: given PINNs \((\widehat{V}, \widehat{w}, \widehat{p}, \widehat{T}):=( V_\theta , w_\theta , p_\theta , T_\theta )\), can the difference between the corresponding generalization error and the training error be made arbitrarily small?

This question can be answered by the well-established results on numerical quadrature rules. Given \(\Lambda \subset \mathbb {R}^d\) and \(f\in L^1(\Lambda )\), choose some quadrature points \(x_m\in \Lambda \) for \(1\le m\le M\), and quadrature weights \(w_m>0\) for \(1\le m\le M\), and consider the approximation \(\frac{1}{M}\sum _{m=1}^M w_m f(x_m) \approx \int _\Lambda f(x)dx. \) The accuracy of this approximation depends on the chosen quadrature rule, the number of quadrature points M and the regularity of f. In the PEs case, the problem is low-dimensional \(d\le 4\), and allows using standard deterministic numerical quadrature points. As in [24], we consider the midpoint rule: for \(l>0\), we partition \(\Lambda \) into \(M \sim (\frac{1}{l})^d\) cubes of edge length l and we denote by \(\{x_m\}_{m=1}^M\) the midpoints of these cubes. The formula and accuracy of the midpoint rule \(\mathcal Q_M^\Lambda \) are then given by,

$$\begin{aligned} \mathcal Q_M^\Lambda [f]:= \frac{1}{M}\sum _{m=1}^Mf(y_m), \qquad \left| \int _\Lambda f(y) dy - \mathcal Q_M^\Lambda [f]\right| \le C_f M^{-2/d} \le C_f l^2, \end{aligned}$$
(4.30)

where \(C_f \le C\Vert f\Vert _{C^2}\). We remark that for high dimensional PDEs, one will need mesh-free methods to sample points in \(\Lambda \), such as Monte Carlo, to avoid the curse of dimensionality, for instance, see [76].

Let us fix the sample sets \({\mathcal {S}}=({\mathcal {S}}_i,{\mathcal {S}}_t,{\mathcal {S}}_b)\) appearing in (2.9) as the midpoints in each corresponding domain, that is,

$$\begin{aligned} \begin{aligned}&{\mathcal {S}}_i:= \{(t_n,x_n,y_n,z_n): (t_n,x_n,y_n,z_n) \text { are midpoints of cubes of edge length}\,l_i \\&\qquad \qquad \text { in } [0,{\mathcal {T}}]\times {\mathcal {D}} \},\\&{\mathcal {S}}_t:= \{(x_n,y_n,z_n): (x_n,y_n,z_n) \text { are midpoints of cubes of edge length}\,l_t \\&\qquad \qquad \text { in } {\mathcal {D}} \},\\&{\mathcal {S}}_b:= \{(t_n,x_n,y_n,z_n): (t_n,x_n,y_n,z_n) \text { are midpoints of cubes of edge length}\, l_b \\&\qquad \qquad \text { in } [0,{\mathcal {T}}]\times \partial {\mathcal {D}} \}, \end{aligned}\nonumber \\ \end{aligned}$$
(4.31)

and the edge lengths and quadrature weights satisfy

$$\begin{aligned} l_i \sim | {\mathcal {S}}_i|^{-\frac{1}{d+1}}, \quad l_t \sim | {\mathcal {S}}_t|^{-\frac{1}{d}}, \quad l_b \sim | {\mathcal {S}}_b|^{-\frac{1}{d}}, \quad w_n^i = \frac{1}{| {\mathcal {S}}_i|}, \quad w_n^t = \frac{1}{| {\mathcal {S}}_t|}, \quad w_n^b = \frac{1}{| {\mathcal {S}}_b|}.\nonumber \\ \end{aligned}$$
(4.32)

Based on the estimate (4.30), we have the following theorem concerning the control of the generalization error from the training error.

Theorem 4.3

(Answer of SubQ3) Suppose that \(k\ge 5\) and \(0\le s\le k-5\), let \(n=\max \{k^2,d(s^2-s+k+d)\}\). Consider the PINNs \(( V_\theta , w_\theta , p_\theta , T_\theta )\) with the generalization error \(\mathcal {E}_G[s;\theta ]\) and the training error \(\mathcal {E}_{T}[s;\theta ;{\mathcal {S}}]\). Then their difference depends on the size of \({\mathcal {S}}\):

$$\begin{aligned} \Big |\mathcal {E}_G[s;\theta ]^2 - \mathcal {E}_{T}[s;\theta ;{\mathcal {S}}]^2\Big |\le C_{i} |{\mathcal {S}}_i|^{-\frac{2}{d+1}} + C_{t} |{\mathcal {S}}_t|^{-\frac{2}{d}} + C_{b} |{\mathcal {S}}_b|^{-\frac{2}{d}}, \end{aligned}$$

where

$$C_{i} = C_s N^{(3d+s+7)(6\,s+24)}(\ln N)^{6(s+4)^2} +C_Q + \lambda C_s N^{(3d+s+8)(6\,s+30)}(\ln N)^{6(s+5)^2}$$
$$\begin{aligned} C_{t} = C_{b} = C_s N^{(3d+s+5)(6s+12)}(\ln N)^{6(s+2)^2}. \end{aligned}$$

Proof

Following from (4.30) and the fact that \(( V_\theta , w_\theta , p_\theta , T_\theta )\) are smooth functions, one obtains the estimate

$$\begin{aligned}&\left| \int _0^{{\mathcal {T}}}\Vert \mathcal R_{i,V}[\theta ]\Vert _{H^s}^2 dt - \sum \limits _{(t_n,x_n,y_n,z_n)\in {\mathcal {S}}_i}\sum \limits _{|\alpha |\le s}w_n^i|D^\alpha \mathcal R_{i,V}[\theta ](t_n,x_n,y_n,z_n)|^2 \right| \\ \le&\sum \limits _{|\alpha |\le s} \left| \Vert D^\alpha \mathcal R_{i,V}[\theta ]\Vert ^2_{L^2(\Omega )} - \sum \limits _{(t_n,x_n,y_n,z_n)\in {\mathcal {S}}_i}w_n^i|D^\alpha \mathcal R_{i,V}[\theta ](t_n,x_n,y_n,z_n)|^2 \right| \\ \le&C \Vert \mathcal R_{i,V}[\theta ]\Vert _{C^{2}([0,{\mathcal {T}}]; C^{s+2}({\mathcal {D}}))}^2 |{\mathcal {S}}_i|^{-\frac{2}{d+1}}. \end{aligned}$$

Let \(\theta \in \Theta _{L, W, R}\) with \(L=3\). Note that \(n=\max \{k^2,d(s^2-s+k+d)\}\), therefore by Theorem 4.1 and Lemma A.1, we know that \(R=\mathcal O(N\ln (N))\) and \(W=\mathcal O(N^{d+1})\). Now by virtue of Lemma A.2 we can compute that

$$\begin{aligned}&C \Vert \mathcal R_{i,V}[\theta ]\Vert _{C^{2}([0,{\mathcal {T}}]; C^{s+2}({\mathcal {D}}))}^2 |{\mathcal {S}}_i|^{-\frac{2}{d+1}} \\&\quad \le C_{s} (W^3 R^{s+4})^{6(s+4)}\\&\quad \le C_s (N^{3d+3} N^{s+4}\ln ^{s+4}(N))^{6s+24} |{\mathcal {S}}_i|^{-\frac{2}{d+1}} \le C_s N^{(3d+s+7)(6s+24)}(\ln N)^{6(s+4)^2} |{\mathcal {S}}_i|^{-\frac{2}{d+1}}. \end{aligned}$$

Similarly we can obtain the following bounds.

$$\begin{aligned}&\Big |\mathcal {E}_G^i[s;\theta ]^2 - \mathcal {E}_{T}^i[s;\theta ;{\mathcal {S}}_i]^2 \Big |\le (C_s N^{(3d+s+7)(6s+24)}(\ln N)^{6(s+4)^2} +C_Q) |{\mathcal {S}}_i|^{-\frac{2}{d+1}},\\&\Big |\mathcal {E}_G^t[s;\theta ]^2 - \mathcal {E}_{T}^t[s;\theta ;{\mathcal {S}}_t]^2 \Big |\le C_s N^{(3d+s+5)(6s+12)}(\ln N)^{6(s+2)^2} |{\mathcal {S}}_t|^{-\frac{2}{d}},\\&\Big |\mathcal {E}_G^b[s;\theta ]^2 - \mathcal {E}_{T}^b[s;\theta ;{\mathcal {S}}_b]^2 \Big |\le C_s N^{(3d+s+5)(6s+12)}(\ln N)^{6(s+2)^2} |{\mathcal {S}}_b|^{-\frac{2}{d}}. \end{aligned}$$

Finally for the penalty term we have

$$\begin{aligned} \Big |\lambda \mathcal {E}_G^p[s;\theta ]^2 - \lambda \mathcal {E}_{T}^p[s;\theta ;{\mathcal {S}}_i]^2 \Big |\le \lambda C_s N^{(3d+s+8)(6s+30)}(\ln N)^{6(s+5)^2} |{\mathcal {S}}_i|^{-\frac{2}{d+1}}. \end{aligned}$$

Remark 4

The above result is an a priori estimate in the sense that \(\Big |\mathcal {E}_G[s;\theta ]^2 - \mathcal {E}_{T}[s;\theta ;{\mathcal {S}}]^2\Big |\) is controlled by a quantity purely depending on the neural network architectures before training.

From the result above we remark that, in order to have a small difference between \(\mathcal {E}_G[s;\theta ]\) and \(\mathcal {E}_{T}[s;\theta ]\), one needs to pick sufficiently large sample sets \({\mathcal {S}}=({\mathcal {S}}_i,{\mathcal {S}}_t, {\mathcal {S}}_b)\) at the order of:

$$\begin{aligned} |{\mathcal {S}}_i|&= \mathcal O\Big (\max \{\lambda N^{(3d+s+8)(6s+30)(d+1)} (\ln N)^{6(d+1)(s+5)^2},\nonumber \\&N^{(3d+s+7)(6s+24)(d+1)} (\ln N)^{6(d+1)(s+4)^2} \}\Big ), \nonumber \\ |{\mathcal {S}}_b|&= |{\mathcal {S}}_t| = \mathcal O\left( N^{(3d+s+5)(6s+12)d} (\ln N)^{6d(s+2)^2}\right) . \end{aligned}$$
(4.33)

4.4 Answers to Q1 and Q2

Combining the answers to SubQ1SubQ3, i.e., Theorem 4.14.3, we have the following theorem which answers Q1 and Q2.

Theorem 4.4

Let \(d\in \{2,3\}\), \({\mathcal {T}}>0\), \(k\ge 5\), \(r> \frac{d}{2} + 2k\), \(0\le s\le k-5\), \(n=\max \{k^2,d(s^2-s+k+d)\}\), and assume that \(V_0\in H_e^r({\mathcal {D}})\), \(T_0\in H_o^r({\mathcal {D}})\), and \(Q\in C^{k-1}([0,{\mathcal {T}}]; H_o^r({\mathcal {D}}))\). Let (VwpT) be the unique strong solution to system (1.1).

  1. (i)

    (answer of Q1) For every \(N>5\), there exist tanh neural networks \((V_\theta ,w_\theta ,p_\theta ,T_\theta )\) with two hidden layers, of widths at most and such that for every \(0\le s\le k-5\),

    $$\begin{aligned} \mathcal {E}_{T}[s;\theta ;{\mathcal {S}}]^2 \le C\left( \frac{(\lambda +1)(1+\ln ^{2s+6} N)}{N^{2k-2s-8}} + \lambda \right) + C_{i} |{\mathcal {S}}_i|^{-\frac{2}{d+1}} + C_{t} |{\mathcal {S}}_t|^{-\frac{2}{d}} + C_{b} |{\mathcal {S}}_b|^{-\frac{2}{d}}, \end{aligned}$$

    where the constants \(C, C_{i}, C_b, C_t\) are defined based on the constants appearing in Theorems 4.1 and 4.3. For arbitrary \(\varepsilon >0\), one can first choose \(\lambda \) small enough, and then either require N small and k large enough so that \(2k-2s-8\) large enough, or N large enough, and \(| {\mathcal {S}}_i|, | {\mathcal {S}}_b|, | {\mathcal {S}}_t|\) large enough according to (4.33), so that \(\mathcal {E}_{T}[s;\theta ;{\mathcal {S}}]^2 < \varepsilon \).

  2. (ii)

    (answer of Q2) For \(0\le s \le 2k-1\), the total error satisfies

    $$\begin{aligned} {\mathcal {E}}[s;\theta ]^2 \le&C_{\nu _h,{\mathcal {T}}} \Big ((\mathcal {E}_{T}^2[s;\theta ] + C_{i} |{\mathcal {S}}_i|^{-\frac{2}{d+1}} + C_{t} |{\mathcal {S}}_t|^{-\frac{2}{d}} + C_{b} |{\mathcal {S}}_b|^{-\frac{2}{d}}) (1+\frac{1}{\sqrt{\lambda }})\\&\hspace{1.5cm}+ \left( \mathcal {E}_{T}[s;\theta ] + (C_{i} |{\mathcal {S}}_i|^{-\frac{2}{d+1}} + C_{t} |{\mathcal {S}}_t|^{-\frac{2}{d}} + C_{b} |{\mathcal {S}}_b|^{-\frac{2}{d}})^{\frac{1}{2}}\right) C_{{\mathcal {T}},s+1}^{\frac{1}{2}}\Big )\\&\times \exp \left( C_{\nu _h,\kappa _h} \left( C_{{\mathcal {T}},s+1} + \frac{C(\mathcal {E}_{T}[s;\theta ]^2+ C_{i} |{\mathcal {S}}_i|^{-\frac{2}{d+1}} + C_{t} |{\mathcal {S}}_t|^{-\frac{2}{d}} + C_{b} |{\mathcal {S}}_b|^{-\frac{2}{d}})}{\lambda }\right) \right) , \end{aligned}$$

    where the constants are defined based on the constants appearing in Theorems 4.2 and 4.3. In particular, when the training error is small enough and \(C_{i} |{\mathcal {S}}_i|^{-\frac{2}{d+1}} + C_{t} |{\mathcal {S}}_t|^{-\frac{2}{d}} + C_{b} |{\mathcal {S}}_b|^{-\frac{2}{d}}\) is small enough, the total error will be small.

Proof

The proof follows from the proof of Theorem 4.14.3.

Remark 5

Theorem 4.4 implies that setting \(s>0\) in the training error (2.8) can guarantee the control of the total error (2.10) under \(H^s\) norm. However, it may be unnecessary to use \(s>0\) in (2.8) if one only needs to bound (2.10) under \(L^2\) norm. See more details in the next section on the numerical experiments.

4.5 Conclusion and discussion

In this paper, we investigate the error analysis for PINNs approximating viscous PEs. We answer Q1 and Q2 by giving positive results for SubQ1SubQ3. In particular, all the estimates we obtain are a priori estimates, meaning that they depend only on the PDE solution, the choice of neural network architectures, and the sample size in the quadrature approximation, but not on the actual trained network parameters. Such estimates are crucial when designing a model architecture and selecting hyperparameters prior to the training.

Our key step of obtaining such a priori estimates is to include a penalty term in the generalization error \(\mathcal {E}_G[s;\theta ]\) and training error \(\mathcal {E}_{T}[s;\theta ]\). This idea is inspired by [4] in the study of PINNs for 2D NSE. In fact, such a priori estimates hold for both 2D NSE and 3D viscous PEs for any time \({\mathcal {T}}>0\), but are only true for a short time interval for 3D NSE. The main reason is that the global well-posedness of the 3D NSE remains open and is one of the most challenging mathematical problems. To bypass this issue and still get a priori estimates for 3D NSE, [24] instead requires high regularity \(H^k\) with \(k>6(3d+8)\) for the initial condition.

Our result is the first one to consider higher order error estimates under mild initial condition \(H^k\) with \(k \ge 5\). In particular, we prove that one can control the total error \({\mathcal {E}}[s;\theta ]\) for \(s\ge 0\) provided that \(\mathcal {E}_{T}[s;\theta ]\) is used as the loss function during the training with the same order s.

5 Numerical experiments

For the numerical experiments, we consider the 2D case and set \(f_0 = 0\). System (1.1) reduces to

$$\begin{aligned}&u_t + uu_x + wu_z -\nu _h u_{xx} - \nu _z u_{zz} + p_x = 0 , \end{aligned}$$
(5.1a)
$$\begin{aligned}&\partial _z p + T= 0, \end{aligned}$$
(5.1b)
$$\begin{aligned}&u_x + w_z =0, \end{aligned}$$
(5.1c)
$$\begin{aligned}&T_t + uT_x + wT_z -\kappa _h T_{xx} - \kappa _z T_{zz} = Q. \end{aligned}$$
(5.1d)

We consider the following Taylor-Green vortex as the benchmark on the domain \((x,z) \in [0, 1]^2\) and \(t \in [0, 1]\):

$$\begin{aligned} {\left\{ \begin{array}{ll} u = -\sin (2\pi x) \cos (2\pi z) \exp (-4\pi ^2(\nu _h+\nu _z)t)\\ w= \cos (2\pi x) \sin (2\pi z) \exp (-4\pi ^2(\nu _h+\nu _z)t)\\ p = \frac{1}{4} \cos (4\pi x) \exp (-8\pi ^2(\nu _h+\nu _z)t) + \frac{1}{2\pi } \cos (2\pi z) \exp (-4\pi ^2\kappa _zt)\\ T = \sin (2\pi z) \exp (-4\pi ^2 \kappa _zt)\\ Q = \pi \cos (2\pi x) \sin (4\pi z) \exp (-4\pi ^2 (\nu _h+\nu _z + \kappa _z ) t). \end{array}\right. } \end{aligned}$$
(5.2)

In this section we show the performance of PINNs approximating system (5.1) using the benchmark solution (5.2) with \(\nu _z = \nu _h = \kappa _z = \kappa _h = 0.01\) (Case 1) and \(\nu _z = \kappa _z = 0\), \(\nu _h = \kappa _h = 0.01\) (Case 2). We will perform the result by setting \(s=0\) (\(L^2\) residual) or \(s=1\) (\(H^1\) residual) in the training error (2.8) with \(\lambda = 0\) during training, and compare their \(L^2\) and \(H^1\) total error (2.10).

Remark 6

The reason to set \(\lambda = 0\) is twofold: the evaluation of \(H^{s+3}\) will slow down the algorithm due to computing many higher-order derivatives, and we observe in experiments that the total error is already sufficiently small without including the \(\mathcal {E}_{T}^p\) term in (2.8). In other words, the inclusion of \(\mathcal {E}_{T}^p\) is rather technical and mainly aims to provide prior error estimates rather than posterior ones.

Fig. 1
figure 1

\(L^2\) and \(H^1\) errors as a function of t between the PINN solutions and the Taylor-Green vortex benchmark (5.2) with \(\nu _z = \nu _h = \kappa _z = \kappa _h = 0.01\) (Case 1). The notations \(s=0\) and \(s=1\) represent the \(L^2\) residuals and \(H^1\) residuals, respectively, in the training

For the PINNs architecture, we make use of four fully-connected multilayer perceptrons (MLP), one for each of the unknown functions (uwpT), where each MLP consists of 2 hidden layers with 32 neurons per layer. In all cases, the quadrature for minimizing the training error is computed at equally spaced points using the midpoint rule and by taking 5751 points in the interior of the spatial domain, 1024 points at the initial time \(t=0\), and 544 points on the spatial boundary. Following from the midpoint rule, the quadrature weights in (2.9) are \(w^i_n = 1/5751\), \(w^t_n = 1/1024\), and \(w^b_n = 1/544\) for all n. The learning rate is 1e–4 while using the Adam algorithm for optimizing the training residuals, and the activation function used for all networks is the hyperbolic tangent function.

Figures 1 and 2 depict the \(L^2\) and \(H^1\) errors of the PINNs after being trained using the residuals in the cases \(s = 0\) and \(s = 1\). Since the pressure p is defined up to a constant term, we only plot the \(L^2\) error of \(\partial _x p_\theta \) and \(\partial _z p_\theta \), rather than the \(L^2\) or \(H^1\) error of \(p_\theta \) itself. Tables 1 and 2 give the absolute and relative total error \({\mathcal {E}}[s;\theta ]\) defined in (2.10) with \(s=0\) or \(s=1\) trained using the \(L^2\) residuals or \(H^1\) residuals.

Fig. 2
figure 2

\(L^2\) and \(H^1\) errors as a function of t between the PINN solutions and the Taylor-Green vortex benchmark (5.2) with \(\nu _z = \kappa _z = 0\), \(\nu _h = \kappa _h = 0.01\) (Case2). The notations \(s=0\) and \(s=1\) represent the \(L^2\) residuals and \(H^1\) residuals, respectively, in the training

As shown in the figures and tables, the \(L^2\) errors of the PINNs when trained with the \(H^1\) residuals are not significantly improved compared to the ones trained with the \(L^2\) residuals. We think this is due to the already-good learning of solutions under \(L^2\) residuals, and that the \(L^2\) residuals are sufficient to control the \(L^2\) error. However, as expected from the analysis in Sect. 4, we observe a noticeably smaller error in the \(H^1\) norm (and in the \(L^2\) norm for \(\partial _x p_\theta \) and \(\partial _x p_\theta \)) when trained with the \(s=1\) residuals, indicating that the derivatives of the unknown functions are learned better under \(H^1\) residuals.

Table 1 Case 1. Absolute and relative total error \({\mathcal {E}}[s;\theta ]\) for \(s = 0, 1\) for the PINNs trained by minimizing the training error \({\mathcal {E}}_T[r; \theta ;{\mathcal {S}}]\) for \(r=0, 1\)
Table 2 Case 2. Absolute and relative total error \({\mathcal {E}}[s;\theta ]\) for \(s = 0, 1\) for the PINNs trained by minimizing the training error \({\mathcal {E}}_T[r; \theta ;{\mathcal {S}}]\) for \(r=0, 1\)

The PINNs and loss functions for the training error were all implemented using the DeepXDE library [62]. The code for the results in this paper can be found at https://github.com/alanraydan/PINN-PE.