Inverse Dirichlet weighting enables reliable training of physics informed neural networks

Suryanarayana Maddu; Dominik Sturm; Christian L Müller; Ivo F Sbalzarini

doi:10.1088/2632-2153/ac3712

1. Introduction

Data-driven modeling has emerged as a powerful and complementary approach to first-principles modeling. It was made possible by advances in imaging and measurement technology, computing power, and innovations in machine learning, in particular neural networks. The success of neural networks in data-driven modeling can be attributed to their powerful function approximation properties for diverse functional forms, from image recognition [1] and natural language processing [2] to learning policies in complex environments [3].

Recently, there has been a surge in exploiting the versatile approximation properties of neural networks for surrogate modeling in scientific computing and for data-driven modeling of physical systems [4, 5]. These applications hinge on the approximation properties of neural networks for Sobolev-regular functions [6, 7]. The popular physics informed neural networks (PINNs) rely on knowing a differential equation model of the system in order to solve a soft-constrained optimization problem [5, 8]. Thanks to their mesh-free character, PINNs can be used for both forward and inverse modeling in domains including material science [9–11], fluid dynamics [8, 12–14] and turbulence [15, 16], biology [17], medicine [18, 19], earth science [20], mechanics [21] and uncertainty quantification [22], as well as for solving stochastic [23], high-dimensional [24] and fractional differential equations [25].

The wide application scope of PINNs rests on their parameterization as deep neural networks, which can be tuned by minimizing a weighted sum of data-fitting and equation-fitting objectives. However, finding network parameters that impartially optimize for several objectives is challenging when objectives impose conflicting requirements. In addition, PINNs suffer from spectral bias [26–29], potentially resulting in large discrepancies between convergence rates of different objectives during training [30]. For multi-task learning (MTL) problems, this issue has been addressed by various strategies for relative weighting of conflicting objectives, e.g. based on uncertainty [31], gradient normalization [32] or Pareto optimality [33]. The PINN community in parallel has developed heuristics for loss weighting with demonstrated gains in accuracy and convergence [34, 35]. However, there is neither consensus on when to use such strategies in PINNs, nor are there optimal benchmark solutions for objectives based on differential equations.

Here, we fill this gap by characterizing training pathologies of PINNs and providing criteria for their occurrence. We mathematically explain these pathologies by showing a connection between PINNs and Sobolev training, and we propose a strategy for loss weighting in PINNs based on the Dirichlet energy of the task-specific gradients. We show that this strategy reduces optimization bias and protects against catastrophic forgetting. We evaluate the proposed inverse Dirichlet weighting by comparing with Sobolev training in a case where provably optimal weights can be derived, and by empirical comparison with two conceptually different state-of-the-art PINN weighting approaches.

2. Training a physics informed neural network is a multi-objective optimization problem

An optimization problem involving multiple tasks or objectives can be written as:

$\begin{equation} \textrm{minimize } \mathcal{L}_k(\boldsymbol{\theta}^\textrm{sh}, \boldsymbol{\theta}^k), \qquad k = 1,\ldots ,K. \end{equation} \tag{ 1 }$

The total number of tasks is given by K. $\boldsymbol{\theta}^\textrm{sh}$ are shared parameters between the tasks, and $\boldsymbol{\theta}^k$ are the task-specific parameters. For conflicting objectives, there is no single optimal solution, but Pareto-dominated solutions can be found using heuristics like evolutionary algorithms [36]. Alternatively, the multi-objective optimization problem can be converted to a single-objective optimization problem using scalarization, e.g. as a weighted sum with real-valued relative weights λ_k associated with each loss objective $\mathcal{L}_k$ :

$\begin{equation} \min_{\boldsymbol{\theta}^\textrm{sh}, \boldsymbol{\theta}^1,\ldots ,\boldsymbol{\theta}^K} \sum_{k = 1}^{K}\lambda_k \mathcal{L}_{k} (\boldsymbol{\theta}^\textrm{sh}, \boldsymbol{\theta}^k). \end{equation} \tag{ 2 }$

Often, the λ_k are dynamically tuned along training epochs [33]. The weights λ_k intuitively quantify the extent to which we aim to minimize the objective $\mathcal{L}_k$ . Large λ_k favor small $\mathcal{L}_k$ , and vice-versa [37]. For K > 2, manual tuning of the λ_k and grid search become prohibitively expensive for large parameterizations as common in deep neural networks.

The loss function of a PINN is a scalarized multi-objective loss, where each objective corresponds to either a data-fidelity term or a constraint derived from the physics of the underlying problem [4, 5]. A typical PINN loss contains five objectives:

$\begin{align} \mathcal{L}(\boldsymbol{u}_{\boldsymbol{\theta}}, \boldsymbol{u})& = \frac{\lambda_\textrm{1}}{N_\textrm{int}} \sum_{i = 1}^{N_\textrm{int}} \Big \vert \partial_t \boldsymbol{u}_{\boldsymbol{\theta}} - \mathcal{F} (\boldsymbol{u}_{\boldsymbol{\theta}}(t_{i}, \boldsymbol{x}_i^{\Omega}), \Sigma ) \Big \vert^2~\nonumber\\ & \quad + \frac{\lambda_2}{N_\mathcal{B}} \sum_{i = 1}^{N_\mathcal{B}} \Big \vert \mathcal{B} (t_{i}, \boldsymbol{x}_i^{\partial \Omega}) \Big \vert^2 + \frac{\lambda_3}{N_\textrm{I}} \sum_{i = 1}^{N_\textrm{I}} \Big \vert \mathcal{I} (t = 0 , \boldsymbol{x}_i^{\Omega}) \Big \vert^2~\nonumber\\ & \quad + \frac{\lambda_4}{N_\textrm{int}} \sum_{i = 1}^{N_\textrm{int}} \Big \vert \mathcal{H} (\boldsymbol{u}_{\boldsymbol{\theta}}(t_{i}, \boldsymbol{x}_i^{\Omega}) ) \Big \vert^2 + \frac{\lambda_5}{N_\textrm{int}} \sum_{i = 1}^{N_\textrm{int}} \Big \vert \boldsymbol{u}_{\boldsymbol{\theta}}(t_{i}, \boldsymbol{x}_i^{\Omega}) - \boldsymbol{u}_{i} ) \Big \vert^2 . \end{align} \tag{ 3 }$

We distinguish the neural network approximation $\boldsymbol{u}_{\boldsymbol{\theta}}(t_i,\boldsymbol{x}_i^\Omega)$ from the training data $\boldsymbol{u}_{i} = \boldsymbol{u}(t_i, \boldsymbol{x}_i^\Omega)$ sampled at the $N_\textrm{int}$ space-time points $(t_{i},\boldsymbol{x}_i^{\Omega})$ in the interior of the solution domain Ω, $N_\mathcal{B}$ points $(t_{i},\boldsymbol{x}_i^{\partial \Omega})$ on the boundary $\partial \Omega$ , and $N_\textrm{I}$ samples of the initial state at time t = 0. The λ₁, λ₂, and λ₃ weight the residuals of the physical equation $\partial_t \boldsymbol{u} = \mathcal{F} (\boldsymbol{u}, \Sigma )$ with right-hand side $\mathcal{F}$ and coefficients Σ, the boundary condition $\mathcal{B}$ , and the initial condition $\mathcal{I}$ , respectively. The objective with weight λ₄ imposes an additional constraint $\mathcal{H}$ on the state variable u , such as, e.g. divergence-freeness ( $\nabla \cdot \boldsymbol{u} = 0$ ) of the velocity field of an incompressible flow. The weight λ₅ penalizes deviations of the neural network approximation from the available measurement data.

PINNs can be used to solve both forward and inverse modeling problems of ordinary differential equations (ODEs) and partial differential equations (PDEs). In the forward problem, a numerical approximation to the solution of an ODE or PDE is to be learned. For this, $\boldsymbol{\theta}^\textrm{sh}$ are the networks parameters and $\boldsymbol{\theta}^{k}$ is an empty set. In the inverse problem, the coefficients of an ODE or PDE are to be learned from data such that the equation describes the dynamics in the data. Then, the task-specific parameters $\boldsymbol{\theta}^{k}$ correspond to the coefficients ( $\boldsymbol{\xi} \subseteq \Sigma$ ) of the physical equation that shall be inferred. During training of a PINN, the parameters $(\boldsymbol{\theta}^\textrm{sh}, \boldsymbol{\theta}^{k})$ are determined by minimizing the total loss $\mathcal{L}$ for given data.

2.1. Sobolev training is a special case of PINN training

Like most deep neural networks, PINNs are usually trained by first-order parameter update based on (approximate) loss gradients. The update rule can be interpreted as a discretization of the gradient flow, i.e.

$\begin{align} \boldsymbol{\theta} (\tau + 1) & = \boldsymbol{\theta} (\tau) - \eta (\tau) \sum_{k = 1}^{K}\lambda_k \nabla_{\boldsymbol{\theta}} \mathcal{L}_{k}. \end{align} \tag{ 4 }$

The learning rate $\eta(\tau)$ depends on the training epoch τ, reflecting the adaptive step size used in optimizers like Adam [38]. The gradients are batch gradients $\nabla_{\boldsymbol{\theta}} \mathcal{L}_k = \frac{1}{\vert \textrm{B} \vert} \sum_{i \in \textrm{B}} \nabla \ell_k (\boldsymbol{u}_{\boldsymbol{\theta}}^{i}, \boldsymbol{u}_i)$ , where $\textrm{B}$ is a mini-batch, created by randomly splitting the N training data points into $N/\vert \textrm{B}\vert$ distinct partitions, and $\ell_k$ is one summand (for one data point i) of the loss $\mathcal{L}_k$ corresponding to the kth objective. For batch size $\vert \textrm{B}\vert = 1$ , mini-batch gradient descent reduces to stochastic gradient descent algorithm.

For approximating a function $\boldsymbol{u}(\boldsymbol{x})$ from data u _i, adding derivatives $\mathcal{D}_{\boldsymbol{x}}^m$ , $m\geqslant 1$ , of the state variable into the training loss has been shown to improve data efficiency and generalization power of neural networks [7]. Minimizing the resulting loss

$\begin{equation} \sum_{i = 1}^{N} \Bigg[ \lambda_0~\vert \boldsymbol{u}_{\boldsymbol{\theta}} (\boldsymbol{x}_i) - \boldsymbol{u}(\boldsymbol{x}_i) \vert^2 + \sum_{k = 1}^{K} \lambda_k \vert \mathcal{D}_{\boldsymbol{x}}^k \boldsymbol{u}_{\boldsymbol{\theta}} (\boldsymbol{x}_i) - \mathcal{D}_{\boldsymbol{x}}^k \boldsymbol{u}(\boldsymbol{x}_i) \vert^2~\Bigg ] \end{equation} \tag{ 5 }$

using the update rule in equation (4) can be interpreted as a finite approximation to the Sobolev gradient flow of the loss functional. Without loss of generality, we drop the dependence of the state variable on time, writing $\boldsymbol{u}_i = \boldsymbol{u}(\boldsymbol{x}_i)$ . The neural network is then trained with access to the (K + 2)-tuples $\{(\boldsymbol{x}_i, \boldsymbol{u}(\boldsymbol{x}_i), \mathcal{D}_{\boldsymbol{x}}^1~\boldsymbol{u}(\boldsymbol{x}_i),\ldots,\mathcal{D}_{\boldsymbol{x}}^K \boldsymbol{u}(\boldsymbol{x}_i)) \}_{i = 1}^N$ instead of the usual training set $\{(\boldsymbol{x}_i, \boldsymbol{u}(\boldsymbol{x}_i))\}_{i = 1}^N$ . Such Sobolev training is an instance of scalarized multi-objective optimization with each objective based on the approximation of a specific derivative. The weights are usually chosen as $\lambda_0 = \ldots = \lambda_\textrm{K} = 1$ [7].

The additional information from the derivatives introduces an inductive bias, which has been found to improve data efficiency and network generalization [7]. This is akin to PINNs, which use derivatives from the physical equation model to achieve inductive bias. Indeed, the Sobolev loss in equation (5) is a special case of the PINN loss from equation (3) with each objective based on a PDE with right-hand side $\mathcal{F} = \mathcal{D}_{\boldsymbol{x}}^k \boldsymbol{u}$ , $k = 1,\ldots ,K$ .

2.2. Vanishing task-specific gradients are a failure mode of PINNs

The update rule in equation (4) is known to lead to stiff learning dynamics for loss functionals whose Hessian matrix has large positive eigenvalues [34]. This is the case for Sobolev training, even with K = 1, when the function $\boldsymbol{u}(\boldsymbol{x})$ is highly oscillatory, leading us to the following proposition, proven in appendix B.1: Stiff learning dynamics in first-order parameter update

Proposition 2.2.1. For Sobolev training on the tuples $\{\boldsymbol{x}_i, \boldsymbol{u}(\boldsymbol{x}_i), \partial_{\boldsymbol{x}}^m \boldsymbol{u}(\boldsymbol{x}_i) \}_{i = 1}^{N}, m \geqslant 1$ , with total loss $\mathcal{L} = \sum_{i = 1}^{N} \left( {\lambda_0} \vert \boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_i) - \boldsymbol{u}(\boldsymbol{x}_i) \vert^2~\right. + \left. {\lambda_1} \vert \mathcal{D}_{\boldsymbol{x}}^m \boldsymbol{u}_{\boldsymbol{\theta}} (\boldsymbol{x}_i) - \mathcal{D}_{\boldsymbol{x}}^m \boldsymbol{u}(\boldsymbol{x}_i)\vert^2\right)$ , the rate of change of the residual of the dominant Fourier mode $\widetilde{\boldsymbol{r}}(\boldsymbol{k}_0) = \widetilde{\boldsymbol{u}}_{\boldsymbol{\theta}} (\boldsymbol{k}_0) - \widetilde{\boldsymbol{u}} (\boldsymbol{k}_0)$ using first-order parameter update is:

$\begin{align*} \left\vert \frac{\mathrm{d}\widetilde{\boldsymbol{r}}({\boldsymbol{k}_0})}{\mathrm{d}\tau} \right\vert & = \frac{\eta({\tau})}{O(\boldsymbol{k}_0) } \left[ {\lambda_{0}} O(1) + {\lambda_{1}} O(\boldsymbol{k}_0^{2m}) \right] \left\vert \frac{\partial \widetilde{\boldsymbol{r}}({\boldsymbol{k}_0})}{\partial \boldsymbol{\theta}} \right\vert , \end{align*}$

where $O(\cdot)$ is the Bachmann-Landau big-O symbol. For ${\lambda_0} = {\lambda_1} = 1$ , this leads to training dynamics with slow time scale $\mathrm{d}/\mathrm{d}t \in O(\boldsymbol{k}_0^{-1})$ and fast time scale $\mathrm{d}/\mathrm{d}t \in O(\boldsymbol{k}_0^{2m-1})$ for $\widetilde{\boldsymbol{r}}({\boldsymbol{k}_0}) \in O(1)$ .

This proposition suggests that losses containing high orders of derivatives m, dominant high frequencies k ₀, or a combination of both exhibit dominant gradient statistics. This can lead to biased optimization of the dominant objectives with large gradients at the expense of the other objectives with relatively smaller gradients. This is akin to the well-known vanishing gradients phenomenon across the layers of a neural network [39], albeit now across different objectives of a multi-objective optimization problem. We therefore call this phenomenon vanishing task-specific gradients.

Proposition 2.2.1 also admits a qualitative comparison with the phenomenon of numerical stiffness in PDEs of the form

$\begin{equation} \frac{\partial {\widetilde{\boldsymbol{u}}}}{\partial t} = \mathbf{L} \widetilde{\boldsymbol{u}} + \mathbf{F}(\widetilde{\boldsymbol{u}}, t) . \end{equation} \tag{ 6 }$

If the real parts of the eigenvalues of L, $c = \mathrm{Re}(\lambda(\mathbf{L})) \gg 1$ , then the solution of the above PDE has fast modes $\widetilde{\boldsymbol{u}} \in O(1)$ with $\mathrm{d}/\mathrm{d}t \in O(c)$ and slow modes $\widetilde{\boldsymbol{u}} \in O(1/c)$ with $\mathrm{d}/\mathrm{d}t \in O(1)$ . Therefore, high frequency modes evolve on shorter time scales than low frequency modes. The rate amplitude for spatial derivatives of order m is $\mathrm{d}/\mathrm{d}t \in O(\boldsymbol{k}^{-m})$ for the wavenumber k [40]. Due to this analogy, we call the learning dynamics of a neural network with vanishing task-specific gradients stiff. Stiff learning dynamics leads to discrepancy in the convergence rates of different objectives in a loss function.

Taken together, vanishing task-specific gradients constitute a likely failure mode of PINNs, since their training amounts to a generalized version of Sobolev training.

3. Strategies for balanced PINN training

Since PINN training minimizes a weighted sum of multiple objectives (equation (3)), the result depends on appropriately chosen weights λ_k. The ratio between any two weights $\lambda_i/\lambda_j$ , i ≠ j, defines the relative importance of the two objectives during training. Suitable weighting is therefore quintessential to finding a Pareto-optimal solution that balances all objectives.

3.1. Weighting based on mean gradient statistics

While ad hoc manual tuning is still commonplace, it has recently been proposed to determine the weights as moving averages over the inverse gradient magnitude statistics of a given objective as [34]:

$\begin{align} \hat{\lambda}_k (\tau) & = \frac{\max \{\vert \nabla_{\boldsymbol{\theta}^\textrm{sh}} \mathcal{L}_\textrm{1} (\tau) \vert \} }{\lambda_k (\tau) \overline{ \vert \nabla_{\boldsymbol{\theta}^\textrm{sh}} \mathcal{L}_k (\tau) \vert}},\nonumber\\ \lambda_k (\tau + 1) & = \alpha \lambda_k(\tau) + (1-\alpha) \hat{\lambda}_k(\tau), \end{align} \tag{ 7 }$

where $\nabla_{\boldsymbol{\theta}^\textrm{sh}} \mathcal{L}_\textrm{1}$ is the gradient of the residual, i.e. of the first objective in equation (3), and α is a user-defined learning rate (here always α = 0.5). All gradients are taken with respect to the shared parameters across tasks, and $| \cdot |$ is the component-wise absolute value. The overbar signifies the algebraic mean over the vector components. The maximum in the numerator is over all components of the vector. While this max/avg weighting has been shown to improve PINN performance [34], it does not completely alleviate the problem of vanishing task-specific gradients.

3.2. Inverse Dirichlet weighting

Instead of determining the loss weights proportional to the inverse average gradient magnitude, we here propose to use weights based on the gradient variance. In particular, we propose to use weights for which the variances over the components of the back-propagated weighted gradients $\lambda_k \nabla_{\boldsymbol{\theta}} \mathcal{L}_k$ become equal across all objectives, thus directly preventing vanishing task-specific gradients. This can be achieved in any (stochastic) gradient-descent optimizer by using the update rule in equation (4) with weights

$\begin{align} \hat{\lambda}_k (\tau) & = {\frac{\gamma(\tau)}{\textrm{std}\{\nabla_{\boldsymbol{\theta}^\textrm{sh}} \mathcal{L}_k (\tau) \}}}, \nonumber\\ \lambda_k (\tau + 1) & = \alpha \lambda_k(\tau) + (1-\alpha) \hat{\lambda}_k(\tau), \end{align} \tag{ 8 }$

where $\gamma(\tau) = \max_{t = 1,\ldots ,K} \left( \textrm{std}\{\nabla_{\boldsymbol{\theta}^\textrm{sh}} \mathcal{L}_t (\tau) \} \right)$ , and $\textrm{std}\{\cdot\}$ is the empirical (sample) standard deviation over the vector components. Under the assumption that the back-propagated gradients are normally distributed, the weighting strategy in equation (8) leads to balanced gradient distributions with variance $\gamma(\tau)^2$ , i.e.

$\begin{equation*} \frac{\gamma(\tau)}{\textrm{std}\{\nabla_{\boldsymbol{\theta}^\textrm{sh}} \mathcal{L}_k (\tau) \}} \mathcal{N} \left( \mu_k, {\textrm{Var}}\{ \nabla_{\boldsymbol{\theta}^\textrm{sh}} \mathcal{L}_k (\tau) \} \right) = \mathcal{N} \left( \mu_k, \gamma(\tau)^2~\right), \end{equation*}$

$\mathcal{N}\left( \cdot \right)$ is the normal distribution with mean $\mu_k, k = {1,\ldots ,K}$ , and given variance.

For optimizers that are invariant to diagonal scaling of the gradients, such as the popular Adam optimizer [38], the weights in equation (8) can be efficiently computed as the inverse of the square root of the Dirichlet energy of each objective, i.e.

$\begin{equation*} \textrm{std}\{\nabla_{\boldsymbol{\theta}^\textrm{sh}} \mathcal{L}_k (\tau) \} \propto \sqrt{\int_{\Omega_{\boldsymbol{\theta}}} \vert \nabla_{\boldsymbol{\theta}^\textrm{sh}} \mathcal{L}_k (\tau) \vert^2 \,\mathrm{d}\boldsymbol{\theta} \, .} \end{equation*}$

We therefore refer to this weighting strategy as inverse Dirichlet weighting.

The weighting strategies in equations (7) and (8) are fundamentally different. Equation (7) weights objectives in inverse proportion to certainty as quantified by the average gradient magnitude. Equation (8) uses weights that are inversely proportional to uncertainty as quantified by the variance of the loss gradients. Equation (8) is also different from previous uncertainty-weighted approaches [31] because it measures the training uncertainty stemming from variance in the loss gradients rather than the uncertainty from the model's observational noise.

3.3. Gradient-based multi-objective optimization

We compare weighting-based scalarization approaches with gradient-based multi-objective approaches using Karush-Kuhn-Tucker (KKT) local optimality conditions to find a descent direction that decreases all objectives. Specifically, we adapt the multiple gradient descent algorithm (MGDA) [33] to PINNs. This algorithm leverages the KKT conditions

$\begin{align} \exists \boldsymbol{\lambda} & = \{\lambda_k \} \in \mathbb{R}_+^K \nonumber \\ \textrm{s.t. } \sum_{k = 1}^{K} \lambda_k \nabla_{\boldsymbol{\theta}^\textrm{sh}} \mathcal{L}_k (\boldsymbol{\theta}^\textrm{sh}, \boldsymbol{\theta}^k) &= 0~\textrm{~~and } \: \sum_{k = 1}^{K} \lambda_k = 1 \end{align} \tag{ 9 }$

to solve an MTL problem. The MGDA is guaranteed to converge to a Pareto-stationary point [41]. Given the gradients $\nabla_{\boldsymbol{\theta}^\textrm{sh}} \mathcal{L}_k$ of all objectives $k = 1,\ldots ,K$ with respect to the shared parameters, we find weights λ_k that satisfy the above KKT conditions. The resulting λ leads to a descent direction that improves all objectives [33], see also figure A.1. We adapt the MGDA to PINNs as described in appendix A.2. We refer to this adapted algorithm as pinn-MGDA.

**Figure A.1.** Multiple gradient descent algorithm. Illustration of multi-objective optimization for minimizing two objectives $f_1(x)$ and $f_2(x)$ . The red curve corresponds to the Pareto-optimal front. The MGDA algorithm searches for a descent direction (arrow) that minimizes all objectives and is close to the Pareto-optimal front. The gray region corresponds to the feasible objective space.
Download figure:
Standard image High-resolution image

**Figure A.1.** Multiple gradient descent algorithm. Illustration of multi-objective optimization for minimizing two objectives $f_1(x)$ and $f_2(x)$ . The red curve corresponds to the Pareto-optimal front. The MGDA algorithm searches for a descent direction (arrow) that minimizes all objectives and is close to the Pareto-optimal front. The gray region corresponds to the feasible objective space.
Download figure:
Standard image High-resolution image

4. Results

As we posit above, PINNs are expected to fail for problems with dominant high frequencies, i.e. with high k ₀ or high derivatives m, as is typical for multi-scale PDE models. We therefore present results of numerical experiments for such cases and compare the approximation properties of PINNs with different weighting schemes. This showcases training failure modes originating from vanishing task-specific gradients.

We first consider Sobolev training of neural networks, which is a special case of PINN training, as discussed in section 2.1. Due to the linear nature of the Sobolev loss (equation (5)), this test case allows us to compute ε-optimal analytical weights (see appendix A.1), providing a baseline for our comparison. Second, we consider a real-world example of a nonlinear PDE, using PINNs to model active turbulence.

In the first example, we specifically consider the Sobolev training problem with target function derivatives up to fourth order, i.e. K = 4, and total loss

$\begin{equation} \sum_{i = 1}^{N} \Bigg[ \Big \vert \boldsymbol{u}_{\boldsymbol{\theta}} (\boldsymbol{x}_i) - \boldsymbol{u}(\boldsymbol{x}_i) \Big\vert^2 + \sum_{k = 1}^{4} \lambda_k \Big \vert \hat{\xi}_k \mathcal{D}_{\boldsymbol{x}}^k \boldsymbol{u}_{\boldsymbol{\theta}} (\boldsymbol{x}_i) -\xi_k \mathcal{D}_{\boldsymbol{x}}^k \boldsymbol{u}(\boldsymbol{x}_i) \Big \vert^2~\Big ]. \end{equation} \tag{ 10 }$

The neural network is trained on the data $\boldsymbol{u}(\boldsymbol{x}_i)$ and their derivatives. To mimic inverse-modeling scenarios, in addition to producing an accurate estimate $\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_i)$ of the function, the neural network is also tasked to infer unknown scalar pre-factors $\boldsymbol{\hat{\xi}} = (\hat{\xi}_1,\hat{\xi}_2,\hat{\xi}_3,\hat{\xi}_4)$ with true values chosen as $\xi_{k} = 1$ . This mimics scenarios in which unknown coefficients of the PDE model also need to be inferred from data. For this loss, ε-optimal weights can be analytically determined (see appendix A.1) such that all objectives are minimized in an unbiased fashion [42]. This provides the baseline against which we benchmark the different weighting schemes.

4.1. Inverse Dirichlet weighting avoids vanishing task-specific gradients

We characterize the phenomenon of vanishing task-specific gradients, demonstrate its impact on the accuracy of a learned function approximation, and empirically validate Proposition 2.2.1. For this, we consider a generic neural network with five layers and 64 neurons per layer, tasked with learning a 2D function that contains a wide spectrum of frequencies and unknown coefficients $\boldsymbol{\hat{\xi}} = (\hat{\xi}_1,\hat{\xi}_2,\hat{\xi}_3,\hat{\xi}_4)$ using Sobolev training with the loss from equation (10). The details of the test problem and the training setup are given in appendix B.

We first confirm that in this test case vanishing task-specific gradients occur when using uniform weights, i.e. when setting $\lambda_k = 1$ for all $k = 0,\ldots , 4$ . For this, we plot the gradient histogram of the five loss terms in figure 1(A). The histograms clearly show that training is biased in this case to mainly optimize the objective containing the highest derivative k = 4. All other objectives are neglected, as predicted by proposition 2.2.1. This leads to a uniformly high approximation error on the test data (training vs. test data split 50:50, see appendix B), as shown in figure 2(B), as well as a uniformly high error in the estimated coefficients ξ (figure 2(C)). When determining the weights using the Pareto-seeking pinn-MGDA, the accuracy considerably improves (figures 2(B) and (C)). However, the phenomenon of vanishing task-specific gradients is still present, as seen in figure 1(B), this time favoring the objective for k = 0 and neglecting the derivatives. Max/avg weighting as proposed in reference [34] and given in equation (7) improves the accuracy for higher derivatives (figure 2(B)), but still suffers from unbalanced gradient distributions (figure 1(C)), which leads to a uniformly high error in the estimated coefficients ξ (figure 2(C)). The inverse Dirichlet weighting proposed here and detailed in equation (8) shows balanced gradient histograms (figure 1(D)) closest to those observed when using ε-optimal analytical weights (figure 1(E)). This leads to orders of magnitude better accuracy in both the function approximation and the coefficient estimates, as shown in figures 2(B) and (C), reaching the performance achieved when using ε-optimal weights.

**Figure 1.** Loss gradient histograms illustrating vanishing task-specific gradients. The gradient distributions of the different objectives ( $k = 0,\ldots ,4$ , see main text) are shown in different colors (inset legend) at training epoch 5000. The panels correspond to different λ weighting schemes as named. The histograms show normalized back-propagated weighted gradients for each objective, computed as $\nabla_\theta \mathcal{L}_k^{\ast} = \lambda_k \nabla_\theta \mathcal{L}_k / \max \{\vert \lambda_k \nabla_\theta \mathcal{L}_k \vert \}$ for $k = 0,\ldots ,4$ and $\vert \cdot \vert$ is the component-wise absolute value.
Download figure:
Standard image High-resolution image

**Figure 1.** Loss gradient histograms illustrating vanishing task-specific gradients. The gradient distributions of the different objectives ( $k = 0,\ldots ,4$ , see main text) are shown in different colors (inset legend) at training epoch 5000. The panels correspond to different λ weighting schemes as named. The histograms show normalized back-propagated weighted gradients for each objective, computed as $\nabla_\theta \mathcal{L}_k^{\ast} = \lambda_k \nabla_\theta \mathcal{L}_k / \max \{\vert \lambda_k \nabla_\theta \mathcal{L}_k \vert \}$ for $k = 0,\ldots ,4$ and $\vert \cdot \vert$ is the component-wise absolute value.
Download figure:
Standard image High-resolution image

**Figure 2.** Approximation accuracy and convergence for Sobolev training using different weighting strategies. (A) One instance of the function u given by equation (B4), visualized on a $128 \times 128$ regular Cartesian grid. (B) The relative L2 error of the learned function approximation, i.e. $\| \boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_i) - \boldsymbol{u}(\boldsymbol{x}_i)\|_2/\|\boldsymbol{u}(\boldsymbol{x}_i) \|_2$ , across all grid nodes x _i of the test data $\boldsymbol{u}(\boldsymbol{x}_i)$ for different objective weighting methods (symbols, see inset legend). (C) The relative L1 error in the estimated model coefficients, i.e. $\| \boldsymbol{\xi} - \boldsymbol{\hat{\xi}} \|_1/\|\boldsymbol{\xi}\|_1$ , for the different objective weighting methods. (D) The power spectrum E(k) of the true function u (solid black line) compared with those of the learned estimators $\boldsymbol{u}_{\boldsymbol{\theta}}$ for the different weighting methods (colored lines with symbols). All results are shown after training for 20 000 epochs using the Adam optimizer. The colored bands show to the standard deviation over 5 repetitions for different realizations (i.e. RNG seeds) of the random function u (see equation (B4)).
Download figure:
Standard image High-resolution image

$128 \times 128 $ — **Figure 2.** Approximation accuracy and convergence for Sobolev training using different weighting strategies. (A) One instance of the function u given by equation (B4), visualized on a $128 \times 128$ regular Cartesian grid. (B) The relative L2 error of the learned function approximation, i.e. $\| \boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_i) - \boldsymbol{u}(\boldsymbol{x}_i)\|_2/\|\boldsymbol{u}(\boldsymbol{x}_i) \|_2$ , across all grid nodes x _i of the test data $\boldsymbol{u}(\boldsymbol{x}_i)$ for different objective weighting methods (symbols, see inset legend). (C) The relative L1 error in the estimated model coefficients, i.e. $\| \boldsymbol{\xi} - \boldsymbol{\hat{\xi}} \|_1/\|\boldsymbol{\xi}\|_1$ , for the different objective weighting methods. (D) The power spectrum E(k) of the true function u (solid black line) compared with those of the learned estimators $\boldsymbol{u}_{\boldsymbol{\theta}}$ for the different weighting methods (colored lines with symbols). All results are shown after training for 20 000 epochs using the Adam optimizer. The colored bands show to the standard deviation over 5 repetitions for different realizations (i.e. RNG seeds) of the random function u (see equation (B4)).
Download figure:
Standard image High-resolution image

This is confirmed when looking at the power spectrum of the function approximation $\boldsymbol{u}_{\boldsymbol{\theta}}$ learned by the neural network in figure 2(D). Inverse Dirichlet weighting leads to results that are as good as those achieved by optimal weights in this test case, in particular when approximating high frequencies, as predicted by proposition 2.2.1. A closer look at the power spectra learned by the different weighting strategies also reveals that the max/avg strategy fails to capture high frequencies, whereas the pinn-MDGA performs surprisingly well (figure 2(D)) despite its unbalanced gradients (figure 1(B)). This may be problem-specific, as the weight trajectories taken by pinn-MDGA during training (figure B.1) are fundamentally different from the behavior of the other methods. This could be because the pinn-MGDA uses a fundamentally different strategy, purely based on satisfying Pareto-stationarity conditions (equation (9)). As shown in figure B.2, this leads to a rapid attenuation of the Fourier spectrum of the residual $\vert \tilde{\boldsymbol{r}}(\boldsymbol{k})\vert$ over the first few learning epochs in this test case, allowing pinn-MGDA to escape the stiffness defined in proposition 2.2.1.

**Figure B.1.** λ trajectories for different weighing strategies. The dynamic weights $\lambda(\tau)$ over the training epochs τ for a single instance of Sobolev training of the neural networks for different methods: (A) pinn-MGDA, (B) ε-optimal, (C) inverse Dirichlet, (D) max/avg weighting. Different trajectories in the same plots correspond to the weights $\lambda_k, k = {1,\ldots ,K}$ , associated with different objectives. We observe that the relative weights $(\lambda_i/\lambda_j) \forall (i,j) = {1,\ldots ,K}$ between objectives for the inverse Dirichlet and ε-optimal weights are of the same order of magnitude: $\lambda_0/\lambda_1~\approx 10$ , $\lambda_0/\lambda_2~\approx 10^2~\ldots 10^3$ , $\lambda_0/\lambda_3~\approx 10^3~\ldots 10^4$ , $\lambda_0/\lambda_4~\approx 10^4~\ldots 10^5$ (shown by red double-headed arrows).
Download figure:
Standard image High-resolution image

**Figure B.1.** λ trajectories for different weighing strategies. The dynamic weights $\lambda(\tau)$ over the training epochs τ for a single instance of Sobolev training of the neural networks for different methods: (A) pinn-MGDA, (B) ε-optimal, (C) inverse Dirichlet, (D) max/avg weighting. Different trajectories in the same plots correspond to the weights $\lambda_k, k = {1,\ldots ,K}$ , associated with different objectives. We observe that the relative weights $(\lambda_i/\lambda_j) \forall (i,j) = {1,\ldots ,K}$ between objectives for the inverse Dirichlet and ε-optimal weights are of the same order of magnitude: $\lambda_0/\lambda_1~\approx 10$ , $\lambda_0/\lambda_2~\approx 10^2~\ldots 10^3$ , $\lambda_0/\lambda_3~\approx 10^3~\ldots 10^4$ , $\lambda_0/\lambda_4~\approx 10^4~\ldots 10^5$ (shown by red double-headed arrows).
Download figure:
Standard image High-resolution image

**Figure B.2.** Evolution of the Fourier spectrum of the function residual $\widetilde{r}(\boldsymbol{k})$ . The Fourier spectrum of the residual tracked over the training epochs for different methods (inset text). The color intensity corresponds to the magnitude of the residual, $\vert \widetilde{r}(\boldsymbol{k}) \vert$ .
Download figure:
Standard image High-resolution image

**Figure B.2.** Evolution of the Fourier spectrum of the function residual $\widetilde{r}(\boldsymbol{k})$ . The Fourier spectrum of the residual tracked over the training epochs for different methods (inset text). The color intensity corresponds to the magnitude of the residual, $\vert \widetilde{r}(\boldsymbol{k}) \vert$ .
Download figure:
Standard image High-resolution image

In summary, we observe stark differences between different weighting strategies. Of the strategies compared, only the inverse Dirichlet weighting proposed here is able to avoid vanishing task-specific gradients, performing on par with the ε-optimal solution. The pinn-MGDA seems to be able to escape, rather than avoid, the stiffness caused by vanishing task-specific gradients. We also observe a clear correlation between having balanced task-specific gradient distributions and achieving good approximation and estimation accuracy.

4.2. Inverse Dirichlet weighting enables multi-scale modeling using PINNs

We investigate the use of PINNs in modeling multi-scale dynamics in space and time where vanishing task-specific gradients are a common problem. As a real-world test problem, we consider meso-scale active turbulence as it occurs in living fluids, such as bacterial suspensions [43, 44] and multi-cellular tissues [45]. Unlike inertial turbulence, active turbulence has energy injected at small length scales comparable to the size of the actively moving entities, like bacteria or individual motor proteins. The spatio-temporal dynamics of active turbulence can be modeled using the generalized incompressible Navier–Stokes equation [43, 44]:

$\begin{align} \frac{\partial \boldsymbol{u}}{\partial t} + \xi_0 (\boldsymbol{u} \cdot \nabla \boldsymbol{u}) & = - \nabla p + \xi_1~\nabla \vert \boldsymbol{u}\vert^2 - \alpha \boldsymbol{u} - \beta \vert \boldsymbol{u}\vert^2~\boldsymbol{u} \nonumber \\ &\qquad +\Gamma_0~\Delta \boldsymbol{u} - \Gamma_2~\Delta^2~\boldsymbol{u}\\ \nabla \cdot \boldsymbol{u} & = 0~,\nonumber \end{align} \tag{ 11 }$

where $\boldsymbol{u}(t,\boldsymbol{x})$ is the mean-field flow velocity and $p(t,\boldsymbol{x})$ the pressure field. The coefficient ξ₀ scales the contribution from advection and ξ₁ the active pressure. The α and β terms correspond to a quartic Landau-type velocity potential. The higher-order terms with coefficients Γ₀ and Γ₂ model passive and active stresses arising from hydrodynamic and steric interactions, respectively [44]. Figure C.2 illustrates the rich multi-scale dynamics observed when numerically solving this equation on a square for the parameter regime $\Gamma_0 \lt 0$ and $\Gamma_2 \gt 0$ [43] (see appendix C for details on the numerical methods used). An important characteristic of equation (11) is the presence of disparate length and time scales in addition to fourth-order derivatives and nonlinearities in u , which jointly lead to considerable numerical stiffness [40]. Given such large dominant wavenumbers k ₀, any data-driven approach, forward or inverse, is expected to encounter significant learning pathologies as per proposition 2.2.1.

**Figure C.1.** Loss trajectories for the reconstruction problem using different activation functions. (A) Relative L2 error trajectories along the training epochs for reconstructing the horizontal component u. (B) Relative L2 error trajectories for reconstructing the vorticity field ω. (C) Relative L2 error trajectories for reconstructing the Laplacian of the vorticity field, $\Delta \omega$ . The problem was set up in a square domain with $N_\textrm{int} = 192\,200$ .
Download figure:
Standard image High-resolution image

**Figure C.1.** Loss trajectories for the reconstruction problem using different activation functions. (A) Relative L2 error trajectories along the training epochs for reconstructing the horizontal component u. (B) Relative L2 error trajectories for reconstructing the vorticity field ω. (C) Relative L2 error trajectories for reconstructing the Laplacian of the vorticity field, $\Delta \omega$ . The problem was set up in a square domain with $N_\textrm{int} = 192\,200$ .
Download figure:
Standard image High-resolution image

**Figure C.2.** Training domain for the active turbulence problem. Left: training domain inside the dotted square used for forward solution in the square domain. Right: training domain interior of the dotted annular outline used for forward solution in the annular geometry.
Download figure:
Standard image High-resolution image

We demonstrate these pathologies by applying PINNs to approximate the forward solution of equation (11) in 2D for the parameter regime explored in reference [43] in both square (figure C.2) and annular (figure 3(E)) domains. The details of the simulation and training setups are given in appendix C. The loss function includes the terms as given in equation (3), including the PDE residual, initial and boundary conditions, and the divergence-freeness constraint.

**Figure 3.** Forward modeling of active turbulence using PINNs. (A) and (B) Average relative L2 prediction errors for the vorticity field $\omega = \nabla \times \boldsymbol{u}$ of the active turbulence PDE in equation (11) in square (A) and annular (B) domains using different weighting strategies for PINN training (symbols, see legend). The dashed vertical lines mark epochs at which the learning rate $\eta (\tau)$ is adjusted. (C) Convergence of the different PINN predictions of the flow vorticity to the ground-truth numerical solution in a square domain for different grid resolutions. Dashed lines visualize integer convergence rates. The PINN solution is extracted at epoch 7000. The colored bands show the standard deviations over three different initializations of the neural network weights. (D) Power spectrum E(k) of the turbulent flow velocity field (solid black line) compared with the power spectra of the approximations learned by PINNs with different weighting (see inset legend) at epoch 500. (E) Visualization of the ground-truth vorticity field in an annular domain obtained by numerically solving equation (11) as detailed in appendix C. (F)–(I) Spatial distribution of the relative prediction errors in the vorticity field when using PINNs with the weighting strategies named in the panel titles.
Download figure:
Standard image High-resolution image

**Figure 3.** Forward modeling of active turbulence using PINNs. (A) and (B) Average relative L2 prediction errors for the vorticity field $\omega = \nabla \times \boldsymbol{u}$ of the active turbulence PDE in equation (11) in square (A) and annular (B) domains using different weighting strategies for PINN training (symbols, see legend). The dashed vertical lines mark epochs at which the learning rate $\eta (\tau)$ is adjusted. (C) Convergence of the different PINN predictions of the flow vorticity to the ground-truth numerical solution in a square domain for different grid resolutions. Dashed lines visualize integer convergence rates. The PINN solution is extracted at epoch 7000. The colored bands show the standard deviations over three different initializations of the neural network weights. (D) Power spectrum E(k) of the turbulent flow velocity field (solid black line) compared with the power spectra of the approximations learned by PINNs with different weighting (see inset legend) at epoch 500. (E) Visualization of the ground-truth vorticity field in an annular domain obtained by numerically solving equation (11) as detailed in appendix C. (F)–(I) Spatial distribution of the relative prediction errors in the vorticity field when using PINNs with the weighting strategies named in the panel titles.
Download figure:
Standard image High-resolution image

Figures 3(A) and (B) show the average (over different initializations of the neural network weights) relative L2 errors of the vorticity field over training epochs when using different PINN weighting strategies. As predicted by proposition 2.2.1, the uniformly weighted PINN fails completely both in the square (figure 3(A)) and in the annular geometry (figure 3(B)). Optimization is entirely biased towards minimizing the PDE residual, neglecting the initial, boundary, and incompressibility conditions. This imbalance is also clearly visible in the loss gradient distributions shown in figure C.5(A). Comparing the approximation accuracy after 7000 training epochs confirms the findings from the previous section with inverse Dirichlet weighting outperforming the other methods, followed by pinn-MGDA and max/avg weighting. Inverse Dirichlet weighting is found to assign adequate weights for the initial and boundary conditions relative to the equation residual (see figures C.3(A) and (D)).

**Figure C.3.** λ trajectories for different weighting strategies for the active turbulence problem. (A)–(C) correspond to the λ trajectories for the forward solution of the active turbulence problem in the square domain (see left figure C.2). (D)–(F) correspond to the λ trajectories for the forward solution in the annular geometry (see right figure C.2). The columns correspond to the different weighting strategies as named on the top left corner of each column.
Download figure:
Standard image High-resolution image

**Figure C.4.** Evolution of the Fourier spectrum of the function residual $\widetilde{r}(\boldsymbol{k})$ . Top row: the Fourier spectrum of the residual for the horizontal velocity component u. Bottom row: the Fourier spectrum of the residual for the vertical velocity component v. The columns correspond to different methods (inset text). The color intensity corresponds to the magnitude of the residual, $\vert \widetilde{r}(\boldsymbol{k}) \vert$ .
Download figure:
Standard image High-resolution image

**Figure C.5.** Loss gradient histograms illustrating vanishing task-specific gradients for the active turbulence problem. The weighted gradient distributions of the different objectives (insets on the right) at training epoch 5000 for the forward problem in a squared domain. The panels correspond to different λ weighting schemes as named.
Download figure:
Standard image High-resolution image

**Figure C.6.** Catastrophic forgetting in sequential PINN training of active turbulence from noisy measurement data. (A) Test error in the velocity field u during the three-step sequential training described in the main text with different weighting strategies (colors). The solid vertical lines at epochs 1000 and 2000 mark the introduction of the divergence constraint and PDE residual, respectively. The dashed vertical lines mark epochs at which the learning rate $\eta(\tau)$ is adjusted. (B) Average relative L1 errors in the inferred values for the coefficients ξ₀ (ξ₁ is absorbed into the effective pressure), α, β, Γ₀, and Γ₂ of equation (11) over training epochs. (C) Average relative L2 error for inferring the latent pressure gradient $\nabla p$ across training epochs. The simulated data u is corrupted with additive Gaussian noise $\hat{\boldsymbol{u}} = \boldsymbol{u} + 0.25~\boldsymbol{\epsilon}$ , where $\boldsymbol{\epsilon}\sim\mathcal{N}(0,\psi^2)$ is a vector of element-wise independent and identically distributed Gaussian random numbers with mean zero and variance $\psi^2 = \textrm{Var}\{u_1,\ldots ,u_N \}$ .
Download figure:
Standard image High-resolution image

**Figure C.6.** Catastrophic forgetting in sequential PINN training of active turbulence from noisy measurement data. (A) Test error in the velocity field u during the three-step sequential training described in the main text with different weighting strategies (colors). The solid vertical lines at epochs 1000 and 2000 mark the introduction of the divergence constraint and PDE residual, respectively. The dashed vertical lines mark epochs at which the learning rate $\eta(\tau)$ is adjusted. (B) Average relative L1 errors in the inferred values for the coefficients ξ₀ (ξ₁ is absorbed into the effective pressure), α, β, Γ₀, and Γ₂ of equation (11) over training epochs. (C) Average relative L2 error for inferring the latent pressure gradient $\nabla p$ across training epochs. The simulated data u is corrupted with additive Gaussian noise $\hat{\boldsymbol{u}} = \boldsymbol{u} + 0.25~\boldsymbol{\epsilon}$ , where $\boldsymbol{\epsilon}\sim\mathcal{N}(0,\psi^2)$ is a vector of element-wise independent and identically distributed Gaussian random numbers with mean zero and variance $\psi^2 = \textrm{Var}\{u_1,\ldots ,u_N \}$ .
Download figure:
Standard image High-resolution image

**Figure C.7.** Inferred effective pressure. The top row shows the prediction of the horizontal component of the effective pressure gradient $\partial_x p^*$ given the velocity field u . The bottom row shows the corresponding point-wise relative error of the prediction. Different columns correspond to different weighting strategies as per the titles.
Download figure:
Standard image High-resolution image

**Figure C.7.** Inferred effective pressure. The top row shows the prediction of the horizontal component of the effective pressure gradient $\partial_x p^*$ given the velocity field u . The bottom row shows the corresponding point-wise relative error of the prediction. Different columns correspond to different weighting strategies as per the titles.
Download figure:
Standard image High-resolution image

**Figure C.8.** Inferred effective pressure from noisy measurement data. The top row shows the prediction of the horizontal component of the effective pressure gradient $\partial_x p^*$ given the noisy corrupted velocity field $\hat{\boldsymbol{u}} = \boldsymbol{u} + 0.25~\boldsymbol{\epsilon}$ , where $\boldsymbol{\epsilon}\sim\mathcal{N}(0,\psi^2)$ is a vector of element-wise independent and identically distributed Gaussian random numbers with mean zero and variance $\psi^2 = \textrm{Var}\{u_1,\ldots ,u_N \}$ . The bottom row shows the corresponding point-wise relative error of the prediction. Different columns correspond to different weighting strategies as per the titles.
Download figure:
Standard image High-resolution image

**Figure C.9.** Computational aspects of dynamic weighing for the active turbulence problem. (A) Training times per epoch for the inverse problem in primitive variables with 35 mini-batch updates per epoch of batch size $\vert\textrm{B}\vert = 4096$ . (B) Training times per epoch for the forward problem in vorticity-streamline formulation with 49 mini-batch updates per epoch of batch size $\vert\textrm{B}\vert = 4000$ . The higher computational cost in (B) stems from using more training points and from the additional derivative that need to be computed for evaluating the vorticity. The profiling was done on a Nvidia GTX 1080 graphics card with 8 GB VRAM.
Download figure:
Standard image High-resolution image

**Figure C.9.** Computational aspects of dynamic weighing for the active turbulence problem. (A) Training times per epoch for the inverse problem in primitive variables with 35 mini-batch updates per epoch of batch size $\vert\textrm{B}\vert = 4096$ . (B) Training times per epoch for the forward problem in vorticity-streamline formulation with 49 mini-batch updates per epoch of batch size $\vert\textrm{B}\vert = 4000$ . The higher computational cost in (B) stems from using more training points and from the additional derivative that need to be computed for evaluating the vorticity. The profiling was done on a Nvidia GTX 1080 graphics card with 8 GB VRAM.
Download figure:
Standard image High-resolution image

In figure 3(C), we compare the convergence of the learned approximation $\boldsymbol{u}_{\boldsymbol{\theta}}$ to a ground-truth numerical solution for different spatial resolutions h of the discretization grid. The errors scale as $O(h^r)$ with convergence orders r as identified by the dashed lines. While the pinn-MGDA and max/avg weighting achieve linear convergence (r = 1), the inverse Dirichlet weighted PINN converges with order r = 4. The uniformly weighted PINN does not converge at all and is therefore not shown in the plot. The evolution of the Fourier spectra $| \widetilde{r}(\boldsymbol{k}) |$ of the residuals of both velocity components $(u,v) = \boldsymbol{u}$ shown in figure C.4 confirm the rapid solution convergence achieved by inverse Dirichlet weighted PINNs.

Inverse Dirichlet weighting also perfectly recovers the power spectrum E(k) of the active turbulence velocity field, as shown in figure 3(D). This confirms that inverse Dirichlet weighting enables PINNs to capture high frequencies in the PDE solution, corresponding to fine structures in the fields, by avoiding the stiffness problem from proposition 2.2.1. Indeed, the PINNs trained with the other weighting methods fail to predict small structures in the flow field, as shown by the spatial distributions of the prediction errors in figures 3(F)–(I) compared to the ground-truth numerical solution in figure 3(E).

In Appendices D and E, we provide results for two additional examples: approximating the solution of the 2D Poisson equation (see figure D.1) on a square with a large spectrum of frequencies, and learning the spatio-temporal dynamics of curvature-driven level-set flow (see figures E.1 and E.2) developing small structures. In both of these additional multi-scale examples we notice similar behavior, with inverse Dirichlet weighting outperforming the other approaches either in terms of accuracy or convergence rate.

**Figure D.2.** Effect of the choice of optimizer on the performance of weighting strategies. Left: comparison between relative L2 errors for different weighing strategies for increasing wavenumbers k using the RMSprop optimizer. Right: the same comparison when using the Adagrad optimizer. The line markers and colors follow the legend of figure D.1.
Download figure:
Standard image High-resolution image

**Figure E.1.** Curvature-driven level-set problem. (A) Relative L2 error for for different methods for the curvature-driven level-set flow. (B) The Power spectrum of the PINN level-set function approximation after 2000 epochs compared with the Power spectrum of the ground truth, shown as solid black line.
Download figure:
Standard image High-resolution image

**Figure E.2.** Solution snapshot of curvature-driven level-set flow. Top row: zero contour of the level-set function visualized at simulation time t = 0.01 after 2000 epochs of training. Different columns correspond to different weighing strategies: (A) uniform weighing, (B) pinn-MGDA, (C) inverse Dirichlet, (D) max/avg. Bottom row: The corresponding point-wise error of the PINN approximation of the level-set function in comparison to the ground truth solution.
Download figure:
Standard image High-resolution image

In summary, these results suggest that avoiding vanishing task-specific gradients can enable the use of PINNs in multi-scale problems previously not amenable to neural-network modeling. Moreover, we observe that seemingly small changes in the weighting strategy can lead to significant differences in the convergence order of the learned approximation.

4.3. Inverse Dirichlet weighting protects against catastrophic forgetting

Artificial neural networks tend to 'forget' their parameterization when sequentially training for multiple tasks [46], a phenomenon referred to as catastrophic forgetting. Sequential training is often used, e.g. when learning to de-noise data and estimate parameters at the same time, like combined image segmentation and denoising [47], or when learning solutions of fluid mechanics models with additional constraints like divergence-freeness of the flow velocity field [16].

In PINNs, catastrophic forgetting can occur, e.g. in the frequently practiced procedure of first training on measurement data alone and adding the PDE model only at later epochs [48]. Such sequential training can be motivated by the computational cost of evaluating the derivatives in the PDE using automatic differentiation over the network, so it seems prudent to pre-condition the network for data fitting and introduce the residual later in order to regress for the missing network parameters.

Using the active turbulence test case described above, we investigate how the different weighting schemes compared here influence the phenomenon of catastrophic forgetting. For this, we consider the inverse problem of inferring unknown model parameters and latent fields like the effective pressure $p^*$ from flow velocity data u _i. We train the PINNs in three sequential steps: (1) first fitting flow velocity data, (2) later including the additional objective for the incompressibility constraint $\nabla \cdot \boldsymbol{u} = 0$ , (3) finally including also the objective for the PDE model residual. This creates a setup in which catastrophic forgetting occurs when using uniform PINN weights, as confirmed in figure 4(A) (arrows). As soon as additional objectives (divergence-freeness at epoch 1000 and equation residual at epoch 2000) are introduced, the PINN loses the parameterization learned on the previous objectives, leading to large and increasing test error for the field u .

**Figure 4.** Catastrophic forgetting in sequential PINN training of active turbulence. (A) Test error in the velocity field u during the three-step sequential training described in the main text with different weighting strategies (colors, see legend). The solid vertical lines at epochs 1000 and 2000 mark the introduction of the incompressibility constraint and PDE residual, respectively. The dashed vertical lines mark epochs at which the learning rate $\eta(\tau)$ is adjusted. (B) Average relative L1 errors in the inferred values for the coefficients ξ₀ (ξ₁ is absorbed into the effective pressure $p^*$ , see appendix C), α, β, Γ₀, and Γ₂ of equation (11) over training epochs. (C) Average relative L2 error in the inferred latent effective pressure gradient $\nabla p^*$ across training epochs.
Download figure:
Standard image High-resolution image

Dynamic weighting strategies can protect PINNs against catastrophic forgetting by adequately adjusting the weights whenever an additional objective is introduced. This is confirmed in figure 4(A) with inverse Dirichlet and pinn-MGDA only minimally affected by catastrophic forgetting. The max/avg weighting strategy [34], however, does not protect the network quite as well. When looking at the test errors for the learned model coefficients (figure 4(B)) and the accuracy of the estimated latent pressure field $p(t,\boldsymbol{x})$ (figure 4(C)), for which no training data is given, the inverse Dirichlet weighting proposed here outperforms the other approaches. Snapshots of the reconstructed pressure fields using different strategies are shown in figure C.7. Results for inference from noisy data are shown in figures C.6 and C.8.

In summary, inverse Dirichlet weighting can protect a PINN against catastrophic forgetting, making sequential training a viable option. In our tests, it offered the best protection of all compared weighting schemes.

5. Conclusion and discussion

PINNs have rapidly been adopted by the scientific community for diverse applications in forward and inverse modeling. It is straightforward to show that the training of a PINN amounts to a MTL problem, which is sensitive to the choice of regularization weights. Trivial uniform weights are known to lead to failure of PINNs, empirically found to exacerbate for physical models with disparate scales or high derivatives.

As we have shown here, this failure can be explained by analogy to numerical stiffness of PDEs when understanding Sobolev training of neural networks [7] as a special case of PINN training. This connection enabled us to prove a proposition stating that PINN learning pathologies are caused by dominant high-frequency components or high derivatives in the physics prior. The asymptotic arguments in our proposition inspired a novel dynamic adaptation strategy for the regularization weights of a PINN during training, which we termed inverse Dirichlet weighting.

We empirically compared inverse Dirichlet weighting with the recently proposed max/avg weighting for PINNs [34]. Moreover, we looked at PINN training from the perspective of multi-objective optimization, which avoids a-priori choices of regularization weights. We therefore adapted to PINNs the state-of-the-art MGDA for training multi-objective artificial neural networks, which has previously been proven to converge to a Pareto-stationary solution [33]. This provided another baseline to compare with. Finally, we compared against analytically derived ε-optimal static weights in a simple linear test problem where those can be derived, providing an absolute baseline. In all comparisons, and across a wide range of problems from 2D Poisson to multi-scale active turbulence and curvature-driven level-set flow, inverse Dirichlet weighting empirically performed best. We also found that all weighting strategies are in general agnostic to the choice of the activation function, with inverse Dirichlet weighting outperforming the other strategies irrespective of the activation function chosen.

The inverse Dirichlet weighting proposed here can be interpreted as an uncertainty weighting with training uncertainty quantified by the batch variance of the loss gradients. This is different from previous approaches [31] that used weights quantifying the model uncertainty, rather than the training uncertainty. The proposition proven here provides an explanation for why the loss gradient variance may be a good choice, and our results confirmed that weighting based on the Dirichlet energy of the loss objectives outperforms other weighting heuristics, as well as Pareto-front seeking algorithms like pinn-MGDA. The proposed inverse Dirichlet weighting also performed best at protecting a PINN against catastrophic forgetting in sequential training, where loss objectives are introduced one-by-one, making sequential training of PINNs a viable option in practice.

While we have focused on PINNs here, our analysis equally applies to Sobolev training of other neural networks, and to other multi-objective losses, as long as training is done using an explicit first-order (stochastic) gradient descent optimizer, of which Adam [38] is the most commonly used. In all our examples, we found that the Adam optimizer leads to better convergence than other first-order update rules like RMSprop or Adagrad (see figure D.2). However, hybrid strategies can also be envisioned. For instance when the optimizer has converged close to a local optimum, the uncertainty in the gradients is small relative to the mean gradient component [49]. In such scenarios, inverse Dirichlet weighting can lead to large optimization step sizes that plateau the learning (see figure E.1). A hybrid strategy could then be beneficial, combining initial inverse Dirichlet weighting with, e.g. LBFGS-based fine-tuning close to a local optimum [16].

Taken together, we have presented a connection between PINNs and Sobolev training in order to explain a likely failure mode of PINNs, and we have proposed a simple yet effective strategy to avoid training failure. The proposed inverse Dirichlet weighting strategy is easily included into any first-order optimizer at almost no additional computational cost (see figure C.9). It has the potential to improve the accuracy and convergence of PINNs by orders of magnitude, to enable sequential training with less risk of catastrophic forgetting, and to extend the application realm of PINNs to multi-scale problems that were previously not amenable to data-driven modeling.

Acknowledgments

This work was supported by the German Research Foundation (DFG) – EXC-2068, Cluster of Excellence 'Physics of Life', and by the Center for Scalable Data Analytics and Artificial Intelligence (ScaDS.AI) Dresden/Leipzig, funded by the Federal Ministry of Education and Research (BMBF). The present work was also partly funded by the Center for Advanced Systems Understanding (CASUS), which is financed by Germany's Federal Ministry of Education and Research (BMBF) and by the Saxon Ministry for Science, Culture and Tourism (SMWK) with tax funds on the basis of the budget approved by the Saxon State Parliament.

Data availability statement

The data that support the findings of this study are openly available at the following URL/DOI: https://github.com/mosaic-group/inverse-dirichlet-pinn.

Appendix A

A.1. Analytically optimal weights

For Sobolev loss functions based on elliptic PDEs, optimal weights λ_k that minimize all objectives in an unbiased way can be determined by ε-closeness analysis [42]. ε -closeness

Definition 1. A candidate solution u is called ε-close to the true solution $\hat{\boldsymbol{u}}$ if it satisfies

$\begin{equation} \frac{\vert \mathcal{D}^{\boldsymbol{\beta}} \boldsymbol{u} ({\boldsymbol{x}}) -\mathcal{D}^{\boldsymbol{\beta}} \hat{\boldsymbol{u}} ({\boldsymbol{x}})\vert}{\vert \mathcal{D}^{\boldsymbol{\beta}} \hat{\boldsymbol{u}}({\boldsymbol{x}}) \vert} \leqslant \varepsilon \end{equation} \tag{ A1 }$

for any multi-index $\boldsymbol{\beta} \in \mathbb{N}^{{d}}$ and $\boldsymbol{x} = (x_1,\ldots ,x_d) \in \Omega \subset \mathbb{R}^d$ , thus $\mathcal{D}^{\boldsymbol{\beta}} \boldsymbol{u} (\boldsymbol{x}) = \frac{\partial^{\vert \boldsymbol{\beta} \vert} \boldsymbol{u}(\boldsymbol{x})}{\partial x_1^{\beta_1}\partial x_2^{\beta_2}\ldots \partial x_d^{\beta_d}}$ with $\vert \boldsymbol{\beta}\vert = \sum_{i = 1}^{d} \beta_i$ .

For the loss function typically used in Sobolev training of neural networks, i.e.

$\begin{equation*} \mathcal{L}_k( \cdot ) = \int_{\Omega} \left( \mathcal{D}_{{\boldsymbol{x}}}^{k} \hat{\boldsymbol{u}} - \mathcal{D}_{{\boldsymbol{x}}}^{k} \boldsymbol{u}\right)^2~\mathrm{d} \boldsymbol{x}. \end{equation*}$

We can use ε-closeness analysis to show that

$\begin{equation} \mathcal{L}_{k} \leqslant \varepsilon^2~\int_{\Omega} \left( \vert \xi_k \mathcal{D}_{\boldsymbol{x}}^k (\boldsymbol{\hat{u}}) \vert \right)^2~\mathrm{d} \boldsymbol{x} \, . \end{equation} \tag{ A2 }$

Using this inequality, we can bound the total loss as

$\begin{equation} \mathcal{L}(\boldsymbol{u}) \leqslant \varepsilon^2~\sum_{k = 1}^{K} \lambda_k \int_{\Omega} \left( \vert \xi_k \mathcal{D}_{\boldsymbol{x}}^k (\boldsymbol{\hat{u}}) \vert \right)^2~\mathrm{d} \boldsymbol{x} = \varepsilon^2~\sum_{k = 1}^{K} \lambda_k \mathcal{I}_k, \end{equation} \tag{ A3 }$

with $\mathcal{I}_k = \int_{\Omega} \left( \vert \xi_k \mathcal{D}_{\boldsymbol{x}}^k (\boldsymbol{\hat{u}}) \vert \right)^2\! \mathrm{d} {\boldsymbol{x}}$ . This bound can be used to determine the weights $\{\lambda _k \} _{k = 1}^K$ that result in the smallest ε:

$\begin{align} \mathcal{L}_k(\boldsymbol{u}) & \leqslant \frac{1}{\lambda_k} \mathcal{L}(\boldsymbol{u}) \leqslant \varepsilon^2~\lambda_k \mathcal{I}_k \:\: \forall k = {1,\ldots ,K}, \nonumber\\ \mathcal{L}(\boldsymbol{u}) & \leqslant \varepsilon^2~\min \{\lambda_1~\mathcal{I}_1,\ldots ,\lambda_K \mathcal{I}_K \}. \end{align} \tag{ A4 }$

The weights $\lambda_k, \forall k \in {1,\ldots ,K}$ , that lead to the tightest upper bound on the total loss in equation (A4) can be found by solving the maximization-minimization problem described in following corollary: A.1.1.

Corollary A maximization-minimization problem can be transformed into a piecewise linear convex optimization problem by introducing an auxiliary variable z, such that

$\begin{align*} \textrm{maximize} \:\:z \qquad\quad & \quad \\ \textrm{subject~to } \mathbf{1}^\top \boldsymbol{\lambda}^{*} = 1; \\ z \leqslant \lambda_k \mathcal{I}_k\,, \quad k = {1,\ldots ,K} . \end{align*}$

The analytical solution to this problem exists and is given by $\lambda_k^* = \frac{\Pi_{j\neq k} \mathcal{I}_j}{\sum_{k = 1}^{K} \Pi_{j\neq k} \mathcal{I}_j}, \quad k = 1,\ldots ,K$ .

For a Sobolev training problem with loss given by equation (5), weights that lead to ε-optimal solutions are computed using

$\begin{equation*} \mathcal{I}_k = \int_{\Omega} \left(\vert \xi_k \mathcal{D}_{\boldsymbol{x}}^k (\boldsymbol{\hat{u}}) \vert \right)^2\! \mathrm{d} \boldsymbol{x} .\end{equation*}$

A.2. PINN multiple gradient descent algorithm (pinn-MGDA)

We adapt the MGDA to PINNs as illustrated in figure A.1 by formulating the conditions for Pareto stationary as stated in equation (9) as a quadratic program:

$\begin{equation} \underset{\boldsymbol{\lambda} \in \mathbb{R}^K}{\textrm{minimize }} \frac{1}{2} \boldsymbol{\lambda}^\top \boldsymbol{Q} \boldsymbol{\lambda} \,\,\,\textrm{ s.t. } \,\,\,\lambda_k \geqslant 0 \: \forall k \in [K] \,\,\,\textrm{ and } \,\,\,\sum_{k = 1}^K \lambda_k = 1, \end{equation} \tag{ A5 }$

where $\boldsymbol{Q} = \boldsymbol{U}^\top\boldsymbol{U}$ . The matrix $\boldsymbol{U} \in \mathbb{R}^{\vert \textrm{B} \vert \times K}$ is given by

$\begin{equation*} \boldsymbol{U} = \begin{bmatrix} \vdots \:\: \qquad \qquad \vdots \qquad \qquad \:\: \vdots \:\: \\ \nabla_{\boldsymbol{\theta}^\textrm{sh}} \mathcal{L}_1 \:\: \nabla_{\boldsymbol{\theta}^\textrm{sh}} \mathcal{L}_2 \:\: \ldots \:\: \nabla_{\boldsymbol{\theta}^\textrm{sh}} \mathcal{L}_K \\ \vdots \:\: \qquad \qquad \vdots \qquad \qquad \:\: \vdots \:\: \\ \end{bmatrix}. \end{equation*}$

We solve the quadratic program in equation (A5) using the Frank-Wolfe algorithm.

Appendix B.: Sobolev training

B.1. Proof of proposition 2.2.1

Let us assume the Fourier spectrum of the signal is a delta function, i.e. $\widetilde{u}(k) = \Gamma_0~\delta (k - k_0)$ , with k₀ being the dominant wavenumber. The residual at training time (t) is given as $r(\cdot,t) = u_\theta - u$ , and its corresponding Fourier coefficients for the wavenumber k is given by $\widetilde{r}(k,t) = \widetilde{u}_{\theta}(k,t) - \widetilde{u}(k)$ . The learning rate of the residual for the dominant frequency is given as,

$\begin{align} \Big \vert \frac{\partial \widetilde{r}(k_0,\tau)}{\partial \tau} \Big \vert & = \Big \vert \frac{\partial \widetilde{u}_{\theta}(k_0,\tau)}{\partial \theta} \Big \vert \Big \vert \frac{\partial \theta}{\partial \tau} \Big \vert \nonumber \\ & = \Big \vert \eta \frac{\partial \widetilde{u}_{\theta}(k_0)}{\partial \theta} \Big \vert \Big \vert \frac{\partial \mathcal{L}}{\partial \theta} \Big \vert \nonumber \\ & = \Big \vert \eta \frac{\partial \widetilde{u}_{\theta}(k_0)}{\partial \theta} \Big \vert \Big \vert \lambda_{{0}} \frac{\partial \mathcal{L}_{{0}}}{\partial \theta} + \lambda_{{1}} \frac{\partial \mathcal{L}_{{1}}}{\partial \theta} \Big \vert , \end{align} \tag{ B1 }$

where $\mathcal{L}_0~{ = \sum_{i = 1}^{N} \vert \boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_i) - \boldsymbol{u}(\boldsymbol{x}_i) \vert^2}$ and $\mathcal{L}_1~{ = \sum_{i = 1}^{N} \vert \mathcal{D}_{\boldsymbol{x}}^m \boldsymbol{u}_{\boldsymbol{\theta}} (\boldsymbol{x}_i) - \mathcal{D}_{\boldsymbol{x}}^m \boldsymbol{u}(\boldsymbol{x}_i)\vert^2}$ . Equation (B1) follows from the gradient flow of the parameters $\partial_\tau \theta \approx \eta \partial_\theta \mathcal{L}$ as described in equation (4). From Parseval's Theorem [50], we can compute the magnitude of the gradients for both objectives $\mathcal{L}_{{0}}$ and $\mathcal{L}_{{1}}$ with respect to the network parameters θ.

$\begin{align} \frac{\partial \mathcal{L}_{{0}}}{\partial \theta} & = 2~\sum_{k = -N/2}^{N/2-1} \Big \vert \textbf{Re}\Big [ \widetilde{u}_{\theta} (k) - \widetilde{u}(k) \Big ] \frac{\partial \widetilde{u_\theta} (k) }{\partial \theta} \Big \vert \nonumber \\ & \leqslant 2~\Big \vert \Gamma_0~\frac{\partial \widetilde{u}_{\theta}(k_0)}{\partial \theta} \Big \vert + 2~\sum_{k = -N/2}^{N/2-1} \Big \vert \widetilde{u}_{\theta}(k) \frac{\partial \widetilde{u}_{\theta} (k)}{\partial \theta } \Big \vert \nonumber \\ &\approx O(1) \Big \vert \frac{\partial \widetilde{r}(k_0)}{\partial \theta } \Big \vert . \end{align} \tag{ B2 }$

The second step in the above calculation involves triangular inequality and the assumption that the Fourier spectrum of the signal is a delta function. In a similar fashion, we can repeat the same analysis for the Sobolev part of the objective function as follows,

$\begin{align} \frac{\partial \mathcal{L}_{{1}}}{\partial \theta} & = 2~\sum_{k = -N/2}^{N/2-1} \Big \vert \textbf{Re} \Big [ (ik)^m \big (\widetilde{u}_{\theta} (k) - \widetilde{u}(k) \big ) \Big ] \frac{\partial (ik)^m \widetilde{u}_{\theta} (k) }{\partial \theta} \Big \vert \nonumber \\ & = 2~\sum_{k = -N/2}^{N/2-1} \Big \vert \textbf{Re} \Big [ (ik)^{2m} \big (\widetilde{u}_{\theta} (k) - \widetilde{u}(k) \big) \Big ] \frac{\partial \widetilde{u}_{\theta} (k) }{\partial \theta} \Big \vert \nonumber \\ & \leqslant 2~\Big \vert \Gamma_0~\textbf{Re} \big [ (ik_0)^{2m} \big ] \frac{\partial \widetilde{u}_{\theta}(k_0)}{\partial \theta} \Big \vert + 2~\sum_{k = -N/2}^{N/2-1} \Big \vert \textbf{Re} \big [(ik)^{2m} \big ] \widetilde{u}_{\theta}(k) \frac{\partial \widetilde{u}_{\theta} (k)}{\partial \theta} \Big \vert \nonumber \\ & \approx O (k_0^{2m}) \Big \vert \frac{\partial \widetilde{r}(k_0)}{\partial \theta} \Big \vert . \end{align} \tag{ B3 }$

By substituting equations (B2) and (B3) into equation (B1), we get,

$\begin{align*} \Big \vert \frac{\partial \widetilde{r}(k_0,\tau)}{\partial \tau} \Big \vert & = \Big \vert \eta \frac{\partial \widetilde{u}_{\theta}(k_0)}{\partial \theta} \Big \vert \left( \lambda_{{0}} O(1) \Big \vert \frac{\partial \widetilde{r}(k_0)}{\partial \theta} \Big \vert + \lambda_{{1}} O (k_0^{2m}) \Big \vert \frac{\partial \widetilde{r}(k_0)}{\partial \theta} \Big \vert \right) \\ & = \frac{\eta}{O(k_0)} \Big[ \lambda_{{0}} O(1) + \lambda_{{1}} O (k_0^{2m}) \Big] \Big \vert \frac{\partial \widetilde{r}(k_0)}{\partial \theta} \Big \vert. \end{align*}$

We use the fact that the spectral rate of the network function residual is inversely proportional to the wavenumber, i.e. $\partial \widetilde{u}_\theta (k)/\partial \theta = O\left(k^{-1}\right)$ [26].

B.2. Sobolev training setup

We use a neural network $f_{\theta}: (x,y) \rightarrow (u)$ with five layers and 64 neurons per layer and $\sin(\cdot)$ activation function. The target function is a sinusoidal forcing term of the form,

$\begin{equation} u(x,y) = \sum_{i = 1}^{M} A_x^i \cos(2\pi l_x^i x/L + \phi_x^i) A_y^i \sin(2\pi l_y^i y/L + \phi_y^i), \end{equation} \tag{ B4 }$

where M = 20 and $L = 2\pi$ is the size of the domain. The parameters $A_x, A_y$ are drawn independently and uniformly from the interval $[-5,5]$ , and similarly $\phi_x, \phi_y$ are drawn from the interval $[0, 2\pi]$ . The parameters $l_x, l_y$ which set the local length scales are sampled uniformly from the set $\{1,2,3,4,5\}$ . The target function $u(x,y)$ is given on a 2D spatial grid of resolution $128 \times 128$ covering the domain $[0, 2\pi] \times [0, 2\pi]$ . The grid data is then divided into a $50/50$ train and test split which corresponds to 8192 training points used. We train the neural network for 20 000 epochs using the Adam optimizer [38]. Each epoch is composed of two mini-batch updates of batch size $\vert \textrm{B} \vert = 4096$ . The initial leaning rate is chosen as $\eta = 10^{-3}$ and decreased by a factor of 10 after 10 000 and 15 000 epochs.

Appendix C.: Forward and inverse modeling in active turbulence

The ground truth solution of the 2D incompressible active fluid is computed in the vorticity formulation which is given as follows,

$\begin{equation} \frac{\partial \omega}{\partial t} + \xi_0~\boldsymbol{u} \cdot \nabla \omega = -(\alpha + \beta \vert \boldsymbol{u}\vert^2) \omega + \beta \nabla \vert \boldsymbol{u} \vert^2~\times \boldsymbol{u} + \Gamma_0~\Delta \omega - \Gamma_2~\Delta^2~\omega .\end{equation} \tag{ C1 }$

The simulation is conducted with a pseudo-spectral solver, where the derivatives are computed in the Fourier space and the time integration is done using the Integration factor method [40, 51]. The equations are solved in the domain $[0, 2\pi] \times [0, 2\pi]$ with a spatial resolution of $256 \times 256$ and a time-step dt = 0.0001. The incompressibility is enforced by projecting the intermediate velocities on to the divergence-free manifold. The simulations in the square domain are performed with periodic boundary conditions, where the values of the parameters are chosen as follows: $\xi_0 = 3.5, \alpha = -1, \beta = 0.5, \Gamma_0 = -0.045, \Gamma_2 = \vert \Gamma \vert^3$ [43].

C.1. Forward problem setup

We study the forward solution of the active turbulence problem using PINNs in the $(\omega-\psi)$ formulation. We employ a neural network $f_{\theta}: (x,y,t) \rightarrow (u,v)$ with seven layers and 100 neurons per layer and ${\sin}$ -activation functions. For choosing the activation function we conducted extensive numerical experiments using different activation functions for a simple function reconstruction problem with the objective

$\begin{equation*} \mathcal{L}(\boldsymbol{u}, \boldsymbol{u}_\theta, \theta) = \frac{1}{N_\textrm{int}} \sum_{i = 1}^{N_\textrm{int}} \Big \vert \boldsymbol{u}_{\boldsymbol{\theta}}(t_i, \boldsymbol{x}_i^{\Omega}) - \boldsymbol{u}_{i} ) \Big \vert^2, \end{equation*}$

where $\boldsymbol{u}_{i} = \boldsymbol{u}(t_i, \boldsymbol{x}_i^{\Omega})$ . In figure C.1 we clearly see that the sin activation function leads to the best approximation accuracy for the velocities and their associated derivatives. The vorticity ω is directly computed from the network outputs using automatic differentiation. For the solution in the square domain, we select a sub-domain $[\frac{\pi}{4},\frac{3\pi}{4}] \times [\frac{\pi}{4},\frac{3\pi}{4}]$ with 50 time steps (dt = 0.01) accounting to a total time of 0.5 units in simulation time. We choose $N_\mathrm{I} = 4096$ points for the initial conditions, $N_\mathrm B = 12\,800$ boundary points and a total of $N_\textrm{int} = 192200$ residual points in the interior of the square domain. The network is trained for 7000 epochs using the Adam optimizer with each epoch consisting of 49 mini-batch updates with a batch size $\vert\textrm{B}\vert = 4000$ . The velocities are provided at the boundaries of the training domain.

For the forward simulation in the annular geometry, the domain is chosen with the inner radius of $r_\textrm{inner} \approx 0.245$ units and the outer radius $r_\textrm{outer}\approx 0.85$ units. We choose $N_\mathrm{I} = 3536$ points for the initial conditions, $N_\mathrm B = 14\,400$ boundary points and a total of $N_\textrm{int} = 162\,400$ residual points in the interior of the domain. The network training is again performed over 7000 epochs using Adam with each epoch consisting of 41 mini-batch updates with a batch size $\vert\textrm{B}\vert = 4000$ . Dirichlet boundary conditions are enforced at the inner and outer rims of the annular geometry by providing the velocities. The network is optimized subject to the loss function

$\begin{equation*} \mathcal{L}(\boldsymbol{u}, \theta) = \lambda_1~\mathcal{L}_\textrm{boundary} + \lambda_2~\mathcal{L}_\textrm{res} + \lambda_3~\mathcal{L}_\textrm{div} + \lambda_4~\mathcal{L}_\textrm{init}. \end{equation*}$

Here, $\lambda_1, \lambda_2, \lambda_3, \lambda_4$ correspond to the weights of the objectives for boundary condition, equation residual, divergence, and initial condition, respectively.

C.2. Inverse problem setup

For studying the inverse problem we resort to the primitive formulation $(u,v,p^{*})$ of the active turbulence model, with $p^{*} = -\nabla p+\xi_1\nabla |\boldsymbol{u}|^2$ as the effective pressure. We used a neural network $f_\theta:(x,y,t)\rightarrow (u,v,p^*)$ with five layers and 100 neurons per layer and sin-activation functions. Given the measurement data of flow velocities $\{u_i, v_i \}_{i = 1}^{N}$ at discrete points, we infer the effective pressure $p^{*}$ along with the parameters $\Sigma = \{\xi_0, \alpha, \beta, \Gamma_0, \Gamma_2 \}$ . The total of $204\,800$ points, sampled from the same domain as for the forward solution in the square domain, are utilized with a $70/30$ train and test split, which corresponds to $143\,360$ training points used. The network is trained for a total of 7000 epochs using the Adam optimizer with each epoch consisting of 35 mini-batch updates with a batch-size $\vert\textrm{B}\vert = 4096$ . We optimize the model subject to the loss function

$\begin{equation*} \mathcal{L}(\boldsymbol{u}, \theta) = \lambda_1~\mathcal{L}_\textrm{fit} + \lambda_2~\mathcal{L}_\textrm{res} + \lambda_3~\mathcal{L}_\textrm{div}, \end{equation*}$

with $\lambda_1, \lambda_2, \lambda_3$ corresponding to the data-fidelity, equation residual and the objective that penalises non-zero divergence in velocities, respectively. The training is done in a three-step approach, with the first 1000 epochs used for fitting the measurement data (pre-training, $\lambda_1 = 1, \lambda_2 = \lambda_3 = 0$ ), the next 1000 epochs training with an additional constraint on the divergence ( $\lambda_1~\neq 0, \lambda_2~\neq 0 , \lambda_3 = 0$ ), and finally followed by the introduction of the equation residual ( $\lambda_1~\neq 0, \lambda_2~\neq 0 , \lambda_3~\neq 0$ ). At every checkpoint introducing a new task, we reset all λ_k for the computation of $\hat{\lambda}_k$ (see equation (7) main text) of the max/avg variant to $\lambda_k = 1$ . For both, forward and inverse problems, we choose an initial learning rate $\eta = 10^{-3}$ and decrease it by a factor of 10 after 3000 and 6000 epochs (see figure C.3).

Appendix D.: 2D poisson problem

We consider the problem of solving the 2D Poisson equation given as,

$\begin{equation} \Delta u = -2\omega^2~\cos(\omega x) \sin(\omega y), \end{equation} \tag{ D1 }$

with the parameter ω controlling the frequency of the oscillatory solution $u(x,y)$ . The analytical solution $u(x,y) = \cos(\omega x)\sin(\omega y)$ is computed on the domain $[0,1] \times [0,1]$ with a grid resolution of $100 \times 100$ . The boundary condition is given as $\mathcal{B}(x,y) = \cos{(\omega x)}\sin{(\omega y)}$ , $x,y\in\partial\Omega$ . We set up the PINN with two objective functions one for handling boundary conditions and the other to compute the residual with the loss

$\begin{equation} \mathcal{L}(u, \theta) = \lambda_1~\mathcal{L}_\textrm{res} + \lambda_2~\mathcal{L}_\textrm{boundary} . \end{equation} \tag{ D2 }$

We set $N_\mathcal{B} = 400$ sampled uniformly on the four boundaries, and randomly sample $N_\textrm{int} = 2500$ points from the interior of the domain. We choose a network $f_\theta: (x,y) \rightarrow (u)$ with five layers and 50 nodes per layer and the tanh activation function is used. The neural network is trained for 30 000 epochs using the Adam optimizer. We choose the initial learning rate $\eta = 10^{-3}$ and decrease it by a factor of 10 after 10 000, 20 000 and 30 000 epochs. In figure D.1 we show the relative L2 error for different weighting strategies across changing wavenumber ω. We notice a very similar trend as observed in the previous problems, with inverse Dirichlet scheme performing in par with the analytical ε optimal weighting strategy, followed by max/avg and pinn-MGDA. Using the Poisson example, we also compare the effect of the choice of the optimizer on the performance of weighting strategies. In figure D.2, we show the performance of weighting strategies using the RMSprop optimizer (left) and Adagrad (right). We find that the inverse Dirichlet strategy outperforms the other strategies in both cases. Especially for Adagrad, we find our weighting strategy to be the most accurate by an order of magnitude. Interestingly, static ε-optimal weights fail to capture the solution for high wavenumbers when using Adagrad (see figure D.2 right).

Appendix E.: Curvature-driven level set

The level-set method [52] is a popular method used for implicit representation of surfaces. In contrast to explicit methods for interface capturing, level-set methods offer a natural means to handle complex changes in the interface topology. For this reason, level-set methods have found extensive applications in computer vision [53], multiphase and computational fluid dynamics [54], and for solving PDEs on surfaces [55]. For this example, we consider the curvature driven flow problem that evolves the topology of a complex interface based on its local curvature, and can be tailored for image processing tasks like image smoothing and enhancement [53, 56]. Specifically, we look at the problem of a wound spiral which relaxes under its own curvature [52]. The parametrization of the initial spiral is given by $\theta = 2~\pi D \sqrt{s}$ with $s = (k+a)/(np+a)$ . The point locations of the zero contour can be computed as,

$\begin{align*} x_s & = x_c + m \left( D/(1+D) \sqrt{s} \,\cos(\theta) \right), \\ y_s & = y_c + m \left( D/(1+D) \sqrt{s}\, \sin(\theta) \right) .\end{align*}$

The zero level set function can then be computed using the distance function $d(\boldsymbol{x}) = \Vert \boldsymbol{x}- \boldsymbol{x}_p\Vert$ as $\phi(\boldsymbol{x},t = 0) = d(\boldsymbol{x}) - w$ with w being the width of the spirals. The distance function $d(\boldsymbol{x})$ measures the shortest distance between two points x and x _p. Given the level set function $\phi(\boldsymbol{x})$ , we can compute the 2D curvature κ as,

$\begin{equation} \kappa = \frac{\phi_{xx} \phi_y \phi_y - 2\phi_{xy} \phi_x \phi_y + \phi_{yy} \phi_x \phi_x}{\left( \phi_x^2 + \phi_y^2~\right)^{3/2}} .\end{equation} \tag{ E1 }$

The evolution of the level set function driven by the local curvature κ and is computed using the PDE,

$\begin{equation*} \frac{\partial \phi}{\partial t} = \kappa \vert \nabla \phi \vert {} .\end{equation*}$

The simulation was conducted on an $128 \times 128$ resolution spatial grid covering the domain $[-2,2] \times [-2,2]$ using the finite-difference method with necessary reinitialization to preserve the sign distance property of the level-set function φ. The spiral parameters are chosen to be $np = 400, D = 2.5, a = 3, m = 2$ [57]. Time integration is performed using a second-order Runge-Kutta scheme with a step size of dt = 0.001.

E.1. Training setup

The training data is generated from uniformly sampling in the domain $[-1.5,1.5] \times [-1.5,1.5]$ with a resolution of $100 \times 100$ in space and 80 time snapshots separated with time-step dt = 0.001. This lead to $N_\mathrm I = 10\,000, N_\mathrm B = 32\,000$ and $N_\textrm{int} = 768\,000$ points in the interior of the space and time domain. The network is trained for 2500 epochs using the Adam optimizer with each epoch consisting of 187 mini-batch updates with batch size $\vert\textrm{B}\vert = 4096$ . The neural network $f_\theta: (x,y,t) \rightarrow (\phi)$ with 5 layers and 64 nodes is chosen for this problem. We found that using the ELU activation function resulted in the best performance. The total loss is given as,

$\begin{equation} \mathcal{L}(\phi) = \lambda_1~\mathcal{L}_\textrm{boundary} + \lambda_2~\mathcal{L}_\textrm{res} + \lambda_3~\mathcal{L}_\textrm{init}{} .\end{equation} \tag{ E2 }$

We choose the initial learning rate as $\eta = 10^{-3}$ and decrease it after 1500 and 2000 epochs by a factor of 10.

Appendix F.: Network initialization and parameterization

Let $\mathcal{X}\in\mathbb{R}^{N \times d_\textrm{inp}}$ be the set of all input coordinates to train the neural network, where N denotes the total number of data points and $d_\textrm{inp}$ the dimension of the input data, i.e. $d_\textrm{inp} = 3$ corresponding to $(x,y,t)$ . Moreover, $\boldsymbol{X}_\textrm{B}\in\mathbb{R}^{\vert\textrm{B}\vert \times d_\textrm{inp}}$ will denote a single batch, by randomly subsampling $\mathcal{X}$ into $N/\vert\textrm{B}\vert$ subsets. We then define an n-layer deep neural network with parameters θ as a composite function $f_{\boldsymbol{\theta}}:\mathbb{R}^{\vert\textrm{B}\vert \times d_\textrm{inp}}\rightarrow\mathbb{R}^{\vert\textrm{B}\vert \times d_\textrm{out}}$ , where $d_\textrm{out}$ corresponds to the dimension of the output variable. The output is usually computed as

$\begin{equation} f_{\boldsymbol{\theta}}(\boldsymbol{X}_\textrm{B}) = T_n(\sigma(T_{n-1}(\sigma(\dots\sigma(T_1(\boldsymbol{X}_\textrm{B})))))), \end{equation} \tag{ F1 }$

with the affine linear maps $T_i(\boldsymbol{z}_{i-1}) = \boldsymbol{z}_{i-1}\boldsymbol{W}^T_i + \boldsymbol{b}_i$ , $i = 1,\dots,n$ , $\boldsymbol{z}_0 = \boldsymbol{X}_\textrm{B}$ , and output $\boldsymbol{Y}_\textrm{B} = \boldsymbol{z}_{n-1}\boldsymbol{W}^T_n + \boldsymbol{b}_n$ . Here, $\boldsymbol{W}_i\in\mathbb{R}^{q_i \times q_{i-1}}$ describes the weights of the incoming edges of layer i with q_i neurons, $\boldsymbol{b}_i\in\mathbb{R}^{q_i}$ contains a scalar bias for each neuron and $\sigma(\cdot)$ is a non-linear activation function. The network parameters $\boldsymbol{W}_i, \boldsymbol{b}_i\subset\boldsymbol{\theta}$ are initialized using normal Xavier initialization [39] with $\boldsymbol{W}_i\sim\mathcal{N}(0,\Psi^2)$ , where $\mathcal{N}$ is a Gaussian distribution with zero mean and standard deviation

$\begin{equation} \Psi = g\cdot\sqrt{\frac{2}{q_i+ q_{i-1}}}, \end{equation} \tag{ F2 }$

and $\boldsymbol{b}_i = \boldsymbol{0}$ . The gain g was chosen to be g = 1 for all activation functions except for tanh with a recommended gain of $g = \frac{5}{3}$ . Before feeding $\boldsymbol{X}_\textrm{B}$ into the first layer of the neural network, we normalize it with respect to the training data $\mathcal{X}$ as

$\begin{equation} \hat{\boldsymbol{X}}_\textrm{B} = \frac{\boldsymbol{X}_\textrm{B}-\boldsymbol{\mu}}{\boldsymbol{\psi}}, \end{equation} \tag{ F3 }$

where $\boldsymbol{\mu} = ((\overline{\mathcal{X}_1},\dots,\overline{\mathcal{X}_{d_\textrm{inp}}}))$ and $\boldsymbol{\psi} = ((\sqrt{\textrm{Var}\{\mathcal{X}_1\}},\dots, \sqrt{\textrm{Var}\{\mathcal{X}_{d_\textrm{inp}}\}}))$ are the column-wise mean and standard deviations of the training data $\mathcal{X}$ .

F.1. Weights computation

We initialize all weights to $\lambda_k = 1,k = 1,\dots,K$ . For all experiments we update the dynamic weights at the first batch of every fifth epoch. The moving average is then performed with α = 0.5. The mini-batches $\boldsymbol{X}_\textrm{B}$ are randomized after each epoch, such that for every update of the weights, λ_k will be computed with respect to a different subset $\boldsymbol{X}_\textrm{B}$ of all training points $\mathcal{X}$ . Furthermore, batching is only performed over the interior points N_int. As $N_\mathrm{I},N_\mathrm{B}\ll N_{\textrm{int}}$ , the points for initial and boundary conditions do not require batching and remain fixed throughout the training.

Inverse Dirichlet weighting enables reliable training of physics informed neural networks

Article metrics

Submit

Author e-mails

Author affiliations

Author notes

ORCID iDs

Dates

Peer review information

Abstract

1. Introduction

2. Training a physics informed neural network is a multi-objective optimization problem

2.1. Sobolev training is a special case of PINN training

2.2. Vanishing task-specific gradients are a failure mode of PINNs

3. Strategies for balanced PINN training

3.1. Weighting based on mean gradient statistics

3.2. Inverse Dirichlet weighting

3.3. Gradient-based multi-objective optimization

4. Results

4.1. Inverse Dirichlet weighting avoids vanishing task-specific gradients

4.2. Inverse Dirichlet weighting enables multi-scale modeling using PINNs

4.3. Inverse Dirichlet weighting protects against catastrophic forgetting

5. Conclusion and discussion

Acknowledgments

Data availability statement

Appendix A

A.1. Analytically optimal weights

A.2. PINN multiple gradient descent algorithm (pinn-MGDA)

Appendix B.: Sobolev training

B.1. Proof of proposition 2.2.1

B.2. Sobolev training setup

Appendix C.: Forward and inverse modeling in active turbulence

C.1. Forward problem setup

C.2. Inverse problem setup

Appendix D.: 2D poisson problem

Appendix E.: Curvature-driven level set

E.1. Training setup

Appendix F.: Network initialization and parameterization

F.1. Weights computation

Inverse Dirichlet weighting enables reliable training of physics informed neural networks

Article metrics

Submit

Share this article

Author e-mails

Author affiliations

Author notes

ORCID iDs

Dates

Peer review information

Abstract

1. Introduction

2. Training a physics informed neural network is a multi-objective optimization problem

2.1. Sobolev training is a special case of PINN training

2.2. Vanishing task-specific gradients are a failure mode of PINNs

3. Strategies for balanced PINN training

3.1. Weighting based on mean gradient statistics

3.2. Inverse Dirichlet weighting

3.3. Gradient-based multi-objective optimization

4. Results

4.1. Inverse Dirichlet weighting avoids vanishing task-specific gradients

4.2. Inverse Dirichlet weighting enables multi-scale modeling using PINNs

4.3. Inverse Dirichlet weighting protects against catastrophic forgetting

5. Conclusion and discussion

Acknowledgments

Data availability statement

Appendix A

A.1. Analytically optimal weights

A.2. PINN multiple gradient descent algorithm (pinn-MGDA)

Appendix B.: Sobolev training

B.1. Proof of proposition 2.2.1

B.2. Sobolev training setup

Appendix C.: Forward and inverse modeling in active turbulence

C.1. Forward problem setup

C.2. Inverse problem setup

Appendix D.: 2D poisson problem

Appendix E.: Curvature-driven level set

E.1. Training setup

Appendix F.: Network initialization and parameterization

F.1. Weights computation