Towards fast weak adversarial training to solve high dimensional parabolic partial differential equations using XNODE-WAN

Due to the curse of dimensionality, solving high dimensional parabolic partial differential equations (PDEs) has been a challenging problem for decades. Recently, a weak adversarial network (WAN) proposed in (Y.Zang et al., 2020) offered a flexible and computationally efficient approach to tackle this problem defined on arbitrary domains by leveraging the weak solution. WAN reformulates the PDE problem as a generative adversarial network, where the weak solution (primal network) and the test function (adversarial network) are parameterized by the multi-layer deep neural networks (DNNs). However, it is not yet clear whether DNNs are the most effective model for the parabolic PDE solutions as they do not take into account the fundamentally different roles played by time and spatial variables in the solution. To reinforce the difference, we design a novel so-called XNODE model for the primal network, which is built on the neural ODE (NODE) model with additional spatial dependency to incorporate the a priori information of the PDEs and serve as a universal and effective approximation to the solution. The proposed hybrid method (XNODE-WAN), by integrating the XNODE model within the WAN framework, leads to significant improvement in the performance and efficiency of training. Numerical results show that our method can reduce the training time to a fraction of that of the WAN model.


Introduction
Applying classical numerical methods (finite element, finite difference, finite volume, etc) to solve high dimensional partial differential equations (PDEs) has been highly challenging due to the notorious curse of dimensionality [1,2].Traditional numerical methods, used as PDE solvers, lead to an exponential growth of computing cost with respect to the dimension of the PDE problem.The application of neural networks on high dimensional data, including image classification [3] and natural language processing [4], has empirically proven to be successful.Recently, applying deep learning models to tackle high dimensional PDE problems, which achieved superior empirical results [1,5,6,7,8], has been a growing area of research.Relevant numerical analysis that takes into account of approximation rates, generality and optimization theory can be also found in many works, see [9,10,2,11].
As an important class of PDEs, parabolic PDEs have been studied extensively and applied to a various number of fields, including acoustic propagation [12], heat conduction [13], image denoising [14], derivative pricing [15], etc.In [16,17], multi-layer deep neural networks (DNNs) are used to approximate the solution to parabolic problems thanks to their universality.However, DNNs do not take into account the fundamental difference in the roles played by time and spatial variables in the parabolic solution, and hence may not be efficient in capturing the spatio-temporal dependence of the parabolic solution.To this end, we propose a novel so-called XNODE model for the solution to reinforce the spatio-temporal dependence.It effectively incorporates the a priori information of the PDEs through the modified neural ordinary differential equations (NODEs) [18].
More specifically, we consider the following parabolic PDE defined on an arbitrary bounded domain D ⊂ [0, T ] × R d under the mild condition, +c(u, t, x) − f (t, x) = 0 for (t, x) ∈ D, u(t, x) = g(t, x) on ∂D, u(0, x) − h(x) = 0 on Ω(0). ( where a ij , b i , f : D → R, c : R × D → R and h : Ω(0) → R are given.c can be a non-linear function with respect to the first augment.Let Ω(t) := {x|(t, x) ∈ D} denote the spatial domain of D when restricting time to be t.Note that D could be not only the time-independent domain, i.e. there exists Ω ⊂ R d , such that D = [0, T ] × Ω , but also the time varying domain, i.e.Ω(t) changes over time t.
The motivation of the XNODE model lies in the key observation that for a fixed spatial variable x, a parabolic PDE solution ũ = u(., x) evolves into an ODE solution, which naturally leads to employing the NODE network [18].A NODE network builds a general model by leveraging the ODE solver and parameterising the vector field as a neural network, which can be regarded as a continuous version of the residual neural network.The NODE model recently introduced in [18] soon became a top-pick model for (continuous) time series modelling because of its efficiency and scalability.As opposed to classical neural networks models, the gradient calculation of the NODE models via the adjoint method has constant memory cost.
Therefore, to model the parabolic PDE solution efficiently, the above observation leads to the proposed XNODE model, which is built by introducing the additional dependency on the spatial variable x to the NODE model.More specifically, the XNODE model takes any spatial variable x and h from (1) as the input, and returns the output of the NODE with vector fields dependent on spatial variable x to predict the solution ũ = u(., x).In contrast to the DNNs which treat the time and spatial variable equally, the proposed XNODE model utilizes the NODE to capture the temporal dependency of the solution path ũ, while the spatial dependence is embedded to the vector field of NODE model.In the XNODE model, the initial condition of PDEs h is incorporated effectively through the initial condition of the NODE.Moreover, we introduce efficient data sampling and path construction with the XNODE method to ensure efficiency for predicting the solution on both time-independent domains and time-varying domains.
The formulation of the PDE problem using deep learning can be broadly divided into two classes, (1) optimisation problems and (2) min-max problems.For the first category, learning the PDE solution is translated into minimizing the loss function, which is used to quantify the performance of the estimated solution.For example, one can consider the mean square error regarding the equilibrate equations and boundary and initial conditions as the loss function, see e.g., [19,16,7].The least square approach is known for its generality.However, the evaluation of high order derivatives are required in the cost functional if the equilibrate equation is directly penalized [16,7].Another commonly used cost functional is based on the energy or minimal principal, [20,21,22].Though the energy functional often only involves the lower order derivatives, i.e. no higher derivative is required to compute, the energy based method is restricted to certain differential operators for which a minimal principal holds.The second category is to utilize the generative adversarial networks (GANs) to solve the PDEs by using the min-max game formulation.The weak adversarial network (WAN) [17] is a typical example of this type.By leveraging the weak formulation of the PDE solution, WAN converts the task of finding the weak solution into a saddle point problem with the solution (u, φ), where u is the solution to the PDE, and φ is a test function.Compared with the least square method of the first category, WAN can be applied to solve a large general class of PDEs, including the cases where the strong solution does not exist as long as the weak solution is well defined.It also can be applied to the case where the minimal principal of the PDEs is not satisfied.ax optimization, it may bring some computational benefits resulted from only requiring lower order derivative computation.
Based on the above considerations, we propose the hybrid XNODE-WAN method by incorporating the XNODE model into the WAN framework [17] as an enhancement of the WAN to learn the solutions to parabolic PDEs more efficiently.In the WAN framework, there is a primal network and an adversarial network to approximate the solution (u) and the test function (φ) respectively, which are both approximated by DNNs [17].In our proposed XNODE-WAN method, we replace the DNNs with the XNODE model for the primal network of WAN, as the XNODE model is able to better model the spatio-temporal dependency of the solution and encode a priori information of the PDE, which leads to reduced training time and faster convergence.As many of the other deep learning methods for PDE solvers [17,7,22], the proposed XNODE-WAN method is mesh-free.It can be applied to a general class of the spatio-temporal domains, whose spatial domain can be irregular and/or time-varying.
Numerical results show that our proposed method significantly outperforms WAN in terms of training efficiency for both time-independent and time-varying domains.Moreover, such improvement of the XNODE-WAN model is consistent regardless of the dimension of the spatial variable, which demonstrates a better scalability for the high dimensional PDEs.
The paper is structured as follows.Section 2 gives a brief introduction to the WAN framework [17].It is followed by Section 3, which starts with preliminaries of NODE model and then introduces the proposed XNODE model for the primal solution and the hybrid XNODE-WAN method for parabolic PDE solution on general time-space domains.Numerical results are provided in Section 4. Section 5 concludes the paper.

Weak Adversarial Network (WAN) framework
In this section, we provide a brief summary of theoretical underpinning and a general framework of the weak adversarial network for solving the linear and non-linear parabolic PDEs of form (1).

The weak formulation
We consider the PDE problem (1) on a time-dependent bounded domain denoted by D ⊂ [0, T ] × R d with Ω(t) = {x|(t, x) ∈ D}, where we impose typical assumptions such as the uniformly parabolic property of the differential operator1 .We assume further that a ij ∈ L ∞ (D), f ∈ L2 (D) and h ∈ L 2 (Ω(0)).
When the problem (1) is linear and on a time-independent domain, i.e., D = [0, T ] × Ω, the well-posedness of equation ( 1) can be ensured with the , given the coefficients are uniformly elliptic [23].
In general, we consider the time-space domain D ⊂ [0, T ] × R d .We now define the following bilinear form: and the linear operator: For the ease of notation, we define H 1 0 (D), L 2 (D) and L 2 (∂D) by where ∇f (t, •) is the weak derivative of f (t, •).
From integration by parts, it is easy to check that the weak solution u to (1) satisfies the following variational problem: Therefore, it leads to a new interpretation of u as a solution to the below minimization problem: inf where v L 2 (D) is L 2 norm of v( [17]).Note that for a given w, the optimal test function v is the one to attain the supremum . In (3) we reformulate the original parabolic PDE into an inf-sup type optimisation problem, which sets the foundation for the WAN framework.

WAN framework
In this subsection, we introduce the WAN framework initially proposed in [17] (cf Algorithm 3 in [17]) for solving parabolic PDEs by leveraging the weak formulation in Section 2.1.The main idea is to formulate the problem as the inf-sup problem (3), where the PDE solution u and the test function φ are parameterised by deep neural networks u θ and φ η , respectively, with θ, η being the trainable model weights.In [17], a deep neural network is employed for both the solutions u and φ.The primal network will be replaced by our proposed XNODE model, which will be discussed in in Section 3.
The loss function of [17], which is also used in our method, consists of the interior loss, the boundary loss, and the initial loss as follows: where α, γ are hyperparameters as penalty terms and Note that the interior loss L int is based on the weak formulation of the solution (3) to quantify the discrepancy between the true solution u and the estimator u θ over the interior.Due to the different norms and log operator applied in the loss function, the coefficient α and γ should be chosen numerically by cross-validation to ensure efficiency.The evaluation of three loss terms can be done through the Monte-Carlo simulation on the domain.We shall create a collection of points for such evaluation at every iteration of our algorithm.As shown in [17], for any bounded domain D, at each iteration, we uniformly sample N n points (t i , x i ) Nn i=1 over the interior of the domain.Similarly, we conduct the uniform sampling on the boundary ∂D with the corresponding number of sampling points denoted by N b .Let Lint (θ, η), Lbdry (θ) and Linit (θ) denote the corresponding Monte-Carlo estimators of L int (θ, η), L bdry (θ) and L init (θ) based on the sampling points respectively.Then we define the empirical loss function by Eq. ( 6) is the loss function of the min-max game used in the WAN to learn the PDE solution.
Despite the same loss function of our method, due to the XNODE model, we adjust the above point sampling strategies to both time-independent domain and time-dependent domain respectively.One may refer to Section 3.2.1 and 3.2.2.
The algorithm of WAN is outline in Algorithm 1, where the lines with additional comments will be modified in our algorithm in Section 3. A list of notation explaination can be found in Table B.2.

Algorithm 1 WAN Algorithm
Input: domain D, tolerance > 0, N D /N ∂D /N 0 : number of sampled points on the domain/boundary/initial condition, K u /K φ : number of solution/adversarial network parameter updates per iteration, α/γ ∈ R + : the weight of boundary/initial loss, and the hyperparameters for all DNNs involved # update weak solution network parameter 5: end for 9: # update test network parameter 10: compute ∇ η L(θ, η); gradient calculation step 12: end for 14: generate points sets X D , X ∂D and X 0 ; random point sampling step for the next run 15: end while Output: the weak solution u θ : D → R

The XNODE-WAN method
In this section, we introduce a novel so-called XNODE model for the solution u to the parabolic PDE problem (1) on arbitrary spatio-temporal domains.It can be conveniently incorporated within the WAN framework by replacing the deep neural network by the XNODE model for the primal solution to achieve superior training efficiency.We will refer to our method as XNODE-WAN in the remaining contents.
The main motivation of the proposed XNODE model comes from the observation that with the spatial variable fixed, the parabolic problem (1) reduces to an ODE problem, which can be efficiently modelled using the neural ODE model (NODE) [18].Hence, the proposed XNODE model captures the additional dependency on spatial coordinates, incorporates the a priori information of the parabolic PDEs and initial conditions effectively.To facilitate the efficiency of XNODE to approximate the primal solution over the domain, we propose the use of multiple constant paths to traverse the domain as the point sampling strategy over the domain.
For the rest of this section, we start with a brief introduction to conventional NODE model.Then we proceed to discuss how to adapt the NODE model to approximate the parabolic solution by introducing additional spatial state variable, i.e., the XNODE model.This is followed by a subsection of using the XNODE-WAN model for solving time-independent domain problems, which involves incorporating the XNODE model within the WAN framework.We then generalize the proposed methodology to the time varying domains (see Figure 1 for illustrative examples of time-independent domains and timevarying domains respectively).The red squares here represent the inserted points that act as the initial points for the paths that cannot be traced back to the initial time.

The neural ODE (NODE) model
The NODE [18] method that fuses concepts of ODEs and neural networks, is well known for offering benefits to time series and density modeling.One of its main benefits lies in the computational feasibility brought by the adjoint sensitivity method, for which one solves an ODE to compute the gradients with constant memory cost.
More specifically, NODE models a continuous function which maps ) by defining the ODE model as follows: for every input Here, the vector field N vec θ is a paramerized neural network with trainable weights θ.The generality of the NODE model mainly come from the fact that the vector field can be universally approximated by a neural network.By Picard's Theorem, the initial value problem will yield a unique solution provided N vec θ is Lipschitz continuous which is guaranteed by using finite weights and the activation functions relu or tanh [18].
The details of forward evaluation of h and the derivative calculation of h with respect to θ can be found in [18].Note that the dependency of the NODE output h on the input is solely passed through the initial condition of ODE model, i.e., h 0 .

XNODE-WAN 3.2.1. Time-independent domain
Consider the time-independent domain [0, T ] × Ω, where Ω ⊂ R d is bounded.When given a fixed spatial variable x ∈ Ω, the map ũ : t → u(t, x) can be viewed as a solution to the following ODE problem from PDE (1): where F is a function depending on (ũ, t; x) defined as Note that although F is not observable due to the unknown partial derivative ∂ i u(t, x) , it can be universally approximated by a neural network provided that F is continuous.This observation motivates us to use the NODE model [18] for a primal solution ũ for each given x.XNODE Model: Similar to the NODE model, we will employ a deep neural network N vec to approximate the vector field F in (8).To construct a universal model for the primal solution ũ, we suggest in (8) that the spatial variable x should be incorporated in the vector field N vec as the additional dependent variable.More specifically, the XNODE model maps an arbitrary input which is defined by the output of the NODE model with the x-dependent vector field.More precisely, for every t ∈ [0, T ], we have where N vec θ 2 and N init θ 1 are neural networks fully parameterized by θ 1 and θ 2 for the vector fields and initial condition of the hidden neural h respectively, and L θ is a linear trainable layer with the weights θ.We denote by Θ = (θ 1 , θ 2 , θ) the set of all trainable model parameters of the proposed XNODE model.We note that the output of XNODE o is a function, that lives in an infinite dimensional space.However in practice, the model output has to be finite dimensional for numerical computation.Hence given any time partition Π T = (t i ) n T i=1 3 of [0, T ], the corresponding output of XNODE is given by XNODE Θ (x, Π T ) := o x (Π T ).
In essence, our proposed XNODE model is built upon the NODE model, though it incorporates an additional spatial variable to model the high dimensional latent process of the PDE solution ũ.The uniqueness of XNODE follows from that of NODE, by considering x as the additional model parameter.Similarly, the adjoint sensitivity of XNODE is inherited from NODE, allows us to compute gradients with low memory cost.We also establish the universal approximation property of XNODE in Theorem 2 in the Appendix.
To effectively use the initial condition of the PDE u(0, x) = h(x), the hidden neuron at the initial time h(0) is given by a neural network compromising any discriminate capacity of the model.As long as L θ • N init can approximate the identity map, which is guaranteed by the universality of the neural network, it is able to achieve the satisfactory fitting performance of initial condition.
Point sampling.At each iteration as in the WAN Algorithm 1, to compute Lint , Linit and Lbdry defined in (5), one needs to sample points {(t i , x i )} i uniformly on the interior and the boundary of the domain [0, T ] × Ω respectively and estimate the solution u evaluated at the grid.Recall that the output of the XNODE Θ (x, Π T ) is to approximate the solution u evaluated at the points Π T × {x}.Therefore, unlike the sampling strategy of WAN, XNODE only requires sampling independently and uniformly on both time domain and spatial domain.Let us denote the sampled time partition and collocation points of the spatial domain Ω and domain boundary ∂Ω by Π T , S r and S b respectively, with S r := {x r j ∈ Ω} Nr j=1 and S b := {x r b ∈ ∂Ω} N b j=1 .Note the proposed sampling method (e.g.see the solid red points in Figure 1a for an illustration) implicitly generates a uniformly sampled grid of [0, T ] × Ω in a more economical way by recycling the uniformly sampled spatial points at each sampled time. 4eak solution prediction.For any point x ∈ Ω, we can construct the corresponding constant spatial path (i.e.X(t) ≡ x, ∀t ∈ [0, T ], e.g.see a solid line in Figure 1a).The output XNODE Θ (x, Π T ) then provides to a discrete approximation of the weak solution u evaluated at {(x, t i )} t i ∈Π T along this constant spatial path.By repeating this procedure on each point x r i ∈ S r , we can generate a collection of constant paths (X r i ) Nr i=1 such that X r i ≡ x r i .Similarly we generate a collection of constant paths (X b i ) N b i=1 .In total, we obtain the corresponding (XNODE Θ (X i , Π T )) x i ∈S r ∪S b to approximate the solution u over the grid Π T × (S r ∪ S b ).The algorithm of using NODE model to approximate the weak solution network is given in Algorithm 2. • In line 1, the DNN network for the primary solution u θ is replaced by the XNODE network.
• In line 2, the point sampling strategy is updated as described before.
For each iteration, our algorithm implements such sampling to generate a set of random collocation data points over the domain.
• In line 6 and line 12, the adjoint method is used to compute the derivative of XNODE model instead of the gradient descent method of DNN network in WAN.
The full XNODE-WAN algorithm is outlined in Algorithm 5 in Appendix.

Time-dependent domain
In this subsection, we consider a more general case D ⊂ [0, T ] × R d which allows that the spatial domain is time-dependent.Let Ω(t) denote the spatial domain of D restricted to any t ∈ [0, T ], i.e.Ω(t) = {x|(t, x) ∈ D}.When Ω(t) changes over t ∈ [0, T ], we call D a time-varying domain.The proposed XNODE model in Section 3.2.1 can not be directly applied to a time varying domain as the constant path X ≡ x ∈ Ω(0) might go outside the domain D at some time s ∈ [0, T ], i.e. (s, x) / ∈ D (see e.g., the sample point marked in • in Fig. 1b).
To circumvent the problem, we shall divide each constant path X ≡ x into multiple sub-paths and only keep the sub-paths remaining within the domain D. To be more specific, we firstly define the entry and exit points on a constant path X.
Let X denote a constant path taking value in x, i.e.X(t) ≡ x, ∀t ∈ [0, T ].For i = 1, 2, • • • , the i th entry point and exit point of the constant path X, denoted by τ (i) X and τ (i) X are defined as follows: where inf ∅ = +∞ by convention.Let N x denote the total number of finite τ (i) X , i.e.
An illustration example is given in Fig. 2, with the blue solid dots along the blue lines indicating where the entry and exit points locate for the given constant path X.
For the ease of notation, when there is no ambiguity, we omit the subscript X in τ X for the rest of the paper.When restricting the constant path X ≡ x to the time interval from the i th entry point τ (i) to the i th exit point τ (i) , we obtain a constant sub-path denoted by X i = X [τ (i) ,τ (i) ] , whose time augmented path5 takes value in the closure of D (e.g.see the solid red lines in Fig. 1b).
We assume that for the given domain D, the entry and exit points of every constant path X ≡ x can be computed and N x is finite.For any given time partition Π T and the spatial point x, we propose the following way to assign the collocation of time points from Π T to each sub-path X i for the XNODE model.We define a set Π x,i T , which is composed with the i t h entry starting point τ (i) and all the elements of Π T belonging to the T is the time collocation points associated with the sub-path X i .Fig. 2 presents an example of computing the entry and exit points and constructing time collection points of each constant subpaths.
Example 3.1 In Fig. 2, a shaded grey region represents the domain D. For the given constant path X = x marked as the solid blue line, one can compute two pairs of finite entry and exit points, denoted by {(τ (i) , τ (i) )} 2 i=1 , which corresponds to two constant sub-paths within the domain D. Let Π T = (t i ) 8 i=1 denote a time partition of length 8 in Figure 2. The time collocation points of the first and second sub-paths are Π x,1 X ) respectively 6 .Now we are ready to outline the proposed modification of Algorithm 2 for the time-varying domain as follows.
Point sampling: First of all, we find the the minimum time-independent domain to cover D, i.e.D = [0, T ] × Ω max , where Ω max = ∪ t∈[0,T ] Ω(t).We adopt the point sampling method in Section 3.2.1 to generate the uniform grid to cover [0, T ] × Ω max , i.e.Π T × S r , where S r is sampled uniformly from Ω max .Secondly, for every x ∈ S r , we compute {(τ (i) , τ (i) )} Nx i=1 .We then construct x-valued constant sub-paths and determine the time collocations points associated with each sub-path as described above.The algorithm of computing Π x,i T is given in Algorithm 3. To sample of the boundary points of the general domain D, we use the same method of WAN [17], i.e. uniformly sample a number of points from the boundary ∂D.T and Π x,2 T for the given domain presented as a shaded region.The solid line presents the constant path X ≡ x.The time coordinates of the blue solid dots are the entry and exit points, while the blue circles represent x × Π T the samples along the path.Clearly there are two points (t 4 , x) and (t 5 , x) along the path but not in the shaded domain, which are filtered out when constructing sub-paths within the domain.

Algorithm 3 Compute the constant sub-paths and the corresponding time collocation points
Algorithm: 1: compute the entry and exit points (τ (i) , τ (i) ) Nx i=1 2: for i from 1 to N x do 3: initialize Π x,i T as an empty list;

4:
add τ (i) to Π x,i T ; 5: for t in Π T and t in (τ (i) , τ (i) ] do 6: add t to Π x,i T ; 7: end for 8: end for Weak solution prediction For each x ∈ S r , we have N x many constant sub-paths defined on the sub-time interval [τ i , τ i ] with corresponding discrete time collocation points Π x,i T .Therefore we can estimate the weak solution u evaluated at {x} × Π T x,i by XNODE(x, Π x,i T ) for i ∈ {1, • • • N x }.Note that the initial condition of XNODE model for each constant sub-path X i is well defined as u(τ i , x) is given by either the initial condition or boundary condition, i.e.
We now present the modified algorithm of XNODE for the time-dependent domain in Algorithm 4, on top of Algorithm 3.

Numerical Results
In this section, we conduct several numerical experiments of parabolic PDEs defined on time independent domains or time varying domains, respectively.We compare the performance of the proposed XNODE-WAN algorithms (Algorithm 2 and Algorithm 4) with the baseline WAN algorithm [17].In particular, following the numerical example in [17], we consider PDE examples in the form of a d-dimensional nonlinear diffusion-reaction equation (Eq.( 11)) defined on a bounded domain on Ω(0), (11) where f , g and h may vary from example to example.In Section 4.2, we consider PDE (11) on two time independent domains D = [0, 1] × Ω where Ω is a hyper cube or a unit ball.Note our example where Ω is a hyper cube is the same as that in Section 4.2.6 of [17].We then investigate the scalability of our proposed method with respect to high spatial dimension in Section 4.3 and follow by a sensitivity analysis in Section 4.4.At the end of this section, we provide an example of a time-varying domain.Numerical implementation of our work is available at https://github.com/paulvoliva/XNODE-WAN-PDE-solver.git.

Experimental setup
To quantitatively access the accuracy and efficiency of the method, we consider the following test metrics: 1.The relative error: ||u θ − u|| L 2 /||u|| L 2 , where u θ and u are the approximation and exact solution of the problem with ||u||

Time per epoch: the average time elapsed in each epoch in the training
process. 3. N : for any error tolerance level > 0, the minimum number of epochs such that the relative error of the trained model is no more than .4. T : given any error tolerance level > 0, the minimum of time (seconds) such that the relative error of the trained model is no more than ; we would also refer to it the time at convergence.
For a fair comparison, we choose the following hyperparametes, which are the same for both WAN and XNODE-WAN: and the learning rate of the adversarial network τ η = 0.04 (see Table B.2 for the summary of notations of variables used in models and algorithms).Note that we set N D = N r * n T and N ∂D = N b * n T for WAN to have a fair comparison.Besides we keep the same network architecture of the adversarial network (test function φ η ) the same for both WAN and XNODE-WAN.However, as the training performance of each method may be sensitive to the learning rate, the learning rate of the primal network should be adjusted to the XNODE or DNN separately.By hyper-parameter tuning, the learning rate of the XNODE-WAN and WAN for the primal network are chosen to be 0.015 and 0.00005 respectively.Note when increasing the learning rate of the WAN algorithm from 0.00005, the WAN may suffer from either divergence or slower training.One may refer the effects of the learning rate in Section 4.4.We use the above hyper-parameter setting for all the following numerical examples.

Time-independent domains
In this subsection, we investigate the performance of the proposed method on a time-independent domain D = [0, 1] × Ω.To demonstrate that our method is applicable to a general spatial domain Ω , we consider two examples of Ω, i.e. ( 1 Example 4.1 We consider the same example in Section 4.2.6 in [17] with f, g, h specified as follows: where Ω = [0, 1] 5 .The exact solution is given by The stopping criteria for both algorithms is set to achieve a relative error tolerance level = 10 −3 .We also randomly test 50 trials on randomly selected data to measure the relative L 2 error on the obtained models.We report the mean and the standard deviation of relative error of the XNODE-WAN and WAN in Table C.3, which demonstrates that XNODE-WAN model significantly outperforms in terms of computational time and iteration number.In particular, to reach the same stopping criteria, XNODE-WAN model saves 99.8% in time and only takes less than 1.5 minutes for training in this case.Furthermore, XNODE-WAN and WAN need 165 and 15, 463 iterations respectively whereas the time used for each iteration is comparable.In addition, Figure 3a gives the evolution of relative errors over time for both models.Clearly the proposed model converges much faster than that of WAN. Figure 3b and Figure 3d demonstrates the comparison between the XNODE-WAN and WAN estimator for the PDE solution u(T, x) at the convergence and their respective absolute point-wise error on the (x 1 , x 2 ) domain.It shows that the out-of-sample point-wise absolution error of our method has less magnitude than that of WAN.The superior performance of our method over WAN is also verified by a smaller relative error (1.7% ± 0.0114% v.s.2.6% ± 0.0144%) with statistical significance.
Finally, in Figure 3e, we show that when XNODE-WAN reaches the stopping criteria at 74s, the WAN solution is far from being close to the true solution with large error.We again implement both models and the results is shown in Figure 4 with the same stopping criteria, i.e., the relative L 2 error = 10 −3 .The evolution of the relative L 2 error for the hypersphere domain is similar to that in Example 4.1.In particular, the XNODE-WAN model achieves the targeted error tolerance within 100s while it takes longer than 3,000s for WAN to reach the stopping criteria.
This example justifies the advantage of using XNODE-WAN for solving parabolic PDE with an irregular spatial domain Ω.

Scalability analysis
In this subsection, we investigate the scalability of the proposed XNODE-WAN model with respect to the spatial dimension.It is easy to verify that the solution to (11) is given by Note in example 4.3, ||u|| L 2 = 1 regardless of the problem dimension d.For both methods, we choose the number of points N r , N b = 800d and = 10 −2 .
We investigate the performance of our proposed XNODE-WAN method with different spatial dimension d in terms of the relative error evolution over the training, total training time and training time per epoch, which is visualized in Figure 5. Figure 5a shows that even for a high dimensional d = 64, our method is able to converge to the relative error tolerance = 10 −2 within 10 2 seconds.
Shown in Figure 5b, for a dimension as high as 64, to acheive the error tolerance 10

Sensitivity analysis
In this subsection, we aim to investigate the sensitivity of our proposed XNODE model in terms of the network architecture of the hidden field N vec θ 2 and learning rate τ θ .Here we choose the same PDE setting defined in Section 4.3 and set d = 5.We follow [17] to choose the error tolerance level = 0.005.
To investigate the effects of the neural network architecture of the hidden field N vec θ 2 on the predictive performance of the XNODE-WAN, we consider different number of neurons per layer (hidden dimension) and number of layers (depth).When the hidden dimension is high as 18 or 22, our method can achieve the satisfactory results.It can be observed, when increasing the depth to 7 or above, the predictive performance of our method is improving.It suggests that our method usually works well as long as the neural network architecture is sufficiently rich.However, the over-complicated network should be discouraged as it may prolong the training time and cause the overfitting.Furthermore, we consider different values of learning rate for both our method and WAN, and the corresponding performance.Fig. 6b shows that, independently of the depth of the hidden field, a learning rate on the magnitude of 10 −5 is needed to achieve a reasonable loss for WAN while XNODE-WAN allows for a much wider range of learning rate, i.e., from 10 −5 to 10 −2 .In practice, we conduct the grid search to select the optimal learning rate for each method, i.e. to achieve the satisfactory accuracy, the learning rate is chosen to be the one corresponding to the least training time.That is the justification of different learning rate used for XNODE-WAN and WAN.
Note that to illustrate the idea in this example we only consider d = 1.
In this example, we choose the error tolerance level = 10 −2 to train both XNODE-WAN and WAN models.The performance comparison results are shown in Figure 7.In particular, from Figure 7b we observe that both models are able to learn the solution on the time-varying domain to a satisfactory  level.The heatmap plot of Fig. 7b shows that they both achieve an error below 0.04 for most of the region at their convergence.However, as shown in Fig. 7a, similarly to the time-independent case, the convergence of XNODE-WAN in terms of the relative error is much faster than that of WAN, which is reflected by the enlarged relative error gap between two models over time until the end of XNODE-WAN training.
To solve the PDE problem on the time-varying domain, both WAN and XNODE-WAN may require more computational time to train compared with their time-independent counterpart as expected.It is evident by that the computational cost of each model on low-dimensional time varying domain example (Example 4.4) is even higher than that for the higher dimensional time-invariant domain (see Fig. 3a).For our method, it is mainly a result of additional complexity of constant sub-path construction in Algorithm 4.
However, more importantly, for this time-varying domain example, our proposed model consistently outperforms in terms of efficiency: WAN requires significantly longer time (5,000s) to converge compared to XNODE-WAN (less than 100s) in Fig. 7a.In part this can be explained by the fact that XNODE-WAN model incorporates the boundary condition to the model (Eq.( 10)) and its weak solution estimators evaluated at the entry points on the boundary have little error.This anchors our model at more places than the WAN model, making it more suited to a time-varying domain.This example justifies the advantage of using XNODE-WAN for solving PDE on an time-varying domain.

Conclusion
In this paper, we propose the novel XNODE model as a universal and effective network for the parabolic PDE solution by leveraging the priori information of the PDE though the NODEs.To tackle the high dimensional parabolic PDE problems on arbitrary bounded domain, incorporating the XNODE model to the WAN as a primal network leads to significantly faster training and better scalability.Our XNODE-WAN method consistently outperforms WAN in terms of computational time on both time-independent domain and time-dependent domain.Proof:.In this proof, we constrain the XNODE models to the ones which has the hidden dimension 1 and identity map being their output layer.Therefore F satisfies the below ODE similar to (A.4): d F (t, x) dt = N vec θ 1 ( F , t; x), F (0, x) = N init θ 2 (h(x)), (A.6) where the activation functions (e.g.ReLu) are chosen such that the vector field is uniformly Lipschitz continuous.the wellposeness of the XNODE model is a directly consequence of Picard's Theorem.So is the function F due to the Lipschitz continuity assumption of its vector field F .Thanks to the universality of neural network, for every > 0, there exist two neural networks with sufficiently large number of hidden neurons and number of layers, denoted by N vec   .
(a) a time-independent domain (b) a time-dependent domain.

Figure 1 :
Figure 1: An illustration of constant paths for spatial-time domains.(a) The red stars are the initial time points and the points are all the points at which we evaluate the data.Note that they are aligned in the time dimension.(b) The red squares here represent the inserted points that act as the initial points for the paths that cannot be traced back to the initial time.

Algorithm 2 XNODE 1 :
Network for modelling the weak solution u on a constant domain [0, T ] × Ω Input: sampled sets of points S r = {x r j ∈ Ω} Nr j=1 and S b = {x b j ∈ ∂Ω} N b j=1 ; Π T = (t i ) n T i=0 time partition of [0, T ]; XNODE parameter set Θ = (θ 1 , θ 2 , θ) Output: O approximate u at S r × Π T and S b × Π T Algorithm: initialize O as an empty vector 2: for x in S r ∪ S b do 3: compute o x = XNODE Θ (x, Π T ); approximate u at {x} × Π T 4: set O = concatenate(O, o x ); 5: end for 6: return output O XNODE-WAN Algorithm: Finally, we obtain the XNODE-WAN model by replacing the DNN network by the XNODE network for the primal solution in the WAN algorithm (see Algorithm 1).In particular, we adapt the WAN algorithm in the following steps:

Figure 2 :
Figure 2: An illustration for entry and exit points of a constant path (see Definition 3.1) and the corresponding subpaths Π x,1T and Π x,2 T for the given domain presented as a shaded region.The solid line presents the constant path X ≡ x.The time coordinates of the blue solid dots are the entry and exit points, while the blue circles represent x × Π T the samples along the path.Clearly there are two points (t 4 , x) and (t 5 , x) along the path but not in the shaded domain, which are filtered out when constructing sub-paths within the domain.

Algorithm 4 XNODE 1 :; 4 : 5 :,i 7 : 9 :
Network for modeling the weak solution u on a timedependent domain D Input: sampled sets of points S r = {x r j ∈ Ω max } Nr j=1 and Π T = (t i ) n T i=0 time partition of [0, T ]; XNODE parameter set Θ = (θ 1 , θ 2 , θ) Output: O approximate u at S r × Π T Algorithm: initialize O as an empty vector 2: for x in S r do 3: compute (Π x,i T ) Nx i=1 with the input arguments D, x and Π T through Algorithm 3initialize o x as a zero vector of length N x ; for i from 1 to N x do 6: compute o x,i = XNODE Θ (x, Π x,i ); approximate u at {x} × Π T xadd o x,i to o x in the corresponding locations based on Π T x,i ; set O = concatenate(O, o x ); 10: end for 11: return output O ) a hyper-cube Ω = [0, 1] d and (2) a unit ball Ω = {x ∈ R d ||x| ≤ 1} for d = 5.

( a )
The relative error over time (b) XNODE-WAN solution and its error at convergence (74s) (c) True solution u(T, x) (d) WAN solution and its error at convergence (5865s) (e) WAN solution and its error at 74s

Figure 3 :
Figure 3: Performance comparison between XNODE-WAN and WAN in Example 4.1.In the right subplots of (b), (d) and (e), the error is the poitwise absolute error |u(T, x) − û(T, x)|, where û is the model estimated solution.All the images (b)-(e) only show 2D slice of x 3 = x 4 = x 5 = 0 as the true solution u of this example only depends on the first two coordinates.

( a )Figure 4 :
Figure 4: Performance comparison of XNODE-WAN and WAN in Example 4.2, where Ω is a 5 dimensional unit ball.(a) the evolution of relative L 2 error of both XNODE-WAN and WAN method over time; (b) The plot of poitwise absolute error |u(T, x) − û(T, x)|, where û is the trained model of each method until hitting the stopping criteria.All the images only show 2D slice of x 3 = x 4 = x 5 = 0 as the true solution u of this example only depends on the first two coordinates.

− 2 ,
the training time of WAN is 4,000s, roughly 40 times of that of XNODE-WAN.It appears that, as N r and N b are chosen proportional to d, the average training time increases approximately linearly in dimension d for XNODE-WAN.Clearly for the same dimension d, the training time per epoch from WAN is longer than that from XNODE-WAN, especially for large values of d.In total, Figure5show that XNODE-WAN has greater potential in scalability for higher dimensional PDEs.
(a) The relative error over time (b) Total computation time and training time per epoch

Figure 5 :
Figure 5: Scalability results of Example 4.3 in terms of spatial dimension d.(a) the evolution of relative L 2 error of the XNODE-WAN method over time (0 to 10 3 s) for d ∈ {4, 16, 64}; (b) the computational time costs (Left) and average training time per epoch (Right) of XNODE-WAN and WAN until they achieve error tolerance respectively.Here we use = 10 −2 for each spatial dimension d ∈ {4, 8, 16, 32, 64}.
(a) XNODE-WAN: depths against hidden dimensions (b) XNODE-WAN (right) and WAN (left): the heatmap of relative error against the network depth and learning rate

Figure 6 :
Figure 6: The relative error plots for sensitivity analysis.All these experiments are implemented under the same setting of hyperparameters as in the Section 4.3.

4. 5 .
Time-varying domain Example 4.4 We now consider the PDE problem on the time-varying domain D ⊂ [0, 1] × [−1, 1] defined as follows (see Figure 7a for its shape), (a) The relative error over time (b) Error at convergence for XNODE-WAN and WAN solutions

Figure 7 :
Figure 7: Performance comparison of XNODE-WAN and WAN on an time-dependent domain in Example 4.4.(a) the evolution of relative L 2 error of both XNODE-WAN and WAN method over time; (b) The plot of poitwise absolute error |u(t, x 1 ) − û(t, x 1 )|, where û is the trained model of each method until hitting the stopping criteria.

Table 1 :
Performance comparison between XNODE-WAN and WAN on Example 4.1.
Table C.3: Performance comparison between XNODE-WAN and WAN on various domains for Example 4.2 and Example 4.4