Tackling the Curse of Dimensionality with Physics-Informed Neural Networks

The curse-of-dimensionality taxes computational resources heavily with exponentially increasing computational cost as the dimension increases. This poses great challenges in solving high-dimensional PDEs, as Richard E. Bellman first pointed out over 60 years ago. While there has been some recent success in solving numerically partial differential equations (PDEs) in high dimensions, such computations are prohibitively expensive, and true scaling of general nonlinear PDEs to high dimensions has never been achieved. We develop a new method of scaling up physics-informed neural networks (PINNs) to solve arbitrary high-dimensional PDEs. The new method, called Stochastic Dimension Gradient Descent (SDGD), decomposes a gradient of PDEs into pieces corresponding to different dimensions and randomly samples a subset of these dimensional pieces in each iteration of training PINNs. We prove theoretically the convergence and other desired properties of the proposed method. We demonstrate in various diverse tests that the proposed method can solve many notoriously hard high-dimensional PDEs, including the Hamilton-Jacobi-Bellman (HJB) and the Schr\"{o}dinger equations in tens of thousands of dimensions very fast on a single GPU using the PINNs mesh-free approach. Notably, we solve nonlinear PDEs with nontrivial, anisotropic, and inseparable solutions in 100,000 effective dimensions in 12 hours on a single GPU using SDGD with PINNs. Since SDGD is a general training methodology of PINNs, it can be applied to any current and future variants of PINNs to scale them up for arbitrary high-dimensional PDEs.


Introduction
The curse-of-dimensionality (CoD) refers to the computational and memory challenges when dealing with highdimensional problems that do not exist in low-dimensional settings.The term was first introduced in 1957 by Richard E. Bellman [7,20], who was working on dynamic programming, to describe the exponentially increasing costs due to high dimensions.Later, the concept was adopted in various areas of science, including numerical partial differential equations (PDEs), combinatorics, machine learning, probability, stochastic analysis, and data mining.In the context of numerically solving PDEs, an increase in the dimensionality of the PDE's independent variables tends to lead to a corresponding exponential rise in computational costs needed for obtaining an accurate solution.This poses a considerable challenge for all existing algorithms.
Physics-Informed Neural Networks (PINNs) are highly practical in solving PDE problems [54].Compared to other methods, PINNs can solve various PDEs in general form, enabling mesh-free solutions and handling complex geometries.Additionally, PINNs leverage the interpolation capability of neural networks to provide accurate predictions throughout the domain, unlike other methods for high-dimensional PDEs, which can only compute the value of PDEs at a single point or are restricted to certain PDE types [1,21,52].Due to neural networks' flexibility and expressiveness, in principle, the set of functions representable by PINNs have the ability to approximate high-dimensional PDE solutions.However, PINNs may also fail in tackling CoD due to insufficient memory and slow convergence when dealing with high-dimensional PDEs.For example, when the dimension of PDE is very high, even just using one collocation point can lead to an insufficient memory error due to the massive high-dimensional derivatives in the PDE.Unfortunately, the aforementioned high-dimensional and high-order cases often occur for very useful PDEs, such as the Hamilton-Jacobi-Bellman (HJB) equation in stochastic optimal control, the Fokker-Planck equation in stochastic analysis and high-dimensional probability, and the Black-Scholes equation in mathematical finance, etc.These PDE examples pose a grand challenge to the application of PINNs in effectively tackling real-world large-scale problems.
To scale up PINNs for arbitrary high-dimensional PDEs, we propose a training method of PINNs by decomposing and sampling gradients of PDE's and PINN's residual, which we name stochastic dimension gradient descent (SDGD).It accelerates the training of PINNs while significantly reducing the memory cost, thereby realizing the training of any high-dimensional PDEs with acceleration.Specifically, we scale up and speed up PINNs by decomposing the computational and memory bottleneck, namely the gradient of the residual loss that contains massive PDE terms the neural tangent kernel to demonstrate the global convergence of PINNs.Mishra et al. [49] derived an estimate for the generalization error of PINNs, considering both the training error and the number of training samples.In a different approach, Shin et al. [56] adapted the Schauder approach and the maximum principle to provide insights into the convergence behavior of the minimizer as the number of training samples tends towards infinity.They demonstrated that the minimizer converges to the solution in both C 0 and H 1 .Furthermore, Lu et al. [43] employed the Barron space within the framework of two-layer neural networks to conduct a prior analysis on PINN with the softplus activation function.This analysis is made possible by drawing parallels between the softplus and ReLU activation functions.More recently, Hu et al. [26] employed the Rademacher complexity concept to measure the generalization error for both PINNs and the extended PINNs variant (XPINNs [33] and APINNs [25]).

Machine Learning PDEs Solvers in High-Dimensions
Numerous attempts have been made to tackle high-dimensional PDEs by deep learning to overcome the curse of dimensionality.In the PINN literature, [60] showed that to learn a high-dimensional HJB equation, the L ∞ loss is required.The authors of [23] proposed to parameterize PINNs by the Gaussian smoothed model and to optimize PINNs without back-propagation by Stein's identity, avoiding the vast amount of differentiation in high-dimensional PDE operators to accelerate the convergence of PINNs.Separable PINN [12] considers a per-axis sampling of residual points in high-dimensional spaces, thereby reducing the computational cost and enlarging the batch size of residual points in PINN via tensor product.However, this method [12] focuses on acceleration on only modest (less than 4) dimensions and is designed mainly for increasing the number of collocation points in 3D PDEs, whose separation of sampling points becomes intractable in high-dimensional spaces.Han et al. [21,22] proposed the DeepBSDE solver for high-dimensional PDEs based on the classical BSDE method, and they used deep learning to approximate unknown functions.Extensions of the DeepBSDE method were presented in [2,10,24,30,35].Becker et al. [6] solved high-dimensional optimal stopping problems via deep learning.The authors of [1] combined splitting methods and deep learning to solve high-dimensional PDEs.Raissi [52] leveraged the relationship between high-dimensional PDEs and stochastic processes whose trajectories provide supervision for deep learning.Despite the effectiveness of the methods in [1,21,52,22], they can only be applied to a certain restricted class of PDEs.Wang et al. [61,62] proposed tensor neural networks with efficient numerical integration and separable structures for solving high-dimensional Schrödinger equations in quantum physics.Zhang et al. [66] proposed using PINNs for solving stochastic differential equations by representing their solutions via spectral dynamically orthogonal and bi-orthogonal methods.Zang et al. [65] proposed a weak adversarial network that solves PDEs using the weak formulation.The deep Galerkin method (DGM) [58] optimizes networks to satisfy the high-dimensional PDE operator, where the derivatives are estimated via Monte Carlo.The deep Ritz method (DRM) [63] considers solving high-dimensional variation PDE problems.
In the numerical PDE literature, there have been attempts to scale numerical methods to high dimensions, e.g., proper generalized decomposition (PGD) [11], multi-fidelity information fusion algorithm [51], and ANOVA [47,67].Darbon and Osher [13] proposed a fast algorithm for solving high-dimensional Hamilton-Jacobi equations if the Hamiltonian is convex and positively homogeneous of degree one.The multilevel Picard method [3,4,5,31,32] is another approach for approximating solutions of high-dimensional parabolic PDEs, which reformulates the PDE problem as a stochastic fixed point equation, which is then solved by multilevel and nonlinear Monte-Carlo.Similar to Beck et al. [1] and Han et al. [21], the multilevel Picard method can only output the solution's value on one test point and can only be applied to a certain restricted class of parabolic PDEs.

Stochastic Gradient Descent
This paper accelerates and scales up PINNs based on the idea of stochastic gradient descent (SGD).In particular, we propose the aforementioned SDGD method.SGD is a standard approach in machine learning for handling large-scale data.As an iterative optimization algorithm, it updates the model's parameters by computing the gradients on a small randomly selected subset of training examples, known as mini-batches.This randomness introduces stochasticity, hence enabling faster convergence and efficient utilization of large datasets, making SGD a popular choice for training deep learning models.Among the theoretical guarantees of SGD, the works in [14,40,48] are remarkable milestones.Specifically, the condition presented in [40] stands as a minimum requirement within the general optimization literature, specifically about the Lipschitz continuity of the gradient estimator.Their proof can also be extended to establish convergence for our particular case, necessitating the demonstration of an upper bound on the Lipschitz constant of our gradient estimator.On the other hand, [14,48] adopted a classical condition that entails a finite upper bound on the stochastic gradient's variance.In this case, it suffices to compute an upper bound for the variance term.While [40] assumed a more relaxed condition, the conclusion is comparably weaker, demonstrating proven convergence but with a notably inferior convergence rate.In the theoretical analysis presented in section 4 of this paper, we will utilize the aforementioned tools to provide convergence guarantees for our SDGD, emphasizing SDGD's stochastic gradient variance and PINNs convergence.

Physics-Informed Neural Networks (PINNs)
In this paper, we focus on solving the following partial differential equations (PDEs) defined on a domain Ω ⊂ R d : where L and B are the differential operators for the residual in Ω and for the boundary/initial condition on Γ. PINN [54] is a neural network-based PDE solver via minimizing the following boundary L b (θ) and residual loss L r (θ) functions.
where λ b , λ r > 0 are the weights for balancing the losses.n b , n r are the number of boundary points {x b,i } n b i=1 ⊂ Γ and residual points {x r,i } nr i=1 ⊂ Ω, respectively.u θ is the neural network model parameterized by θ.

Methodology for High-Dimensional PDEs
We first adopt the simple high-dimensional second-order Poisson's equation for illustration; then, we move to the general case covering a variety of high-dimensional PDEs.

Introductory Case of the High-Dimensional Poisson's Equation
We consider a simple high-dimensional second-order Poisson's equation for illustration: where u is the unknown exact solution, and u θ is our PINN model parameterized by θ.The memory scales quadratically as d increases due to the growing network size at the first input layer of u θ and the increasing number of second-order derivatives.So, for extremely high-dimensional PDEs containing many second-order terms, using only one collocation point can lead to insufficient memory, whose memory cost cannot be further reduced in the traditional SGD over collocation points scenario.The huge memory cost is also incurred by the large input size at the first layer of u θ , which is the same as the PDE dimensionality.However, the memory problem is solvable by inspecting the residual loss on the collocation point x: • Thus, our method SDGD can solve high-dimensional PDEs arbitrarily since its minimal computational cost will grow much more slowly than conventional SGD as the dimension d increases.Empirically, it is interesting to see how the two SGDs affect convergence.In particular, B-point + D-term with the same B × D quantity has the same memory cost, e.g., 50-point-100-terms and 500-point-10-terms.

General Case
Now, we move from the simple Poisson's equation to more general cases.We are basically performing the following decomposition on the PDE differential operator/PDE terms: where L i are the N L decomposed PDE terms.We have the following examples: . This observation also highlights the significance of decomposition methods for high-dimensional Laplacian operators.The computational bottleneck in many high-dimensional second-order PDEs often lies in the high-dimensional Laplacian operator.
• Consider the d-dimensional Fokker-Planck (FP) equations, which are ubiquitous in science and engineering.
It is originally derived from statistical mechanics to describe the evolution of stochastic differential equations (SDEs).Recently, a real-world application of SDE and its associated FP equation is for generative modeling [59].Furthermore, it also generalizes the Black-Scholes equation in mathematical finance, the Hamilton-Jacobi-Bellman equation in optimal control, and Schrödinger equation in quantum physics. where ) is the probability density function of the stochastic process x(t), F i (x) is the i-th component of the drift coefficient vector F (x), and D ij (x) is the (i, j)-th component of the diffusion coefficient matrix D(x).Then, our decomposition over the PDE terms is where we denoted • The d-dimensional bi-harmonic equation: is the Laplacian operator.For this high-dimensional fourth-order PDE, we can decompose as follows for the PDE operator/terms: • For other high-dimensional PDEs, their complexity arises from the dimensionality itself.Therefore, it is always possible to find decomposition methods that can alleviate this complexity.
We go back to the algorithm after introducing these illustrative examples; the computational and memory bottleneck is the residual loss: The gradient with respect to the model parameters θ for training the PINN is Subsequently, we are sampling the red part to reduce memory cost and to accelerate training.Our algorithm can be summarized in Algorithm 1.

Extensions: Gradient Accumulation and Parallel Computing
This subsection introduces two straightforward extensions of our Algorithm 1 for further speed-up and scale-up.Gradient Accumulation.Since our method involves SDGD in large-scale problems, we must reduce the batch size to fit it within the available GPU memory.However, a very small batch size can result in significant gradient variance.In such cases, gradient accumulation is a promising idea.Specifically, gradient accumulation involves sampling different index sets I 1 , I 2 , • • • , I n , computing the corresponding gradients grad I k (θ) for k = 1, 2, • • • , n, and then averaging them as the final unbiased stochastic gradient for one-step optimization to effectively increase the batch size.The entire process can be implemented on a single GPU, and we only need to be mindful of the tradeoff between computation time and gradient variance.
Parallel Computing.Just as parallel computing can be used in machine learning to increase batch size and accelerate convergence in SGD, our new SDGD also supports parallel computing.Recall that the stochastic gradient by sampling the PDE terms randomly is given by grad I (θ) in equation (15).We can compute the above gradient for different index sets I 1 , I compute L i u θ (x).If i ∈ I, then keep the gradient with respect to θ; else, we detach it from the GPU to save memory.After detachment, the term will not be involved in the costly backpropagation, and its gradient with respect to θ will not be computed, which saves GPU memory costs.Compute the unbiased stochastic gradient used to update the model as below: 7: end while 3.4 Further Speed-up via Simultaneous Sampling in Forward and Backward Passes In this section, we discuss how to accelerate further large-scale PINN training based on our proposed memory reduction methods to make it both fast and memory efficient.
Although the proposed methods in the previous Algorithm 1 can significantly reduce memory cost and use SDGD to obtain some acceleration through gradient randomness, the speed is still slow for particularly large-scale problems because we need to calculate each term {L i u(x)} N L i=1 in the forward pass one by one, and we are only omitting some these terms in the backward pass to scale up.For example, in extremely high-dimensional cases, we may need to calculate thousands of second-order derivatives of PINN, and the calculation speed is clearly unacceptable.
To overcome this bottleneck, we can perform the same unbiased sampling on the forward pass to accelerate it while ensuring that the entire gradient is unbiased.I.e., we only select some indices for calculation in the forward pass and select another set of indices for the backward pass, combining them into a very cheap yet unbiased stochastic gradient.
Mathematically, consider the full gradient grad(θ) again given in equation (14).We choose two random and independent indices sets I, J ⊂ {1, 2, • • • , N L } for the sampling of the backward and the forward passes, respectively.The corresponding stochastic gradient is: Since I, J are independent, the gradient estimator is unbiased: E I,J grad I,J (θ) = grad(θ).Our algorithm can be summarized in Algorithm 2.

Algorithm 2
Training Algorithm for scaling-up and further speeding-up by sampling forward and backward passes.compute L i u θ (x) and keep the gradient with respect to θ. for j ∈ J do 7: compute L j u θ (x) and detach it to save memory cost.Compute the unbiased stochastic gradient used to update the model given in equation (16).10: end while Trading off Speed with Gradient Variance.Obviously, since we introduce more randomness, the gradient variance of this method may be larger, and the convergence may be sub-optimal.However, at the initialization stage of the problem, an imprecise gradient is sufficient to make the loss drop significantly, which is the tradeoff between convergence quality and convergence speed.We will demonstrate in experiments that this method is particularly effective for extremely large-scale PINN training.compute L i u θ (x) and keep the gradient with respect to θ. Compute the stochastic gradient used to update the model grad 7: end while

Additional Speed-up via Sampling Only Once
In the previous section, we introduced Algorithm 2, which aims to accelerate SDGD by randomly sampling forward and backward passes.To ensure unbiased gradients, we independently sample the forward and backward passes.However, this leads to additional computational overhead.If we used the same sample for both the forward and backward passes, our computational cost would be even lower than Algorithm 2, resulting in further acceleration.Nonetheless, the gradients obtained in this manner are biased due to the correlations between the forward and backward pass samples, and the batch size for sampling dimension |I| controls the bias: a larger batch size |I| leads to smaller bias.Fortunately, in practice, this approach often converges rapidly.We introduce this in Algorithm 3.
The empirical success of Algorithm 3 can be attributed to the bias-speed tradeoff and its relatively low bias in practice.Algorithm 3 is the fastest due to the sampling only once.It can converge well thanks to the low bias in practice, as long as the batch size for sampling dimension |I| is not too small, which will be experimentally demonstrated in Section 5.1.2.

Theory
In this section, we analyze the convergence of our proposed Algorithms 1 and 2.

SDGD is Unbiased
In previous sections, we have shown that the stochastic gradients generated by our proposed SDGD are unbiased: Theorem 4.1.The stochastic gradients grad I (θ) and grad I,J (θ) in our Algorithms 1 and 2, respectively, parameterized by index sets I, J, are an unbiased estimator of the full-batch gradient grad(θ) using all PDE terms, i.e., the expected values of these estimators match that of the full-batch gradient, E I [grad I (θ)] = E I,J [grad I,J (θ)] = grad(θ).
Proof.We prove the theorem in A.1.

SDGD Reduces Gradient Variance
In this subsection, we aim to show that SDGD can be regarded as another form of SGD over PDE terms, serving as a complement to the commonly used SGD over residual points.Specifically, |B|-point + |I|-term where |B|, |I| ∈ Z + with the same |B| × |I| quantity has the same memory cost, e.g., 50-point-100-terms and 500-point-10-terms.In particular, we shall demonstrate that properly choosing the batch sizes of residual points |B| and PDE terms |I| under the constant memory cost (|B| × |I|) can lead to reduced stochastic gradient variance and accelerated convergence compared to previous practices that use SGD over points only.
We assume that the total full batch is with N r residual points {x n } Nr n=1 and N L PDE terms L = N L i=1 L i , then the loss function for PINN optimization is 2 , where we normalize over both the number of residual points N r and the number of PDE terms N L , which does not impact the directions of the gradients.More specifically, the original PDE is The reason for normalization lies in the increase of both dimensionality and the number of PDE terms.As these increase, the scale of the residual loss and its gradient becomes exceptionally large, leading to the issue of gradient explosions.Therefore, we normalize the loss by the dimensionality and the number of PDE terms to mitigate this problem.Additionally, during testing, we focus on relative errors, and normalization here is, in essence, a normalization concerning dimensionality following the relative error we care about.The full batch gradient is given by If the random index sets I ⊂ {1, 2, • • • , N L } and B ⊂ {1, 2, • • • , N r } are chosen, then the SDGD gradient with |B| residual points and |I| PDE terms using these index sets is where Theorem 4.2.For the random index sets (B, I) where where V computes the variance of a random variable, C 1 , C 2 , C 3 are constants independent of B, I but dependent on model parameters θ.
Proof.We prove the theorem in A.2.
Theorem 4.2 focuses on the gradient variance after fixing the current model parameter θ, i.e., given θ at the current epoch, we investigate the SDGD gradient variance at the next optimization round.
Intuitively, the variance of the stochastic gradient tends to decrease as the batch sizes (|B|, |I|) increase, and it converges to zero for |B|, |I| → ∞, where we considered sampling with replacement to make the theorem clear.Further, the SDGD gradient variance can be rewritten as |B||I| , where B is the set for sampling residual points and I is that for sampling PDE terms/dimensions.C 1 is the variance from points, and therefore, sampling more points, i.e., increasing |B|, can reduce it.Similarly, C 2 is the variance from PDE terms/dimensions.C 3 is related to the variance due to the correlation between sampling points and PDE terms/dimensions.This verifies that SDGD can be regarded as another form of SGD over PDE terms, serving as a complement to the commonly used SGD over residual points.|B|-point + |I|-term with the same |B| × |I| quantity has the same memory cost.According to Theorem 4.2, under the same memory budget, i.e., |B| × |I| is fixed, then there exists a particular choice of batch sizes |B| and |I| that minimizes the gradient variance, in turn accelerating and stabilizing convergence.This is because the stochastic gradient variance V B,I [g B,I (θ)] is a function of finite batch sizes |B| and |I|, which therefore can achieve its minimum value at a certain choice of batch sizes.
Let us take the extreme cases as illustrative examples.The first extreme case is when the PDE terms have no variance, meaning that the terms ∂ ∂θ L i u θ (x n ) are identical for all i after we fix n, i.e., C 2 = 0, and the SDGD variance becomes C1|I|+C3 |B||I| .In this case, if a memory budget of, for example, |B||I| = 100 units is given, the optimal choice would be to select 100 points and one PDE term with the minimum variance, i.e., we obtain the lowest SDGD variance by choosing |B| = 100 and |I| = 1.Choosing more PDE terms would not decrease the gradient variance and would be less effective than using the entire memory budget for sampling points.Conversely, if the points have no variance in terms of the gradient they induce, i.e., C 1 = 0, the optimal choice would be one point and 100 PDE terms, i.e., |I| = 100 and |B| = 1.In practice, the selection of residual points and PDE terms will inevitably introduce some variance.Therefore, in the case of a fixed memory cost, the choice of batch size for PDE terms and residual points involves a tradeoff.Increasing the number of PDE terms reduces the variance contributed by the PDE terms but also decreases the number of residual points, thereby increasing the variance of the points.Conversely, decreasing the number of PDE terms reduces the variance from the points but increases the variance from the PDE terms.Thus, there exists an optimal selection strategy to minimize the overall gradient variance since the choices of batch sizes are finite.

Gradient Variance Bound and Convergence of SDGD
To establish the convergence of unbiased stochastic gradient descent, we require either Lipschitz continuity of the gradients [40] or bounded variances [14,48], with the latter leading to faster convergence.To prove this property, we need to take the following steps.Firstly, we define the neural network serving as the surrogate model in PINN.

Definition 4.1. (Neural Network). A deep neural network (DNN) u
the input, and W l ∈ R m l ×m l−1 is the weight matrix at l-th layer with d = m 0 and m L = 1.The parameter vector θ is the vectorization of the collection of all parameters.We denote h as the maximal width of the neural network, i.e., h = max(m L , • • • , m 0 ).
For the nonlinear activation function σ, the residual ground truth R(x), we assume the following: Assumption 4.1.We assume that the activation function is smooth and |σ (k) (x)| ≤ 1 and σ (k) is 1-Lipschitz, for all 0 ≤ k ≤ n, where n is the highest order of the PDE under consideration, σ (k) denotes the kth order derivative of σ, e.g., the sine and cosine activations.We assume that |R(x)| ≤ R for all x ∈ Ω where R is a constant.
Remark 1.The sine and cosine functions naturally satisfy the aforementioned conditions.As for the hyperbolic tangent (tanh) activation function, when the order n of the PDE is determined, there exists a constant C n such that the activation function C n tanh(x) satisfies the given assumptions.This constant can be absorbed into the weights of the neural network.This is because both the tanh function and its derivatives up to order n are bounded.Assumption 4.1 gives the boundedness of the activation function σ and the residual ground truth R(x).Furthermore, to establish the stochastic gradient variance bounds for SDGD, we also need to assume the boundedness of the PDE differential operators L i where L = N L i=1 L i , as follows.
Assumption 4.2.We assume the highest-order derivative in the PDE operator The motivation is that x is in a compact domain, i.e., the finite set of training residual/collocation points, since the training is finished in finite time.Due to Lemma 4.1, high-order derivatives of the neural network, i.e., ∂ n u θ (x) ∂x n , can also be bounded in a compact domain by bounded optimization trajectory assumption stated later in Theorem 4.3, i.e., the parameter θ is assumed to be bounded during optimization.Combining these guarantees the PINN residual's and SDGD-PINN gradient's boundedness throughout the optimization, which can lead to convergence thanks to bounded gradient variance [14,48].Assumption 4.2 is realistic since most PDE coefficients are constant or continuous functions.
where M (l) = max {∥W l ∥, 1}, n is the derivative order, the norms are all vector or matrix 2 norms and vec denotes vectorization of the high-order derivative tensor, and h is the maximal width of the neural network, i.e., Proof.The proof is presented in A.3.
Remark 2. Intuitively, given the order of the PDE and the neural network structure, it is observed that both of them are finite.Consequently, the norms of the higher-order derivatives of the network are necessarily bounded by a quantity related to its weight matrices, the number of layers, the width, the PDE order, and the network input x.
From the boundedness of the derivatives of the neural network in PINN given by Lemma 4.1, coupled with our assumed boundedness of the PDE operator in Assumption 4.2, we can infer that the PINN residual is also bounded, i.e., the boundedness of the numerical values obtained when the PDE operator acts on the neural network of PINN, which can further ensure bounded stochastic gradient variance during SDGD optimization.
Next, we define the SGD trajectory, and we will demonstrate the convergence of the SGD trajectory based on the stochastic gradients provided by our Algorithms 1 and 2. Definition 4.2.(SGD Trajectory) Given step sizes (learning rates) {γ n } ∞ n=1 , and the initialized PINN parameters θ 1 , then the update rules of SGD using Algorithms 1 and 2 are respectively.Here, the stochastic gradient produced by Algorithm 1 g B,I (θ) is given in equation ( 18) , and the stochastic gradient produced by Algorithm 2 is where B is the index set sampling over the residual points, while I and J sample the PDE terms in the backward and forward passes, respectively.
Since the PINN optimization landscape is generally nonconvex, we consider SDGD's convergence to the regular minimizer defined as follows.Kawaguchi [37] demonstrates that such a "regular minimizer" can be good enough in practice.
The following theorem demonstrates the convergence rate of SDGD in PINN training.The statement follows Theorem 4 in Mertikopoulos et al. [48] and the proof is to check if the conditions in that theorem can be satisfied in our setting.
Theorem 4.3.(Convergence of SDGD under Bounded Gradient Variance) Assume Assumptions 4.1 and 4.2 hold, fix some tolerance level δ > 0, suppose that the SGD trajectories given in equations (22,23) are bounded, i.e., ∥W n l ∥ ≤ M (l) for all epoch n where the collection of all {W n l } L l=1 is θ n , and that the regular minimizer of the PINN loss is θ * as defined in Definition 4.3, and that the SGD step size follows the form γ n = γ/(n + m) p for some p ∈ (1/2, 1] and large enough γ, m > 0, then 1.There exist neighborhoods U and U 1 of θ * such that, if θ 1 ∈ U 1 , the event occurs with probability at least 1 − δ, i.e., P(E Proof.The proof is presented in A.4. Remark 3. The convergence rate of stochastic gradient descent is O(1/n p ), where O represents the constant involved, including the variance of the stochastic gradient.A larger variance of the stochastic gradient leads to slower convergence, while a smaller variance leads to faster convergence.The selection of the learning rate parameters γ and m depends on the specific optimization problem itself.For example, in a PINN problem, suitable values of γ and m ensure that the initial learning rate is around 1e-4, which is similar to the common practice for selecting learning rates.
Remark 4. The theorem is mainly due to Theorem 4 in Mertikopoulos et al. [48], which is proved via checking if the required conditions are satisfied.The convergence result is local and requires that the initialized neural network parameter θ 1 in the first epoch be inside the neighborhood U 1 .A proper neural network weight/parameter initialization, e.g., Xavier initialization [15] used in our experiment, can mitigate this.This approach avoids the problem of initializing weights with too large or too small values, thus ensuring that the loss and gradient do not explode to infinity or vanish to zero during forward and backward passes, contributing to a small initial loss function value, which signifies a starting point close to a local minimum and stable gradient and fast convergence.

Discussion on Batch Sizes Selection
We have introduced our SDGD algorithms with three batch sizes: |B| for the batch size of residual points, |I| for the dimensions included in the backward pass, and |J| for the dimensions included in the forward pass; see the stochastic gradients in equations (17,18,24).In this subsection, we further discuss how to choose batch sizes.Firstly, both computational cost and memory consumption expand with the increase in batch sizes |B|, |I|, |J|.Additionally, SDGD's stochastic gradient variance diminishes with larger batch sizes, facilitating better convergence.Therefore, it represents a tradeoff between computational load and performance.Further, memory consumption is directly linked to |B|•|I|, where |B| represents the number of residual points and |I| involves the number of dimensions participating in the backward pass.|J| has nothing to do with the memory cost since it is for the forward pass detached from the GPU after computation.
Users should determine the appropriate batch size based on their GPU memory capacity.Given a specified memory constraint, i.e., once the |B| • |I| value is determined, we recommend a balanced selection to ensure similarity between the two batch sizes, i.e., |I| ≈ |B|.Extremely unbalanced cases, such as setting one batch size to a minimum of 1, are not advised.We explain the negative effect of unbalanced batch sizes as follows.If the batch size for residual points, |B|, is too small, it can hinder the GPU from leveraging its advantage in large-batch parallel computing.The parallel nature of neural network inferences for each residual point cannot be used, resulting in slower execution due to the exceptionally small batch size of points |B|.Meanwhile, if one reduces the batch for sampling dimension |I| too much, we cannot reuse the computation graphs for the derivatives along different dimensions, leading to a significant slowdown.Specifically, taking the high-dimensional Laplacian operator as an example, the neural network's second-order derivative with respect to the j-th dimension is given by where For all dimensions j, the computations of Φ l (x) and Ψ l (x) are shared.Thus, computing multiple derivatives on one computational graph is faster than computing them from scratch without reusing the shared part.Hence, we recommend sampling an appropriate dimension batch size, such as 100 or more, and we avoid choosing a too small |I| to reuse the computation graph properly.
As for the choice of |J| in Algorithm 2, we recommend setting |J| ≈ |I|.This is because both |I| and |J| represent the dimension batch size, ensuring that the computational load for both forward and backward passes is roughly equal and the variances of the two samplings are similar.Although |J| does not affect memory consumption, setting it excessively large is not advisable, as it would resemble Algorithm 1, being memory-efficient but suffering from slow computation due to excessive forward pass workload.

Comparison with Latest Work
Recently, a line of work has been proposed for high-dimensional PINN, including Random Smoothing PINN (RS-PINN) [28], Hutchinson trace estimation (HTE) PINN [27], and Score-PINN [29].We compare these three latest papers with our SDGD proposed in this paper.Generally speaking, we demonstrate that SDGD is still more general, universal, applicable, easily implementable, contributory to high-order and high-dimensional PINN's theory, and tends to have lower variance in most scenarios despite these latest follow-ups.

Comparison with RS-PINN
RS-PINN [28] is another way to speed up and scale up PINN for high-dimensional PDEs, first proposed by He et al. [23].However, we show that RS-PINN [28] is worse than SDGD-PINN in several aspects.First, RS-PINN cannot deal with PDEs with more than second-order terms since the corresponding variance reduction form cannot be derived, while SDGD applies to arbitrary high-dimensional high-order PDEs.Second, RS-PINN smoothes the network, and thus, it can only model smooth PDE solutions, i.e., RS-PINN is not a universal approximator.On the other hand, SDGD employs a regular neural network and is universal.Third, RS-PINN requires sampling, which leads to bias when dealing with nonlinear PDEs like the Sine-Gordon equation, as pointed out in Section 4.3 of RS-PINN paper [28].Fourth, we compare RS-PINN with SDGD-PINN in Table 5 in this paper, where He et al. [23] is the RS-PINN.However, we demonstrate that SDGD is much faster than RS-PINN.
Overall, the bias, over-smoothness, and limitation on lower-than-second-order derivatives threaten RS-PINN's application.SDGD is better in these senses.SDGD also outperforms RS-PINN significantly in numerical experiments.
1. Gradient variance: We first compare the variance of SDGD and HTE theoretically.Overall, SDGD and HTE have distinct pros and cons, but SDGD will be more stable in most high-dimensional PDEs than HTE.
Hu et al. [27] gave examples where SDGD/HTE can be better and cases where they perform the same in their Section 3.3.2,which is the comprehensive and fair comparison showing that HTE is not necessarily better than SDGD.
We further demonstrate that SDGD's variance can be more stable than HTE in most extremely high-dimensional problems.Using the setting of second-order PDEs in Section 3.3.2 of HTE [27], SDGD and HTE both estimate the Hessian trace stochastically.The HTE paper shows that HTE's variance comes from the off-diagonal terms, whose number grows quadratically with dimensionality.SDGD's variance comes from the variability of on-diagonal terms, whose number grows linearly with dimension.Hence, with growing PDE dimensionality, HTE tends to suffer from more variance due to a faster-growing number of off-diagonal terms.In other words, the number of off-diagonal terms injecting noise into HTE's stochastic gradient grows quadratically fast, while the number of on-diagonal terms injecting noise into SDGD only grows linearly.
To conclude, regarding gradient variance, SDGD and HTE have distinct pros and cons, but SDGD will be more stable in most high-dimensional PDEs than HTE.
2. Applicability and generality: Next, we compare the applicability of SDGD and HTE.SDGD can be applied to arbitrary high-dimensional and high-order PDEs thanks to the general decomposition of PDE operators i=1 L i based on dimensionality.In contrast, HTE can only be used for first-order, second-order, and biharmonic PDEs, which enables the Tensor-Vector Product proposed in HTE.
More specifically, HTE [27]'s Section 3.5.2acknowledges HTE's limitation.In contrast, we are always emphasizing SDGD's generality to arbitrary PDEs throughout this paper.
In fact, PyTorch [50] is still more widely used than JAX [8] and has minimal requirements on the computer system.Hence, SDGD is more implementable than HTE.
Additionally, regarding implementation, SDGD and HTE are compared in JAX implementation using Section 5.1's setting in the HTE paper [27].We implemented other experiments (Sections 5.2, 5.3, and 5.4) in PyTorch since our baselines are also implemented in PyTorch.Thus, we keep the deep learning package the same for fair comparison and reproducible research.
4. Numerical experiments: Meanwhile, we compare HTE and SDGD experimentally.As shown by Table 1 in the HTE paper [27], if both are implemented in JAX [8] and under fair comparison with the same memory and speed budgets, they perform similarly.So, SDGD and HTE are comparable in this case.
But, as mentioned before, SDGD is more applicable and implementable than HTE, and SDGD generally has a low variance with the growing PDE dimension.
5. Theoretical contribution: SDGD's Section 4 provides a convergence theory for general high-order and highdimensional PINN.This theoretical contribution can also be applied to bound the HTE gradient variance.It can also guarantee future work on general high-order and high-dimensional PINN.
Overall, SDGD is better than HTE, especially regarding the variance in growing PDE dimension, generality, and implementation.

Comparison with Score-PINN
Score-PINN [29] is an application that works on high-dimensional Fokker-Planck (FP) equations by adopting the speed-up and scale-up techniques we have developed, i.e., SDGD can be applied to it.It is not proposing any general methodology of high-dimensional PDE solver, but it's novelty is to adapt existing techniques to FP equation to solve the numerical issue due to extremely small solution value.
In sum, Score-PINN [29] is an application paper that uses SDGD or HTE for a specific class of FP equations.It is not a methodology paper like SDGD.Score-PINN is also based on SDGD.

Experiments
In this section, we conduct extensive experiments to demonstrate the stable and fast convergence of SDGD over several nonlinear PDEs.In particular, we compare our PINN-based method with other non-PINN mainstream approaches for solving high-dimensional PDEs [1,21,52], especially in terms of accuracy.We also demonstrate SDGD's convergence in 100,000-dimensional nonlinear PDEs with complicated exact solutions.All the experiments in the main text are done on an NVIDIA A100 GPU with 80GB memory using PyTorch implementation [50].

SDGD Can Deal with Inseparable and Effectively High-Dimensional PDE Solutions
Herein, we test the ability of SDGD to deal with linear/nonlinear PDEs with nonlinear, inseparable, and effectively high-dimensional PDE exact solutions in PINN's residual loss.Specifically, we consider the following exact solution: where c i ∼ N (0, 1).Here the term 1 − ∥x∥ 2 2 is for a zero boundary condition on the unit sphere and for preventing the boundary from leaking the information of the solution, as we want to test SDGD's ability to fit the residual where sampling over dimensions is employed.Furthermore, the following PDEs are considered in the unit ball B d , all of which are associated with a zero boundary condition on the unit sphere: where g(x) = ∆u exact (x).
We would like to emphasize that the PDEs and the exact PDE solutions used here are highly nontrivial and challenging.
• Nonlinear, inseparable, and effectively high-dimensional PDE exact solution: This PDE cannot be reduced to a lower-dimensional problem, and the exact solution of the PDE cannot be decomposed into lower-dimensional functions.It is worth noting that in the exact solution, all pairs of variables, both x i and x i+1 , exhibit pairwise interactions and are highly coupled.
• Anisotropic and nontrivial PDE exact solution: In addition, due to the random coefficients c i ∼ N (0, 1) along different dimensions, the exact solution is anisotropic, which poses an additional challenge to SDGD.The exact solution is also highly nontrivial; its value's standard deviation is large in high-dimensional spaces.
• PDEs with different levels of nonlinearities: We test both linear and nonlinear PDEs to verify SDGD's robustness.
• Zero boundary condition to test SDGD on the residual part: In a d-dimensional problem, the boundary is typically (d − 1)-dimensional.Therefore, if the boundary conditions are nontrivial, the boundary loss will leak information at a rate of (d − 1)/d.To assess the impact of sampling the dimension in the residual loss in SDGD, we maintain a zero boundary condition to eliminate its interference.This also makes the PDE more challenging, as we have no knowledge of the exact solution from the boundary.
• Unable to solve via traditional methods: These elliptic PDEs are traditionally solved by finite difference or finite element methods, which suffer from the curse of dimensionality, whose costs grow exponentially with respect to the dimension.The related work mentioned before, e.g., [1,21,54], is mainly tailored for time-dependent parabolic PDEs.Thus, traditional methods directly fail in these cases.
The training details are as follows.The model is a 4-layer multi-layer perceptron network with 128 hidden units, which is trained via Adam [38] for 10K epochs, with an initial learning rate 1e-3, which linearly decays to zero at the end of the optimization.We select 100 random residual points at each Adam epoch (i.e., |B| = 100) and 20K fixed testing points uniformly from the unit ball.We utilize Algorithm 3 with a minibatch of 100 dimensions to randomly sample the second-order derivatives along various dimensions in the Laplacian operator in all the PDEs, i.e., |I| = 100.For vanilla PINN baseline [54], we use full dimension gradient descent with |I| equals the PDE dimensionality.We adopt the following model structure to automatically satisfy the zero boundary condition, which helps the model avoid boundary loss and avoid sampling the boundary points [45]: where u θ (x) is the neural network and u SDGD θ (x) is the boundary-augmented model.The hard constraint technique for PINN is popular to avoid the additional boundary loss [45].We repeat our experiment 5 times with 5 independent random seeds.[54] and SDGD (Ours) across different PDEs under various high dimensions.Vanilla PINNs [54] run out-of-memory for an A100 GPU with 81252MB memory in more than 5,000 D, while SDGD scales up to 100,000 D. SDGD's errors are similar to those of vanilla PINNs in relatively lower dimensions (100 D, 1,000 D, and 5,000 D) while being much faster and more memory-efficient.

Vanilla
The test relative L 2 errors and the convergence time are shown in Table 1, while the convergence curve of SDGD is shown in Figure 1.The main observations from the results are as follows.
• Stable performances: SDGD demonstrates remarkable stability across various dimensions and different nonlinear PDEs, with relative L 2 errors testing at approximately 1e-3.Furthermore, given that we have provided the same exact solution for each PDE and have kept the training and testing point samples fixed by using the same random numbers, the convergence curves of SDGD for different PDEs are remarkably similar.This is especially noticeable for Poisson and Allen-Cahn, where the convergence curves exhibit minimal distinctions.This highlights the stability of SDGD in the face of PDE nonlinearity.
• Linear computational cost: Furthermore, SDGD's memory and optimization time costs exhibit linear growth with respect to dimensionality.For instance, when transitioning from 10K dimensions to 100K dimensions, both memory and time costs increase by less than a factor of 10.
• SDGD scales up PINNs to extremely high dimensions: In the case of 100K dimensions, we opted for |I| = 100 dimensions for SDGD.If we were to employ the conventional full-batch gradient descent across all dimensions, even with just one residual point, the memory cost would surpass SDGD's current setting by a factor of 10.This greatly exceeds the 80GB memory limit of an A100 GPU.Thus, SDGD allows for the scalability of PINN to handle large-scale PDE problems.
• Smoother convergence curve in higher dimensions: We observe that the convergence curves of SDGD are smoother in high-dimensional scenarios, which could be attributed to the tendency of high-dimensional functions to be smoother thanks to the blessing of dimensionality.
• Comparison with vanilla PINNs [54]: With regards to vanilla PINNs, we note a considerable decrease in speed and a swift escalation in memory usage with the growth of dimensionality.This occurs due to the necessity of calculating the entire second-order Laplacian operator in regular PINNs.This leads to a quadratic rise in computational expense with dimension due to the growing network size at the input layer and the number of second-order derivatives in the high-dimensional Laplacian operator.Consequently, when dealing with 10,000 dimensions, regular PINNs surpass the memory capacity of an A100 GPU of 81252 MB.Moreover, in scenarios where vanilla PINNs can handle dimensions less than 5,000 D, the performance of SDGD is comparable, with the added benefits of significant acceleration and reduced memory consumption.In 100 D, since our batch size for dimension is |I| = 100, SDGD and vanilla PINN are the same algorithm.These results demonstrate that SDGD scales up and speeds up PINNs in very high dimensions.
In sum, SDGD with PINN can tackle the CoD in solving nonlinear high-dimensional PDEs with inseparable and complicated solutions.Table 2: SDGD results for the 100,000-dimensional Sine-Gordon equation with anisotropic solutions under various batch sizes.

Ablation
In this section, we investigate the impact of the batch sizes of residual points |B| and dimensions sampled |I| in SDGD on its convergence results.We adopt the same experimental settings of 100,000-dimensional Sine-Gordon equations and hyperparameter choices as in Section 5.1, except we control these two batch sizes |B| and |I| in Algorithm 3. The results are presented in Table 2, where we manipulate the batch sizes in eight different cases with (|I|, |B|) = (100, 100) or (1000, 10) or (10, 1000) or (10000, 1) or (1, 10000) or (100, 10) or (10,100) or (10,10).We remark that the total batch size |I| • |B| = 10 4 for the first five cases.So, they should have similar memory costs according to Section 4.4.For the other three cases with (|I|, |B|) = (100, 10) or (10,100) or (10,10), we remark that we chose even smaller total batch sizes.For each case, we report the test relative L 2 error, speed (second per iteration), total running time for 10K iterations, and GPU memory cost in MB.
• Firstly, regarding performance, the convergence results of the final relative L 2 error are quite consistent among various settings, indicating that SDGD is robust to the choice of batch size.Particularly, smaller batch sizes like (|I| = 100, |B| = 10), (|I| = 10, |B| = 100), and (|I| = 10, |B| = 10) converge only slightly worse than larger ones but with faster speeds, suggesting that selecting fewer points and dimensions is sufficient when dealing with high-dimensional PINNs.
• As for memory consumption, let us consider the first five settings; if the total batch size (the numerical value of |I| • |B| = 10 4 for these five cases) remains the same, the memory usage is generally comparable.When selecting more points, additional memory is consumed due to the inherent memory usage of each point, but this constitutes only a small fraction of the total memory.
• Consider the five cases with the same total batch size |I| • |B| = 10 4 .We observed that extremely unbalanced batch sizes, e.g., "(|I|, |B|) = (10000, 1)" and "(|I|, |B|) = (1, 10000)", result in particularly slow speeds.Conversely, in cases where a more balanced batch size is chosen, e.g., "(|I|, |B|) = (100, 100)", the computation tends to run faster.If the batch size for residual points |B| is excessively small, the advantage of GPU parallel computing with large batches of residual points cannot be utilized, leading to a slowdown in speed.If the batch size for dimensions |I| is too small, it sacrifices the reuse of the computational graph for calculating the second-order derivatives along each dimension, resulting in slower execution, consistent with our analysis in Section 4.4.
• Furthermore, in similarly extreme scenarios, selecting an exceptionally small batch size for the sampled dimension results in even slower speeds, i.e., the case "(|I|, |B|) = (1, 10000)" is much slower.In this example, the slowdown due to not reusing the computation graph for each dimension's second-order derivatives is much more obvious.Decreasing the batch size too much for either residual points or dimensions is not ideal, and which one has a greater negative impact depends on the specific setting and model architecture.Also, the case "(|I|, |B|) = (10, 1000)" is much slower than the case '(|I|, |B|) = (1000, 10)" points to the same fact.
In summary, this section demonstrates the stability of SDGD concerning batch size.It highlights that achieving good convergence results in high-dimensional PINNs can be accomplished with a small batch of randomly selected points and dimensions using SDGD.We also validate the guideline for batch size selection in Section 4.4.We have demonstrated the rapid convergence of SDGD on complex nonlinear PDEs using Algorithm 3 and its stability in particularly high dimensions across various nonlinear PDEs.Next, we will compare the speed and efficacy of Algorithms 1, 2, and 3. Given the stability of SDGD across various PDEs, we will employ the example of the nonlinear Sine-Gordon PDE in various dimensions, as discussed in Section 5.1.We have maintained identical experimental hyperparameters to focus specifically on comparing the algorithms as in the previous section.Furthermore, for the batch sizes of the three algorithms, including the batch sizes for residual/collocation points (|B|), PDE terms/dimensions in the forward pass (|J|), and PDE terms/dimensions for the backward pass (|I|), we consistently chose |B| = |I| = |J| = 100, ensuring a fair comparison.

Ablation
The results are shown in Table 3.The memory cost of the three algorithms is the same because their values of |I| and |B| are equivalent, representing the number of residual points and dimensions involved in the backward pass, respectively.Regarding speed, Algorithm 1 necessitates a forward pass for all dimensions.Despite being memory-efficient, its speed is compromised by the massive forward pass computation, particularly in extremely high dimensions.Algorithm 3 is slightly faster than Algorithm 2. This is attributed to the fact that the former samples only once, utilizing the same set of dimensions for both forward and backward passes.In contrast, the latter requires two separate sets of dimensions for these passes.Regarding performance, Algorithm 1 slightly outperforms Algorithm 2, which, in turn, is slightly better than Algorithm 3.This is because we employ more randomness to accelerate at the expense of introducing a certain degree of gradient variance.Nevertheless, Algorithm 3, the fastest in terms of speed, proves to be quite effective, with marginal differences compared to more precise algorithms.

Nonlinear Fokker-Planck PDEs and Comparison with Strong Baselines
Next, we present further results comparing our method based on PINN and other strong non-PINN baselines [1,21,52] for several nonlinear PDEs without analytical solutions.Thus, we adopt other methods' settings and evaluate the model on one test point, where we show that SDGD-PINNs are state-of-the-art (SOTA) models that still outperform other competitors.Concretely, the following nonlinear PDEs are considered: • Allen-Cahn equation.

Method
Table 4: Results of various methods on the Allen-Cahn, semilinear heat, and sine-Gordon PDEs with different dimensions.The evaluation metric is all relative L 1 error on one test point.SDGD (Ours) is the state-of-the-art (SOTA) method.
with the initial condition u(x, t = 0) = 5/(10 + 2∥x∥ 2 ).We aim to approximate the solution's true value on the one test point The reference values of these PDEs on the test point are computed by the multilevel Picard method [5,31] with sufficient theoretical accuracy.The exact solutions of these nonlinear PDEs are not solvable analytically and are highly nontrivial and nonseparable.The residual points of PINN are chosen along the trajectories of the stochastic processes that those PDEs correspond to: More training details are as follows.The model is a 4-layer multi-layer perceptron network with 1024 hidden units, which is trained via Adam [38] for 10K epochs, with an initial learning rate 1e-3 which decays exponentially with exponent 0.9995.We discretize the time into a stepsize of 0.015 and select boundary and residual points from |B| = 100 random SDE trajectories at each Adam epoch.We assign unity weight to the residual loss and 20 weight to the boundary loss, where in the latter, we fit the initial condition and its first-order derivative concerning x.We utilize Algorithm 2 with a minibatch of |I| = |J| = 10 dimensions to randomly sample the second-order derivatives along various dimensions in the Laplacian in all the PDEs.We repeat our experiment 5 times with 5 independent random seeds.
The results for these three nonlinear PDEs are shown in Table 4. PINN surpasses other methods in all dimensions and for all types of PDEs.Whether it is predicting the entire domain or the value at a single point, PINN is capable of handling both scenarios.This is attributed to the mesh-free nature of PINN and its powerful interpolation capabilities.

SDGD Accelerates PINN's Adversarial Training in HJB PDEs
In Wang et al. [60], it was demonstrated that the class of Hamilton-Jacobi-Bellman (HJB) equation [21,60] could only be effectively solved using adversarial training, which approximates the L ∞ loss.Specifically, they consider the HJB equation with linear-quadratic-Gaussian (LQG) control: where g(x) is the terminal condition/cost to be chosen.We can use the exact solution for testing and benchmarking PINNs' performances: We choose the following cost functions • Logarithm cost: following many previous papers [21,23,52,60] to obtain the HJB-Log case.We demonstrate that SDGD is the state-of-the-art (SOTA) high-dimensional PDE solver, where SDGD outperforms these previous methods on this classical benchmark PDE, especially the closely related PINN-based approach in [23,60].
• Rosenbrock cost: where c 1,i , c 2,i ∼ Unif[0.5, 1.5].This nonconvex cost function will lead to a PDE exact solution that is anisotropic and asymmetric along dimensions, highly coupled and inseparable.Notable, in the exact solution, all pairs of variables, both x i and x i+1 , exhibit pairwise interactions and are highly coupled.The corresponding HJB PDE has an effective dimension of d + 1, i.e., it can not be reduced to low-dimensional subproblems.This case is called the HJB-Rosenbrock case.
The solution's integration in HJB PDE cannot be analytically solved, necessitating Monte Carlo integration for approximation.We use the relative L 2 error approximating u as the evaluation metric for this equation.
In this example of the HJB equation with LQG control, we decompose the residual prediction of PINN along each dimension as follows: Despite its ability to maintain a low maximal memory cost during training, adversarial training approximating the L ∞ loss is widely recognized for its slow and inefficient nature [55], which poses challenges in applying it to high-dimensional HJB equations with LQG control.Adversarial training involves optimizing two loss functions: one for the PINN parameters, minimizing the loss through gradient descent, and another for the residual point coordinates, maximizing the PINN loss to approximate the L ∞ loss.This adversarial minimax process, known as adversarial training, is computationally demanding, often requiring multiple rounds of optimization before the resulting residual points can effectively optimize the PINN parameters.
The current state-of-the-art (SOTA) research by Wang et al. [60] and He et al. [23] has successfully scaled adversarial training to 250 dimensions.In this study, we use SDGD to enhance the scalability and efficiency of adversarial training of PINN in high-dimensional HJB equations.
The implementation details are given as follows.We use a four-layer PINN with 1024 hidden units, trained by an Adam optimizer [38] with a learning rate = 1e-3 at the beginning and decay linearly to zero.We modified the PINN structure to make the terminal condition automatically satisfied [45]: where g(x) is the terminal condition.Thus, we only need to focus on the residual loss.We conduct training for a total of 10,000 epochs.In all dimensions, we employ Algorithm 3 with |I| = 100 batches of PDE terms (dimensions) in the loss function for gradient descent and Algorithm 3 with |I| = 10 batches of PDE terms (dimensions) in the loss function for adversarial training.For both our methods and the baseline methods proposed by Wang et al. [60] and He et al. [23], we randomly sample |B| = 100 residual points per epoch, and 20K test points, both sampling from the distribution t ∼ Unif[0,1], x ∼ N (0, I d×d ).For generating the reference value for testing, we conduct a Monte Carlo simulation with 1e5 samples to estimate the integral in the exact solution.We repeat our experiment 5 times with 5 independent random seeds.
The results of adversarial training on the two HJB equations are presented in Table 5.For instance, in the 250D HJB-LQG case, achieving a relative L 2 error of 1e-2 using [60] requires a training time of 1 day.In contrast, our approach achieves a superior L 2 error in just 3.5 hours.In the 1000D case, full batch GD's adversarial training is exceedingly slow, demanding over 5 days of training time.However, our method effectively mitigates this issue and attains a relative L 2 error of 5.852E-3 in 217 minutes.Moreover, our approach enables adversarial training even in higher-dimensional HJB equations, specifically 10,000 and 100,000 dimensions, with resulting relative L 2 errors of approximately 5E-2.For the HJB-Rosenbrock case, our method is also consistently better than others, demonstrating SDGD's strong capability to deal with a complicated, inseparable, effectively high-dimensional, non-convex, and highly coupled solution.
Furthermore, although the HJB equation is also parabolic and non-PINN traditional methods like [1,21,22] can be used, we test on 20,000 points.SDGD with PINN trains a surrogate model to predict these 20,000 points after one training process.However, these traditional methods can only predict one point after one training instance, so they require 20,000 rounds of separate inferences, which is too costly.HJB-Log Results for u Wang et al. [60] He et al. [ In all dimensions, our method significantly outperforms the other two baselines in terms of accuracy and speed.The baselines, employing full batch gradient descent, experience slower speed and increased memory requirements in higher-dimensional scenarios.In contrast, our approach utilizes Algorithm 3, which employs stochastic gradient descent over PDE terms, resulting in faster computation and relatively lower memory consumption.

Schrödinger Equation
Here, we scale up the Tensor Neural Networks (TNNs) [61,62] for solving very high-dimensional Schrödinger equations.Due to its unique characteristics, the Schrödinger equation cannot be addressed by the traditional PINN framework and requires a specialized TNN network structure and corresponding loss function for high-dimensional eigenvalue problems.However, our method can still be flexibly integrated into this framework for scaling up Schrödinger equations, demonstrating the versatility of our approach.In contrast, other methods specifically designed for high-dimensional equations exhibit much greater limitations.

Tensor Neural Network (TNN) for Solving Schrödinger Equation
In general, the Schrödinger equation can be viewed as a high-dimensional eigenvalue problem. where is the unknown solution and λ is the unknown eigenvalue of the high-dimensional eigenvalue problem.This problem can be addressed via the variational principle: The integrals are usually computed via the expensive quadrature rule, e.g., Gaussian quadrature: where N is the index set for the quadrature points.The size N grows exponentially as the dimension increases, incurring the curse of dimensionality.Hence, traditional methods for solving Schrödinger equations suffer from the curse of dimensionality due to the exponential growth of the grid size in the quadrature for numerical integral in the equation's variational form.So, traditional methods cannot be used even in the lowest 100-dimensional case in our experiment.To overcome this issue, the Tensor Neural Network (TNN) adopts a separable function network structure to reduce the computational cost associated with integration: where x i is the ith dimension of the input point x, and ϕ j (x i ; θ i ) is the jth output of the sub-wavefunction network parameterized by θ i , and p is the predefined rank of the TNN model.The integral by quadrature rule of the TNN can be calculated efficiently: where N i is the index set for the ith dimensional numerical integral.Thus, the numerical integrals can be computed along each axes separately to reduce the exponential cost to linear cost.We refer the readers to [61,62] for more details.With the efficient quadrature achieved by TNN's separable structure, the loss function to be minimized is where Φ(x; θ) is the neural network wave function with trainable parameters θ.The primary computational bottleneck of TNNs for this problem lies in the first-order derivatives in the loss function, while other terms only contain zero-order terms with lower computational and memory costs.Since TNN is a separable architecture, the d-dimensional Schrödinger equation requires computing the first-order derivatives and performing quadrature integration for each of the d subnetworks.Consequently, the computational cost scales linearly with the dimension.In comparison to previous examples of PINN solving PDEs, TNN incurs larger memory consumption for first-order derivatives due to the non-sharing of network parameters across different dimensions, i.e., d independent first-order derivatives of d distinct subnetworks are required, resulting in increased computational requirements.
Mathematically, the computational and memory bottleneck is the gradient with respect to the first-order term of the neural network wave function Φ(x; θ) By employing the proposed SDGD approach, we can simplify the entire gradient computation: where I ⊂ {1, 2, • • • , d} is a random index set.Therefore, we can sample the entire gradient at the PDE term level to reduce memory and computational costs, and the resulting stochastic gradient is unbiased, i.e., E I [grad I (θ)] = grad(θ) ensuring the convergence of our method.The gradient accumulation can also be done for large-scale problems by our method; see Section 3.3 for details.
It is important to note that our method is more general compared to the traditional approach of SGD solely based on quadrature points.In particular, in extremely high-dimensional cases, even computing the gradient for a single quadrature point may lead to an out-of-memory (OOM) error.This is because the traditional approach requires a minimum batch size of d PDE terms and 1 quadrature point, where d is the dimensionality of the PDE.However, our method further reduces the computational load per batch to only 1 PDE term and 1 quadrature point.

Experimental Setup
We consider the Coupled Quantum Harmonic Oscillator (CQHO) potential function: , where all pairs of variables, both x i and x i+1 , exhibit pairwise interactions and are highly coupled.The original problem is defined on an infinite interval.Therefore, we truncate the entire domain to [−5, 5] d .The exact smallest eigenvalue is λ = d i=1 1 − cos iπ d+1 .We use the same model structure and hyperparameters as Wang et al. [61].For this problem, we test the effectiveness of the Gradient Accumulation method in Section 3.3 based on our further decomposition of PINN's gradient.
The test metric we report is the L 1 relative error of the minimum eigenvalue, given by |λtrue−λapprox| |λtrue| . Due to the diminishing nature of the eigenfunctions in high dimensions, we only report the accuracy of the eigenvalues.It is important to note that the eigenvalues carry physical significance as they represent the ground state energy.

Experimental Results
On the CQHO problem, our gradient accumulation method achieves training in 2 * 10 4 dimensions within the limited GPU memory in Figure 2. In all dimensions, SDGD achieves a stable relative error of less than 1e-6.Note that gradient accumulation performs theoretically the same as full batch GD, so we do not compare their results, but the former can save GPU memory, thus scaling up PDEs to higher dimensions.These results demonstrate the scalability of our method for PINN and PDE problems and highlight its generality, making it applicable to various physics-informed machine-learning problems.
During the early stages of optimization, we observe quick convergence of TNNs.Since the model parameters are initialized randomly, there is more room for improvement.At this point, the gradients of the loss function with respect to the parameters can be relatively large.Larger gradients result in more significant updates to the parameters, leading to faster convergence.
Despite the satisfactory convergence results, these problems demonstrate one shortcoming of our proposed approach.Our method is primarily designed to address the computational bottleneck arising from high-dimensional differentiations in PDEs.However, if the problem itself is large-scale and does not stem from differential equations, our method cannot be directly applied.Specifically, in the case of the Schrödinger equation, due to the separable nature of its network structure, we employ a separate neural network for each dimension.This results in significant memory consumption, even without involving differentials.Additionally, in this problem, we cannot sample the points of integration / residual points because there is an integral term in the denominator.If we sample the integral in the denominator, the resulting stochastic gradient will be biased, leading to poor convergence.Therefore, for such problems, designing an improved unbiased stochastic gradient remains an open question.

Conclusions
CoD has been an open problem for many decades, but recent progress has been made by several research groups using deep learning methods for specific classes of PDEs [1,21,52].Herein, we proposed a general method based on the mesh-free PINNs method by introducing a new type of stochastic dimension gradient descent or SDGD as a generalization of the largely successful SGD method over subsets of the training set.In particular, we claim the following advantages of the SDGD-PINNs method over other related methods: Generality to Arbitrary PDEs.Most existing approaches [1,21,52] are restricted to specific forms of parabolic PDEs.They leverage the connection between parabolic PDEs and stochastic processes, employing Monte Carlo simulations of stochastic processes to train their models.Consequently, their methods are limited to a subset of PDEs.In contrast, our approach is based on PINNs capable of handling arbitrary PDEs.Notably, we can address challenging cases such as the wave equation, biharmonic equation, Schrödinger equation, and other similar examples, while [1,21,52] cannot.
Prediction on the Entire Domain.Furthermore, existing methods [1,21,52] can only predict the values of PDE solutions at a single test point during each training instance.This limitation arises from the necessity of setting the starting point of the stochastic process (i.e., the test point for the PDE solution) before conducting Monte Carlo simulations.Consequently, the computational cost of their methods [1,21,52] increases greatly as the number of test points grows.In contrast, our approach, based on PINNs, allows for predictions across the entire domain in a single training instance, achieving a "once for all" capability.
Dependency on the Mesh.Moreover, precisely because these methods [1,21,52] require Monte Carlo simulations of stochastic processes, they necessitate temporal discretization.This introduces additional parameters and requires a sufficiently high discretization precision to obtain accurate solutions.For PDE problems that span long durations, these methods undoubtedly suffer from the burden of temporal discretization.However, our PINN-based approach can handle long-time PDEs effortlessly.Since PINNs are mesh-free methods, the training points can be randomly sampled.The interpolation capability of PINNs ensures their generalization performance on long-time PDE problems.
PINNs can also be adapted for prediction at one point.Although PINNs offer advantages such as being mesh-free and capable of predicting the entire domain, they still perform remarkably well in settings where other methods focus on, i.e., single-point prediction.Specifically, if our interest lies in the value of the PDE at a single point, we can exploit the relationship between the PDE and stochastic processes.This single point corresponds to the initial point of the stochastic process, and the PDE corresponds to a specific stochastic process, such as the heat equation corresponding to simple Brownian motion.In this case, we only need to use the snapshots obtained from the trajectory of this stochastic process as the residual points for PINN.
We have proved the necessary theorems that show SDGD leads to an unbiased estimator and that it converges under mild assumptions, similar to the vanilla SGD over collocation points and mini-batches.SDGD can also be used in physics-informed neural operators such as the DeepONet [44,16] and the FNO [41,42] and can be extended to general regression and possibly classification problems in very high dimensions.
The potential limitation of SDGD is that it requires a relatively larger batch size for the dimension when dealing with extremely high-dimensional PDEs.A too-small batch size may incur high stochastic gradient variance that slows down the convergence.In addition, for PDEs not following the boundary+residual forms, like the Schrödinger equation, a special loss function is generally required, which may cause the core memory bottleneck to not be in the dimension that SDGD focuses on.This causes the minimum memory that SDGD can obtain to grow faster with dimensions.
In summary, the new SDGD-PINN framework is a paradigm shift in the way we solve high-dimensional PDEs as it can tackle arbitrary nonlinear PDEs, of any dimension providing high accuracy at extremely fast speeds.

A Proof
A.1 Proof of Theorem 4.1 The unbiasedness of the stochastic gradient in Algorithm 1 can be derived as follows: The unbiasedness of the stochastic gradient in Algorithm 2 can be derived as follows: A.2 Proof of Theorem 4.2 Lemma A.1.Suppose that we have N fixed numbers a 1 , a 2 , • • • , a N and we choose k random numbers {X i } k i=1 from them, i.e., each X i is a random variable and X i = a n with probability 1/N for all n ∈ {1, 2, • • • , N }, and X i are independent.Then the variance of the unbiased estimator Proof of Lemma A.1.Since the samples X i are i.i.d., Proof of Theorem 4.2.We assume that the total full batch is with N r residual points {x i } Nr i=1 and N L PDE terms L = N L i=1 L i , then the residual loss of the PINN is where we normalize over both the number of residual points and the number of PDE terms.The full batch gradient is: From the viewpoint of Algorithm 1, the full batch gradient is the mean of N r N L terms: where we denote The stochastic gradient generated by Algorithm 1 is given by We have the variance From high-level perspective, V B,J [g B,J (θ)] should be a function related to |B| and |J|.Thus, under the constraint that |B| • |J| is the same for different SGD schemes, we can choose B and J properly to minimize the variance of SGD to accelerate convergence.
The entire expectation can be decomposed into four parts (1) i = i ′ , j = j ′ , (2) i ̸ = i ′ , j = j ′ , (3) i = i ′ , j ̸ = j ′ , (4) i ̸ = i ′ , j ̸ = j ′ .For the first part where we denote N L j=1 g i,j (θ) 2 which is independent of B, J.In the second case, since the ith sample and the i ′ th sample are independent for i ̸ = i ′ , we have (62) In the third case, due to the same reason, In the last case, where M (l) = max {∥W l ∥, 1}, the norms are all vector or matrix 2 norms and vec denotes vectorization of the high-order derivative tensor, and h is the maximal width of the neural network, i.e., h = max(m L , • • • , m 0 ).
Due to Assumption 4.2, the values generated by the PDE operator acting on the neural network and its derivative with respect to model parameters θ (the gradient for optimization) will also fall into a bounded domain throughout the optimization, whose closure is compact.
Since we bound the gradient produced by physics-informed loss functions, we can further bound the gradient variance during PINN training using SDGD.
For the stochastic gradient produced by Algorithm 1, V B,J (g B,J (θ)) ≤ where g i,j (θ) := (Lu θ (x i ) − R(x i )) ∂ ∂θ L j u θ (x i ) .For the stochastic gradient produced by Algorithm 2, V B,J,K (g B,J,K (θ)) ≤ where g i,j,k (θ Since |L j u θ (x i )| and ∂ ∂θ L j u θ (x i ) can be bounded throughout the optimization process thanks to Lemma 4.1 and Assumption 4.2, the gradient variance during optimization is universally bounded and finite, i.e., the gradient variance bounds are agnostic of the epoch n.
Thus, we complete the proof by applying Theorem 4 in Mertikopoulos et al. [48].

1: while NOT CONVERGED do 2 :
Choose random indices I, J ⊂ {1, 2, • • • , N L }, where we can set |I| ≪ N L to minimize memory cost and |J| ≪ N L to further speed up.

Algorithm 3 2 :
Training Algorithm for scaling-up and more speeding-up by sampling both the forward and backward passes only once.1: while NOT CONVERGED do Choose random indices I ⊂ {1, 2, • • • , N L }, where we can set |I| ≪ N L to minimize memory cost and to further speed up.

Lemma 4 . 1 .
(Bounding the Neural Network Derivatives) Consider the neural network defined as Definition 4.1 with parameters θ, depth L, and width h, then the neural network's derivatives can be bounded as follows.
, • • • , I n on various machines simultaneously and accumulate them to form a larger-batch stochastic gradient.Algorithm 1 3:

Table 1 :
Relative L 2 error, time, and memory costs for vanilla PINNs Study 1: Effect of the Batch Sizes for Residual Points and Sampled Dimensions

Table 5 :
Relative L 2 error results, running time across different dimensions for the baselines and ours, for two HJB equations requiring adversarial training of PINNs.