Application and effects of physics-based and non-physics-based regularizations in artificial intelligence-based surrogate modelling for highly compressible subsurface flow

Artificial intelligence (AI)-based surrogate reservoir models (SRMs) can provide computationally feasible and accurate approximations to numerical simulations. An AI-based SRM is trained to a set of parameters that significantly reduces its variance. This can be done by either supervised or semi-supervised learning. The latter involves regularization of the model’s parameters using non-physics-based, physics-based or a combination of both regularization terms. Effective enforcement of the physics-based and non-physics-based regularizations can significantly reduce the variance of AI-based SRMs. Little study has been reported on the application and effects of regularization terms. Also, for highly compressible subsurface flow where strong nonlinearities exist, well-constructed composite AI-based architectures and regularizations are necessary for learning. This paper applies and studies the effects of various regularization terms for highly compressible subsurface flow; it proposes unique and effective techniques in AI-based surrogate development and training. The learning utilizes the discretized domain and boundary physics with derivatives obtained from both finite difference methods (FDM) and algorithmic differentiation (AD). The regularizations are partly enforced as a hard constraint in the network architecture using a trainable layer and as soft constraints using a multi-cost function. The soft constraints exploit a tank material balance and time-discretization numerical errors, in addition to the domain, boundary and non-physics-based 𝐿 2 regularization terms. The timely-trained AI-based surrogate predictions agree with those obtained from a numerical simulator. The regularization terms separately contribute to improved learning. The non-physics-based 𝐿 2 norm if used in the right order of magnitude, improves grid block predictions. The tank material balance regularization term constrains the AI-based surrogate parameters to net domain accumulation, ensuring its reliability. The trainable hard enforcement layer enforces the initial condition and improves the predictions compared to other hard enforcement techniques. The discretized domain equation and time-discretization numerical errors allow for learning of variable timesteps, which give the best rounding-truncation error trade-off and improve the predictions compared to those of fixed timesteps. The AI-based surrogate, effectively trained by semi-supervision, can be reliably used as a state-dependent model in domain analysis like sensitivity and data assimilation.


Introduction
Highly compressible fluids are encountered in many subsurface systems and accurately simulating their flow behaviour using new frontiers like AI is essential for exploitation from or storage in these systems.Fluid flow in porous media is governed by the partial differential equation (PDE) and for highly compressible fluids, the PDE of runs.These PDE solutions can be computationally expensive for the numerical simulator and can delay decision-making.
A surrogate model can provide a computationally feasible and reliable approximation to the numerical simulations; it gives a fit-forpurpose model for quickly obtaining the multiple flow realizations of the subsurface domain without much sacrifice to accuracy or details (Memon et al., 2013;Mohaghegh et al., 1996).A surrogate's only significant time expense is during the training phase, after which predictions are made in seconds.Surrogate models developed and used in subsurface modelling can be categorized into the projection-based reduced order model (ROM) (Cardoso and Durlofsky, 2010;Markovinović and Jansen, 2006;Xiao et al., 2019Xiao et al., , 2022) ) and data-driven models.
Data-driven surrogate modelling involves empirical approximation by either a response surface or an ANN.The data-driven models do not require intrusive access to high-fidelity numerical simulator codes; their estimators are calibrated with datasets containing sufficient training examples.A training example consists of a feature vector and a label.The feature vector is an instance in the domain, e.g., the space and time characteristics, and the label is a collection of true responses at the instance.The calibration, called machine learning (ML), is performed by either supervised, unsupervised or weakly-supervised methods.Supervised learning involves learning on a labelled dataset (Hastie et al., 2009;Dridi, 2021).Unsupervised learning involves identifying patterns in datasets that are neither labelled nor classified (Ghahramani, 2003).Weakly-supervised learning involves learning by either incomplete, inexact or inaccurate supervision (Zhou, 2018).Semi-supervision, a kind of incomplete supervision is of interest in this paper.This involves using an incomplete dataset, in which a small amount is labelled while the remaining are unlabelled.However, the unlabelled training examples can utilize the labelled subdataset for learning (Chapelle et al., 2009;Zhou, 2018;Van Engelen and Hoos, 2020).Fig. 1 is a description of the learning process.
Deep learning architectures can handle high-dimensional parameters and can overcome the curse of dimensionality, i.e., the volume of data needed to estimate an arbitrary function increases exponentially with respect to the input parameters.The exact method in which deep learning (DL) algorithms break this curse remains a fundamental question.The manifold hypothesis (Fefferman et al., 2013) is one theory that justifies how DL can handle this curse; it implies that in a real-world high-dimensional data, there lies a low-dimensional manifold embedded in the high-dimensional space.DL techniques can discover and utilize the low-dimensional manifold.The hierarchically local compositional functions in deep convolutional networks, where all the constituent functions are local, and have bounded, small dimensionality without no weight sharing can approximate a high dimensional data without the curse of dimensionality (Poggio and Liao, 2018).
In AI-based surrogate modelling, learning follows an inductive approach to reduce its bias and variance.The bias error is due to erroneous assumptions in the AI-based surrogate structure and the variance is the sensitivity of the surrogate to changes in input.An optimum surrogate model is one that yields low bias and low variance errors.The bias can be reduced by changing the ANN architecture, e.g., increasing the number of trainable parameters by adding more transformation layers.The variance is reduced by either the supervised learning or regularization Regularization involves using non-physics-based or physics-based penalties to constrain the surrogate's parameters when learning on a sparse dataset; this is a semi-supervised method.Sparse dataset can occur when the training examples are few; the additional simulation and processing time associated with generating more training examples can defeat the timeliness of the surrogate.Sparse dataset can also occur when there are sufficient inexpensive training examples but incompletely labelled, i.e., a small amount is labelled and the remaining, unlabelled.The regularized parameters prevents overfitting and also reduce the likelihood of exploding gradients during training.Non-physics-based regularization has been applied for improved AI-based surrogate modelling of gas and gas-condensate subsurface flow (Molokwu and Jamiolahmady, 2022).
Physics-based regularization can significantly reduce the surrogate model's variance by utilizing the domain and boundary physics, called the physics-informed neural network (PINN) (Raissi and Karniadakis, 2018;Raissi, 2018;Sun et al., 2020;Tartakovsky et al., 2020) or theory-guided convolutional neural network (TgCNN) (Wang et al., 2021).Effective implementation of physics-based regularization can compensate for dataset incompleteness.Physics-based regularization of neural networks has been used to learn linear and nonlinear PDEs and hidden fluid relationships in fluid flow problems with limited amounts of labelled datasets (Kadeethum et al., 2020;Raissi, 2018;Meng and Karniadakis, 2020); and without labelled datasets (Sun et al., 2020).Physics-based methods have also been applied in surrogate reservoir modelling to learn the theories of subsurface hydrocarbon flow in linear and nonlinear PDEs, solved using FDM for single and multiphase flow of slightly compressible fluids (oil and water) (Wang et al., 2021(Wang et al., , 2022)).
Physics-based regularization requires the evaluation of derivatives and their higher orders.These derivatives can be evaluated algorithmically-AD or numerically-FDM.AD allows for exact evaluation using software computing libraries; however, this can introduce significant overhead when computing higher-order derivatives as the gradients obtained by chain rule are stored in memory.AD can adversely affect computation time if the network graph is deep.FDM use finite differences for faster evaluation of the space-time derivatives; it requires that the space-time domain is discretized, resulting in round-off (floating point precision) and truncation errors.ANNs can be used to learn the numerical errors associated with the discretization, then reduced as part of the physic-based losses.Learning the associated errors yields an optimal discretization step at each time point, which is used during training.
Highly compressible fluids, due to their nonlinear PDE, require deeper network architectures for evaluation of pressure-dependent fluid properties.The FDM can be used as timely solutions for evaluation of the derivatives; however, this could result in nonconvergence if the AI-based surrogate is not properly enforced and the discretization steps not carefully selected.This paper investigates the application and effects of applying various physics-based regularization terms including the non-physics-based  2 norm regularization term in AI-based surrogate modelling for highly compressible subsurface flow.The training is done by semi-supervision-i.e., only the initial and boundary subdatasets are labelled with initial pressure and boundary (well) production constraints respectively while the remaining, larger training examples are unlabelled.This paper uniquely proposes the following for effective learning: 1.A custom three-module network architecture, comprised of pressure, fluid property and timestep modules.The pressure module utilizes FDM while the fluid property module uses AD in computing its derivatives.The timestep module allows for variable timesteps by learning the combined discretized domain PDE and time-discretization numerical errors.2. A tank material balance regularization term, in addition to the domain, boundary and  2 norm regularization terms.3. A trainable layer for hard enforcement of the initial condition and output refinement.4. Time-discretization numerical errors derived from two Euler differences.
The paper is organized as follows: The physics-based and nonphysic-based regularizations are described in the methodology.In the next section, the custom network architecture is explained.In the case study section, the implementation of the AI-based surrogate on a dry gas, multi-well reservoir with varying operating conditions, including a shut-in is presented.In the results section, the performance of the AI-based surrogate is compared with results obtained from a numerical reservoir simulator for a permeability-time dataset of 146 × 48 realizations, and the effects of various regularization terms are presented.The paper concludes with a summary of the main findings.

Physics-based equations for single-phase compressible fluid flow in porous media
The general equation describing the flow of a single-phase compressible fluid (i.e., gas) in a porous media is obtained by combining the continuity equation and Darcy's transport equation (Aziz and Settari, 1979): ṁ is the rate of mass accumulation in a sink or source (negative if source) and  is the infinitesimal control bulk volume, both located in the fluid flow domain.Dividing through by the fluid density at standard conditions   , then expanding the right-hand-side (RHS) of Eq. ( 1) yields: (2) Eq. ( 2) is the domain solution,  is the sink/source rate at standard conditions. is the domain permeability, a constitutive relation between the pressure gradient and the flow velocity distribution in the domain. can be represented by a tensor or assumed isotropic.The PDE is linear in Eq. ( 2) if the fluid's viscosity  and formation volume factor  are assumed constant, and the derivative term ) is represented by a constant fluid compressibility; this is the assumption for a slightly compressible subsurface flow, e.g., oil flow.However, for a highly compressible subsurface flow, the fluid properties and their derivatives are not constant but functions of pressure.The PDE for highly compressible subsurface flow has variables and derivatives, which are strongly pressure-dependent, resulting in a nonlinear PDE and composite (deeper) neural network architecture.Solving Eq. ( 2) requires specifying the boundary conditions-Dirichlet and Neumann boundary conditions (Abou-kassem et al., 2006).Combining the domain and boundary equations yield the physics-based losses.In this work, the physics-based losses are evaluated by computing the spaceand time-dependent first-order and second-order partial derivatives in Eq. (2) using FDM.AD, a superior technique in shallow computational graph architectures is used in computing the pressure-dependent (fluid) derivatives.The time-discretization errors are reduced by the timestep ANN.Table 1 shows the different differentiation schemes used in evaluation of the derivatives.

Domain equation
A 2D reservoir with uniform thickness, anisotropic rock properties, zero dip and a highly compressible fluid is considered.The reservoir is penetrated by a group of wells.
Evaluating the domain PDEs by FDM involve the specification of fixed finite dimensions-, ,  and a timestep .For the 2D reservoir model,  is the reservoir thickness ℎ.A schematic of the discretized reservoir is shown in Fig. 2. The 2D domain loss MSE  for a given timestep is as follows:    is the residual error of the domain given as: is a regularization weight for the domain loss,  , is the grid block bulk volume,     is the number of collocation points or grid blocks in the flow domain and   , is the time-discretization round-off and truncation errors.a and b are field unit conversion constants with values of 0.001127 and 5.615, respectively.Evaluating Eq. ( 7) at an unknown time level  =  + 1 is the centred-in-distance, backward-in-time formulation.This is a stable finite difference scheme as the associated numerical errors do not grow when the computations are carried forward (Peaceman, 2000). =  + 1 is used this paper.Eq. ( 7) evaluation requires specifying a small advancement (timestep) and recomputing the AI-based graph.The evaluation, a forward pass, is performed twice to obtain the training losses.This is a less expensive process when the network computational graph is static-i.e., the flow of data, its size and function evaluations type are known before the start of the training.The nonlinearities introduced by the compressibility, viscosity and formation volume factor terms are locally linearized in space.The local linearization approximates the properties at the grid block edges and time level.Different linearization schemes have been used to evaluate the pressure-dependent terms, e.g., single-point upstream weighting, average function value weighting, and average pressure value weighting (Abou-kassem et al., 2006).In this work, the average function value weighting is used for local linearization.The average function value weighting   ±1∕2 linearized implicitly is given by: The pressure-dependent term linearized in space is given by: ( 1 The directional x-transmissibility across a grid face is given by: is computed by harmonic averaging as: Eqs. ( 11)-( 14) can be applied to compute the directional −terms by changing the parameter axis.The compressibility term is evaluated as: d d is defined as the chord slope; it affects the convergence of numerical simulations and can also affect physics-based gradients propagated by a machine learning model.The chord slope is computed by polyharmonic spline layers in the fluid property module.These layers have radial basis functions as their neurons, rapidly pretrained to observed fluid properties and are highly memory-efficient.
V.C. Molokwu et al.The pretraining maps smooth functional models to discrete pressure, volume, and temperature (PVT) tables.These tables are noisy in most practical cases, and if obtained from equation of state (EOS) models, in some cases, these models are not simple and can diverge or give erroneous results.The memory efficiency of this layer allows for AD, thereby obtaining accurate representations of the chord slope.

Neumann and Dirichlet boundary conditions
The Neumann boundary condition (NBC) specifies the normal deriv ative or flux of the potential function   , (i.e., the PDE) on the boundary , to be a constant or zero (Mazumder, 2016;Venkateshan and Swaminathan, 2014).NBC can be specified either at the inner boundaries or at the outer edges.In a Neumann inner boundary condition, the well (or group of wells) is under a constant rate at the sand face, and Darcy law (steady-state) can be applied.Similarly, in a Neumann outer boundary condition specification, the outer boundary is under a certain or zero flux at any time.The Dirichlet boundary condition (DBC) specifies the value that the solution function  , to the PDE must have on the boundary .DBC can be at the inner boundaries or at the outer boundaries.In a Dirichlet inner boundary condition, the well (or group of wells) is under constant pressure at the sand face at any time.Similarly, in a Dirichlet outer boundary condition, the outer boundary is under constant pressure at any instance of time, e.g., an aquifer or adjoining compartment (Idorenyin, 2016).Under steady-state, the net flow across the inner boundary is zero.The finite difference for the well grid block under steady-state is as follows: Eq. ( 16) can be evaluated at the well grid block as: Eq. ( 17), expressed as the inner boundary condition loss MSE | IBC , is as follows: is a binary matrix having same dimension as the domain; it specifies the grid block locations penetrated by group of wells.The binary matrix allows for multiple well connections existing in the grid domain.The rate  , , is a dataset label, which does not require an expensive process to generate.For a constant rate well,  , is the initial rate of the well.Eq. ( 17) can be solved to relate the rate, as the difference between the well grid block (centroid) pressure and a flowing bottom hole pressure (Peaceman, 1978).
Eq. ( 19) can be rearranged to evaluate the flowing bottom hole pressure for a constant rate system as: is the equivalent radius of the well block-the radius at which the steady-state flowing pressure for the actual well is equal to the numerically calculated pressure for the well block.The equivalent radius for an anisotropic system is given by Peaceman (1983)   = 0.28 The outer boundary residual error  | OBC , if any can be computed by expressing the flow across the outer boundaries using auxiliary grid blocks as shown in Fig. 3.For a no-flow outer boundary, the auxiliary (or image) grid blocks have centred pressures equal to that of the real grid blocks located around the boundary.The outer boundary residual error using the auxiliary grid blocks is a subset of the domain loss.

Initial condition
This is a special case of the Dirichlet boundary condition, specifying the solution at a time boundary of zero and for all spatial locations.The AI-based surrogate requires supervision for learning the initial condition (IC) loss.The dataset label for this loss is the initial pressure  =0 , which can be obtained without an expensive process.AI-based learning on this loss conditions the network's trainable weights to recognize the initial state of the system.The IC residual error is given by: is a pressure solution at  = 0, estimated by the AI-Based SRM. | IC is the residual error at the initial condition.The residual in Eq. ( 22) can be fully satisfied-MSE PHY = 0, by hard enforcement of network pressure output.Hard enforcement is a way to improve the quality of predictions while loosing the requirements for labelled data (Márquez-Neila et al., 2017).A trainable hard enforcement layer is proposed; this results in an updated pressure solution, which could be mapped to the Ei-function solution for an infinite acting radial flow as shown in Eq. ( 22) (, , , ) is the output from the AI-Based SRM,  is a scaling function based on a constantly changing input variable.A permeability-time dependent scaling function is used and is given by: and   are the input normalization limits, which are set to −1 and 1, respectively.The updated pressure output in Eq. ( 23), ensures the initial condition loss  | IC is always zero at  = 0.  h can be set as 0.5 or made a trainable parameter in the layer.The parameter  h when made trainable is estimated by the CEDNN optimizer updates, subject to constraint: . This constraint is necessary to prevent exploding gradients associated with power functions in deep learning.The effects of the trainable and non trainable states of  h are shown in the discussion section.

Tank material balance equation
A material balance equation is adapted as a physics-based loss to accelerate the learning of the reservoir physics by the AI-based surrogate.
Setting the divergence term in Eq. ( 2), results in a tank model.Material balance checks are used in numerical simulators to ensure the net tank accumulation due to a proposed rate The material balance check is an important regularization term when the time domain is sparsely sampled.This is usually the case as it will be memory intensive to sample the time domain at very fine intervals, e.g., at regular intervals of 1 day for a year yields 365 time points, which when combined with spatial locations could take a longer training time.The time domain, however, can be sparsely sampled while specifically considering time points at which the system undergoes significant rate changes, e.g., shut in or well opening time points.This yields fewer time points, which can be timely trained.A stable form of the material balance error for a given timestep  is as follows: Eq. ( 25) accumulates the loss linearly and has a similar order of magnitude as other physics-based losses thereby preventing exploding gradients.The material balance loss MSE MB , is as follows: MB is a regularization weight applied to the material balance loss.

Numerical errors
Solving the PDEs by FDM yield numerical errors: round-off and truncation errors.The round-off error is due to the approximation of the numerical computations and is affected by the precision of the DL algorithm.The truncation error is due to the elimination of higher order Taylor series terms in the discretized space-time.The numerical errors can cause numerical diffusion, a situation where the discretized domain experiences a higher diffusivity than the physical (real) domain.The numerical errors, as artefacts, diffuse similarly as the pressure transient.The magnitude of the numerical dispersion is affected by the finite difference scheme, pressure diffusivity (mobilitystorativity ratio), space and time steps (Peaceman, 2000;Zhu et al., 2019).When the physical diffusivity is small, e.g., in low permeability systems, the finite difference solutions can contain significant numerical errors due to numerical diffusion having a greater effect compared to the physical diffusivity.The AI-based surrogate will have numerical errors inherent in its learning when FDM are used to solve the PDE.These numerical effects cannot be eliminated but can be reduced using smaller discretization steps, a stable finite difference scheme in which the numerical errors do not grow as the computations are carried forward in time or mobility weighting in two-phase flow.
Grid sensitivity can be performed as a pretraining operation, using a numerical simulator, to select an optimum grid size with reduced numerical effects.Fig. 4 shows the numerical effects on the cumulative gas production and average pressure for a 2900 ft × 2900 ft, dry gas subsurface domain, which is simulated for different grid resolutions and different uniform permeability values.Each grid resolution corresponds to the number of grid blocks in the domain: 9 2 , 19 2 , 29 2 , 39 2 , 49 2 and 59 2 .The highest and lowest resolutions correspond to grid block sizes 49.15 ft and 322.22 ft respectively.
Successive increase of the grid resolutions cause deviations in their simulated responses; these deviations are more pronounced in the low permeability systems (Fig. 4A and B) than in the high one, where it is almost negligible (Fig. 4C).In Fig. 4, considering the date 01-June and incremental recoveries between the lowest and highest resolutions, the high permeability case ( = 300 mD) yields incremental gas production and recovery of 1700 Mscf and 0.0623% respectively.As  is decreased by a factor of 100, the incremental values become 135 000 Mscf and 4.95% respectively, which is greater than those of the high permeability system.Further reduction by a factor of 100 reduces the incremental values to 11 680 Mscf and 0.428% respectively.However, it is still greater than those of the highest permeability system (Fig. 5).The reduced incremental quantities are due to very low diffusivity, which has been affected by the low value of the permeability, i.e.,  = 0.03 mD.
The space-discretization numerical effects cannot be eliminated, but they can be reduced by proper grid selection.A cut-off resolution can be established where successive increments have little effect on the cumulative recovery.In Fig. 5, a cut-off resolution of ≥ 39 2 can be selected for the low permeability cases; this can be lowered for the high permeability case.However, the cut-off affects the training time.High cut-off resolutions take longer training times than low ones; this is due to increased DL operations, such as movements of convolution filters, in the high cut-off resolutions compared to the low ones.The optimal grid size will be a trade-off between the space-discretization numerical errors and the cost of training the AI-based surrogate.
The numerical errors due to time discretization can be minimized by learning an average advancement (timestep) at each training time The total error in Eq. ( 27) consists of the round-off errors (computing)   and the truncation error (discretization)   .The truncation error using the backward Euler in time is a first-order error  ().The round-off and truncation errors due to time discretization are as follows: is the truncation error bound in the time domain.If it is assumed that the set of round-off errors } is of an absolute magnitude , then the total error term   is given by: is a machine precision value or epsilon.In this paper, the singleprecision floating point format (float32) is used in all nodal computations of the AI-based surrogate.In Eq. ( 30), considering a small timestep, the error contribution due to rounding is large while that due to truncation is small.Conversely, with a large timestep, the error contribution due to truncation becomes significant while that due to rounding is insignificant.Since the contribution due to rounding is inversely proportional to the timestep, for numerical computations with low-medium numerical precision, obtainable in machine learning computations, timestep round-off errors are expected.
Eq. ( 30) is formulated as a soft constraint and combinedly minimized with the residual error of the domain.The formulation is obtained by applying two Euler approximations: Combining Eqs ( 31) and (32) yield the truncation error for the timestep  =  1 and error bound  =  +  1 : Eq. ( 33) is substituted into Eq.( 30) to yield the total numerical error as follows: The terms in Eq. ( 34) are the time-discretization numerical errors, evaluated using the timestep and CEDNN modules.In Eq. ( 34), the quantity of interest  +2 , requires an additional forward evaluation of the AI-based model at the next two time point . This is an added computational cost; however,  +2  , can be estimated from the existing AI-based forward evaluations using linear extrapolation:

Non-physics-based equation
The non-physics-based loss is computed as the Euclidean norm |||| 2 2 using parameters in the model (Loshchilov and Hutter, 2019a).The square of the Euclidean norm is as follows: is a set of model parameters (i.e., weights and biases) of the AIbased SRM.The non-physics-based loss, known as the ridge penalty or  2 regularization reduces the effect of multicollinearity that could occur between model parameters in regression models (Golam Kibria and Lukman, 2020; Hoerl and Kennard, 1970).In highly correlated model parameters, the variance is high and can yield unsatisfactory predictions.The variance determines the model's performance when a dataset outside the training dataset is used.The ridge penalty is a non-zero value, which shrinks the neural network trainable parameters and also prevents exploding gradients that can occur during training.In linear regression problems, the ridge hyperparameter  ′  is added along the diagonals of the ordinary least squares (OLS) estimator.The OLS estimator solves the least squares problem and depends on a labelled training dataset.For the semi-supervised learning process, the least squares problems are the domain and boundary losses.
In ML,  ′  must be tuned as it can adversely affect the optimization process if its magnitude becomes too large; usually, a small or decaying value of this weight is applied.The decaying weight,   =  ′  , inline a first-order optimizer can be used inplace of the ridge hyperparameter.Fig. 6 shows the effect of using the non-physics-based regularization.

Total loss
The total loss is a weighted average of the physics-based and nonphysics-based losses.The weighted physics-based loss MSE PHY is given by: Eq. ( 37) is the summation of all soft constraints to be minimized by the optimizer.The initial condition loss MSE IC is not included in Eq. (37) as it is fully enforced in the neural network architecture by the hard enforcement layer.The weighted non-physics-based loss Eq. ( 36) is added to the weighted physics-based loss to give the total loss MSE T :

Artificial neural network development and training
The proposed architecture is a custom three-module ANN with each module trained independently.The custom modules are defined as follows: 1.A CEDNN, with a trainable hard-enforced output layer and encoder-decoder skip connections.This module determines the pressure solution at a given time.2. A fully connected deep neural network with residual connections (ResNet) to determine the advancement (timestep) at a given time.3. A pretrained fluid-property layer for prediction of pressuredependent fluid properties.
The CEDNN is preferred as the main network for learning the pressure solution, as its convolutional layers can capture the local dependencies between neighbouring grid blocks during feature extraction.In addition, the similarity in spatial dimensions of the main input and output layers allows for easy implementation of the domain and boundary losses during training.The CEDNN module can be compared to the standard U-Net architecture as it encodes latent variables by convolution and decodes outputs by transposed convolution.In contrast, the internal architecture of the CEDNN module permits realsquared dimensions, e.g., a discretized domain with a centrally located sink or source requires odd grid dimensions and padding operations at convolution operations; this can be achieved, without any further transformation, using the custom-built CEDNN module.A comparison of the standard U-Net and the CEDNN module is shown in Table 2.
The CEDNN has some features similar to the encoder-decoder structure in Wang et al. (2021); however, it uses a set of skip connections for transfer learning, a trainable hard enforcement technique and is coupled with other trainable network modules for effective learning.A schematic of the custom three-module AI-based architecture is shown in Fig. 7; the details of the network are given in Table 3.The described network module in Table 3 has a 2D input dimension of 39 × 39.A schematic of the custom three-module neural network is shown in Fig. 7.
In Table 2, the CEDNN uses a real square domain shape as against the power-of-two used in the standard U-Net architectures.The CEDNN module uses strided convolution during encoding.Strided convolution operation may perform very well or slightly outperforms the non-strided convolution and subsequent downsampling (averaging or maxpooling) operations (Springenberg et al., 2015).Also, the absence of the downsampling operation in a strided convolution operation can improve its memory efficiency.The skip connections, which allow for transfer learning between an encoder-decoder pair, are added rather than concatenated.The addition is a layer-layer arithmetic operation, which does not change the decoder's inner dimension whereas the concatenation operation causes an increase in the decoder's inner dimension, which has to be resized back by linear projection.The concatenation and then linear projection can be computationally expensive for the physics-based training.The addition of skip connection layers has yielded excellent results in the standard ResNet (He et al., 2016).
In Table 3, the encoder with layers: 1, 2, 3 and 4 performs convolution operations to encode latent features from the main input.Encoding is done progressively by the strided convolution operation; it involves moving a feature detector (filter or kernel) across the domain with strides greater than one.The 2D input tensor of the convolution operation follows the shape definition: [batch size, height, width, depth].For the 2D reservoir, the second and third indexes (i.e., height and width) of the main input tensor represent the number of grid blocks in the  and  axes respectively.The depth describes the number of filters required for each convolution operation.This convention is followed throughout this paper.The expression between the input and output spatial dimensions for each convolutional layer is given by: , is the input length (height or width) of the convolutional layer;  is a kernel filter used to extract the features from the input;  is the

Table 3
Inner architecture of the custom three-module artificial neural network.

Module
Layers Inner architecture 1 0 Input (features): permeability, time; concatenation (axis = −1); output = (39, 39, 2) 1 Convolution 1: Kernel filters = 32; kernel size = (3, 3); stride = 1; skip connection 1; activation function = Swish; output shape = (37, 37, 32); zero padding; output shape = (39, 39, 32) 2 Convolution 2: kernel filters = 48; kernel size = (5, 5); stride = 2; skip connection 2; activation function = Swish; output shape = (18, 18, 48); zero padding; output shape = (20, 20, 48) stride, indicating how many steps the filter is moved during convolution;   is the input padding, indicating the number of grid blocks to be added at the boundary or edges during the convolution operation.Each convolution operation reduces the output spatial dimension (height and width) and increases the number of output filters (depth).The depth of each convolution operation   , is increased progressively as follows: 0 is the depth of the first convolution operation and  is the encoding depth sequence.The CEDNN module has a depth sequence: (   ) 3 =0 = (0, 1, 2, 3).The output from the encoder has a smaller spatial dimension and larger depth representation (latent) of the main input.The latent representation is passed on to the bottle neck layer 5 for further transformation using a linear projection (nonlinear transformation by an activation function is not used in the bottle neck layer).The decoder, as described in Table 3 with layers: 6, 7 and 8 takes the encoded (latent) variable and progressively increases its resolution via an up-sampling technique, called transposed convolution operation (Zeiler and Fergus, 2014).Transposed convolution involves the use of trainable filters similar to a convolution but generates a feature map higher in dimension than the original feature.Transposed convolution first increases the original feature map tensor by insertion of zeros at regular intervals before performing the convolution operation.The input-output shape relationship using the transposed convolution operation is given by: The output from the last transposed convolutional layer has a resolution similar to that of encoder's input.Skip connections are also added to allow for direct propagation of information between encoderdecoder pairs.Skip connections can reduce the degradation problem observed in deep neural network (DNN) architectures (He et al., 2016).Skip connections allow for the recovery of fine spatial detail that could have been lost during the successive convolution operations in the encoding layers.Padding and then convolutional operations are performed in layer 9 to resize the final output if its spatial dimension is not equal to that of the main input.A custom layer algorithmically determines the number of padded blocks on all sides of the domain.A constant padding using the reflection technique is used.The hard enforcement layer uses the permeability-time main input variables to develop a transforming function for enforcing the initial condition and refining layer 9 outputs.
The hard-enforced pressure output is separately connected to the fluid property network (module 2) as shown in Fig. 6.The fluid property module, rapidly pretrained, determines the pressure-dependent fluid properties during training.A regularizer can also be applied to the PVT module; however, this is not considered.The main input features are also separately passed to the ResNet (module 3) to determine an average timestep for the FDM (Fig. 6).The ResNet module follows the standard ResNet architecture using fully-connected layers.
The AI-based input features are scaled to a [−1,1] using the linear normalization technique.The Swish activation function with hyperparameter,  = 1, is used in all the trainable hidden layers.The linear activation function is used as the transformation function in the CEDNN module prior to hard enforcement and a truncated LiSHT activation is used in the timestep module prior to grid-scale averaging.Swish activation has been shown to reduce the internal covariance shift during the training of DNNs (Ioffe and Szegedy, 2015;Ramachandran et al., 2017).The LiSHT can tackle the dying gradient problem and non-utilization of the large negative input values in DNNs (Roy et al., 2019); this can be useful during the weighting of the timestep, ensuring that gradients associated with negative values are propagated.
The network's initial weights are obtained using the Xavier (Glorot) normal while its biases are initialized to zero.Xavier normal draws an initial weight set from a truncated normal distribution with a mean of 0 and a standard deviation , where    is the number of input nodes to a given layer and    is the number of output nodes of the same layer.To allow for standard comparison of regularization effects, the Xavier normal initialization is obtained from a stateless (deterministic) pseudo-random algorithm, such that for a given seed, and for multiple initializations, the exact sequence of weight sets are generated.One hundred weight initializations are generated within five seconds, and at each instance, the global norm of the weight set is estimated.The accepted weight initialization is the weight set with the median norm.
The pressure output and its numerical derivatives from module 1, the fluid properties and their algorithmic derivatives from module 2, and the average timestep from module 3 are used to compute the physics-based losses.The AI-based model is tuned by equally weighting the physics-based losses using a weighting value of 1 ∕ PHY ; setting a small weight for the non-physics-based loss, obtained either from a hyperparameter search, published literatures or experience of the analyst; using a trainable parameter in the hard enforcement layer and a trainable timestep module.A first-order optimizer is used for training the network; the adaptive estimate of moments with weight decay (AdamW), wrapped with a multi-optimizer (Jin et al., 2016), is used for updating the model parameters during each step.One step corresponds to the forward and backward passes of a batched dataset.The weight decay of the optimizer is only applied to the non-physicsbased loss.The first-order optimizer iteration update with a decaying non-physics-based loss   is given by: The learning rate relationship is given by: The non-physics-based loss weight decays similarly as the learning rate: As deep neural network architectures are well-known for their high non-convexity (Chen and Wang, 2014), multiple minima and multiple maxima could exist in the loss manifold.Therefore, using model parameters just at the end of training might not be an optimal solution.The optimized model parameters are selected from a collection of model parameters watched during the last 25% of the total epochs.The validation loss is taken as the weighted domain and boundary losses, including the material balance loss.

Stochastic generation of the permeability field
The log-normal distribution is assumed to be the basic type of distribution of geological variables like permeability in a naturally occurring permeable medium.The log-normal distribution is the probability distribution of a random variable , whose logarithm ln , follows a normal distribution.The standard normal variate, , is given by: The standard normal can be expressed for the log-normal distribution as: The mean and standard deviation of the logarithm,  ln  and  ln  , is related to the mean and standard deviation,   and   , by the expression: One of the methods of generating stochastic realization of a property field is using the Karhunen-Loeve expansion (KLE).KLE represents a stochastic process as an infinite linear combination of orthogonal functions.In the KLE expansion, a stochastic process {(),  ∈  } is represented via a sequence of independent simple random variables {  ,  ∈ }.The corresponding kernel is defined as: (, ) =  ( (), () ) in the reproducing Hilbert kernel space.Assuming {  ()} is a series of basis (orthogonal) functions derived from certain integral equations, the stochastic process is given by: The independent random variables are given by: where {  } is identically independent distributed with a zero mean and unit variance, then the integral of the kernel is given by: Where {  ,  ∈ } is the set of eigenfunction and {  ,  ∈ } is the set of eigenvalues.Further reading on the KLE can be found in the following literatures: (Lu and Zhang, 2004;Wang, 2008;Wang et al., 2021).The KLE is used in this work to generate stochastic processes from a prescribed covariance function.The log-transformed stochastic field has the following statistical information:

Case study
A dry gas reservoir with a dimension of 2900 ft × 2900 ft × 80 ft is considered.The model has a no-flow outer boundary with four wells operating under different production constraints of limiting rates and minimum bottomhole pressure, including a shut in as given in Table 4.The reservoir permeability is stochastically generated using the KLE, following a log-normal distribution with a mean and standard deviation of 3 mD and 1 mD respectively.The reservoir domain is discretized using a 39 × 39 × 1 grid.This is chosen as the optimum grid dimension whilst considering the associated numerical errors (Fig. 1) and the computing cost of convolutional layers.Fig. 8 shows the gridded domain with the production wells; Fig. 9 is a matrix plot of some permeability realizations.

Training dataset generation
200 permeability realizations are generated, each for 48 time points, yielding a 200 × 48 training dataset.The 48 time points are sparsely sampled from a time length of 365 days.The first 10 points are selected at a 1-day interval for first 10 days while the remaining 38 points are selected at a 10-day interval starting from the 10th day till the end including two time points at the start and end of the shut in.54 permeability realizations are used for training and the remaining used for testing.The fluid property module is rapidly pretrained to PVT properties-formation volume factor and viscosity.The AIbased surrogate is trained for 600 epochs.Other model properties and hyperparameters are shown in Table 4 3.2.Test dataset generation 146 by 48 permeability-time dataset is used for evaluating the performance of the AI-based surrogate predictions.This test dataset is labelled with grid block pressures obtained from a numerical simulator, which has same reservoir, fluid and production characteristics.The numerical simulator is separately run using the test permeability realizations, each for 365 days at a maximum timestep of 1 day.The numerical reservoir simulator is constrained to the small timestep to ensure that convergence errors due to time discretization are reduced.The grid block pressures, finely simulated in time are extracted from the simulator restarts at the 48 time points and then mapped to the test dataset features.

Artificial intelligence-based surrogate reservoir model results and discussion
The accuracy of the predictions are evaluated using the percent error   for each grid block, and the relative  1 and  2 errors for the domain.The percent error   is given by: The relative  1 and  2 errors for the domain  at a given time  are as follows: Fig. 10A and B shows the comparison between two wells grid block predicted pressures and those obtained from the numerical simulator.Fig. 11 shows the 2D image (snapshots) of pressures using a test permeability realization ( k = 2.39 mD) at four prediction times [260,290,310,340] days.In Fig. 11, the third column shows the percent errors for the grid blocks.
In Fig. 10A and B and Fig. 11, the predictions of the tuned AIbased surrogate compared to those of the numerical simulator are in good agreement; the pressure distributions around the wells are predicted by the surrogate.The regularization weights of the tuned AI-based surrogate and relative error statistics using the 146 × 44 test realizations are given in Table 5. Fig. 12 shows the relative  1 and  2 errors of the tuned AI-based surrogate using the test realizations.
In Table 5, the non-physics-based regularization weight is set as 0.0005 0 after training on other limits: 0.0 0 and 0.05 0 .The physicsbased regularization terms are equally weighted, and the hard layer parameter and timesteps are made trainable.In Fig. 12A and B, the relative errors follow a lognormal distribution, i.e., the low-error class intervals are more dense than the high-error ones and the modal relative error tends towards zero.This is a good indication that the tuned model's variance has been significantly reduced.A model with good generalization will have its error histogram positively skewed on a linear bin scale and bell-shaped on a logarithmic bin scale.
The effects of changing the non-physics-based regularization weight, material balance regularization weight, hard enforcement layer variable and timestep variability are studied.These are done by performing one-variable-at-a-time analysis.In studying these effects, the relative  1 and  2 errors are visualized using histograms with logarithmic binning, which allow for comparison over a wide range of relative errors.

Effect of changing the non-physics-based regularization weight
The non-physics-based regularization term can further constrain the grid block pressures to those of the numerical reservoir simulator.Fig. 13 shows the grid-predicted pressures and percent errors for different non-physics-based regularization weights using the test permeability realization ( k = 2.39 mD) at four prediction times [260,290,310,340].Three different non-physics-based regularization weights are considered: 0 × 10 −4  0 (Fig. 13A), 5 × 10 −4  0 (Fig. 13B) and 500 × 10 −4  0 (Fig. 13C).In each subplot, the first and second columns represent the predicted and observed pressure responses and the third column shows the percent error of the predictions.Fig. 14 shows the relative  1 and  2 error histograms for the different non-physics-based regularization weights, and Table 6 shows their relative error statistics.
As shown in Figs. 13 and 14, increasing or decreasing the weight affects the predictions; the colour of third column in Fig. 13C darkens compared to those of Fig. 13A and B. The darkening effect is due to increased residual errors.The ranges of the relative error histograms first decrease, then increase as the non-physics-based regularization weight is increased from zero (Fig. 14).This effect is clearly visualized in the plot of the mean relative  1 and  2 errors for the different nonphysics-based regularization weights (Fig. 15 and Table 6).There exists     an optimum for which the AI-based surrogate gives the lowest relative errors.This lies in the search space:   0 < 0.01 0 .
In general cases, starting values of this weight can be obtained from published literatures, e.g., Loshchilov and Hutter (2019b) and experience of the data analyst.The search can be performed between the extremes: {   0 ∈ R|0 <   0 ≤ 0.1 0 } .In Fig. 10A and B, the tuned AI-based SRM is very reliable in predicting the flow realizations, i.e., grid block pressures for the unseen permeability realizations at the given training times.This is because the unseen permeability realizations follow a Gaussian distribution whose characteristic features can be discovered by the DL convolutional layers.The AI-based SRM can also be used in predicting the flow realizations using the given permeability realizations at unseen times, i.e., outside the training data time points; However, it can become less reliable when the prediction times are 25% more than the maximum training time.The optimum non-physics-based regularization weight improves the predictions in the unseen time region, up to 50% more than the maximum training time (  = 360 days), without relying on additional training time points in this region (Fig. 16B).

Effect of changing the material balance regularization weight
Two cases are considered: no material balance weight and when it is equally weighted with other physics-based losses.Fig. 17 shows the relative  1 and  2 error histograms for the different material balance regularization weights using the test realizations, and Table 7 shows their relative errors statistics.
As shown in Fig. 17, for the case of no material balance regularization weight ( MBC = 0.), the relative  1 and  2 error histograms are negatively skewed having very high mean and standard deviation.The utilization of the material balance regularization weight yields bellshaped (i.e., logarithmic binning) relative error histograms, their mean and standard deviation are significantly reduced compared to those of no regularization (Table 7).This weight accelerates the learning by updating the trainable parameters with additional gradients obtained from tank material balance.
Fig. 18 shows the total loss trends for the different material balance regularization weights.The learning progresses, as observed by the decreasing total loss, when the material balance regularization weight is set to zero; however, the updated model parameters are poorly constrained, yielding a high model variance at the end of training.This high model variance is reflected in the mismatch between the predicted well grid block pressures and those obtained from the numerical simulator (Fig. 19A).The addition of the material balance    regularization weight reduces the model's variance, yielding an optimal solution by the end of training (Fig. 19B).For an AI-based surrogate employing FDM, the material balance regularization term is important for validating the predictability of the surrogate.The material balance regularization weight is a zero-dimension regularizer; it averages the spatial variation of grid block fluxes.This term cannot be used alone or jointly weighted with the inner boundary conditions as it results in a high model's variance (Fig. 20A-C).

Effect of using the trainable hard enforcement layer
The hard enforcement layer, which has a trainable exponent  h is compared to cases of a nontrainable parameter and the nontrainable hard enforcement (Wang et al., 2021).Fig. 21 shows the relative  1 and  2 error histograms for the different hard enforcement techniques using the test realizations, and Table 8 shows their relative error statistics.
As shown in Fig. 21, the trainable hard enforcement layer results in bell-shaped histograms having error limits smaller than those of the nontrainable enforcement techniques.In Table 8, the trainable hard enforcement significantly reduces the mean and standard deviation of the relative  1 and  2 errors compared to those obtained by other nontrainable hard enforcement techniques.The trainable layer enforces   the initial condition whilst improving the predicted pressures.The trainable exponent initially set as 0.5778 is optimized to 0.159 at the end of the training.

Effect of using the trainable timestep module
The trainable timestep module allows for varying timesteps in the FDM.Three fixed timesteps: 0.1 day, 1 day, 10 days, and a trainable timestep are considered.The fixed timesteps are constant values for all sampled space-time points, and the trainable timestep is an averaged value that varies for all the sampled space-time points.Fig. 22 shows the relative  1 and  2 error histograms for the fixed and variable timesteps using the test realizations, and Table 9 shows their relative error statistics.
As shown in Fig. 22A and B, using the 0.1-day timestep and float32 computation causes the relative errors to be negatively skewed; the mean and standard deviation of the relative errors are higher compared to those of other timesteps (Table 9).As seen in Eq. ( 30), for small timesteps, the round-off errors due to machine precision becomes significant, hindering the learning process.The 10-day timestep can be suitable for learning, reducing the round-off errors but with likely growth of truncation errors.This is observed by the low relative error statistics but still greater than those of the 1-day and variable timesteps (Table 9).The 1-day timestep can reduce the truncation errors and improve the learning as shown by the lower mean relative errors compared to those of the 0.1-day and 10-day timesteps (Table 9);       however, the round-off errors are still significant at early time points.The variable (trainable) timesteps, obtained from the LiSHT output activation limits 0.1 <  < 10, gives the best rounding-truncation error trade-off for all space-time points; that is, the mean relative  1 and  2 errors are reduced by 25% and 49% respectively compared to those of the 1-day mean errors.
The minimum and maximum grid block timesteps using three test permeability realizations, and average timesteps using the 146 test permeability realizations are shown in Fig. 23.These are the optimized (i.e., after training using the training dataset) timestep statistics for each time point if the domain PDE were to be solved by the AI-based SRM using the test data and FDM.However, the AI-based SRM during testing only performs a forward pass without evaluation of the domain and boundary equations.In Fig. 23, it is observed that the timestep statistics do not show significant deviations when the test permeability realizations are varied; considering the 146 test dataset, the maximum deviation in the average timesteps is about 0.5 day (pale blue region in Fig. 23).However, the average timestep values change with time; these values are higher at early time periods and decrease over time to an average timestep value of 2 days.The higher timesteps, as determined by the timestep module at early periods, are required to offset the ML round-off errors and produce the required offtake rates.

Comparison of the artificial intelligence -based surrogate and numerical simulator performance
The comparison is based on training and predictions performed on the NVIDIA RTX 3060 graphics processing unit (GPU), and numerical simulations performed on the Core i7 11 700 K central processing unit (CPU).The total training time for the 54 by 48 training dataset, trained for 600 epochs using a batch size of 144 data rows is 42 min.The AI-based surrogate prediction runtime for the 146 by 48 test dataset is 292 s.In contrast, it takes 2929 s (48.7 mins) to run the numerical simulator using the same test dataset.The runtime ratio of the AI-based surrogate predictions to that of the numerical simulator for the test dataset is 0.1.Fig. 24A shows the different runtimes of the numerical simulator and the AI-based surrogate during prediction and training.In situations, where hundreds or even thousands of permeability-time realizations are required, using similar machine and ANN architecture, a runtime ratio ≤ 0.1 will favour the AI-based surrogate over the numerical simulator (Fig. 24A).The only significant time expense of the AI-based surrogate is during the training; however, this is a fixed cost that becomes discounted for large test datasets (Fig. 24B).The training time can be reduced using advanced computing architectures, carefully increasing the batch size and more efficient graph parallelization, e.g., using distributed training across multiple GPUs or cloud nodes, swapping GPU to CPU usage during peak memory in the GPU, further simplification of some arithmetic operations to prevent recopying data, and large-batch optimization algorithms.

Conclusions
This paper investigates the application and effects of physics-based and non-physics-based regularizations in AI-based surrogate modelling for highly compressible subsurface flow.The AI-based surrogate is constructed as a custom three-module neural network with discriminative training; it is trained by exploiting the theoretical directions of fluid flow using a combination of FDM and AD on an incompletely-labelled dataset, i.e., only the initial condition and wells' subdatasets need labels of initial pressure and well constraints respectively.
The advantages of this workflow are the effective utilization of modular AI-based network architectures for simulation of flow behaviour, including the fluid property learning and time stepping; the grid sensitivity analysis performed as a pretraining operation ensures proper selection of an optimum AI-based grid size with reduced spacediscretization numerical effects and training cost.Furthermore, the learning utilizes additional regularizations for improved predictions, including the  2 norm, tank material balance, trainable hard enforcement layer and time-discretization numerical errors.The predictions of the AI-based surrogate for unseen permeability-time realizations in a dry gas reservoir with multi wells operating under different constraints, including a shut in, agree with those obtained from a numerical simulator.The AI-based predictions are timely with an AI-based model to numerical simulator computational time ratio of 0.1, based on the test realizations and computing infrastructure.
The study shows that the non-physics-based regularization term, if used in the right order of magnitude improves grid block predictions; the material balance regularization term is important for constraining AI-based model trainable parameters towards optimal solutions in shorter epochs.A nonoptimal solution can be obtained, even as the total loss is minimized, when the material balance regularization term is not used; also, using it without the domain loss yields nonoptimal solutions.The trainable exponent used in the hard enforcement layer fully satisfies the initial condition and refines the grid block pressures; this is more effective than other hard enforcement techniques.The timestep module, which learns variable timesteps for solving the discretized domain PDE during training, improves the predictions compared to those of fixed timesteps.The trainable timestep module gives the best rounding-truncation error trade-off by effectively minimizing the combined discretized domain PDE and time-discretization numerical errors.
The limitation of this workflow is the time expense during pretraining and training.The pretraining costs can be during the grid sensitivity analysis or stochastic data generation.The training cost, which is the most significant cost, is due to the deep learning convolution operations, physics-based computations and their derivatives, and backpropagation within the network architecture.These costs could be a challenge in multiphase, 3D flow problems and finer grids as the trainable parameters are increased, in orders, according to the quantities of interest (e.g., multiphase flow) or grid dimension (e.g., 3D and finer grids).Nevertheless, with advanced computing and memory architectures, proper batch size selection and large-batch optimization, graph parallelization, effective model hard-enforcements, these costs can be reduced.The training cost is also dependent on the experience of the data analyst as the network hyperparameters, e.g., learning rates and regularization weights, and the hard enforcement techniques can require experience.
The effectively regularized and timely AI-based surrogate can be used as a state-dependent function in techniques like sensitivity analysis and data assimilation.

Fig. 6 .
Fig. 6.Training and validation loss histories for an ANN with non-physics-based regularization.

Fig. 7 .
Fig. 7.A schematic of the custom three-module artificial neural network.

)
ln  = covariance function  ln  = normally distributed mean of the log-normal permeability,  x = flattened spatial locations in the  and  directions; x = ( [  ], [  ] )  = correlation length in the  or  axis.In this work,   =    2 ln  = normally distributed standard deviation of the log-normal permeability,

Fig. 14 .
Fig. 14.Relative (A)  1 and (B)  2 error histograms for the different non-physics-based regularization weights using the test permeability realizations.

Fig. 17 .
Fig. 17.Relative (A)  1 and (B)  2 error histograms for the different material balance regularization weights using the test realizations.

Fig. 18 .
Fig. 18.Total loss trends for the different material balance regularization weights.

Fig. 21 .
Fig. 21.Relative (A)  1 and (B)  2 error histograms for the different hard enforcement techniques using the test realizations.

Fig. 22 .
Fig. 22.Relative (A)  1 and (B) 2 error histograms for the fixed and variable timesteps using the test realizations.

Fig. 23 .
Fig. 23.Minimum and maximum grid block timesteps using three test permeability realizations, and average timesteps using the 146 test permeability realizations.

Fig. 24 .
Fig. 24.Runtimes of the numerical simulator and AI-based surrogate using the training and test realizations including (A) 54 training and 146 test permeability realizations and (B) 54 training and 540 projected test permeability realizations.

Nomenclature
= formation volume factor of the fluid, rb/Mscf  ln = covariance function   = formation compressibility, psi −1  = encoding or decoding depth sequence   = depth of each convolution operation  = convolution filter dimension ℎ = domain thickness, ft ℎ  = completion ratio   = iteration counter  = covariance kernel   = -directional permeability, mD   = -directional permeability, mD  , = input length (height or width) of each transposed convolution layer  , = output length (height or width) of each transposed convolution layer  , = input length (height or width) of each convolution layer  , = output length (height or width) of each convolution layer  = mass accumulation, lbm ṁ = rate of mass accumulation, lbm/D MSE MBC = weighted tank material balance loss MSE PHY = weighted physics-based loss MSE T = total loss MSE  = weighted non-physics-based loss MSE  = weighted domain loss MSE | IBC = weighted inner boundary condition loss MSE | IC = weighted initial condition loss MSE | OBC = weighted outer boundary condition loss    = number of input nodes to a layer    = number of output nodes from a layer  PHY = number of output nodes from a layer   = number of grid blocks in the -direction   = number of grid blocks in the -direction   = number of trainable parameters in the neural network   = number of rows per batch   = number of rows in training data   = number of epochs   = number of iterations during training    = number of iterations per epoch   = input padding of the convolution/deconvolution  = pressure, psia   = initial pressure, psia p = predicted pressure, psia  = initial source or sink rate at standard conditions, Mscf ∕D   = Peaceman radius, f t

Table 1
Differentiation schemes used in evaluation of the derivatives.

Table 2
Comparison of the standard U-Net and CEDNN architecture.

Table 4
Input parameters used in training the AI-based surrogate reservoir model.

Table 5
Regularization weights of the tuned AI-based surrogate and relative error statistics using the test realizations.

Table 7
Material balance regularization weights and their relative error statistics using the test realizations.

Table 8
Hard enforcement techniques and their relative error statistics using the test realizations.

Table 9
Relative errors statistics for the test realization using the different timesteps.
Fig. 15.Mean relative  1 and  2 errors for the different non-physics-based regularization weights.