A Two-Stage Preconditioner for Multiphase Poromechanics in Reservoir Simulation

Many applications involving porous media--notably reservoir engineering and geologic applications--involve tight coupling between multiphase fluid flow, transport, and poromechanical deformation. While numerical models for these processes have become commonplace in research and industry, the poor scalability of existing solution algorithms has limited the size and resolution of models that may be practically solved. In this work, we propose a two-stage Newton-Krylov solution algorithm to address this shortfall. The proposed solver exhibits rapid convergence, good parallel scalability, and is robust in the presence of highly heterogeneous material properties. The key to success of the solver is a block-preconditioning strategy that breaks the fully-coupled system of mass and momentum balance equations into simpler sub-problems that may be readily addressed using targeted algebraic methods. Numerical results are presented to illustrate the performance of the solver on challenging benchmark problems.


Introduction
Coupled simulations of multiphase fluid flow, transport, and geomechanics are useful for predicting performance of many subsurface systems. These simulations solve a set of nonlinear, time-dependent partial differential equations (PDEs) enforcing mass conservation of each fluid phase and linear momentum balance for the mixture. For this system, fully-implicit time discretization is widely preferred for its unconditional stability, allowing large timesteps. The major computational bottleneck, however, is that one must solve a sequence of large, ill-conditioned linear systems to advance the solution. Robust and scalable solvers are therefore needed to enable large-scale simulations on high performance computing platforms. In this paper, we propose an efficient preconditioning strategy for multiphase poromechanics that addresses the inherent ill-conditioning of the discrete system using a mixture of physics-based insight and flexible algebraic methods.
Over the last four decades, a rich literature focusing on the development of efficient iterative solvers for fullyimplicit simulation of complex multiphase flow and transport-without mechanics-has developed within the reservoir simulation community. The most popular approach is based on Krylov subspace methods combined with a Constrained Pressure Residual (CPR) multistage preconditioning technique [1,2]. In essence, the CPR preconditioner uses separate stages to damp error modes associated with (i) the nearly-hyperbolic behavior of the saturation (or composition) variables and (ii) the nearly-elliptic character of the pressure variables. The standard reservoir simulator implementation [3][4][5][6][7][8][9] couples Incomplete LU (ILU) factorizations, effective for the hyperbolic part, with an algebraic multigrid (AMG) method, well-suited for the elliptic part. A variant of CPR preconditioning that replaces AMG by a multiscale finite volume (MSFV) or finite element (MSFE) solver was proposed in [10]. Using a multigrid reduction (MGR) approach [11], the CPR method has been generalized in a broader multigrid framework and successfully applied to multiphase flow and transport with phase transitions in [12,13]. Recent solution alternatives to CPR can be found in [14,15].
In the present work, a preconditioning scheme for fully-implicit multiphase fluid flow, transport, and geomechanics is proposed. The algorithm relies on successive sparse Schur-complement approximations to reduce the coupled system into easier sub-problems for which algebraic methods can be applied. We build on fixed-stress partitioning [20,32,36,37] and CPR [1] concepts to construct a novel multi-stage preconditioning scheme well-suited to the problem at hand. The end result is a modular, easy-to-implement, and highly scalable solver.
The paper is organized as follows. Sections 2 and 3 introduce the mathematical model and discretization scheme. The nonlinear and linear solution algorithms are described in detail in Sections 4 and 5. Weak and strong scaling results are presented in Section 6 to demonstrate performance and robustness of the proposed preconditioner in a parallel setting. We then end with a few concluding remarks regarding future work.

Governing Equations
As a simple but representative model system, we consider a displacement-saturation-pressure formulation of multiphase poroelasticity [42] assuming isothermal, quasi-static, two-phase flow of immiscible fluids. We limit the discussion to small-strain kinematics. Let Ω = Ω ∪ Γ denote a closed domain in R 3 occupied by a poroelastic medium, with Ω and Γ an open set and its boundary, respectively. For the application of boundary conditions, let Γ be partitioned as Here, subscripts u and f denote solid displacement and fluid flow boundary conditions, respectively, while superscripts D and N indicate Dirichlet or Neumann conditions. Let n Γ denote the outer normal vector. Wetting and non-wetting fluid phases are identified by subscript w and nw, respectively. The requisite set of governing equations consists of one conservation law for linear momentum and two conservation laws for the mass balance of each fluid phase. Let u, s , and p denote the solid displacement vector, saturation, and pressure of fluid phase = {w, nw}. Since the two fluids fill the voids, the saturations must satisfy the closure condition ={w,nw} s = 1. In the following, the wetting fluid phase saturation is selected as a primary unknown and will be denoted by lower case s without subscript. Capillarity effects are not considered, leading to the simplification p w = p nw = p. This is a frequent assumption in many reservoir engineering applications, but we emphasize that it can have a significant impact on the nature of the governing equations and solver strategy.
Given a finite time interval I = (0, t max ] the strong form of the initial/boundary value problem may be stated as follows: on Ω × I (mass balance for the wetting fluid phase), (1b) on Ω × I (mass balance for the non-wetting fluid phase), subject to boundary conditions and initial conditions In Eq. (1a), the Cauchy total stress tensor is given by with C dr the fourth-order drained elasticity tensor, b the Biot coefficient, and 1 the second-order unit tensor; ρg is a body force due to the self-weight of the multiphase mixture, with g the gravity vector; is the density of the mixture, with ρ s , ρ w , and ρ nw the density of the solid, the wetting, and the non-wetting fluid phase, respectively; φ is the porosity; and ∇ s and ∇· are the symmetric gradient and divergence operator, respectively. In Eqs. (1b), m w = (φρ w s), w w = (ρ w v w ), and q α w ≥ 0, α = {I, P}, denote mass per unit volume, mass flux, and mass injection (I) and production (P) sources per unit volume for the wetting phase. Here, v w is the volumetric flux. It follows the traditional multiphase flow extension of Darcy's law [43], with λ w = k rw /µ w the phase mobility, µ w the viscosity, k rw the relative permeability factor, κ the absolute permeability tensor, g the gravitational acceleration, z the elevation above a datum, and ∇ the gradient operator. In Eqs. (1c), m nw = (φρ nw (1 − s)), w nw = (ρ nw v nw ), q I nw ≥ 0, and q P nw ≥ 0 are the counterpart of m w , w w , q I w , and q P w for the non-wetting fluid phase.
The formulation is completed with constitutive equations and equations of state to specify the following dependencies: φ = φ(u, p), ρ s = ρ s (p), ρ = ρ (p), µ = µ (p), k r = k r (s), = {w, nw}. The porosity change from a reference porosity φ 0 is modeled in incremental form as [42] where v = trace(∇ s u) is the volumetric strain, and K dr is the drained (skeleton) bulk modulus. Specific forms for the remaining relative permeability and fluid compressibility relationships used in our simulations are detailed with the numerical examples in Section 6. The absolute permeability κ is here modeled as a constant field and exhibits no stress-dependence, though this additional form of solid-fluid coupling is of frequent interest in applications and often motivates the desire for a fully-coupled solver in the first place.

Discrete Formulation
A variety of choices are available for the spatial discretization of the PDEs governing multiphase poromechanics. Continuous Finite Element (FE) based formulations-the most popular technique employed in consolidation modeling, e.g. [44][45][46][47]-have been traditionally used for solving Biot's equations. However, a discontinuous interpolation for the pressure field may be more appropriate in the presence of highly heterogeneous permeability fields. In addition, element-wise mass conservation can also be a fundamental requirement, in particular in geoscience applications. Therefore, formulations that combine FE approaches for mechanics with a Finite Volume (FV) method for fluid flow and transport are often adopted by modelers [32,36,48,49]. Some authors have recently started investigating facecentered and cell-centered finite-volume methods also for the mechanical subproblem [50][51][52][53][54] to account for mechanical effects when flow processes are of predominant importance. Also, in the context of single-phase poromechanics, locally mass-conservative approaches have been proposed using mixed three-field (displacement-velocity-pressure) or four-field (stress tensor-displacement-velocity-pressure) formulations, respectively [e.g., 27,29,30,[55][56][57][58][59][60][61][62][63][64][65][66].
We partition the domain using a computational mesh T h . The mesh consists of disjoint, nonoverlapping cells such that Ω ≈ Ω h = i K i with K i ∈ T h . A unique direction is associated with each mesh face f through a unit normal vector n f . For an internal face we let n f point in the direction of increasing global cell indices-i.e. for f shared by cells K and L, with L greater than K, n f = n K where n K is the outward normal unit vector to cell K's boundary. For a boundary interface we assume n f = n h Γ . In the case of a non-planar face, a plane characterized by a mean normal n f is associated to f .
The numerical solution of the system of governing equations (1) relies on a mixed FE-FV spatial discretization and a fully-implicit time-marching scheme (backward Euler) [67][68][69]. Without loss of generality, we assume homogeneous prescribed displacement (ū = 0) and wetting and non-wetting fluid phase mass fluxes (w w =w nw = 0). Time integration is carried out by partitioning the time interval I into n ∆t subintervals I n = (t n−1 , t n ], n = 1, . . . , n ∆t , with ∆t n = (t n − t n−1 ) the timestep size. The mass time derivatives in Eqs. (1b)-(1c) are discretized using discrete mass increments, while the remaining terms are approximated at the end-of-step time t n . This leads to the following mesh-dependent, fully discrete, weak version of (1): using the discrete spaces, with C 0 (Ω) and L 2 (Ω) the space of continuous and square Lebesgue-integrable functions on Ω and Ω, respectively, Q 1 (K) the space of trilinear polynomials in K, and P 0 (K) the piecewise constant space in K. In (5b)-(5c), F f ,n denotes a discrete approximation to the inter-cell mass flux for the fluid phase, such that F f ,n ≈ − f w ,n · n f dΓ, with the summation taken over faces in T h that do not belong to the (impervious) boundary Γ N,h f . The symbol · f denotes the jump of a quantity across a face f in T h . For an internal face, χ f = (χ |L − χ |K ), with χ |L and χ |K the restriction of χ on cells K and L sharing f , respectively. For a face belonging to the domain boundary, the jump expression reduces to χ f = −χ |K .
Discrete approximations for the displacement, pressure, and saturation field are given by with {η i }, {ψ j }, and {χ k } bases for U h , S h , and P h , respectively. The nodal displacement components {u i,n }, cellcentered wetting phase saturations {s j,n }, and cell-centered pressures {p k,n } are collected in algebraic vectors u n ∈ R n u , s n ∈ R n s , and p n ∈ R n p .
Remark 3.1. It should be noted that the Q 1 − P 0 − P 0 space adopted here is not automatically inf-sup stable. If the fluid and solid constituents are nearly incompressible, and if the permeability or timestep are very small, the resulting discrete system may allow for non-physical pressure modes, and pressure checkerboarding may be observed. To address this issue, we include additional pressure jump stabilization terms to the mass balance equations following the macroelement stabilization scheme proposed in [70]. Techniques for stabilizing mixed discretizations of multiphase poromechanics are not well studied to date, however, and will be the subject of a separate contribution. Whenever sufficiently large timesteps or compressible constituents are used, spurious pressure modes are avoided, and such conditions are frequently satisfied in practice.
A linear two-point flux approximation (TPFA) scheme is adopted for the intercell numerical fluxes F f ,n . For an internal face f shared by cells K and L the fluxes are computed as where the density ρ upw ,n and mobility λ upw ,n are evaluated using single-point upstream weighting (SPU) according to the sign of the potential Φ f ,n , i.e. the so-called geometric part of the -fluid phase mass flux [71]. In (8), Υ f is the transmissibility coefficient between cell K and L, z K and z L are the elevation for cell K and L evaluated at the respective centroids, and f is the mass density averaged at face f as [72] f ,n = ,n + ρ L ,n )/2, if phase appears in both K and L, ρ K ,n , if phase appears only in K, ρ L ,n , if phase appears only in L, 0, otherwise.
For a boundary face, F f ,n is still computed using Eqs. (8)- (9), with the neighboring cell index L being replaced by the face index. See Appendix A for details concerning the computation of the transmissibility coefficient Υ f . Remark 3.2. The linear TPFA scheme is the industry standard for reservoir simulation because of its robustness, ease of implementation, and monotonicity-preserving properties. However, the consistency of the two-point scheme is not guaranteed for arbitrary grid and permeability configurations. Therefore, the linear TPFA approach may lead to inaccurate results for highly distorted grids or full-permeability tensors. In such cases more sophisticated discretization methods like multipoint and/or nonlinear schemes are needed. For details on recent developments for heterogeneous anisotropic diffusion problems see [73][74][75] and references therein.
The source terms in Eqs. (5b)-(5c) are used to capture the effect of injection and production wells, modeled as line sources. These sources are governed by a suitable well model-typically derived by assuming radial flow in the immediate vicinity of the wellbore-relating well control parameters, such as bottomhole pressure, to mass flow rates through the wellbore [76]. In this work, we assume each well segment is vertical and connected to the centroid of a cell. In particular, for a cell K connected to a well the source terms are computed as [77] with Φ K,W ,n = WI[(p bh,n + ρ ,n gz bh ) − (p K,n + ρ ,n gz K )], (11) wherep bh,n is the bottom hole pressure, δ (x − x K ) is the Dirac function with x K the cell centroid position vector, and WI is the well index. In the present work all wells are assumed to operate under specified bottom-hole pressure conditions for simplicity. Clearly, the condition Φ K,W ,n < 0 must hold true to have fluid flow from the reservoir into a production well. Similarly, injection wells require Φ K,W ,n > 0. Note that injection and production wells differ in the treatment of the phase mobilities. In Eq. (10b), the sum of phase mobilities is used for injector wells so that saturation conditions are reflected in the wellbore. This is the standard approach adopted in reservoir simulators when no wellbore crossflow occurs [78]. The well index depends on the well properties, local permeability, and cell dimensions. For a hexahedral κ-orthogonal grid-i.e., when the principal axes of the permeability tensor are aligned with the grid-WI is given as where κ x , κ y , and κ z are the diagonal (principal) components of κ; h x , h y , and h z are the cell dimensions; r w is the well radius; and s k is the well skin factor.

Newton-Krylov Solver
The discrete form of the multiphase poromechanical problem (1) is obtained by introducing expressions (7), (8), and (10) into the weak form (5). This leads to a system of nonlinear equations at time t n , for the latest solution vector x n = {u n , s n , p n }. The solution x n−1 is known from the previous timestep. Starting from an initial guess x 0 n , Newton's method is used to drive the norm of the combined residual vector to below a specified relative tolerance, r k / r 0 < τ . At each iteration k, the Newton update is given by with Jacobian matrix A k = ∂r k /∂x n . In the current context, the linearization of three coupled governing equations leads to a Jacobian system characterized by an inherent 3 × 3 block structure, Detailed expressions for the residual vectors and sub-matrices in A are given in Appendix B. The linear solution of the Jacobian system (15) at each Newton iteration is the most memory-demanding and timeconsuming computational kernel in a multiphase poromechanics simulation. It is also the most difficult component to address in a scalable way on large parallel platforms. In this work, the system (15) is solved iteratively using a nonsymmetric Krylov subspace solver-specifically the generalized minimal residual (GMRES) method [79]. For large problems, practical convergence of a Krylov method is only possible provided that an effective preconditioner is available. Using right-preconditioning, we introduce a preconditioning operator M −1 and work with the modified system In the next section, we describe an efficient and scalable two-stage preconditioner that can address both the illconditioning and coupled nature of the system at hand.
Remark 4.1. For simplicity of presentation, we have restricted the discussion to a model system with two immiscible phases. Nevertheless, the approach devised here is naturally extendible to a larger number of phases and components. In this case, the saturation unknowns s n should be understood to include all non-pressure, cell-based unknowns, e.g. phase compositions.
Remark 4.2. Multiphase flow introduces strong nonlinearities, and a naïve implementation of Newton's method may fail to converge when large timesteps are taken. More robust simulators include adaptive timestepping and/or globalization strategies to expand the neighborhood of convergence. Here, we apply a simple backtracking line search whenever a Newton step fails to satisfy the necessary descent conditions. We remark that smart strategies to address the intrinsic nonlinearities in these systems remain an active area of research.
Remark 4.3. The system of residual equations (13) stem from the discretization of different governing equations that are not dimensionally consistent with one another. As a result, it is quite easy for the resulting discrete equations to be poorly scaled with respect to one another, causing a variety of problems at the linear and nonlinear solver level. A good choice of dimensionally consistent units can typically avoid many of these problems. As an extra safeguard, we also introduce a simple block-row scaling to ensure that the flow and mechanics equations have roughly equal magnitude before entering the solution routines.
Remark 4.4. The Jacobian systems (15) is solved to a desired linear tolerance. Thus two tolerances must be selected: the Newton tolerance τ and the Krylov tolerance η. While a constant Krylov tolerance can be employed, this choice will often lead to oversolving when Newton iterates are far from the converged solution. That is, significant work will be invested in solving system (15) to a stringent tolerance, even though a much cruder solution is often acceptable for early Newton iterates. A more efficient strategy is to adapt the linear solver tolerance to reflect the observed nonlinear convergence profile-often referred to as an inexact Newton method. In this way, linear solver work is minimized in early Newton iterates. A stringent linear solver tolerance is then used in later iterates to maintain the rapid nonlinear convergence behavior of Newton's method in a neighborhood of the solution. In particular, a good approach is the forcing strategy proposed in [80]. Given parameters γ ∈ (0, 1], ω ∈ (1, 2], and η 0 ∈ [0, 1], set where η k is the relative Krylov tolerance adopted at Newton step k. In practice, an additional safeguard is also included to prevent the tolerance from becoming too small far away from the solution, The specific choices γ = 0.9, ω = 2, and η 0 = 10 −1 work well in practice. As the primary focus of the current work is the effectiveness of the linear solver and proposed preconditioning strategy, however, in the numerical examples below we disable this adaptive tolerance and instead set a fixed linear solver tolerance of η = 10 −6 . This stringent tolerance leads to oversolving, but it makes the assessment of linear solver performance more straightforward.

Two-Stage Preconditioner
The preconditioner operator M −1 for the Jacobian matrix A is constructed using two ideas. The first component is a particular block factorization of the system matrix, which allows us to break the coupled multi-physics problem into simpler sub-problems. The second component is sparse Schur-complement approximation, allowing us to circumvent certain dense operators which would otherwise appear during the block factorization process.

Mechanics / Flow Partitioning
For clarity of presentation, it is helpful to first re-partition the Jacobian system into a 2 × 2 form in which flow variables (saturations and pressures) are grouped and separated from the mechanics variables (displacements): A common approach for constructing preconditioning operators for multiphysics systems is to begin with a block factorization of the matrix [81,82], The operator S ff is the Schur-complement of A uu in A. In anticipation of developments to come, we remark that this is a "first-level" Schur-complement, corresponding to elimination of one field from the original 3 × 3 system. The second-level Schur-complement, stemming from elimination of two fields, will appear later in the discussion. The guiding idea is to use the inverse of the lower-triangular factor as a template, and seek a preconditioning This strategy is motivated by the observation that L −1 A = U. A property of triangular matrices is that their eigenvalues appear on the diagonal, from which we conclude that the matrix U has a single eigenvalues λ = 1 with multiplicity n.
A direct application of L −1 would therefore provide perfect conditioning. In a practical setting, however, we can only approximate L −1 and so the construction of a useful scheme requires additional work. The first challenge arises in the treatment of the Schur-complement operator. It may be expanded as This operator is dense and difficult to work with due to the presence of A −1 uu . As a first step, we therefore seek to replace the exact S ff with a sparse approximation.
A first simplification comes from an examination of the block A us . This operator represents the impact of saturation changes on the momentum balance equations. Specifically, a change in saturation changes the mixture density appearing in equation (1a), which will impact the displacement field. In general, however, this coupling is quite weak, particularly if the saturation field evolves slowly in time or if phase density differences are small. For preconditioning purposes, we simply assume A us ≈ 0 and eliminate the dense terms from the first block column of S ff .
For the remaining dense terms, we rely on a physically-motivated approach that has proven effective for singlephase poromechanics, the fixed-stress approximation. This approach was first introduced as an operator splitting method for sequential solution of poromechanical systems [32,[36][37][38]. The sequential method was later reinterpreted in [20] as a block-preconditioned Richardson iteration involving a particular Schur-complement approximation. The authors showed that much better performance could be achieved by embedding this same preconditioning strategy within a Krylov iteration. Here, we consider the multiphase extension of this idea.
Let σ v = (K dr v −bp) denote the mean total stress. Assuming the mean total stress remains fixed over an increment, i.e. ∆σ v = 0, the volumetric strain can be expressed in terms of the pressure increment as Substituting expression (23) in the constitutive relationship (4) during the linearization process removes the explicit dependence of the porosity on the displacement vector, with the resulting discrete mass balance equations uncoupled from the linear momentum balance. Using this decoupling argument for computing an approximation to the exact Schur complement yields where the entry (i, j) of matrices D (fs) sp and D (fs) pp are given by Noting the cell-wise constant interpolation used for both pressure and saturation, the matrices D sp and D pp are diagonal. Expanding the partial derivatives, the diagonal entries for an element K i ∈ T h with volume V i are The approximate Schur-complement defined in (24) is therefore sparse, with a storage pattern corresponding to the original finite-volume stencil of the mass balance equations. The fixed stress assumption only modifies the matrix diagonals associated with mass accumulation terms.
Remark 5.1. For finite-volume based simulation of multiphase problems that account for a larger number of phases (or compositions), the matrix D (fs) sp will have a block-diagonal structure with blocks having dimensions n K s × 1, where n K s denotes the total number of non-pressure degrees-of-freedom per element. Remark 5.2. The relationship between volumetric strain and fluid pressure defined by equation (23) is an approximation based on the assumption that the loading path is primarily volumetric. The actual relationship will depend both on material properties and on the specific loading conditions for a given problem. While the modulus K dr will provide good convergence in all cases, it may not be the optimal value for any given problem. For example, under onedimensional loading, a uniaxial modulus will more accurately approximate the coupling condition. See, for example, recent work in [37,62] on using different fixed stress assumptions to improve convergence rates.
Remark 5.3. An algebraic generalization of the fixed-stress approximation was proposed in [83] based on a row-sum lumping (RSL) strategy. Using a probing vector e = {1, . . . , 1} T ∈ R n p , the entries of the diagonal matrices defined in (25) can be computed as Here, diag(·) indicates construction of a diagonal matrix using an input vector. Computing matrices D (rsl) sp and D (rsl) pp in this manner requires two solves with A uu up to a desired accuracy, and four matrix-vector multiplications. If a good preconditioner for A uu is available, it can be used to approximate the action of A −1 uu in the construction. Note that it is not necessary to update these vectors at each nonlinear iteration if the relevant Jacobian blocks do not change significantly. An alternative algebraic construction for the diagonal approximation D (fs) pp was presented and analyzed in [31] for single-phase poromechanics using an element-by-element approach.
Returning now to the template M −1 ≈ L −1 , the top level preconditioning operator is defined as where two sub-preconditioners have been identified, corresponding to a mechanics preconditioner acting on the displacement components, and a flow preconditioner acting on the combined saturation and pressure components. We will describe particular forms for them in subsequent sections. In general, though, we remark that the sparse Schur complement system can be viewed as the discretization of a standard multiphase flow problem without mechanics, where the fixed stress assumption provides a particular form for the rock-compressibility term. Therefore, much of the existing knowledge on preconditioning multiphase flow problems may be brought to bear in designing M −1 ff . The matrix A uu is an elasticity operator, and many existing tools also exist for constructing M −1 uu . Therefore, the block factorization strategy provides a convenient and modular framework for assembling a coupled multiphase poromechanics solver from highly-tuned component pieces.

Mechanics Preconditioner
The elasticity block A uu is an elliptic operator that is well suited to multigrid techniques [84]. Here, we take a simple approach known as the separate displacement component approximation, introduced in [85]. If the displacement degrees of freedom are ordered by coordinate direction, then the resulting matrix has its own 3 × 3 structure, Rather than building a preconditioner using the complete matrix, we consider the sparser approximation in which the off-diagonal blocks have been eliminated. Using Korn's inequality, one can show that A (SDC) uu is spectrally equivalent to A uu [86,87]. Note that this approximation breaks down in the incompressible elasticity limit, Poisson ratio ν → 0.5, though this is not a concern for most geologic materials. We then construct an algebraic multigrid preconditioner from the block-diagonal approximation, which can be thought of as three scalar AMG preconditioners, one for each coordinate direction. More specific details regarding the AMG implementation are summarized with the numerical examples below. The mechanics preconditioner is the most expensive operator to construct during the solver setup phase. For an elastic model, however, the matrix A uu is constant during the entire simulation. This AMG preconditioner therefore only needs to be initialized once, and its setup cost is quickly amortized over many solver calls. This is not the case for the flow preconditioner, which is typically re-initialized within every Newton iteration.

Flow Preconditioner
The preconditioning operator for the approximate Schur complement S (fs) ff is built following the Constrained Pressure Residual (CPR) approach [1,2,5]. This is a multistage strategy that combines preconditioners with different features in a multiplicative fashion. In general, given a matrix B in R n×n , a multiplicative multistage preconditioner M −1 B for B can be written as [5] M −1 with n st the number of stages, M −1 i the ith preconditioner for B, and I the identity matrix in R n×n . The design of the CPR multistage approach-the industry standard for fully-implicit simulation of multiphase flow-is guided by the underlying nature of the governing equations. Fields that exhibit near-elliptic behavior, such as pressure, are characterized by long-range (global) coupling. They require multilevel preconditioners to reduce error modes that are smooth across the computational domain. Conversely, fields with near-hyperbolic character, such as saturation, display short-range (local) coupling. Cheaper strategies can be employed to remove the associated error modes. Standard CPR is implemented as a two-stage method that combines a global (M −1 G ) preconditioner with a local preconditioner (M −1 L ). The overall preconditionioning operator M −1 ff is formally In practice, M −1 ff is applied through a sequence of vector-multiply operations. The derivation of M −1 G also relies on a block LU-factorization, this time of the approximate Schur-complement Note that S pp is the "second-level" Schur complement, obtained after recursive elimination of both displacements and saturations from the original 3 × 3 system A. We again seek to sparsify the Schur-complement, this time by introducing diagonal approximations for A ss and A ps . Two common strategies are True-IMPES reduction and Quasi-IMPES reduction. True-IMPES reduction seeks to mimic the IMplicit-Pressure Explicit-Saturation (IMPES) time discretization scheme [71] on the linear level. A column-sum lumping operation is used to lump off-diagonal entries-containing derivatives of the flux terms with respect to the saturation degrees of freedom-into the diagonal. This can be computed using the matrix transpose as with e = {1, . . . , 1} ∈ R n s . A simpler alternative is the Quasi-IMPES reduction, which extracts the diagonal entries from the respective matrices, without lumping in off-diagonal terms, A ps ≈ D (cpr) ps = diagm(A ps ). (37b) To clarify notation, diag(·) indicates construction of a diagonal matrix using an input vector, while diagm(·) indicates construction of a diagonal matrix using an input matrix. In practice, the diagonal matrices are simply stored as a vector of their diagonal entries. Note that in a more general formulation where multiple saturation or compositional variables are present, the CPR reduction leads to block-diagonal matrices with small, dense blocks corresponding to degrees of freedom coupled in each cell. See [5] for details. With these diagonal approximations, the second-level Schur complement may be approximated as Note that this strategy again preserves the matrix sparsity pattern stemming from the original finite-volume stencil. The result is a near-elliptic pressure system that is well suited to AMG preconditioning. As before, we then use the inverse of the lower triangular factor appearing in (35) as a preconditioner template. This leads to the concrete Algorithm 1 Application of the preconditioner M −1 In equation (39), we note that the matrix A ps could also be replaced here with its diagonal approximation. We have generally found better performance using the complete matrix, however, despite the larger cost for a matrix-vector multiply operation. The global preconditioner M −1 G corrects low-frequency errors associated with the pressure unknowns, but it only performs a simple (Jacobi) correction to the saturation unknowns using D (cpr) ss . The second-stage preconditioner M −1 L complements this first stage by smoothing out the remaining error modes. These residual errors are primarily associated with steep saturation gradients, and are highly localized. As a result, single-level and processor-local preconditioning strategies can be quite effective.
To construct the second stage, the matrix S (fs) ff is viewed as a monolithic system. Let P be a permutation matrix that reorders the partitioned vector of saturation and pressure unknowns into an interleaved ordering-that is, degrees of freedom are re-ordered as {{s 1 , p 1 }, {s 2 , p 2 }, ..., {s n , p n }}. This permutation creates a sparsity pattern in which dense 2 × 2 blocks appear for each grid cell, with these cell blocks connected by the TPFA finite volume stencil. This hierarchical structure also appears in the more general case when multiple saturation or compositional variables exist. The size of the dense blocks corresponds to the total number of unknowns per cell. Block versions of relaxation or incomplete factorization preconditioners are therefore appealing, as dense multiplication and inversion operations can be applied to the small blocks.
where the input matrix is the interleaved version of the partitioned flow system. Option 2 uses one sweep of processorlocal, pointwise ILU(k) (iluk) on the interleaved system.

Final Algorithm
Using the developments above, the concrete implementation of the preconditioning strategy can be summarized as follows. In the simulator, linear solver calls are broken into two phases: a setup phase and an iteration phase. During the setup phase, auxiliary vectors and matrices associated with the fixed stress and CPR reductions are assembled.
Then sub-preconditioners M −1 uu , M −1 pp , and M −1 L can be constructed. The elasticity preconditioner M −1 uu is the most expensive to build, but it only needs to be initialized once during the entire simulation. Once the necessary operators are assembled, a non-symmetric Krylov solver (e.g. GMRES) may be called with a preconditioning routine implementing the strategy proposed here. As an aid to the reader, Algorithm 1 summarizes the concrete steps necessary to apply the preconditioner M −1 to an input vector. We observe that steps 1-5 implement a block-lower-triangular preconditioner involving two AMG sub-preconditioners and one Jacobi (diagonal) preconditioner. Steps 6-7 then implement a second stage correction to the saturation and pressure fields using the cheap local smoother.
Remark 5.4. While this work focuses on "monolithic" solver strategy using a preconditioned Krylov iteration, we remark that a "sequential" solver strategy can be obtained easily by simply embedding the proposed preconditioner within a Richardson iteration.
Step 1 of the preconditioner implements a correction to the displacement field, while steps 2-8 implement a subsequent update to the flow variables. One would then iterate back and forth between mechanics and flow in a sequential fashion within the fixed point iteration. See [20] for further details. In general, however, the Krylov approach will provide superior convergence properties.

Numerical Examples
We present two numerical examples to test the performance of the proposed preconditioner: (1) a weak scaling study using a simple, synthetic configuration; and (2) a strong scaling study using a more realistic, highlyheterogeneous reservoir. The code used for this study, Geocentric, relies heavily on the deal.ii Finite Element Library [89] for discretization functionality. Algebraic multigrid, relaxation, and incomplete factorization sub-preconditioners are provided by Hypre [90]. All numerical results were obtained on a cluster of Intel Xeon E5-2695 processors. Each computational node contains two 18-core processors sharing 128 GiB of memory, with Intel Omni-Path interconnects between nodes. We use pure MPI-based parallelism.
In general, the performance of the proposed algorithm is heavily dependent on the specific design of the subpreconditioners. The numerical results below are based on the following choices: For the separate-displacementcomponent elasticity operator, we use an AMG preconditioner with one level of aggressive coarsening, one sweep of hybrid forward 1 -Gauss-Seidel [91] for the down cycle and one sweep of hybrid backward 1 -Gauss-Seidel for the up cycle. The coarsest grid is solved directly with Gaussian elimination. For the first-stage flow system, we use quasi-IMPES reduction. The pressure system uses a standard scalar AMG approach with a Hybrid Modified Independent Set (HMIS) coarsening strategy [92]. Similar to the elasticity block, we use one sweep of hybrid forward 1 -Gauss-Seidel for the down cycle, one sweep of hybrid backward 1 -Gauss-Seidel for the up cycle, and a direct solve with Gaussian elimination for the coarsest grid. For the second stage correction, we explore two options. For the first benchmark in §6.1 we use three sweeps of hybrid block Gauss-Seidel. For the second benchmark in §6.2 we use one sweep of processor-local, pointwise ILU(0). Details of the material and parameter values used in both examples are summarized in Table 1.
In both examples, fluid phases follow the density model where c l is the fluid compressibility and ρ o l is the phase density at a reference pressure p o . This reference pressure is taken as initial formation pressure. For the rock, solid grains are intrinsically incompressible with Biot coefficient b = 1, maximizing the coupling strength between the mechanical and flow problems. The relative permeability relationships are where s wr and s nwr are the wetting and non-wetting residual saturations. At t = 0, the formations are assumed to be in hydrostatic and lithostatic equilibrium, with a uniform saturation s(x, 0) = s wr . In both examples, target well pressures are not applied instantaneously, but rather ramped up linearly over a fixed time period (1 day) to relax the stiffness of the nonlinear problem. We use a telescoping time-stepping scheme, where simulation timesteps grow exponentially from an initial value to a maximum value, after which they remain constant for the remainder of the simulation.

Staircase Benchmark
For the first benchmark, we consider the synthetic configuration illustrated in Figure 1. Injection and production wells are placed at opposite ends of a high-permeability channel that winds its way in a "staircase" spiral through a low permeability host rock. The reservoir channel has a 1000:1 permeability contrast with surrounding seal units, with isotropic permeability in both regions. All external boundaries are assumed to be no flow. For mechanics, the side and bottom boundaries have a zero normal-displacement constraint (roller boundary), while the top surface is allowed to deform freely. The water injector and fluid production wells completely perforate the channel. The injection (respectively, production) well reaches a target bottomhole pressure of 5 MPa above (below) initial formation pressure after 1 day of buildup (drawdown). An illustrative snapshot of the simulation results is presented in Figure 1. While the geometry is artificial, the problem is designed to exhibit strong coupling and provides a straightforward description for comparison with other codes.
The results of a weak-scaling study using this configuration are presented in Table 2. We begin with an 88k degree-of-freedom mesh run on one core. The mesh is then uniformly refined and run on eight cores, preserving (approximately) the same number of unknowns per core. This process is continued out to 512 cores and 42M degrees of freedom. We then repeat the process with 88k degrees-of-freedom on 2 cores and scale out to 1024 cores. One can also get an indication of strong scaling performance by comparing results in Tables 2a and 2b for the same problem size but different core counts.
Because the problem is nonlinear, we observe an increase in the average number of Newton iterations per timestep as the mesh is refined. As desired, however, the number of GMRES iterations per linear solve is relatively insensitive to the problem size, showing only minor growth with refinement. Using more complex smoothers, we can further drive down the iteration count but at the cost of overall timing performance. Table 2 also presents timing results. Both the setup and solver iteration phases exhibit some growth, though even at large processor counts the solve time is quite satisfactory. We note that the actual work per processor in this problem is quite small, and the timing growth is a symptom of communication costs in the AMG setup and solver iteration phases. In general, however, the overall framework provides a robust basis for scalable performance.

SPE10-based Benchmark
As a more realistic test case, we consider a water flood in a heterogeneous reservoir. The setup is based on Model 2 of the 10 th SPE Comparative Solution Project [88] but equipped with poroelastic mechanical behavior. In order to provide realistic mechanical boundary conditions, we add 300-m thick overburden and underburden layers to the original SPE10 reservoir. A sketch of the simulated domain is shown in Figure 2. The computational grid consists of 60 × 220 × 252 cells. The total number of degrees of freedom is n total = 16,730,559 (n u = 10,077,759; n s = 3,326,400; and n p = 3,326,400). The original SPE10 anisotropic permeability distribution and relative permeability curves are used-details provided in [88]-while the porosity field has been slightly modified so that the lowest porosity for a cell is thresholded at 1% to remove extremely low porosity values present in the original data files. An isotropic permeability value equal to 0.01 mD is assigned to the overburden and underburden layers, and porosity is set to 1%. Homogeneous Young's modulus E = 5000 MPa, Poisson's ratio ν = 0.25, and Biot's coefficient b = 1.0 are assumed everywhere. All boundaries are impervious to fluid flow and constrained to have zero normal displacement, except for the top which is allowed to deform freely. Five wells penetrate the entire reservoir as shown in Figure 2. The injector well, located in the middle of the domain, and the production wells, located at the corners, operate at a constant +5 and -5 MPa overpressure, respectively. Note that these well conditions differ from the original SPE10 specification, which include rate-controlled wells. Overall, 100 days of injection are simulated. Table 3 reports results of a strong-scaling study in which the total problem size is kept fixed but allocated across an increasing number of cores. The GMRES iterations to convergence are well-controlled and mostly insensitive to the parallel partitioning. Good overall timing efficiency is observed with only a mild decrease from ideal performance for the 576 core case. The results suggest that one can use the proposed solver strategy and large processor counts to efficiently drive down long simulation times even for challenging heterogeneous problems.

Conclusion
In this work, we have presented a preconditioning scheme suitable for fully-implicit simulation of multiphase flow and transport with geomechanics. We have used physics-based arguments to break the inherently coupled problem into modular sub-problems for which "black-box" algebraic methods can be effective. This modularity also provides significant flexibility to test other approaches. For example, while we have used a CPR-like strategy for the reduced multiphase flow problem, research into effective reservoir simulation preconditioners remains an active area of research. If an alternative flow preconditioner has been identified, the partitioning framework here provides a simple strategy for adding poromechanical physics.
Regarding future work, a number of avenues would be interesting to explore. First, it is well known that the coupling strength between different PDEs evolves with both time and timestep size. Strong coupling is often observed at early time and small timesteps, while weaker coupling is observed at late time and large timesteps. One might  consider an adaptive solver strategy where the coupling strength between various quantities is detected based on the magnitude of coupling contributions in the preconditioner. When sufficient "uncoupling" is detected between one or more fields via these coupling indicators, one might switch to a cheaper solver strategy. Also, this work has focused almost exclusively on the solver strategy at the linearized equation level. There are likely significant performance improvements to be gained by embedding similar ideas at the nonlinear level. Finally, while the physics-based splitting strategies used here are effective, they are PDE-specific and cannot be immediately extended to other coupled systems. For example, the addition of thermal or fractured reservoir behavior would change the underlying PDEs and render the current solver strategy incomplete. There is therefore significant room for both new physics-based solvers as well as algebraic-coupling methods that can be (to the extent possible) agnostic as to the underlying physics.

Appendix B. Finite element and finite volume vectors and matrices
Repeated indices in the expressions given in this appendix do not imply summation.