A novel block non-symmetric preconditioner for mixed-hybrid finite-element-based flow simulations

In this work we propose a novel block preconditioner, labelled Explicit Decoupling Factor Approximation (EDFA), to accelerate the convergence of Krylov subspace solvers used to address the sequence of non-symmetric systems of linear equations originating from flow simulations in porous media. The flow model is discretized blending the Mixed Hybrid Finite Element (MHFE) method for Darcy's equation with the Finite Volume (FV) scheme for the mass conservation. The EDFA preconditioner is characterized by two features: the exploitation of the system matrix decoupling factors to recast the Schur complement and their inexact fully-parallel computation by means of restriction operators. We introduce two adaptive techniques aimed at building the restriction operators according to the properties of the system at hand. The proposed block preconditioner has been tested through an extensive experimentation on both synthetic and real-case applications, pointing out its robustness and computational efficiency.


Introduction
Numerical modelling of fluid flow in porous media is a key requirement for a wide number of applications in subsurface hydrology and petroleum engineering. In general, computer simulators are fundamental tools for the proper management and exploitation of aquifer systems, as well as oil and gas fields. The growing demand for a higher accuracy of the simulation, assisted by the increasing availability of computational and storage resources, leads to a continuous development and refinement of virtual simulators. The degree of approximation of the overall numerical model is defined, first of all, by the underlying mathematical model, but also by the selected discretization scheme. Discretization schemes should handle effectively non K-orthogonal unstructured grids, as well as highly heterogeneous and anisotropic rock/fluid properties, frequently introduced as full-tensors in the model (see, for instance, [1] about the numerical issues, related to abrupt changes in permeability, in node control volume finite element discretizations).
The Mixed Hybrid Finite Element (MHFE) method, and in general the whole class of Mixed Finite Element (MFEM) methods [2], coupled with the Finite Volume (FV) method, has been gaining a growing popularity in recent years. The enforcement of the mass balance at the elemental level, the continuity of normal fluxes across internal faces of the discretized domain, the accuracy of the velocity and pressure fields, the possibility of handling either structured or unstructured grids and the elegant treatment of full tensor fluid properties [3,4,5] have made the MHFE method attractive for several applications, such as contaminant transport [6,7,8], energy storage [9], poromechanics [10,11] and, of course, single-phase [12], variably saturated [13,14], multi-phase [15,16,17,18,19] and more recently compositional [20] flow problems. However, the MHFE method exhibits some critical aspects as well, for instance the violation of the Discrete maximum principle [21] and the higher number of unknowns per element, as compared to other schemes like the FV or the classical Finite element methods. In this regard, much effort was devoted to try to

MHFE-FV model of single-phase flow in porous media
The set of equations governing the single-phase flow in porous media consists of the mass conservation and Darcy's law. The monolithic solution approach addresses these equations simultaneously by means of a fully-implicit coupling.
where the symbol ∇ indicates the gradient operator, ∇· the divergence operator and() the derivative with respect to time. In equation (1a), γ is the fluid specific weight and K is the conductivity tensor, assumed to be symmetric and positive definite (SPD). The gravitational term is here neglected. The specific storage coefficient c in equation (1b) can be expressed as c = γ(α+φβ) where α is the soil compressibility, φ the medium porosity and β the fluid volumetric compressibility [50]. The solution to the system of equations (1a) and (1b) is a well-posed problem provided that a set of appropriate initial and boundary conditions is supplied: for assigned functions p 0 :Ω → R, p : Γ p × T → R, and v n : Γ v × T → R. In equation (2c), n denotes the outer unit normal vector to Γ v .

Discretization of the governing equations
The model domain is partitioned into non-overlapping hexahedral elements, which accommodate, as shown in Figure 1, two types of pressure unknowns, located on each face barycentre, π, and on the element centroid, p E . The former act the part of Lagrange multipliers and express the face average pressure, whereas the latter represents the average elemental value.
Let E h and F h be the collections of elements and faces of the discretized domain, respectively. In our modelling approach, equation (1a) is discretized by means of the MHFE method, using the RT 0 space to approximate the velocity v and the P 0 space for the pressurep and Lagrange multipliersπ: M h = π |π ∈ L 2 (Ψ),π = p on Γ p ,π| F ∈ P 0 (F), ∀F ∈ F h , where L 2 (Ω) and L 2 (Ψ) denote the spaces of square Lebesgue-integrable functions on the domain Ω and the union of elemental faces Ψ, respectively, and H(div, Ω) is the Sobolev space of square integrable vector functions with square integrable divergence in Ω [2]. The V h trial space for 3-D problems is generated by local piecewise trilinear vector functions, η E i (x, y, z), defined per each face i of element E [5,51]. Such functions exhibit two basic properties [4]:

The MHFE-FV system of equations
The solution to the model problem is achieved by solving the system of equations (14) at each time step, along with the strong enforcement of the continuity of fluxes across the faces of the grid: Along the boundary, equation (16) allows also to apply Neumann conditions in a strong form, just by substituting the RHS accordingly and dropping q E i . Notice that equation (16) uses the fluxes as expressed in (12), unlike equation (14). Finally, it is implicitly assumed that π E i = π E i due to continuity reasons. The resulting system exhibits a 2 × 2 block structure, which is solved in a fully-implicit framework, with two types of unknowns: where , π n+1 and p n+1 gather the face and element pressure unknowns, and f π and f p are the relevant components of the known term. In equation (17), the Lagrange multipliers and the element pressure unknowns are coupled by means of the rectangular blocks A πp and A pπ . As to the properties of A, this matrix has a flipped generalized saddle-point structure, it is sparse, non-symmetric and usually ill-conditioned. In particular, A pp has a symmetric structure, though it is not symmetric, A ππ is a symmetric negative definite matrix, and A πp ±A T pπ . As mentioned before, in the context of single-phase flow, system (17) is linear.

The Explicit Decoupling Factor Approximation preconditioner
Solving accurately and efficiently the sequence of linear systems (17) arising from a MHFE-FV unsteady flow simulation is the major purpose of this study. Iterative Krylov subspace solvers are mandatory to address the large-size and sparse systems of equations that stem from real-world 3-D models, especially for the low memory requirements and better scalability as compared to direct solvers [28]. When the system matrix is non-symmetric, the Bi-Conjugate Gradient Stabilized (Bi-CGStab) [54] or the Generalized Minimal Residual (GMRES) [55] methods are usually the selected algorithms. However, improving their performance by supplying an appropriate preconditioning operator P −1 is key in order to guarantee a fast and smooth convergence.
It is well-known that an effective preconditioner is an operator whose application to a vector should resemble as much as possible that of the inverse A −1 of the system matrix [56,57]. Therefore, a good starting point for the design of our preconditioner is to consider A −1 and take advantage of its block structure, as it is usually done in saddle-point and general block problems [48,49,58,59]. The block LDU decomposition of the system matrix reads: where S = A pp − A pπ A −1 ππ A πp is the so-called Schur complement. The exact inverse of A in a factorized form reads: where the two decoupling factors are defined as: The decoupling factors F and G are also used to compute the Schur complement as: with H = GA ππ F.

6
Considering equations (20), F and G can be computed explicitly by solving two independent sets of multiple right-hand side (MRHS) systems: Of course, such an operation cannot be performed exactly because F and G are dense, hence proper approximations have to be introduced. The key feature of the proposed approach, denoted as Explicit Decoupling Factor Approximation (EDFA) preconditioner, is the computation of sparse explicit approximations for F and G, F and G, respectively, by means of proper restriction operators. The approximate decoupling factors F and G are used to compute a sparsified Schur complement S : Recalling equation (19), the final algebraic expression of the EDFA preconditioner reads: where A −1 ππ and S −1 are inexact applications of the inverse of the leading block A ππ and the approximate Schur complement, respectively.
Remark 3.1. The approximate decoupling factors F and G are used only for the computation of S and do not replace the relevant terms in the triangular factors in equation (24). In this sense, the EDFA algorithm can be regarded as a member of the class of mixed constraint preconditioners [58,60], where a twofold approximation for the inverse of the leading block is inherently introduced. Similarly, it can be also viewed as an example of application in a nonsymmetric context of the multigrid reduction framework, e.g. [61,62], where face and elemental pressures play the role of fine and coarse nodes, respectively, and F and G are approximations of the optimal restriction and prolongation operators from the fine to the coarse grid.
The approximation of the decoupling factors F and G is performed by solving the sequence of MRHS systems (22a) and (22b) inexactly in properly restricted subspaces. For the sake of simplicity, we refer to system (22a), but the same developments can be easily extended to (22b). The m-th system to be solved reads: where A ππ = A T ππ for symmetry reasons, g (m),T = G T e (m) , a (m),T pπ = A T pπ e (m) and e (m) is the m-th vector of the canonical basis of R N e , which plays the role of restriction operator over columns. The minus sign has been introduced at both sides of (25) to obtain an SPD problem, since A ππ is negative definite. Let us now consider the set Q = {1, . . . , N f } ⊂ N and a sequence of (possibly overlapping) subsets Q (m) ⊆ Q, whose size is |Q (m) | = s (m) , with m = 1, . . . , N e . The m-th restriction operator over rows, is expressed as: where f is the -th column vector of the canonical basis of R N f and Q (m) i is the i-th member of Q (m) . The application of the operator R (m) r to equation (25) leads to the following system ( Figure 2): where A (m) ππ = R (m) r A ππ R (m),T r is a symmetric restriction of A ππ to the entries in the rows and columns with indices in Q (m) and g (m),T = R (m) r g (m),T is the restriction of the m-th row of G to the entries in the columns with indices in Q (m) . Since 7

−A (m)
ππ is a symmetric square submatrix of the SPD matrix −A ππ , it is guaranteed to be SPD as well. The sequence of systems (28) can be inexpensively solved by an inner direct solver, provided that the cardinality of Q (m) is small enough.
The restricted vector g (m) obtained from the solution of system (28) is an approximation of the m-th row of the exact decoupling factor G and inherits an optimal property, as stated by the following result.
Proposition 3.1. Let A ∈ R n×n be SPD and R ∈ R m×n (m < n) be a restriction operator from R n to R m . Then, for any right-hand side vector b ∈ R n , the solution x ∈ R m to the restricted system: is such that the error e = A −1 b − R T x has minimal energy norm with respect to the A-inner product.
Proof. The energy norm of e with respect to A reads: The contribution under square root in (30) is a quadratic function Φ(x) : R m → R + : which has a unique minimum in R m being A SPD. Hence: Condition (32) applied to equation (31) immediately yields: which completes the proof.
Remark 3.2. Proposition 3.1 guarantees that the restricted vector g (m) is the best approximation of g (m) that can be computed for the components selected by the set Q (m) , in the sense of the energy norm with respect to the A ππ -inner product. Hence, an accurate selection of such components, so as to identify the most important ones for each column, is fundamental for the quality of the approximation G and, similarly, of F and H.
Finally, the assemblage of the N e contributions g (m) from equation (28), prolonged back to R N f , gives rise to the approximate factor G. Recalling that restrictions and prolongations are dual operators, G is easily obtained as: Operating similarly for equation (22b), we obtain: where f (m) are the solution of the N e restricted SPD systems: with a (m) πp = A πp e (m) . Of course, the restriction operators R (m) r can be the same as for G or based on a different sequence of subsets W (m) ⊆ Q.
As observed in Remark 3.2, the sequence of subsets Q (m) , W (m) , along with their size s (m) , affects the density of the approximate decoupling factors and governs the effectiveness of the EDFA preconditioner. In fact, the entries of Q (m) and W (m) are the indices of the non-zero entries computed for the m-th row of G and column of F, respectively. To be effective, the sets Q (m) and W (m) should roughly identify for each row of G and column of F the largest entries of G and F. This key operation is carried out by means of two techniques, referred to as static and dynamic in the sequel, aimed at selecting the most influential entries expected in G and F.
First of all, for the sake of simplicity we use a single sequence of sets Q (m) for both decoupling factors. A natural initial guess for Q (m) is the set of indices of the non-zero entries belonging to the columns of A T pπ , which is denser than A πp . Such pattern is referred to as Q (m) A pπ . Figure 2 schematically shows how the restricted systems can be retrieved from the global one using the set Q (m) A pπ . The two strategies for computing Q (m) starting from Q (m) A pπ are as follows.
1. Static technique. The non-zero entries of Q (m) A pπ can be derived by the discretization. In particular, the non-zeros lying in the m-th row of A pπ identify the faces of the cells connected with the m-th element, as illustrated in Figure 3. Notice that the front and right elements have been removed for the sake of readability, being the overall patch symmetric along the three principal directions. The central element (red-filled faces) is the m-th cell, which is connected to six adjacent elements, and the colored faces correspond to the indices of the nonzero entries in the m-th row of A pπ . Note that, depending on whether the grid is structured or unstructured, the patterns are different. This is a direct consequence of the structure of the elemental matrices B E,−1 of equation (11), which derives from the mutual relationships among the basis functions of the RT 0 space for hexahedral elements. For a regular hexahedron, in fact, matrix B E,−1 is block-diagonal, while this property is no longer valid for a general-shaped hexahedron. Since the solution of the system (25) can be physically interpreted as the face pressures induced by the fluid fluxes related to the pressure gradients occurring in neighboring cells, the pattern Q (m) A pπ can be extended by adding the connection to faces belonging to close cells where the pressure perturbation is expected to propagate. From an algebraic viewpoint, the static technique is based on partitioning the problem domain into overlapping subregions built around each cell and keeping the face connections.
is obtained and used to expand Q (m) (0) by incorporating the indices of the largest components of r (m) (0) , thus obtaining Q (m) (1) . The process can be iterated to obtain Q (m) (2) , Q (m) (3) , etc., until a certain exit criterion is met. The same procedure applies to the system (36).   (37) is not expensive to compute. In fact, the matrix A ππ R (m),T r,0 is the restriction of A ππ to the columns with indices in Q (m) (0) . However, such columns are sparse and contain only the connection of a face with the faces of the two sharing cells. Hence, the only non-zero entries of r (m) (0) correspond to the indices of the faces of a set of neighboring elements. In practice, the dynamic strategy automatically selects the most significant entries among a subset of potential indices that should resemble the one associated with the static strategy.
Remark 3.4. The use of the prolonged residual to select the most significant entries to be retained is strictly related to the symmetry and positive definiteness of −A ππ . In fact, r (m) (0) is the direction of the gradient of the quadratic form associated to −A ππ , whose absolute minimum is the exact solution to (25). Therefore, the dynamic strategy can be also regarded as an incomplete steepest descent process, where only the largest contributions to the gradient direction are taken into account.
Remark 3.5. The EDFA preconditioner, in both the static and dynamic variants, exhibits the remarkable feature that its computation is embarrassingly parallel. In fact, the row-and column-wise approach, used to tackle the restricted solution to the MRHS systems (22a) and (22b), allows to solve each single linear system independently of the others. All the available processing units can be assigned batches of systems that are approximately solved at the same time, with a full and effective exploitation of the most modern computational architectures.

Implementation details
The static and dynamic variants of the EDFA preconditioner require a set of user-specified elements to be properly set up.
The static technique needs the sets Q (m) ⊆ {1, 2, . . . , N f } for m = 1, . . . , N e , which correspond to the indices of faces connected to a certain cell. The level of such a connection, i.e., the neighbours, or the neighbours of the neighbours, and so on, is defined by means of a domain partition into overlapping subregions built around each cell. These subregions are defined on the basis of physical considerations related to the expected directions of fluxes.
The dynamic variant can be regarded as fully algebraic and requires a set of user-specified parameters controlling the enlargement of the initial set Q (m) (0) defined for m = 1, . . . , N e . Assuming Q (m) (0) = Q (m) A pπ , the selected user-defined parameters are: • n add : maximum number of entries added to Q (m) (k−1) at the k-th step of the dynamic procedure; • n ent : total maximum number of new entries added to Q (m) (0) . The iterative process continues until n ent has been reached. Alternatively, it is also possible to set a maximum number of iterations, it max , instead of n ent .
The computation of F, G and S = A pp − H, with either the static or dynamic technique, is followed by a check of the non-zero entries size. Pre-and post-filtration techniques are implemented with the purpose of further sparsifying the approximate Schur complement by discarding those entries whose absolute value is smaller than a user-defined tolerance, namely τ filt , relative to the Euclidean norm of the corresponding row. Performing pre-and/or post-filtration produces an additional cost in the preconditioner set-up, which might be anyway beneficial at the application stage. With the aim at preventing possible breakdowns in the inexact application of S −1 , all the diagonal entries are preserved irrespective of the dropping threshold.
Recalling that A pp is the only block of A changing during a transient simulation, the preconditioner set-up can be split into two stages. The first one, which can be carried out only once at the beginning of the simulation and then recycled at every system solution, consists of the computation of H, i.e., the most time demanding operation, a pre-filtration of G and F, if needed, and the inner preconditioner for the inexact application of A −1 ππ . The second one, performed at the beginning of each time step, includes the update of S = A pp − H, the post-filtration, if required, and the computation of the inner preconditioner for the inexact application of S −1 . In summary, Algorithms 1 and 2 provide an overview of the sequence of operations needed to compute the first and second stage of the EDFA preconditioner in both its variants. (27)) 5: Perform Pre-filtration on g (m),T and f (m) with tolerance τ filt , if required 9:  (27)) 15: 18: Counter of the total number of new entries added to Q (m) (0) and total number of swipes 19: while n prog < n ent and k < it max do 20: k ← k + 1 21: n = min(n add , n ent − n prog ) 22: Obtain Q (m) (k) by adding to Q (m) (k−1) at most n new indices associated with the largest components of |r (m) (k−1) |

Numerical results
The computational performance of the EDFA preconditioner is investigated in both synthetic and real-world reservoir applications. Four test cases are considered, with grid consisting of four layers taken from the SPE10 model [63]  and comprising 51,741 elements and 171,070 faces, for a total of 222,811 unknowns. The scenario being tested, depicted in Figure 4, represents a reservoir with a producer located in the centre and four injectors, one at each corner. The wells intercept the full thickness of the reservoir. The initial water pressure in the reservoir is uniform and equal to 140 bar, with the producer and injectors pumping at a constant pressure of 100 and 200 bar, respectively. Different variants of the model domain have been considered. In Test 1 and 3, the grid is cartesian with a regular hexahedral discretization (Figure 4a), whereas in Test 2 and 4 the planar structure has been deformed into a dome (Figure 4b). The resulting grids are, therefore, structured and unstructured, respectively. In all tests, porosity spans the interval [2.6E-5,0.5] with a spatial distribution that follows the SPE10 benchmark properties, rock (α) and water (β) compressibilities are 4.67E-5 1 bar and 4.84E-5 1 bar , respectively, and the water specific weight (γ) is 0.101 bar m . A summary of the test cases and their main properties is reported in Table 1. Test 1, which is characterized by a homogeneous isotropic hydraulic conductivity in the form of a diagonal tensor is aimed at introducing the operative principles of the proposed preconditioner variants. A sensitivity analysis is carried out on the patterns selected for the static variant and on the two governing user-specified parameters for the dynamic technique. Then, the EDFA preconditioner is employed in a transient simulation to evaluate the effect of time, and in particular the size of the time step, ∆t, on its performance. Test 2 preserves the same hydraulic properties as Test 1, but highlights the influence of an unstructured mesh in the optimal setting of the preconditioner. Finally, Tests 3 and 4 investigate the efficiency and robustness of the EDFA preconditioner in challenging real-world conditions. Specifically, Test 3 exhibits a highly heterogeneous and anisotropic conductivity distribution, as derived from the properties of the SPE10 model and expressed in the form of a diagonal tensor. On the contrary, the dome reservoir in Test 4 is characterized by a heterogeneous and isotropic conductivity field with a full tensor, obtained by extending the horizontal conductivity values (K x,y ) to the vertical direction (K z ) and rotating the principal axes of the resulting tensor so as to follow the curvature of the dome reservoir. The sensitivity analysis on the EDFA preconditioner performance is carried out for the system at steady state, then the overall performance is investigated in full-transient simulations.
Bi-CGStab [54], with the null vector used as initial guess, is elected as Krylov subspace method to solve the sequence of non-symmetric linear systems (17). The exit criterion for the iteration count relies on the reduction of the 2-norm of the relative residual below a prescribed threshold τ, i.e., ||r k || 2 /||r 0 || 2 ≤ τ, where k is the iteration number and τ = 10 −8 . The computational performance of the preconditioned Bi-CGStab solver is monitored by using the following indicators: (i) the iteration count, n it , (ii) the preconditioner density, µ, defined as where the function nnz() provides the number of non-zeros stored for a sparse matrix, and (iii) the CPU time split into t p 0 , t p and t s , needed to perform the first and second stage of the EDFA preconditioner setup (Algorithm 1 and 2, respectively) and to iterate to convergence. We denote by t t = t p + t s the total time associated with the solution of the linear system in a single time step. For the transient simulations, we consider also the Courant-Friedrichs-Lewy (CFL) number, which is defined as [64,65]: where Q E is the water flux through the E-th element during a time step of size ∆t. Specifically, two measures are reported depending on the type of analysis: where n step is the number of temporal steps in the simulation. The size of the time steps is dynamically adjusted during the transient simulations in order to stabilize the pressure change between two consecutive time steps. The underlying criterion relies on the maximum pressure difference at the two previous steps, ∆p max = max E (p E n − p E n−1 ), and a user defined goal for the pressure change, ∆p T , to define the optimal size of the next time step: where ∆t mult is a predefined multiplicative factor and ∆t max the maximum time step length. A relaxation factor can be introduced also in equation (41) [20]. The number of non-zeros of matrix A and its submatrices for the four test cases is reported in Table 1. Notice that it depends only on the grid type. Both the solver and the preconditioner are implemented in Matlab. For the inexact application of both A −1 ππ and S −1 we use the incomplete factorizations with partial fill-in degree already available in Matlab. Of course, other powerful strategies, also more prone to a fully parallel implementation, can be used and will be considered in future developments of the algorithm. For the computation of F and G, the parfor operator has been exploited. All numerical tests were carried out on an Intel R Core TM i7 Quad-Core at 2.9 GHz with 16 GB of RAM.

Test 1: Planar reservoir with homogeneous and isotropic hydraulic conductivity
First, we introduce a set of patches of cells associated with the face pattern connection used in the static EDFA variant. The native pattern of the m-th column, previously introduced in Figure 3a for a structured grid, can be statically enlarged by considering the patches A, B and D (Figures 5a, 5b and 5d), assuming that the flux is mainly oriented along the principal conductivity axes. By distinction, the connections of patterns C and E (Figures 5c and 5e) assume the presence of significant fluxes through all directions, as it might be for instance expected in the case of a full conductivity tensor. Similarly, a significant permeability anisotropy could suggest privileging one direction with respect to the orthogonal ones.
The main results from the application of the EDFA preconditioner in its static variant to Test 1 (homogeneous and isotropic conductivity with a diagonal tensor) are reported in Table 2. Run 0, denoted as base, is taken as benchmark for the following considerations, since it refers to the performance of the EDFA preconditioner with the original A T pπ non-zero pattern. Expanding such initial pattern, by using the predefined connections A, B, C, D and E (Figures 5a -5e) is indeed beneficial, as observed in runs 1 to 5. Considering the reduction in the total CPU time t t per time step as evaluation criterion, the best results are achieved by patterns E, B and A. Specifically, the use of pattern E allows to reduce n it by a factor 1.88 and t t by 1.77, while increasing the preconditioner density by only 1.15. Runs 6 and 7 show the results obtained by applying pre-and post-filtration to pattern E. Pre-and post-filtration introduce a further sparsification of the approximate Schur complement, which is expected to decrease the application cost of the EDFA preconditioner at the cost of a slight increase in the iteration count. In this case, such a strategy does not appear to pay off, with the performance substantially getting back to Pattern A at a larger set-up cost.
As to the dynamic technique, Figure 6 shows the results of a sensitivity analysis carried out on the two userspecified parameters n ent and n add , governing the expansion of the initial pattern Q (m) (0) = Q (m) A pπ , versus the number of iterations to converge n it , the preconditioner density µ, the time to compute the pre-processing stage of the preconditioner t p 0 and the total CPU time t t per time step. All the possible settings therein allow to accelerate the convergence compared to the base case of Table 2. The most interesting results are located in the blue to light-blue area in Figures 6a and 6d, characterized by values of n ent between 4 and 12. Such an interval was also confirmed by the outcome of the static technique, in particular runs 1 and 2, where the number of new entries added per column is 6 and 12, respectively, with a very similar overall performance. Figure 6b and 6c are self-explanatory; the higher n ent the denser the preconditioner and the lower n add the more iterations are needed to expand the native sparsity pattern, and thus the higher t p 0 . It is interesting to provide the dynamic technique with a physical interpretation, so as to visually locate the position of the faces associated with the entries connected by the dynamically-formed optimal patterns. This analysis confirms the connection between the static and dynamic strategies, possibly inspiring the selection of better static connections from the the dynamic patterns. In particular, we focus on the pairs (n add , n ent ) equal to (1,4) and (2,6). Notice in Figure 7 that the newly added faces are primarily located following the main flow direction. This is consistent with the physical principle which the static technique relies on. In the scenario of Test 1, in fact, the water flow is essentially horizontal, since the wells penetrate the full thickness of the reservoir, with a principal component along the y axis. These patterns are uniform throughout the grid.   Finally, the effect of the time step size on the EDFA preconditioner in a full transient simulation is assessed in Table 3. The preconditioner is built by means of the dynamic technique, with the setting (1,4) for the pair (n add , n ent ). The range investigated spans the interval [0.01,1000] days, with the associated χ ∞ parameter up to almost 600. Notice how the number of iterations grows progressively as ∆t increases, where the steady state condition can be regarded approximately as an upper limit.

Test 2: Dome reservoir with homogeneous and isotropic hydraulic properties
The evolution from a structured to an unstructured grid introduces new challenges for the design of efficient nonzero patterns due to the modification of the native stencils of blocks A ππ and, most of all, A T pπ , which is accompanied by the increase in the number of non-zeros of those two blocks, as shown in Table 1. Specifically, A T pπ is 1.9 times denser than with the planar mesh. In fact, the face-to-element connection building the non-zero pattern of the typical column of A T pπ moves from Figure 3a to Figure 3b. This modification has a very relevant effect on the numerical performance of the base case (see Table 4), which does not converge after 2,000 iterations. Enlarging the face-to-element connections with patterns A, B and D (runs 1, 2 and 4), however, improves very rapidly the quality of S . By comparing Figure 3b with 5a, 5b and 5d, the adoption of patterns A, B and D consists in a simultaneous expansion and contraction of A T pπ 's column non-zero pattern, since at most 24 entries are discarded and others are added in a variable number. It is a sort of implicit moderate expansion accompanied by a significant filtration. Applying pre-or post-filtration when adopting pattern B, which means in practice sparsifying for the second time the relevant pattern, is not beneficial as proved in runs 6 and 7. Notice that the overall preconditioner density is generally higher than in Test 1 (see Table 2 for a comparison).
As to the dynamic technique, the same strategy, consisting of the expansion and contraction of the original A T pπ pattern, has been followed. Figure 8 reports the outcomes of a sensitivity analysis on the post-filtration tolerance for different values of n ent , ranging from 10 to 18 with n add = 4. The outcomes are expressed in terms of iteration count to converge (8a) and total solution time per time step (8b). The best performance is achieved for τ filt =1.E-3, where the number of iterations to converge is similar to that obtained with pattern B in Table 4. An effective way to limit the post-filtration cost consists of applying it to H as a pre-processing effort. This strategy is successfully investigated in Figures 8c and 8d. Notice that, while the number of iterations remains approximately the same or even improved, the CPU time per time step is halved, thus making the dynamically-formed preconditioner competitive to the best statically-derived one (run 2 in Table 4).

Test 3: Plain reservoir with heterogeneous and anisotropic hydraulic conductivity
Introducing a heterogeneous and anisotropic conductivity field in the planar reservoir application worsens the conditioning properties of the associated problem, as shown by the increase in the number of iterations of the base case (Table 5) with respect to Test 1 ( Table 2). Like in Test 1, patterns A and E (runs 1, 5) provide improved results that can slightly benefit from further sparsification of the approximate Schur complement. Specifically, pre-and postfiltration are characterized barely by the same performance (runs 6 to 9) with pre-filtration a little more efficient in this application.
A sensitivity analysis on the dynamic parameters n ent and n add is shown in Figures 9a and 9b in terms of number of iterations to converge n it and total solution time per time step t t , respectively. All the analyzed combinations succeed in accelerating the convergence with respect to the base case (run 0 in Table 5). The most efficient settings are located in the bottom left portion of graph 9b in the interval 2 ≤ n ent ≤ 8, where t t is minimized. Specifically, the fastest convergence is achieved with the setting (6,6) for the pair (n add , n ent ), where n it = 267 and t t = 8.98 s. In the same figure notice how t t grows with n ent , depending only partially on n add .
Finally, a direct solver, such as Matlab backslash operator, and Bi-CGStab accelerated by a global preconditioner, such as a threshold-based ILU(τ) of A as available in Matlab, are used to benchmark the performance of the proposed EDFA preconditoner in the steady state case. The relevant outcomes are conveyed in Table 6, where the memory peak reached during the solving phase with Matlab backslash replaces the preconditioner density as a measure of the solver memory footprint. In both cases, the solution time t t , as well as the memory requirements, are by far higher than those obtained with the EDFA preconditioner (see Table 5 and Figures 9a and 9b for reference).

Test 4: Dome reservoir with full tensor heterogeneous and isotropic hydraulic conductivity
In this final application, the dome-structured reservoir is characterized by a full-tensor heterogeneous conductivity field, which is obtained by rotating the element local axes x and y, following the curvature of the domain [66]: where K E is the element diagonal conductivity matrix and R E xy = R y (−θ E y )R x (−θ E x ) is the overall rotation matrix. θ E x and θ E y are the average rotations of the domain surface, within element E, around the global Cartesian reference system as 18  per the right-hand rule.
The main results for the static strategy are provided in Table 7. The base case performance (run 0) is not satisfactory and the only pattern yielding an appreciable acceleration is D (runs 1-5). Moreover, in two cases the solver has not converged yet after 2,000 iterations. Filtration here appears to be mandatory and effective, as shown by runs 6 and 7, where a significant post-filtration (τ filt =1.E-2) on factor H has been applied to patterns D and B. This confirms that the native patterns seem to involve too many connections, which turn out to be not significant to capture the complicated flux nature of this test case. On the other hand, pre-filtration (runs 8 and 9) does not represent a consistent alternative, due to the larger preconditioner density, which leads to an increase in the application costs even for a smaller number of iterations.
As to the dynamic technique, the sensitivity analysis on n ent and n add in Figures 9c and 9d reveals that there exists a wide blue area characterized by competitive settings, both in terms of number of iterations and total solving time per step. In this regard, the most attractive portion of the graphs remains the bottom left one. Table 8 aims at assessing the effect of filtration on the dynamically-formed preconditioner obtained with the settings n ent = 6 and n add = 1. The outcomes of post-filtration, applied to both S and H in runs 1 and 2 respectively, are here reported to emphasize the greater efficiency of the latter strategy. Although performing post-filtration on H gives a higher density preconditioner, the filtration is anticipated in the pre-processing stage of the preconditioner set-up, while preserving the quality in the approximation of S . Post-filtration on H (run 2), in fact, allows to approximately halve the density  of the original preconditioner (run 0) to the benefit from a 38% reduction in the total solution time. Pre-filtration, on the other hand, seems not to be effective (run 3). Notice that, in this application, combinations where n ent = n add , i.e., all the prescribed new entries are subsumed at once, give a bad quality S , hence at least two steps of the dynamic procedure are recommended. Despite the high density, the unfiltered dynamically-formed preconditioner in Table 8 is by far more competitive than the corresponding static alternative in Table 7 (runs 0-5). An explanation comes from the analysis of the dynamic pattern representation vs the position within the dome-grid and the flux distribution, as provided in Figure 11. The dynamic technique is capable to flexibly catch and exploit the possibly complex physics of the fluxes behind the problem. Due to the dome structure of the domain and the heterogeneity of the hydraulic conductivity, the fluid fluxes are not uniformly distributed in the domain. As an effect, the resulting pattern is not the same throughout the grid and can be hardly guessed, hence a statically designed homogeneous pattern is not well-suited to the specific requirements   Table 7), V 2 → Dynamic strategy with n ent = 6, n add = 1 (run 0 in Table 8), V 3 → as V 2 with post-filtration on H and τ filt =1.E-3 (run 2 in Table 8) and V 4 → Static strategy with pattern D, post-filtration on H and τ filt =1.E-2 (run 6 in Table 7).
of this application.
In conclusion, the performance of the EDFA preconditioner is evaluated in a full-transient simulation, reproducing the exploitation of a reservoir, initially undisturbed, under the conditions mentioned in Section 4, i.e., four injectors (one at each corner) and a producer (in the centre) operating at a constant pressure. The overall computational times are displayed in Figure 10, where a comparison among a selection of four variants of the EDFA preconditioner is proposed. The performance of the basic setting (V 1 , run 0 in Table 7) is used as benchmark to test the advantage provided by the most efficient dynamic (V 2 and V 3 , runs 0 and 2 in Table 8) and static strategies (V 4 , run 6 in Table 7). The simulated time interval is 130 days with 200 time steps. The other relevant settings are: ∆t max = 5 d, ∆p T = 5 bar, ∆t mult = 1.1. The best performance is given by variant V 3 , which proved to be the most competitive one also in the previous steady state analysis. Nevertheless, the difference with V 2 and V 4 is not as significant as the steady state analysis might have suggested. In any case, all trials V 2 , V 3 and V 4 clearly outperform V 1 , reducing the total time to about one fourth.

Discussion
Achieving a fast and cheap solution to the sequence of systems (17) that stem from a full MHFE-FV singlephase flow simulation is the main objective of this paper. Given the peculiar properties of system (17), which is characterized by a non-symmetric generalized saddle-point structure and usually ill-conditioned blocks, an original preconditioning strategy, denoted as EDFA, has been specifically designed to accelerate the convergence of Krylov subspace solvers. The key features of the proposed approach are twofold: (i) the exploitation of the decoupling factors, G and F, of the system matrix A block LDU decomposition to recast the Schur complement avoiding the inversion of the leading block (equation (21)); (ii) the inexact computation of these factors by solving two independent sets of MRHS systems (25), by means of a combination of restriction and prolongation operators based on the non-zero pattern of G and F. Such an approximation turns out to be optimal, with respect to A ππ -inner product, for the selected non-zero pattern. The ability to recognize the most representative entries of G and F is key to obtain high quality approximations at a workable sparsity and achieve a fast convergence. This is the motivation behind the introduction of two strategies, namely static and dynamic, aimed at selecting the sparsity pattern of the two decoupling factors by customizing the face-to-element connections contained in the columns of A T pπ , depending on the properties of the problem at hand. A physics-based concept underlies the static technique, since the entries of the generic column of the decoupling factors correspond to the face pressure unknowns resulting from the fluid flow arising in a certain small and compact partition of the physical domain. Expanding such partition in order to incorporate the most relevant entries is a way for enlarging effectively the native sparsity pattern. The size of the augmented partition, as well as the location of the faces associated with the newly-added entries, give rise to a plethora of possible combinations. As an example, in this paper we considered five general-purpose prototypes (Figures 5a-5e). This strategy turns out to be effective when the modeller has a robust idea of the flux distribution and the resulting pattern can be more or less uniformly extended throughout the domain (compare for instance Figures 7 and 11).
On the other hand, a fully algebraic framework is introduced to define a dynamic variant, where n ent new entries to the initial pattern are progressively added to an initial guess at the locations corresponding to the largest components of the prolonged residual (equation (37)). Compared to the static variant, the dynamic technique is computationally more demanding because, at every step of the pattern construction, a static solution is needed, however this cost can be easily amortized during a transient simulation and take advantage of an almost ideal parallel degree. Furthermore, this techniques is more flexible, since it is capable to implicitly capture the physics behind the problem without the modeller being aware of it and might be used in a black-box fashion as well. Pre-and post-filtration techniques have been introduced with the twofold purpose of controlling the density of S and improving the quality of its inexact inverse application by removing possibly detrimental near-zero entries.
The extensive experimental phase of Section 4 helped understand the potential of the EDFA preconditioner in different settings, according to the grid type and hydraulic properties, but also suggests indications about default optimal settings. All the tests revealed that less than 12 new entries are actually needed in addition to the original column patterns, but it is the grid type, i.e., whether structured or unstructured, that mostly influences the optimal EDFA preconditioner set-up. With a structured grid, the suggested static patterns in Figure 5 seem to be appropriate, and especially E and A as shown in Tables 2 and 5. By comparing Figures 5a and 5e, notice that the physical structure of these patterns is similar as the elements involved in the their definition is the same. By distinction, for the dynamic variant optimal results have been obtained by setting n ent between 4 and 10, with n add ≈ n ent . Filtration is not strictly necessary, even though some good results were obtained with pre-filtration and τ filt between 1.E-4 and 1.E-3 (Table 5).
Conversely, with an unstructured grid, patterns B and D turned out to be the winning choice for the static technique, whereas for the dynamic one it is advisable to set n add < n ent < 10 ( Figure 9d). A significant post-filtration with 1.E-3 ≤ τ filt ≤ 1.E-2 proved to be effective to accelerate, or even to allow for, convergence (see for instance Tables 4 and 7). Post-filtration on H, rather than on S , should be preferred.
Finally, Table 3 and Figure 10 showed that the preconditioner set-up for steady-state conditions plays the role of a worst-case scenario. In particular, the smaller ∆t, the better the accuracy of S (equation (21)). The reason for this behavior comes from the structure of A pp (equation (A.4)), where the diagonal entries depend on the inverse of ∆t. Therefore, when ∆t is small, they tend to prevail over the other contributions in S , and A pp becomes diagonally dominant. This observation suggests a different set-up strategy of the preconditioner, because building F and G with the optimal settings at steady state might be too conservative. The two decoupling factors can be approximated more than once during a full-transient simulation, since usually ∆t increases as steady state is approached, and cheaper approximations can be effective at the initial stages.

Conclusions
In this paper, we introduce a novel preconditioning technique, denoted as EDFA, for the solution of the sequence of non-symmetric block linear systems arising from the original MHFE-FV discretization of flow problems in porous media developed in [20]. The proposed method is based on the approximation of the decoupling factors of the system matrix by using appropriate restriction operators for the sake of the Schur complement computation. The experimental phase proved its robustness and reliability in different settings, depending on the structure of the grid and the properties of the hydraulic conductivity tensor. The EDFA preconditioning strategy exhibits several attractive features, in particular: 1. Since the process for building the preconditioner is based on the solution of the set of independent MRHS systems (25), the set-up stage is inherently parallel and can fully exploit the architecture of modern computing platforms; 2. The overall set-up stage can be split in a two-step procedure, where the first stage is performed only at the beginning of a full-transient simulation and the second stage at each time step; 3. The largest computational cost, associated with the approximation of G and F, is concentrated in the first stage, so it can be effectively amortized during a full simulation.
Both the static and dynamic variants proved to be overall efficient. A winner does not stand out clearly, even though it might be better to rely on the dynamic technique when the flux distribution is highly variable throughout the domain and it is hard to define a uniform static pattern prototype. All the same, it is appreciable the relatively easy setting up and cheapness of the static variant and the flexibility of the dynamic. Research is currently ongoing to develop a fully parallel C++ implementation of the proposed solver and extend the formulation to multi-phase MHFE-FV reservoir models.

Acknowledgment
This publication was supported by the National Priorities Research Program grant NPRP10-0208-170407 from Qatar National Research Fund. g : k → p = g(k) be the function that converts the global matrix index k into the local one p. The expressions for the A sub-blocks read: