An approximate block factorization preconditioner for mixed-dimensional beam-solid interaction

This paper presents a scalable physics-based block preconditioner for mixed-dimensional models in beam-solid interaction and their application in engineering. In particular, it studies the linear systems arising from a regularized mortar-type approach for embedding geometrically exact beams into solid continua. Due to the lack of block diagonal dominance of the arising 2 x 2 block system, an approximate block factorization preconditioner is used. It exploits the sparsity structure of the beam sub-block to construct a sparse approximate inverse, which is then not only used to explicitly form an approximation of the Schur complement, but also acts as a smoother within the prediction step of the arising SIMPLE-type preconditioner. The correction step utilizes an algebraic multigrid method. Although, for now, the beam sub-block is tackled by a one-level method only, the multi-level nature of the computationally demanding correction step delivers a scalable preconditioner in practice. In numerical test cases, the influence of different algorithmic parameters on the quality of the sparse approximate inverse is studied and the weak scaling behavior of the proposed preconditioner on up to 1000 MPI ranks is demonstrated, before the proposed preconditioner is finally applied for the analysis of steel-reinforced concrete structures in civil engineering.


Introduction
Originating from evolution in nature or human design processes, thin fibers embedded into solid continua can enhance the constitutive or functional properties of systems in science, engineering, and biomedicine.Applications can be found in different fields: In civil engineering for example, steel-reinforced concrete is used to amplify the load bearing capacity of concrete structures such as bridges.In aerospace engineering, fiber-reinforced composite materials are often used due to their unique combination of a high stiffness, but low specific weight.In biological tissues, collagen fibers are distributed throughout the arterial walls of the circulatory system.For all these application areas, finite element simulations can provide Figure 1: Spectrum of modeling techniques for fibers embedded into three-dimensional solids [55] detailed insight in the system's behavior and potentially assist in improving or optimizing the system's design.While various mathematical models of fiber-enhanced continua are available in literature, their efficient solution on parallel computing clusters has not been studied in detail.To this end, this paper sets out to develop an efficient and scalable multi-level block preconditioning framework for a penalty-regularized mixed-dimensional approach recently proposed by Steinbrecher et al. [55,57].Figure 1 illustrates the range of modeling techniques for fully embedded fibers in solid bulk volumes.On the one hand, homogenized formulations as depicted in Figure 1(a) incorporate all fiber information into the bulk constitutive law, usually leading to anisotropic formulations with preferential directions along the fiber orientation [3,49].On the other end of the spectrum, both bulk field and fibers are resolved as three-dimensional (3D) continua, cf. Figure 1(c), which allows the reuse of existing constitutive models and finite element technology from classical 3D computational solid dynamics.This approach enables the analysis of very detailed micro-mechanical features of individual fibers and the incorporation of advanced physical effects at the fiber-solid interface, though it comes at significant computational cost.Finally, to unify the high model quality of fully resolved models with the efficiency of homogenized models, fibers can be represented by dimensionally reduced structural models such as beams or trusses, which are embedded at arbitrary positions into a 3D solid domain (cf. Figure 1(b)).In such mixed-dimensional 1D/3D models, the bulk field still portrays the same effects as in the fully resolved 3D model, but fibers are now reduced to a computationally efficient one-dimensional (1D) representation, making such models good candidates to study fiber-enhanced continua at large scale.
For the enforcement of coupling conditions in mixed-dimensional models, embedded mesh techniques are required.Although the imposition of constraints through Lagrange multipliers is well established in computational solid mechanics for both boundary-fitted meshes [46,20] and embedded meshes [6,24] as well as in computational contact mechanics [47,48,45,62], the construction of stable Lagrange multiplier spaces for mixed-dimensional beam-solid coupling still poses an open research question.Consequently, mixed-dimensional models for solid problems so far either use a penalty regularization to enforce the fibersolid coupling constraints [17,55,57,54,27], employ a variationally consistent overlapping domain decomposition approach [28], or directly link embedded fibers to the surrounding volume discretization via the extended finite element method (XFEM) [32,4].Similarly, mixed-dimensional 1D/3D models are also available for other types of physics, e.g., in fluid-beam interaction (FBI) [23,22,33] to counteract 3D/3D fluid-solid interaction (FSI) models such as [37,38].
Naturally, computational benefits of mixed-dimensional 1D/3D models are expected, since beam models require much fewer degrees of freedom (DOFs) than solid models to represent the embedded fibers.In [22], the mixed-dimensional approach reduces the number of DOFs for fiber modeling by 96% while keeping the L 2 error of the bulk field below 1.5%.Yet, the solution process of mixed-dimensional models at large scale is not widely studied in literature, but will be tackled in this contribution.In particular, when using Krylov methods to iteratively solve the arising linear systems, suitable preconditioners are required to sufficiently improve the spectral properties of the linear system [51].A few approaches can be found in literature: Block diagonal preconditioners for saddle point systems arising from 1D/2D coupling are investigated in [30].A simplified model problem of 1D/3D coupling yielding a 3 × 3 saddle point system is studied in [29], where a fractional Laplacian is used to approximate the Schur complement and a block diagonal preconditioner is employed for the coupled problem.An additive multigrid preconditioner for the arising fractional Laplacian is proposed in [5].With mixed-dimensional 1D/3D couplings as one application area of interface-driven multi-physics problems, uniform convergence, parameter robustness and scalability have been achieved through a suitable subspace splitting and custom smoothers in [9].Based on the framework of operator preconditioning [35], robust preconditioners have been proposed for 1D/3D domains coupled with Lagrange multipliers for applications in micro-circulation [15].For 2 × 2 systems, the use of Ruge-Stüben algebraic multigrid (AMG) methods to solve a Schur complement equation for the 3D bulk domain, while tackling the 1D domain by a direct solver, is briefly discussed in [10].For penalty-based 1D/3D models and in particular for beam models serving as 1D models, preconditioners are yet to be developed.
In this paper, we will focus on our prior work on a regularized mortar-type embedding of geometrically exact nonlinear beams into 3D solids [55,57] and will devise scalable preconditioners for the arising systems of linear equations.We will study key properties of the linear systems, in particular their loss of both diagonal dominance and block diagonal dominance due to the penalty contributions, and design a preconditioner that is mostly agnostic to these challenges.To this end, we will interpret the system matrix containing solid and beam contributions as a 2 × 2 block matrix and employ approximate block factorizations to arrive at a Block-LU preconditioner.For the approximate inversion of the beam sub-block of the coupled system matrix, we will construct a sparse approximate inverse (SPAI) [21], which will not only allow us to explicitly form an approximate Schur complement, but will also serve as a smoother within the application of the Block-LU preconditioner.The original SPAI algorithm will be equipped with filtering and static enrichment algorithms to amplify its performance and robustness.To achieve scalability on parallel computing clusters, we will tackle the Schur complement equation itself with AMG methods from Trilinos/MueLu [7].We will then study the computational performance and demonstrate weak scalability, robustness under changes of physical parameters, and applicability to practical use cases in a series of numerical experiments and investigate savings in wall clock time due to the reuse of the preconditioner throughout the entire load step.In sum, this contribution builds upon our prior work [55,57] and equips these models with scalable iterative solvers to facilitate their efficient application to large models and systems with thousands of embedded fibers.Furthermore, this contribution constitutes the first presentation of a scalable, preconditioned iterative solver for truly 1D/3D models applied to beam-solid coupling with weak scalability demonstrated on a distributed memory cluster using the message passing interface (MPI) on up to 1000 MPI ranks.
The remainder of this manuscript is organized as follows: Section 2 introduces the underlying mechanical problem of mixed-dimensional couplings of slender fibers embedded into solid continua and outlines the finite element discretization and the resulting linear systems.Relevant properties of these linear systems are then discussed in Section 3. The design of the preconditioner and its building blocks, in particular the SPAI algorithm, will be detailed in Section 4 along with a brief comparison to existing methods from literature.In Section 5, we will study the numerical properties of individual components of the preconditioner and assess its performance and scalability when applied to academic and engineering test cases.Section 6 will summarize our findings and hint at future research directions.

Mixed-dimensional modeling of fiber/solid systems
Since this manuscript concerns itself with preconditioner development for the mixed-dimensional modeling of the interaction of solid continua with slender fibers, we only give a brief introduction into the governing equations, discretization and coupling approach presented in [55,57].For a broader overview of the fundamentals of beam-solid interaction, the interested reader is referred to [58].

Pure solid problem
The 3D solid bodies considered in this work are modeled as hyperelastic Boltzmann continua.The weak form of the equation of elastostatics describing the deformation of the solid body is given as with u S being the solid displacement field, S and E representing the second Piola-Kirchhoff stress tensor and the Green-Lagrange strain tensor, b denoting the body forces, t standing for the external traction field, and δ indicating virtual, but kinematically admissible quantities in line with the concept of virtual work.Furthermore, Ω S is the solid domain and Γ S is the Neumann boundary of the solid domain.In this work, all quantities denoted with an (•) S are associated with the solid continuum.For the spatial discretization of the solid domain, we employ displacement-based isoparametric finite elements interpolated by Lagrange polynomials resulting in the following discretized linear system to be solved in every iteration of the Newton solver.Therein, f S s represents the nonlinear residual vector and K S ss denotes its linearization, i.e., the solid stiffness matrix.The discrete displacement vector and its increment are given by d S and ∆d S , respectively.

Coupled beam-solid system with Simo-Reissner beam formulation
In the scope of this publication, Simo-Reissner (SR) and torsion-free Kirchhoff-Love (TF) beam theories are considered.With respect to the coupled beam-solid problem, the SR beam theory results in the most general coupling formulation.We first describe the coupled problem based on a SR beam formulation and defer the use of TF beam formulations to Section 2.3.
The general weak form for a 1D Cosserat continuum (applicable to SR and TF beam theory) is given by where δΠ B int,(•) denotes the variation of the internal elastic energy function and δW B ext is the external virtual work acting on the beam.All quantities denoted with the superscript (•) B are associated with beam contributions.The internal elastic energy for a SR beam, i.e., a shear-deformable beam with six local modes of deformation, reads Here, Γ represents the material deformation, Ω the material curvature, and C F as well as C M are the respective constitutive matrices for the translational and rotational deformation modes.We refer the interested reader to [40,41] for more details on the SR beam theory and its finite element discretization.The weak form of the beam is defined on the undeformed 1D centerline domain Ω B .The spatial interpolation of the beam finite elements is tailored to the particular beam model in order to ensure objectivity of the discrete formulation.The interpolation of the beam centerline position employs C 1 -continuous thirdorder Hermite shape functions [41].An objective interpolation of the finite cross section orientations is a non-trivial task and results in a nonlinear and deformation-dependent interpolation strategy.For a more detailed description of this topic, the reader is referred to [41] and the references therein.The resulting linearized system of a pure SR beam problem reads To clarify the following equations and simplifications in the TF case, the global discrete beam centerline degrees of freedom d B are gathered together as are the global beam orientational degrees of freedom θ B .Therefore, the beam stiffness matrix is written as a 2 × 2 block system with the individual blocks K B ab with indices a, b ∈ {r, θ}.Similarly, the residual is split into positional and rotational contributions f B r and f B θ , respectively.
To fully embed the beam in the solid matrix, the following six local coupling constraints are defined: Here, r B is the beam centerline position vector, x S is the solid position vector, and ψ BS denotes the (pseudo-)rotation vector describing the relative rotation between a suitable orthonormal triad field in the solid domain and the beam cross section orientation.The constraints given in (1) are referred to as the positional coupling constraints, since they enforce the position of the beam cross section centroid to be coupled to the underlying solid.In a similar manner, ( 2) is referred to as rotational coupling since a vanishing relative rotation enforces the beam cross section orientation to be coupled to the solid domain.For a more elaborate discussion on the coupling constraints, the interested reader is referred to [55,57].The coupling constraints (1) and ( 2) are enforced via a Lagrange multiplier method.The total Lagrange multiplier potential reads where λ V and λ R are the Lagrange multiplier fields introduced to enforce the positional and rotational coupling, respectively.Accordingly, quantities denoted with (•) V and (•) R refer to positional and rotational coupling, respectively.For the spatial interpolation of the Lagrange multiplier fields, we resort to the mortar-type approach and software implementation of [55,57].There it is shown that a linear interpolation of the Lagrange multiplier field with a subsequent penalty regularization of the saddle point system results in a stable coupling formulation for our envisioned application range.The resulting global discrete Lagrange multiplier vectors for positional and rotational coupling are denoted with λ V and λ R , respectively.Variation of the coupling potential (3) and insertion of the spatially discretized quantities results in the weak form of the coupling formulation.With that we can now state the global discretized residual vector for the coupled problem: Here, g V and g R are the constraint residual vectors for positional and rotational coupling, respectively.Moreover, f (•) mt,(•) are the discrete coupling force contributions required to enforce the action of the Lagrange multipliers on the beam and solid degrees of freedom.We resort to a penalty regularization with the penalty parameters ϵ V > 0 and ϵ R > 0 to express the Lagrange multiplier unknowns λ V and λ R in terms of beam and solid unknowns, i.e., Therein, the diagonal matrices V V and V R are scaling matrices to scale the regularized equations in order to pass patch-test like problems, cf.[55].With the penalty regularization, the Lagrange multipliers are no longer unknowns.Thus, we can state the final linearized system to be solved in every Newton iteration: Here, D and M are the so called mortar matrices for positional coupling, which only depend on the reference configuration, i.e., they are constant.The matrices Q ab with a, b ∈ {s, θ, λ} are the coupling matrices for rotational coupling, which depend on the current configuration.We refer to the original publications [55,57] for details of the linearization procedure.

Coupled beam-solid system with torsion-free (TF) beam formulation
The torsion-free Kirchhoff-Love (TF) beam formulation (see [40,41]) represents a special case of the Simo-Reissner beam theory, where the assumptions of vanishing shear and torsion deformations are incorporated in the beam model and the resulting finite element formulation can be described solely by displacement degrees of freedom.The assumptions are valid for fibers with high slenderness ratios, a double symmetric cross section and a straight centerline in the reference configuration.For such fibers, the TF beam formulation results in an even more efficient numerical model than the SR formulation presented in the previous section.The internal energy of a TF beam reads with E denoting the Young's modulus, A the cross section area, I the moment of inertia, ε the axial tension and κ the scalar curvature, respectively.The TF beam formulation requires a C 1 -continuous centerline interpolation of centerline positions r, which is realized with third-order Hermite polynomials [40].
As the only field of unknowns along the TF beam centerline is a translation field, the total Lagrange multiplier potential of the coupling constraints in (3) simplifies to i.e., there is no rotational coupling for TF beams.The same penalty regularization as stated in (4) is employed, resulting in the following global linearized system : Again, for a more detailed information on the derivations, the interested reader is referred to [55].It should be noted, that the coupling contributions in (7), i.e., D and M, are constant due to the coupling taking place in the reference configuration.

Characteristics of linear systems arising in beam-solid coupling
To be able to construct efficient algebraic block preconditioning techniques to accelerate the convergence of the outer Krylov solver, certain matrix properties of the underlying linear systems Ax = b specified in ( 6) and ( 7) are of particular interest.A short explanation regarding conditioning, block diagonal dominance, symmetry and sparsity pattern is given below.For the sake of simplicity, the block systems in ( 6) and ( 7) are both abbreviated with the compact notation throughout the remainder of this manuscript, where we have grouped the unknowns based on their physical meaning, i.e., being associated with the beams or the background solid.The concrete identification of individual matrix blocks in (8) with the linear systems from Sections 2.2 and 2.3 depends on the employed beam theory.In the case of a SR beam theory, the individual blocks are defined such that (8) represents (6) with b B = (∆(d B ) T , (∆θ B ) T ) T and b S = ∆d S .For the TF case, (8) represents (7) with b B = ∆d B and b S = ∆d S .Generally speaking, A denotes the matrix block containing the beam stiffness matrices K B ab as well as stiffness and penalty contributions of the coupling constraints.In similar fashion, the sub-block C refers to the sum of the solid stiffness matrix K S ab and the respective interaction terms.The off-diagonal sub-blocks B T 1 and B 2 represent the remaining coupling terms between both fields, respectively.

Ill-conditioning due to penalty regularization
Due to the discretization and coupling approach introduced in Section 2, the linear system of equations suffers from ill-conditioning, which directly originates from the penalty parameters ϵ V and ϵ R , that are steering the strength of the interaction between solid and fibers.Naturally, larger values of ϵ V and ϵ R lead to a more accurate constraint enforcement, however enlarge the eigenvalue spectrum of the matrix and, thus, worsen the conditioning problems.To show this exemplarily, a small eigenvalue study is done for test cases I and IV introduced later in Section 5.1 with all parameters fixed despite ϵ V and ϵ R as shown in Table 1.For both beam models, an increase of the penalty parameters results in an increasing maximum eigenvalue λ max of the overall system, while the minimum eigenvalue λ min remains constant.This results in growing condition number estimates given by the well-established definitions λ max /λ min and ∥A∥ 1 A −1 1 .Despite the bad conditioning, sub-block A is still nonsingular, being an essential requirement for factorizations of the block matrix A.

Loss of block diagonal dominance
Independent of the actual choice of the beam model within beam-solid interaction, the matrices in the arising linear systems in ( 6) and ( 7) exhibit a 2 × 2 block structure based on a physically motivated grouping of unknowns into beam unknowns x B and solid unknowns x S , respectively.This becomes particularly evident in the unified notation of (8).For a closer look at the arising matrices, we adopt the concept of block diagonal dominance for block matrices from [19]: be a square block matrix with N R block rows and block columns, respectively.With a given matrix norm ∥(•)∥, we assume H to only contain nonsingular matrix blocks H ιι on its main diagonal, i.e., det In general, the matrices in ( 6) and ( 7) do not satisfy the conditions for block diagonal dominance as outlined in Definition 3.1.For illustration purposes, we consider the mixed-dimensional modeling approach from Section 2 in the practical case of fiber-reinforced solids, where the stiffness of the fibers is much higher than the stiffness of the embedding solid, i.e., E B ≫ E S .To this end, we assume a fixed geometry and mesh, constant material parameters, and fiber and solid constitutive properties satisfying E B ≫ E S .Since the projection operators D and M solely depend on the mesh, the only variable parameters left are ϵ V ≫ 0 and ϵ R ≫ 0. With increasing penalty parameters ϵ V and ϵ R , the norm of the off-diagonal matrix blocks increases, too.In addition, the inversion of the diagonal matrix blocks results in denser matrices with a rapid growth of the norm and, thus, decreasing values on the right-hand side of inequality (9).For positional coupling, the block diagonal dominance property of the matrix becomes harder to achieve with an increasing penalty parameter ϵ V , especially for ϵ V ≈ E B as recommended for practical computations.The same holds true for the rotational coupling contributions for the recommended choice ϵ R ≈ E B R 2 with R being the radius of the beam along the centerline (see [54]).To support this argument, we assess the property of block diagonal dominance for the first block row in (8), i.e., specifically the contribution of A, by anticipating a small numerical example, in particular test case I introduced later in Section 5.1.Therein, the off-diagonal matrix block B 1 exhibits a norm ∥B 1 ∥ = 1.0153, while the main diagonal block A's contribution evaluates to A −1 −1 = 3.9202 • 10 −8 , hence violating the condition outlined in (9).
Due to this lack of block diagonal dominance, conventional block preconditioning methods based on block Jacobi or block Gauß-Seidel schemes are not applicable without major convergence problems (or even divergence) as already evidenced in [11,14].Independently, the individual matrix blocks on the main diagonal lose their diagonal dominance as well due to the penalty contributions.Hence, conventional relaxation-based smoothers cannot be applied on individual blocks either.

Potential loss of symmetry
The symmetry of A is governed by the beam formulation at hand: TF beam models always result in a symmetric A, while all other beam models yield a non-symmetric beam sub-block A. For pure positional coupling as applied for TF beam models, the off-diagonal matrix block are symmetric, i.e., B 1 = B 2 .In contrast, the additional coupling of rotational degrees of freedom for SR beams introduces non-symmetric off-diagonal terms.

Sparsity pattern
Of special interest is the matrix graph J (A) of the sub-block A related to the beam problem.Since we restrict ourselves so far to cases where the embedded fibers do not interact with each other but only with the surrounding solid, the matrix A features a block diagonal sparsity structure, where the size of the small blocks depends on the number of beam finite elements used to discretize each individual fiber.We will illustrate and study the sparsity pattern J (A) for different test cases in Section 5.1, in particular in Figure 4.

Block preconditioning for beam-solid interaction
The construction of a preconditioner, that captures the coupling interactions properly and is tailored to the specific matrix properties, is crucial for an efficient and scalable solution process.Preconditioners based on approximate block factorizations have been shown to be suited for similar problem types such as contact problems [62], incompressible flow [42,18], FSI [31,26], or magneto-hydro dynamics [11,43,44].

Preconditioning based on a block factorization of the system matrix
Due to the lack of block diagonal dominance and the ill-conditioning of the matrix discussed in Section 3, we resort to a block factorization preconditioner.Specifically, we perform a block factorization into a lower triangular matrix L, a diagonal matrix D, and an upper triangular matrix U.The LDU decomposition of the system matrix reads To this end, the preconditioning matrix P based on the above factorization is given by Expressing the application of the preconditioner as a fixed-point iteration over the index k, one application of the preconditioner yields It is usually executed via a predictor-corrector scheme by first solving an equation related to the beam contribution, afterwards one related to the Schur complement, and lastly a correction step to the beam solution.To this end, after having to form an explicit representation of the Schur complement, a total of three linear systems have to be solved in every application of the preconditioner.The overall algorithm is given as Algorithm 1.
Since an exact version of a block factorization preconditioner is hard to achieve due to its immense computational cost, the following sections are devoted to the construction of an approximate block factorization preconditioner and its application to mixed-dimensional beam-solid interaction.
Algorithm 1: Full block factorization preconditioner for fiber-solid coupling Procedure Preconditioner(k max ) // Form explicit Schur complement

Explicit sparse inverse approximation
The first major step of the preconditioner calculation consists of finding an explicit approximation of the Schur complement S := C − B 2 A −1 B T 1 with A denoting an easy-to-invert approximation of A. Both quality and computational cost of the preconditioner are mainly governed by the choice A and S to approximate A and S, respectively.In traditional Schur complement based block preconditioners, the inverse A −1 ≈ A −1 appearing due to the block factorization and also in the Schur complement calculation itself is often approximated by some diagonal matrix, since the inversion of a diagonal matrix comes at a negligible computational cost.The most simple approach is to base the inverse approximation on the diagonal part of the M × M matrix A resulting in A = diag (a ii ) , i = 1, . . ., M. Another well-known approach takes the row sums of A [60], reading However, such simple diagonal approximations cannot be used in the present scenario since A lacks diagonal dominance, cf.Section 3.2.Since A resembles the sub-block related to the beam equations, which themselves yield a block diagonal structure of A, a more sophisticated explicit approximation scheme of the inverse can be applied taking this particular sparsity structure into account.
Although in general the inverse of the sparse matrix A cannot be expected to be sparse as well, explicit SPAIs aim at creating an explicit matrix representation A * of the approximation of the exact inverse A −1 , that itself is still a sparse matrix.Ideally, nnz (A * ), the number of non-zeros of A * , does not exceed nnz (A), since a matrix-vector product with A * must be performed at each Krylov iteration [21].Although incomplete LU factorizations fall into that category, they require considerable effort to parallelize [13].
To take advantage of the block diagonal structure of A, we pursue a fully parallelizable approach to construct an explicit sparse approximate inverse A * of the matrix A based on the minimization of the Frobenius norm of the residual matrix AA * − I, see [21,53].By choosing an appropriate sparsity pattern J (A * ) for the SPAI A * from the set J describing all known patterns, the following least-squares problem needs to be solved: min We represent the minimization procedure to compute A * by the operation A * ← M (A, J (A * )).To this end, (10) requires to select an appropriate sparsity pattern J (A * ) and a practical approach to the minimization of the Frobenius norm in a distributed memory environment, which we will address in Sections 4.2.1 and 4.2.2, respectively.

Selection of a sparsity pattern for the sparse approximate inverse calculation
The main challenge in (10) is the selection of a sparsity pattern J (A * ) to be used as input into the minimization procedure.An appropriate pattern needs to contain enough information of the inverse by retaining high values, but should also act as a filter to remove small entries in order to reduce fill-in.Straightforward approaches are based on a static sparsity pattern selection and include the choices J (A * ) := J (A) and J (A * ) := J A T , which are easy to obtain, but do not guarantee a good approximation quality.Especially in cases with a partially known sparsity structure of the inverse, e.g., for block-diagonal matrices such as A in (8), more advanced static selections are able to deliver a satisfying approximation, yet not requiring dynamic pattern selection approaches as proposed in [21].
In this work, we follow the static pattern selection proposed in [12] and use powers of a sparsified version J (A) of the graph J (A) of the original matrix A to obtain an enriched sparsity pattern to be used for the minimization in (10).First, J (A) is obtained through a thresholding of J (A) based on the entries in A and using a drop-off tolerance σ.We represent this thresholding by the filter operation delivering individual entries j (A) i j of the filtered graph via The additional Jacobi scaling in (11) simplifies the thresholding and choosing of σ if A is poorly scaled.In a second step, a refined sparsity pattern J A ℓ is calculated by taking powers ℓ of the sparsified graph J (A), reading The matrix A ℓ is never calculated explicitly, but the powers are directly computed on its sparsified graph J (A).

Evaluation of the Frobenius norm on a parallel computer
Due to the 2-norm compatibility of the Frobenius norm, the problem can be decoupled into a sum of Euclidian norms, reading min which can be solved independently for each row i = 1, . . ., M of A * .Since the matrix A is usually stored in a row-wise distribution, where each parallel process stores a subset of all rows of A, this decomposition renders the method, besides an initial communication step, inherently parallel.To this end, calculating a row of the SPAI means solving a small least-squares problem by applying a dense QR factorization. Pre-Filtering Post-Filtering

Practical algorithm
In practice, one usually does not solve (10) directly, but rather combines it with some pre-and postoperations, foremost the graph computation from Section 4.2.1 and the handling of the Frobenius norm outlined in Section 4.2.2.In addition, it appears beneficial to perform a post-filtering of A * by dropping all entries with a * i j < σ ∀i, j = 1, . . ., M to further reduce fill-in of A −1 and, thus, the cost of applying the preconditioner [12].The post-filtering will be denoted by the operator F A (A * , σ).
We summarize all the steps to compute the sparse approximate inverse A * in Figure 2. Therein, optional steps are marked in gray, while the mandatory computation of A * is highlighted in orange.User-given data to configure the individual steps is depicted in circles, while computed input and output data is put into rectangular boxes.The arrows indicate the flow of data between the individual steps.
Even though the method can be steered quite effectively by the refinement level ℓ and threshold tolerance σ, there are still several problems to avoid.The method can still produce rather dense matrices with poor approximation quality.Depending on the threshold parameter σ, an aggressive dropping of values might result in a loss of information, which again results in a poor result and the approximation of the inverse might even be singular.

One-level approach for the predictor and corrector step
As outlined in Algorithm 1, the first and last linear equation to be solved during the application of the block preconditioner resemble the predictor step and the corrector step respectively.Both use the beam matrix A to update the beam unknowns.Following the assumptions made in Section 3.2, traditional smoothing approaches are not applicable to approximate the solutions of the linear systems in ( 12) and ( 13) without a major loss in convergence properties.
Based on the explicit SPAI calculation for the approximation of the Schur complement, a rather good approximation A −1 is already available.Following [8], the sparse approximate inverse is reused as a smoother by applying the fixed-point iteration over index m to (12), reading and to (13), reading A comparison to more traditional smoothers, e.g., Jacobi and Gauß-Seidel methods, can also be found in the original publication [8].
Remark 4.1 (Challenges for multilevel methods).Expanding the solution procedure of the predictor and corrector steps to an AMG approach to solve the beam-related equation is challenging, as there are still several open questions around the construction of AMG hierarchies for beam models: The beams are represented as 1D elements, for which a suitable coarsening scheme to form aggregates for coarser multigrid (MG) levels is still part of active research.Even with appropriate coarsening, fibers discretized with only a few elements would quickly form single-node aggregates being suboptimal for the restrictor and prolongator construction as well as effectively stalling the coarsening process.The addition of rotational components into the beam equation introduces another difficulty, as nodes with a different number of DOFs exist, hindering the application of conventional aggregation strategies.To project and, thus, treat important error modes correctly on coarser levels, a proper near nullspace has to be constructed, which strongly depends on the underlying beam formulation.For the use cases presented in this manuscript, which mostly feature a small number of beam elements used to discretize a fiber, the SPAI smoother from ( 14) and ( 15) proved to be very competitive in terms of approximation quality of the inverse, computational efficiency, robustness, and weak scaling behavior, see Section 5.

Multilevel approach for the Schur complement step
The computationally most demanding part of the preconditioning algorithm involves the solution procedure for the Schur complement equation given as Due to the explicit sparse approximation of the inverse of A used to form the approximate Schur complement S := C − B 2 A −1 B T 1 , the resulting matrix S is rather dense compared to calculations using one of the diagonal approximation approaches for A given in Section 4.1.The inverse of the Schur complement is approximated by a standard aggregation-based AMG method.As level smoother, a one-level domain decomposition with overlap δ and an incomplete LU factorization with fill-in p and thresholding τ of small entries is applied, often abbreviated as ILUT [50].To accelerate convergence, a smoothing of the tentative prolongation operator basis functions can be done, also known as smoothed-aggregation algebraic multigrid (SA-AMG) [61].This however leads to a higher fill-in of the coarse system matrix representations increasing the computational cost of the preconditioner setup.Especially the Galerkin product for the calculation of the coarse level operator and the incomplete LU factorization smoother are negatively influenced by the additional fill-in.Furthermore, the operator smoothing is classically based on a Jacobi method, which heavily relies on diagonal dominance.Thus, in some cases it makes sense to skip the prolongator smoothing and take advantage of the robustness of plain-aggregation algebraic multigrid (PA-AMG) by applying aggregate-wise constant basis functions [59].Caused by the additional mixed-dimensional coupling terms in the bulk field's matrix, a similar increase in the fill-in of the coarse level operators of a Ruge-Stüben AMG method has been observed in [10].

Approximate block factorization preconditioner for beam-solid interaction
The components described in Sections 4.1 -4.4 are now put together to tailor the block preconditioner from Algorithm 1 to the specifics of mixed-dimensional beam-solid interaction, yielding the final preconditioner summarized in Algorithm 2. In a pre-computation step, an explicit representation A −1 of the SPAI Algorithm 2: Approximate block factorization preconditioner for fiber-solid coupling x S,k // Prediction step: solve for x B,k+ 1 2 with SPAI smoother for m = 1 . . .m max do x B,k+ 1 2 ,m+1 = x B,k+ 1 2 ,m + A −1 r B,k end // Schur complement step: solve for x S,k+1 with AMG Sx S,k+1 = r S,k − B 2 x B,k+ 1 2 // Correction step: solve for x B,k+1 with SPAI smoother for m = 1 . . .m max do x B,k+1,m+1 = x B,k+1,m + A −1 r B,k − B T 1 x S,k+1 end end return x B,k max x S,k max of the beam matrix A is formed and also used to compute an approximation S to the Schur complement.
Based on the sweep index k of the preconditioner, the main computation loop consists of the three steps: First, we predict the beam unknowns x B,k+ 1 2 by using the SPAI as a smoother.Then, we solve the Schur complement equation for the solid unknowns x S,k+1 using an AMG method.Finally, we again use the SPAI as a smoother to correct the beam unknowns to their final values x B,k+1 .
In terms of computational effort, the computation of the SPAI as outlined in Section 4.2 comes at a certain cost, however is perfectly parallelizable and is used at three steps in Algorithm 2: once in the approximation of the Schur complement and twice to update the beam solution in the predictor and corrector step.A single application of SPAI as a smoother for the predictor or corrector step boils down to a sparse matrix-vector multiplication.

Comparison to existing methods in literature
Having discussed all the details of our proposed preconditioner, we now want to outline its commonalities and differences compared to preconditioners available in the literature.In particular, we will discuss the work in [30,29,5,10,15].
In contrast to our work, the preconditioners from [30,29,5,15] are all tailored to saddle point systems.In the construction of a block diagonal preconditioner, they use only the diagonal part of an LDU factorization of the original matrix.In [29,5,15], the arising Schur complement is approximated by a spectrally equivalent fractional Laplacian.AMG is used to tackle matrices arising from the 3D bulk field, while the embedded 1D domains are always handled by a direct solver.
The case of 2 × 2 systems as they arise for example from a penalty regularization is covered in [10].Due to the low cost of inverting the matrix of the 1D domain, they use an exact factorization to represent the Schur complement.
Similar to [10], we base the construction of the Schur complement on the original block matrix, however avoid the exact inversion of the beam sub-block A and rather use its SPAI A −1 .In line with all available preconditioners, the matrix block associated with the 3D bulk discretization is tackled by an AMG method.In our work though, the matrix arising from the 1D discretization is never explicitly inverted or factorized, but also treated in an approximate fashion (i.e., using the SPAI as a smoother, cf.Section 4.3) to facilitate large numbers of embedded fibers as well as finely resolved fiber discretizations.Regarding the underlying physical problems and applications, existing work is concerned with transport problems on the 1D domain, while our work is the first preconditioner tailored to multi-dimensional partial differential equations (PDEs) on the 1D domain, in particular geometrically exact beam models with up to nine DOFs per mesh node depending on the actual beam model at hand.

Numerical experiments
We present numerical examples to illustrate the influence of the different algorithmic parameters of the SPAI computation proposed in Section 4.2, to study and demonstrate the weak scaling behavior of the proposed preconditioner, to investigate the robustness of the proposed method regarding material parameters and geometric properties and to showcase its applicability to practical problems in civil engineering.All computations are done with our in-house multi-physics code 4C [1], which is built upon the Trilinos project [25,2].All preconditioning operations are done through the multigrid package MueLu [7] and its dependencies within the Trilinos project.For the generation of the beam geometries, we rely on MeshPy [56].

Numerical study of the sparse approximate inverse calculation for the beam sub-block
A major component of a Schur complement based preconditioner is the approximation of the inverse appearing in the Schur complement calculation itself.In the presented approach, the combination of the drop-off tolerance σ of small values and the allowed fill-in (indirectly described by the refinement level ℓ) during the sparse approximate inverse calculation have a great influence on the quality of the sparse approximate inverse A * and, thus, on the convergence behavior of the block preconditioning method.In the following, we investigate test cases with different sparsity patterns of the sub-matrix A and study the impact of different choices of both σ and ℓ on the quality of the sparse approximate inverse as well as on the convergence behavior of the preconditioned linear solver.
To this end, we consider a simple 3D beam-solid interaction problem as shown in Figure 3: A solid cube with edge length l S = 1 m is filled with randomly placed straight fibers, such that the beam-solid volume ratio is V B /V S ≈ 0.2% and that the fibers do not stick out of the solid volume.The cube is clamped at its bottom surface and loaded with a constant tensile load of q = 1 N/m 2 at its top face.The solid is modeled by a St.-Venant-Kirchhoff material (Young's modulus E S = 1 N/m 2 , Poisson's ratio ν S = 0.3) and discretized by first-order hexahedral finite elements.The fibers are represented by either torsion-free Kirchhoff-Love beams (TF) or Simo-Reissner beam elements (SR) using the following parameters: Young's modulus E B = 10 N/m 2 , radius R = 0.005 m and length l B = 0.25 m.In addition, a Poisson's ratio of ν B = 0.0 is used in the q e 1 e 2 e 3 Figure 3: Geometry and setup for the numerical study of the sparse approximate inverse calculation: a solid cube with edge length l S = 1 m is randomly filled with fibers of the same length l B = 0.25 m, clamped at its bottom and loaded with a distributed external load q = 1 N/m 2 .cases with SR beam models.The coupling conditions are enforced with penalty parameters ϵ V = 10 N/m 2 and ϵ R = 10 Nm/m, if applicable.For the calculation of the sparse approximate inverse, the matrix block A describing the contribution of the fibers is of particular interest.Therefore, six test cases with different beam formulations and varying number of beam finite elements per fiber are set up to trigger different sparsity patterns J (A).In the test cases I and IV, the fibers are discretized by just one beam element, resulting in a block-diagonal matrix with fully populated sub-blocks.For test cases II and V, four beam elements are used per fiber, resulting in bigger and sparser sub-blocks.Test cases III and VI mix the other scenarios by randomly using between one and four beam elements per fiber.All test cases and their respective matrix sizes and number of non-zeros of J (A) are summarized in Table 2.The resulting sparsity patterns are illustrated exemplarily for the test cases I-III in Figure 4.
The overall simulation is of quasi-static nature and imposes the total load over the course of two load steps, which is sufficient for investigating the key features of the linear solver like the iteration count and setup/solve timings.The nonlinear solver converges, if the nonlinear residual ∥f∥ 2 drops below 10 −6 and if the full displacement increment ∥∆d∥ 2 is smaller than 10 −8 .In each nonlinear iteration, a linear system is solved using a preconditioned GMRES method [52].The preconditioner is configured as follows: the number of sweeps through the preconditioner is set to k = 1, and the number of iterations for the SPAI smoother is chosen as m = 1.The Schur complement equation is solved with a SA-AMG scheme with an ILUT smoother with overlap δ = 1, fill-in level p = 1 and drop-off tolerance τ = 10 −4 .The linear solver is assumed to be converged if the relative residual ∥r n ∥ 2 / r 0 2 falls below 10 −8 .All simulations are done in serial on a single processor.
Using SPAIs, the convergence of the linear solver is tightly related to the parameters chosen for the SPAI calculation.In Table 3, different combinations of the drop-off tolerance σ and refinement level ℓ are given for each of the six test cases.Each value pair represents the largest possible drop-off value σ possible for a fixed refinement level ℓ, such that the linear solver still converges.For certain values of ℓ, no convergence could be a achieved at all for some test cases, even with a very small drop-off tolerance σ.In these situations, an appropriate value of σ is chosen, such that only explicit zero values are dropped from the matrix graph to still be able to make a fair comparison with the other test cases regarding the error norm and number of non-zero entries in the approximate inverse.In addition, the number of non-zero entries of the filtered graph J (A) used as starting point for the sparsity pattern construction as well as the number of non-zero entries of the graph of the inverse approximation J (A * ) are given.The number of non-zeros for the unfiltered graph, nnz (J (A)), is illustrated in Table 2 for comparison.The behavior of the iterative linear solver is assessed by the averaged number of iterations per nonlinear solver step and three timings concerning the preconditioner setup time T setup , the time spent for solving the linear system T solve , and the total time T total = T setup + T solve .The overall quality of the inverse approximation A * is quantified by the error norm relative to the exact inverse A −1 .
In test case I, the dense sub-blocks of the block-diagonal sparsity pattern of A already lead to the exact graph of the inverse, which makes the approximation rather simple.For the given parameter combinations the upper bound of number of non-zeros to exactly compute A −1 is quickly reached resulting in low iteration counts and error norms.For the given problem, the setup timings are nearly identical, with only the solver timings for the first parameter combination taking a bit longer due to a not fully populated sparsity pattern.
The second test case is based on rather sparse sub-blocks of bigger size compared to test case I. Without a pattern refinement, the linear solver did not converge as the approximation quality of the inverse is not sufficient.For higher refinement levels, the number of non-zeros in J (A * ) quickly increases resulting in convergence of the method with still acceptable timings.Yet, there is not a lot of flexibility in choosing the drop-off tolerance σ to still retain convergence.
As test case III is a combination of the first two problems, the resulting behavior is a mix of these.Using the initial sparsity graph for the inverse approximation results in 20 linear solver iterations until convergence, which explains the high solving time T solve .On the other hand, the setup time T setup is comparable to the other tests.For increased refinement levels ℓ, a similar behavior as in test cases I and II is observed.
For test case IV, the beam elements are switched to a Simo-Reissner formulation, which contributes additional rotational degrees of freedom into A. In contrast to test case I, this results in already sparse sub-blocks for using one beam element per fiber.Therefore, using just the graph J (A) results in a poor approximation of the sparsity pattern of the inverse and, thus, leads to no convergence.Using higher refinement levels ℓ to enrich the input graph for the SPAI computation quickly heals this problem and even allows one to work with a more aggressive dropping scheme, i.e., using larger values for σ.
Test cases V and VI show a similar behavior and only converge with ℓ = 3.The additional beam elements used per fiber increase the block size of each sub-block and thus leave more room for possible sparse approximations for the inverse.The static approach for choosing an appropriate sparsity pattern for the inverse presented in Section 4.2.3 is still able to produce good approximations, thus leading to convergence of the linear solver, but only for rather dense representations of A * .
In conclusion, a robust parameter combination is highly problem dependent and necessary to achieve convergence of the linear solver.In the presented cases, a refinement level of ℓ = 3 and a drop-off tolerance of σ = 10 −9 to get rid of small values polluting the sparsity pattern showed to be sufficient to enable convergence even for challenging cases.

Weak scaling behavior
To study the performance of the proposed block preconditioner also for large-scale examples and on parallel computing clusters, we now conduct a weak scaling study.The problem setup is similar to test case I from Section 5.1.To guarantee that large problems exhibit the same fiber distribution at least in parts of the domain, the meshes are setup as follows: We first create the geometry and mesh for the largest problem by placing a cube with edge length 10 m inside a cartesian frame of reference, such that the cube's center of mass coincides with the origin O and its edges are oriented along the cartesian axes.Then, the cube is filled with randomly positioned and oriented straight fibers with length l B = 0.25 m and radius R = 0.005 m.Only fibers, which are fully contained in the cube, are considered.This problem will be solved on 1000 MPI ranks.For smaller problems, the geometry is cut out of this initial cube.Also for the cut out problem, only fully contained fibers are considered.Figure 5 shows the intersections of the series of cubes with each coordinate plane spanned by basis vectors e ξ and e η , ξ, η ∈ {1, 2, 3}, ξ η.This process not only yields an almost constant beam-solid volume ratio for all problem sizes, but also guarantees that larger meshes are just extensions of the smaller meshes.The load per processor is kept constant at around 50k degrees of freedom.Meshing details are given in Table 4.
The boundary conditions are identical to test case I from Section 5.1, i.e., the bottom surface is clamped and the top surface loaded with a constant tensile load of q = 1 N/m 2 .The solid is again modeled by a St.-Venant-Kirchhoff material (Young's modulus E S = 1 N/m 2 , Poisson's ratio ν S = 0.3) and discretized with first-order hexahedral finite elements.The fibers are modeled using torsion-free Kirchhoff-Love beam elements with Young's modulus E B = 10 N/m 2 .The positional coupling condition is enforced with a penalty parameter of ϵ V = 10 N/m 2 .
The overall simulation is again of quasi-static nature and imposes the total load over the course of two load steps.The nonlinear solver converges, if the nonlinear residual ∥f∥ 2 drops below 10 −6 and the full displacement increment ∥∆d∥ 2 is smaller than 10 −8 .For the solution of the linear system arising in each nonlinear iteration, a preconditioned GMRES method is applied with the proposed preconditioner.Hereby, the parameters are set as follows: to increase the robustness of the smoothers in the parallel setting, the number of sweeps through the preconditioner is changed to k = 3 as well as the number of iterations for the SPAI smoother to m = 3 (in contrast to the parameter choice in Section 5.1).The SPAI computation for the beam sub-matrix A uses a drop tolerance σ = 10 −8 and a refinement level ℓ = 2 to enrich the sparsity pattern.This choice is inspired by the results of the test case I in Section 5.1.Due to the inherently parallel nature of the SPAI computation (cf.Section 4.2.2), it has a marginal influence on the weak scalability compared to the other components of the overall preconditioner, especially the AMG components.Therefore this part of the algorithm is kept constant without parameter variation.The Schur complement equation is solved with an aggregation-based AMG method.Coarsening is performed until the number of unknowns on the coarsest level drops below 6500.The AMG hierarchy is traversed using a V-cycle with level transfer operators arising from either PA-AMG or SA-AMG.On all but the coarsest level, the level smoother is chosen as ILUT with overlap δ = 1, fill-in level p = 1 and drop-off tolerance τ = 10 −4 .The coarse level is solved with a direct solver using the distributed memory version of SuperLU [34].The outer GMRES solver is assumed to be converged if the relative residual ∥r n ∥ 2 / r 0 2 falls below 10 −8 .The scaling study is run on our in-house cluster (16 nodes with 2x Intel Xeon Cascade Lake CPUs with 26 cores, 20 nodes with 2x Intel Xeon Skylake CPUs with 24 cores, 1312 cores in total, Mellanox Infiniband Interconnect).The overall weak scaling performance is quantified by the averaged number of iterations per nonlinear Newton iteration and the timings for setting up the preconditioner T setup , for solving the linear system T solve , and the total solver time T total = T setup + T solve .Since the setup of the preconditioner is expected to be expensive, we also examine the option of reusing the preconditioner throughout all Newton steps of a load step with the aim to reduce T setup and, thus, also reduce T total .
Figure 6 summarizes the results of the weak scaling study.We first discuss the case where the preconditioner is built in every Newton step.Looking at the iteration counts in Figure 6(a), the SA-AMG method delivers iteration counts independent of the problem size (with mostly 13 iterations per solve), while PA-AMG exhibits an increase in iterations by a factor of 2× (from 26 to 52 iterations) when increasing the problem size by 120×, i.e., from mesh ID 1 to mesh ID 9, cf.Table 4. Regarding the setup time T setup required to build the preconditioner shown in Figure 6(b), PA-AMG is more than twice as fast as SA-AMG due to the smaller support of PA-AMG interpolation functions and, thus, less fill-in in coarse level operators.In contrast, the time to solve the linear system is more than 2× smaller for SA-AMG due to the better approximation properties of smoothed interpolation functions in SA-AMG, cf. Figure 6(c).When looking at the combined time T total = T setup + T solve as shown in Figure 6(d), both types of transfer operators result in very similar timings.
With a moderate increase of T setup for PA-AMG for an increasing number of parallel processes, the increase in solver time T solve as well as total time T total of the PA-AMG scheme appears to be directly linked to the number of iterations required to achieve the desired tolerance of the iterative linear solver.In contrast, SA-AMG requires a rather constant number of iterations for all problem sizes and spends most of its time in the preconditioner setup, thus resulting in total solver timings T total that are dominated by the preconditioner setup time.This hints at potential savings when setting up the preconditioner once and then reusing it to solve multiple subsequent linear systems, e.g., through the course of a Newton scheme.
We now look at the option of building the preconditioner only in the first Newton step and then reusing it throughout an entire load step.In the present study, each load step requires two Newton steps, hence we expect to save 50% of T setup and hope to not worsen in terms of iteration numbers and solver time T solve .When looking at the setup time in Figure 6(b), setup costs for both SA-AMG and PA-AMG are reduced by ≈ 50% as expected.Furthermore, the iteration numbers as well as the solver time T solve stay nearly the same for both SA-AMG and PA-AMG, cf.Figures 6(a) and 6(c).Overall, the reuse of the preconditioner positively impacts the total solver time T total with the best option being SA-AMG with reuse of the preconditioner, which appears to be roughly 30% faster than the variant without reuse of the preconditioner.We note that the actual benefit of reusing the preconditioner depends on the number of nonlinear solver iterations per load step: The more Newton steps are required, the greater savings are to be expected from reusing the preconditioner.
In conclusion, this example shown weak scalability of the proposed preconditioner.The iteration numbers remain perfectly constant for SA-AMG.Due to the overlap δ = 1 of the ILUT smoother in the MG hierarchy, the setup time T setup increases by a factor of ≈ 5× when increasing the problem size by 120×, i.e., from mesh ID 1 to mesh ID 9, cf.Table 4.We stress that the choice of transfer operators has a great influence on the total weak scalability of the preconditioner.In the presented test cases, the best scalability in terms of the iteration count were obtained by SA-AMG transfer operators and when reusing the preconditioner though all Newton steps of a load step.In sum, SA-AMG appears as the method of choice to demonstrate weak scalability and keep the iteration count low.For more complex application scenarios however, it might be beneficial to fall back to PA-AMG due to its reduced fill-in during coarsening.Given the intricate nature of beam-solid applications and their arising systems of linear equations, we deem the present weak scaling behavior acceptable and adequate.

Robustness of the preconditioner under varying physical parameters
To assess the preconditioner's robustness, we now study its behavior w.r.t.iteration numbers of the linear solver under changes of critical physical parameters.Therefore, a composite plate with four fiber layers is considered, where the beam-solid stiffness ratio as well as the beam radius are varied.This allows to cover a wide range of possible parameter combinations for the coupled problem.
The geometrical setup is identical to the composite plate presented in our prior work [55] with a length of 2 m, a width of 1 m and a total thickness of t = 0.04 m, where two layers are oriented in 45 • and −45 • angles, respectively.The solid bulk domain is modeled as St.-Venant-Kirchhoff material with fixed constitutive properties (E S = 1 GPa, ν S = 0.3).The embedded fibers are modeled as torsion-free Kirchhoff-Love beams, where we vary the beam radius R ∈ {0.001 m, 0.002 m, 0.004 m, 0.008 m} and the beam's Young's modulus E B ∈ {2 GPa, 8 GPa, 32 GPa, 128 GPa, 256 GPa, 512 GPa}.The examined beam-solid stiffness ratios E B /E S span a wide array of practical applications, ranging from natural fiber composites with low ratios to steel-reinforced concrete and carbon fiber composites with high ratios.The penalty parameter is chosen as ϵ V = E B to properly enforce the positional coupling constraints.Considering boundary conditions, the left side of the plate is fixed and a distributed tensile load q is applied to the right side.The problem setup and the deformed configuration (exemplary for E B = 8 GPa and R = 0.004 m) are depicted in Figure 7.For a more detailed description of the problem, the reader is referred to [55].Since the overall compound plate stiffness changes with the different parameter combinations described above, the load q is adapted such that the axial deformations of the plate are the same for each parameter combination and match the example shown in [55].The overall simulation is of quasi-static nature and the load is applied incrementally over the course of 10 load steps.The nonlinear solver converges for all test cases if the nonlinear residual ∥f∥ 2 drops below 10 −6 and the displacement increment ∥∆d∥ 2 falls below 10 −6 .For the solution of the linear system arising in each nonlinear iteration, a preconditioned GMRES method is applied with the proposed preconditioner.Hereby, the parameters for the block method are similar to test case I from Section 5.1: We set the number of applications of the preconditioner per linear solver iteration to k = 1, build the preconditioner once per load step and then reuse it in every nonlinear iteration of this load step.Due to the rather long fibers appearing in this example, the settings for the SPAI smoother are changed to m = 3.In addition, the SPAI computation for the beam sub-matrix A uses a drop tolerance σ = 10 −8 and an increased refinement level ℓ = 4 to handle the larger individual beam sub-blocks properly.Similar to before, the Schur complement equation is solved with an aggregation-based AMG method.Coarsening is performed until the number of unknowns on the coarsest level drops below 6500.The AMG hierarchy is traversed using a V-cycle with level transfer operators arising from SA-AMG.On all but the coarsest level, the level smoother is chosen as ILUT with overlap δ = 1, an increased fill-in level p = 2.5 (again necessary due to the length of the fibers and their discretization with sometimes more than ten beam elements, resulting in a denser Schur complement matrix) and drop-off tolerance τ = 10 −4 .The coarse level is solved with a direct solver using the distributed memory version of SuperLU [34].The linear solver is assumed to be converged if the relative residual ∥r n ∥ 2 / r 0 2 falls below 10 −6 .All simulations are done in serial on a single processor.
Averaged iteration counts of the linear solver are shown in Figure 8.At global scope, the number of iterations appears to be independent of the stiffness ratio E B /E S as well as the geometric ratio R/t.This is particularly true for a larger beam radius, i.e., R/t = 1/5.While smaller beam radii are expected to be more challenging to handle, their impact on the solver performance is very limited: the increase in iterations is very small compared to the case of R/t = 1/5.Intermediate geometric ratios R/t = {1/10, 1/20} exhibit slight changes in the iteration number under an increasing stiffness ratio E B /E S , though neither outliers nor a trend towards increasing iteration numbers has been observed.Considering the smallest beam radius, i.e., R/t = 1/40, the iteration count remains rather constant for all but the largest stiffness ratio.Only the largest stiffness ratio E B /E S = 512 GPa in combination with the smallest radius results in an outliner in iteration numbers that is slightly above all other cases.The challenges of small radii for the iterative solvers have also been reported in [15].Overall, this study reveals a satisfying robustness of the proposed preconditioner in relevant application scenarios.

Application: Concrete wall with steel reinforcements
To show the applicability of the proposed block preconditioner to real-world problems, we study an example from civil engineering, in particular the loading of a steel-reinforced concrete wall.The problem setup and its dimensions are shown in Figure 9.
Since we are focusing on the performance of the linear solver and the proposed preconditioner, we restrict ourselves in this example to an idealized, fully elastic constitutive behavior of concrete.Therefore, it suffices to model the solid with a St.-Venant-Kirchhoff material (E S = 30 GPa, ν S = 0.3).In this study, we, thus, refrain from using more elaborate constitutive models, that also cover inelastic effects, such as the Drucker-Prager model [16], which could serve as a smeared, phenomenological model for concrete undergoing damage or crack initiation.Yet, we study a complex reinforcement design: The initially curved reinforcement bars are modeled as Simo-Reissner beams (E B = 210 GPa, ν B = 0.0) with a fiber cross section radius of R = 0.005 m.The bottom and left side of the wall are clamped, restricting the displacement of the solid and fibers.Additionally, a distributed load of −3 • 10 7 N/m 2 is applied on the top surface in e 2 direction and 1.5 • 10 5 N/m 2 on the back surface in e 3 direction.The reinforced concrete wall is discretized with first-order hexahedral finite elements for the solid domain and Simo-Reissner beam elements for the reinforcement fibers, respectively.We study this problem for three different mesh sizes (coarse, medium, fine), where each refinement quadruples the total number of unknowns.Details on these meshes including the resulting number of degrees of freedom for each field are given in Table 5 along with the computational resources, i.e., number of MPI ranks n proc , used for each mesh.The penalty parameters for positional and rotational couplings are set to ϵ V = 2.1 • 10 11 N/m 2 and ϵ R = 5.25 • 10 6 Nm/m, respectively.We perform a quasi-static simulation and impose the total load over the course of four load steps.
Exemplarily, Figure 10 depicts the sparsity pattern of the system matrix for the coarse mesh.The 2 × 2 blocking as introduced in (8) is highlighted through colors.The beam sub-block A (orange) internally exhibits block diagonal structure as illustrated in Figure 4, which allows for an efficient construction of the SPAI, since the sparsity pattern of the inverse can be estimated very well based on the block-diagonal structure of A. Arising from the penalty contributions assembled into C, the solid sub-block C (white) contains many entries far away from its diagonal, exemplifying the attested lack of diagonal dominance, cf.Section 3.2.Both beams and solid are connected through the coupling blocks B T 1 and B 2 (gray), which are the main reasons for the loss of block diagonal dominance, again see Section 3.2.For all simulations, the nonlinear solver converges if the nonlinear residual ∥f∥ 2 drops below 10 −8 and the displacement increment ∥∆d∥ 2 falls below 10 −8 .To get an idea of the effectiveness of the proposed preconditioner in an application scenario, we compare different methods to solve the arising linear system in each Newton step, in particular a direct solver (by applying the distributed memory version of SuperLU [34]), a naive approach represented by a GMRES solver with an incomplete LU (ILU) preconditioner, and a GMRES solver with the proposed block preconditioner from Section 4.Where applicable, we also study the effect of reusing the preconditioner over multiple invocations of the linear solver within a single load step to better amortize the potentially expensive setup of the block preconditioner.In case of GMRES, we assume convergence of the linear solver, if the full relative residual norm ∥r n ∥ 2 / r 0 2 drops to at least 10 −6 .
In case of the block preconditioner, we apply k = 3 sweeps of the proposed preconditioner within each GMRES iteration.The SPAI computation for the beam sub-matrix A uses a drop tolerance σ = 10 −8 and a refinement level ℓ = 4 to enrich the sparsity pattern.For the predictor and corrector step, the described SPAI smoother is applied with m = 3 sweeps.The Schur complement equation is tackled with a PA-AMG hierarchy with a maximum size of the coarse level problem of 6500 unknowns, which results in three MG levels for all meshes.Prolongator smoothing (as proposed by [61] and usually beneficial for problems in solid mechanics) is explicitly disabled to reduce the fill-in of the coarse level matrices; see [59] for a detailed comparison.An ILUT method with δ = 1, p = 2.5, and τ = 10 −4 is used as level smoother.The coarse level equations are solved directly using the distributed memory version of SuperLU [34].
The solver options and their iteration counts and timings are summarized in Table 5.Each of the four load steps requires five Newton iterations to reach convergence of the nonlinear solver.The reported iteration counts and timings have been averaged over all load steps and Newton steps, such that the numbers now give a good estimate for the cost of a single invocation of the linear solver.For the direct solver, the coarse and medium mesh could be solved, while the fine mesh was infeasible, i.e., one load step taking more than three days of wall clock time on a cluster, such that we do not report the final result.For the out-of-the-box iterative solver using a GMRES method preconditioned with an incomplete LU factorization, the iterative solver was not able to reach convergence within 1000 iterations.Only GMRES with the proposed block preconditioner from Section 4 was able to solve all three levels of mesh refinement.Moreover, the iterative method outperforms the direct solver in the sense that it is roughly 9× faster than the direct method for the coarse discretization.When moving to the medium-sized mesh, the discrepancy in solver timings increases even further: the direct method requires a total solver time T total = 5366.3s for a single solve, while the total solver time T total of the preconditioned GMRES only takes 163.5 s, resulting in a speed-up of approximately 33×.Increasing the problem size even further makes the application of a direct method infeasible, as memory consumption and computing time become prohibitively high, leaving the iterative approach as the sole viable option.In all working cases, the linear iterations for the preconditioned iterative method per Newton step remain almost constant over the whole simulation for each problem and in addition also stay almost constant for each mesh size.For the coarse and medium meshes, approximately 25 iterations are necessary to achieve the desired tolerance, whereas the large setup requires 22 iterations on average.Still, the discrepancy between T setup and T solve is rather large for the iterative method, as most of the computation time is spent in the construction of the factorization of the ILUT level smoother.To better amortize the expensive setup cost, one can build the preconditioner only once per load step and then reuse it for each Newton step with the goal of decreasing the overall simulation time.This shows to be an effective option, as it reduces the setup time T setup by a factor of 5× in each case, which is perfectly in line with using the preconditioner five times, but only building it once per load step.Due to the reuse, the preconditioner is not perfectly fitting the system matrix anymore, occasionally resulting in a slight increase in iteration numbers (#iter) and solver time T solve .Overall though, the total solver time T total is reduced, which also manifests itself in the speed-up factors of 33× and 112× in Table 5 for the coarse and medium mesh.The infeasibility of a direct solver for the fine mesh prevents the calculation of a speed-up factor (n/a), however clearly testifies to the beneficial impact of the proposed preconditioner on the solvability of large-scale application examples.For both rebuilding or reusing the preconditioner, the iteration counts appear to be independent of the mesh size also in this example from engineering practice.
Overall, this example demonstrates not only the applicability of the proposed preconditioner in engineering problems, but also its benefits in terms of efficiency and speed-up, ultimately enabling the analysis of large and complex fiber-solid systems, which have not been accessible with existing linear solvers so far.

Concluding remarks
In this paper, we have proposed a physics-based multi-level block preconditioner for the scalable solution of mixed-dimensional models in beam-solid interaction, specifically tailored to systems with many independent fibers being embedded into a solid domain.The regularized mortar-type coupling approach leads to 2 × 2 block systems exhibiting particular properties, most prominently the lack of block diagonal dominance stemming from the penalty terms on the off-diagonal coupling blocks, which render classical block relaxation preconditioners inapplicable.To precondition an outer Krylov solver, we utilize an approximate block factorization to enable the tackling of the individual blocks and their coupling within the preconditioner.To this end, we exploit the beam-related sub-block's sparsity structure resembling a block diagonal matrix to explicitly construct a sparse approximate inverse (SPAI) by solving a minimization problem over the Frobenius norm on a given sparsity pattern, which in practice is decomposed into row-wise minimizations to be solved in parallel.To increase its robustness and approximation quality, the SPAI computation is equipped with pre-processing steps such as a filtering of small entries and a static enrichment of the sparsity pattern as well as a post-filtering of small entries.This approximation has then not only been used for the explicit formation of an approximation to the Schur complement, but also as a smoother in the Block-LU's prediction and correction steps.To solve the Schur complement equation, we have employed an AMG hierarchy.Due to the Schur complement's fill-in stemming from the SPAI matrix as well as the penalty contributions, an ILUT factorization with fill-in p, threshold τ, and overlap δ serves as level smoother on all levels except for the direct solver on the coarse level.All building blocks of the proposed preconditioner have been implemented in Trilinos and are available as open-source software to the entire scientific community.
We have studied the influence of the SPAI algorithm's parameters and have found, that a static enrichment of the graph J (A) of at least ℓ = 2 greatly improves the quality of the approximation as well as the performance of the iterative solver, while more enrichment might be required for particularly challenging problems.Regarding the scaling behavior, we were able to demonstrate weak scalability up to 1000 MPI ranks.While the iteration count is completely independent of the problem size and number of MPI ranks for SA-AMG, the setup and solver time exhibit a minor increase with an increasing problem size.This is mainly attributed to the use of an ILUT level smoother within the Block-LU's Schur complement step.We have demonstrated the robustness of the preconditioner by showing that the iteration counts remain constant when changing critical physical parameters such as the stiffness ratio between fibers and bulk field or the fiber radii.Finally, we have investigated an application example from civil engineering, in particular a steelreinforced concrete wall, and have compared the performance of the proposed multi-level preconditioner to established one-level preconditioners and direct solvers.Even for a coarse mesh, the one-level preconditioner failed to converge.The proposed multi-level preconditioner delivered a speed-up by a factor up to 112× compared to the direct solver on small and medium sized meshes, whereas the application of a direct method on the finest mesh was not feasible anymore, leaving the proposed preconditioned iterative method as the only working option.Again, the iteration count is independent of the mesh size.While intractable for existing solvers, the proposed preconditioner enables the analysis of mixed-dimensional fiber-solid systems with complex reinforcement structures for the first time.Considering computational performance, the option of building the preconditioner only once per load step and then reusing it in every iteration of the nonlinear solver has shown to cut down the setup time T setup at the expected rate, while still delivering a strong preconditioning effect, such that the convergence of the linear solver is not impeded and the total solver time T total is reduced significantly.
In future work, the proposed preconditioner and its building blocks can be extended to other types of beam-solid interaction phenomena such as the coupling of beams onto a solid's surface [54] or the contact between beams and solid bodies [58].Similarly, other mixed-dimensional multi-physics systems such as fiber-fluid interaction are likely to be amenable to such a methodology.It seems worthwhile to use semistructured grids to discretize the solid domain, if applicable, which in turn allow one to further enhance its computational performance in many application scenarios [36].Performance and scalability bottlenecks associated with the evaluation of the beam-solid coupling terms on a distributed-memory parallel computing cluster need to be addressed, e.g., following ideas from mortar methods for contact mechanics to improve data locality and load balancing [39].within the project "IFB-Individuelle Fließfertigung für Betonfertigteile" (project number DIK-2201-0016 / DIK0411/03).The authors gratefully acknowledge the computing resources provided by the Data Science & Computing Lab at the University of the Bundeswehr Munich.Some sketches in this work have been created using the Adobe Illustrator plug-in LaTeX2AI (https: //github.com/isteinbrecher/latex2ai).
Embedded fibers in continuum (c) Fully resolved 3D model

Figure 2 :
Figure 2: Optional (gray) and mandatory (orange) steps of the SPAI computation and its flow of information with computed data (in boxes) and user parameters (in circles)

Figure 4 :
Figure 4: Partial visualization of the sparsity structure J (A) of matrix A for test cases I-III

3 Figure 5 :
Figure 5: Intersections of all cubes (IDs 1-9) of the weak scaling studies with planes spanned by basis vectors e ξ and e η , ξ, η ∈ {1, 2, 3}, ξ η in the cartesian frame of reference.Orientation of the cutting planes is sketched in the top left.

[ 45 • 3 qFigure 7 :
Figure 7: Model problem for the robustness study.The left figure is adapted from our previous work [55], permissions granted under the Creative Commons (CC BY) license.

40 Figure 8 :
Figure 8: Robustness study of the preconditioning regarding the beam to solid stiffness ratio and plate thickness to beam radius ratio: Iteration numbers are not affected by changes in physical parameters.

Table 1 :
Minimum and maximum eigenvalues λ min , λ max of the given block system with condition number estimates for different beam models and varying positional and rotational penalty parameters ϵ V , ϵ R

Table 2 :
Matrix size of A and number of non-zeros of the graph J (A) for the six different test cases

Table 3 :
Comparison of the influence of different values of σ and ℓ on the number of linear solver iterations and timings

Table 4 :
Mesh refinement schedule for the weak scaling study

Table 5 :
Comparison of averaged linear solver timings per nonlinear iteration for a steel-reinforced concrete wall (methods: Direct -direct solver based on an LU factorization, Naive -a naive-preconditioned GMRES solver using an ILU factorization as preconditioner, Block -a GMRES solver using the block preconditioner proposed in Section 4)