A Note on Massively Parallel Implementation of FETI for the Solution of Contact Problems

The paper deals with a solution of large multibody contact problems using massively parallel computers and domain decomposition methods. These methods can solve the problems discretized by billions of nodal variables at the cost nearly proportional to the number of variables using up to thousands cores before the communication costs start to dominate the computational costs. The paper describes the ingredients essential for efficient massively parallel implementation that increases the parallel scalability beyond the limit mentioned above. The improvements were enhanced into a new software package PERMON which is based on PETSc. The performance of the algorithm is demonstrated on the solution of an academic benchmark discretized by nearly billion of nodal variables.


Introduction
We are interested in the solution of large multibody contact problems such as those arising in the design of electromagnetic brakes [3].A powerful tool for the solution of these problems are the Finite Element Tearing and Interconnecting (FETI) methods that were originally introduced by Farhat and Roux [2] for the solution of large linear systems arising from the discretization of elliptic partial differential equations.The key ingredient of FETI is the decomposition of the spatial domain into non-overlapping subdomains that are glued by Lagrange multipliers.After elimination of the primal variables, the original problem is reduced to a small, relatively well conditioned equality constrained Quadratic Programming (QP) problem, so it can be solved by a suitable iterative method.
A unique feature of FETI is its numerical scalability, i.e., its capability to solve the discretized problems at the cost nearly proportional to the number of variables.Moreover, FETI also enjoys the parallel scalability -if each subdomain is assigned to a core, then the time of the solution can be reduced nearly proportionally to the number of available cores until the communication costs start to dominate the cost of the solution.
If the FETI procedure is applied to the contact problems, the duality not only reduces the dimension of the original problem, but it also turns all inequality constraints into bound constraints.Though the resulting QP has not only the equality constraints, but also the non-negativity constraints, the resulting problem can still be solved with the optimal complexity [1].
Let us mention that the optimality results mentioned above are rather qualitative; the estimates contain unknown constants, so it is not clear whether we can observe the scalability in the solution of realistic problems.The point of this paper is to present some implementation details that improve the performance of the algorithm and extend the domain of their parallel scalability.The improvements were implemented into a novel software package based on PETSc that is called PERMON toolbox.PERMON uses both our in-house software and efficient open source libraries.The core solver layer consists of two modules: PermonFLLOP for FETI and PermonQP for QP.The numerical and parallel scalability is demonstrated on the solution of an academic benchmark discretized by nearly a billion variables.

TFETI Method
Our development is based on a variant of FETI that we call Total-FETI (TFETI) [9].TFETI uses Lagrange multipliers to enforce the Dirichlet boundary conditions so that all subdomains are floating and the kernels of the associated subdomain stiffness matrices are known a priori.
The original discretized primal problem has the following form: The primal global distributed objects denote the stiffness matrix, its kernel, equality and inequality constraint matrices and the load vector where diag means a block-diagonal matrix consisting of the diagonal blocks in the parentheses, and K s , R s , B s and f s denote the local objects associated with s-th subdomain.
The primal problem Eq. ( 1) is transformed into the dual one, The corresponding dual objects are where K † denotes a left generalized inverse of K, i.e., a matrix satisfying KK † K = K, λ satisfies G λ = e, while the solution is looked for in the form λ = λ + λ, and ρ > 0 is the regularization parameter.The favorable spectral properties of the Hessian are given by orthogonal projectors

QP Problems
QP problems resulting from the application of the TFETI method to variational inequalities can be solved by the MPRGP (Modified Proportioning with Reduced Gradient Projections) algorithm and SMALBE (SemiMonotonic Augmented Lagrangian algorithm for Bound and Equality constraints) algorithm, both developed by Dostal [1].These algorithms have the rate of convergence given by the bounds on the spectrum of the Hessian matrix.This operator contains three projector applications that can be implemented using two CP solutions.In combination with TFETI, these algorithms were proved to enjoy both numerical and parallel scalability.
The rate of convergence does not depend on penalization parameter ρ, but it is bounded by the constant given by the ratio H h : with h and H denoting the discretization and decomposition parameter, respectively [2].Efficiency of TFETI method could be further improved by application of preconditioning in face.

Massively Parallel Implementation
Parallelization of FETI/TFETI can be implemented using SPMD technique.The distribution of primal matrices is quite straightforward as every subblock reflects a subdomain.The matrices K and R have a favourable block-diagonal layout and can be implemented using a block-diagonal matrix composite type where blocks are ordinary sequential matrices and every core holds an array of blocks associated with its subdomains.
While the dual problem is solved iteratively, the auxiliary problems related to the application of an unassembled system matrix are solved directly.
The first auxiliary problem is the stiffness matrix pseudoinverse application.It is parallelizable without any data transfers because of a favourable blockdiagonal structure so that core factorizes the regularized local blocks (subdomain stiffness matrices).
The cost of the factorization can be reduced by increasing the number of subdomains.On the other hand, this increases the null space dimension and the number of Lagrange multipliers on the subdomains interfaces (dual dimension).It can hardly be solved sequentially on the master core for large scale problems and becomes a bottleneck, see Fig. 1.
The second one is the CP solution appearing in the application of the projectors.However, this problem does not possess such a favourable structure suitable for parallel processing and some communication is necessary.
An effect of a choice of the LU direct solver on the Cray XE6 machine HECToR (PETSc, MUMPS, SuperLU) on performance of the TFETI massively parallel implementation in PERMON library can be found in [11].

K † Action
In the case of the pseudoinverse, each core regularizes a subdomain stiffness matrix using fixing nodes and factorizes it in the preprocessing phase.The application of K † then consists of purely local backward and forward substitutions in each CG iteration.A nice time reduction can be reached by the decomposition of the domain into more subdomains, see [10].For this purpose it was necessary to implement special matrix formats enabling handling an array of diagonal blocks associated with more subdomains on one computational core, see Fig. 2 and Fig. 3.

6.
(GG T ) −1 Action Firstly, it is necessary to assemle the natural coarse space matrix and CP matrix in parallel, see Fig. 4 and Fig. 5.The action time and level of communication depend primarily on the implementation of the CP solution GG T x = y, which can be hardly solved sequentially on the master core for large scale problems.We have suggested and compared several strategies of CP solution: 1. directly using LU factorization, 2. iteratively using PCG, 3. applying explicit inverse of GG T , 4. orthonormalization performed by iterative classical version of Gram-Schmidt process, if matrix G T has orthonormal columns then CP is eliminated because (GG T ) −1 = I).
The second strategy does not seem suitable because it harms robustness of the FETI method.The third strategy, i.e., explicit inverse assembling + dense matrix-vector products could be effective depending on the number of subdomains and the expected number of CP actions, for results with direct explicit inverse computation see [12].The fourth orthonormalization strategy with the classical Gramm-Schmidt method fails when the nullspace is large enough (thousands) because the roundoff errors become an issue.The modified or iterative Gramm-Schmidt methods have better numerical properties but worse scalability.
Let us describe in more details the first strategy using parallel direct solvers (MUMPS [13], SuperLU_DIST [14]) for factorization + forward/backward substitutions.The CP dimension is not large enough to justify the fully parallel approach; in this case, communication takes over computation.The significant efficiency improvement can be achieved by means of the partial of this CP solution.The approach splits cores of the global "world" communicator into groups of cores called subcommunicators in MPI; the number of these subcommunicators is N r (number of cores doing redundant work), i.e., the number of cores in each subcommunicator is equal to N c /N r , with N c denoting total number of cores.This CP matrix is factorized in parallel in subcommunicators.The application of (GG T ) −1 then consists of backward and forward substitutions which are not local and not negligible amount of the communication is needed in each CG iteration.Parallel approach has a big advantage consisting in the reduction of memory requirements for the CP solution; there are practically no memory limits as more and more cores can be engaged into the subcommunicators.
However, solving large CPs gets very complicated for tens or hundreds of thousands of subdomains, even if the best techniques are employed.This motivates our research in projector-less TFETI method for contact problems avoiding the CP solution at all, see [15].

PERMON Toolbox
The QP algorithms and FETI methods are implemented in our software package PERMON based on PETSc.PERMON (Parallel, Efficient, Robust, Modular, Object-oriented, Numerical) [4] and [5] is a collection of software libraries, uniquely combining QP and DDM.There are two core modules in PERMON: Per-monQP and PermonFLLOP.They are built on the top of PETSc [6] and [7], using mainly its linear algebra part.They extend PETSc with new specific functionality, algorithms for large scale sparse QP problems and DDM of the FETI type.The same coding style is used so that the users familiar with PETSc can utilize them with minimal effort.Among the main applications are contact problems of mechanics.
PermonQP provides a base for solving the QP problems.It includes data structures, transforms, algorithms, and supporting functions for QP.PermonQP is available for free under the FreeBSD open source license.
PermonFLLOP (FETI Light Layer on Top of PETSc) is an extension of PermonQP that adds support for DDM of the FETI type.PermonFLLOP is currently under preparation for publishing.The "gluing" signed Boolean matrix B g is constructed based on l2g as described in [8].The subdomain nullspace matrix R s is assembled using the rigid body modes assembled from the mesh nodal coordinates [9].
PermonFLLOP passes the global primal data to Per-monQP and calls a specific series of QP transforms provided by PermonQP, resulting in the bound and equality constrained QP which is then solved with the QPSSolve function.Note that PermonFLLOP allows more than one subdomain per core, i.e. an array of K s and f s is passed per subdomain.The achieved numerical, weak parallel, strong parallel and memory scalability are reported in [10].

Numerical Experiments
The As a model of 3D linear elasticity contact problem, we considered an elastic cube with the bottom face fixed, the top one loaded with a given vertical surface force directed downwards, and the right one in contact with a rigid obstacle, see Fig. 6.Young modulus is E = 2 • 10 5 [MPa] and Poisson ratio µ = 0.33.
The action of the stiffness matrix pseudoinverse K † was implemented using the LU factorization of the regularized K using the MUMPS library and SuperLU_DIST was used for the CP solution in MPI subcommunicators.Each subdomain was assigned to one core.The computations were carried out with decompositions into 512, 1024, 2048, 4096, 8192 and 10648 subdomains.The stopping criterion was defined by the relative precision of the projected gradient and the feasibility error equal to 10 −4 and compared with the Euclidean norm of the dual linear term.The results of computations are depicted in Fig. 7.

Conclusion
We demonstrated the numerical scalability and weak scalability of the FETI based algorithm for contact problems using Salomon up to 702 millions of unknowns and 10648 subdomains (cores).The results document the efficiency of the implemented matrix formats, parallel direct solvers, and other ingredients (as e.g.adaptive expansion steplength in MPRGP).The results were obtained by means of TFETI + SMALBE + MPRGP implemented in PERMON toolbox.

Fig. 1 :
Fig. 1: Increase in number of subdomains leads to a decrease in size of the local stiffness matrices and improvement of bounds at the cost of the CP matrix size increasetimes needed by individual actions.

Fig. 7 :
Fig. 7: Cube contact linear elasticity problem -numerical and weak parallel scalability highlight on Salomon.