High-performance dune modules for solving large-scale, strongly anisotropic elliptic problems with applications to aerospace composites ✩

The key innovation in this paper is an open-source, high-performance iterative solver for high contrast, strongly anisotropic elliptic partial differential equations implemented within dune-pdelab . The iterative solver exploits a robust, scalable two-level additive Schwarz preconditioner, GenEO (Spillane et al., 2014). The development of this solver has been motivated by the need to overcome the limitations of commercially available modelling tools for solving structural analysis simulations in aerospace composite applications. Our software toolbox dune-composites encapsulates the mathematical complexities of the underlying packages within an efficient C++ framework, providing an application interface to our new high-performance solver. We illustrate its use on a range of industrially motivated examples, which should enable other scientists to build on and extend dune- composites and the GenEO preconditioner for use in their own applications. We demonstrate the scalability of the solver on more than 15,000 cores of the UK national supercomputer Archer , solving an aerospace composite problem with over 200 million degrees of freedom in a few minutes. This scale of computation brings composites problems that would otherwise be unthinkable into the feasible range. To demonstrate the wider applicability of the new solver, we also confirm the robustness and scalability of the solver on SPE10, a challenging benchmark in subsurface flow/reservoir simulation. Licensing BSD 3-clause C++ Nature of problem: dune-composites is designed to solve anisotropic linear elasticity equations for anisotropic, heterogeneous materials, e.g. composite materials. To achieve this, our contribution also implements a new preconditioner in dune-pdelab . Solution method: The anisotropic elliptic partial differential equations are solved via the finite element method. The resulting linear system is solved via an iterative solver with a robust, scalable two-level overlapping Schwarz preconditioner: GenEO.


Introduction
Across the physical sciences, elliptic partial differential equations (PDEs) naturally arise as mathematical models of the equilibrium state of a system. Classical examples include the distribution of temperature in a body, the flow of fluid in a porous medium and, the particular application of interest in this paper, the equilibrium of forces acting on a material. The most widely used approach to solve such PDEs is the finite element method [1], which results in a sparse system of equations. For systems which exhibit multiple scales and large spatial variations in model parameters the system of equations can be very illconditioned and extremely large, e.g. contain > 10 7 degrees of freedom. The design of robust and scalable solvers that do not require laborious tuning and are capable of exploiting the power of modern distributed computers, is essential. In this paper, we describe the design and implementation of a robust twolevel additive Schwarz preconditioner for parallel Krylov based iterative solvers in dune-pdelab [2]. By also creating a bespoke module for analysis of composite structures dune-composites, we demonstrate the capabilities of this new solver on industrially motivated aerospace composite problems with over 200 million degrees of freedom. To also demonstrate the wider applicability of the new solver, we demonstrate the robustness and scalability of the solver on the challenging, classical SPE10 benchmark [3,4] in subsurface flow/reservoir simulation.

Motivating computational challenge in aerospace composites
Scientific advances in aerospace composite design and materials offer exciting engineering opportunities, making them the material of choice for many modern aircraft (e.g Airbus A350, Boeing 787). However, composite manufacturers face huge challenges in designing and making complex components quickly enough to remain commercially competitive. There is a growing realisation in both academia and industry that to meet ambitious global growth targets, composites manufacturers should 'reduce time, cost and risk to market through the use of validated simulation packages' [5]. Currently simulation capabilities allowing highfidelity full-scale analysis of a composite structure are not openly available. But, why is this? What makes this analysis of large scale composite structure so challenging?
When we apply classical finite element (FE) analysis to a composite structure the problem reduces to finding a vector of displacements u (i) ∈ R 3 at each of the N nodes within a FE mesh.
This leads to the sparse system of FE equations [1]: A is the global stiffness matrix and b is the load vector arising from the applied boundary conditions or loading. In solving the linear system (1), we face two significant mathematical challenges: • Scale of calculations. Composite materials are manufactured from thin fibrous layers, less than 1 mm thick, separated by even thinner resin interfaces of thickness less than 0.05 mm, yet entire component parts are generally several metres long, Fig. 1. To resolve stresses and accurately predict failure, several elements need to be placed through each layer [6]. Naturally this means that the number of nodes N is very large. As an example in this paper, we model a 1 m section of a wing box given in Section 5.3, while resolving the resin interfaces, giving in total 200 million degrees of freedom. Solving linear systems of this size requires specialised, parallel solvers. Current industry standard tools, such as Abaqus [7], are not able to deal with these problem sizes, largely due to limitations of the parallel solvers employed.
• Material anisotropy. Central to the benefits of composite structures is the inclusion of directional fibres, which gives them an excellent weight to stiffness ratio under a particular loading. This means, there is a large contrast (∼ 1 : 40) in mechanical properties within a single layer of composite, related to the fibre direction(s) and those directions dominated by the stiffness of the matrix material (typically a toughened epoxy resin). In the FE discretisation, this leads to a stronger coupling between degrees of freedom in the fibre direction, as opposed to those in the orthogonal directions. The fibrous layers are stacked with different fibre orientations to form a laminate, adding an additional level of complexity. The fibre directions act as stiff constraints on the deformation, whilst the weak connections give rise to low-energy mechanics within the structure. Mathematically, this causes significant numerical challenges in solving Eq. (1) via iterative solvers, since the system is very ill-conditioned. For such cases, classical iterative solvers (required to address Challenge 1) converge very slowly.
The usual approach to tackle both these challenges is to apply the parallel iterative solvers to a preconditioned version of Eq. (1). The task then is to develop an operator M −1 , which is computationally cheap to construct, such that M −1 Aũ = M −1 b is better conditioned. The most widely used preconditioners for iterative solvers for (1) in both commercial and scientific FE codes are Algebraic Multigrid (AMG) methods [8,9]. They have demonstrated excellent scalability for a broad class of problems over thousands of processors, and have the advantage of working only on the matrix equations (1), so that they can be applied 'black-box'. As a preconditioner, AMG constructs the matrix M by repeatedly coarsening the full matrix A through recursive aggregation over the degrees of freedom. The aggregation process is algebraic and based on the fact that the solution at two neighbouring nodes will be similar if they are 'strongly connected'. The success of an AMG preconditioner depends on this aggregation process. As discussed above, the connectivity of degrees of freedom within a laminate is very complex even for a simple laminate and in our numerical experiments the performance of all AMG preconditioners that we tested was prohibitively poor. In particular, this includes off-theshelf AMG used in the commercial software Abaqus [7], as well as tailored aggregation strategies in the AMG preconditioners provided through dune-istl [8] and Hypre [9].
This initial testing of AMG preconditioners highlighted the need for the development and implementation of a robust preconditioner with respect to both problem size, material heterogeneity and anisotropy for large-scale composite based applications. Over the last decade there has been significant effort from the domain decomposition community to develop scalable and robust preconditioners suitable for parallel computation. One such preconditioning approach is provided by the additive Schwarz framework [10]. The domain is decomposed into overlapping subdomains, which in our case each correspond to one processor, and the subdomain's local stiffness matrix is inverted on each processor using a direct solver. This ''one level'' approach is not sufficient for very large problems and global information in the form of a coarse space must be added. In dune-composites, we use GenEO [11] to construct a coarse space by combining low energy eigenvectors of the local subdomain stiffness matrices using a partition of unity. The resulting preconditioner leads to an almost optimal scaling with respect to problem size and number of processors, allowing us to successfully tackle large industrially important problems with over 200 million degrees of freedom.

The contributions of this paper
dune-composites is a high-performance composite FE package built on top of Dune (Distributed and Unified Numerics Environment), an open source modular toolbox for solving partial differential equations (PDEs) with grid-based methods [12][13][14].
Based on the core Dune philosophy, dune-composites is written using C++ and exploits modern inheritance and templating programming paradigms. It is open-source and publicly available at https://dune-project.org/modules/dune-composites/. The package provides a codebase with the following key features: • implementation and interface to a novel, robust preconditioner called GenEO [6,11] for parallel Krylov solvers, which exhibits excellent scalability over thousands of processors on Archer, the UK national HPC system. Since release 2.6, the preconditioner is provided as part of dune-pdelab, available at https://dune-project.org/modules/dune-pdelab/ (initially it had been developed within the dune-composites module); • interfaces to handle composite applications, including stacking sequences, complex part geometries, defects and non-standard boundary conditions, such as multi-point constraints or periodicity; • to overcome shear locking of standard FEs, mesh stabilisation strategies to support reduced integration [15], as well as a new 20-node 3D serendipity element (with full integration) have been implemented; • interfaces to other state-of-the-art parallel solvers (& preconditioners) in dune-istl [8] and Hypre [9]; • a code structure which supports both engineering endusers, and those requiring flexibility to extend any aspect of the code in a modular way to introduce new applications, solvers or material models.
The purpose of this paper is to highlight the novel mathematical aspects of the code and document its structure. We illustrate its use through a range of industry motivated examples, which should enable other scientists to build on and extend dune-composites for use in their own applications. We begin by outlining the mathematical formulation of the new robust preconditioner and its implementation on a distributed memory computer in Section 3. We then provide details of the structure and salient features of the code in Section 4. Finally in Section 5, through the use of a series of example problems, we provide details of how to implement, build and run your own applications. We also use these examples as an opportunity to demonstrate the computational efficiency of dune-composites.

Preliminaries: Anisotropic elasticity equations and their finite element discretisation
A composite structure occupies the domain Ω ⊂ R 3 with the boundary Γ and a unit, outward normal vector n ∈ R 3 . At each point x ∈ Ω we define a vector-valued displacement u(x) : Ω → R 3 . In each of these three global directions the boundary may contain a Dirichlet component Γ Let σ ij denote the Cauchy stress tensor and f(x) : Ω → R 3 the body force per unit volume. The infinitesimal strain tensor, is defined as the symmetric part of the displacement gradients where The strain tensor is connected to Cauchy stress tensor via the generalised Hooke's law (4) C ijkl (x) is a symmetric, positive definite fourth order tensor. A composite laminate is made up of a stack of composite layers (or plies ∼ 0.2 mm), separated by a very thin layer of resin (15 µm). A single composite layer is modelled as a homogeneous orthotropic elastic material, characterised in general by 9 parameters and a vector of orientations θ. Resin interfaces are assumed isotropic, defined by just 2 scalar (Lamé) parameters. These fibres are aligned in local coordinates and can be rotated in any direction using standard tensor rotations, for more details see e.g. [16].
Given functions h i : Γ (i) D → R and g i : Γ (i) N → R, prescribing the Dirichlet and Neumann boundary data (for each component), we seek the unknown displacement field u(x), which satisfies the force equilibrium equations and the boundary conditions, as well as Eqs. (3) and (4). Then, we define the function space for each component of displacement u i to be leading to the weak formulation of (5), of finding u ∈ V := We consider the discretisation of the variational equations (7) with conforming FEs on a mesh T h on Ω. Let V h ⊂ V denote the restriction of V onto a FE space on T h and seek an approximation We block together displacements from all three space dimensions, so that u (i) h ∈ B = R 3 denotes the vector of displacement coefficients containing all space components associated with the ith basis function. We introduce the (vector-valued) FE basis for V h defined by the spanning set of (vector-valued) shape func- . These are the normal scalar shape functions, repeated for each displacement component. Therefore the vector displacement at a point is given by The choice of basis converts (8) into a symmetric positive-definite (spd) system of algebraic equations where the blocks in the global stiffness matrix and in the load vector, for any i, j = 1, . . . , N, are given by block vector of unknown FE coefficients. System (9) is assembled element-wise from (7), using Gaussian integration.

A robust, scalable, parallel iterative solver for composite structures
The key innovation of dune-composites, as a software package, is the design and implementation of a highly robust, scalable parallel iterative solver for composite applications. This solver is applicable to a more general class of problems and has been made available in the dune-pdelab module. In this section, we provide the mathematical and implementation details of the new solver. Apart from new types of FEs that had to be implemented in Dune, the remainder of the package largely provides interfaces to handle the set-up for complex composite applications.

Krylov subspace methods preconditioned with two-level additive Schwarz methods
In dune-composites we use preconditioned Krylov subspace methods both in sequential and parallel, as provided by Dune's ''Iterative Solver Template Library'' dune-istl [8]. Krylov subspace methods are iterative solvers which construct a sequence of approximations u (k) in the k-dimensional subspace: where r = b − Aũ (0) is the initial residual. The simplest Krylov subspace method for a symmetric positive-definite matrix A is the Conjugate Gradient method (CG), first introduced by Hestenes and Stiefel (1952). In each step, the approximate solutionũ (k) is updated by adding the search direction d (k) scaled by a factor chosen to minimise the energy norm over the space K k . The search directions are chosen to be A-orthogonal to all previous direction i.e. ⟨d (k) , Ad (k ′ ) ⟩ = 0 for k ′ < k. The method iterates until the residual norm (or ''energy'') ∥r (k) ∥ reduces below a user defined tolerance. Importantly, the convergence rate of CG depends on the spectral properties of the matrix A, see e.g. [17].
In particular, it can be bounded proportionally to the square root of the condition number κ, defined as the ratio between its largest and smallest eigenvalue. A large value, as usually seen in composites, indicates that the system Aũ = b is ill-conditioned.
This means that u is very sensitive to small changes in b. For such cases, iterative solvers converge very slowly or even not at all, particularly when the problem size increases. A remedy is to precondition the system, that is to develop an operation M −1 which is computationally cheap to construct and apply (in parallel) such that M −1 Aũ = M −1 b is better conditioned and CG solvers converge quickly.
In dune-composites our main preconditioner is a two level additive Schwarz method. To construct this method we partition our domain Ω into a set of non-overlapping subdomains Ω ′ j for j = 1 to N resolved by T h , as shown in Fig. 2 Remark. In dune-composites the user can define the initial non-overlapping decomposition (or a default is used), the overlapping process is handled by Dune's parallel structured grid class Dune::YaspGrid [12]. Any function v ∈ V h,0 (Ω j ) is mapped onto V h by the prolon- We therefore note that the restriction operator R j : In matrix form the restriction and prolongation operators R j and R T j , are denoted R j and R T j respectively. This allows us to define the subdomain stiffness matrices restricted to V h,0 (Ω j ) In practice, we do not compute A j from A via this double matrix product. Instead, we can equivalently assemble A j directly from the bilinear form (8) on Ω j with homogeneous Dirichlet boundary conditions on all artificial interior subdomain boundaries, i.e all points x ∈ ∂Ω j that satisfy The 1-level Additive Schwarz method can then be defined as a preconditioner of (9) via the operator Here, in the subscript, the AS denotes additive Schwarz and the 1 denotes a one-level method. In this case the preconditioner M −1

AS,1
approximates the inverse operator A −1 by a sum of local solves on overlapping subdomains, with homogeneous Dirichlet boundary conditions on interior boundaries. We will see in the numerical examples to follow that, for large problems, a single-level method is not sufficient, causing stagnation (high iteration counts) of the iterative solver. This stagnation of the iterative solver is caused by a few very small eigenvalues in the spectrum of the preconditioned problem. They are due to the lack of a global exchange of information in the preconditioner in the single-level method. A classical remedy is the introduction of a coarse grid problem that couples all subdomains at the second level [10]. To define our coarse problem, we introduce a coarse space V H ⊂ V h (which we define below). We denote the restriction from the fine to the coarse space by the operator R H : V h → V H , with matrix representation R H . The two-level additive Schwarz preconditioner (in matrix form) is given by Two natural questions arise: • What is a good choice of coarse space V H for composite applications?
• How do we construct A H efficiently on a distributed memory computer without assembling A directly?

A robust coarse space via generalised eigenproblems in the overlaps (GenEO)
The ideal coarse space would capture the global low energy modes of A that jeopardise the performance of Krylov solvers. Specifically, in the two-level additive Schwarz setting, the modes not captured by the local solves are of interest. Yet, to compute those low-energy modes explicitly would be more expensive than inverting A itself. Instead, the global low-energy modes can be approximated by stitching together local (optimal) approximations. These local approximations are solutions of specific Generalised Eigenproblems in the Overlaps, hence named GenEO, defined below. Importantly the local eigenproblems are independent and can trivially be computed in parallel. The robustness of GenEO has been proven for isotropic elasticity problems, Spillane et al. [11], and numerically verified by the authors for anisotropic variants [6].
The construction of the GenEO coarse space has two key steps: the definition of the generalised eigenproblems on the subdomains and the stitching together of the resulting local eigenmodes from each subdomain to form a global basis. This stitching process by means of partition of unity (PoU) operators is also incorporated in the local eigenproblems, therefore we construct the PoU operators first.
e. the subset of Ω j which belongs to at least one other subdomain.

Definition 3.2 (Partition of Unity). The family of operators
it follows from this definition that restricted to Ω j \Ω o j each Ξ j has to be the identity operator. In the overlaps, the choice of Ξ j is not unique. The simplest approach is to define Ξ j (v) such that each coefficient of the FE function v is scaled by the number of subdomains the corresponding degree of freedom belongs to (see [11] for details). However, we also provide a smoother PoU as defined by Sarkis [18] in our implementation, but observe little difference in the performance of the overall solver (at most one iteration); we therefore keep the presentation here as simple as possible.
Given this set of local PoU operators Ξ j (·), we can construct as follows: In particular, we can define the local generalised eigenproblems that (once collected from each subdomain) provide the basis of the GenEO coarse space. The following definition can be rigorously motivated from theoretical considerations and we refer again to [11]. For each subdomain Ω j , j = 1, . . . , N, we define the generalised eigenproblem: where, for any D ⊂ Ω, the bilinear form a D is defined like a in (7) with the integral restricted to D.

Definition 3.3 (GenEO Coarse Space).
For each subdomain Ω k let p j k be the eigenfunctions from (14) with associated eigenvalues λ j k in ascending order. Then, for some choice of m j ∈ N, the GenEO coarse space is defined as The only parameter that remains to be chosen is the number of eigenmodes m j to be included in each subdomain Ω j . In order to ensure robustness and scalability of the solver, the condition number of the preconditioned system needs to be bounded from above, independent of N, h and of the material properties. It has been shown in [11] that ) .
where C > 0 is a constant depending only on the geometry of the subdomains and where λ j m j +1 is the lowest eigenvalue whose eigenfunction is not added to the coarse space on Ω j .
Thus, the desired robustness can be achieved by including all eigenfunctions in the coarse space whose eigenvalues are below an a priori chosen threshold. A particular threshold that turns out to provide an effective black-box choice for m j and also depends only on the geometry of the subdomain partition is to include 1 This simple threshold can be scaled by a constant factor, thus also scaling the condition bound of the preconditioned system by the same factor. As the number of iterations of the Krylov solver depends directly on the condition number, this allows us to balance the time spent in the iterative solver with the time spent on setting up the preconditioner.
The number of eigenfunctions that are used in the coarse space is problem-specific, but it turns out that for strongly structured coefficient distributions only a small number is typically sufficient. We will see in Section 5 that the calculation of these local eigenmodes is not prohibitively expensive, while yielding excellent condition numbers and, due to the independence of the individual eigenproblems, parallel scalability.

Implementation of GenEO on a high performance computer
The two-level additive Schwarz preconditioner with GenEO coarse space is implemented within a collection of header files, which are located in dune-pdelab from releases/2.6 and in the folder solvers/geneo/ in prior releases. Here we describe our implementation. We are aware of only one other high performance implementation of GenEO, which can be found in the package HPDDM for which details are provided in Jolivet et al. [19,20].
It is a main goal of such an implementation to fully exploit the excellent parallel scalability promised by the method's construction and theoretical properties. Therefore, each process j will be assigned to subdomain Ω j and only store relevant fine-level operators and functions in the form of local restrictions to Ω j .
Further, per-subdomain stiffness matrix and eigenproblem solves will be run in parallel on the respective processes. Only scalable nearest-neighbour communication is needed, with the exception of setting up the coarse matrix, which consequently requires particular care.

Partition of Unity (PoU) operator
The partition of unity operator as defined by Definition 3.2 is stored locally on each processor (see Fig. 2). In practice, the partition of unity operator Ξ j is represented as a diagonal matrix X (j) . In the simplest case, each diagonal entry of X (j) is set to one divided by the number of subdomains containing the associated degree of freedom, except for the subdomain boundary where entries are set to zero. Therefore, if v (j) is a vector containing . Such a PoU can be generated using existing parallel data structures in Dune by adding a vector of ones and by enforcing both global and subdomain boundary conditions before and after communication. The implementation of the PoU operators is within the header file geneo/partitionofunity.hh under the function standardPartitionofUnity(...). As the choice of partition of unity operator is not unique, we also provide the PoU in [18], which is implemented in the same header file under the function sarkisPartitionofUnity(...). This gives a 'smoother' PoU operator, which is however restricted to equally distributed subdomain sizes. Under testing, we noted no significant difference in the performance of the preconditioner when changing between the two different PoU operators.

Subdomain eigenproblems
The local generalised eigenvalue problems (14) can be rewritten in matrix form as follows: Find eigenpairs (λ eigenvalues in ascending order and ∥p j i ∥ = 1 such that where A Ω j and A Ω • j denote the stiffness matrices corresponding to the bilinear forms a Ω j (·, ·) and a Ω respectively. They are solved using ARPACK++ [21].
A customised wrapper has been developed to convert Dune data structures into a suitable format for ARPACK++ [21]. In order to regularise the problem, we employ ARPACK++'s shift and invert spectral transformation mode and, since we are interested in the smallest m j eigenvalues, we choose a small shift factor. The global coarse basis vectors Φ 1 , . . . , Φ N H are obtained from the local eigenvectors by applying the PoU operator, i.e., Φ i(j,k) := X (j) p j k , and padding the rest of the global vector (outside Ω j ) with zeros.
Here, (j, k) ↦ → i(j, k) is a one-to-one mapping from the local numbering of the eigenvectors on Ω j to a global numbering, with

Coarse space assembly
The parallel assembly of the coarse system A H = R H AR H T is not trivial in practice since process j only has local access to rows and columns of A associated to degrees of freedom on subdomain Ω j . We denote this submatrixÃ j . Note thatÃ j differs from the matrix A j in (11) (16) corresponding to Φ i . However, due to the local support of the basis functions, one can break down the global matrix product into local products with a slight abuse of notation, denoting the local parts of the global vectors Φ i and Φ ℓ restricted to Ω j(i) again by Φ i and Φ ℓ .
In the implementation, the matrix vector product in the bracket is local whereas the scalar product requires communication of (parts of) vector Φ ℓ from processor j(ℓ) to processor j(i). This avoids having to communicate the local matricesÃ j . We note that the locality of the basis functions implies Φ iÃj(ℓ) Φ ℓ = 0 whenever Ω j(i) ∩ Ω j(ℓ) = ∅. Therefore, the parallel assembly of A H requires only communication between processes assigned to overlapping subdomains. This locality of communication, which is demonstrated in Fig. 2 (middle), can be exploited to set up A H as a banded sparse matrix. Its parallel communication is implemented within geneo/multicommdatahandle.hh, allowing to pass basis functions between all processes at the same time and therefore make best use of available bandwidth. Combining sparsity and efficient communication, linear complexity in basis size can be achieved for this step.
Since the number of coarse degrees of freedom per process is too small, i.e., m j on process j, after assembly we distribute the resulting global coarse matrix A H to all processors and duplicate the coarse solve on all processors to avoid fine-grained communication. The communication of A H is achieved directly with MPI calls, as the dune-pdelab communication infrastructure is not designed for such global operations.
In case of the restriction of a distributed vector v h representing a function v h ∈ V h , it follows that Each process executes its associated subdomain solve as well as the coarse space solve (redundantly). Where possible we use a sparse direct solver (UMFPack) [22]. For very large problems and a large number of parallel processors the coarse space becomes too large, and must itself be solved with a preconditioned iterative solver; in that case we use by default preconditioned CG with the BoomerAMG [9] preconditioner. It is important to note that in such cases since the coarse solve is inexact, the preconditioner for the (overall) Krylov method is now instationary. It is therefore necessary to switch from a standard preconditioned CG to a flexible Krylov solver. In our case we use Flexible GMRES as provided by dune-istl [8].

Using and extending dune-composites
In this section we provide an overview of the dunecomposites code, sufficient to enable other scientists to leverage the framework. The code is structured so that little additional knowledge of Dune and/or C++ is required to apply the code within the existing functionality. Fig. 3 shows the code structure. A user can extend any of the functionality e.g. implement a new solver, define a more complex nonlinear problem (e.g. cohesive zone) or introduce new types of elements.

Defining a Model
At the highest level an analysis is defined by a user-defined baseStructuredGridModel class shown in green in Fig. 3.
This class defines all the key variables, functions and classes which describe the analysis, as well as storing any variables that are required for postprocessing or any later calculations. A base model class is provided, which can be inherited by each example. This provides default variables and functions, so that the user need only overwrite those functions which deviate from this base class. The Model class also defines the general loading on the structure and the boundary conditions, these include Dirichlet and Neumann conditions, but also thermal loading and multi-point constraints. Periodic boundary conditions are defined within the grid data structure, using Model::LayerCake(). Examples of user defined Model classes for a series of applications are provided in Section 5 which follows.

Internals of dune-composites
The functions provided in the subfolder /Setup provide support for the geometric setup of the grid geometry, material properties and boundary conditions. This includes the composite layering (or stacking sequence), the structural geometric shape of the component and in some cases adding a perturbation to the geometry to form a defect (see for example Section 5.1). Because of uniform layering and planar anisotropy in composite laminates, our first version has focused on structured, overlapping grid implementations using Dune::YaspGrid and Dune:: GeometryGrid. The Dune::GeometryGrid functionality allows us to apply any continuous transformation to the basic Cartesian mesh provided by Dune::YaspGrid. Development of unstructured grid implementation using Dune::UGGrid [23] are described in the future work Section 7.
The folder /Driver contains the key functions and classes which relate to the FE calculations beyond what is available directly from Dune::PDELab. In particular these include all element calculations, the definitions of new FEs, solvers and preconditioners. The functions and classes are split between three folders: • /localOperators define the weak form of the equations to be solved on an element, along with any support functions. In our case for anisotropic linear elasticity equations we define the local operator linearelasticity.hh which returns the element stiffness matrix, load vector and residual as defined by Eq. (7).
• /FEM defines specialist finite elements beyond those defined by Dune::PDELab. In our case, these are the family of serendipity elements [24]. The use of these elements are then defined explicitly in the Driver class, where the FE space is set up on the grid.
• /Solvers defines specialist solvers and preconditioners beyond those defined in Dune::PDELab::istl [8]. The Driver uses the solver as defined by the Model class, defined by the templated class function Model::solve(). By default, as defined by baseStructuredGridModel, for parallel calculations we use a CG Krylov solver, preconditioned with either a one or two level additive Schwarz method. Two-level methods use GenEO as the coarse space as long as ARPACK++ is available. If not, the coarse space consists of only the zero energy modes [10]. In sequential mode, in particular for the coarse and the local solves, a sparse direct solver UMFPack [22] and the iterative CG preconditioned with AMG as provided by Dune::PDELab ::ISTLBackend_SEQ_CG_AMG_SSOR, are also available. In /Solvers/hypre we provide a wrapper to the external parallel solvers provided by hypre [25], including boomer-AMG [9].
The analysis is defined via a Driver. This is a class which provides the complete solution procedure, from the setup of the grid, the definition of a finite element space, assembly of the global stiffness matrix and load vector, and finally, calling the solver, followed by the PostProcessing routines, as defined by the Model class. In this version, we provide two driver classes for the examples considered in Section 5: /FEMDriver/linearStatic.hh and /FEMDriver/ThermalStatic.hh. The folder /PostProcessing contains various classes and functions for the postprocessing of results. By default, after the solution has been calculated the driver initiates certain postProcessing steps, in particular the calculation of stresses, the creation of the necessary data for any plots and finally the calculation of any further quantities of interest. All of these routines can be modified by the user.

dune-composites -examples
In this section, we introduce and demonstrate the function- The composite layers are separated by 11 isotropic resin interface layers, giving a total thickness of T = 2.98 mm. In each of the examples, we discretise the geometry with quadratic, 20-node serendipity elements (with full Gaussian integration). For the base mesh, which will be refined, we take 20 elements in the x-direction, 5 in the y-direction and through thickness 2 per composite layer and 1 per interface layer. This gives a total number of 3,500 elements, with 13,608 degrees of freedom.
The geometry, the stacking sequence and the initial finite element mesh are introduced into a model by overwriting the base class function Model::LayerCake(), with the following user defined function.
Here the geometry and grid are defined by a file ''stackingSequences/example1.csv''.
In these first two examples, we demonstrate a very simple setup and run on a single processor as well as on a few processors. We consider a cantilever beam with a uniform pressure of 0.01 MPa applied to the top face and the following boundary conditions: All other boundary conditions are assumed to be homogeneous Neumann conditions, i.e σ ij · n j = 0. Here f is the output density in a given element id.

Example 1a: A flat composite plate -getting started
Our first study computes the maximum vertical deflection of the cantilever beam as our quantity of interest We use the default sequential and parallel solvers. On a single processor (e.g. with the call ./Example1a) the sparse direct solver UMFPack [22] is used. Otherwise, if more than one processor is used (e.g. with the call mpirun -np 8 ./Example1a) the equations will be solved with CG, preconditioned with a onelevel additive Schwarz preconditioner, as defined by (11) using UMFPack as the local solver on each subdomain.
As output, the quantity of interest is printed to the screen (Maximum Vertical Displacement inExample01a = 1.23992mm). Furthermore, the data for plots of the laminate stacking sequence, the solution (deformation) and the stress field are generated and provided in three files named Exam-ple01a_xxx.vtu. The stacking sequence and solution are shown in Fig. 4.

Example 1b: A flat composite plate -testing preconditioners (up to 32 cores)
In Example01b, we test our new preconditioner GenEO on up to 32 processors. In this example we also demonstrate the inclusion of a failure criterion. To do this we change the quantity of interest to be the pressure q = q ⋆ in the boundary condition (21) at which the laminate fails according to the Camanho criterion [27], defined by the functional We apply the Camanho criterion only in the resin-rich interface layers and we say that failure occurs at a load q ⋆ if max x∈Ω Inter F (σ (x)) = 1. However, since the problem is linear it suffices to solve only one problem with an arbitrary load q. The failure load is then given by q ⋆ := q/max x∈Ω Inter F (σ (x)). Expression (23) is implemented in the file PostProcessing/FailureCriterion/Camanho.hh within the code. Within linearStatic-Driver, by default, the stress field (per element) is stored within the container stress_mech (a 6 × 1 vector). To compute q ⋆ , the Camanho functional is first calculated in each element. The Dune::Composites::Camanho(). In this simple test, we note that failure initiates due to high through thickness stresses in the interface between layers (σ 13 and σ 23 ) as the laminate bends.
For further engineering discussion of the failure of composites under the Camanho criterion we point the reader to the original paper [27] and to [6,26]. We use this test example to demonstrate the influence of the GenEO coarse space on the parallel iterative solver. For the first experiment we use 16 processors. Fig. 5 (left) shows the first nine non-zero energy modes of a subdomain with no global Dirichlet boundary. Linear combinations of these functions together with the zero energy modes (six rigid body translations and rotations) provide a good low dimensional representation of the system on that subdomain consisting of those modes most easily energetically excited. Fig. 5 (right) shows the influence of the coarse space on the number of iterations for the preconditioned Krylov Solver (pCG), comparing no coarse space (one-level additive Schwarz), only the zero energy modes (ZEM) and the GenEO coarse space. The need for a coarse space is clear; with no coarse space, even in this simple test case we observe the well-documented stagnation phenomenon for iterative solvers [10] between Iteration 10−100. With a coarse space (ZEM or GenEO), the convergence shows no stagnation and it is much faster -close to optimal with GenEO.
Next we want to study the robustness of GenEO as a function of the number of subdomains in comparison to one-level additive Schwarz (AS) and ZEM. We consider a fixed size problem and increase the number of subdomains. We note that the tests can be run with mpirun -np 16 ./Example01b or mpirun -np 16 ./Exam-ple01bBoomerAMG In each case we record the condition number, the dimension of the coarse space dim(V H ) (if applicable) and the number of CG iterations to achieve a residual reduction of 10 −5 . The results are summarised in Table 1. We see that the iteration counts (and the condition number estimates) increase steadily with the number of subdomains when no coarse space is used. The condition number estimate is still fairly big if only the zero energy modes are used and the iteration counts also increase steadily with the number of subdomains. In contrast, the iterations and the condition number estimates remain constant for the GenEO preconditioner.
We also add a comparison with boomerAMG [9] for this test problem. BoomerAMG provides a large number of parameters to fine-tune. We retained the defaults for most parameters (HMIS coarsening without aggressive refinement levels and a hybrid Gauss-Seidel smoother). We used blocked aggregation with block size 3 as recommended for elasticity problems. A strong threshold of 0.75 was chosen after testing values in the range from 0.4 to 0.9. Due to a lower setup cost with this parameter setting, the boomerAMG solver is faster in actual CPU time, but the numbers of iterations -albeit also constant -are more than 10× bigger.
For more complex geometries, boomerAMG does not perform very well and in our tests it does not scale beyond about 100 cores in composite applications.

Example 2: Corner unfolding -validation & performance comparison with Abaqus (up to 32 cores)
This example is motivated by the industrial challenge of certifying the corner-bend strength of a wingspar as its corner unfolds due to the internal fuel pressure in an aeroplane wing. We use this example to demonstrate the validity of the results of dunecomposites by comparing the stresses computed with those given by Abaqus. We also make a cost comparison between the two software packages up to 32 cores. Table 1 Demonstration of performance of different preconditioners for Example01b for fixed problem size (30,000 DOFs) but increasing the number of subdomains: Number of pCG iterations (it), coarse space dimension (dim(V h )), an estimate of the condition number κ of the preconditioned system matrix M −1 Num. levels   4  89  79,735  26  394  12  16  10  78  258  10  8  97  84,023  30  245  42  15  9  126  258  11  16 ⋆  107  98,579  36  177  84  16  10  182  257  12  32  158  226,871  42  230  168  16  9 526 263 12 The model setup is shown in Fig. 6. We consider the curvilinear coordinate system (s, r, ℓ), where s is around the radius, r is outwards (or normal) to the laminate and ℓ runs along the length of the sample. For our particular test, the two limbs of the coupon are of length L = 3 mm and border a corner with a radius of R = 6.6 mm. The width is taken to be W = 15 mm. The 12 plies and the 11 interfaces have the same properties as in Example01, but the stacking sequence is slightly different, given by Furthermore, we apply a resin treatment of 2 mm to the free edges of the laminate as shown in Fig. 6 (left). The advantages of edge treatment have been shown in [26]. It reduces conservatism in the design of aircraft structures, as well as making the analyses more reliable by eliminating stress singularities at the free edges. Away from the points of contact, a standard four-point bend test as detailed in ASTM D6415 [28] generates a pure moment on the corner. To simulate such a moment, all degrees of freedom at the boundary of one limb are clamped. At the other end, all nodes are tied with a multipoint constraint where a running moment of 96.8 Nmm/mm is applied. This is achieved by applying an offset load from the mid-plane of the laminate as a Neumann boundary condition, implemented with the user-defined function evaluateNeumann().
The finite element mesh consists of 56 × 56 columns of hexahedral 20-node serendipity elements (element C3D20R in ABAQUS, [7]) in its local l and s coordinates. In the r direction, each (fibrous and resin) layer is discretised into 6 elements, leading to an overall number of 432,768 elements. All of the geometry and mesh parameters are defined in stackingSequenes/example2.csv. To ensure a sufficient resolution of the strong gradients of the solution at the free edges and at the material discontinuities, the mesh is graded along the width towards both free edges and in the radial direction towards each of the fibre-resin interfaces. Grading is defined by the ratio between largest and smallest elements in the mesh, called the bias ratio and chosen to be 400 between the centre and the edge in the l direction and 10 between the layer centres and interfaces in the r direction. The specification of the geometry and of the mesh grading can be defined in the function gridTransformation().
In Fig. 7 (left & middle), we compare the radius stress (denoted by σ r ) and the through-thickness shear stress (denoted by τ sℓ ) recovered from dune-composites and Abaqus. We see good agreement between the two codes. There are two small differences: Abaqus uses reduced integration while our example uses full integration and the stresses are not recovered in an identical way from the displacements in the two codes. In Fig. 7 (right) we see the absolute cost and the parallel scalability of the sparse direct solver in Abaqus and the iterative CG solver in dunecomposites for a fixed total problem size, i.e. a strong scaling test. The red and blue curves show one-level and two level overlapping Schwarz methods respectively, both of which perform better than the sparse direct solver (green) in Abaqus. However, both codes show optimal parallel scalability up to 32 cores. In dune-composites the problem is decomposed into 8 subdomains for 1 − 8 cores, distributed evenly to the available cores.
On 16 and 32 cores, each core is passed exactly one subdomain, i.e., the number of subdomains is 16 and 32, respectively. The local problems on each subdomain are solved using the sparse direct solver UMFPack [22]. The simulations in Abaqus are with a parallel sparse direct solver, based on a parallel multi-frontal method similar to [29]. Abaqus's iterative solver, which is based on CG preconditioned with ML [30] (another black-box AMG preconditioner), does not converge in a reasonable time for this problem. Therefore the computational gains observed here are really the difference between using a direct and a robust iterative solver. Importantly, we note that the parallel sparse direct solver, available in Abaqus does not scale beyond 64 cores [7], making it unsuitable for problems much bigger than that considered here, and reinforcing the need for robust iterative solvers and therefore dune-composites as a package.

Example 3: Large composite structure -parallel efficiency of
dune-composites (up to 15,360 cores) The industrially motivated problem described in this section is to assess the strength of a wingbox with a small localised wrinkle defect. Wrinkle defects, which can form during the manufacturing process [31,32], occur at the layer scale. They lead to strong local stress concentrations [6,33], causing premature failure. Naturally, good mesh resolution around the defect is required, leading to We reiterate that this example serves to represent structural scale modelling. Therefore, sub-structural phenomena, such as stiffening of the upper and lower covers, are not modelled here. The specific layer-sequencing has been chosen, using a discrete optimiser, to ensure that each laminate is balanced, symmetric with no bend-twist coupling, whilst maximising the number of continuous orientations around the wing box. Transitions between each of the stacking sequences are achieved over a relatively short segment of 5 cm, and the chosen stacking sequence is in no way optimised for strength in these regions, as considered for example by Dillinger et al. [34]. In practice, this change of stacking sequence is easy for the user to specify using a .csv file specifying different Regions for each segment of the wingbox and providing the required different stacking sequence. The wingbox geometry is again achieved by specifying a gridTransformation(), which now becomes slightly more complex, in order to handle each of the different regions. To create the closed curve of this wingbox, periodic boundary conditions are imposed.
In this application, we consider two forms of loading. Firstly, an internal pressure of 0.109 MPa, arising from the fuel, is applied to the internal surface. Secondly, a thermal pre-stress induced by the manufacturing process is imposed, using the user-defined function evaluateHeat() (see Fig. 9).
We approximate the influence of the ribs that constrain the wingbox in the y direction, by clamping all degrees of freedom at one end, whilst tying all other degrees of freedom at the other end using a multipoint constraint. Elements to be included in the multipoint constraint are marked with the user-defined function isMPC(FieldVec& x). A localised wrinkle defect is introduced into one of the corner radii, as shown in Fig. 8. The defect is introduced by adapting the function gridTransformation().
The wrinkle geometry is defined by a random field, parameterised by a Karhunen-Loéve expansion. The actual parametrisation of the wrinkle is chosen to match an observed defect in a CT-Scan of a real corner section. Further details of this methodology are provided in Sandhu et al. [33,Sec. 3].
We firstly carry out a weak scaling experiment, increasing the problem size proportionally to the number of cores used. For iterative solvers that scale optimally with respect to problem size and with respect to the number of cores, the computational time should remain constant. To scale the problem size as the number of cores N cores grows, we refine the mesh, doubling the number of elements as we double the number of cores. The number or elements for each setup are detailed in Table 2, separately listing the number of elements across the spar and the cover, around the corners and along the length of the wingbox. The defective corner, denoted R d , contains twice as many elements as the other three corners, denoted by R nd . Table 2 also details the resulting number of degrees of freedom, iteration numbers for the preconditioned CG, an estimate of the condition number of the preconditioned system matrix M −1 AS,2 A, the dimension of the coarse space dim V H , as well as the total run time. Fig. 10 (left) shows that the weak scaling of the iterative CG solver in dune-composites with GenEO preconditioner is indeed almost optimal up to at least 15,360 cores (the limiting capacity available on Archer for our experiments). We also include a more detailed subdivision of the computational time into Setup Time (for the assembly of the FE stiffness matrix and for the construction of the GenEO coarse space) and Iteration Iime (for the preconditioned CG iteration). Both scale almost optimally. This test demonstrates the capability of increasing the size of the tests at a nearly constant run time and thus, to solve a problem with 200 million degrees of freedom in just over 13 min.
Next we carry out a small strong scaling experiment. The mesh is that of Setup 5 in Table 2 and the results of the strong scaling test are given in Fig. 10 (right) and in Table 3. We see that the iterative CG solver in dune-composites with GenEO preconditioner scales almost optimally to at least 11 320 cores, with the time taken approximately halving as the number of cores is doubled.
Again, both the Setup and the Iteration scale optimally.

Subsurface flow application: Strong scaling for the SPE10 benchmark
In this section, we apply the GenEO solver to an elliptic partial differential problem outside of composites modelling, demonstrating its scalability and robustness on a subsurface flow problem in a highly heterogeneous medium. A challenging test case in  Table 2. (Right) Close-up plot of the corner of the wingbox using plotProperties(), which shows the wrinkle and the inter-lacing of the different stacking sequences in the corner, cover and spar regions.   Table 2. (Right) A strong scaling test using Setup 5 in Table 2, with the dashed line showing perfect scaling, as summarised in Table 3.

Table 2
Details of the six setups and results used in the weak scaling test. In all of the tests, we used two layers of 20-node serendipity elements per fibrous layer and only one layer of elements in each of the interface layers. The number of elements per core was fixed at 2808.  subject the boundary conditions u(x) = 0 on Γ D and − K(x)∇u · n = 0 on Γ N = ∂Ω\Γ D .
The SPE10 dataset gives a spatially varying permeability tensor ] ∀x ∈ Ω. Fig. 11(left) shows the permeability field, it is constant in each cell, but varies strongly over the domain. The parameters K x and K y vary from 6.65×10 −4 to 2.0×10 4 and the parameter K z varies from 6.65 × 10 −8 to 6.0 × 10 3 . We define the function space for the pressure u as V := {v ∈ H 1 (Ω) : v(x) = 0 , x ∈ Γ D }, and choose the finite element space V h ⊂ V to be the set of continuous, piecewise linear functions on T h . The finite element discretisation of (26) then reads: Find By defining u h = ∑ N i=1 u (i) φ i (x), again we obtain the sparse system of equations Au = b, where u = [u (1) , u (2) , . . . , u (N) ] T is a vector pressures at each cell vertex, which is assembled element-wise from (28) using standard Gaussian integration. The source term in our experiments is assumed to be uniform f ≡ 1.
In Fig. 11(right) and Table 4 we show a small strong scaling experiment performed with this challenging setup. The parameter contrast for this benchmark is on the order of 10 11 . Nevertheless, we see that the iterative CG solver in dune-composites with GenEO preconditioner scales almost optimally to at least 256 cores. At 512 cores with only around 2000 elements per core the strong scaling begins to break down due to the communication overhead. Due to the layered structure of the material parameters our domain decomposition is two dimensional. Each subdomain includes the full length in z-direction. We used a minimal overlap of only one element. Table 4 also details the number of cores, size of the coarse space dim V H , iteration numbers for the preconditioned CG, as well as the time spent in CG iterations, setup time and total run time.

Discussion & future developments
In this paper, we describe the new high performance package dune-composites, designed to solve massive finite element problems for the anisotropic linear elasticity equations. The paper provides both the mathematical foundations of the methods, their implementation within a state-of-the-art software platform on modern distributed memory computer architectures, as well as details of how to set up a problem and carry out an analysis, illustrated via a series of increasingly complex examples. In addition, we demonstrate the scalability of the new solver on over 15,000 cores on the UK national supercomputer Archer, solving industrially motivated problems with over 200 million degrees of freedom within minutes. This scale of computations brings composites problems that would otherwise be unthinkable into the feasible range.
The disadvantage of dune-composites as a package over commercial counterparts is currently the limited functionality in considering more general problems; this includes complex geometries, unstructured grids and nonlinear problems. This is the first release of dune-composites, and therefore the functionality is naturally still limited, but it will increase over time, driven by the industrial questions we seek to solve as a community of developers. There are currently four key areas of active development: • Multiscale methods: There is a strong connection between coarse spaces for domain decomposition methods, as developed and implemented within the GenEO preconditioner, and multiscale discretisation methods, such as generalised multiscale FEs (GMsFE) [35]. In fact, the GenEO coarse space provides and natural multiscale method. Current research [36] is aiming to provide this functionality as an embedded solution scheme within dune-composites. • Nonlinear mechanics: Currently the analysis implemented in the package is linear (static and thermal). Naturally, modelling failure propagation in composites is a nonlinear problem. Current research in implementing nonlinear material models includes cohesive zone models and ply damage models, for integration within the existing framework.
• Uncertainty quantification: A significant motivation for developing efficient robust solvers is to enable stochastic studies in which many simulations of a model with different parameters are required. The package has already been used to study the effects of wrinkle defects in composite strength [33], and ongoing research is exploring Bayesian parameter estimation using multilevel Markov chain Monte Carlo methods [37,38].
• GUI development: A current limitation of the package is that its application to new models or geometries requires a basic knowledge of C++ and command line programming. Active development with software engineers is seeking to provide a simple Graphical User Interface (GUI) to enable application-focused research with dune-composites without extensive programming experience.