The DUNE Framework: Basic Concepts and Recent Developments

This paper presents the basic concepts and the module structure of the Distributed and Unified Numerics Environment and reflects on recent developments and general changes that happened since the release of the first Dune version in 2007 and the main papers describing that state [1, 2]. This discussion is accompanied with a description of various advanced features, such as coupling of domains and cut cells, grid modifications such as adaptation and moving domains, high order discretizations and node level performance, non-smooth multigrid methods, and multiscale methods. A brief discussion on current and future development directions of the framework concludes the paper.


Introduction
The Distributed and Unified Numerics Environment Dune 1 [1,2] is a free and open source software framework for the grid-based numerical solution of partial differential equations (PDEs) that is developed since more than 15 years as a collaborative effort of several universities and research institutes. In its name, the term "distributed" refers to distributed development as well as distributed computing. The enormous importance of numerical methods for PDEs in applications has lead to the development of a large number of general (i.e. not restricted to a particular application) PDE software projects. Many of them will be presented in this special issue and an incomplete list includes AMDIS [3], deal.II [4], FEniCS [5], FreeFEM [6], HiFlow [7], Jaumin [8], MFEM [9], Netgen/NGSolve [10], PETSc [11] and UG4 [12].
The distinguishing feature of Dune in this arena is its flexibility combined with efficiency. The main goal of Dune is to provide well-defined interfaces for the various components of a PDE solver for which then specialized implementations can be provided. Dune is not build upon one single grid data structure or one sparse linear algebra implementation. All these components are meant to be exchangeable. This philosophy is based on a quote from The Mythical Man-Month: Essays on Software Engineering by Frederick Brooks [13, p. 102]: Sometimes the strategic breakthrough will be a new algorithm, . . . Much more often, strategic breakthrough will come from redoing the representation of the data or tables. This is where the heart of a program lies. This observation has lead to the design principles of Dune stated in [2]: i. Keep a clear separation of data structures and algorithms by providing abstract interfaces algorithms can be built on. Provide different, special purpose implementations of these data structures.
ii. Employ generic programming using templates in C++ to remove any overhead of these abstract interfaces at compile-time. This is very similar to the approach taken by the C++ standard template library (STL).
iii. Do not reinvent the wheel. This approach allows us also to reuse existing legacy code from our own and other projects in one common platform.
Another aspect of flexibility in Dune is to structure code as much as possible into separate modules with a clear dependence.

The Dune ecosystem
The modular structure of Dune is implemented by conceptually splitting the code into separate, interdependent libraries. These libraries are referred to as Dune-modules (not to be confused with C++-modules and translation units). The Dune project offers a common infrastructure for hosting, developing, building, and testing these modules. However, modules can also be maintained independently of the official Dune infrastructure.
The so-called Dune core modules are maintained by the Dune developers and each stable release provides consistent releases of all core modules. These core modules are: Dune-Common: Basic infrastructure for all Dune modules such as build scripts or dense linear algebra. Dune-Geometry: Reference elements, element transformations and quadrature rules. Dune-Grid: Abstract interface for general grids featuring any dimension, various element types, conforming and nonconforming, local hierarchical refinement, and parallel data decomposition. Two example implementations are provided.
Dune-ISTL: Abstract interfaces and implementations for sparse linear algebra and iterative solvers allowing compile-time known blocking. Dune-LocalFunctions: Finite elements on the reference element.
While the core modules build a common, agreed upon foundation for the Dune framework, higher level functionality based on the core modules is developed by independent groups and there exist concurrent implementation of some features focusing on different aspects. These additional modules can be grouped into the following categories: Grid modules provide additional implementations of the grid interface and socalled meta grids, implementing additional functionality based on another grid implementation. Discretization modules provide full-fledged implementations of the finite element method based on the core modules. The most important ones are Dune-Fem, Dune-Fufem and Dune-PDELab. Extension modules provide additional functionality interesting for all Dune users which are not yet core modules. Examples are Python bindings, a generic implementation of grid functions and support for system testing. User modules are all other Dune modules that usually provide applications implementations for specific research projects and also new development features that are not yet used by a large number of other Dune users but over time may be become extension modules. Among the additional modules there are some so-called staging modules which are considered to be of wider interest and may be proposed to become part of the core in the future.
The development of the Dune software takes place on the Dune gitlab instance (https://gitlab.dune-project.org) where users can download and clone all openly available Dune git repositories. They can create their own new projects, discuss issues, and open merge requests to contribute to the code base. The merge request are reviewed by Dune developers and others that want to contribute to the development. For each commit to the core and extension modules continuous integration tests are run to ensure code stability.

Dune core modules and re-usable concepts
In this section we want to give an overview of the central components of Dune, as they are offered through the core modules. These modules are considered in Dune version 2.7 (which corresponds to the current master at time of writing). The focus is on those components modeling mathematical abstractions needed in a finite element method. We will discuss in detail the Dune-Grid and Dune-ISTL modules, explain the basic ideas of the Dune-LocalFunctions and Dune-Functions module and discuss how the recently added Python support provided by the Dune-Python module works. While Dune-Common offers central infrastructure and foundation classes, its main purpose is that of a generic C++ toolbox and we will only briefly introduce it, when discussing the build system and infrastructure in general.

The Dune grid interface -dune-grid
The primary object of interest when solving partial differential equations (PDEs) are functions f : Ω → R, where the domain Ω is a (piecewise) differentiable d−manifold embedded in R w , w ≥ d, and R = R m or R = C m is the range. In grid-based numerical methods for the solution of PDEs the domain Ω is partitioned into a finite number of open, bounded, connected and nonoverlapping subdomains Ω e , e ∈ E, E the set of elements, satisfying e∈E Ω e = Ω and ∂Ω e ∩ ∂Ω e = ∅ for e = e .
This partitioning serves three separate but related tasks: i. Description of the manifold. Each element e provides a diffeomorphism (invertible and differentiable map) µ e :Ω e → Ω e from its reference domain Ω e ⊂ R d to the subdomain Ω e ⊂ R w . It is assumed that the maps µ e are continuous and invertible up to the boundary ∂Ω e . Together these maps give a piecewise description of the manifold.
ii. Computation of integrals. Integrals can be computed by partitioning and transformation of integrals Ω f (x) dx = e∈E Ω e f (µ e (x)) dµ e (x). Typically, the reference domainsΩ e have simple shape that is amenable to numerical quadrature.
iii. Approximation of functions. Complicated functions can be approximated subdomain by subdomain for example by multivariate polynomials p e (x) on each subdomain Ω e .
The goal of Dune-Grid is to provide a C++ interface to describe such subdivisions, from now on called a "grid", in a very general way. In addition Task iii. requires further information to associate data with the constituents of a grid. The grid interface can handle arbitrary dimension d (although naive grid-based methods become inefficient for larger d), arbitrary world dimension w as well as different types of elements, local grid refinement and parallel processing.

Grid entities and combinatorial properties
Our aim is to separate the description of grids into a geometrical part, mainly the maps µ e introduced above, and a combinatorial part describing how the elements of the grid are constructed hierarchically from lower-dimensional objects and how the grid elements are glued together to form the grid.
The combinatorial description can be understood recursively over the dimension d. In a one-dimensional grid, the elements are edges connecting two vertices and two neighboring elements share a common vertex. In the combinatorial description of a grid the position of a vertex is not important but the fact that two edges share a vertex is. In a two-dimensional grid the elements might be triangles and quadrilaterals which are made up of three, respectively four edges. Elements could also be polygons with any number of edges. If the grid is conforming, neighboring elements share a common edge and two vertices or at least one vertex if adjacent.
In a three-dimensional grid elements might be tetrahedra or hexahedra consisting of triangular or quadrilateral faces, or other types up to very general polyhedrons.
In order to facilitate a dimension-independent description of a grid we call its constituents entities. An entity e has a dimension dim(e), where the dimension of a vertex is 0, the dimension of an edge is 1 and so on. In a d-dimensional grid the highest dimension of any entity is d and we define the codimension of an entity as codim(e) = d − dim(e).
We introduce the subentity relation ⊆ with e ⊆ e if e = e or e is an entity contained in e, e.g. a face of a hexahedron. The set U (e) = {e : e ⊆ e} denotes all subentities of e. The type of an entity type(e) is characterized by the graph (U (e), ⊆) being isomorphic to a specific reference entityê ∈Ê (the set of all reference entities). A d-dimensional grid is now given by all its entities E c of codimension 0 ≤ c ≤ d. Entities of each set E c are represented by a different C++ type depending on the codimension c as a template parameter. In particular we call E d the set of vertices, E d−1 the set of edges, E 1 the set of facets and E 0 the set of elements. Grids of mixed dimension are not allowed, i.e. for every e ∈ E c , c > 0 there exists e ∈ E 0 such that e ⊆ e. We refer to [1,14] for more details on formal properties of a grid.
Dune provides several implementations of grids all implementing the Dune-Grid interface. Algorithms can be written generically to operate on different grid implementations. We now provide some code snippets to illustrate the Dune-Grid interface. First we instantiate a grid: const int dim = 4 ; using Grid = Dune : : YaspGrid < dim > ; Dune : : FieldVector < double , dim > len ; for ( auto & l : len ) l = 1 . 0 ; std : : array < int , dim > cells ; for ( auto & c : cells ) c = 4 ; Grid grid ( len , cells ) ; Here we selected the YaspGrid implementation providing a d-dimensional structured, parallel grid. The dimension is set to 4 and given as a template parameter to the YaspGrid class. Then arguments for the constructor are prepared, which are the length of the domain per coordinate direction and the number of elements per direction. Finally, a grid object is instantiated. Construction is implementation specific. Other grid implementations might read a coarse grid from a file.
Grids can be refined in a hierarchic manner, meaning that elements are subdivided into several smaller elements. The element to be refined is kept in the grid and remains accessible. More details on local grid refinement are provided in Section 4.1 below. The following code snippet refines all elements once and then provides access to the most refined elements in a so-called GridView: grid . globalRefine ( 1 ) ; auto gv = grid . leafGridView () ; A GridView object provides read-only access to the entities of all codimensions in the view. Iterating over entities of a certain codimension is done by the following snippet using a range-based for loop: const int codim = 2 ; for ( const auto & e : entities ( gv , Dune : : if ( ! e . type () . isCube () ) std : : cout < < " no cube " < < std : : endl ; In the loop body the type of the entity is accessed and tested for being a cube (here of dimension 2=4-2). Access via more traditional names is also possible: This corresponds to iterating over U (e) ∩ E c for a given e ∈ E 0 .

Geometric aspects
Geometric information is provided for e ∈ E c by a map µ e :Ω e → Ω e , whereΩ e is the domain associated with the reference entityê of e and Ω e is its image on the manifold Ω. UsuallyΩ e is one of the usual shapes (simplex, cube, prism, pyramid) where numerical quadrature formulae are available. However, the grid interface also supports arbitrary polygonal elements. In that case no maps µ e are provided and only the measure and the barycenter of each entity is available. Additionally, the geometry of intersections Ω e ∩ Ω e with d − 1dimensional measure for e, e ∈ E 0 is provided as well.
Working with geometric aspects of a grid requires working with positions, e.g. x ∈Ω e , functions, such as µ e or matrices. In Dune these follow the DenseVector and DenseMatrix interface and the most common implementations are the class templates FieldVector and FieldMatrix providing vectors and matrices with compile-time known size built on any data type having the operations of a field.  A . mv (x , y ) ; // y = Ax A . usmv ( 0 .5 ,x , y ) ; // y += alpha*Ax An entity e (of any codimension) offers the method geometry() returning (a reference to) a geometry object which provides, among other things, the map µ e :Ω e → Ω e mapping a local coordinate in its reference domainΩ e to a global coordinate in Ω e . Additional methods provide e.g. the barycenter of Ω e and the volume of Ω e . They are used in the following code snipped to approximate the integral over a given function using the midpoint rule: integral + = u ( e . geometry () . center () ) * e . geometry () . volume () ; For more accurate integration Dune provides a variety of quadrature rules which can be selected depending on the reference element and quadrature order. Each rule is a container of quadrature points having a position and a weight. The code snippet below computes the integral over a given function with fifth order quadrature rule on any grid in any dimension. It illustrates the use of the global() method on the geometry which evaluates the map µ e for a given (quadrature) point. The method integrationElement() on the geometry provides the measure arising in the transformation formula of the integral.
double integral2 = 0 . 0 ; using QR = Dune : : QuadratureRules < Grid : : ctype , Grid : : dimension > ; for ( const auto & e : elements ( gv ) ) { auto geo = e . geometry () ; const auto & quadrature = QR : : rule ( geo . type () ,5 ) ; for ( const auto & qp : quadrature ) integral2 + = u ( geo . global ( qp . position () ) ) * geo . integrationElement ( qp . position () ) * qp . weight () ; } An intersection I describes the intersection Ω I = ∂Ω i(I) ∩ ∂Ω o(I) of two elements i(I) and o(I) in E 0 . Intersections can be visited from each of the two elements involved. The element from which I is visited is called the inside element i(I) and the other one is called the outside element o(I). Note that I is not necessarily an entity of codimension 1 in the grid. In this way Dune-Grid allows for nonconforming grids. In a conforming grid, however, every intersection corresponds to a codimension 1 entity. For an intersection three maps are provided: The first map describes the domain Ω I by a map from a corresponding reference element. The second two maps describe the embedding of the intersection into the reference elements of the inside and outside element, see Figure 1, such that Intersections Ω I = ∂Ω i(I) ∩ ∂Ω with the domain boundary are treated in the same way except that the outside element is omitted.

Attaching data to a grid
In grid-based methods data, such as degrees of freedom in the finite element method, is associated with geometric entities and stored in containers, such as vectors, external to the grid. To that end, the grid provides an index for each entity that can be used to access random-access containers. Often there is only data for entities of a certain codimension and geometrical type (identified by its reference entity). Therefore we consider subsets of entities having the same codimension and reference entity E c,ê = {e ∈ E c : e has reference entityê}.
The grid provides bijective maps index c,ê : E c,ê → {0, . . . , |E c,ê | − 1} enumerating all the entities in E c,ê consecutively and starting with zero. In simple cases where only one data item is to be stored for each entity of a given codimension and geometric type this can be used directly to store data in a vector as shown in the following example: Dune : : GeometryType gt ( Dune : : GeometryTypes : : cube ( dim ) ) ; // encodes (c,ê) std : : vector < double > volumes ( indexset . size ( gt ) ) ; // allocate container for ( const auto & e : elements ( gv ) ) volumes [ indexset . index ( e ) ] = e . geometry () . volume () ; Here, the volumes of the elements in a single element type grid are stored in a vector. Note that a GeometryType object encodes both, the dimension and the geometric type, e.g. simplex or cube. In more complicated situations an index map for entities of different codimensions and/or geometry types needs to be composed of several of the simple maps. This leaves the layout of degrees of freedom in a vector under user control and allows realization of different blocking strategies. Dune-Grid offers several classes for this purpose, such as the MCMGMapper. When a grid is modified through adaptive refinement, coarsening or load balancing in the parallel case, the index maps may change as they are required to be consecutive and zero-starting. In order to store and access data reliably when the grid is modified each geometric entity is equipped with a global id: where I is a set of unique identifiers. The map globalid is injective and persistent, i.e. globalid(e) does not change under grid modification when e is in old and new grid, and globalid(e) is not used when e was in the old grid and is not in the new grid. There are very weak assuptions on the ids provided by globalid. They don't need to be consecutive, actually they don't even need to be numbers. Here is how element volumes would be stored in a map: const auto & globalidset = gv . grid () . globalIdSet () ; using GlobalId = Grid : : GlobalIdSet : : IdType ; std : : map < GlobalId , double > volumes2 ; for ( const auto & e : elements ( gv ) ) volumes2 [ globalidset . id ( e ) ] = e . geometry () . volume () ; The type GlobalId represents I and must be sortable and hashable. This requirement is necessary to be able to store data for example in an std::map or std::unordered_map.
The typical use case would be to store data in vectors and use an indexset while the grid is in read-only state and to copy only the necessary data to a map using globalidset when the grid is being modified. Since using a std::map may not be the most efficient way to store data, a utility class PersistentContainer<Grid, T> exists, that implements the strategy outlined above for arbitrary types T. To allow for optimization, this class can be specialized by the grid implementation using structural information to optimize performance.

Grid refinement and coarsening
Adaptive mesh refinement using a posteriori error estimation is an established and powerful technique to reduce the computational effort in the numerical solution of PDEs, see e.g. [15]. Dune-Grid supports the typical estimatemark-refine paradigm as illustrated by the following code example: const int dim = 2 ; using Grid = Dune : : UGGrid < dim > ; auto pgrid = std : : shared_ptr < Grid > ( Dune : : GmshReader < Grid > : : read ( " circle . msh " ) ) ; Here the UGGrid implementation is used in dimension 2. In each refinement iteration those elements with a diameter larger than a desired value given by the function h are marked for refinement. The method adapt() actually modifies the grid, while preAdapt() determines grid entities which might be deleted and postAdapt() clears the information about new grid entities. In between preAdapt() and adapt() data from the old grid needs to be stored using persistent global ids and in between adapt() and postAdapt() this data is transferred to the new grid. In order to identify elements that may be affected by grid coarsening and refinement the element offers two methods. The method mightVanish(), typically used between preAdapt() and adapt(), returns true if the entity might vanish during the grid modifications carried out in adapt(). Afterwards, the method isNew() returns true if an element was newly created during the previous adapt() call. How an element is refined when it is marked for refinement is specific to the implementation. Some implementations offer several ways to refine an element. Furthermore some grids may refine non-marked elements in an implementation specific way to ensure certain mesh properties like, e.g., conformity. For implementation of data restriction and prolongation a geometryInFather() method provides geometrical mapping between parent and children elements.
Grid refinement is hierarchic in all currently available Dune-Grid implementations. Each entity is associated with a grid level. After construction of a grid object all its entities are on level 0. When an entity is refined the entities resulting from this refinement, also called its direct children, are added on the next higher level. Each level-0-element and all its descendants form a tree. All entities on level l are the entities of the level l grid view. All entities that are not refined are the entities of the leaf grid view. This is illustrated in Figure 2 Each GridView provides its own IndexSet and so allows to associate data with entities of a single level or with all entities in the leaf view.

Parallelization
Parallelization in Dune-Grid is based on three concepts: i. data decomposition, ii. message passing paradigm and iii. single-program-multiple-data (SPMD) style programming. As for the refinement rules in grid adaptation the data decomposition is implementation specific but must adhere to certain rules: i. The decomposition of codimension 0 entities E 0 into sets E 0,r assigned to process rank r form a (possibly overlapping) partitioning ii. When process r has a codimension 0 entity e then it also stores all its subentities, i.e. e ∈ E 0,r ∧ E c f ⊆ e ⇒ f ∈ E c,r for c > 0.
iii. Each entity is assigned a partition type attribute partitiontype(e) ∈ {interior, border, overlap, front, ghost} with the following semantics: iii.a. Codimension 0 entities may only have the partition types interior, overlap or ghost. The interior codimension 0 entities E 0,r,interior = {e ∈ E 0,r : partitiontype(e) = interior} form a nonoverlapping partitioning of E 0 . Codimension 0 entities with partition type overlap can be used like regular entities whereas those with partition type ghost only provide a limited functionality (e.g. intersections may not be provided).
iii.b. The partition type of entities with codimension c > 0 is derived from the codimension 0 entities they are contained in. For any entity f ∈ E c , c > 0, set Σ 0 (f ) = {e ∈ E 0 : f ⊆ e} and Σ 0,r (f ) = Σ 0 (f ) ∩ E 0,r .
partitiontype(e) ∈ {interior, border, overlap, front, ghost} If Σ 0,r (f ) ⊆ E 0,r,interior and Σ 0, Two examples of typical data decomposition models are shown in Figure 3. Variant a) on the left with interior/overlap codimension 0 entities is implemented by YaspGrid, variant b) on the right with the interior/ghost model is implemented by UGGrid and ALUGrid.
To illustrate SPMD style programming we consider a simple example. Grid instantiation is done by all processes r with identical arguments and each stores its respective grid partition E c,r .
Parallel computation of an integral over a function using the midpoint rule is illustrated by the following code snippet: integral + = u ( e . geometry () . center () ) * e . geometry () . volume () ; integral = gv . comm () . sum ( integral ) ; In the range-based for loop we specify in addition that iteration is restricted to interior elements. Thus, each element of the grid is visited exactly once.
After each process has computed the integral on its elements a global sum (reduce followed by a broadcast) produces the result which is now known by each process.
Data on overlapping entities E c,r ∩ E c,s stored by two processes r = s can be communicated with the abstraction CommDataHandleIF describing which information is sent for each entity and how it is processed. Communication is then initiated by the method communicate() on a GridView. Although all current parallel grid implementation use the message passing interface (MPI) in their implementation, nowhere the user has to make explicit MPI calls. Thus, an implementation could also use shared memory access to implement the Dune-Grid parallel functionality. Alternatively, multithreading can be used within a single process by iterating over grid entities in parallel. This has been implemented in the Exa-Dune project [16,17] or in Dune-Fem [18] but a common interface concept is not yet part of Dune core functionality.

List of grid implementations
The following list gives an overview of existing implementations of the Dune-Grid interface and their properties and the Dune module these are implemented in. Where noted, the implementation wraps access to an external library. A complete list can be found on the Dune web page https://dune-project. org/doc/grids/.
AlbertaGrid (dune-grid) Provides simplicial grids in two and three dimensions with bisection refinement based on the ALBERTA software [19].
ALUGrid (dune-alugrid) Provides a parallel unstructured grid in two and three dimensions using either simplices or cubes. Refinement is nonconforming for all element types and conforming for simplices [20].
CpGrid (opm-grid) Provides an implementation of a corner point grid, a nonconforming hexahedral grid, which is the standard in the oil industry https://opm-project.org/.
FoamGrid (dune-foamgrid) Provides one and two-dimensional grids embedded in three-dimensional space including non-manifold grids with branches [23].
UGGrid (dune-grid) Provides a parallel, unstructured grid with mixed element types (triangles and quadrilaterals in two, tetrahedra, pyramids, prisms and hexahedra in three dimensions) and local refinement. Based on the UG library [24].
YaspGrid (dune-grid) A parallel, structured grid in arbitrary dimension using cubes. Supports non-equidistant mesh spacing and periodic boundaries.
Metagrids use one or more implementations of the Dune-Grid interface to provide either a new implementation of the Dune-Grid interface or new functionality all together. Here are examples: GeometryGrid (dune-grid) Takes any grid and replaces the geometries of all entities e by the concatenation µ geo •µ e where µ geo is a user-defined mapping, see Section 4.1.2.
PrismGrid (dune-metagrid) Takes any grid of dimension d and extends it by a structured grid in direction d + 1 [25].
GridGlue (dune-grid-glue) Takes two grids and provides a projection of one on the other as a set of intersections [26,27], see Section 4.2.1.
MultiDomainGrid (dune-multidomaingrid) Takes a grid and provides possibly overlapping sets of elements as individual grids [28], see Section 4.2.2.
SubGrid (dune-subgrid) Takes a grid and provides a subset of its entities as a new grid [29].
IdentityGrid (dune-grid) Wraps all classes of one grid implementation in new classes that delegate to the existing implementation. Ideal to write a new metagrid.
CartesianGrid (dune-metagrid) Takes a unstructured quadrilateral or hexahedral grid (e.g. ALUGrid or UGGrid) and replaces the geometry implementation with a strictly Cartesian geometry implementation for performance improvements.
FilteredGrid (dune-metagrid) Takes any grid and applies a binary filter to the entity sets for codimension 0 provided by the grid view of the given grid.
SphereGrid (dune-metagrid) A meta grid that provides the correct spherical mapping for geometries and normals for underlying spherical grids.
3.1.7. Major developments in the Dune-Grid interface Version 1.0 of the Dune-Grid module was released on December 20, 2007. Since then a number of improvements were introduced, including the following: • Methods center(), volume(), centerUnitOuterNormal() on Geometry and Intersection were introduced to support FV methods on polygonal and polyhedral grids.
• GridFactory provides an interface for portably creating initial meshes. GmshReader uses that to import grids generated with gmsh.
• The Dune-Geometry module was introduced as a separate module to provide reference elements, geometry mappings, and quadrature formulae independent of Dune-Grid.
• The automatic type deduction using auto makes using heavily templatebased libraries such as Dune more convenient to use.
• Initially, the Dune-Grid interface tried to avoid copying objects for performance reasons. Many methods returned const references to internal data and disallowed copying. With copy elision becoming standard, copyable lightweight entities and intersections were introduced. Given an Entity with codimension c to obtain the geometry one would write: using Geometry = typename Grid : : template Codim < c > : : Geometry ; const Geometry & geo = entity . geometry () ; whereas in newer Dune versions one can simply write: const auto geo = entity . geometry () ; using both, the automatic type deduction and the fact that objects are copyable. A performance comparison discussing the references vs. copyable grid objects can be found in [30,31]. In order to save on memory when storing entities the entity seed concept was introduced.
• Range-based for loops for entities and intersections made iteration over grid entities very convenient. With newer Dune versions this is simply for ( const auto & element : Dune : : elements ( gridView ) ) const auto geo = element . geometry () ; • UGGrid and ALUGrid became dune modules instead of being external libraries. This way they can be downloaded and installed like other dune modules.

The template library for iterative solvers -Dune-ISTL
Dune-ISTL is the linear algebra library of Dune. It consists of two main components. First it offers a collection of different vector and matrix classes. Second it features different solvers and preconditioners. While the grid interface consists of fine grained interfaces and relies heavily on static polymorphism, the abstraction in Dune-ISTL contains both, coarse grained and fine grained interfaces.

Concepts behind the Dune-ISTL interfaces
A major design decision in Dune-ISTL was influenced by the observation, that linear solvers can significantly benefit from inherent structure of PDE discretizations. For example a DG discretization leads to a block structured matrix for certain orderings of the unknowns, the same holds for coupled diffusion reaction systems with many components. Making use of this structure often allows to improve convergence of the linear solver, reduce memory consumption and improve memory throughput.
Dune-ISTL offers different vector and matrix implementations and many of these can be nested. The whole interface is fully templatized w.r.t. the underlying scalar data type (double, float, etc.), also called field type. Examples of such nested types are Dune::BlockVector<Dune::FieldVector<std::complex<double>,2>> a dynamic block vector, which consists of N blocks of vectors with static size 2 over the field of the complex numbers, i.e. a vector in (C 2 ) N .
Dune::BCRSMatrix<Dune::BCRSMatrix<double>> a (sparse) matrix whose entries are sparse matrices with scalar entries. These might for example arise from a Taylor-Hood discretization of the Navier-Stokes equations, where we obtain a 4 × 4 block matrix of sparse matrices.
In particular it is not necessary to use the same field type for matrices and vectors, as the library allows for mixed-precision setups with automatic conversions and determination of the correct return type of numeric operations. In order to allow for an efficient access to individual entries in matrices or vectors, these matrix/vector interfaces are static and make use of compile-time polymorphism, mainly duck typing techniques like they are used in the STL 2 . Solvers on the other hand can be formulated at a very high level of abstraction and are a perfect candidates for dynamic polymorphism. Dune-ISTL defines abstract interfaces for operators, scalar products, solvers and preconditioners. A solver, like LoopSolver, CGSolver, and similar is parameterized with the operator, possibly a preconditioner and usually the standard euclidean scalar product. The particular implementations, as well as the interface, are strongly typed on the underlying vector types, but besides this it is possible to mix and shuffle different solvers and preconditioners dynamically at runtime. While linear operators are most often stored as matrices, the interface only requires that an operator can be applied to a vector and thus also allows for implementing on-the-fly operators for implicit methods; this drastically reduces the memory consumption and allows to increase arithmetic intensity and thus overcomes performance limitations due to slow memory access. The benefit of on-the-fly operators is highlighted in section 4.5. It should be noted that many strong preconditioners, like the Dune::Amg::AMG have stricter requirements on the operator and need access to the full matrix.
The interface design also offers a simple way to introduce parallel solvers. The parallelization of Dune-ISTL's data structures and solvers differs signif-icantly from that of other libraries like PETSc. While many libraries rely on a globally consecutive numbering of unknowns, we only require a locally consecutive numbering, which allows for fast access into local containers. Global consistency is then achieved by choosing appropriate parallel operators, preconditioners and scalar products. Note that the linear solvers do not need any information about parallel data distribution, as they only rely on operator (and preconditioner) applications and scalar product computations, which are hidden under the afore introduced high level abstractions. This allows for a fully transparent switch between sequential and parallel computations.

A brief history
The development of Dune-ISTL began nearly in parallel with Dune-Grid around the year 2004 and it was included in the 1.0 release of Dune in 2007. The serial implementation was presented in [32]. One goal was to make Dune-ISTL usable standalone without Dune-Grid. Hence a powerful abstraction of parallel iterative solvers based on the concept of parallel index sets was developed as described in [33]. As a first showcase of it an aggregation based parallel algebraic multigrid method for continuous and discontinuous Galerkin discretizations of heterogeneous elliptic problems was added to the library, see [34,35]. It was one of the first solvers scaling to nearly 295,000 processors on a problem with 150 billion unknowns, see [36].

Feature overview and recent developments
The afore outlined concepts are implemented using several templatized C++ structures. Linear operators, that do not do their computations on the fly, will often use an underlying matrix representation. Dune-ISTL offers dense matrices, both either of dynamic size or size known already at compile time, as well as several sparse (block) matrices. For a comprehensive list see Table 1. The corresponding vector classes can be found in Table 2.
The most important building block of the iterative solvers in Dune-ISTL are the preconditioners. Together with the scalar product and linear operator it governs whether a solver will be serial/sequential only or capable of running in parallel. To mark sequential solvers the convention is that their name starts with Seq. By wrapping them in BlockPreconditioner, that needs information about the domain decomposition in terms of a parallel index set, all of them can be turned into so-called hybrid preconditioners, see [37], commonly used by parallel (algebraic) multigrid methods. A hybrid preconditioner is basically a simple block Jacobi method with each block being the unknowns of the partition of one processors. That means each unknown is associated to only one processor for the preconditioner. Instead of solving the blocks exactly in each iteration only one step of the sequential preconditioner is applied. A list of preconditioners provided by Dune-ISTL is in Table 3. The third column indicates whether a preconditioner is sequential (s), parallel (p) or both. For simple preconditioners, that do not need to store a decomposition, a recursion level can given to the class. Those are marked with "yes" in the last column. The level given to the class indicates where the inversion on the matrix block happens. For a  BCRSMatrix<FieldMatrix<double,n,m>> a level of 0 will lead to the inversion of the scalar values inside of the small dense matrices whereas a level of 1 would invert the FieldMatrix. The latter variant, which leads to a block preconditioner, is the default. All of the listed preconditioners can be used in the iterative solvers provided by Dune-ISTL. Table 4 contains a list of these together with the direct solvers. The latter are only wrappers to existing well established libraries.
In recent time Dune-ISTL has seen a lot of new development. Both, regarding usability, as well as feature-wise. We now briefly discuss some noteworthy improvements.
i. From the start, Dune-ISTL was designed to support nested vector and matrix structures. However, the nesting recursion always had to end in FieldVector and FieldMatrix, respectively. Scalar entries had to be written as vectors of length 1 or matrices of size 1×1. Exploiting modern C++ idioms now allows to support scalar values directly to end the recursion. In other words, it is now possible to write Dune : : BCRSMatrix < double > Internally, this is implemented using the Dune::IsNumber traits class to recognize scalar types. Note that the indirections needed internally to implement the transparent use of scalar and blocked entries is completely optimized away by the compiler.
ii. As discussed in the concepts section, operators, solvers, preconditioners and scalar products offer only coarse grained interfaces. This allows to use dynamic polymorphism. To enable full exchangeability of these classes at runtime we introduced abstract base classes and now store shared pointers to these base classes. With this change it is now possible to configure the solvers at runtime. Additionally, most solvers can now be configured using a Dune::ParameterTree object, which holds configuration parameters for the whole program. A convenient solver factory is currently under development, which will complete these latest changes. For example the restarted GMRes solver was constructed as Dune : : RestartedGMResSolver < V > solver ( op , preconditioner , reduction , restart , maxit , verbose ) ; where reduction, restart, maxit and verbose are just scalar parameters, which the user usually wants to change often to tweak the solvers. Now these parameters can be specified in a section of an INI-style file like: [ GMRES ] reduction = 1e -8 maxit = 500 restart = 10 verbose = 0 iii. From a conceptual point of view Dune-ISTL was designed to support vectors and matrices with varying block structure since the very first release. Practically it took a very long time to actually fully support such constructs. Only since the new language features of C++11 are available it was possible to implement the classes MultiTypeBlockVector and MultiTypeBlockMatrix in a fully featured way. These classes implement dense block matrices and vectors with different block types in different entries. Before that this was only supported using the boost library but much harder to use for users. The user now can easily define matrix structures like using namespace Dune ; using Row0 = MultiTypeBlockVector < Matrix < FieldMatrix < double ,3 , 3 > > , Matrix < FieldMatrix < double ,3 , 1 > > > ; Such a matrix type would be natural, e.g., for a Taylor-Hood discretization of a three-dimensional Stokes or Navier-Stokes problem, combining a velocity vector field with a scalar pressure.
iv. With the Dune 2.6 release an abstraction layer for SIMD-vectorized data types was introduced. This abstraction layer provides functions for transparently handling SIMD data types, as provided by libraries, e.g. Vc 3 [38,39] or VectorClass [40], and scalar data types, like double or std::complex.
The layer consists of free-standing functions, for example Simd::lane(int l, VT& v), where v is of vector-type VT and Simd::lane gives access to the l-th entry of the vector. Operators like + or * are overloaded and applied component-wise. The result of boolean expressions are also vectorized and return data types with bool as scalar type. To handle these values Dune offers function like Simd::cond, Simd::allTrue or Simd::anyTrue for testing them. The Simd::cond function has the semantics of the ternary, which cannot be overloaded. This operator is necessary, as if-else expressions might lead to different branches in the different lanes, which contradicts the principle of vectorization.
Using this abstraction layer it is possible to construct a solver in Dune-ISTL supporting multiple right hand sides. This is achieved by using the vectorized type as field_type in the data structures. For example using the type Vec4d, provided by VectorClass, the Vector type is constructed as: Dune : : BlockVector < Dune : : FieldVector < Vec4d , 1 > > It can be interpreted as a tall-skinny matrix in R N ×4 . Using these data types has multiple advantages: • Explicit use of vectorization instructions -Modern CPUs provide SIMD-vectorization instructions, that can perform the same instruction on multiple data simultaneously. It is difficult for the compiler to make use of these instructions automatically. With the above approach we can make explicit use of the vectorization instructions.
• Better utilization of memory bandwidth -The application of the operator or the preconditioner is in most cases limited by the available memory bandwidth. This means the runtime of these operations depends on the amount of data that must be transferred from or to the memory. With our vectorization approach the matrix has to be loaded from memory only once for calculating k matrix-vector products.
• Reduction of communication overhead -On distributed systems the cost for sending a message is calculated as αD + β, where D is the amount of data, α is the bandwidth, and β is the latency. When using vectorized solvers, k messages are fused to a single message. Therefore the costs are reduced from k(αD + β) to kαD + β.
• Block Krylov methods -Block Krylov methods are Krylov methods for systems with multiple right hand sides. In every iteration the energy error is minimized in all search directions of all lanes. This improves the number of iterations, that are needed to achieve a certain residual reduction.

Finite element spaces on discretization grids
While Dune focuses on grid-based discretization methods for PDEs, its modular design explicitly avoids any reference to ansatz functions for discretizations in the interfaces of grids and linear algebra of the modules discussed so far. Instead of this the corresponding interfaces and implementations are contained in separate modules.

Local functions spaces
The Dune-LocalFunctions core module contains interfaces and implementations for ansatz functions on local reference domains. In terms of finite element discretizations, this corresponds to the finite elements defined on reference geometries. Again following the modular paradigm, this is done independently of any global structures like grids or linear algebra, such that the Dune-LocalFunctions module does not depend on the Dune-Grid and Dune-ISTL module. The central concept of the Dune-LocalFunctions module is the LocalFiniteElement which is defined along the lines of a finite element in terms of [41]. There, a finite element is a triple (D, Π, Σ) of the local domain D, a local function space Π, and a finite set of functionals Σ = {σ 1 , . . . , σ n } which induces a basis λ 1 , . . . , λ n on the local ansatz space by σ i (λ j ) = δ ij .
Each LocalFiniteElement provides access to its polyhedral domain D by exporting a GeometryType. The exact geometry of the type is defined in the Dune-Geometry module. The local basis functions λ i and the functionals σ i are provided by each LocalFiniteElement by exporting a LocalBasis and LocalInterpolation object, respectively. Finally, a LocalFiniteElement provides a LocalCoefficients object. The latter maps each basis function/functional to a triple (c, i, j) which identifies the basis function as the j-th one tied to the i-th codimension-c face of D.

Global functions spaces
While the core module Dune-LocalFunctions provides interfaces and implementations of function spaces on local reference elements, a related infrastructure for global function spaces on a grid view that are defined in terms of their local element restrictions-denoted global finite element spaces in the following-is not contained in the Dune core so far. Instead several concurrent/complementary discretization modules, like Dune-Fem, Dune-PDELab, Dune-Fufem, used to provide their own implementations. In order to improve interoperability an interface for global function spaces was developed as a joint effort in the staging module Dune-Functions. The aim is to provide a common foundation on which higher level discretization modules can build upon.
It can often be useful to use different bases of the same global finite element space for different applications. For example, a discretization in the space P 2 will often use a classical Lagrange basis of the second order polynomials on the elements, whereas hierarchical error estimators make use of the so called hierarchical P 2 basis. As a consequence the Dune-Functions module does not use the global finite element space itself but its basis as the central abstraction. This is represented by the concept of a GlobalBasis in the interface.
A fundamental feature adopted from the Dune-PDELab module is to provide a flexible interface for hierarchically structured products of global finite element spaces. In the following we will repeatedly illustrate this using the k-th order Taylor-Hood space as example. The latter is the product space (P k+1 × P k+1 × P k+1 ) × P k where k ≥ 1 and P j is the space of continuous functions which are piecewise polynomials of order j on the elements of a given grid view the functions are defined on. Notice that the Taylor-Hood space provides a natural hierarchical structure: It is the product of the space (P k+1 ×P k+1 ×P k+1 ) for the velocity with the space P k for the pressure, where the former is again the 3-fold product of the space P k+1 for the individual velocity components. Similar product spaces arise, e.g. in non-isothermal phase field models, primal plasticity, mixed formulations of higher order PDEs, multi-physics problems, and many more applications. Common to all these examples is that the global product space can be viewed as a tree of spaces, where the root denotes the full space, inner nodes denote intermediate product spaces, and the leaf nodes represent elementary spaces that are not considered as products themselves. This is illustrated for the first order Taylor-Hood space in Figure 4.
The Dune-Functions module on the one hand defines an interface for such nested spaces and on the other hand provides implementations for a set of elementary spaces together with a mechanism for the convenient construction of product spaces. For example the first order Taylor-Hood space on a given grid view can be constructed using using namespace Dune : : Functions ; using namespace Dune : : Functions : : BasisFactory ; auto basis = makeBasis ( gridView , composite ( power < 3 > ( lagrange < 2 > () ) , lagrange < 1 > () ) ) ; Here, langrange<k>() creates a descriptor of the Lagrange basis of P k which is The interface of a GlobalBasis is split into to several parts. All functionality that is related to the basis as a whole is directly provided by the basis, whereas all functionality that can be localized to grid elements is accessible by a so called LocalView obtained using basis.localView(). Binding a local view to a grid element using localView.bind(element) will then initialize (and possibly pre-compute and cache) the local properties. To avoid costly reallocation of potential internal caches, this can also be used to rebind to another element.
Once constructed and bound to an element, a LocalView provides access to the restriction of all basis functions whose support overlaps the bound-to element. Since the global basis has a tree structure, the localized basis also forms a tree. This local tree is accessible using localView.tree() and its children can be obtained using the child(...) method. Finally, each leaf node of the local tree provides access to a LocalFiniteElement as introduced in the previous section.
Additionally to providing access to shape functions via the localized leaf nodes, the basis also has to provide global indices of all basis functions in order to associate them to corresponding entries in coefficient vectors. The mapping of shape functions to global indices is done in several stages. First, all shape functions of a LocalFiniteElement are enumerated which induces a unique index to each shape function within the LocalFiniteElement and thus the corresponding leaf node. Next, the per-LocalFiniteElement-unique indices of a leaf node can be mapped to per-LocalView-unique indices using leafNode.localIndex(i). The resulting indices enumerate all basis functions on the bound-to element uniquely. Finally, the local per-LocalView-unique indices can be mapped to globally unique indices using localView.index(j). To give a full example, the global index of the i-th shape function for the d-th velocity component of the Taylor-Hood basis on a given element can be obtained using Here, we made use of the static index constant Dune::Indices::_0 which encodes the index 0 at compile time. This is needed, because direct children in a composite construction may have different types.
The size of the basis, e.g., the number of basis functions, can be obtained using basis.size(). While all local indices are flat and zero-based, global indices can in general be multi-indices which allows to efficiently access hierarchically structured containers. As a consequence, providing a single number as size is often not appropriate. However, since the multi-indices are guaranteed to provide consecutive, zero-bases entries (called digits in the following), the set of all multi-indices of a GlobalBasis forms the leafs of an index tree. To explore this tree, basis.size(prefix) with a given prefix multi-index provides the size of the next digit following the prefix, or, equivalently, the number of direct children of the (possibly interior) node denoted by the prefix. Consistently, basis.size() provides the number of entries appearing in the first digit. If a basis only provides flat indices, this corresponds to the total number of basis functions.
The way shape functions are associated to indices can be influenced according to the needs of the used discretization, algebraic data structures, and algebraic solvers. In principle an elementary basis provides a pre-defined set of global indices which are often flat and zero based. When defining more complex product space bases using composite and power, the way global indices of the respective direct children are combined can be influenced by providing additional arguments to these functions. Possible strategies are, for example, to prepend or append the number of the child to the global index within the child, or to increment the global indices to get consecutive flat indices.
Additionally which allows to construct the finite element function (with given range type) obtained by weighting the basis functions with the coefficients stored in a suitable vector, and interpolate ( basis , vector , function ) ; which computes the interpolation of a given function storing the result as coefficient vector with respect to a basis. The following example interpolates a function into the pressure degrees of freedom only and later construct the velocity vector field as a function. The latter can e.g. be used to write a subsampled representation in the VTK format. A detailed description of the GlobalBasis interface, the available elementary basis implementations, the mechanism to construct product spaces, the rulebased combination of global indices, and the basis-related utilities can be found in [42]. The type-erasure based polymorphic interface of global functions and localizable grid functions as e.g. implemented by makeDiscreteGlobalBasisFunction is described in [43].

Python interfaces for Dune
Combining easy to use scripting languages with state-of-the-art numerical software has been a continuous effort in scientific computing for a long time. For solution of PDEs the pioneering work of the FEniCS team [5] inspired many others, e.g. [44,45] to also provide Python scripting for high performance PDE solvers usually coded in C++.
Starting with the 2.6 release in 2018, Dune can also be used from within the Python scripting environment. The Dune-Python staging module provides i. a general concept for exporting realizations of polymorphic interfaces as used in many Dune modules and ii. Python bindings for the central interfaces of the Dune core modules described in this section. These bindings make rapid prototyping of new numerical algorithms easy since they can be implemented and tested within a scripting environment. Our aim was to keep the Python interfaces as close as possible to their C++ counterparts so that translating the resulting Python algorithms to C++ to maximize efficiency of production code is as painless as possible. Bindings are provided using [46]. We start with an example demonstrating these concepts. We revisit the examples given in Section 3. integral + = u ( e . geometry . center ) * e . geometry . volume A few changes were made to make the resulting code more Pythonic, i.e., the use of class attributes instead of class methods, but the only major change is that the function returning the grid object in fact returns the leaf grid view and not the hierarchic grid structure. Notice that the life time of this underlying grid is managed automatically by Python's internal reference counting mechanism. It can be obtained using a class attribute, i.e., to refine the grid globally gv . hierarchicalGrid . globalRefine ( 1 ) ; Other interface classes and their realizations have also been exported so that for example the more advanced quadrature rules used in the previous sections can also be used in Python: from dune . geometry import quadratureRule integral2 = 0 . 0 for e in gv . elements : geo = e . geometry quadrature = quadratureRule ( e . type , 5 ) for qp in quadrature : integral2 + = u ( geo . toGlobal ( qp . position ) ) \ * geo . integrationElement ( qp . position ) * qp . weight Again the changes to the C++ code is mostly cosmetics or due to the restrictions imposed by the Python language. While staying close to the original C++ interface facilitates rapid prototyping, it also can lead to a significant loss of efficiency. A very high level of efficiency was never a central target during the design of the Python bindings to Dune-to achieve this, a straightforward mechanism is provided to call Dune algorithms written is C++. Nevertheless, we made some changes to the interface and added a few extra features to improve the overall efficiency of the code. The two main strategies are to reduce the number of calls from Python to C++ by, for example, not returning single objects for a given index but iterable structures instead. The second strategy is to introduce an extended interface taking a vector of its arguments to allow for vectorization.
Consider, for example the C++ interface methods on the Geometry class geometry.corners() and geometry.corner(i) which return the number of corners of the elements and their coordinates in physical space, respectively. Using these methods, loops would read as follows: auto center = geometry . corner ( 0 ) ; for ( std : : size_t i = 1 ; i < geometry . corners () ; + + i ) center + = geometry . corner ( i ) ; center / = geometry . corners () ; To reduce the number of calls into C++, we decided to slightly change the semantics of method pairs of this type: the plural version now returns an iterable object, while the singular version still exists in its original form. So in Python the above code snippet can be written as follows: center + = c center / = len ( corners ) As discussed above quadrature loops are an important ingredient of most grid based numerical schemes. As the code snippet given at the beginning of this section shows, this requires calling methods on the geometry for each point of the quadrature rule which again can lead to a significant performance penalty. To overcome this issue we provide vectorized versions of the methods on the geometry class so that the above example can be more efficiently implemented import numpy from dune . geometry import quadratureRule u = lambda x : numpy . exp ( numpy . sqrt ( sum ( x * x ) ) ) integral3 = 0 . 0 for e in gv . elements : hatxs , hatws = quadratureRule ( e . type , 5 ) . get () weights = hatws * e . geometry . integrationElement ( hatxs ) integral3 + = numpy . sum ( u ( hatxs ) * weights , axis = -1 ) The following list gives a short overview of changes and extensions we made to the Dune interface while exporting it to Python. As pointed out some changes are required due to the language restrictions in Python or to make the resulting interface more Pythonic. Other changes were made, to make writing efficient code possible, e.g., vectorization, as has been discussed above.
• Since global is a keyword in Python we cannot export the global method on the Geometry directly. So we have exported global as toGlobal and for symmetry reasons local as toLocal.
• Some methods take compile-time static arguments, e.g., the codimension argument for entity.subEntity<c >( i ). These had to be turned into dynamic arguments, so in Python the subEntity is obtained via entity.subEntity(i,c).
• In many places we replaced methods with properties, i.e., entity.geometry instead of entity.geometry().
• Methods returning a bool specifying that other interface methods will return valid results are not exported (e.g. neighbor on the intersection class). Instead None is returned to specify a non valid call (e.g. to outside).
• Some of the C++ interfaces contain pairs of methods where the method with the plural name returns an integer (the number of ) and the singular version takes an integer and returns the ith element. The plural version was turned to a range-returning method in Python as discussed above.
• In C++, free-standing functions can be found via argument-dependent lookup. As Python does not have such a concept, we converted those free-standing functions to methods or properties. Examples are elements, entities, intersections, or localFunction.
• A grid in Dune-Python is always the LeafGridView of the hierarchical grid. To work with the actual hierarchy, i.e., to refine the grid, use the hierarchicalGrid property. Level grid views can also be obtained from that hierarchical grid.
• In contrast to C++, partitions are exported as objects of their own. The interior partition, for example, can be accessed by partition = grid . interiorPartition The partition, in turn, also exports the method entities and the properties elements, facets, edges, and vertices.
• An MCMGMapper can be constructed using the mapper method on the GridView class passing in the Layout as argument. The mapper class has an additional call method taking an entity, which returns an array with the indices of all DoFs (degrees of freedom) attached to that entity. A list of DoF vectors based on the same mapper can be communicated using methods defined on the mapper itself and without having to define a DataHandle.
A big advantage of using the Python bindings for Dune is that non performance critical tasks, e.g., pre-and postprocessing can be carried out within Python while the time critical code parts can be easily carried out in C++. To make it easy to call functions written in C++ from within a Python script, Dune-Python provides a simple mechanism. Let us assume for example that the above quadrature for e |x| was implemented in a C++ function integral contained in the header file integral.hh using the Dune interface as described in Section 3.1.1: We can now call this function from within Python using integral4 = algorithm . run ( ' integral ' , ' integral . hh ' , gv ) Note that the correct version of the template function integral will be exported using the C++ type of the gv argument, i.e., YaspGrid<4>.
With the mechanism provided in the Dune-Python module, numerical schemes can first be implemented and tested within Python and can then be translated to C++ to achieve a high level of efficiency. The resulting C++ functions can be easily called from within Python making it straightforward to simply replace parts of the Python code with their C++ counterparts.
In addition to the features described so far, the Dune-Python module provides general infrastructure for adding bindings to other Dune modules. Details are given in [47]. We will demonstrate the power of this feature in Section 4.1 where we also use the domain specific language UFL [48] to describe PDE models.

Build system and testing
Starting with the 2.4 release, Dune has transitioned from its Autotools build system to a new, CMake-based build system. This follows the general trend in the open source software community to use CMake. The framework is split into separate modules; each module is treated as a CMake project in itself, with the build system managing inter-module dependencies and propagation of configuration results. In order to simplify the inter-module management, there is a shell script called dunecontrol (part of Dune-Common) that resolves dependencies and controls the build order.
In the CMake implementation of the Dune build system, special emphasis has been put on testing. Testing has become increasingly important with the development model of the Dune core modules being heavily based on Continuous Integration. In order to lower the entrance barrier for adding tests to a minimum, a one-line solution in the form of a CMake convenience function dune_add_test has been implemented. Further testing infrastructure has been provided in the module Dune-TestTools [49], which allows the definition of system tests. These system tests describe samples of framework variability covering both compile-time and run-time variations.
More information on the Dune CMake build system can be found in the Sphinx-generated documentation, which is available on the Dune website 4 .

Selected advanced features with applications
After having discussed the central components of Dune and their recent changes, we now want to highlight some advanced features. The following examples all showcase features of Dune extension modules in conjunctions with the core modules.

Grid modification
In this section we discuss two mechanisms of modifying grids within the DUNE framework: dynamic local grid adaptation and moving domains. In particular, dynamic local grid adaptation is of interest for many scientific and engineering applications due to the potential high computational cost savings. However, especially parallel dynamic local grid adaptation is technically challenging and not many PDE frameworks offer a seamless approach. We will demonstrate in this section how the general concepts described in Section 3.1 for grid views and adaptivity provided by the core modules are used to solve PDE problems on grids with dynamic refinement and coarsening. Especially important for these concepts is the separation of topology, geometry, and user data provided by the grid interface.
To support grid modification the Dune-Fem module provides two specialized GridViews: AdaptiveGridView and GeometryGridView. Both are based on a given grid view, i.e., the LeafGridView, and replace certain aspects of the implementation. In the first case, the index set is replaced by an implementation that provides additional information that can be used to simplify data transfer during grid refinement and coarsening. In the second case the geometry of each element is replaced by a given grid function, e.g., by an analytic function or by some discrete function over the underlying grid view. The advantage of this meta grid view approach is that any algorithm based on a Dune grid view can be used without change while for example the data transfer during grid modification can be transparently handled with specialized a algorithm using special features of the meta grid view.
As previously described in Section 3.1.4 the Dune grid interface offers the possibility to dynamically refine and coarsen grids if the underlying grid implementation offers these capabilities. Currently, there are two implementations that support parallel dynamic grid adaptation including load balancing, UGGrid and ALUGrid, and AlbertaGrid that supports grid adaptation but cannot be used for parallel computations.
A variety of applications are making use of the Dune grid interface for adaptive computations. For example, adaptive discontinuous Galerkin computations of compressible flow, e.g. Euler equations [54] or atmospheric flow [55]. A number of applications focus on hp-adaptive schemes, e.g. for continuous Galerkin approximations of Poisson type problems [56], or discontinuous Galerkin approximations of two-phase flow in porous media [57,58,59,60] or conservation laws [61]. Other works consider, for example, the adaptive solution of the Cahn-Larché system using finite elements [62].
In this section we demonstrate the capabilities of the Dune grid interface and it's realizations making use of the Python bindings for the Dune module Dune-Fem. We show only small parts of the Python code here, the full scripts are part of the tutorial [63].
To this end we solve the Barkley model, which is a system of reactiondiffusion equations modeling excitable media and oscillatory media. The model is often used as a qualitative model in pattern forming systems like the Belousov-Zhabotinsky reaction and other systems that are well described by the interaction of an activator and an inhibitor component [64].

In its simplest form the Barkley model is given by
. Note that u, v are assumed to be between zero and one so m(u n , v n ) > 0. We end up with a linear, positive definite elliptic operator defining the solution u n+1 given u n , v n . In the following we will use a conforming Lagrange approximation with quadratic basis functions. To handle possible nonconforming grids we add interior penalty DG terms as discussed in [56]. The slow reaction h(u, v) can be solved explicitly leading to a purely algebraic equation for v n+1 . The initial data is piecewise constant chosen in such a way that a spiral wave develops.
(e) t = 14 Figure 5: The evolution of u for different times using non-conforming grid adaptation in 2d with quadrilaterals. Figure 6 shows results using the quadratic Lagrange basis and a conforming simplicial grid with bisection refinement.  Figure 7 shows the same example for 3d grids, using a bi-linear Lagrange basis for a non-conforming hexahedral grid in Figure 7a and using a quadratic Lagrange basis on a conforming simplicial grid with bisection refinement in Figure 7b.
Details on the available load balancing algorithms and parallel performance studies for the Dune-ALUGrid package can be found in [20], [65], and [66], and for UGGrid in [24].

Moving grids
In this section we touch on another important topic for modern scientific computing: moving domains. Typically this is supported by moving nodes in the computational grid. In Dune this can be done in a very elegant way. The presence of an abstract grid interface allows the construction of meta grids where  only parts of the grid implementation are re-implemented and, in addition, the original grid implementation stays untouched. Thus meta grids provide a very sophisticated way of adding features to the complete feature stack and keeping the code base modular. In Dune-Grid there exist such meta grid implementations and for moving grids one can use GeometryGrid (see also Section 3.1.6) which allows to move nodes of the grid by providing an overloaded Geometry implementation. Another, slightly easier way, is to only overload geometries of grid views which is, for example, done in Dune-Fem.
Both approaches re-implement the reference geometry mapping. In GeometryGrid an external vector of nodes providing the positions of the computational grid is used while for GeometryGridView a grid function, i.e., a function which is evaluated on each entity given reference coordinates, is used to provide a mapping for the coordinates. The advantage of both approaches is, that the implementation of the numerical algorithm does not need to changed at all. The new grid or grid view follows the same interface as the original implementation. A moving grid can now be realized by modifying this grid function.
To demonstrate this feature of Dune we solve a mean curvature flow problem which is a specific example of a geometric evolution equation where the evolution is governed by the mean curvature H. One real-life example of this is in how soap films change over time, although it can also be applied to other problems such as image processing. Assume we are given a reference surfaceΓ such that we can write the evolving surface in the form Γ t = X(t,Γ). It is now possible to show that the vector valued function X = X(t, x) withx ∈Γ satisfies where H is the mean curvature of Γ t and ν is its outward pointing normal.
We use the following time discrete approximation as suggest in [67] Γ n Here U n parametrizes Γ n+1 ≈ Γ t n+1 over Γ n := Γ t n and ∆t is the time step.
In the example used here the work flow can be setup as follows. First one creates a reference grid and a corresponding quadratic Lagrange finite element space to represent the geometry of the mapped grid.   Other successful applications of this meta grid concept for moving domains can be found, for example, in [68] where the compressible Navier-Stokes equations are solved in a moving domain and in [25] where free surface shallow water flow is considered.

Grid coupling and complex domains
In recent years Dune has gained support for different strategies to handle couplings of PDEs on different subdomains. One can distinguish three different approaches to describe and handle such different domains involved in a multiphysics simulation. As an important side effect, the last approach also provides support for domains with complex shapes.
i. Coupling of individual grids: In the first approach, each subdomain is : ∩ = Figure 9: Coupling of two unrelated meshes via a merged grid: Intersecting the coupling patches yields a set of remote intersections, which can be used to evaluate the coupling conditions.
treated as a separate grid, and meshed individually. The challenge is then the construction of the coupling interfaces, i.e., the geometrical relationships at common subdomain boundaries. As it is natural to construct nonconforming interfaces in this way, coupling between the subdomains will in general require some type of weak coupling, like the mortar method [69], penalty methods [70,71], or flux-based coupling [72].
ii. Partition of a single grid: In contrast, one may construct a single host grid that resolves the boundaries of all subdomains. The subdomain meshes are then defined as subsets of elements of the host grid. While the construction of the coupling interface is straightforward, generating the initial mesh is an involved process, if the subdomains have complicated shapes. As the coupling interfaces are conforming (as long as the host grid is), it is possible to enforce coupling conditions in strong form.
iii. Cut-cell grids: The third approach is similar to the second one, and again involves a host grid. However, it is more flexible because this time the host grid can be arbitrary, and does not have to resolve the boundaries Figure 11: Construction of two cut-cell subdomain grids from a Cartesian background grid and a level-set geometry representation: cut-cells are constructed by intersecting a background cell with the zero-iso-surface of the level-set.
of the subdomains. Instead, subdomain grids are constructed by intersecting the elements with the independent subdomain geometry, typically described as the 0-level set of a given function. This results in so-called cut-cells, which are fragments of host grid elements. Correspondingly, the coupling interfaces are constructed by intersecting host grid elements with the subdomain boundary.
It is important to note that the shapes of the cut-cells can be arbitrary and the resulting cut-cell grids are not necessarily shape-regular. As a consequence, it is not possible to employ standard discretization techniques. Instead, a range of different methods like the unfitted discontinuous Galerkin (UDG) method [73,74] and the CutFEM method [75] have been developed for cut-cell grids.
All three concepts for handling complex domains are available as special Dune modules.

Dune-Grid-Glue -Coupling of individual grids
When coupling simulations on separate grids, the main challenge is the construction of coupling operators, as these require detailed neighborhood information between cells in different meshes. The Dune-Grid-Glue module [26,27], available from https://dune-project.org/modules/dune-grid-glue, provides infrastructure to construct such relations efficiently. Neighborhood relationships are described by the concept of RemoteIntersections, which are closely related to the Intersections known from the Dune-Grid module (Section 3.1.2): Both keep references to the two elements that make up the intersection, they store the shape of the intersection in world space, and the local shapes of the intersection when embedded into one or the other of the two elements. However, a RemoteIntersection is more general than its Dune-Grid cousin: First of all, the two elements do not have to be part of the same grid object, or even the same grid implementation. But there is also no requirement for the two elements to have the same dimension. This allows mixed-dimensional couplings like the one in [76].
Constructing the set of remote intersections for a pair of grids first requires the selection of two coupling patches. These are two sets of entities that are known to be involved in the coupling, like a contact boundary, or the overlap between two overlapping grids. Coupling entities can have any codimension. In principle all entities of a given codimension could always be selected as coupling entities, but it is usually easy and more efficient to preselect the relevant ones.
There are several algorithms for constructing the set of remote intersections for a given pair of coupling patches. Assuming that both patches consist of roughly N coupling entities, the naive algorithm will require O(N 2 ) operations. This is too expensive for many situations. Gander and Japhet [77] proposed an advancing front algorithm with expected linear complexity, which, however, slows down considerably when the coupling patches consist of many connected components, or contain too many entities not actually involved in the coupling. Both algorithms are available in Dune-Grid-Glue. A third approach using a spatial data structure and a run-time of O(N log N ) still awaits implementation.
A particular challange arises in the case of parallel grids, as the partitioning of both grids is also unrelated. Dune-Grid-Glue can also compute the set of RemoteIntersection in parallel codes, using additional communication. For details on the algorithm and how to handle couplings in the parallel case we refer to [27].
As an example we implement the assembly of mortar mass matrices using Dune-Grid-Glue. Let Ω be a domain in R d , split into two parts Ω 1 , Ω 2 by a hypersurface Γ, as in Figure 9. On Ω we consider an elliptic PDE for a scalar function u, subject to the continuity conditions where u 1 and u 2 are the restrictions of u to the subdomains Ω 1 and Ω 2 , respectively, and n is a unit normal of Γ. For a variationally consistent discretization, the mortar methods discretizes the weak form of the continuity condition which has to hold for a space of test functions w defined on the common subdomain boundary. Let Ω 1 and Ω 2 be discretized by two independent grids, and let be nodal basis functions for these grids, respectively. We use the nonzero restrictions of the {θ 1 i } on Γ to discretize the test function space. Then (2) has the algebraic form with mortar mass matrices The numbers n Γ,1 and n Γ2 denote the numbers of degrees of freedom on the interface Ω ∩ Γ. Assembling these matrices is not easy, because M 2 involves shape functions from two different grids. For the implementation, assume that the grids on Ω 1 and Ω 2 are available as two Dune grid view objects gridView1 and gridView2, of types GridView1 and GridView2, respectively. The code first constructs the coupling patches, i.e., those parts of the boundaries of Ω 1 , Ω 2 that are on the interface Γ. These are represented in Dune-Grid-Glue by objects called Extractors. Since we are coupling on the grid boundaries-which have codimension 1-we need two Next, we need to compute the set of remote intersections from the two coupling patches. The different algorithms for this mentioned above are implemented in objects called "mergers". The most appropriate one for the mortar example is called ContactMerge, and it implements the advancing front algorithm of Gander and Japhet. 5 The entire code to construct the remote intersections for the two trace grids at the interface Γ is using GlueType = GridGlue : : GridGlue < Extractor1 , Extractor2 > ; // Backend for the computation of the remote intersections auto merger = std : : make_shared < GridGlue : : ContactMerge < dim , double > > () ; GlueType gridGlue ( domEx , tarEx , merger ) ;

gridGlue . build () ;
The gridGlue object is a container for the remote intersections. These can now be used to compute the two mass matrices M 1 and M 2 . Let mortarMatrix1 and mortarMatrix2 be two objects of some (deliberately) unspecified matrix type. We assume that both are initialized and all entries are set to zero. The nodal bases  After these loops, the objects mortarMatrix1 and mortarMatrix2 contain the matrices M 1 and M 2 , respectively.
The problem gets more complicated when Γ is not a hyperplane. The approximation of a non-planar Γ by unrelated grids will lead to "holes" at the interface, and the jump u 1 | Γ − u 2 | Γ is not well-defined anymore. This situation is usually dealt with by identifying Γ h 1 and Γ h 2 with a homeomorphism Φ, and replacing the second mass matrix by Only few changes have to be done to the code to implement this. First of all, the vertical predicate class has to be exchanged for something that correctly finds the curved coupling boundaries. Then, setting up extractor and GridGlue objects remains unchanged. The extra magic needed to handle the mapping Φ is completely concealed in the ContactMerge implementation, which does not rely on Γ h 1 and Γ h 2 being identical. Instead, if there is a gap between them, a projection in normal direction is computed automatically and used for Φ.

Dune-MultiDomainGrid -Using element subsets as subdomains
The second approach to the coupling of subdomains is implemented in the Dune-MultiDomainGrid module, available at https://dune-project.org/modules/ dune-multidomaingrid. This module allows to structure a given host grid into different subdomains. It is implemented in terms of two cooperating grid implementations MultiDomainGrid and SubDomainGrid: MultiDomainGrid is a meta grid that wraps a given host grid and extends it with an interface for setting up and accessing subdomains. It also stores all data required to manage the subdomains. The individual subdomains are exposed as SubDomainGrid instances, which are lightweight objects that combine the information from the host grid and the associated MultiDomainGrid. SubDomainGrid objects present a subdomain as a regular Dune grid. A MultiDomainGrid inherits all capabilities of the underlying grid, including features like h-adaptivity and MPI parallelism. Extensions of the official grid interface allow to obtain the associate entities in the fundamental mesh and the corresponding indices in both grids.

Dune-TPMC -Assembly of cut-cell discretizations
The main challenge for cut-cell approaches is the construction of appropriate quadrature rules to evaluate integrals over the cut-cell and its boundary. We assume that the domain is given implicitly as a discrete level set function Φ h , s.t. Φ(x) < 0 if x ∈ Ω (i) . The goal is now to compute a polygonal representation of the cut-cell and a decomposition into sub-elements, such that standard quadrature can be applied on each sub-element. This allows to evaluate weak forms on the actual domain, its boundary and the internal skeleton (when employing DG methods).
The Dune-TPMC library implements a topology preserving marching cubes (TPMC) algorithm [78], assuming that Φ h is given as a piecewise multilinear scalar function (i.e. a P 1 or Q 1 function). The fundamental idea in this case is the same as that of the classical marching cubes algorithm, known from computer graphics. Given the sign of the vertex values the library identifies the topology of the cut-cell. In certain ambiguous cases additional points in the interior of the cell need to be evaluated. From the topological case the actual decomposition is retrieved from a lookup table and mapped according to the real function values.
Evaluating integrals over a cut-cell domain using Dune-TPMC. We look at a simple example to learn how to work with cut-cell domains. As stated, the technical challenge regarding cut-cell methods is the construction of quadrature rules. We consider a circular domain of radius 1 in 2d and compute the area using numerical quadrature. The scalar function Φ : x ∈ R d → |x| 2 −1 describes the domain boundary as the isosurface Φ = 0 and the interior as Φ < 0. We now compute the local volume by quadrature over the cut-cell e| Φ<0 . In order to evaluate the integral we use the TpmcRefinement and construct snippets, for which we can use standard quadrature rules: This gives us a convergent integral, approximating π. Unsurprisingly we obtain an O(h 2 ) convergence of the quadrature error, as the geometry is approximated as a polygonal domain.

Non-smooth multigrid
Various interesting PDEs from application fields such as computational mechanics or phase-field modeling can be written as nonsmooth convex minimization problems with certain separability properties. For such problems, the module Dune-TNNMG offers an implementation of the Truncated Nonsmooth Newton Multigrid (TNNMG) algorithm [79,80] and call R i : R N → R Ni the canonical restriction to the i-th block. Typically, the factor spaces R Ni will have small dimension, but the number of factors m is expected to be large. A strictly convex and coercive objective functional with a convex C 2 functional J 0 : R N → R, and convex, proper, lower semicontinuous functionals ϕ i : Given such a functional J, the TNNMG method alternates between a nonlinear smoothing step and a damped inexact Newton correction. The smoother solves local minimization problems v k = arg miñ in the subspaces V k ⊂ R N of all vectors that have zero entries everywhere outside of the k-th block. The inexact Newton step typically consists of a single multigrid iteration for the linearized problem, but other choices are possible as well.
For this method global convergence has been shown even when using only inexact local smoothers [80]. In practice it is observed that the method degenerates to a multigrid method after a finite number of steps, and hence multigrid convergence rates are achieved asymptotically [79].
The Dune-TNNMG module, available from https://git.imp.fu-berlin. de/agnumpde/dune-tnnmg, offers an implementation of the TNNMG algorithm in the context of Dune. The coupling to Dune is very loose-as TNNMG operates on functionals in R N only, there is no need for it to know about grids, finite element spaces, etc. 6 The Dune-TNNMG module therefore only depends on Dune-ISTL and Dune-Solvers.

Numerical example: small-strain primal elastoplasticity
The theory of elastoplasticity describes the behavior of solid objects that can undergo both temporary (elastic) and permanent (plastic) deformation. In its simplest (primal) form, its variables are a vector field u : Ω → R d of displacements, and a matrix field p : Ω → Sym d×d 0 of plastic strains. These strains are assumed to be symmetric and trace-free [81]. Displacements u live in the Sobolev space H 1 (Ω, R d ), and (in theories without strain gradients) plastic strains live in the larger space L 2 (Ω, Sym d×d 0 ). Therefore, the easiest space discretization employs continuous piecewise linear finite elements for the displacement u, and piecewise constant plastic strains p.
Implicit time discretization of the quasistatic model leads to a sequence of spatial problems [82,81]. These can be written as minimization problems which do not involve a time step size because the model is rate-independent.
Here, u and p are the finite element coefficients of the displacement and plastic strains, respectively, and A is a symmetric positive definite matrix. The number σ c is the yield stress, and b is the load vector. The functions θ 1 , . . . , θ n2 are the canonical basis functions of the space of piecewise constant functions, and · F : Sym d×d 0 → R is the Frobenius norm. In the implementation, trace free symmetric matrices p i ∈ Sym d×d 0 are represented by vectors of length 1 2 (d + 1)d − 1.
By comparing (5) to (3), one can see that the increment functional (5) has the required form [82]. By a result of [83], the local nonsmooth minimization problems (4) can be solved exactly.
The implementation used in [82] employs several of the recent hybrid features of Dune-Functions and Dune-ISTL. The pair of finite element spaces for displacements and plastic strains for a three-dimensional problem forms the tree  The corresponding linear algebra data structures must combine block vectors with block size 3 and block vectors with block size 5. Hence the vector data type definition is using DisplacementVector = BlockVector < FieldVector < double , dim > > ; using PlasticStrainVector = BlockVector < FieldVector < double , nP l a s ti c S t ra i n C om p o n en t s > > ; using Vector = MultiTypeBlockVector < DisplacementVector , PlasticStrainVector > ; and this is the type used by Dune-TNNMG for the iterates. The corresponding matrix type combines four BCRSMatrix objects of different block sizes in a single MultiTypeBlockMatrix, and the multigrid solver operates directly on this type of matrix. We show a numerical example of the TNNMG solver for a three-dimensional test problem. Note that in this case, the space Sym d×d 0 is 5-dimensional, and therefore isomorphic to R 5 . Let Ω be the domain depicted in Figure 12a, with bounding box (0, 4) × (0, 1) × (0, 7). We clamp the object at Γ D = (0, 4) × (0, 1) × {0}, and apply a time-dependent normal load The material parameters are taken from [84]. Figure 12b shows the evolution of the plastification front at several points in time. See [82] for more detailed numerical results and performance measurements.

Multi scale methods
There has been a tremendous development of numerical multiscale methods in the last two decades including the multiscale finite element method (Ms-FEM) [85,86,87], the heterogeneous multiscale method (HMM) [88,89,90], the variational multiscale method (VMM) [91,92,93] or the local orthogonal decomposition (LOD) [94,95,96]. More recently, extensions to parameterized multiscale problems have been presented, such as the localized multiscale reduced basis method (LRBMS) [97,98,99] or the generalized multiscale finite element method (GMsFEM) [100,101,102]. In [103,104] we have demonstrated that most of these methods can be cast into a general abstract framework that may then be used for the design of a common implementation framework for multiscale methods, which has been realized in the Dune-MultiScale module [105]. In the following, we concentrate on an efficient parallelized implementation of MsFEM within the DUNE software framework.

Multiscale model problem
As a model problem we consider heterogeneous diffusion. Given a domain Ω ⊂ R n , n ∈ N >0 with a polygonal boundary, an elliptic diffusion tensor A ∈ (L ∞ (Ω)) n×n with microscopic features, and an f ∈ L 2 (Ω) we define our model problem as find u ∈H 1 (Ω) : For the discretization of Equation (6) we require a regular partition T H of Ω with elements T and a nested refinement T h of T H with elements t and choose associated piece-wise linear finite element spaces U H : We assume that U h is sufficiently accurate. By A h we denote a suitable piecewise polynomial approximation of A , and for T ∈ T H , we call U (T ) an admissible environment of T , if it is connected, if T ⊂ U (T ) ⊂ Ω and if it is the union of elements of T h . Admissible environments will be used for oversampling. In particular T is an admissible environment of itself.
The MsFEM in Petrov-Galerkin formulation with oversampling is defined in the following. The typical construction of an explicit multiscale finite ele-ment basis is already indirectly incorporated in the method. Also note that for U (T ) = T we obtain the MsFEM without oversampling.
Let now U H = {U (T )| T ∈ T H } denote a set of admissible environments of elements of T H . We call R h (u H ) ∈ U h ⊂H 1 (Ω) the MsFEM-approximation of u , if u H ∈ U H solves: for all T ∈ T H , whereŮ h (U (T )) is the underlying fine scale finite element space on U (T ) with zero boundary values on ∂U (T ). Since we are interested in a globally continuous approximation, i.e. R h (u H ) ∈ U h ⊂H 1 (Ω), we still need a conforming projection P H,h which maps the discontinuous parts Q h,T (Φ H ) |T to an element of U h . Therefore, if denotes such a projection, we definẽ with indicator function χ T . For a more detailed discussion and analysis of this method we refer to [87].

Implementation and parallelization
Our implementation of the general framework for multiscale methods (Dune-Multiscale, [105]) is an effort birthed from the Exa-Dune project [16,106,107] and is built using the Dune Generic Discretization Toolbox (Dune-GDT, [108]) and Dune-XT [109] as well as the Dune core modules described in Section 3.
To maximize CPU utilization we employ multi-threading to dynamically load balance work items inside one CPU without expensive memory transfer or cross-node communication. This effectively reduces the communication/overlap region of the coarse grid in a scenario with a fixed number of available cores. Within Dune we decided to use Intel's Thread Building Blocks (TBB) library as our multithreading abstraction.
Let us now consider an abstract compute cluster that is comprised of a set of processors P, where a set of cores C Pi = {C j Pi } is associated with each P i ∈ P Figure 13: Non-overlapping hybrid macro grid distribution of T H for P = P 0 , · · · P 3 with the hatched area symbolizing sub-distribution over t C j and zoomed fine scale sub-structure of U h,T for U (T ) = T (left). Overlapping macro grid distribution of T H for P = P 0 , · · · , P 3 (right). and a collection of threads t Cj = {t k Cj }. For simplicity, we assume here that j = k across P.
Since we are interested in globally continuous solutions in U H , we require an overlapping distribution T H,Pi ⊂ T H where cells can be present on multiple P i . Furthermore, we denote by I i ⊂ T H,Pi the set of inner elements, if for all T H ∈ I i ⇒ T H / ∈ I j for all i, j with i = j. The first important step in the multiscale algorithm is to solve the cell corrector problems (7) for all U (T H ), T H ∈ I i . These are truly locally solvable in the sense of data independence with respect to neighbouring coarse cells. We build upon extensions to the Dune-Grid module made within Exa-Dune, presented in [110], that allow us to partition a given GridView into connected ranges of cells. The assembler was modified to use TBB such that different threads iterate over different of these partitions in parallel (Fig. 13).
For each T H we create a new structured Dune::YaspGrid to cover U (T H ). Next we need to obtain Q h,T (Φ H ) for all J coarse scale basis function. After discretization this actually means assembling only one linear system matrix and J different right hand sides. The assembly handled by Dune::GDT::SystemAssembler, which is parametrized by an elliptic operator GDT::Operators::EllipticCG and corresponding right hand side functionals GDT::LocalFunctional::Codim0Integral. The SystemAssembler performs loop-fusion by merging cell-local operations of any number of input functors. This allows to perform the complete assembly in one single sweep over the grid, using a configurable amount of threadparallelism.
Since the cell problems usually only contain up to about 100,000 elements it is especially efficient to factorize the assembled system matrix once and then backsolve for all right hand sides. For this we employ the UMFPACK [111] direct solver from the SuiteSparse library 7 and its abstraction through Dune-ISTL. Another layer of abstraction on top of that in Dune-XT allows us to switch to an iterative solver at run-time, should we exceed the suitability constraints of the direct solver.  After applying the projections P H,h to getQ h (Φ H ), we discretize Eq. (7) which yields a linear system in the standard way. Since this is a system with degrees of freedom (DoF) distributed across all P i we need to select an appropriate iterative solver. Here we use the implementation of the bi-conjugate gradient stabilized method (BiCGSTAB) in Dune-ISTL, preconditioned by an Algebraic Multigrid (AMG) solver, see Section 3.2. We note that the application of the linear system solver is the only step in our algorithm that requires global communication. This allows the overall algorithm to scale with high parallel efficiency in setups with few coarse grid cells per rank, as demonstrated in Figure 14.
While the Dune-Multiscale module can function as a standalone application to apply the method to a given problem, it is also possible to use it as a library in other modules as well (see for example Dune-MLMC [112]). Runtime parameters like the problem to solve, oversampling size, micro and macro grid resolution, number of threads per process, etc. are read from a structured INI-style file or passed as a XT::Common::Configuration object. New problems with associated data functions and computational domains can easily be added by defining them in a new header file. The central library routine to apply the method to a given problem, with nulled solution and prepared grid setup is very concise as it follows the mathematical abstractions discussed above. " Coarse Part MsFEM Solution " ) ; LocalProblemSolver ( problem , coarse_space , localgrid_list ) . solve_for_all_cells () ; CoarseScaleOperator elliptic_msfem_op ( problem , coarse_space , localgrid_list ) ; elliptic_msfem_op . apply_inverse ( co arse_ msfe m_so lutio n ) ; //! projection and summation i d en t i f y_ f i n e_ s c a le _ p a rt ( problem , localgrid_list , coarse_msfem_solution , coarse_space , solution ) ; solution -> add ( coa rse_ msfem _sol utio n ) ; } 4.5. Sum-factorization for high order discretizations to improve node level performance In this last example we showcase how Dune is used to develop HPC simulation code for modern hardware architectures. We discuss some prevalent trends in hardware development and how they affect finite element software. Then a matrix-free solution technique for high order discretizations is presented and its node level performance on recent architectures is shown. This work was implemented in Dune-PDELab and was originally developed within the Exa-Dune project. The complexity of the performance engineering efforts have let to a reimplementation and continued development in Dune-codegen.
With the end of frequency scaling, performance increases on current hardware rely on an ever-growing amount of parallelism in modern architectures. This includes a drastic increase of CPU floating point performance through instruction level parallelism (SIMD vectorization, superscalar execution, fused multiplication and addition). However, memory bandwidth has not kept up with these gains, severely restricting the performance of established numerical codes and leaving them unable to saturate the floating point hardware. Developers need to both reconsider their choice of algorithms, as well as adapt their implementations in order to overcome this barrier. E.g. in traditional FEM implementations, the system matrix is assembled in memory and the sparse linear system is solved with efficient solvers based on sparse matrices. Optimal complexity solvers scale linearly in the number of unknowns. Despite their optimal complexity, these schemes cannot leverage the capabilities of modern HPC systems as they rely on sparse matrix vector products of the assembled system matrix, which have very low arithmetic intensity and are therefore inherently memory-bound.
One possible approach to leverage the capabilities of current hardware is to directly implement the application of the sparse matrix on a vector. This direct implementation shifts the arithmetic intensity into the compute-bound regime of modern CPUs. Given an optimal complexity algorithm on suitable discretizations, it is possible to compute the matrix-vector product faster than the entries of an assembled system matrix can be loaded from main memory. Such optimal complexity algorithms make use of a technique called sum factorization [113] which exploits the tensor product structure of finite element basis functions and quadrature formulae. Given polynomial degree k and minimal quadrature order, it allows to reduce the computational complexity of one operator application from O(k 2d ) to O(k d+1 ) by rewriting the evaluation of finite element functions as a d sequence of tensor contractions. To compute local contributions of the operator it is necessary to have access to the 1d shape functions and quadrature rule that was used in the tensor-product construction of the 2d or 3d variants. Although this optimal complexity algorithm can not use 3d shape functions, the implementation is still hard-coded, but uses 1d shape functions from Dune-LocalFunctions. By this the implementation can still be fairly generic and easily switch between different polynomial degrees and polynomial representation (e.g. Lagrange-or Legendre-Polynomials).
In order to fully exploit the CPU's floating point capabilities, an implementation needs to maximize its use of SIMD instructions. In our experience, general purpose compilers are not capable to sufficiently autovectorize this type of code, especially as the loop bounds of tensor contractions depend on the polynomial degree k and are thus not necessarily an integer multiple of the SIMD width. Explicit SIMD vectorization is a challenging task that requires both an algorithmic idea of how to group instructions and possibly rearrange loops as well as a technical realization. In the following we apply a vectorization strategy developed in [114]: Batches of several sum factorization kernels arising from the evaluation of finite element functions and their gradients are parallelized using SIMD instructions. In order to achieve portability between different instruction sets, code is written using a SIMD abstraction layer [40]. This however requires the innermost loops of finite element assembly to be rewritten using SIMD types. With Dune-PDELab's abstraction of a local operator, these loops are typically located in user code. This let to the development of Dune-codegen, which will be further described in Section 5. Figure 15 shows node level performance numbers for a Discontinuous Galerkin finite element operator for the diffusion reaction equation on an Intel Haswell node. The code performs at roughly 40% of the theoretical peak performance of 1.17 TFlops/s (10 12 floating point operations per second) on this 32 core node. Discontinuous Galerkin discretizations benefit best from this computebound algorithm, as they allow to minimize memory transfers by omitting the costly setup of element-local data structured, operating directly on suitably blocked global data structures instead. A dedicated assembler for DG operators, Dune::PDELab::FastDGAssembler, is now available in Dune-PDELab. It does not gather/scatter data from global memory into element-local data structures, but just uses views onto the global data. By this it avoids unnecessary copy operations and index calculations. This assembler is essential to achieve the presented node level performance, but can also be beneficial for traditional DG implementations.
It is worth noting that iterative solvers based on this kind of matrix-free operator evaluation require the design of preconditioners that preserve the low memory bandwidth requirements while ensuring good convergence behavior, as the per-iteration speedup would otherwise be lost to a much higher number of solver iterations. We studied matrix-free preconditioning techniques for Discontinuous Galerkin problems in [115]. This matrix-free solution technology have been used for an advanced application with the Navier-Stokes equations in [116].

Currents development trends in Dune
Dune is under constant development and new features are added regularly. We briefly want to highlight four topics that are subject of current research and development.

Asynchronous communication
The communication overhead is expected to be an even greater problem, in future HPC systems, as the numbers of processes will increase. Therefore, it is necessary to use asynchronous communication. A first attempt to establish asynchronous communication in Dune was demonstrated in [18,20].
With the MPI 3.0 standard, an official interface for asynchronous communication was established. Based on this standard, as part of the Exa-Dune project, we are currently developing high-level abstractions for Dune for such asynchronous communication, following the future-promise concept which is also used in the STL library. An MPIFuture object encapsulates the MPI_Request as well as the corresponding memory buffer. Furthermore, it provides methods to check for the state of the communication and access the result.
Thereby the fault-tolerance with respect to soft-and hard-faults that occur on remote processes is improved as well. We are following the recommended way of handling failures by throwing exceptions. Unfortunately, this concept integrates poorly with MPI. An approach how to propagate exceptions through the entire system and handle them properly, using the ULFM functionality proposed in [117,118], can be found in [119].
Methods like pipelined CG [120] overlap global communication and operator application to hide communication costs. Such asynchronous solvers will be incorporated in Dune-ISTL, along with the described software infrastructure.

Thread parallelism
Modern HPC systems exhibit different levels of concurrency. Many numerical codes are now adopting the MPI+X paradigm, meaning that they use internode parallelism via MPI and intranode parallelism, i.e. threads, via some other interface. While early works where based on OpenMP and pthreads, for example in [18], the upcoming interface changes in Dune will be based on the Intel Thread Building Blocks (TBB) to handle threads. Up to now the core modules don't use multi-threading directly, but the consensus on a single library ensures interoperability among different Dune extension modules.
In the Exa-Dune project several numerical components like assemblers or specialized linear solvers have been developed using TBB. As many developments of Exa-Dune are proof of concepts, these can not be merged into the core modules immediately, but we plan to port the most promising approaches to mainline Dune. Noteworthy features include mesh partitioning into entity ranges per thread, as it is used in the MS-FEM code in Section 4.4, the block-SELL-C-σ matrix format [121] (an extension of the work of [122]) and a task-based DG-assembler for Dune-PDELab.

C++ and Python
Combining easy to use scripting languages with state-of-the-art numerical software has been a continuous effort in scientific computing for a long time. While much of the development of mathematical algorithms still happens in Matlab, there is increasing use of Python for such efforts, also in other scientific disciplines. For solution of PDEs the pioneering work of the FEniCS team [5] inspired many others, e.g. [44,45] to also provide Python scripting for high performance PDE solvers usually coded in C++. As discussed in Section 3.4, Dune provides Python-bindings for central components like meshes, shape functions and linear algebra. Dune-Python also provides infrastructure for exporting static polymorphic interfaces to Python using just in time compilation and without introducing a virtual layer and thus not leading to any performance losses when objects are passed between different C++ components through the Python layer. Bindings are now being added to a number of modules like the Dune-Grid-Glue module discussed in Section 4.2.1 and further modules will follow soon.

DSLs and code-generation
Code-generation techniques allow to use scripting languages, while maintaining a high efficiency. Using a domain-specific language (DSL), the FEniCS project first introduced a code-generator to automatically generate efficient discretization code in Python. The Unified Form Language UFL [5,48] is an embedded Python DSL for describing a PDE problem in weak form. UFL is now used by several projects, in particular we want to mention Firedrake [45], and we also started adopting this input in several places in Dune. For example UFL can now be used for the generating model descriptions for Dune-Fem [63] as demonstrated in Section 4.1. An other effort is the currently developed Dunecodegen module, which tries to make performance optimization developed in the Exa-Dune project accessible to the Dune community.
In Section 4.5 we highlighted how highly tuned matrix-free higher-order kernels can achieve 40% peak performance on modern architectures. While Dune offers the necessary flexibility, this kind of optimizations is hard to implement for average users. To overcome this issue and improve sustainability, we introduced a code generation toolchain in [123], using UFL as our input language. From this DSL, a header file containing a performance-optimized LocalOperator class is generated. The LocalOperator interface is Dune-PDELab's abstraction for local integration kernels. The design decisions for this code generation toolchain are discussed in detail in [124]. This toolchain achieves near-optimal performance by applying structural transformations to an intermediate representation based on [125]. A search space of SIMD vectorization strategies is explored from within the code generator through an autotuning procedure. This work now lead to the development of Dune-codegen, which also offers other optimizations, like block-structured meshes, similar to the concepts described in [126], or extruded meshes, like in [127,128]. This is an ongoing effort and still in its early development.