ArcFVDSL, a DSEL Combined to HARTS, a Runtime System Layer to Implement Ef ﬁ cient Numerical Methods to Solve Diffusive Problems on New Heterogeneous Hardware Architecture

— Nowadays, some frameworks like Arcane and Dune offer a number of advanced tools to deal with the complexity related to parallelism, meshes and linear solvers. However, they do not handle the high level complexity related to discretization methods and physical models. Generative programming and Domain Speci ﬁ c Languages (DSL) are key technologies allowing to write code with a high level expressive language and take advantage of the ef ﬁ ciency of generated code with low level services. DSL may be embedded in host languages like Python or C++ . Such languages, named in that case Domain Speci ﬁ c Embedded Languages (DSEL), are applied for instance in frameworks like Fenics or Feel++ which are dedicated to the domain of Finite Element (FE) methods and Galerkin methods. ArcFVDSL is a DSEL developed on top of the Arcane framework, aiming to implement various lowest order methods (Finite-Volume (FV), Mimetic Finite Difference (MFD), Mixed Hybrid Finite Volume (MHFV), etc.) for diffusive problems on general meshes. In this paper, we present various implementations of different complex academic problems. We focus on the capability of the language to allow the description and the resolution of these problems with several lowest-order methods. We illustrate the bene ﬁ ts of such technology combined to runtime system tools like Heterogeneous Abstract RunTime System (HARTS) and its ability to handle seamlessly new heterogeneous architectures with multi-core processors enhanced by General Purpose computing


INTRODUCTION
Industrial simulation softwares have to manage: (i) the complexity of the underlying physical models, usually expressed in terms of a Partial Differential Equation (PDE) system completed with algebraic closure laws, (ii) the complexity of the numerical methods used to solve the PDE system, and finally (iii) the complexity of the low level computer science services required to have efficient software on modern hardware.Robust and effective Finite Volume (FV) methods as well as advanced programming techniques need to be combined in order to fully benefit from massively parallel architectures (implementation of parallelism, memory handling, design of connections).Moreover, the above methodologies and technologies have become more and more sophisticated and too complex to be handled by physicists alone.Today, this complexity management becomes a key issue for the development of scientific software.
A number of existing frameworks already offer advanced tools to deal with the complexity related to parallelism.They hide hardware complexity and provide low level algorithms dealing directly with hardware specificities for performance reasons.They often offer services to manage mesh data services and linear algebra services which are key elements for efficient parallel software.However, all these frameworks often provide only partial answers to the problem as they only deal with hardware complexity and low level numerical complexity like linear algebra.The complexity related to discretization methods and physical models lacks tools to help physicists develop complex applications.
New paradigms for scientific software must be developed to help them seamlessly handle the different levels of complexity so that they can focus on their specific domain.
Generative programming, component engineering and Domain Specific Languages (DSL) are key technologies to make the development of complex applications easier to physicists, hiding the complexity of numerical methods and low level computer science services.These paradigms allow to write code with a high level expressive language and take advantage of the efficiency of generated code for low level services close to hardware specificities.Their application to scientific computing has been limited so far to Finite Element (FE) methods, for which a unified mathematical framework has been existing for a long time.Such kinds of DSL have been developed for FE or Galerkin methods in projects like Freefem, Getdp, Getfem++, Sundance, Feel++ and Fenics.In these projects, they are embedded in host languages like Python or C++ and are named Domain Specific Embedded Languages (DSEL).
Over the last few years, we have extended this kind of approach to lowest-order methods to solve the PDE systems resulting from geo modeling applications.Indeed, a recent consistent unified mathematical framework which allows a unified description of a large family of these methods has emerged.It enables then, as for FE methods, the design of a high level language inspired from the mathematical notation.Such languages help then physicist to implement their application writing the mathematical formulation at a high level, hiding the complexity of numerical methods and low level computer science services guaranty of high performance.We have developed such a language, that we have embedded in the C++ language, on top of Arcane plateform [1].We have used the Boost Proto library [2], a powerful framework providing tools to design this DSEL.In [3] we have presented ArcFVDSL, our DSEL aiming to implement various lowest-order methods (Finite-Volume, Mimetic Finite Difference, Mixed Hybrid Finite Volume, etc.) for diffusive problems on general meshes.In [4] we have presented technical details on how this language has been designed with the Boost Proto library.In this paper, we focus on the capability of the language combined to runtime system tools like Heterogeneous Abstract RunTime System (HARTS) [5], to handle seamlessly new heterogeneous architectures with multi-core processors enhanced by General Purpose computing on Graphics Processing Units (GP-GPU).We present the performance results of various implementations of different academic problems on different kinds of heterogeneous hardware architecture.
In Section 1, we briefly present the mathematical framework, our DSEL relies upon.We present the C++ concepts used to define the back end and the front end of the language and explain how to parse expressions representing bilinear and linear forms.We show how our framework enables to generate useful algorithms to solve the discrete problems.In Section 2, we present how by introducing HARTS, a runtime layer to handle new heterogeneous hardware architecture, we have improved the generated algorithms to take advantage of multi-core architectures.
In Section 4, we present some performance results on various academic test cases that we have implemented with our framework.

ARCFVDSL, A DSEL IN C++ TO SOLVE DIFFUSIVE PROBLEMS WITH LOWEST-ORDER METHODS
As for FE/DG (Finite Element/Discontinuous Galerkin) methods, a unified mathematical framework which allows a unified description of a large family of lowest-order methods has recently emerged [6].The key idea is to reformulate the method at hand as a (Petrov)-Galerkin scheme based on a possibly incomplete, broken affine space.This is done by introducing a piecewise constant gradient reconstruction, which is used to recover a piecewise affine function starting from cell (and possibly face) centered unknowns.
For example, considering the following heterogeneous diffusion model problem: The continuous weak formulation reads: In this framework, for a given partition T h of X, a specific lowest-order method is defined by (i) selecting a trial function space U h ðT h Þ and a test function space ) and a linear form b h (v h ).Solving the discrete problem consists then in finding u h 2 U h such that: The definition of a discrete function space U h is based on four main ingredients: -T h the mesh representing X, S h a submesh of T h where 8S 2 S h ; 9T S 2 T h ; S & T S ; -V h the space of vector of degrees of freedom with components indexed by the mesh entities (cells, faces or nodes); -G h a linear gradient operator that defines for each vector v h 2 V h a constant gradient on each element of S h ; and -r h the broken gradient operator.
Using the above ingredients, we can define for all Usually three kinds of submesh S h are considered: T h the mesh itself, P h the submesh with pyramidal subcells built regarding the faces of T h and N h the submesh with subcells built regarding the nodes of T h .We denote T h the space of degrees of freedom with components indexed by cells and F h the space of degrees of freedom with components indexed only by faces.Usually the following choices are considered: With this framework, the model problem (1) can be solved with various methods: the cell centered Galerkin (ccG) method and the G-method with cell unknowns only; the hybrid finite volume method with both cell and face unknowns that recover the Mimetic Finite Difference (MFD) and Mixed Hybrid Finite Volume (MHFV) family.
For example, the hybrid finite volume method recovers the SUSHI scheme [7][8][9][10].The discrete space with hybrid unknowns is then obtained with: (i) where the linear residual operator r h : T h Â F h !P 0 d ðP h Þ is defined as follows: for all T 2 T h and all F 2 F T , This method with hybrid unknowns reads: and r h broken gradient on P h .This unified framework allows the design of a high level language close to the mathematical notation.Such a language enables to express the variational discretized formulation of PDE problem with various methods, each of them defining specific bilinear and linear forms.Algorithms are then generated to solve the problems, evaluating the forms representing the discrete problem.The front end of the language is based on concepts (mesh, function space, test trial functions, differential operators) close to their mathematical counterpart.They are linked to low level structures, the back end of the language, representing meshes, scalar arrays indexed by mesh entities, algebraic objects (vectors, matrices, linear operators).For theses structures we use frameworks like Arcane [6] or ALIEN a framework to handle various linear solver packages.Linear and bilinear forms are represented by expressions built with the terminals of the language linked with unary, binary operators (+, À, * , /, dot(.,.)) and with free functions like grad (.), div (.), integrate (.,.).The purpose of theses expressions is (i) to express the variational discretized formulation of the problem, (ii) to generate algorithms which consist in evaluating these expressions to build global linear systems which are solved to find the solution of the problem.The generative mechanism of our framework is based on the Boost Proto framework [2] and is described in detail [4].
For our diffusion model problem (1), such a DSEL will for instance achieve to express the variational discretized formulation (9) with the programming counterpart shown in Listing 1.The bilinear form a sushi h defined by ( 9) has the programming counterpart given in Listing 1 and the corresponding expression tree is detailed in Figure 1.
Listing 2 is a generic assembly algorithm consisting in iterating on each entity of the mesh group and in evaluating the test and trial expression on each entity.For such evaluation, different kinds of context objects are defined.The structure EvalContext<ItemT> enables to compute the linear combination objects, some generalized stencils or local vectors indexed by DOF.These objects are returned by the evaluation of test or trial expressions.When they are associated to a binary operator tag, they lead to a bilinear contribution, a local matrix contributing to the global linear system.In the same way the evaluation of a linear form expression with a linear context leads to the construction of the right hand side of a global linear system.
Once the global linear system is built, it can be solved with a linear system solver provided by the linear algebra layer ALIEN.

HARTS, AN ABSTRACT OBJECT ORIENTED RUNTIME SYSTEM LAYER FOR HETEROGENEOUS ARCHITECTURE
In the previous section we have presented a generative framework, based on a DSEL that enables to describe numerical methods at a high level, and to generate C++ codes with back-end objects with a generative mechanism.
In this section, we show how we can introduce various kind of parallelism in the generated codes with the runtime system layer HARTS presented in details [5].This layer is aimed to handle, in a unified way, different levels of parallelism.As in most existing runtime system frameworks, it is based on: (i) an abstract architecture model that enables us to describe in a unified way most of present and future heterogeneous architectures with static and runtime information on the memory, network and computational units; (ii) an unified parallel programming model based on tasks that enables us to implement parallel algorithms for different architectures; (iii) an abstract data management model to describe the processed data, its placement in each memory level and the various ways to access it from the each computation unit.The use of HARTS as illustrated in Figure 2 has several advantages: 1. it enables to clearly separate the implementation of the numerical layer from the implementation of the runtime system layer; 2. it enables to take into account the evolution of hardware architecture with new extensions and new concepts implementation, limiting in that way the impact on the numerical layer based on the DSEL generative layer.Most of the algorithms that could be parallelized rely on this layer that bridges the gap between our DSEL and the low level Application Programming Interface (API) used to execute algorithms on various computational units.For example, this layer has been used: 1. to create an abstract API to manipulate algebraic objects like matrices and vectors with standard Basic Linear Algebra Subprograms (BLAS) 1 and 2 operations; 2. to enhance the assembly part of linear systems, while linear and bilinear expressions are evaluated on collections of mesh items.Numerical low level algorithms can often been described as sequences of matrices and vectors algebraic operations.Expression tree for the bilinear form defined in Listing 1.
Expressions are in light gray, language terminals in dark gray.
Layer architecture.
Such sequences with a great number of floating points operations can be expensive, and the way to optimize them are diverse with respect to the hardware specificities.It is important for the generative framework to have a high level unified way to express such sequences independently of low level optimisations.We have for this purpose defined an abstract algebraic API detailed in Listing 3 aiming: (i) to hide hardware specificities, (ii) to manage memory allocation and locality, (iii) to manage parallel loops, (iv) to provide most BLAS 1 and 2 functionalities, (v) to provide tools to split vectors and manage vector views and range iterators.
class We have seen in Section 1 that the evaluation of bilinear and linear expressions with linear context objects leads to algorithms as in Listing 2. These algorithms consist in iterating on mesh entities, in computing local matrices and vectors which are assembled in a global matrix and right hand side vector.With new hardware architectures, such algorithms can be parallelized in a number of different ways as long as the concurrency on global data like the global linear system is managed.To handle the variety of low level systems, HARTS provides functionalities like HARTS::parallelForeach() (see Listing 3) to iterate on collection of mesh entities (nodes, faces, cells) and to apply lambda functions in parallel.Using graph coloring techniques, we can manage concurrency on shared data without locks.For that we partition the collections of mesh entities in sub collections of mesh entities with disjoined connectivities, then we apply on each sub collection any lambda functions, avoiding in that way any concurrent access to shared data (Shared degrees of freedom during the assembly phase).Listing 4 shows how the generic assembly algorithm of Listing 2 can be written delegating the parallelism to the runtime system layer.
The discrete space for the ccG method is obtained with: (i) where T g h is a linear trace reconstruction operator on the faces of T h .

Let for all ðu
The method reads: We can compare to the following programming counterpart: The discrete variational hybrid method reads [14]:

Stokes Problem
The countinuous strong formulation reads: where u : X !R d is the vector-valued velocity field, p : X !R is the pressure, and f : X !R d is the forcing term.
The countinuous variational formulation reads: Following [15], we consider a discretization based on the spaces The momentum diffusion is discretized by the bilinear form a h 2 LðU h Â U h ; RÞ such that where, for all w h 2 U h , the Cartesian components of w h are denoted by (w h,i ) i2{1,. ..,d} .The velocity-pressure coupling hinges on the bilinear form b h 2 LðU h Â P h ; R): The discrete divergence operator associated to b h is not surjective with choice of spaces (10).The stability of the velocity-pressure coupling can be recovered by penalizing pressure jumps via the bilinear form s h 2 LðP h Â P h ; RÞ such that The discrete problem reads: Find (u h , p h ) 2 U h 9 P h such that, for all (v h , q h ) 2 U h 9 P h , We can compare to the following programming counterpart:

PERFORMANCE RESULTS
In this section, we present some performance results obtained on the heterogeneous diffusive problem.We focus on the two most expensive parts, the global linear system assembly and the linear system resolution.In Figure 3, we can notice that, even if the linear system assembly is not trivial to parallelize due to concurrent access matrix and vector entries, we can obtain not so bad accelerations.For the linear system resolution, our generic layer ALIEN gives us access to various linear solver packages that can perform on multi-core nodes or on GP-GPU.In Figure 4, we see the solver performance up to 16 cores on multi-cores architecture.We can also see how we can even take advantage of the power of GP-GPU with the dedicated solver package for GP-GPU.

CONCLUSION AND PERSPECTIVE
We have presented ArcFVDSL, a DSEL developed on top of the Arcane framework.This high level language enables to implement various lowest-order methods for diffusive problems on general meshes.It hides low level optimisations for new heterogeneous architecture.We have shown how this generative framework combined to the HARTS layer can generate efficient codes and allows to handle seamlessly architectures with multi-core processors enhanced by GP-GPU.This combination turns to be a good solution to provide high level tools to implement new complex numerical methods preserving the performance of the generated code on heterogeneous hardware architectures.In the future, we plan to introduce at the numerical level, new features in the language to handle for example non linear models.At the hardware level, we plan to introduce new optimisations through the HARTS layer to take advantage on new many integrated cores architectures using for instance Intel Xeon Phi processors.Linear solver performance.

REFERENCES1
Figure 3Linear system assembly performance.