PeriPy -- A High Performance OpenCL Peridynamics Package

This paper presents a lightweight, open-source and high-performance python package for solving peridynamics problems in solid mechanics. The development of this solver is motivated by the need for fast analysis tools to achieve the large number of simulations required for `outer-loop' applications, including sensitivity analysis, uncertainty quantification and optimisation. Our python software toolbox utilises the heterogeneous nature of OpenCL so that it can be executed on any platform with CPU or GPU cores. We illustrate the package use through a range of industrially motivated examples, which should enable other researchers to build on and extend the solver for use in their own applications. Step improvements in execution speed and functionality over existing techniques are presented. A comparison between this solver and an existing OpenCL implementation in the literature is presented, tested on benchmarks with hundreds of thousands to tens of millions of nodes. We demonstrate the scalability of the solver on the GeForce RTX 2080 TiGPU from NVIDIA, and the memory-bound limitations are analysed. In all test cases, the implementation is between 1.4 and 10.0 times faster than a similar existing GPU implementation in the literature. In particular, this improvement has been achieved by utilising local memory on the GPU.


Introduction
Modelling the initiation and propagation of fractures remains a central and classical scientific challenge in computational mechanics. The complex nonlinearities, stress singularities and multiple scales involved with such problems mean that the approaches are extremely computationally intensive for any problem of industrial interest, and there are significant modelling uncertainties. Peridynamic theory, introduced by Silling in 2000 [1], provides a promising theoretical framework for developing robust numerical models that can be applied to fracture simulation across a broad range of materials [2,3]. In this contribution, we develop a scalable, high-performance solver exploiting modern Graphics Processing Unit (GPU) / Central Processing Unit (CPU) architecture, which is readily extended, to ensure that the computational demands of peridynamics simulators do not limit their adoption and future development for real-world problems.
Peridynamics is a non-local theory of continuum mechanics that is well suited to modelling crack initiation and propagation. The peridynamic model defines the material behaviour at a point in a continuum body as an integral equation of the surrounding displacement. This is in contrast to the classical theory of solid mechanics, where material behaviour at a given point is defined by partial differential equations. The classical theory is only valid if the body under analysis has a spatially continuous and twice differentiable displacement field. Spatial derivatives are not defined across discontinuities, and additional techniques are required for fracture modelling. Peridynamic theory does not include spatial derivatives and remains valid across discontinuities. Fracture is an emergent behaviour of the peridynamic governing equation, and no additional assumptions or techniques are required. There are two primary formulations of peridynamic theory: bond-based [1] and state-based theory [4]. In the original bond-based theory, pairwise particles interact through a single bond. It is an oversimplification to assume that any pair of particles interact only through a central potential that is independent of all other local conditions. For example, this assumption limits the Poisson ratio to a fixed value. Silling et al. [4] later introduced a generalised state-based theory that overcomes the limitations of the original theory.
Peridynamics has been successfully applied to a wide range of material failure problems, and Javili et al. [2] performed a thorough review of peridynamic theory and its many applications for static and dynamic problems across a range of materials. Diehl et al. [3] recently reviewed benchmark experiments for the validation of peridynamic models. The authors identified 39 publications that compare numerical predictions from peridynamic simulations against experimental data. Validation has been restricted to relatively simplistic element-level benchmark tests, such as the Split Hopkinson Pressure Bar test, Kalthoff-Winkler experiment, three-point flexural tests and anchor bolt pullout. Validation through large-scale experiments is still needed.
The computationally demanding nature of peridynamics is a major barrier to tackling large-scale industry-motivated problems. Peridynamic models are computationally expensive due to the non-local nature of the governing equations. Simulating large problems of interest, on the order of hundreds of thousands to tens of millions of particles, requires vast computational resources. Reducing the computational expense of peridynamic models is essential for wider adoption and further development. The high computational expense of peridynamic simulations is a major factor preventing detailed validation studies. This is manifested as follows: (1) The vast majority of papers only consider two-dimensional problems.
(2) Very few papers address large scale industry-motivated problems. (3) The majority of papers calibrate model parameters against a single experiment, and only a few addressed further validation [3]. The availability of a fast solver would enable the development of 'outer-loop' applications that require a large number of simulations, including sensitivity analysis, uncertainty quantification [5,6], optimisation [7,8] and convergence [9] studies. It is essential to quantify the uncertainty in model outputs for peridynamics to become a viable method for addressing industry-motivated problems.
A number of strategies are available to minimise computational effort: (1) employ a multi-scale disretisation scheme and refine the mesh in areas of interest [10,11]; (2) adaptively refine the mesh at crack tips [12,9,13,14]; (3) couple peridynamic and finite element meshes [15,16]; and (4) exploit parallel computing techniques and multi-thread processing. Peridynamics lends itself well to parallel processing techniques because the forces of the bonds and particles can be computed in parallel. Peridynamics is particularly suited to parallelisation when compared to other particle methods, such as Smoothed Particle Hydrodynamics (SPH), because the neighbourhood of nodes remains unchanged throughout the simulation, and so a computationally expensive neighbour search over nodes only needs to be performed once for any given model. The next three paragraphs explore (4), the many existing peridynamics codes that have been developed on distributed, shared and massively parallel GPU memory architectures, respectively.
Message Passing Interface (MPI) is the dominant parallel programming model for developing highly scalable codes on distributed memory platforms. MPI-based codes use spatial decomposition, in which the physical space is decomposed into multiple subdomains, and each subdomain is assigned to a unique compute node (each compute node has its own local memory). This approach is used for problems with large memory requirements. A number of peridynamic solvers have been developed that utilise MPI [17,18,19]. EMU, developed by Silling [17], was the first computational peridynamics code. EMU is a parallel MPI-based Fortran code [20]. Peridynamics has been implemented within the classical molecular dynamics package LAMMPS (Large-scale Atomic/Molecular Massively Parallel Simulator). The peridynamics package in LAMMPS is known as PD-LAMMPS [18]. This was the first open-source peridynamics code. A more recent and capable package is the open-source code Peridigm [19]. Peridigm is the only available open-source code that is specifically designed for peridynamic simulations. Peridigm is written in C++ and requires many dependencies [21].
The number of cores per compute node has rapidly increased in recent years, and shared memory approaches have been widely adopted. OpenMP (Open Multi-Processing application programming interface) is the standard for developing shared memory multi-threaded code. OpenMP has been used in multiple works [22,23,24]. Parallelisation with OpenMP is efficient and simple to implement but is constrained by the memory of a single compute node. Hybrid applications can be developed for large-scale clusters, where MPI is used for parallelism across compute nodes and OpenMP is used for parallelism within a multi-core compute node. Hybrid MPI/OpenMP approaches have been employed by Dalla Barba et al. [25] and Ha [26]. HPX (High Performance ParallelX) has recently been employed by Diehl et al. [27] to develop scalable peridynamics code.
The use of GPU codes to speed up peridynamics simulations has been limited. Calculation of the bond forces are ideal for GPU acceleration, as each bond calculation can be done on one of the many concurrent threads in a GPU. Mossaiby et al. (2017) [28] demonstrated that, by replacing the outer loop over nodes with a compute kernel executing on each node, a speed increase of a factor of 40 to 90 over optimised sequential C++ code and of 3 to 6 times over OpenMP parallel code can be achieved. The current work improves on the work by Mossaiby et al. [28] by replacing the outer loop over bonds and utilising a binary parallel reduction of the bond forces into the nodal force density by exploiting the execution model and memory model of OpenCL, respectively. This paper introduces PeriPy, a fast, open-source package with a user interface in Python. The code is hosted on Github https://github.com/alan-turing-institute/PeriPy and is fully tested. PeriPy is easy to use, and the reader can get started with the latest documentation at https://peripy.readthedocs.io. Please see Appendix A for the installation instructions. It is 1.4-10.0x faster than the Mossaiby et al. OpenCL implementation [28] and therefore opens up the possibility of 'Outer-loop' applications including uncertainty quantification, optimisation and feature recognition. PeriPy has support for both regular and irregular mesh files, composite and interface material models, arbitrary n-linear 'microelastic' damage models and simulates force or displacement controlled boundary conditions and initial conditions. The user interface allows arbitrary subsets of particles to be easily measured for their state variables. Output files can be viewed in Paraview. Various 'partial volume correction' algorithms, 'surface correction' algorithms and 'micromodulus functions' are included. Various explicit integrators are included, and the code is easily extended to define other higher order and/or adaptive integrators.
It is hoped that this package will become the industry standard and other researchers will contribute to its further development. It is important to have an industry standard solver for a number of reasons: (1) to reduce unnecessary development time; (2) so that research is accessible; (3) so that research is repeatable. Section 2 provides a recap of bondbased peridynamics theory, while Section 3 provides an overview of the code design and the factors that have helped provide the speedup over existing codes. Section 4 provides benchmark tests that demonstrate the speedup over existing codes and the code performance on a modern GPU up to tens of millions of nodes, over thousands of time-steps and with varying node family size. Section 5 provides a validation of PeriPy against existing codes through an engineering test case, the Kalthoff-Winkler experiment. Section 6 demonstrates the use of the code with a problem from industry, the simulation of notched concrete beam fracture. The experiment involves outer-loop optimisation over peridynamics constitutive model parameters using experimental data from a concrete beam test. The point estimate of the constitutive parameters are tested on experimental data, the first validation of its kind in the literature. Section 7 demonstrates the use of the code with another problem from industry, the simulation of a reinforced concrete beam, and shows a visualisation of the cracking pattern. Section 8 outlines future avenues for developing PeriPy, and Section 9 concludes the paper.

Preliminaries: Bond-based peridynamics equations and their disretisation
In this section, we will briefly review bond-based peridynamics continuum theory and its numerical disretisation. The implementation presented is readily applicable to state-based theory [4,29], and this connection is briefly discussed in Sec. 8.2.

Peridynamic continuum model
Our starting point, which is general to all formulations of peridynamic models, is the continuum description. A deformable body (see Fig.1) occupies the domain D ⊂ R d (for d = 2 or 3). The deformation (displacement) at some position x ∈ D is described by the vector valued function u(x, t) : D → R 3 , for some time t ∈ R + . The relative distance between two points x and x ∈ D in the undeformed and deformed configurations are denoted as The acceleration of the body at x ∈ D is described by the peridynamic equation of motion where ρ(x) is the material density, f (η, ξ) is a pair-wise force function (the force vector per unit volume squared that the body at x exerts on x) and b(x, t) denotes a body force. Perhaps most importantly in peridynamic theory, H x defines the horizon, a subdomain of the undeformed space H x ⊂ D, over which forces within the material interact. This finite length scale over which forces in a material interact means that peridynamics defines a non-local continuum model of the material.
The constitutive behaviour of the material is encapsulated within the force function f . Silling [30] calculated the pairwise force function for an elastic material, defined in terms of a scalar micropotential w, such that This representation leads to a pairwise force function of the form where f is an associated scalar-valued function that contains the constitutive model.

Constitutive model
To define the constitutive model, the micropotential function for linear elastic materials is defined as Here, s is the stretch between two points in the material, while c( ξ ) is the micro-modulus function and refers to the elastic bond stiffness between two points. For the purposes of this paper, we use constitutive laws that have been derived for a constant micro-modulus function. The use of a non-constant micro-modulus function may provide a number of benefits, such as reduced surface effects [31] and improved numerical convergence [32,33]. We have chosen to implement two micro-modulus functions in PeriPy: constant and conical, which is explored by Ha and Bobaru [34]. The code can be easily extended to other formulations of c(·).
In this paper, we consider examples using the prototype micro-elastic brittle (PMB) model [30], which relates the bond stretch s to a scalar-valued force projected in the direction of the bond, Here, K defines the material bulk modulus, horizon δ, stretch s (see (4)) and a damage parameter µ, which controls the onset of brittle failure. In the undamaged/linear regime, this expression was derived by equating the strain energy density of a peridynamic body under uniform dilation with classical elasticity theory [30]. For a pure shear state, there is also an equivalent strain energy density, and a second expression for c can be determined. Equating these two expressions for c, it is found that, for the simple PMB peridynamic formulation, the Poisson's ratio ν is limited to 1/4 in the general 3D case for plain strain and 1/3 [35] in 2D plane stress conditions. This restriction on the Poisson's ratio can and has been addressed by introducing state-based peridynamic theory, which has been considered extensively by Silling et al. [4]. However, it is not discussed further in this paper or used within the examples since, while the initial formulation of the ordinary state-based model is more involved, the resulting computational implementation is equivalent, albeit with additional parameters to define a material response (see Section 8.2). As introduced in (5), brittle failure is captured in the constitutive model by introducing a history-dependent scalar-valued (boolean) function This (non-smooth) parameter allows bonds to irreversibly break when they are stretched beyond a predefined limit, defined by the critical strain of the bond, where E = 3K(1 − 2ν) is Young's modulus, and G F the fracture energy release rate of the material. Silling and Askari [30] derived this value by separating a peridynamic body into two halves and thereby breaking all the bonds that initially crossed the fracture surface. By determining the energy required to break a single bond, the energy per unit fracture area for complete separation of a peridynamic body can be determined. The energy release rate G F is a material parameter and can be determined experimentally. The PMB model is shown in Fig. 2.
Remark: Although in this paper we limit the examples to modelling quasi-brittle homogeneous isotropic materials, PeriPy can model heterogeneous materials, such as composites. Mehrmashhadi et al. [36] randomly assigned bonds as matrix, fibre or interface bonds so that the volume fraction matched the average volume fraction of peridynamic nodes, thereby modelling a composite material that is homogeneous on length scales larger than the horizon. This is possible in PeriPy by defining a bond classification function as a function of the points x and x . The bond stiffness and other damage model parameters for the matrix, fibre or interface bonds are passed at model instantiation or, alternatively, at the time of simulation, which allows for outer-loop black-box optimisation of constitutive parameters with respect to some objective function, as demonstrated in Section 6. Since the relative angle between a bond and the fibre orientation represents the fibre directions, peridynamic theory has an inherent advantage in modelling fibre-reinforced composite laminates. Xu et al. [37] classified the bonds as either fibre or matrix bonds to reproduce the anisotropy of a fibre-reinforced composite lamina. PeriPy is also easily capable of modelling anisotropic materials, which may be modelled through a dependence of f on the initial direction of the bond ξ. Due to the inherent anisotropy of a fibre-reinforced composite, Hu et al. [38] assigned distinct micro-elastic moduli to the bonds to describe the anisotropy. The bonds were classified as either longitudinal or transverse bonds. In their work, they used a micromodulus function to introduce the dependence of the interaction strength on the bond length. Micromodulus functions are included in PeriPy, and the code is easily extended for users to implement their own micromodulus function, such as the one used by Hu et al. The design choices and features of PeriPy are discussed in more detail in Section 3.6.
The continuum level damage in a peridynamic body can be measured by introducing the local damage φ(x, t) This defines the ratio of broken bonds over original bonds for all points in the horizon of x. Therefore, 0 ≤ φ ≤ 1, with 0 representing virgin material, and 1 represents complete disconnection of a point from all of the points with which it was initially connected. To completely separate the two halves of the body across the fracture surface requires breaking all the bonds that initially connected points in the opposite halves and will result in a damage of φ ≈ 0.5 at those points on the fracture surface.

Spatial disretisation
As before, let a body in its initial configuration occupy a domain D ⊂ IR d . Different disretisation approaches can be used to solve a problem numerically in peridynamics [39]. Computationally, using the mesh-free approach of Silling and Askari [30], the material can be represented by nodes (or particles) in the index set G := {i | x i ∈ D}. Here, the total number of nodes is denoted by n = |G|, the initial coordinates by x i ∈ D, the displacement vector by u i (t) ∈ [0, T ] × IR d and associated lumped volumes by dV i for that node. Furthermore, a given node is subject to an externally applied force density b i ∈ IR d .
As mentioned above, the peridynamic formulation is a non-local model. The i th node interacts with the all nodes within the index set which is called the family of node x i . The user-defined parameter δ ≥ 0 is, as before, the horizon, which describes a characteristic length scale specific to the mechanics of the material.
The discretised peridynamics equation of motion approximates the integral in Eq. (1) with the sum where a partial volume correction β ij is introduced to improve the accuracy of the spatial integration. For simplicity, this work does not use any partial volume correction algorithms, but PeriPy implements the partial volume algorithm proposed by Hu et al. [32]. Seleson [33] provides a detailed analysis of the various methods available for partial volume correction algorithms and concludes that this algorithm outperforms approximation of the partial volumes, both with their full volume (β = 1) and with the algorithm that is commonly used in peridynamics software (e.g. PDLAMMPS [40]).
An energy dissipation term, known as dynamic relaxation, can be added to (10), so the solution will converge to a steady state solution [22]. The method relies on the introduction of an artificial damping term in the equation of motion, where the resultant force on the node i, F i has been defined for brevity.

Time integration
Simulating the evolution of the model from its initial conditions requires a numerical time integration scheme. The choice is between explicit and implicit schemes [41]. Explicit schemes are most commonly used in peridynamics solvers because they are more suited to parallel computation, as no solution of a system of equations is required. The specific method chosen was the velocity-Verlet method. Velocity-Verlet leads to improvements in numerical accuracy in node dynamics simulations [42], is simple to implement and has been widely used in other implementations of peridynamics solvers. Velocity-Verlet is a second-order symplectic integrator that has minimal computational memory requirements, which is important for GPUs since they are inherently memory constrained by the amount of global memory, or VRAM. The velocity-Verlet scheme can be written as follows: followed byu Here,ü i (t) is the acceleration term evaluated using the equation of motion (10). The velocity-Verlet algorithm does not provide a prescription for including velocity dependent forces as a result of dynamic relaxation. To extend this algorithm to include velocity-dependent forces, we follow Groot and Warren (1997) [42], where the equation of (damped) motion (11) is written as follows:ü where the half step velocity isu i (t + ∆t 2 ) =u i (t) + ∆t 2ü i (t).

Surface correction factors
The derivation of the peridynamic bond stiffness c is based on the assumption that the horizon of a material point is contained within the bulk H i ∈ D. This is not true for nodes within a distance δ from the body edge, which do not possess a full family. Consequently, the stiffness of the nodes contained within the bulk and nodes on the edge of a body differs, resulting in significant error between peridynamic solutions and classical analytical methods. This is called the peridynamic surface effect. Le and Bobaru [43] presented a detailed comparison of the most common surface effect correction techniques. They found that the Volume method, proposed by Bobaru et al. [44], is one of the most effective techniques for correcting surface effects. The method ensures that nodes under homogeneous deformation have the same strain energy density values when located near the surface or in the bulk. The micromodulus of a bond ξ ij is corrected as c ij,corrected = λ ij c ij where

Implementation of PeriPy in python and OpenCL
It has been proposed that, if a peridynamic code has few dependencies, is easy to use and, above all, is scalable and sufficiently fast for real-world engineering problems, then the user base of bond-based peridynamics will increase, and new research in bond-based peridynamics can utilise a solver with a known, robust standard. To that end, we have developed an open-source GPU (Graphics Processing Unit) code that uses OpenCL [45] [46] for GPU acceleration and a python front-end for user inputs. Details of installation (via pip) and documentation can be found in Appendix A. The GPU communicates with the host device (a CPU) with the OpenCL API. The front-end of the code is written in python in order to make it accessible for research, and as a result the python API, pyopencl [47], was chosen.
Preliminary testing showed that the calculation of bond forces and their reduction to a single node force were the most expensive components for solving any peridynamics problem numerically. Testing showed that these two steps contribute to 93-98% of the run time over a range of test setups. This is because of the need to calculate the bond force density of each bond for each node family in Eq. (10). Our implementation therefore targets the acceleration for these two critical steps in the peridynamic solver.
In a serial peridynamics solver, a nested for-loop over all nodes and families is used to obtain the force density contributions of each bond f ij , with each of the elements of this nested for-loop calculated in series. In fact, the bond forces f ij can be readily calculated in parallel, as the calculation of the contribution to force density of one bond is independent of another. Therefore, calculation of the bond forces are ideal for GPU acceleration since each bond calculation can be done on one of the many concurrent threads in a GPU. Peridynamics is particularly suited to parallelisation when compared to other particle methods, such as Smoothed Particle Hydrodynamics (SPH), because the family H i remains unchanged throughout the simulation, and thus a neighbour search over nodes only needs to be performed once for any given model, and for each subsequent simulation, the family can be read from a file stored to disk.
A function that is written in OpenCL (using C) that replaces loops with functions executing at each point in a problem domain is called a compute kernel or kernel. This is the central idea behind OpenCL. Mossaiby et al. (2017) [28] demonstrated that, by replacing the outer loop over nodes with a kernel for each node executing in parallel, a speed increase of a factor of 40 to 90 times over optimised sequential C++ code and an increase of 3 to 6 times over OpenMP parallel code can be achieved. The current work improves on the work of Mossaiby et al. [28] by pairing a parallel computation of each bond force and a binary parallel reduction of the bond forces into the nodal force density by exploiting the execution and memory models of OpenCL, respectively. We elaborate on this step below and give a brief introduction to OpenCL with an example.

Introduction to OpenCL with binary parallel reduction example
We will now introduce the language and concepts of OpenCL by way of explaining one of the optimisations used to speed up the code in the current work; binary parallel reduction, which is a common and important data parallel primitive [48].

Execution model
In mesh-free peridynamics, the bond force density contributions are summed to calculate the resultant force on each particle, as shown on the right hand side of Eq. 10. In PeriPy, the bond force density contributions for each node for each Cartesian direction are stored in an N × 1 C array and summed to a Cartesian resultant force vector using a binary parallel reduction algorithm. A pictorial representation of this algorithm for a single node and Cartesian direction and N = 4 is shown in Fig. 3. The idea of this algorithm is to use sequential addressing [48] to perform an addition on two values that are a stride apart and propagate this result downwards, at each step reducing the stride by a factor of two, so that the final result is inserted into the first space of a local memory cache. The pseudocode for binary parallel reduction is shown in Algorithm 1 in the Appendix, and is executed using the OpenCL kernel shown in Listing 1, also in the Appendix.
The OpenCL execution model [46] describes how kernels execute. When a kernel is submitted for execution by the host (as defined in Section 3), an index space is defined. An instance of the kernel, called a work-item, executes for each point in this index space. The work-item can therefore be identified by its point in the index space, called the global_id. Returning to the binary parallel reduction example in Listing 1, to create the index space, each bond in the sparse two-dimensional index set {H 1 , ..., H n } is assigned a global_id ∈ [0, global_size), where global_size = nN .
In OpenCL, work-items are organised into work-groups. The work-groups provide a coarse decomposition of the index space. The local_size is the number of work-items per work-group. A work-group is identified by its point in this coarse index space, called the group_id. Thus, each and every node may be assigned a group_id if each work-group contains local_size = N work-items, as is the case in PeriPy. Work-items are assigned a unique local_id ∈ [0, N ) within a work-group such that global_id = group_id × local_size + local_id. Calculations on the work-item are then for the local_idth bond of the group_idth node. The result of executing Listing 1 means that N work-items (for each work-group) execute in parallel to reduce the bond forces to the node force of a single node which is stored in local_cache[0]. Of course, Values (local memory)

local_ID
Step 1 Stride 2 Step there must be a local_cache for each work-group that stores the bond forces for the work-group's node. This is explained by the OpenCL memory model [46] in the next section. N must clearly be a power of two so that a binary parallel reduction can be performed. So, N = min k (2 k ) ≥ max i (|H i |) is the minimum power of two, which is larger than (or equal to) the maximum number of neighbours of any one node, such that each bond is assigned a global_ID and the constants that are associated with each bond. For example, the stiffness_corrections factors can be stored in a dense global_size C array. Finally, the first work-item in the work-group, local_id = 0 is used to assign this value in the body_force array and the reduction is complete. It is important to note that not only are each work-item in a work-group executed in parallel, but many work-groups can be executed in parallel, also. See that this means we are using the execution model to parallelise over bonds as well as nodes.
As well as using the execution model to perform the binary parallel reduction algorithm, we use the execution model to calculate the other computationally expensive part, the bond-forces. Each work-item is responsible for calculating the bond force of one bond, which is presented in the pseudocode of Algorithm 2 in the Appendix. This is in contrast to the work of Mossaiby et al [28], who assign an index space only over nodes, which is presented in pseudocode in Algorithm 3, also in the Appendix. In the kernel for Algorithm 2, work-items and work-groups are assigned unique global_id and group_id respectively in exactly the same way as in Listing 1. The index of the child node is j = nlist[global_id], where nlist is the neighbour list and is an n × N C array where the ith row contains H i . The rows of nlist are padded with values of −1, where |H i | < N . Note that node indices are taken from the set of non-negative integers, so −1 is a safe choice for denoting a broken bond. The global_idth bond is broken by setting the value of nlist[global_id] = −1 and the bond force f ij = 0. Using Algorithm 2, it is possible to achieve particular optimisations that result in a significant speedup in the peridynamics code, as explained in Sections 3.3-3.5.
In OpenCL, the global_size = nN must be divisible by the local_size = N because work-items must be divided between work-groups. Recall that there is a one-to-one mapping between group_ids and nodes, and each work-group is responsible for calculating the resultant node force of a single node. Work-groups are executed in parallel on a device's compute units. There is no way to synchronise the timings of work-groups since, in an application, there are usually more work-groups than hardware to execute them concurrently. As a result, they will be queued to run as compute units become available. Due to the limited number of compute units on any GPU, we can expect the performance of PeriPy to scale linearly with the number of nodes. At smaller problem sizes, the overhead due to memory transfer between the GPU and controlling CPU dominates, while at larger problem sizes the time required for actual computation comprises most of the total wallclock time. This is seen to be the case in Fig. 4, which shows the performance of PeriPy for different values of N . All of the benchmarking tests in Fig. 4 and Fig. 7-9 were run on an NVIDIA 2080Ti GPU for various problem sizes n. The benchmarking setup is detailed in Section 4. Fig. 4 shows tests for beams that have been discretised with a regular cubic tensor grid, consistent with the majority of the existing literature. The beams were put under some basic boundary conditions, and the tests were displacement controlled, although this does not affect the simulation time. The stable time-step condition derived by Silling and Askari [30] was used.  Note that N can be chosen independently of n. This is because the horizon radius δ is selected so that it has a value δ > m∆x, where m is a small positive constant (usually chosen to be π) [49] and ∆x is the mesh spacing. The choices of m and δ are discussed in Bobaru and Hu [50] and Henke and Shanbhag [51]. In the limit of the horizon δ approaching zero, the peridynamic solution should converge to the elastic classical continuum mechanics elasticity solution. Bobaru and Hu [50] and Zhao et al. [52] note that the size of the horizon should be defined by the smallest geometrical feature; see also Silling and Askari [30] and Silling and Madenci [53].

Memory model
The memory regions of an OpenCL device and their relationship to the OpenCL platform model [46] are summarised in Fig. 5. The most basic building block of the OpenCL platform model is the processing element: a virtual scalar processor, one of the many concurrent 'threads' on the device. A work-item may execute on one or more processing elements. A compute unit is composed of one or more processing elements and local memory. An OpenCL device has one or more compute units, and a work-group executes on a single compute unit.
Compute device memory is one or more memories attached to the compute device that are directly available to kernels executing on OpenCL devices. Compute device memory consists of four name address spaces or memory regions: • Global Memory: permits read/write access to all work-items in all work-groups running on any device within a context.
Work-items can read from or write to any element of a memory object. Reads and writes to global memory may be cached depending on the capabilities of the device.
• Constant Memory: A region of global memory that remains constant during the execution of a kernel instance. The host allocates and initialises memory objects placed into constant memory.
• Local Memory: A memory region local to a work-group. This memory region can be used to allocate variables that are shared by all work-items in that work-group.
• Private Memory: A region of memory private to a work-item. Variables defined in one work-item's private memory are not visible to another work-item.
Local and private memories are always associated with a particular compute unit and processing element, respectively. Local and private memories are much faster than global memory. Therefore, it is imperative to exploit the different memory regions in GPU-accelerated programming. Consider again the binary parallel reduction example in Listing 1. The local memory cache was used to store the three (one for each Cartesian direction) N × 1 C arrays containing the bond forces. The summation is performed using the faster bandwidth of the local memory. The bond forces are never stored in the slower-to-access global memory.  The global and constant memories are shared memories between all compute devices, compute units and processing elements. In the binary parallel reduction algorithm, the body_force array is stored in global memory. In the bond force algorithm, the stiffness_corrections are constant and so stored in the constant memory. In PeriPy, a single compute device is used, which can be either the GPU or CPU. The four named address spaces available to a device do not overlap. This is a logical relationship, however, and an implementation may choose to let these disjointed named address spaces share physical memory.

Host
On the GPU, the private memory accessible to each work-item is small in size and is only O(10) (32-bit) words per workitem and very fast [54]. Non-pointer variables are stored in the private memory, such as those used to assign the node indices i = group_id and j = nlist[global_id] and relative displacements ξ ij . Private memory is used to do fast floating-point arithmetic, such as evaluating the bond force. Each work-group contains many work-items that share its local memory. This memory is larger (O(10) kBytes) than the private memory of a work-item and is used in PeriPy to store the bond forces. The largest memory on a GPU is the global memory, which can be in the high-speed GPU memory (VRAM).

Hardware considerations
There are a number of hardware-dependent considerations that users of PeriPy should note. These are: Memory bandwidth: Memory transfers to CPU are expensive, and thus the whole application is transferred onto the GPU until completion unless intermediate writes are performed.
Local memory size: For application to the vast majority of 3D problems in the literature, the local_size = N is usually 64, 128 or 256 depending on m and, as mentioned above, is a power of two so that a parallel binary parallel reduction can be performed. This can work well since, for NVIDIA cards, which are multiprocessor units that create, manage, schedule and execute threads in groups of 32 parallel threads called warps [55], any local_size that is a multiple of 32 should result in good thread utilisation. These are also the recommended local_size for intel CPUs [56], in the event the code is to be run on a CPU. The amount of local memory used by a work-item will be the limiting factor of the number of work-items that can be executed in parallel [46], as each work-group offers a limited amount of shared local memory and registers. As a result, the hardware limit on the value of N is CL_KERNEL_WORK_GROUP_SIZE for the kernel that calculates bond forces, which is hardware dependent. On modern cards, this number will be well above the number needed for peridynamics simulations, so is not a problem.
Global memory size: The upper limit on the problem size possible for peridynamics simulations on GPUs is governed by the size of the global memory. The total number of bytes of arrays that are allocated space or stored on the GPU must be less than the size of the global memory. The global memory is used to store the state variables, such as displacement u[i] = u i and nlist arrays. If a simulation uses stiffness correction factors, they are stored in an n × N double C array in the global memory. The constants, such as node coordinates coords[i] = x i and volumes vols[i] = dV i , are stored in a memory region of global memory called the constant memory, which remains constant during the execution of a kernel. The host allocates and initialises memory objects placed into constant memory. Work-items have read-only access to these objects. These values can be used to approximate the maximum problem size n for a given amount of global memory. The upper limits on n for the NVIDIA 2080Ti GPU with 11Gb of VRAM are shown in Table 1. These values define the upper bound for the problem size in the benchmarking results in Section 4. The approximate calculations used to determine the maximum problem size in PeriPy are shown in Table B.5 in the Appendix. Precision: Single precision performance (in FLOPS) on GPUs is better than double precision performance [57]. Our tests showed that using single precision with PeriPy is between 6.7 and 8.6 faster than using double precision on our GPU. Now that the memory model has been explained, the optimisations of the current work are demonstrated.

Optimisation 1: Kernel fusing
Each kernel launch has some overhead, which includes the allocation of memory for the first launch, copying the values of the input parameters to constant memory, indexing any local or global arrays used within the kernel and calculating intermediate variables, such as the bond stretch s. By fusing the kernels for calculation and summation of bond forces, the extra overhead is eliminated and the total run time reduced. It is also possible to fuse the kernel that checks whether a bond has exceeded the critical stretch into the bond force calculations kernel, which avoids the need to calculate the bond stretch s twice as well as the need to store the bond forces in an n × N C array in global memory, which would be accessed by the separate check-bonds kernel. Since there is a work-item for each bond, there are no race conditions that result from fusing these kernels.
The check for broken bonds is applied directly after a displacement update and before the calculation of bond forces. It is important to ensure that all of the bond forces are updated for a given time-step before the time integration is performed to avoid race conditions. Therefore, it is not possible to fuse the time integration kernel with the bond force kernel. Splitting the time integration and bond force kernels guarantees that bond forces are calculated with nodal displacements from the same time step, and vice versa. The time integration kernel for velocity-Verlet is shown in Algorithm 4 in the Appendix.

Optimisation 2: Parallel reduction of bond forces onto nodal forces
The implementation of binary parallel reduction was explained in Section 3.1. The reduction is needed to reduce the bond forces to the node force. The binary parallel reduction facilitates the parallelisation over bonds as opposed to nodes, which leads to the most significant optimisation of the current work, minimising thread divergence.

Optimisation 3: Thread divergence
The most significant performance consideration relates to the execution of threads or work-items [58]. Recall that work-items are organised into work-groups. Work-groups can be executed by the GPU in any order, but work-groups are not necessarily execution units themselves. Instead, work-groups are partitioned into warps for execution. In the current generation of NVIDIA devices, each warp consists of 32 threads. If a work-group consists of more than 32 work-items, the work-group is partitioned into multiple warps based on the work-item ID (e.g. local_id). All work-items in a warp should follow the same instruction path. Otherwise, work-items execute different instructions, which is called thread divergence and reduces performance significantly. To avoid thread divergence, work-groups should be organised so that warps follow the same control paths. In the case of Algorithm 3 in the Appendix, thread divergence occurs when, as is most often the case, the number of non-broken bonds between nodes differs. The warp will not complete until all work-items in that warp have completed the for loop starting at line 5. This results in many threads in the warp being under-utilised during that time. However, in PeriPy, Algorithm 2 in the Appendix, the parallelisation is over bonds, and so the time that threads are underutilised is limited to the amount of time it takes to calculate a single bond rather than a number of bonds.
The problem of thread divergence has not been eliminated in Algorithm 2. For example, if some work-items in warp execute the if statement in an if-then-else construct on lines 4-16 (calculating the bond force) while others follow the else path (the bond is broken), the GPU can no longer execute the work-items concurrently, and multiple passes are required. It is not possible to prevent thread divergence and to fuse the kernel that checks if a bond has exceeded the critical stretch into the  Figure 6: The class diagram for PeriPy, which is designed to easily extend the code with new integrators, correction algorithms or disretisations. The functionality of the code is defined in two classes: Integrator, which is the Abstract Base Class (ABC) that is extended by a concrete Integrator class, such as EulerCL, and Model, which defines the geometry (mesh and node volumes), connectivity (neighbour list and number of neighbours for each node), damage model, material types, correction algorithms and boundary conditions. kernel that calculates the bond force, as both the if and the else paths of the control loop will be executed by many work-items. Clearly, there is a trade off in the performance gain between kernel fusing and preventing thread divergence. Preliminary tests showed that kernel fusing at the cost of thread divergence in this manner is the faster approach across all problem sizes, by two orders of magnitude for large problem sizes n > 10 6 . This is because it avoids the memory constraints relating to storing bond forces in an n × N C array in global memory at every time step, the bandwidth constraints of accessing this array in a separate check-bonds kernel and the need to calculate the bond stretch, s, two times.
Increasing n by a factor of 2 will double the amount of warps that need to be scheduled and executed and will thus decrease performance by a factor of 2. The amount of warps that need to be executed per node will be the dominant factor for per-node performance. Increasing N by a factor of 2 will double the number of warps that need to be executed and thus decrease performance by approximately a factor of 2. This relationship is approximate since not all warps take equal time to execute; for example, warps that contain all broken or 'padding' bonds will take much less compute time to execute compared to warps with non-broken bonds.

Overview of Code Design
The class diagram for PeriPy is presented in Fig. 6 and shows the peripy.model.Model class, which allows users to define a bond-based peridynamics model for composite materials with non-linear damage models defined by an arbitrary number of linear splines, such as the trilinear damage model illustrated in Fig. 12. Surface correction factors, partial volume correction factors, different micromodulus functions and boundary conditions can be readily included by supplying the relative keyword arguments to the model class. The model is defined by a damage model, the set of keyword arguments described above, a set of initial conditions (coordinates, connectivity and, optionally, node state vectors), force and/or displacement controlled boundary conditions and a set(s) of nodes the user has chosen to measure a datum for. The design of the code has facilitated the following features. Readily extended integrators: When instantiating the peripy.model.Model class, an explicit integrator that inherits peripy.integrators.Integrator as an Abstract Base Class is required. PeriPy has not yet been extended to implicit integration methods, and contributions are welcome. Examples of the explicit integrators included are: peripy.integrators.Euler, peripy.integrators.EulerCL, peripy.integrators.EulerCromerCL and peripy.integrators.VelocityVerletCL. The suffix CL represents an integrator that utilises OpenCL kernels. As mentioned earlier, the CL integrators can be utilised on any CPU or GPU platform due to the heterogeneous nature of OpenCL. Regular and irregular meshes: PeriPy uses the python package meshio [59] to read input meshes of various formats, where the vertices of the meshes are the peridynamic nodes. PeriPy can take cuboidal tensor grid meshes as input, which are often employed so that the nodal volumes dV i can be easily and accurately calculated. Regular or irregular tetrahedral grids may also be used, with the only complication being the need to define a volume associated with each node. This may be accomplished by using computational geometry techniques, such as Delaunay triangulation or Voronoi diagrams [60], which are not included in the package. Instead, if a tetrahedral mesh is used and exact nodal volumes are not supplied through the keyword argument volume of the model class peripy.model.Model, Peripy will approximate the nodal volumes by a weighted sum of the volumes of the tetrahedra of which the node is a vertex. Specifically, if a node is a vertex of a tetrahedra, it will be assigned 1/4 of the volume of that tetrahedra. The volumes of the tetrahedra are calculated using the standard formula of a dot product and cross product of the difference of the vertex position vectors. With this method, the total sum of the nodal volumes is equal to the total volume of the mesh, and as the volumes go to zero the absolute error is reduced. In the case of a regular cuboid grid that is split into tetrahedra, as the mesh elements are congruent, this method will result in exact volumes in the bulk. However, on the boundaries there will always be a significant error in the volumes when compared to the Veronoi diagram method. This results in surface nodes that have different effective stiffnesses to the bulk and will cause surface effects extraneous to the 'peridynamics surface effect' explored by Le and Bobaru [43]. It is therefore recommended that, if an irregular mesh is used and the exact volumes are known, then they should be supplied through the keyword argument volume. While both irregular and regular tetrahedral meshes are possible with PeriPy, for the purposes of this paper, regular cuboidal tensor grid meshes were used, keeping with the majority of the existing literature. Composite materials: Composite materials and structures can be defined and supplied to the peripy.model.Model class.
General boundary conditions: General boundary conditions can be supplied, and magnitudes of the boundary conditions can be changed over time. The directions of the boundary conditions cannot be changed over time, except along the directions that they were initially defined. This means that, while most commonly used boundary conditions are possible, torsion, for example, is not currently possible; however, contributions are welcome. Reduced overhead: The neighbour list generated using the scipy.spatial.KDTree class [61]. In PeriPy, the arrays that are constructed during the instantiation of the peripy.model.Model, such as the neighbour list, can be written to an h5 [62] file and read by the peripy.model.Model through keyword arguments. This reduces the overhead needed to simulate an example. Running simulations: Simulations are conducted with Model.simulate. The peridynamics parameters of a damage model, node states, initial conditions and boundary conditions can be changed without reinstantiating the peripy.model.Model by setting them as keyword arguments to the peripy.Model.simulate method. In this way, multiple simulations with different parameters be conducted with minimal overhead cost so that 'outer loop' applications, including sensitivity analysis, uncertainty quantification and optimisation, may be performed.

Initiating a simulation from an intermediate state:
The peripy.model.simulate can be restarted from an intermediate state by supplying a first_step argument and state variables for that time step. This could be useful, for example, when changing parameters that do not affect the elastic response of the model but do affect the crack nucleation and propagation, in which case the simulation states would be saved in the linear elastic region of a quasi-static response of an increasing load.

Output visualisation in ParaView:
The simulate method will write the states to a mesh file that is easily visualised (e.g. Paraview [63]). Writes occur at a specified frequency, in a number of steps. The other outputs of the method are a tuple of the state vectors and a dictionary object containing the average of displacement, velocity and acceleration as well as the sum of forces and body forces for each of the writes (over time), for each unique tip_type, which is defined as the set of nodes the user has chosen to measure a datum for, as defined by the is_tip function. This could be the set of nodes on the tip or midspan of a structure.
For a comprehensive explanation and usage of all classes, methods and arguments, please see the latest full documentation at https://peripy.readthedocs.org.

Benchmarking
This section will compare the performance of the solver versus two existing solvers. One is an OpenMP implementation by Hobbs [64]. All the functions are written in C and use OpenMP. The C functions are called directly by MATLAB. This requires compilation of the C code into a MEX function (MATLAB executable). The compilation used gcc-6.3.0 for Linux and the MATLAB version used was R2018b. The computations were performed using a single Skylake node with 32 OMP threads with the Cambridge Service for Data Driven Discovery (CSD3) Peta4 platform operated by the University of Cambridge Research Computing Service. The other solver is the OpenCL implementation by Mossaiby et al. [28]. The source code provided by Mossaiby et al. [28] was executed by us along with the OpenMP and PeriPy codes. Both OpenCL solver computations were run on an NVIDIA 2080Ti with 11GB GDDR6 RAM and 4352 CUDA cores, and pyopencl==2020.1 was used.
The results show a 1.4-2.0x performance gain over the OpenCL solver, with an up to 10.0x performance increase for problems with a small number of nodes n. The results also show a 3.7-7.3x performance gain over the OpenMP solver. Linear compute scaling across problem size n (number of nodes) and family size N is demonstrated. There is a negligible difference  in the performance of PeriPy as a result of using stiffness correction factors.The performance difference of using an irregular mesh rather than a regular mesh is shown and an explanation provided.

Comparing across problem size, and horizon distance
The performance of PeriPy across problem size with and without any stiffness correction factors is shown in Fig.4. The performance depends on the number of warps that need to be executed on the device. The performance is linear with the number of nodes n and approximately linear with the family size N , for the reasons discussed in Section 3.5.
The performance with irregular meshes is shown in Fig.7. For the same N , performance for the irregular meshes is faster than with regular meshes. This is because, in irregular meshes, the range of the family size |H i | across the node index i is larger than in a regular mesh, so the group_size, N = min k (2 k ) ≥ max i (|H i |), is typically 2 times larger than the regular mesh with equivalent average spacing ∆x. As a result, there are whole warps in the irregular mesh case for which no bond force calculations are required, which are performed much faster time than warps for which a bond force calculation is required. As a result, the performance for an N = 64 irregular mesh closely matches that of N = 32 with a regular mesh; the performance for an N = 128 irregular mesh closely matches that of N = 64 with a regular mesh and so on, as shown in Table 2. The effect on performance of using stiffness correction factors (i.e. through micromodulus function, partial volume correction or surface corrections) is negligible, except at small problem sizes. At smaller problem sizes, the overhead due to memory transfer between the GPU and controlling CPU dominates, and copying the stiffness correction factor C array buffer onto GPU impacts performance. At larger problem sizes, the time required for actual computation comprises most of the total wallclock time, and the only change in the computation is the indexing of a constant stiffness correction factor array, which does not significantly impact performance.

Comparing against existing solvers
The performance of PeriPy when compared to the OpenCL solver by Mossaiby et al. [28] is shown for the group_size N = 128 in Fig. 8a, and with the added comparison against the OpenMP solver by Hobbs it is shown for the group_size N = 256 in Fig. 8b. The speedups identified in Section 3 result in PeriPy being 1.4-2.0 times faster than the OpenCL solver by Mossaiby et al. [28] and 3.7-7.3 times fast than the OpenMP solver by Hobbs. As explained in Section 3.5, the speed increases are a result of (1) kernel fusing of the calc_bond_force and check_bonds kernels, (2) parallelisation over bonds (as opposed to nodes), which facilitates the most significant performance enhancement which is (3) minimisation of thread divergence. The relative effect on the performance of the speedups and a detailed explanation are provided below. Fig. 9 shows the effect of binary parallel reduction (PeriPy BPR) on performance when compared to a serial reduction (PeriPy SR). Mossaiby et al. (2017) do not include any of the optimisations, while PeriPy SR has optimisations (1) but not (2) or (3). The speedup due to (1) is significant and is shown by the difference between PeriPy SR and Mossaiby et al.
The logical implications of the results in Fig. 9a are that the summation algorithm, parallel or serial, does not actually yield much performance gain in and of itself. Rather, the binary reduction facilitates the parallelisation over bonds without needing to store a huge bond forces array in the GPU global memory every time step in order to do a summation of bond forces. The majority of the time savings for regular meshes is a result of kernel fusing.
However, it is for the irregular mesh, as shown in Fig. 9b, that the speedups due to (2) and (3), which are implemented in PeriPy BPR, are fully realised. These speedups are primarily due to reducing thread divergence to a practical minimum by parallelising over bonds rather than nodes. Consider, as is the case in PeriPy SR, parallelising over nodes. Each work-item, corresponding to a node, has to calculate the force of non-broken bonds for that node. It is often the case for irregular meshes that the number of non-broken bonds between each node varies significantly. Therefore, parallelising over nodes results in significant thread divergence, which depends on the range of broken bonds for the nodes. In PeriPy BPR, we reduce the thread divergence as much as possible by parallelising over bonds. The only thread divergence left is whether or not the bond force for a bond is calculated, and hence we achieve the minimum possible thread divergence for the problem, which results in the speedup. This explains why the performance of PeriPy for irregular meshes is much better than in Mossaiby and the PeriPy SR implementation.
It can also be seen from Fig. 9 that, for small problems, where the overhead due to memory transfer between the GPU and controlling CPU dominates, PeriPy BPR is up to 10 times faster than Mossaiby et al. and PeriPy SR. The difference between PeriPy SR and PeriPy BPR is that the parallelisation is over bonds rather than nodes (2), and so forces are calculated in local rather than global memory, reducing communication time, and the thread divergence is also reduced (3).

Validation of PeriPy through engineering test cases
In this section we demonstrate the use of PeriPy for explicit schemes for a 3D dynamic test. In testing, PeriPy was also validated against other solvers for simple test cases.

Example 1: Kalthoff-Winkler experiment
The Kalthoff-Winkler experiment [65] [66] is a classical benchmark problem for dynamic fracture [67,68,69]. The geometry and boundary conditions used for the simulation are illustrated in Fig. 10. The thickness of the specimen is 9 mm. For a plate made of 18Ni1900 steel subject to impact loading at the speed of v 0 = 32 m s −1 , brittle fracture was observed with a crack forming at 68°angles to the notches. We repeat the simulation first conducted by Silling [70] and repeated by Ren et al. [10] using the same material parameters as in the latter, comparing our results for validation purposes. The elastic modulus is E = 190 GPa, the density is ρ = 7800 kg m −3 , Poisson's ratio is ν = 0.25 and the energy release rate is G 0 = 6.9 × 10 4 J m −2 . Repeating the experiment by [10], impact loading was imposed by applying an initial velocity at v 0 = 22 m s −1 to the first three layers of nodes in the domain, as shown in Fig. 10; all other boundaries are free. The plate is discretised with a node diameter ∆x = 1.5625 mm. Along the thickness of the plate, four layers of nodes are employed to match the simulation in [10]. The total number of nodes is n = 32, 768.
We also repeat the simulation conducted by Mossaiby et al. [28] in the validation of their solver, using material parameters the same as in the study by Ren et al. [10], but, following Mossaiby et al. [28], the plate is discretised with a node diameter of ∆x = 1.0/2560 m, with 13 layers of nodes, contributing to the total number of nodes n = 1, 713, 933.
It should be noted here that, in order to prevent crack nucleation at the centre of the long edge away from the pre-cracks, a 'no failure zone' must be imposed on that edge, as can be seen in the simulation of [10]. If no 'no failure zone' is to be used, then a micro-elastic plastic damage model must be used, as mentioned by Silling [70]. The 'no failure zone' is simple to implement in PeriPy and results in a fracture pattern that matches the simulations by Ren et al. [10] and Mossaiby et al. [28], the latter of which is shown in Fig. 11 (right).
The crack propagation speed is given as where x tip (t) is the position of the crack tip at time t. The Rayleigh wave speed c R [71] is given as where c s = µ/ρ is the shear wave speed, and µ is the shear modulus. The crack tip is measured using the corner detection algorithm developed by Harris [72] with the OpenCV python package [73]. Corner detectors can be unreliable, and higher-level visual routines must be designed to tolerate a significant number of outliers in the output of the corner detector. Outliers were filtered if the resulting velocity was above the Rayleigh speed, and the results were averaged with slightly different values for the free parameter κ, which describes how 'edge phobic' the corner detection algorithm is. Fig. 11 shows the evolution of the crack speed using this method. The speed was calculated every 5 × 10 −6 s. For verification purposes, our results are compared with those of the peridynamic model in [10].
The crack starts to propagate at 24 × 10 −6 s, the same time as in the simulation of Ren et al. [10] The highest crack speed in the simulation measured by the corner detection algorithm was 1830 m s −1 , 65.4% of the Rayleigh speed, which was larger than Ren's value of 1500 m s −1 . Most other velocities are in good agreement. The crack angle at the kink was measured to be 65.3°, closely matching Ren's reported value of 65.7°.

Example 2: Outer-loop calibration of peridynamics model parameters using experimental force-deflection data
This section presents an optimisation problem to calibrate a bond-based trilinear damage model for a concrete beam simulation. Concrete is a quasi-brittle material and exhibits strain-softening behaviour when in tension. Yang et al. [74] noted that the tension softening part of the bilinear model, was not based on the true softening behaviour of concrete and thus established a trilinear model within the bond-based peridynamic formulation, illustrated in Fig. 12. The shape of the softening branch in the trilinear model is consistent with the tension softening behaviour of concrete. A generalised method for determining the parameters s 0 , s 1 and s c was proposed and only requires the elastic modulus, tensile strength and fracture energy. The position of the kink point in the softening curve is determined by the initial fracture energy, as proposed by Bažant [75]. However, the critical stretch of a peridynamic bond is dependent on fracture energy G F . The selection of G F is a major source of uncertainty in peridynamic models. Furthermore, concrete material properties are typically determined through the testing of concrete compressive strength f c . Additional material parameters, such as elastic modulus E, tensile strength f t and fracture energy G F , are rarely tested and must be determined from empirical formulas. At present, there is no standard theory relating concrete compressive strength to fracture energy. It is therefore likely that an 'outer-loop' data-driven method, is required to calibrate the trilinear model against experimental data. PeriPy is a useful tool in such outer-loop applications, providing sufficiently fast model evaluates to enable optimisation routines. Following Yang et al. [74], we set the kink point to β = 1/4, and compare the values of the trilinear model of Yang et al. [74] to optimised parameters that are calibrated against experimental data. A schematic diagram of the experimental setup used [76] is illustrated in Fig. 13   The peridynamic model was discretised using a ∆x = 5 mm mesh giving a total number of peridynamic nodes of n = 3, 645. A fixed peridynamic horizon of δ = 15.7 mm, and no stiffness correction methods were applied in this simulation (see Example 3). A displacement-controlled loading scheme is used for the numerical simulations. The left-hand support was clamped in all three directions. Following the experimental setup closely, the beam was simply supported. The right-hand support was constrained not to move in the vertical and out-of-plane directions. The central support was constrained not to move in the out-of-plane direction and had a prescribed displacement in the vertical direction. As noted by Mehrmashhadi et al. [36], applying imposed displacements to a single node will result in a significant peridynamic surface effect. Boundary displacements were applied over a two-by-two node cross-section outside the surface of the beam (except for the central point load, where the boundary was applied over a three-by-two node cross-section, for symmetry), which is shown in Fig. 13 (right). The boundary nodes were set as 'no-failure-zones' so that they do not exceed critical stretch, which would be nonphysical, and their bond stiffness was taken to be the same as the bulk material. The displacement boundary conditions were applied smoothly using a fifth-order polynomial displacement time curve, such that the boundary nodes had an acceleration of zero at the first time step and during crack nucleation and growth. Dynamic damping was used to dampen oscillations in the beam. These steps serve to increase the stability of simulation and ensure that crack growth occurs under quasi-static loading conditions. The exact dynamic parameters used for the simulations and the velocity of the applied displacement in the vertical direction are shown in Table 3. As per the original experimental paper [76], the CMOD was measured by the relative displacement in the x direction of two points, 25 mm on either side of the midpoint of the beam. To apply the notch, all bonds that are at least partially within the volume of the void of the notch were broken as an initial condition. The force P was measured as the vertical resultant force in the bonds of the displacement controlled particles.  In this simple example, calibration of material parameters was performed on the unnotched experiment using a simple binary search. In order to find a point estimates of the bond stiffness c, the gradient of the elastic part of the experimental force-CMOD curve was calibrated to the gradient of the simulation force-displacement curve to two significant figures. In order to calibrate the trilinear parameters s 0 , s 1 and s c , a loss function in the form of a mean squared error between the experimental data and the numerical solution was minimised until the parameters converged to two significant figures through successive iterations of binary searches for each parameter.
The results for two possible calibrations, 'Point estimate UN 1' and 'Point estimate UN 2' (corresponding to two different minimum values of the loss function used) for the un-notched case are shown in Fig. 14 (left). The values of the calibrated model parameters are compared to values from the trilinear model of Yang et al. [74] in Table 4. The calibrated model parameters were then tested on fifth-notched (FN) beam geometry. The trilinear model parameters determined by the calibration of the numerical simulation ('Point estimate UN 1' and 'Point estimate UN 2') are compared to the values determined by the trilinear model proposed by Yang et al. [74] in Table 4. The mean material properties of the experimental beam were used to calculate the trilinear model [74] parameters: compressive strength f cm,cyl =42.3 MPa; modulus of elasticity E =37.0 GPa; splitting tensile strength f t =3.9 MPa. The density of the concrete mixture is ρ =2346 kg m −3 . The material fracture energy is determined empirically using fib Model Code 2010:

Example 3: Predicting the failure load of a reinforced concrete beam
This section presents the large-scale industry-motivated problem of estimating the failure load of a reinforced concrete beam. The numerical and experimental results for a reinforced concrete beam failing in shear, 'Beam 5' from the series of reinforced concrete beam tests by Leonhardt and Walther [77], commonly referred to as the Stuttgart Shear Tests. The beam is subjected to four-point loading, and shear reinforcement was not present between the applied loads so that the development of diagonal shear cracks could be studied. A schematic diagram of the experimental setup is illustrated in Fig. 15, overlaid by the fracture pattern observed towards the end of the test. The span of the member L = 1950 mm, and the shear span a v = 810 mm. The effective depth d, the distance from the top compression fibre to the centre of the tensile reinforcement, is 270 mm. The cross sectional area is 320 mm × 190 mm, and the reinforcement ratio, the ratio between the area of steel A s and the total cross-sectional area, is approximately 2.05%. Longitudinal reinforcement is provided in the form of two ribbed steel bars with 26 mm diameters. The reinforcing bars were manufactured from high-yield steel, for which we assume the following material properties: modulus of elasticity E s = 208 000 MPa, yield strength f y = 465 MPa. Since the beam fails in shear, the steel was assumed not to yield, and hence it could be modelled with a linear damage model. Input parameters for the concrete constitutive model have been determined using empirical formulas from fib [78], which relate the experimentally measured cubic compressive strength to common concrete properties. The following concrete properties were used: cubic compressive strength f cm,cube = 35 MPa, material fracture energy G F = 133 N m −1 and modulus of elasticity E c = 30 500 MPa.
The peridynamic model was finely discretised using a ∆x = 5 mm mesh with δ = 15.7 mm. The total number of peridynamic nodes was n = 957, 002. A displacement-controlled loading scheme was used for the numerical simulations. Following the experimental setup closely, the beam was supported such that the right-hand support was constrained not to  Figure 16: Comparison of load vs. midspan deflection for Beam 5 between experimental data and the bilinear peridynamic model results. The experimental data is from Leonhardt and Walther [77]. The deflection visualisation is amplified 50 times. move in the vertical and out-of-plane directions. The left-hand support was clamped. The central supports were constrained not to move in the out-of-plane direction and had a prescribed displacement in the vertical direction. For the reasons outlined in Section 6, the boundary displacements were applied over two-by-two node cross-sections outside the surface of the beam at the positions of the arrows and supports on Fig. 15. These nodes were set as 'no-failure-zones' so that they do not exceed critical stretch, which would be nonphysical, and their bond stiffness is the same as the bulk material. The displacement boundary conditions were applied smoothly over a fifth-order polynomial displacement time curve such that the boundary nodes had an acceleration of zero at the first time step and during crack nucleation and growth. This ensures that crack growth occurs under quasi-static loading conditions. The exact dynamic parameters used for the simulations and the maximum displacement rate of the applied displacement in the vertical direction are shown in Table 3.
In this simulation a bilinear damage model originally proposed by Gerstle et al. [79] was used. Zaccariotto et al. [80] determined the parameters of the bilinear damage model with material fracture energy G F and experimental load-displacement curves. The energy required to start the fracture process is defined as G 0 and k r = s c /s 0 = G f /G 0 . The stretch at the linear elastic limit s 0 given by and k r is determined from experimental load-displacement curves or through sensitivity studies. Stiffness correction factors were applied to this numerical simulation. The force P/2 was measured as the vertical resultant force in the bonds of the displacement-controlled particles. The midspan deflection was measured as the average deflection of the nodes at the top of the beam at the midspan (975 mm from the left-hand support). A comparison between the force-deflection curve generated from model and experiment is shown in Fig. 16, alongside inset plots show the evolution of the fracture pattern in the simulations. Without specific calibration of the damage model parameterisation for this test, the numerical and experimental model show good qualitative agreement. The numerical solution underestimates failure, and demonstrates a more progressive damage response to failure. The peridynamic model reproduces the expected asymmetrical failure mode, note that in both cases as the diagonal crack propagates into the compression zone, the characteristic rotation behaviour is seen. The final full crack is shown in a Paraview visualisation in Fig. 17. Further improvements to the modelling of this specific experiment could be achieved by refining the model representation, application of loading and boundary conditions, and exploring different material models. Here however, the example demonstrates the scale of peridynamic solutions which can be readily solved using PeriPy.

Future developments
This section outlines some possible extensions to the code. The user is encouraged to file bug reports or feature requests and make contributions by opening a 'new issue' on GitHub (https://github.com/alan-turing-institute/PeriPy/issues).

Stiffness correction algorithms
The code has been developed to be easily extendable. Any partial volume algorithm, stiffness correction algorithm or micromodulus function can be applied with only minor changes to the code.

State-based Peridynamics
While PeriPy is currently a bond-based peridynamics solver, it may be extended using the same ideas to ordinary statebased (OSB) peridynamics. In OSB peridynamics, the bond force f ij of one bond ξ ij , j ∈ H i is not in general independent of another bond ξ ik , j ∈ H i within the horizon. However, the individual components needed to calculate f ij are calculated in two nested for-loops [81]: one to calculate a 'nodal dilation' using ξ ij , j ∈ H i and η ij and another to calculate the nodal force, also using ξ ij , j ∈ H i and η ij . These two nested for-loops can be parallelised in the same way as the presented implementation, as the bond dilation contribution depends only on the state of the parent and child nodes. Since the bond dilation contributions cannot be factorised from the node force calculation, two kernels are now required to calculate the node force, which corresponds to the two aforementioned for-loops. Since this kernel contributes to 93-98% of the overall compute time, we can expect the OSB implementation to be about two times slower than the presented bond-based implementation.

Other
Further contributions to this code include the extensions that have been discussed in literature, such as a contact algorithm for applying the boundary conditions; multiscale methods; and PD-FEM [82,24], which is a natural extension to PeriPy given the mesh compatibility. Finally, it is believed that probabilistic methods for quantifying uncertainty in the peridynamic model parameters using real-life data will be a necessary for the application of the peridynamic method to industry problems and that PeriPy has opened up this avenue of research.

Conclusion
PeriPy, a lightweight, open-source and high-performance python package for solving peridynamics problems on a CPU/GPU platform in solid mechanics, is presented. The optimisation steps are discussed, namely, pairing a parallel computation of each bond force and a binary parallel reduction of the bond forces into the nodal force density by exploiting the execution and memory models of OpenCL, respectively. In all cases, PeriPy outperforms the existing CPU and GPU implementations. In particular, the optimisations lead to a 1.4-2.0 times speed increase compared to the existing OpenCL solver and a 3.7-7.3 times speed increase compared to the existing OpenMP solver. The implementation is compared on industrymotivated 3D benchmark problems with the other OpenCL and OpenMP solvers. The results indicate a step improvement in the algorithmic speed of peridynamic solvers. This work enables the use of parameter estimation, uncertainty quantification and outer-loop simulations in peridynamics research.

Appendix A. Installation
Developers looking to contribute or make changes to the package should follow the installation instructions "Get started from the GitHub repository (for developers)" on the GitHub page https://github.com/alan-turing-institute/PeriPy.
For users looking to create their own examples, the code is available to install at https://pypi.org/project/peripy/.