Accelerating the force-coupling method for hydrodynamic interactions in periodic domains

The efficient simulation of fluid-structure interactions at zero Reynolds number requires the use of fast summation techniques in order to rapidly compute the long-ranged hydrodynamic interactions between the structures. One approach for periodic domains involves utilising a compact or exponentially decaying kernel function to spread the force on the structure to a regular grid where the resulting flow and interactions can be computed efficiently using an FFT-based solver. A limitation to this approach is that the grid spacing must be chosen to resolve the kernel and thus, these methods can become inefficient when the separation between the structures is large compared to the kernel width. In this paper, we address this issue for the force-coupling method (FCM) by introducing a modified kernel that can be resolved on a much coarser grid, and subsequently correcting the resulting interactions in a pairwise fashion. The modified kernel is constructed to ensure rapid convergence to the exact hydrodynamic interactions and a positive-splitting of the associated mobility matrix. We provide a detailed computational study of the methodology and establish the optimal choice of the modified kernel width, which we show plays a similar role to the splitting parameter in Ewald summation. Finally, we perform example simulations of rod sedimentation and active filament coordination to demonstrate the performance of fast FCM in application.


Introduction
Micron-scale fluid-structure interactions are present throughout a range of industrial processes involving colloidal particles and suspensions, as well as natural processes including those in cell-level biology.Notable examples from biology include the movement of fluid driven by flagella and cilia [1,2,3], the flexible and motile hair-like organelles protruding from cell surfaces, and the dynamical rearrangement of filament-motor protein complexes in the cell interior [4].Many interesting examples arise also in engineering applications, including the rheological changes exhibited by flowing suspensions of particles and fibres [5,6,7,8], the self-assembly of structures in more advanced materials such as magnetorheological fluids [9], and the motion and interactions of exotic colloidal particles such as nanomotors and other types of phoretic particles [10].A unifying feature of these systems is the presence of long-ranged hydrodynamic interactions that couple the motion of structures that are present.
For microscopic objects moving through viscous fluids, viscous forces typically dominate over inertia resulting in the fluid motion being governed by the linear and steady Stokes equations.This, along with the negligible effect of structure inertia, confers a linear relationship between the forces and velocities governing the motion of the structures.The matrix relating the velocities and forces is called the mobility matrix.While the relationship is linear, the mobility matrix is itself configuration dependent and dense due to the slow decay of the fluid velocity fields generated by the forced structures.In direct implementations of methods for particulate suspensions such as Brownian [11,12] and Stokesian dynamics [13], the application of the mobility matrix is performed by pairwise summation -an O(N 2 ) calculation where N is the number of degrees of freedom.This scaling for the computational cost is also present in direct implementations of other approaches such as the boundary element method [14], the rigid multiblob method [15], or the method of regularised Stokeslets [16,17], where multiple degrees of freedom are associated with each structure.Thus, for simulations involving many degrees of freedom, computational costs quickly grow prohibitive and approaches that circumvent pairwise summation need to be considered.
Reducing the computational cost often requires taking advantage of fast summation techniques, such as the fast multipole method [18,19], that provide the action of the mobility matrix without ever computing the mobility matrix directly.For periodic domains, such methods can be constructed around the fast Fourier transform (FFT).In this context, there are two related approaches.The first stems directly from classical Ewald summation [20] where the application of the inverse Stokes operator is split between sums in real and Fourier space.By introducing an appropriately chosen splitting function, rapid convergence of both the sums can be ensured.The sum in real space can be interpreted as a local, pairwise correction to the changes to the mobility matrix introduced by the splitting function.In terms of computation, the sum in Fourier space can be performed for all degrees of freedom simultaneously, while the pairwise correction only needs to be performed for degrees of freedom in close proximity.This approach has been applied successfully for the evaluation of the interactions based on point-forces (Stokeslets) [21], as well as the Rotne Prager Yamakawa (RPY) tensor in the positively split Ewald (PSE) method [22] and incorporated into accelerated [23] and fast [24] Stokesian Dynamics.
The second approach is to instead consider regular, localised force distributions that can be evaluated on a grid where the Stokes equations can be solved using an FFT-based method.This is done, for example with the immersed boundary method (IBM) [25,26] and the force-coupling method (FCM) [27,28,29] with a main difference between the two being the particular choice of function, or kernel, used to transfer the structure force to the grid.Additionally, with these methods structure velocities are obtained by using the same kernels to interpolate, or volume average, the resulting fluid flow.The result is a positive definite mobility matrix and proper energy balance with the viscous dissipation in the surrounding fluid.As discussed in [22], a limitation of this approach is that the grid must be chosen to provide sufficient resolution of the kernel.As a result, computations can become expensive in cases where the separation of the structures is large compared to the kernel width.To emphasise this point, [22] showed that PSE utilises a maximum wavenumber approximately 3 times smaller than that needed for FCM.
In this paper, we develop a fast implementation of FCM which alleviates this limitation by eliminating the need for the grid to resolve the FCM kernel.We accomplish this by substituting the FCM kernel by a modified kernel with a larger width, and correcting the resulting particle velocities to ensure errors are below a user specified tolerance.This extends similar ideas developed in [12,30] for Stokeslet interactions, where regularised forces in real space, rather than an Ewald splitting function in Fourier space, is used to formulate the Ewald summation.Our modified kernel is carefully chosen to ensure that the resulting pairwise mobility converges exponentially to the standard FCM mobility as particle separation distance increases.Additionally, the specific choice of modified kernel ensures positive splitting of the force-velocity mobility matrix, resulting in SPD matrices for both the modified mobility matrix as well as the local pairwise correction matrix.We also extend fast FCM to allow for torques and rotations.We present a GPU-based implementation of fast FCM and perform a number of tests to demonstrate how parameters can be tuned to optimise computational performance.In doing so, we find that in many cases fast FCM can be an order of magnitude faster than the standard FCM computation.We provide example simulations of rod sedimentation and active filament dynamics that demonstrate the effectiveness of fast FCM in application and as part of larger computations.

Force-coupling method: the force-velocity mobility matrix
We begin by reviewing FCM [27] for the motion of N particles of hydrodynamic radius a through a fluid with viscosity η whose flow is governed by the Stokes equations.Each particle n = 1, . . ., N is centred at Y n and exerts the force F n on the fluid.The fluid and particles occupy domain Ω.The force on each particle n is transferred to the fluid using a Gaussian distribution, or kernel, Accordingly, the Stokes system for the resulting fluid flow u(x) and pressure p(x) is, for x ∈ Ω, where is the 3N × 1 vector containing the components of the forces on all particles and J † [⋅] is the spreading operator that transfers the forces on all particles to the fluid such that, The motion of the particles is determined using the kernel to locally average u(x).The velocity of particle m is then given by, recognising also that the interpolation operator J and the spreading operator are adjoints.The kernel width, σ, sets the hydrodynamic radius of the particles via a = σ √ π.This particular choice of a recovers the Stokes drag law for a single particle in an unbounded domain [27].
The successive actions of spreading the particle force to the fluid, solving for the resulting fluid velocity, and interpolating the fluid velocity to obtain the particle velocities provides the single linear relationship between the particle forces, F, and velocities, and where M VF is the 3N × 3N force-velocity mobility matrix.This can be written in terms of the operator J and L −1 , the inverse of Stokes operator, as

Grid-based algorithm for FCM in periodic domains
For simulations of particles in periodic domains where Ω = [0, L x )×[0, L y )×[0, L z ), the FCM mobility matrix can be applied using an FFT-based algorithm described previously in [29,31].Here, Ω is discretised using a uniform grid with spacing ∆x and size M x × M y × M z , where M x , M y and M z are the number of gridpoints in the x−, y−, and z−directions, respectively.The total number of grid points is then M = M x M y M z .The algorithm consists of three main steps, each corresponding to the action of one of the continuous operators described above.
1. Applying J † : The particle force is communicated to the fluid by evaluating the Gaussian force distributions on the grid.Although the Gaussian is not compactly supported, it decays rapidly and the tails can be safely truncated in an error-controlled fashion.Thus, each Gaussian is supported locally on the grid by M G × M G × M G gridpoints. 2. Applying the inverse Stokes operation, L −1 : (a) The Fourier transform of the total force distribution on the grid is computed using the FFT.
(b) The fluid velocity in Fourier space is obtained by applying the inverse Stokes operator in Fourier space.(c) The inverse Fourier transform of the fluid velocity is computed using the FFT. 3. Applying J : The particle velocities are obtained by evaluating numerically the integral appearing in (4) using the trapezoidal rule.Again, the sums are performed using the truncated Gaussians on grid of size The operation count associated with evaluating the force on the grid and interpolating the resulting flow is O(N M 3 G ), while that associated with the application of the inverse Stokes operator is O(M log M ).While we see that this algorithm avoids the O(N 2 ) scaling associated with pairwise computation, when the grid is large compared to the number of particles, the cost of the FCM algorithm will exceed that of pairwise evaluation.
The accuracy of this algorithm hinges on the grid being sufficiently small as to resolve the Gaussian kernel used in J and J † , i.e. ∆x < σ.Thus, the grid spacing, and hence number of grid points, is determined by σ and consequently the hydrodynamic radius, a.As a result, for simulations at low volume fraction where the separation between particles is large compared to the particle size, the FFT and applying the inverse Stokes operator will require excessive computational time.It is also worth noting that the grid-based approach can incur a high memory cost.Flow data on a 1024 × 1024 × 1024 grid using double precision requires 25.8GB RAM.
The purpose of this paper is to formulate fast FCM by decoupling the grid spacing from the Gaussian width to facilitate more efficient computation across all particle volume fractions.We accomplish this by replacing the Gaussian kernel in the FCM algorithm by a modified kernel of larger width, and then correcting the resulting particle velocities by a pairwise computation.An essential aspect of this procedure is constructing a modified kernel that ensures the number of particle pairs requiring correction remains small while simultaneously having the kernel width as large as possible.

FCM pairwise mobility matrix
A key piece of information needed to formulate fast FCM is the FCM pairwise mobility matrix, M V F nm , which provides the contribution to the velocity of particle n due to the force on particle m.We can obtain an analytical expression for M V F nm using the expressions found in [27] for the fluid velocity generated by a Gaussian force distribution.
The flow generated by f (x) = F ∆(x; σ) with ∆(x; σ) = (2πσ 2 ) −3/2 exp(−r 2 /2σ 2 ) and r = ∥x∥ can be expressed as where we can write S(x; σ) = S (1) In terms of the Oseen tensor, we have The expression for the pairwise mobility matrix for FCM follows directly from that for the flow.Namely, the entries of the mobility matrix that relate the velocity of particle n to the force on particle m are We see that this is identical to the expression for the flow field, but with the envelope size replaced by σ √ 2. This slight modification is a result of the volume averaging.The validity of this expression is most easily established using Fourier integrals, as shown in Appendix A.
Along with these expressions, it will also be useful to have at hand the flow generated by f (x) = H∇ 2 ∆(x; σ), which is where Q(x; σ) can be decomposed into two terms where The expression (16) can also be written as

Fast FCM
With the results from the previous sections established, we are in the position to formulate and justify the fast FCM framework.As is typically done in Ewald splitting, we decompose the mobility matrix into two parts and aim for the action of MVF to be evaluated efficiently using a grid-based computation and the correction ( MVF − M VF ) to be applied pairwise but only for a limited number of pairs whose separations are within a cut-off radius, i.e. ( MVF − M VF ) is sparse.
In standard Ewald splitting, the decomposition and aims are achieved by introducing the splitting function H(k; ξ), where k = |k| is the magnitude of the wavenumber k and ξ is the splitting parameter, in the Fourier transform of the inverse Stokes operator such that L−1 = L−1 H(k; ξ) + L−1 (1 − H(k; ξ)), which are then associated with MVF and (M VF − MVF ), respectively.The splitting function is selected to decay exponentially with increasing k, with the decay rate controlled by the splitting parameter, ξ.Common choices for H(k; ξ) include the Hasimoto [20] and Beenakker [32] splitting functions.
With fast FCM, we achieve a similar splitting by replacing the kernel ∆(x; σ) in FCM with a modified kernel ∆(x; Σ), which we use to compute the action of the approximate mobility MV F using the standard, FFT-based FCM algorithm.Thus, to ensure that the cost of applying MV F will be less then that of M V F , we must have Σ > σ to be able to use a smaller grid.At the same time, however, the choice of ∆(x; Σ) should yield an approximate pairwise mobility matrix that satisfies ∥ M This requirement ensures that the correction matrix MVF −M VF is sparse and the number of pairwise corrections needed to achieve a given tolerance is minimised.
With these requirements in mind, we utilise the modified kernel, with Σ > σ.The inclusion of the operator (1 + (σ 2 − Σ 2 )/2) ∇ 2 , as we will see, enables the exponential convergence of the pairwise mobility.

The action of MVF
Computing the action MVF follows the same steps as the standard FCM computation described in Section 2 with ∆ n (x; σ) replaced by ∆n (x; Σ).Accordingly, we consider the Stokes system where and compute the flow field.Then, from the fluid velocity, we determine the approximate particle velocities by evaluating

The action of M VF − MVF
To correct these velocities, we utilise an analytical expression for the pairwise mobility.We derive this expression from the integral representation [31] of the pairwise mobility, which we then expand as Each integral can be related to the expressions for the FCM mobility matrices found in Section 2.2.Specifically, we have where S(x; σ) and Q(x; σ) are provided by ( 8)-( 10) and ( 16)-( 17), respectively.The last term, , is new and given by Taking the difference with the FCM pairwise mobility (14), we arrive at the expression for the correction matrix, )) Along with providing the expression needed to correct the particle velocities, (31) reveals the important property that each term on the right-hand side decays exponentially as The expression for (31) evaluated at ∥Y n −Y m ∥ = 0, which is well-defined, is provided in Appendix B and is used to correct the self-mobility matrix for every particle.Note also that we assume that the domain size L is always sufficiently large so corrections are not needed between a particle and its own periodic images.As with standard Ewald splitting, rapid convergence is crucial to ensuring that the correction matrix is sparse.We also observe that the parameter Σ that controls the width of the modified kernel plays the same role as ξ in Ewald splitting.Accordingly, Σ will need to be chosen to optimise the overall performance of fast FCM.

Positive splitting
An additional desirable feature associated with this choice of ∆ is that it provides a positive splitting of the mobility matrix, namely that both MVF and M VF − MVF are symmetric positive definite matrices.While positive splitting is not essential for the efficient evaluation of the hydrodynamic interactions, as discussed in [22], it is essential for the efficient computation of Brownian displacements when thermal fluctuations are to be included in the computation.
First, we notice that MVF is positive definite by construction due to the positive definiteness of the inverse Stokes operator and the spreading operator being the adjoint of the interpolation operator.It remains to show that M VF − MVF is positive definite.
We begin with the expression for the pairwise FCM mobility matrix in terms of the Fourier transforms (see also Appendix A) of ∆(x; σ) and the inverse Stokes operator, where If we were to apply Ewald splitting directly, we would have As described in [22], the splitting is positive if 0 < H(k; ξ) < 1 for k > 0. To demonstrate that our modified kernel approach yields positive splitting, we find the corresponding splitting function and show that it satisfies this condition.Since, we can write This allows us to express M V F nm as, where the fast FCM splitting function is We first notice that and examining the expression for H ′ (k; Σ, σ) more closely, we find Thus, since k ≥ 0 and Σ > σ, we have From this analysis, we also see how the approach of using a modified kernel corresponds to the standard Ewald splitting framework, namely that ∆(k; Σ) = (H F CM (k; Σ, σ)) 1/2 ∆(k; σ).Thus, it would be possible to have used one of the standard splitting functions, such as that used by Hasimoto [20], provided that H H (k; ξ) 1/2 exists for all k and its inverse Fourier transform can be found to evaluate its real-space representation on a grid.It is important to note that while this direct correspondence is available for periodic domains, in other domains such as channels, using a modified kernel in real-space to perform the splitting, as done for Stokeslets [12,30], may be the only route to accelerating the computation.

Extension of fast FCM to torques and angular velocities
As introduced in [28], FCM can be extended to include torques and particle rotations.Along with the force, F n , the torque, T n , on each particle n is also spread to the fluid through where , J † is given in (25), and After solving for the fluid flow, the angular velocity, Ω m , on particle m is found by applying N to the flow field, For particle rotations, the width of the Gaussian is σ D = a/(6 √ π) 1/3 to ensure FCM provides the correct value for the viscous torque on a single spherical particle.
With the addition of torques and angular velocities, , the mobility relationship is now where M VF is given by ( 14), and the other mobility 3N × 3N submatrices are We may utilise the analytical expressions [27,28] for the vorticity induced by single FCM force and torque distributions located at the origin to obtain expressions for the pairwise mobility matrices.The vorticity resulting from a force distribution is, where and (−x×) is the matrix For a torque, the vorticity is given by with P (x; σ D ) = P (1) (x; σ D ) + P (2) (x; σ D ), where Without presenting the details as the steps are similar those used for the force-velocity pairwise mobility, the additional pairwise mobilities are then along with As done when applying the force-velocity mobility, we split the mobility into two parts such that the application of M can be evaluated rapidly on a grid while M − M is evaluated pairwise for a limited number of particle pairs.For the grid-based computation, we solve with J † given by ( 25), and As before, the resulting particle velocities follow from applying J to the flow field, while the angular velocities are given by Note that for Ñ , we simply replace the Gaussian with σ D by one with Σ D > σ D .No additional term involving the Laplacian is required to ensure exponential convergence.
The real space corrections are computed from the pairwise mobility relations.They are where R(x; σ) is provided in (53) and and where P (x; σ) is given by ( 56)-(57).
As final point, we note that Σ D , the width of the modified kernel appearing in Ñ , can be set independent of Σ.For convenience, however, in our computations below when we have torques, we take Σ D = Σ.

Fast FCM Algorithm
To evaluate the action of the mobility matrix using fast FCM, we must combine the gridbased FFT computation described in Section 2.1 for FCM with an efficient scheme for evaluating the pairwise correction for particles whose separation distance is within a tolerance-related cutoff radius, R c .To do this, we both sort the particles and create neighbour lists using the approach described below.Our approach has the dual effect of ensuring efficient computation of the pairwise corrections and a localisation in memory of nearby particles, allowing for efficient spreading and interpolation to the grid.
1. Spatial hashing: The first step involves dividing the domain into cells in order to group particles based on their positions within the domain.This division needs to be performed only once during the initialisation of the simulation.Here, the periodic domain is divided into m x × m y × m z cells, where the number of cells in each direction is determined from ).This number of cells guarantees that for all particles, any other particle within a distance R c is either in the same cell or in one of the adjacent cells.The index for each cell is where x c = j, j = 0, . . ., m x − 1, and y c and z c are defined similarly.At each timestep, each particle is assigned to a specific cell based on its location in the domain and the cell index is used as the particle's hash value.2. Particle sorting and cell lists: All particle data arrays (positions, forces, torques) are then sorted based on their hash values, which we perform using 'sort by key.'This sorting facilitates efficient memory retrieval during grid operations since particles positioned consecutively in memory will now have in common grid points to which they spread and from which they interpolate.In addition, since particles in a cell are now contiguous in memory, the pairwise corrections for pairs in the same and adjacent cells can readily be evaluated by knowing the index of the first and the last particle in each cell.3. Applying J † : This step is the same as Step 3 in the FCM computation presented in Section 2.1 with J replaced by J † .4. Applying L −1 : This step is identical to Step 2 in Section 2.1.5. Applying J : Again, this is the same as Step 3 in FCM computation, with J replaced by J .
6. Pairwise correction: For each particle m, we determine its cell index and identify the adjacent cells.Then, using the stop and start indices, we compute the distance between particle m and the other particles in the same and adjacent cells.If the distance between particles m and n, denoted as R mn , satisfies R mn < R c , we apply the pairwise correction (31) using ( 65) and (67) to adjust the velocities of particles m and n accordingly.
The operation counts for applying J and J † will be identical to those for applying J and J † in FCM, as will the operation count for applying L −1 .It is important to remember, however, that the grid used for fast FCM will be smaller.For a random dispersion of N FCM particles, the operation count for the pairwise correction will be O(N ϕ(R c /a) 3 ), where ϕ is the volume fraction.Thus, the savings in computation time provided by fast FCM will broadly depend on the reduction of computation time due to the reduced grid size and the increased cost of the pairwise correction.

Implementation
In the remainder of the paper, we explore the performance of fast FCM using a C++/CUDAbased implementation that utilises the Graphics Processing Unit (GPU) to accelerate the computation.Our CUDA implementation is freely available on GitHub: https://github.com/racksa/cuFCM_demo.
In our implementation, specific attention paid to many GPU-related considerations.We carefully organise the data access pattern for quantities such as the flow field, that reside on the grid.This limits waiting times when fetching data from memory.We utilise the CUDA interface which offers flexibility in manipulating hardware memory through the 'shared memory' hierarchy in the GPU.In doing so, we can take advantage of important features of GPU computing, including the ability to switch processing cores between tasks while waiting for memory communications, thereby reducing idle time.For a more detailed summary of the CUDA-specific choices used in our implementation, please see Appendix C.

Error control and parameter optimisation
We set the values of computational parameters appearing in fast FCM to ensure that particle velocities are returned efficiently to within a user specified tolerance.These parameters are: 1. M G , the size of the grid support of ∆n (x; Σ) in one direction, 2. Σ/∆x, the kernel width relative to the grid-spacing, 3. Σ/σ, the kernel width relative to the FCM envelope width, 4. R c /σ, the pairwise correction cutoff radius relative to the FCM envelope width.
The parameters M G and Σ/∆x are selected to provide sufficient resolution of the grid-based computation associated with applying MVF .The remaining parameters, Σ/σ and R c /σ, are set to ensure that the particle velocity error is below the specified tolerance while balancing the computational effort between the grid-based and pairwise computations.One can see that for Σ/σ = 1, R c /σ can be rather small.Thus, very few pairwise corrections will be required, but nothing has been gained with respect to the standard FCM computation.At the other extreme where Σ/σ ≫ 1, the grid size will be greatly reduced, but a large R c /σ will be required.In this case, an exceedingly large number of pairwise corrections will be needed to deliver errors below an acceptable tolerance.We seek, therefore, intermediate values of Σ/σ and R c /σ so as to yield the optimal balance in computation time between the pairwise and grid-based computations.

Resolving the action of MVF
We first determine the values of M G and Σ/∆x that are needed to return the application of MVF to within a desired tolerance.The parameter Σ/∆x controls the grid-resolution of the kernel, and M G sets the finite extent of the grid support of ∆(x; Σ).Hence, errors associated with M G are linked to the truncation of Gaussian kernel's tails.As computational cost increases with Σ/∆x and M G , it is important to ensure that these are set to the lowest values possible for a given tolerance.
Fig. 1(a) shows the error, ϵ, in applying MVF for a random suspension of N = 64457 particles with volume fraction ϕ = 8% in a domain of size L × L × L for a range of Σ/∆x and M G .To ensure any error incurred is associated with MVF , the correction is applied to all particle pairs by setting R c = L.Each particle n, is subject to a random force, F n , and the resulting velocity V n for each n is determined.The error is given by where a computation with error tolerance 10 −15 was used to generate the exact velocities, U n .We observe that ϵ decreases exponentially as Σ/∆x and M G increase simultaneously.We notice also that the error contours are approximately rectangular.This indicates that minimum values of Σ/∆x and M G are required to achieve a desired tolerance.These values are provided in Table 1, where we see both Σ/∆x and M G approximately double in value as ϵ decreases from 10 −2 to 10 −8 .The slope of the line Rc/σ = λϵΣ/σ that provides the relationship between Rc and Σ needed to achieve tolerance, ϵ, for random suspensions with different ϕ and a/L = 1/150.

Determining the optimal R c /σ and Σ/σ
With values of M G and Σ/∆x established for the grid-based application of MVF , we move to determining values for the remaining two parameters R c /σ and Σ/σ.To do so, we must now consider the complete computation and assess how the inclusion of pairwise corrections affects the overall error.This will provide a relationship between the values of R c /σ and Σ/σ needed to achieve a given ϵ.Then, using parameter values lying on these curves, we can find the combination of R c /σ and Σ/σ that minimises computational time for a specified ϵ.
Fig. 1(b) shows the error for the random suspension with N = 64457 particles and volume fraction ϕ = 8% over a range of R c /σ and Σ/σ.For all simulations performed in this paper, there is no overlapping between particles.The computations were performed with Σ/∆x = 2.0 and M G = 16 to ensure sufficient resolution of the grid-based computation for all values of ϵ.Based on these parameter values, the number of grid points ranges from M x = 30 to M x = 260, with M x = M y = M z in all cases.We see that error contours can be approximated by lines that pass through the origin with slopes that increase as the error decreases.Thus, for a given ϵ, we may write R c /σ = λ ϵ Σ/σ, where λ ϵ is the slope of the line.Table 2 shows the values of λ ϵ for different values of ϵ over a range of ϕ.We see that for a given ϵ, changing ϕ results in only a modest change in λ ϵ .
While all R c /σ and Σ/σ along these lines return values within the specified error tolerance, the computational time for different parameter values along the lines can vary widely.At low R c /σ and Σ/σ, there will be a high computational cost incurred due to a very fine grid, while at high R c /σ and Σ/σ, the pairwise computation will dominate the cost.Thus, the appropriate choice for R c /σ and Σ/σ will be one along the desired tolerance curve that balances the computational costs of the grid-based and pairwise computations.ϕ optimal Σ/σ optimal R c /σ 0.05%  Fig. 2 shows the computational cost measured in particle time-steps per second (PTPS) [22] as a function of Σ/σ for random suspensions with different values of ϕ.The PTPS is the number of particles divided by the average time required to apply the mobility matrix.Thus, high values of PTPS correspond to lower computation times.We set the error tolerance to be ϵ = 10 −4 , and using the values from Tables 1 and 2, we have Σ/∆x = 1.00,M G = 10 and λ ϵ = 5.5.We see for each ϕ, PTPS attains a maximum value and the value of Σ/σ where the peak is realised decreases as ϕ increases.The values of Σ/σ and R c /σ for which the peak values occur are given in Table 3.The peak PTPS value is approximately 7 × 10 6 for all ϕ and exhibits a slight reduction at the lowest values of ϕ.For the most dilute case, ϕ = 0.0005, the peak occurs at Σ/σ ≈ 6.Thus, for this case, using fast FCM reduces the grid size by a factor of approximately 6 3 .As the suspension becomes denser, the average particle separation decreases resulting in the peak occurring at lower values of Σ/σ.Eventually, when the volume fraction becomes significantly high, the optimal computation is realised for Σ/σ ≈ 1, and the standard FCM computation (no pairwise corrections) becomes the most efficient choice.

Cost comparison with standard FCM
With the optimal parameters established, we can assess the performance of fast FCM relative to the standard FCM implementation.To do so, we compute the PTPS for fast and standard FCM for random suspensions with different ϕ, and determine ϕ c , the value of ϕ where the computational cost for fast FCM exceeds that of the standard implementation, i.e.PTPS F F CM < PTPS F CM .To ensure a proper comparison, the two methods utilise the same code base and code optimisation where possible.They both use the same FFT package and gridding subroutines, and both are run on the same device.We perform the comparison for a/L = 0.004, 0.008 and 0.012, and set the tolerance to ϵ = 10 −4 .Fig. 3(a) shows the PTPS ratio, PTPS F F CM /PTPS F CM , as a function of ϕ for three difference values of a/L.For all cases, we have PTPS F F CM /PTPS F CM > 1 at low ϕ followed by an overall decrease to values PTPS F F CM /PTPS F CM < 1 as ϕ increases.As discussed above, at low ϕ when the particles are on average further apart, the number of grid points for fast FCM is lower than that needed in standard FCM, and hence the PTPS for fast FCM is higher.As ϕ increases, the particles become more closely spaced and the number grid points for both algorithms become more comparable until reaching near equality at ϕ c .The value of ϕ c increases from approximately 10% for a/L = 0.012 to approximately 18% for a/L = 0.004.We see, therefore, that for a fixed particle size, the computational cost of fast FCM relative to the standard computation decreases as the simulation domain increases in size.This dependence of ϕ c on a/L is shown in Fig. 3(b).We observe that ϕ c decreases linearly, ϕ c ≈ −9.24(a/L) + 0.22, over the range of a/L that we are able to assess.Similar results are obtained when the tolerance is lowered to ϵ = 10 −6 (see Figs.

3(c) and 3(d))
. For a/L < 0.004, the grid for standard FCM becomes too large to be accommodated in the memory of the GPU (NVidia RTX 2080 Ti), while for a/L > 0.012, the FCM grid is too small, resulting in idle nodes on the GPU.For small a/L but now with ϕ fixed, we again have that the standard FCM computation requires large M to ensure ∆x < a for sufficient resolution of the particles.This results in excessively costly grid computations, including very large memory overheads, that are mitigated in fast FCM by the inclusion of the pairwise correction.Before presenting results from simulations performed using fast FCM, it is important to note that the parameter values provided here are a general guide as to how errors can be controlled and computation times minimised when using fast FCM.Our results were compiled for simulations of random suspensions with different volume fractions performed on one type of device.The performance of fast FCM, and hence the optimal parameter values, will depend on the particle arrangement, as well as the device on which the computation is to be performed.Thus, users are encouraged to explore how these parameters should be tuned to optimise fast FCM for their specific computation.

Simulations using fast FCM
In this section, we employ fast FCM as the hydrodynamic solver for simulations of rigid bodies and flexible filaments.In these simulations, fast FCM is used in a way similar to the immersed boundary method [25] or rigid multiblob method [15] where the immersed body is discretised into a number of FCM particles, which then experience forces that correspond to the internal stresses experienced by the structure.These forces can arise through constraints, or through a constitutive law that relates structure deformation to the internal stress.As such, this allows us to examine the usefulness and performance of fast FCM as part of a larger, more complex computation.

Sedimentation of rod suspensions
We first employ fast FCM to simulate the sedimentation of a planar suspension of N R rigid rods.Following [33], the individual rods are constructed from 22 FCM particles stacked as alternating doublets, as shown in Figure 4a.As a result of this construction, the total rod length is l = 14.14a.The position of rod n is X n , while its orientation is provided by the unit-quaternion q n .The simulations are conducted in triply-periodic domains with dimensions [L × L × 60a] (see again Fig. 4).In the simulations, the rods are subject to gravity and short-ranged, inter-rod repulsive forces.These forces are imposed on the individual FCM particles making up the rods.For gravity, each FCM particle is subject to the constant force F G = F 0 x, yielding the timescale T = ηl/F 0 .To prevent the rods from overlapping if they collide, we apply a pairwise barrier force [34] between the FCM particles making up the rods of the form As a result, we can write the total force and torque on rod n as where F B i denotes the total barrier force on particle i and N n is the set of FCM particle indices that comprise rod n.As the rods are rigid bodies, their motion is governed by, where U n and Γ n are, respectively, the translational and angular velocities of rod n.The symbol • is the quaternion product, which for quaternions p = (p 0 , p) and q = (q 0 , q) is p • q = (p 0 q 0 − p ⋅ q, p 0 q + q 0 p + p × q).Computing U n and Γ n for all n is done by solving where, as before are the vectors containing the velocities and forces, respectively, on all FCM particles.These equations enforce that conditions that the motion of individual FCM particles is consistent with the rigid body motion of the rods to which they belong.In particular, note that f n and τ n are given by ( 71) and (72), respectively, and F must be solved for as part of the problem.
In the simulations, we discretise (73) and (74) using the second-order, geometric BDF scheme [35] that preserves the unit length of q n for all n.The resulting algebraic equations, along with (75) -( 78), constitute a nonlinear system, which we must solve to obtain the updated rod positions and orientations, as well as F that enforces the rigid body constraints.To solve the system, we employ Broyden's method [36] with an initial approximation of the Jacobian based on a diagonal mobility matrix.At each Broyden iteration, fast FCM provides the action of M VF on F to determine V from the current solution.As the tolerance for Broyden's method is set to 10 −4 , we also require that the fast FCM return the velocities to within ϵ = 10 −4 .At the start of the simulations, the rods are positioned on a square lattice and randomly oriented in that plane in the plane at z = 30a.As F G and F B i for all i are in the (x, y)-plane and the rods are symmetric with respect to reflections about this plane, the dynamics of their positions and orientations should remain 2D as the suspension evolves.The small out-of-plane rod motion incurred as a result of numerical error is ignored when the rod positions and orientations are advanced in time.
Fig. 5 shows the evolution of a suspension of with N R = 7744 with an area fraction of 10.5%.The simulation is run to time t = 4992.8Tand images of the suspension and the rod velocities at time t = 339.6Tand 4992.8T are provided.Initially, when the particles have random positions and orientations, the average settling speed is below that of a single rod..After a short time, the rods form groups that settle rapidly, while other regions in the suspension where the rod density is lower are seen to move upwards.Similar dynamics were observed in sedimentation simulations of rod suspensions in 3D periodic domains [37] As the rapidly falling groups encounter regions moving more slowly, they deform and eventually break up, only to reform again.The periodic process of group formation and breaking continues over the course of the simulation.
To understand the dynamics more quantitatively, we perform additional simulations with N R = 2304 with a final time of t = 46676T .We consider L = 960a, 1920a, and 3840a, corresponding to area fractions 10.5%, 2.6%, and 0.66%, respectively.Fig. 4b shows the average sedimentation velocity ⟨V x ⟩ = (∑ n V n ⋅ x)/N R over time relative to the settling speed W , the sedimentation velocity of a single rod moving along its axis.In all three cases, we see that the velocity initially increases from values below the settling speed of a single particle, corresponding to the initial formation of rod clusters, as seen in Fig. 5.After this initial growth, however, the speed plateaus and fluctuates around an average value, as a result of the clusters continually breaking and reforming.The average speed for the plateau for area fractions 0.66% and 2.6% are approximately equal ⟨V x ⟩/W ≈ 1.15, while it is significantly higher for area fraction 10.5%, where we have ⟨V x ⟩/W ≈ 1.45.Additionally, the time required to reach the plateau value decreases as the area fraction increases.

Arrays of filaments
Along with simulating interacting rigid bodies, we perform simulations of arrays of interacting flexible filaments employing fast FCM as the hydrodynamic solver.Specifically, we investigate the coordinated motion in periodic domains of an array of clamped and tethered follower-force driven filaments [38,39,40,41].To do so, we employ the filament model presented in [35] and used previously in [40,41] to simulate follower-force driven filaments, which, for the sake of clarity, we describe here in the case of a single filament.
The position along the filament centerline is denoted by Y (s, t) which is a function of arclength, s ∈ [0, l), and time, t.The filament has length l and cross-sectional radius a.Additionally, at each point along the filament centerline is the orthonormal basis { t(s, t), μ(s, t), ν(s, t)}, where t(s, t) is the unit-vector tangent to the centerline, The force and moment balances along the filament are where Λ(s, t) and M (s, t) are the internal forces and moments, respectively, on the filament cross-section, and f H (s, t) and τ H (s, t) are the hydrodynamic forces and torques per unit length experienced by the filament.While the internal moments are given by a constitutive law where K B and K T are the bending and twist modulli, respectively, the internal forces exist to ensure (79) is satisfied.At s = 0, the filament is tethered and clamped, such that Y (0, t) = Y 0 and t(0, t) = ẑ.At s = l, M (l, t) = 0, and where f is the non-dimensional parameter controlling the magnitude of the follower force.
The filament is discretised into N S segments of length ∆l such that segment i has position Y i and frame { ti , μi , νi }, which is determined from the segment quaternion, q i .Central differencing is applied to (79), (80) and (81) and after multiplying by ∆l, we obtain the discrete force and moment balances, and the discrete kinematic constraint, In these expressions, ) are the constraint forces and torques respectively, and F H i = f H i ∆l and T H i = τ H i ∆l are the hydrodynamic force and torque on segment i. Segment velocities and angular velocities are determined by applying the FCM mobility matrix, M, using fast FCM to the vector of hydrodynamic force and torques, where V and W are, respectively, 3N S × 1 vectors containing the components of the translational and angular velocities of all segments, and The hydrodynamic radius of each segment is a.
Using the translational and angular velocities, the segment positions and quaternions are obtained by integrating in time the differential-algebraic system for each i.As done in the rigid rod simulations, we discretise these equations in time using a second-order, geometric BDF scheme [35].The resulting equations along with the kinematic constraints form a nonlinear system that we then solve iteratively using Broyden's method [36].
As filament interact exclusively through the surrounding fluid, the model is easily extended to many filaments by using fast FCM to couple the motion of all segments of all filaments when applying the mobilty matrix.
In our simulations, we have N S = 20 for each filament and set the segment length ∆l = 2.2a, so the total filament length is l = 44a.The strength of the follower force is f = 220.In previous work [40,41] using the RPY tensor with a no-slip surface [42] for segment mobility, this value of f results in dynamics where the filament beats in a plane.When many such filaments are in an array, their beating becomes aligned and synchronised.Using fast FCM, we can explore how these dynamics might change under periodic boundary conditions and in the absence of a no-slip surface, where the fluid is able to flow through the array.We examine how the filament spacing and the number of filaments affect the dynamics.
To begin, we first consider a single filament in the periodic domain, corresponding to an infinite array of filaments with synchronised motion.The base of the filament is tethered at the origin and the orientation of the corresponding segment is t1 = ẑ.The dimensions of the periodic domain are L × L × H with L = 2.27l fixed, while H varies between 2.27l to 45.45l.Initially, the filament is straight and oriented vertically, i.e. ti = ẑ for all i, and a random perturbation is introduced to initiate filament motion.To ensure that we observe the asymptotically stable state, we simulate each system for at least 10000T 0 , where T 0 is the period of filament motion.The resulting filament behaviour and its variation with H are shown in Fig. 6.We find for low values of H < 29.54l, we recover the planar beating observed in [40,41] with beating along one of the lattice directions.For H > 29.54l, however, instead of planar beating, we observe filament whirling, which is typically associated with lower values of f .
We probe the coordination further by now having N F = 100 filaments within the periodic domain.As these filaments can now move independently, the final state that emerges can be different from when N F = 1 and synchronisation of the lattice is imposed.To keep the filament spacing the same as the single filament case, we have L = 22.7l and space the filaments equally in the x− and y−directions on a square lattice.We vary H between 2.84L and 45.45L and allow the simulations to run for more than 10000T 0 to be confident that the final asymptotic state is reached.The different states that emerge and the range of H for which they are realised are shown in Fig. 7.We also provide a visual representation of the filament tip trajectories over one period as viewed from above of each state.
We find the synchronised whirling state for H > 29.54L and the synchronised beating state for 20.0L < H < 29.54L.These are identical to the solutions that we obtained when there was only one filament in the domain.At low values of H, however, we find that two different states emerge.For 12.00L < H < 20.00L, planar beating persists at the individual level and beating occurs in the same plane for all filaments, however, their motion is no longer synchronised across the array.Instead, synchronised motion is observed for filaments along rows in the beat direction and phase differences develop between the different rows.Finally, at the lowest values where 2.84L < H < 12.00L, planar beating is lost and the individual filaments are found to move more erratically, often changing the direction of tip motion.At the collective level, however, we do find synchronised movement to occur in subsections, or patches, of the array.

Performance
Before concluding, we examine the computational performance of fast FCM for the rod and filament simulations presented above.Fig. 8 show the average wall time required per Broyden iteration as the number of rods or filaments increases while keeping the domain size fixed.In both cases, as expected, the time required for fast FCM to apply the mobility matrix scales linearly with the number of rods or filaments.What we do see, again for both cases, is that with fast FCM, the computational cost of the hydrodynamics solver is reduced to approximately one third of the total cost per iteration.This is a marked change with respect to simulations using standard FCM, where the hydrodynamic solver dominates the computational cost.In our rod simulations, for example, portions of the computation involving rank-one updates to the Jacobian and the arithmetic operations of updating the rod positions and orientations are comparable in cost to applying the mobility matrix using fast FCM.It is important to note that in our implemenation, updating the Jacobian and positions and orientations are both performed on the CPU rather than the GPU, and memory transfer between the CPU and the device can also affect overall wall times.We see, therefore, that the speed and efficiency of fast FCM forces us to reevaluate the algorithms and their implementations for the nonhydrodynamic aspects of the computation.In particular, questions as to whether more aspects of the computation can be performed using the GPU need to be considered.
Along with computational time, we also note the reduction in memory cost associated with fast FCM.For instance, for the N F = 100 filament simulations with H = 22.72L presented in the previous section would require a grid of approximately 1800 × 1800 × 1800 points if standard FCM were used.Storing the flow field and its Fourier transform on such as grid would require on the order of 280GB of RAM.Even if such a memory requirement could be met, fetching data from memory for gridding and interpolation would compromise computational efficiency.For fast FCM, however, the typical grid used in the H = 22.72L simulation contains 128 × 128 × 128 points, requiring less than 1GB of RAM which can be easily accommodated on a single modern GPU.

Conclusions and outlook
In this paper, we have introduced fast FCM to accelerate the application of the FCM mobility matrix in periodic domains.The method relies on decoupling the grid spacing from the FCM particle hydrodynamic radius by replacing the FCM Gaussian kernel by one with a larger width.The resulting particle velocities from the grid-based computation are then corrected pairwise using analytical expressions to obtain values accurate to within a user specified tolerance.We show that our specific choice of kernel ensures positive splitting, an important feature to use fast FCM for Brownian simulations, where we expect to compute random particle displacements using a combination of fluctuating FCM [31,43] with the grid-based computation and the Lanczos algorithm for the pairwise corrections, as in PSE [44].We also show that fast FCM can be extended to include torques on the FCM particles.We perform a numerical calibration of the method to obtain the optimal value of the modified envelope width and correction cut-off radius for a given tolerance.Using our GPU based implementation on a Nvidia RTX 2080Ti, we can achieve approximately 7 × 10 6 PTPS.By comparing with the standard FCM computation, we find that fast FCM can, in many cases, accelerate by the application of the FCM hydrodynamic mobility matrix by an order of magnitude.The largest gains are achieved for large-scale simulations at low volume fractions, where there are many particles that are on average far apart.
these situations, fast FCM also avoids incurring large memory costs associated with the need to perform a large grid-based computation.Finally, we showed that fast FCM can readily be applied as part of a larger, more complex computation for many interesting problems including rigid body suspension dynamics and the dynamics of flexible filament arrays.
There are several interesting directions in which fast FCM can be further extended.While our general approach to accelerating FCM is similar in spirit to spectral Ewald summation [21] and Positively Split Ewald [22], a key difference is that for fast FCM, the splitting is performed through a modified kernel in real space, rather than through a splitting function in Fourier space.Thus, we are in the position to move beyond periodic domains and consider other boundary conditions such as no-slip channel walls.Indeed, this kind of approach has been pursued [12,30] for Stokeslets (point forces) in channels using a combination of Fourier and finite-difference methods for the grid-based solver.Enabling fast FCM for channels would require determining mobility corrections that includes the hydrodynamic effects of the boundaries.This would likely need to be done numerically to build up a look-up table that could then be interpolated, or fit to a convenient functional form that could then be evaluated.We have extended fast FCM to accommodate torques applied to the FCM particles, but it remains to also include the stresslets, which are important in the context of suspension rheology [29], or active particle dynamics [45].Additionally, it will be important to show that the torque extension yields positive splitting, though we note that the full FCM mobility including torques is itself positive definite due to the positive definiteness of the Stokes operator, and the FCM spreading and interpolation operators being adjoint.Finally, a nice feature of FCM and other immersed boundary methods is the inclusion of the fluid degrees of freedom, allowing for the seamless consideration of other physics such as the interaction with chemical fields [46], or polymer stresses [47].The use of the modified kernel for the grid based computation means that we no longer compute the correct flow field.Developing correction schemes for the flow to enable simulation in viscoelastic fluids, for example, as well as the other advancements mentioned here will form the foundation of future work along side our application of these results to interesting problems involving microscale fluid-structure interactions.
2. For applying J † and J † , we adopt a block per particle (BPP) approach.This is one of two main approaches to particle-based gridding with CUDA, with the other being thread per particle (TPP).For TPP, each thread is assigned one particle and is for spreading the force and interpolating the velocity for that particle to the M 3 G associated grid points.For BPP, each block is assigned one particle and each thread within that block is responsible for the particles spreading and interpolation for a single grid point.The advantage of BPP is that there sufficient shared memory to store the 3M G values needed to reconstruct the separable kernel function at the grid points.These values can be accessed by all threads.We use atomic operation when performing spreading to avoid race conditions (different particles attempting to write to the same grid point).For gathering, we first store the kernel weighted grid velocities in the register memory of the individual threads and, once all velocities are accounted for, a reduction operation is performed to sum the values from all threads to obtain the particle velocity.This eliminates the need for expensive atomic operations and as a result, the time for gathering is one third of that for spreading.A more detailed discussion of BPP and TPP gridding algorithms can be found in [48].In our implementation of fast FCM, BPP always outperforms TPP. 3. To apply L −1 , we utilise the parallelised CUDA cuFFT toolkit to perform the FFT and IFFT, assigning one thread per grid point for the computation.4. Finally, for the pairwise correction, each thread is assigned one particle and performs the calculations to check the distance with other particles in the relevant cells, and applies the correction when required.

Figure 1 :
Figure 1: Contour plots showing the fast FCM error as a function of (a) Σ/∆x and M G and (b) Σ/σ and Rc/σ for a random suspension with N = 64457, ϕ = 8% and a/L = 1/150.

Figure 2 :
Figure 2: The computational cost measured in particle timesteps per second (PTPS) as a function of Σ/σ for random suspensions with different ϕ.The computations are performed with ϵ = 10 −4 and a/L = 1/400.Data obtained using our a single-precision CUDA implementation of fast FCM run on a single RTX 2080Ti.

Figure 3 :
Figure 3: (a) The ratio of PTPS for fast FCM to that for standard FCM as a function of ϕ for different a/L and ϵ = 10 −4 .Timings are obtained by averaging 50 computations after an initial GPU warm-up period.(b) ϕc as a function of a/L.The solid line shows ϕc = −9.24(a/L)+ 0.22 which is determined from a linear fit to the data.(c) The same as (a), but with ϵ = 10 −6 .(d) The same as (b), but with ϵ = 10 −6 .
(a) Illustration of the sedimentation of 2304 rods with L = 960a.

Figure 5 :
Figure 5: (a) Snapshot of the simulation with 7744 rods at t = 339.64T .(b) Image showing the rod speeds at t = 339.64T .(c) Snapshot of the simulation with 7744 rods at t = 4992.81T .(b) Image showing the rod speeds at t = 4992.81T .

Figure 6 :
Figure 6: Dynamics of a single filament in the periodic domain for different H. Below H = 29.54L, the filament exhibits planar beating (∎), while above H = 29.54Lwhirling ( ) is observed.

Figure 7 :
Figure 7: Coordinated motion of 100 filaments.The black curves indicate the projection in the xy-plane of the filament tip trajectories over one period.The red arrows depict the tip velocities in the final snapshot.Along with the synchronised beating (∎) and whirling ( ) exhibited in the single filament case we also observe phase-shifted beating in rows (▼), and patches of coordinated, non-planar motion (☀).

Figure 8 :
Figure8: Wall time per Broyden iteration and iterations per time step for (a) rod simulations with L = 3840a using a grid of size 512×512×8 and Rc/Σ = 6.0, and (b) filament simulations in a domain of dimension 5760a×5760a×360a using a grid of size 256 × 256 × 16 and constant Rc/Σ = 6.0.In both plots, the solid black lines show the wall time per iteration, the dotted black lines show the wall time for fast FCM per iteration, and the dashed blue lines show the average number of Broyden iterations required at each timestep.

Table 1 :
The minimum values of Σ/∆x and M G needed to achieve the error tolerance, ϵ, for a random suspension with N = 64457, ϕ = 8% and a/L = 1/150.