A Parallel and Broadband Helmholtz FMBEM Model for Large-Scale Target Strength Modeling

The Fast Multipole Boundary Element Method (FMBEM) reduces the O(N) computational and memory complexity of the conventional BEM discretized with N boundary unknowns, to O(N log N) and O(N), respectively. A number of massively parallel FMBEM models have been developed in the last decade or so for CPU, GPU and heterogeneous architectures, which are capable of utilizing hundreds of thousands of CPU cores to treat problems with billions of degrees of freedom (dof). On the opposite end of this spectrum, small-scale parallelization of the FMBEM to run on the typical workstation computers available to many researchers allows for a number of simplifications in the parallelization strategy. In this paper, a novel parallel broadband Helmholtz FMBEM model is presented, which utilizes a simple columnwise distribution scheme, element reordering and rowwise compression of data, to parallelize all stages of the fast multipole method (FMM) algorithm with a minimal communication overhead. The sparse BEM near-field and sparse approximate inverse preconditioner are also constructed and executed in parallel, while the flexible generalized minimum residual (fGMRES) solver has been modified to apply the FMBEM matrix-vector products and corresponding minimum residual convergence within the parallel environment. The algorithmic and memory complexities of the resulting parallel FMBEM model are shown to reaffirm the above estimates for both the serial and parallel configurations. The parallel efficiency (PE) of the FMBEM matrix-vector products and fGMRES solution for the present model is shown to be satisfactory; achieving PEs up to 92.3% and 74.1% in the fGMRES solution using 3 and 6 CPU cores respectively, when applied to models having > 10 dof per CPU core. The PE of the precalculation stages of the FMBEM — in particular the FMM precomputation stage which is largely unparallelized — reduces the overall PE of the FMBEM model; resulting in average efficiencies of 68.3% and 47% for the 3-core and 6-core models when treating problems with > 10 dof per CPU core.


The Helmholtz BEM
The Helmholtz BIE for acoustic scattering exterior to a closed boundary surface S is where p is the total surface pressure, q = ∂p ∂n is its normal derivative, n is the surface normal vector (pointing outward from S), p inc is the incident acoustic field, and x & y are points on S (which is assumed to be locally smooth at y). The 3D Helmholtz Green's function is as follows: where k = 2πf /c is the wavenumber, f is the frequency, c is the sound speed and i = √ −1. Equation (1) will break down at frequencies corresponding to the eigenfrequencies of the adjoint interior problem. 40 where q inc is the normal derivative of the incident acoustic field, α is the Burton-Miller coupling coefficient and the L, M , L and M integral terms are defined as follows: L(x, y)q(x) = S G(x, y)q(x)dS(x), M (x, y)p(x) = S ∂G(x, y) M (x, y)p(x) = ∂ ∂n(y) S
Equation (3) will provide a unique solution at all frequencies provided Im(α) = 0, and is chosen here to be proportional to + i k as recommended by Marburg. 43 The conventional BEM discretizes Eq. (3) using (for example) the collocation or Galerkin methods, 44 to yield a set of N equations involving N boundary unknowns. Both the computational and memory requirements for the iterative solution of the resulting matrix equation are proportional to O(N 2 ) as the BEM coefficient matrix is fully populated. This is a result of the BEM explicitly calculating every pairwise element interaction of unknowns on the discretized boundary mesh. The FMBEM instead treats interactions between well-separated groups of elements and this alleviates the need to explicitly construct the full coefficient matrix: instead the matrix-vector product is directly calculated. As a result, the FMBEM has a computational complexity of between O(N 1.5 ) and O(N log N ) (depending on the FMM algorithm and frequency range of interest) and a memory complexity of O(N ). 39

The Broadband FMBEM Algorithm
The present broadband FMBEM model employs a switch between the functional representations of the multipole expansions in the low-and high-frequency regimes to provide a stable solution at all frequencies. In the low-frequency regime, the spherical basis functions 20,39 and RCR method 8 are used for the multipole expansions and translations respectively, while in the high-frequency regime the far-field signature functions and diagonal translation methods 7 are employed. The forward and inverse spherical transform is used to convert between the low-and high-frequency representations. Figure 1 shows a schematic of the present broadband FMBEM algorithm, compared to a reference implementation. 20 The following sections present an overview of the theory for the broadband FMBEM model in both the low-and high-frequency regimes, as well as the transformation between the functional representations of the multipole expansions via the spherical transform. The choice of the truncation number for the multilevel FMBEM and the corresponding algorithmic cost of the various components of the broadband model are also discussed.

Low-frequency FMBEM
The Helmholtz Green's function, Eq. (2), can be expressed as the sum of two independent series expansions constructed around the 'expansion center' c provided that the source and receiver points meet the 'well-separated' criterion |y − c| < |x − c|. 39 R and S denote the regular and singular spherical basis functions, which are defined in the spherical coordinate system r = r(sin θ cos φ, sin θ sin φ, cos θ) as R m n (r) = j n (kr)Y m n (θ, φ), S m n (r) = h n (kr)Y m n (θ, φ), (9) where j n (kr) and h n (kr) are, respectively, the spherical Bessel and Hankel functions of the first kind, Y m n (θ, φ) is the spherical harmonic function Y m n (θ, φ) = (−1) m 2n + 1 4π and P |m| n is the associated Legendre function. The degree n and order m have integer ranges of n ≥ 0 and −n ≤ m ≤ n, respectively. 39 Truncation of the infinite series expansions in Eq. (8) determines the accuracy of the FMM algorithm. Assuming a truncation number p, the L, M , L and M integral terms in Eq. (3) between two well-separated plane triangular elements containing the collocation points x and y at their centers can be approximated using piecewise constant unknowns as where A is the area of the boundary element E to be integrated, and [ ] n(x) and [ ] n(y) denote the surface normal derivative of the (S, R) spherical basis functions 39 at x and y, respectively. The FMBEM treats element integrals between well-separated 'far-field' elements by constructing multipole expansions for local element groups about common expansion centers. Interactions between element groups via Eqs. (11)- (14) are applied by shifting or 'translating' the expansion centers to coincide. In the multilevel FMBEM a hierarchy of element groups with different sizes on each level is constructed using an octree structure. Elements which are spatially close to one another (i.e. not considered well-separated on the lowest level of the octree) cannot be treated with the FMM. These 'near-field' integrals are directly calculated and stored in a sparse BEM matrix using appropriate numerical quadrature rules for the regular, 45 near-singular 46 and singular 47 integrals.

Truncation of the multipole expansions
In the multilevel FMBEM, the truncation number p is selected on each octree level based on the size of the corresponding octree boxes. Several methods for determining p based on either empirical formulas 6,48 or theoretical error bounds 39,49,50 have been proposed. The present broadband model selects p using the formula proposed in Ref. 20 where σ = 2, is the specified accuracy of the multipole expansion and a l is half the diagonal box length on octree level l. Choosing p = n − 1 in Eq. (8) will yield p 2 (n, m) coefficients.

Translation of the spherical basis functions
In the low-frequency regime, the translation of the expansion center for an established set of S or R coefficients is conducted using the R | R, S | S and S | R translations where r and t are the respective expansion and translation vectors, and |r| > |t| and |r| < |t| for Eqs. (17) and (18) 39 Finally, it should be noted that the RCR translation method automatically incorporates the interpolation and anterpolation operations required to vary the truncation number p between octree levels in the multilevel FMBEM.

High-frequency FMBEM
The Helmholtz Green's function, Eq. (2), can be equivalently expressed in the highfrequency regime using the far-field signature function defined on the unit sphere S 2 as where the vector y − x is separated as c 1 − x and y − c 2 for the well-separated expansion centers c 1 and c 2 . (F | F )(c 2 − c 1 ) is the multipole-to-local translation of the far-field signature function (equivalent to the S | R translation in Sec. 3.1.3), and s is an arbitrary location on the sphere. The far-field signature functions thus have the form Analogous expressions to Eqs. (11)- (14) can thus be constructed with Eq. (20). The broadband FMBEM instead converts the spherical basis functions to the far-field signature functions by applying the forward spherical filter on the switching octree level between the lowand high-frequency regimes (see Fig. 1). Conversely, the inverse spherical filter transforms the signature functions to spherical basis functions in the downward pass of the FMM. Numerical integration of the signature function in Eq. (19) is applied on S 2 using grid points defined in the angular spherical coordinates (θ j , φ k ) for j = 1, . . . , p and k = 1, . . . , 2p. Grid points have the following angular positions: θ j = cos −1 t j , φ k = (k − 1) π p , and Gauss weights: w θ j = w j , w φ k = π p , where (t n , w n ) are the points and weights for the n-point Gaussian quadrature rule. The quadrature weight at each spherical grid point is thus w θ j w φ k .

The spherical transform and signature function filtering
The forward spherical transform 18,51 is used to convert the spherical basis functions (F = S, R) to the far-field signature function as follows: f (θ j , φ k ) = n m=−n β m j e imφ k , j = 1, . . . ,p, k = 1, . . . , 2p, where Eq. (21) has a computational cost of O(p 2 p) and Eq. (22) is applied using the FFT in O(p 2 logp) operations. The inverse spherical transform is applied as  52 The NFFT filtering also reduces the memory storage requirements of the spherical filters from O(p 3 ) to O(p): a significant reduction for large-p filters. Furthermore, these filters must be replicated across workers in the parallel model. Thus, in the present formulation the NFFT filter is utilized for p ≥ 250.

Translation of the far-field signature functions
Translations of the far-field signature functions between octree levels (analogous to the S | S and R | R translations) can be applied for an arbitrary translation vector c 2 − c 1 as which has an O(p 2 ) algorithmic complexity. The F | F i and F | F f diagonal translations between octree levels are thus applied in the upward pass by first interpolating the signature functions on the current level l and then translating to the 'parent' level l − 1, and in the downward pass by translating to the 'child' level l + 1 and then anterpolating. The F | F multipole-to-local diagonal expansions, (F | F )(c 2 − c 1 ) in Eq. (19), can be expressed as 7 where P n is the Legendre polynomial of degree n and the truncation number should be selected as p * = 2p to translate a signature function with degree p. 18,20 The F | F translations thus require a (2p) × (4p) grid of sample points, which can again be constructed via spherical filtering. In the present model the truncation numbers above the switching octree level are set to 2p, to minimize the number of O(p 3 ) filtering operations in the high-frequency regime. The F | F translations can then be applied without having to resample to/from the finer translation grid. The disadvantage of this approach is that the F | F i and F | F f translations between octree levels are also applied on an over-sampled grid i.e. using a truncation number of 2p when p would suffice. However, the cost of applying these translations is O(p 2 ) compared to the O(p 3 ) filtering operations, and so the reduction of the number of filtering operations is prioritized to minimize the overall cost of the broadband FMBEM. The F | F diagonal expansion is applied to the far-field signature functions as

Parallelization Scheme for the FMBEM
The main computational work of the standard BEM is due to the repeated matrix-vector products applied during the iterative solution. As the BEM explicitly builds and stores the full coefficient matrix, an obvious distribution scheme to parallelize the matrix-vector product using a number of CPU cores, or 'workers', is to distribute the BEM coefficient matrix as contiguous blocks of rows across the workers, and replicate the vector across all workers, as shown in Fig. 2(a). The output vector from this parallel operation is distributed across the workers in blocks and does not require any further inter-worker communication (e.g. 'gathering operations' across workers) to recover the distributed solution vector.
The FMBEM only explicitly constructs and stores the vector of unknowns, not the full coefficient matrix, and so the matrix distribution scheme cannot be directly applied. An alternative scheme for the FMBEM is to instead distribute the vector of unknowns. Each worker may then apply the FMM algorithm in parallel to the section of the vector stored locally on that worker. This scheme necessitates a columnwise distribution of the coefficient matrix (see Fig. 2(b)) which results in a full vector being output on each worker. Components from each of these vectors must then be gathered onto the correct worker in a subsequent inter-worker communication step to reconstruct the distributed matrix-vector product. The rowwise distribution requires the full vector of unknowns to be replicated across all n workers, while the columnwise distribution only requires a 1 /n fraction of the vector to be transferred to each worker. However, in the subsequent inter-worker communication stage, each worker must both send and receive a n−1 /n fraction of the vector. Thus, the communication cost of the columnwise distribution is proportional to 2n−1 /n, compared to the n-proportionality of the rowwise distribution scheme.
In this work, a novel parallelization strategy for the broadband FMBEM algorithm, based on the columnwise distribution of the equivalent FMBEM matrix is presented. This strategy utilizes element reordering and rowwise compression of sparse vectors to minimize the inter-worker communication. The FMBEM constructs local element groups to apply the multipole interactions, but the order of the elements within the full element list is arbitrary. Ordering the elements according to their octree box numbers (equivalent to the threedimensional Morton order 53 ) yields a diagonally banded structure in the equivalent BEM matrix. An example of the element reordering for a spherical mesh, and the corresponding structure in the FMBEM sparse near-field matrix, is shown in Fig. 3.
The diagonal banding provided by the element reordering reduces the amount of interworker communication required in the columnwise distribution scheme. In this case the parallel matrix-vector product will yield sparsely populated vectors ( Fig. 4(a)) which can be row-compressed to minimize the output vectors on each worker ( Fig. 4(b)). As a result the main component of each vector is located on the correct worker and only a small component must be transferred between the workers.
The bandwidth of the equivalent FMBEM matrix varies with the octree level, where the lower octree levels involving small element groups have the narrowest bandwidth while larger element groups at higher octree levels are separated by larger distances and so will be located further from the main diagonal. The reduction in the inter-worker data transfer afforded by the element reordering thus becomes less effective as the octree level increases.
The following subsections describe the main stages of the parallel FMM algorithm (upward pass, downward pass and final summation), iterative solution and precalculation procedures for the parallel implementation. Finally, an overview of the parallelization strategy which considers the load balancing of the workers for the different components of the broadband FMM algorithm is discussed. Full details on the implementation of the FMM for the Helmholtz equation can be found in, for example, the textbooks on the method. 39,55

Parallel FMBEM : Upward pass
The S-type spherical basis functions are calculated for every element in parallel using an approximately equal partition. The exact distribution is selected to ensure that (a) all elements contained within the same box on the lowest octree level l reside on the same worker, and (b), that each set of boxes with same parent box reside on the same worker ( Fig. 5(a)). The expansions are first multiplied by the vector of unknown coefficients and combined into a single expansion set per octree box. Each set of expansions is then S | S translated to the center of the larger parent boxes on level l − 1 and again combined into a single set per parent octree box. At this stage the distribution of expansions sets across the workers may be somewhat nonuniform. An inter-worker data transfer is thus applied to rebalance the number of S expansions sets across the workers. Again the repartition of the data is chosen to ensure that the sets of l − 1 box expansions with a common parent box on level l − 2 are retained on the same worker. This process is repeated up to octree level 2, yielding an approximately equal distribution of S expansion sets on every octree level. The parallel implementation of the upward pass follows the same procedure in both the low-and high-frequency regimes using the spherical basis functions and far-field signature functions, respectively. At the switching octree level the spherical transform is applied before the S | S translations to the parent octree level, as indicated in Fig. 1(b).

Parallel FMBEM : Downward pass
The downward pass of the FMM algorithm applies the interactions between all wellseparated boxes (element groups) on every octree level, beginning at l = 2. The S expansion sets for all well-separated 'source' boxes are S | R translated to each 'receiver' box; the translations also transform the S-type expansions to R expansions. The R expansions from all well-separated boxes are each summed into a single set per receiver box and then R | R translated from the parent box centers on level l to each of the children boxes on level l + 1. This process is then repeated on subsequent octree levels but with the additional step that the sets of R expansions from the parent octree level l are first combined with the R expansions on the current level, l + 1, before applying the R | R translations to level l + 2.
In the parallel algorithm each worker only stores part of the full set of box S expansions on each octree level, whereas the well-separated interactions may involve source boxes which are stored across several workers. However, the element reordering procedure ensures that the boxes which are spatially close to one another will generally reside on the same worker, as shown in Fig. 6(a). Therefore, in the parallel model each worker applies all S | R translations and summations for the set of source boxes residing on that worker to all receiver boxes on the octree level, including those residing on other workers. An inter-worker communication step is then applied to transfer the partial sets of summed R expansions calculated on each worker to the receiver boxes residing on other workers (see Fig. 6(b)). After the communication step each worker contains the complete sets of R expansions for all of the boxes residing on that worker. The R | R translations are then applied on each worker and another data transfer is applied to reverse that conducted in the upward pass (see Fig. 5(b)). This reversal ensures that the data partition of the level l R expansions will match that on the below (l + 1) octree level (determined by the partition of the S expansions) and so the expansions can be combined in the subsequent stage of the downward pass.
The parallel implementation of the downward pass in the broadband model follows the same procedure for both the low-and high-frequency regimes. At the switching octree level the far-field signature functions are converted to the spherical basis functions before applying the downward R | R translations to the child octree level (see Fig. 1).
The present model employs translation stencils 26 in both the low-and high-frequency regimes for sets of eight, four and two receiver boxes, to reduce the number of applied translations from 189 S | R to 71 S | R and 4 R | R translations per receiver box. 54 The required R | R translations for the stencils are applied on the same octree level (from the center point of the local receiver group to each of the individual boxes), and so do not require any resampling operations. To maximize the computational savings from the eightbox stencils the S | R translations are applied on the parent octree level and then combined with the parent-level R expansions before applying the R | R translations to the child octree level as part of the standard procedure for the downward pass. In the low-frequency regime the spherical basis functions with different truncation numbers (i.e. on the parent/child octree levels) can be directly combined without requiring any additional operations. In the high-frequency regime the far-field signature functions must be filtered to combine the signature functions with differing truncation numbers. Therefore, in the present model the eight-box stencils in the high-frequency regime are applied on the current octree level and then R | R translated from the center of the eight-box receiver group to each box. This increases the cost of the translation stencils in the high-frequency regime to 71 S | R and 5 R | R translations per receiver box but avoids additional O(p 3 ) filtering operations.

Parallel FMBEM : Final summation
At the conclusion of the downward pass the R expansions for each box on the lowest octree level represent the integrals for all boundary elements except for those residing in the nearest neighbor boxes. Elements contained within these boxes are directly applied via the conventional BEM (i.e. as explicit pairwise element interactions). In the present model, these element integrals are precalculated and stored in a distributed sparse matrix using the columnwise distribution and row compression scheme indicated in Fig. 4(b). Thus, the final summation proceeds by applying Eqs. (11)-(14) using the final sets of R expansions for each receiver box and the corresponding S expansions for each element contained within the box, to recover the far-field part of the discretized boundary integrals. The near-field element integrals are calculated on each worker via a parallel sparse-matrix vector product and added to the far-field result. The final result of the FMBEM procedure is the equivalent matrix-vector product between the BEM coefficient matrix and vector of unknowns.

Parallel FMBEM : Iterative solution
The FMBEM can be combined with an iterative solution method to solve large-scale acoustic problems. The GMRES method has been shown to be an effective iterative solver for acoustic problems modeled with both the BEM 56 and FMBEM. 57 The present model uses a flexible GMRES 58 solver, where a variable preconditioner is applied in each iteration of the solution.
The fGMRES is implemented in the present model as follows: for each iteration of the main 'outer' GMRES, an inner GMRES first solves the problem using the current iteration of the solution vector from the outer GMRES. The resulting 'preconditioned' solution vector is then used in the next outer GMRES iteration. The inner GMRES uses a fast low accuracy FMBEM 20 which is solved to a coarse convergence tolerance, and so the inner GMRES acts as a fast approximate preconditioner. Conversely the outer GMRES is solved to a fine convergence tolerance using the full accuracy fast multipole expansions. Furthermore, the inner GMRES loop is also preconditioned using a sparse approximate inverse (SAI) preconditioner 59 constructed from the sparse matrix of near-field integrals. The SAI preconditioner approximates the inverse of the BEM sparse near-field matrix by calculating a local inverse for each boundary element using the BEM coefficients for that element and its geometrical neighboring elements. This results in a small system of coefficients for each boundary element which can be directly inverted using standard inverse techniques for a small computational cost. The vectors of matrix inverse coefficients for each element are then constructed into a global sparse preconditioner (SAI) matrix.
The present parallel fGMRES solver replicates the main GMRES operations across the workers to locally update the distributed solution vector on each worker with minimal inter-worker communication. The only part of the GMRES algorithm which requires interworker communication is the calculation of the Euclidean norm of the solution vector in each iteration. This is calculated in parts on each worker by taking the inner product of the complex solution (stored in parts across the workers), gathering the results -a single complex number per worker -into a vector replicated on every worker and then calculating the magnitude of this vector. The other stages of the GMRES solution (modified Gram-Schmidt orthonormalization, QR decomposition, etc.) are then calculated in parallel across the workers for each section of the distributed solution vector. Similarly the SAI preconditioner matrix is precalculated in parallel and stored as a distributed sparse matrix (using the columnwise distribution and row compression scheme indicated in Fig. 4(b)) to apply the SAI matrix-vector product in parallel. This approach minimizes the inter-worker communication required during the fGMRES solution, as the distributed components of the solution vector do not need to be copied to and from the host worker in each outer/inner GMRES iteration. Furthermore, the initialization of the parallel calculation environment must only be done once for the entire fGMRES solution. Once the outer GMRES has converged to the specified convergence tolerance, the final distributed solution vector can be gathered on the host worker for post-processing. It has been shown 60 that redundantly replicating the serial operations of the fGMRES reduces the PE. This is in contrast to the present model, where replicating these operations yielded an improved PE compared to the additional communication overhead (and corresponding reduction in PE) required to parallelize all stages of the fGMRES. These differences are likely due to the relative communication overheads associated with the different programming languages.
Finally, it should be noted that the standard GMRES solver algorithm (as implemented here) will break down for certain problem configurations which yield successive Krylov vectors with rapidly increasing magnitudes 61 and thus a poorly conditioned Krylov subspace. Improved GMRES methods have been developed to alleviate this issue, 61 and this is an obvious avenue for future development of the present model.

Parallel FMBEM : Precalculation stage
The precalculation stage of the parallel FMBEM model constructs the octree structure, multipole expansions and translation coefficients for each level of the octree. The sparse distributed near-field BEM and SAI matrices are also precalculated and stored.
In the present model the S, R, and normal derivative of the S-type expansions are precalculated and stored in parallel across the workers. The octree structure is then constructed on the host worker (i.e. as a serial process), and both the level-dependent truncation numbers and switching octree level between the low-and high-frequency regimes are determined by the problem frequency and the specified tolerances of the full and preconditioner multipole expansions. 20 The sets of unique S | S and R | R multipole translation coefficientstotalling 8 unique sets per octree level (from the children to the parent box center and vice versa) -are calculated on the host and transferred to each worker. Similarly, the full sets of unique S | R translations are calculated for each octree level on the host and replicated to each worker. There are 316 unique S | R translation vectors per octree level in the complete set of well-separated boxes. The translation stencils increase the total number of uniquely defined translation vectors to several hundred, as the centers of the eight, four and two box groups can be located at the corners, edges or side faces of the different group of receiver boxes. In the low-frequency regime the sets of RCR translation coefficients are only calculated and stored for the full accuracy FMBEM, while a subset of these coefficients can be directly utilized for the smaller truncation numbers in the preconditioner FMBEM. In the high-frequency regime separate sets of translation coefficients must be calculated and stored for the full accuracy and preconditioner FMBEM. Finally, both the sparse matrix of near-field integrals and the SAI preconditioner matrix are calculated in parallel across the workers and stored in the column distributed and row-compressed format.

Parallel FMBEM : Choice of parallelization strategy
The present parallelization strategy prioritizes the equal distribution of the octree boxes across the workers, in lieu of balancing the numbers of S | R translations applied by each worker. This design choice balances the more costly filtering operations for the far-field signature functions, which have an O(p 3 ) or O(p 2 log p) computational cost, compared to the O(p 2 ) translation operations in the high-frequency regime. In the low-frequency regime both the filtering and translation operations have the same O(p 3 ) computational cost. In practice the number of S | R translations per worker varies by ∼10% or less for boundary element meshes with relatively uniform element distributions whose discretization is dictated by the acoustic wavelength. 62 This distribution scheme may become suboptimal for highly nonuniform meshes treated via adaptive octree structures, 39 or for highly parallel (many core) implementations, where the smaller numbers of S | R translation per worker may become significantly unbalanced. A more efficient strategy could be to prioritize the distribution of the S | R translations in the low-frequency regime, and the number of octree boxes (filtering operations) in the high-frequency regime. However, this approach would necessitate increased worker communication and so the overall PE of the algorithm would depend on the relative cost of the increased data transfer compared to the optimized load balancing between workers. It will be shown in the next section that the data communication for the present model is a relatively small component of the total calculation time for meshes which have > 10 4 dof per CPU core.

Numerical Results
The parallel broadband FMBEM model has been implemented in MATLAB. An analysis of the algorithmic/memory complexities and PE of the model is presented in Sec. 5.1. Target-strength modeling of the BeTTSi II generic submarine hull model is presented in Sec. 5.2. The outer and inner loops of the fGMRES were set to convergence tolerances of 10 −4 and 0.15, respectively, and the truncation numbers were selected via Eq. (15) to provide multipole expansion errors of 10 −4 in the 'full accuracy' (outer) and 0.2 in the preconditioner (inner) FMBEM calculations. All calculations were performed on a desktop workstation equipped with an Intel i7 hexacore CPU running at 3.30 GHz, and 128 GB of RAM. It should be noted that Intel's automatic 'turbo boost' feature for this particular processor allows a single CPU core to operate up to 3.60 GHz; allowing relatively faster serial calculations and thus lower efficiencies for the parallel models (a theoretical maximum of 91.67% PE). However, it is shown in Sec. 5.1.2 that some parts of the model achieve close to linear scaling, and so the effect of the turbo boost is assumed to be relatively small.

Algorithmic and memory complexity of FMBEM model
This section presents the algorithmic complexity and PE of the present parallel broadband FMBEM model. The problem configuration for a plane-wave scattering from spherical surface under a rigid (Neumann) boundary condition, which has an analytic solution, is used to characterize the model's performance. Solutions are calculated over a range of wavenumbers using mesh discretizations that provide a minimum of 10 constant elements per acoustic wavelength (that is, the solution frequency increases with the number of dof to provide a constant element per wavelength discretization). The boundary element meshes range from 10 3 to 10 7 elements/dofs, a maximum kD value of 1299 (where k is the wavenumber and D is the size of the computational domain), and are solved using 1, 3 and 6 CPU cores. Model results from Matlab's implicit parallelization of its own inbuilt functions are also presented for reference (denoted as 'Implicit'). Relative error norms for the model solutions varied between 1.18% and 6.75%, with a mean value of 2.16%. The largest solution error occurred for the 10 7 element model, where a maximum of only 2 elements per lowest octree box was used in the FMBEM calculation due to the memory limitations for running this model. Figure 7 presents the total calculation time of the parallel broadband FMBEM as a function of N , for a constant number of elements per wavelength. The total calculation time includes all stages of the FMBEM model, including the set-up of the octree structure, near-field and SAI precalculations, and fGMRES solution time. Lines of best fit are also presented in the figure for each serial/parallel model configuration. The broadband model exhibits the expected N log 2 N algorithmic complexity. 20 Increasing the number of CPU cores in the parallel solution reduces the scaling factor of the N log 2 N slope (i.e. a linear reduction in the calculation time) while the constant y-intercept increases due to the increased overhead in setting up the parallel algorithm across multiple cores. For the present problem configuration the parallel model becomes faster than the single-threaded model for N ∼ 10 4 . Figure 8 shows the calculation time required for a single matrix-vector product (FMBEM far-field and sparse BEM near-field) using the full accuracy multipole expansions (expansion error = 10 −4 ). The lines of best fit again show an algorithmic complexity proportional to N log 2 N . Furthermore, the parallel matrix-vector calculations are faster than the serial model results for all problem sizes (and similarly for the 'preconditioner accuracy' matrixvector calculation times; not shown), indicating that the slower total calculation time of the parallel model for N < 10 4 is due to the set-up of the model.
The memory storage requirements of the model are presented in Fig. 9 as a function of N . In this case the results for the MATLAB implicit parallelization are identical to the 1-core results and so are not shown. The fitted lines indicate a memory scaling proportional to N , which agrees with the theoretical prediction for the broadband FMBEM model. 39 The parallel model requires more memory as the number of CPU cores is increased, but this inversely proportional to N (as indicated by the different slopes of the fitted lines). This is  due to the memory footprint being dominated by the translation and filtering matrices at lower frequencies, which are replicated across the workers, and by the multipole expansions and sparse near-field matrix at higher frequencies, which are distributed across the workers. Finally, it should be noted that the instantaneous peak memory usage of the FMBEM is somewhat larger than that indicated in Fig. 9 (typically by 10-20%) as various temporary arrays are built during execution of the algorithm.
The results presented in this section thus reaffirm the expected algorithmic and memory complexities of the broadband Helmholtz FMBEM. Comparative results for the parallel FMBEM model indicate that the present parallelization strategy does not affect the overall O(N log 2 N ) complexity, only the scaling and offset parameters of the line-fit (corresponding to the total calculation and initialization times for the parallel models).

Parallel efficiency of the FMBEM model
This section presents results for the PE of the FMBEM, for both the total solution time and the time required to perform the far-field part of a single matrix-vector product using the FMM algorithm. The FMM matrix-vector results were averaged over a number of calculations (100 iterations for problems with N ≤ 10 6 and 25 iterations for problems with N > 10 6 ) inside a single call to Matlab's 'spmd' parallel execution environment to minimize the time delay for initialising MATLAB's parallel environment. The calculation time for the sparse near-field matrix-vector product was negligibly small; ranging from 10 −2 -10 −4 % of the total matrix-vector calculation time, and so was not included in the timing results. This contrasts with some other Helmholtz FMBEM models in the literature, 26 where the near-field calculation contributed a significant cost to the total calculation time.
The PE for the total calculation time of the FMBEM model as a function of N , is presented in Fig. 10. Each stack of bars indicates the constituent calculation times for the main components of the FMBEM algorithm -the FMM precomputation, BEM sparse nearfield, SAI preconditioner and fGMRES calculation time -versus N , using different numbers of CPU cores for the parallel algorithm (indicated by the groups of bars). The computational time for each set of solutions has been normalized relative to the corresponding 1-core solution for each problem size. Table 1 presents the corresponding parallel efficiencies for the components of the FMBEM algorithm versus N . The PE ratio is defined as Time 1core (Time Ccore×C) , where C is the number of CPU cores used in the parallel calculation and 'Time 1core' and 'Time Ccore' are the respective serial and parallel calculation times being compared. Figure 10 shows that the total calculation time of the parallel models is dominated by the FMM precomputation time at smaller values of N (lower frequencies). This is due to the fact that the majority of operations in this stage are performed in serial and then transferred to the workers (thus making this stage slower than the single-threaded model). In the case of the smaller models the data distribution cost is dominated by the constant overhead in initiating the parallel operations. At higher frequencies, the parallel initialization cost is relatively small compared to the larger data transfer, and so the precomputation scaling improves as N increases. The parallel scaling from the near-field and SAI calculations improve as N increases as they are performed in parallel and require minimal data-transfer between workers: becoming faster than the serial model for relatively small problem sizes. However, these gains appear to scale poorly with the number of cores: exhibiting average parallel efficiencies of 50.7% and 21.4% for the 6-core near-field and SAI calculations, respectively (see Table 1). The numerical results for the present model configuration indicate that the fGMRES solution becomes the dominant computational cost for problems involving 3.9 × 10 4 dof. The fGMRES solution also exhibits good PE -up to 92.3% in the best case -due to the large computational cost and minimized data-transfer for this part of the model. The average PE of the parallel FMBEM model is 68.3% and 47% for the 3-core and 6-core models respectively, for problems with > 10 4 dof per core. The poorer PE of the set-up stages of the model clearly limit the overall PE achieved. Future work will focus on improving the PE of these parts of the model.
Finally, it should be noted that the largest few models exhausted the available RAM during the precomputation, as evidenced by the relatively slower set-up calculation times (due to file paging) exhibited by those models. The other stages of the analysis were successfully performed within memory but also suffered some latency as data was transferred back into RAM. Thus the parallel efficiencies exhibited by the largest models is suboptimal. The calculation time for the FMM matrix-vector product versus N using the full accuracy multipole expansions is shown in Fig. 11. The main components of the FMM algorithm are (1) the operations for the initial multipole summation (element grouping) and final far-field evaluation at each element (denoted as 'Multipole Summ.'), (2) the upward and downward pass operations ('Up/Down Pass') for translations/filtering between octree levels, and (3) the S | R translations across octree levels ('S | R Trans.') for interacting wellseparated element groups. The parallel models also show the worker communication times in the upward/downward pass ('Up/Down Comm.') and S | R translation ('S | R Comm.') stages of the algorithm. The computational times for the parallel solutions has been normalized relative to the 1-core solution for each problem size. The percentage of the total calculation time spent on inter-worker communication is indicated in Fig. 10 above the 3and 6-core results. The corresponding PEs for the components of the FMM matrix-vector product versus N (including worker communication times, where appropriate) are presented in Table 2. Figure 11 shows that the S | R translations are the dominant computational cost of the FMM matrix-vector product for all problem sizes considered, as expected. The combined cost of the initial and final multipole summations constitutes only a relatively small component of the total computational cost of the matrix-vector product, and this trend is inversely proportional to N . Conversely, the calculation time required to apply the S | S and R | R translations in the upward and downward passes increases with N . The number of octree levels and associated translations between octree levels increases with N , and these have a higher computational than the corresponding S | R translations. Thus, for the large models with k ∝ N the S | S and R | R translations constitute a more significant component of the total FMM matrix-vector calculation time. Figure 11 also shows that the inter-worker communication is dominated by the S | R translations, while that for the S | S and R | R translations is negligible for N > 10 4 or so. The average total communication time of the FMM matrix-vector product is 8.4% and 18.1% for the 3-core and 6-core solutions respectively, for models with > 10 4 dof per CPU core. Analogous results for the preconditioner accuracy FMM matrix-vector products are 13.9% and 24.2%, respectively. The parallel efficiencies for the main components of the FMM matrix-vector product are shown in Table 2. The initial and final multipole summations do not require any worker communication and so should present close to ideal (100%) scaling. This is indeed the case for all but the smallest 3-core solutions, whereas the 6-core results exhibit a PE of around 80% for models with N ≥ 10 4 dof. The initialization cost for the parallel execution environment has been minimized in these tests, while any effect from Intel CPU's turbo boost feature would equally affect the PE of the 3-core results. Thus, it is not obvious what additional effect causes the poorer PE in the 6-core results. One possible explanation is that the operating system must share some utilization of the CPU cores when all cores are used in the parallel FMBEM, and this causes additional latency in the parallel calculations. Some multipole summation results in Table 2 also display apparent super-scaling (> 100% efficiency). These are likely due to small errors in the average calculation times for the smaller models, while the 112.9% PE for the 10 7 dof 3-core model is due to the fact that this model exceeded the available RAM in the pre-calculation stage. The multipole expansions are automatically transferred to hard disk storage when the memory is exceeded, and so transferring this data back into RAM during the first iteration of the multipole summation timing calculations impeded the serial calculation speed. Conversely, the parallel models transfer the multipole expansion data back into RAM when distributing the data to the workers during the FMM precomputation stage, and so no similar delay is present in the parallel model results. A similar effect is also apparent in the 6-core 10 7 dof model.
The combined upward and downward pass of the FMM matrix-vector product exhibits good PE for models with N > 10 4 dof (see Table 2), which can be seen in Fig. 11 to correspond to those models which have small ratios of worker communication to calculation time in the upward/downward passes. The 6-core results again exhibit poorer PE than the corresponding 3-core models. Finally, the larger component of inter-worker communication for the S | R translations yields a reduced PE for this component: averaging 90.6% and 69.7% for the 3-core and 6-core models with N > 10 4 dof. Excluding the S | R communication time yields corresponding PEs of 101.4% and 89.1% respectively, which again suggests that the parallel calculations are applied with suboptimal scaling for the 6-core model. This may again be due to latency related to utilizing all available CPU core, or from a more significant nonuniformity in the number of S | R translations applied by each worker (or both).
The PE of the present model is comparatively poor compared to others in the published literature, mainly due to the inefficient precalculation steps, while that for the fGMRES solution (the most computationally expensive component for large-scale models) is relatively good. This is due to the minimized inter-worker communication resulting from the columnwise distribution, element reordering and row compression implemented in the present algorithm. As a result, all stages of the FMBEM matrix-vector products exhibit good PE for the model configurations considered. Future work will focus on improving the PE of the precalculation steps of the model, while testing with larger multicore systems will resolve any systematic latencies that are suspected in the present 6-core results.

TS modeling of the BeTSSi II submarine hull model
In this section, the parallel broadband Helmholtz FMBEM is applied to TS modeling of the BeTSSi II generic submarine hull model, which was developed in 2014 63 to provide a generic but realistic submarine hull shape for benchmarking various TS modeling approaches. A summary of the results from the workshop can be found in Ref. 64. TS predictions for the model have been presented from a number of modeling methods, including BEM, FMBEM, H-matrix, SCSD and ray-tracing methods. [65][66][67][68][69][70] The FMBEM is suitable for modeling the submarine hull at low-to-mid frequencies (up to a few kHz), while at higher frequencies the fast approximate methods (such as ray-tracing) are better suited, as they do not require an explicit frequency-dependent discretization of the scattering surface. Monostatic TS modeling results are presented here for the outer submarine hull at frequencies of 1 kHz and 3 kHz. It should be noted that the performance of the FMBEM can be improved for predominantly long and thin '1D-shaped' objects by optimally placing the object within the octree structure 71 : however these optimizations have not been pursued in the present analysis. The overall dimensions of BeTSSi II are 62 m length by 7 m diameter (extending to 7.5 m for the flat deck) for the main hull (see Ref. 63 for full details).

Monostatic TS modeling at 1 kHz
The monostatic TS of the outer BeTSSi II submarine hull as a function of incident angle in the horizontal plane was calculated at 1 kHz using the FMBEM model, assuming a rigid (Neumann) BC. The mesh consisted of 159 780 plane triangular boundary elements with constant unknowns (corresponding to a mesh resolution of 10 elements per wavelength at 1 kHz) and the FMBEM model was solved using 6 CPU cores. The TS of the model was solved over the 0 • to 180 • azimuth angular range (from bow to stern) with a 0.25 • resolution. The average calculation time (including initialization of the MATLAB parallel environment) was approximately 725 s per model. Figure 12 presents the FMBEM TS results versus backscatter angle, in comparison to analogous digitized numerical results published by Hoffmann. 66 Additionally, TS results in 5 • increments over the angular range have been calculated on a higher resolution mesh consisting of 1 538 580 plane triangular boundary elements (10 elements per wavelength at 3 kHz).
It can be seen from Fig. 12 that the present model results and those from Hoffmann 66 are generally in good agreement, except near the bow (0 • ) where the present model predicts lower TS levels and a null at approximately 3 • . This may be due to variations in the discretized curvature of the outer pressure hull (nose and top casing section which melds into the flat deck). The 1 kHz TS results calculated on the higher resolution mesh also show good agreement with those calculated on the 159 780 element mesh; indicating that the TS results for the BeTTSi II model are sufficiently converged when using 10 elements per wavelength.

Monostatic TS modeling at 3 kHz
Analogous monostatic TS results were calculated for the outer BeTSSi II submarine hull at a frequency of 3 kHz using the higher resolution mesh (1 538 580 dof) introduced in the  previous section. The TS of the model was solved over the 0 • to 180 • azimuth angular range with a 0.25 • resolution. The average calculation time (including initialization of the MATLAB parallel environment) was approximately 90 min per model run using 6 CPU cores. Comparatively, the calculation time for the single-threaded model was approximately 300 min per model run, corresponding to a PE of approximately 56%. Figure 13 presents the FMBEM TS results versus backscatter angle at 3 kHz, while Fig. 14 presents the absolute value of the surface pressure field at 3 kHz and broadside incidence.

Conclusions
This work presents a parallel broadband Helmholtz FMBEM for large-scale acoustic scattering problems, which employs a simple novel parallelization strategy based on the column-wise distribution of the equivalent BEM matrix, as well as element reordering and rowwise compression of data. This approach allows for the parallel calculation of the FMBEM matrix-vector products using a small number of CPU cores (≤ 64) with a minimized communication overhead. The sparse BEM near-field and sparse approximate inverse preconditioner are also constructed and executed in parallel, while the fGMRES solver has been modified to execute within the parallel environment; thus minimizing the initialization cost and data transfer for the parallel calculations. Numerical tests were conducted on a workstation computer (Intel i7 hexacore, 128GB RAM) which demonstrated the broadband FMBEM models expected O(N log N ) computational cost and O(N ) memory complexity. Acoustic scattering problems with up to 10 7 dof and kD = 1299 were considered in the analysis. The parallel fGMRES iterative solver exhibited good parallel efficiencies for largescale models with more than 10 000 dof per CPU core; up to 92.9% and 75.9% in the best case for the 3-core and 6-core models, respectively. Furthermore, some test results indicated that the models utilizing all available CPU cores may suffer from an additional latency and a reduced PE due to the processors sharing resources with the background operations (operating system, etc.). The precalculation stages of the FMBEM model (in particular, the FMM precomputation stage) exhibited poor PE, and so the complete FMBEM model solution only achieved average parallel efficiencies of 68.3% and 47% for the 3-core and 6-core models when treating problems with > 10 4 dof per CPU core. Future work will focus on improving the parallel performance of the precalculation stages and to extend this analysis to multi-core Xeon processors found in typical high-end workstations.