Introduction

The Poisson–Boltzmann theory has become well established for the modeling of electrostatic solvation interactions in biomolecules [114]. In this approach, a molecular solute is approximated as a continuous cavity with a low dielectric constant, for example with values in the range of 1–4. The solvent is approximated as a continuous medium with a different dielectric constant solvent, for example with a value of ~80 for water. The cavity contains point charges at atomic centers obtained from a molecular mechanics representation of the solute. When mobile ions are present, the ion distribution is also modeled in a mean field fashion (assumed to obey the Boltzmann distribution). With such a continuum approximation of the electrostatic interactions, the electrostatic potential ϕ is characterized by the Poisson–Boltzmann equation (PBE):

$$ \nabla \cdot \varepsilon \nabla \phi = - 4\pi {\rho_0} - 4\pi \lambda \sum\limits_i {e{z_i}{c_i}\exp \left( { - e{z_i}\phi /{k_{\rm{B}}}T} \right)}, $$
(1)

where ε is the dielectric constant, ρ 0 is the solute charge density, λ is the Stern layer masking function, ez i is the charge of ion type i, c i is the bulk number density of ion type i, k B is the Boltzmann constant, and T is the absolute temperature. If the electrostatic field is weak and ionic strength is low, the PBE can be linearized as [15]

$$ \nabla \cdot \varepsilon \nabla \phi = - 4\pi {\rho_0} + \lambda {\kappa^2}\phi . $$
(2)

Here \( {\kappa^2} = 4\pi {{\hbox{e}}^2}\sum\limits_i {z_i^2{c_i}} /{k_{\rm{B}}}T \). Since the early 1980s, considerable efforts have been devoted to solving the PBE numerically for biomolecular applications. Many software packages that can solve either the linear or the nonlinear PBE have been released, such as Delphi [1618], UHBD [1922], MEAD [23, 24], the CHARMM PBEQ module [25, 26], the Jaguar PB module [27], APBS [28, 29], and the Amber PBSA module [3035].

Typical numerical methods involve the discretization of the partial differential equation (PDE) into a system of linear or nonlinear equations and then the solution of the linear or nonlinear system with an iterative approach. The most commonly used discretization approach is the finite-difference method [1619, 21, 24, 26, 30, 3638]. In this method, the physical properties of the solution, such as atomic charges and dielectric constants, are mapped onto rectangular grids, and a discrete approximation to the governing PDE is produced. The second approach is the finite-element method [2729, 3941], which approximates the potential with the superposition of a set of basis functions. A linear or nonlinear system for the coefficients produced by the weak formulation has to be solved. The third approach is the boundary-element method [4255]. One advantage of applying the boundary element approach lies in the fact that the dimension of the linear or nonlinear system is much smaller than that from the finite-difference scheme due to the different dimensionality (3-D vs. 2-D). However, the corresponding system is much denser.

In this study, we focus our effort on the sparse linear systems from the finite-difference discretization of the linear PBE. These methods are often termed finite-difference Poisson–Boltzmann (FDPB) methods. The FDPB method converts the PBE to a system of linear equations A x = b, where the electrostatic potential (x) on every grid point is unknown, the charge distribution (b) on the grid point is the source, and the dielectric constant and salt-related terms are wrapped into the coefficient matrix A, which is a seven-banded symmetric matrix [36]. Its numerical solvers can be categorized into direct methods and iterative methods. Due to the large number of unknowns in biomolecular applications, it is not very practical to use direct methods. Therefore, most solvers are iterative in nature for biomolecular applications.

The iterative methods can be loosely grouped into two types: stationary methods and Krylov subspace methods. The partition is not very strict because many stationary methods can be used during preconditioning for the Krylov subspace methods [56]. To implement stationary methods such as Jacobi, successive over-relaxation, and Gauss–Seidel for distributive computing environments, the unknowns are usually reordered with multi-coloring [57] or multi-splitting approaches [58].

Possible strategies for implementing Krylov subspace methods in distributive computing environments are domain decompositions and distributive preconditioners. The principle of domain decompositions is to divide the problem domain into smaller subdomains, solve the subdomains independently, and merge the subdomain solutions in a self-consistent manner. We can decompose a problem domain into either overlapping or nonoverlapping subdomains with the domain decomposition methods for the linear systems from FDPB. An overlapping domain decomposition method normally takes the form of the multiplicative Schwarz method; for nonoverlapping domain decomposition, the Schur method is usually used [59].

Most prior efforts to adapt Krylov subspace methods to distributive computing environments focus on the preconditioning step [60]. These efforts aim at either extracting the data/instruction independence from a serial method or modifying or approximating a serial method to increase the independence [61]. Take the widely used incomplete factorization methods [62] for example. We cannot simply convert the algorithm into a distributive method because of the data dependence in the forward and backward substitutions [61]. Modifying or approximating the methods normally leads to slower convergence because there is a trade-off between the data locality needed for parallelism and the data global dependence needed for fast convergence. Therefore, a reduction in CPU time is not always guaranteed in distributive computing environments [63]. Three basic strategies for incomplete factorization methods such as incomplete Cholesky conjugate gradient (ICCG) are: reordering, series expansion, and domain decomposition [61]. Some common practices for reordering include hyperplane ordering (to change the order of computation) and multi-color ordering (to reorder the unknowns). The hyperplane ordering method can be implemented efficiently for good parallel performance [61, 6467]. Multi-coloring methods like red–black ordering were proven to result in slower convergence [68]. However, the use of more colors was reported to lead to a good balance between convergence and data/instruction independence [69, 70]. More recent studies of ordering strategies show that a “blocked” red–black ordering method can reduce the cost of synchronization or communication [71]. Series expansions were also attempted to approximate the preconditioners with truncated expansions that can be evaluated more efficiently without data dependence. The idea is to split and truncate the expansions but keep enough terms to ensure that the convergence rate is not degraded too much. The truncated Neumann expansion [72] and other polynomial expansions [73] are often used to approximate the preconditioners. Interestingly, the domain-decomposition idea can also be used to implement preconditioning without data dependence. Aside from decomposing the grid then applying independent incomplete factorization for distributive processing, preconditioners can also be domain-decomposed algebraically [7476].

Other efforts to solve linear problems include the use of a two- (or more) level grid-based solution such as multigrid methods [77, 78]. Multigrid methods usually consist of the following steps: (1) performing some stages of a basic method (i.e., one of the iterative solvers) to smooth out the error; (2) restricting the current state of the problem to a subset of coarse grids and solving the resulting smaller problem; (3) interpolating the coarse-grid solution back to the original grid and performing a number of stages of the basic method. These steps are applied recursively to achieve convergence [79]. Many aspects of multigrid algorithms are readily implemented in multiple threads for distributive environments. Indeed, multigrid methods have become popular for solving linear systems arising from PDEs [28, 29, 80, 81].

A seemingly related method is electrostatic focusing in FDPB. The focusing technique is a series of finite-difference runs that are performed with successively finer grids. Each run has boundary conditions that are calculated from the potential map of its predecessor [36]. The focusing technique ensures reasonable accuracy on the finest grid without a huge computational overhead, because only a certain part of the problem domain is solved on the finest grid. Long-range interactions with the rest of the domain outside the focusing region can be accurately represented by the coarser-grid calculations. However, there has been no theoretical justification of focusing when it is applied in the finite-difference method [28], though the error due to the approximation is often extremely small under certain conditions.

A limitation of the electrostatic focusing calculations is that the memory usage of the finest grid for the region of interest may still be unmanageably large in FDPB calculations of large biomolecular systems. It has been reported that the fine grids in the focusing technique can be divided into multiple smaller blocks for distributive computation [82]. This idea, termed “multi-block” focusing below, is similar to the domain decomposition concept, but the boundaries of the subdomains are neither conjugated nor communicated during computation. It is thus an approximated method. Nevertheless, the method may be used to lower the memory usage of each focusing run and is potentially friendly to distributive computing environments.

In this study, we explored the distributive multi-block focusing technique in order to analyze what it takes to achieve an acceptable accuracy and good performance in numerical FDPB solutions of large biomolecular systems. A highly scalable parallel version of the method is also implemented and incorporated into the Amber/PBSA program [3035]. In the following, we summarize our algorithm design and implementation and evaluate its performance using several challenging test cases. We conclude with a brief discussion of potentially more accurate numerical algorithms for future development in the field of distributive FDPB calculations of large biomolecular systems.

Methods

Finite-difference Poisson–Boltzmann method

In our implementation of the FDPB method, a uniform cubic grid is used to discretize the linear PBE [Eq. 2]; i.e., the spacing between neighboring grid points is uniformly set to be h. The grid points are labeled (i, j, k), i = 1,....xm, j = 1,....ym, and k = 1,....zm, where xm, ym and zm are the numbers of grid points along the x, y and z axes, respectively. Electrostatic potential (ϕ) and atomic charge (q) are mapped onto the grid points. The dielectric constant is defined at the midpoint between any two neighboring grid points: ε i (i, j, k) denotes the dielectric constant between grids (i, j, k) and (i + 1, j, k), and ε j (i, j, k) and ε k (i, j, k) are used similarly. Finally, κ 2 is used to absorb all related coefficients in the Boltzmann term. The discretized linear PBE can then be written as

$$ \begin{array}{*{20}{c}} { - {h^{{ - 2}}}{\varepsilon_i}(i - 1,j,k) [\phi (i - 1,j,k) - \phi (i,j,k)]} \hfill \\{ - {h^{{ - 2}}}{\varepsilon_i}(i,j,k)\quad \; [\phi (i + 1,j,k) - \phi (i,j,k)]} \hfill \\{ - {h^{{ - 2}}}{\varepsilon_j}(i,j - 1,k)[\phi (i,j - 1,k) - \phi (i,j,k)]} \hfill \\{ - {h^{{ - 2}}}{\varepsilon_j}(i,j,k)\quad \; [\phi (i,j + 1,k) - \phi (i,j,k)]} \hfill \\{ - {h^{{ - 2}}}{\varepsilon_k}(i,j,k - 1)[\phi (i,j,k - 1) - \phi (i,j,k)]} \hfill \\{ - {h^{{ - 2}}}{\varepsilon_k}(i,j,k)\quad \; [\phi (i,j,k + 1) - \phi (i,j,k)]} \hfill \\{ + \quad \;{\kappa^2}\quad \quad \quad \quad \phi (i,j,k)\quad \quad \quad \quad \quad \quad = {h^{{ - 3}}}q(i,j,k)} \hfill \\\end{array}, $$
(3)

which is a linear system of xmymzm equations, usually denoted A x = b (with the electrostatic potential as the unknown x, the charge distribution as the source b, and the dielectric constant and salt-related terms wrapped into a seven-banded symmetric coefficient matrix A).

One-block versus multi-block electrostatic focusing

As mentioned in the “Introduction,” electrostatic focusing can be used to obtain an accurate solution in a specific region of a problem domain. For the sake of simplicity of presentation, we utilized only a two-level focusing construct in this study; i.e., only two grids (coarse and fine) were used in all test calculations discussed below. The first step in the electrostatic focusing method is a coarse-grid FDPB calculation spanning the entire problem domain. The problem domain is often set to be large enough to secure a good free-boundary condition. The coarse-grid solution of the potential is then used to define the boundary condition (i.e., the boundary potential) for the fine grid covering only the region of interest. Here the boundary potential is computed with the trilinear weighted interpolation method for a good balance of accuracy and efficiency [31].

Again, as we mentioned in the “Introduction,” there often is memory limitation in electrostatic focusing calculations. For example, an electrostatic focusing calculation for the complete 70S ribosome requires ~16.6 GB memory to achieve a resolution of 0.5 Å at the fine-grid level. To overcome this limitation, the fine-grid region can be divided into multiple subdomains or blocks to be solved independently, via the so-called divide-and-conquer strategy. Subsequent multiple blocks can be made small enough to be processed sequentially on a personal workstation or in a distributive manner on networked workstations or computing nodes.

In the following we term the extended electrostatic focusing method the “multi-block” electrostatic focusing method, and the classical electrostatic focusing method the “one-block” electrostatic focusing method. One noticeable feature of the multi-block electrostatic focusing method is that the adjacent blocks overlap (see Fig. 1) in order to maintain good overall calculation accuracy. The overlapping region is termed the buffering space in this study. Its influence upon the calculation accuracy is analyzed in detail below.

Fig. 1
figure 1

Four multi-block focusing subdomains (blocks) on a two-dimensional grid. The coarse grid is colored black. The fine grid is covered by four overlapping blocks in different colors. The dimension of each fine-grid block is 7 × 7. The buffering space is 2 grid points thick on all sides of each block. Therefore the actual computed grid number for each block in this example is 11 × 11

In the current design (Fig. 1), every focusing FDPB calculation depends on the solution of the coarse-grid FDPB calculation. There are two choices with how to proceed in the coarse-grid FDPB calculation if the focusing blocks are distributed to different computing threads: (1) only the master thread is used to solve the coarse-grid FDPB, and the solution is communicated to slave threads before the fine-grid FDPB starts; or (2) the coarse-grid FDPB is solved on all threads before starting the assigned fine-grid FDPB calculations, so the coarse-grid solution is kept local on each thread. Apparently the second strategy is slightly better because it reduces the need for communication among all threads. Of course, the molecular data are still needed on the slave threads before the coarse-grid FDPB calculation, and the final electrostatic energies are communicated back to the master thread after finishing the assigned fine-grid FDPB calculations. The distributive multi-block focusing method can then be summarized as the following pseudo procedure:

  1. 1.
    1. (a)

      Discretize the problem domain with the coarse grid. Discretize the region of interest with the fine grid.

    2. (b)

      Decompose the fine grid into blocks.

    3. (c)

      Distribute blocks to computing threads if multiple threads are deployed.

  2. 2.

    Solve the linear system for the coarse grid.

  3. 3.

    Solve the linear systems for the assigned fine-grid blocks in turn.

  4. 4.

    Collect electrostatic energies from blocks.

Since the multi-block focusing method is only an approximation to the classical one-block focusing method, it is important to analyze the influence of the approximation on the calculation accuracy while optimizing the algorithm for efficiency. We checked three key data structures in the FDPB algorithm to gain a better understanding of how the approximation may influence accuracy as follows. (1) The coefficient matrix (i.e., the dielectric constant and salt term). It is fairly straightforward to enforce exact numerical consistency between the matrices of the multi-block focusing method and the one-block focusing method if the same grids are used. (2) The constant term (i.e., the atomic charge term). It is also not difficult to enforce numerical consistency between the atomic charge terms in the two focusing methods if the same grids are used. (3) The boundary condition for the fine-grid iteration. This is where the inconsistency may be located; i.e., the boundary grid points are deeply buried inside the solute in the multi-block focusing method, while the boundary grid points are exposed in solvent in the one-block focusing method. Since the boundary potentials are assigned from the coarse-grid solution, the quality of the boundary potentials on the fine grid is clearly different from that in the one-block focusing method, where the boundary grid points are well away from the heterogeneous dielectrics and atomic charges near the solute. In the one-block focusing method, it is a very good approximation to interpolate fine-grid boundary potentials from the coarse-grid potentials. However, in the multi-block focusing method, the distortion might be noticeable when the boundary grid points are buried well inside the solute. Further analysis is certainly necessary before the method is recommended for wide applications.

This is why the buffering space is required in each block to maintain a reasonable accuracy level in the multi-block focusing method. Note that the fine-grid potentials in the buffering space are neither passed to other computing threads nor used for later energy calculations. Thus, any large error from the distortion of boundary potentials when atomic charges and heterogeneous dielectrics are too close to the edge of the block can be alleviated to a certain degree. Of course, the use of buffering space leads to an extra computing cost in the multi-block focusing method. Thus, on the one hand, the use of smaller fine grid blocks accelerates each FDPB calculation; on the other hand, the use of buffering spaces slows down the overall calculation. In general, there is an optimal block partition scheme that achieves the best efficiency, as shown below in the “Results and discussion.”

Distributive implementation of the multi-block electrostatic focusing method

When implementing the multi-block focusing method as a distributive program suitable for multi-thread computing environments, we also paid attention to computing resource usage, particularly in relation to memory management and network communication management. As we mentioned in the “Introduction,” one key benefit of the multi-block focusing method is that it reduces the memory needed. A distributive algorithm like the multi-block focusing method undoubtedly divides a large system into smaller and manageable pieces. However, strict memory management is still necessary to achieve the desired benefit (i.e., allocating only the memory needed for the current calculation). Similarly, communication management is also strictly enforced so that only the smallest amount of data required for multi-thread implementation is broadcast. We reduced the communication needs of the distributive multi-block focusing method to several 1-to-N broadcasts of working arrays for initialization, and one N-to-1 broadcast for collecting electrostatic energies in the current implementation.

MICCG solver

The basic iterative solver used in the multi-block focusing method is the MICCG method with the following pseudo program:

figure a

where M is the preconditioner matrix, r and p are auxiliary vectors, and <x, y> represents the inner product of x and y. The preconditioner matrix M in MICCG can be symbolically written in a factorized form as [83]:

$$ {{\hbox{M}}^{{ - 1}}} = ({{\hbox{E}}^{\rm{T}}} + {{\hbox{C}}^{\rm{T}}} + {{\hbox{B}}^{\rm{T}}} + {\hbox{D}})\,{{\hbox{D}}^{{ - 1}}}\,({\hbox{D}} + {\hbox{B}} + {\hbox{C}} + {\hbox{E}}), $$
(4)

assuming a 7-band matrix of A as

$$ {\mathbf{A = }}{{\mathbf{E}}^{{\mathbf{T}}}}{\mathbf{+ }}{{\mathbf{C}}^{{\mathbf{T}}}}{\mathbf{+ }}{{\mathbf{B}}^{{\mathbf{T}}}}{\mathbf{+ diag(A) + B + C + E}}. $$
(5)

Here B and B T are the bands next to the diag(A), C and C T are xm away from the diag(A), and E and E T are xmym away from diag(A). In MICCG, the elements of D are computed from a recursive relation as

$$ \begin{array}{*{20}{c}} \hfill {{{\mathbf{D}}_{{ijk}}} = {\mathbf{diag}}{{\left( {\mathbf{A}} \right)}_{{ijk}}} - {b_{{i - 1jk}}}\left( {\,\,{b_{{i - 1jk}}} + \alpha {c_{{i - 1jk}}} + \alpha {e_{{i - 1jk}}}} \right)/{{\mathbf{D}}_{{i - 1jk}}}} \\\hfill { - {c_{{ij - 1k}}}\left( {\alpha {b_{{ij - 1k}}} + {c_{{ij - 1k}}} + \alpha {e_{{ij - 1k}}}} \right)/{{\mathbf{D}}_{{ij - 1k}}}} \\\hfill { - {e_{{ijk - 1}}}\left( {\alpha {b_{{ijk - 1}}} + \alpha {c_{{ijk - 1}}} + {e_{{ijk - 1}}}} \right)/{{\mathbf{D}}_{{ijk - 1}}}} \\\end{array}, $$
(6)

where α was suggested to be 0.95 in the original MICCG implementation [64], while we adopted a value of −0.3 for biomolecules in this study [30].

Other computation details

Four biomolecular systems with nontrivial sizes were used as test cases in this study. These are: the 5S ribosomal RNA of the 70S ribosome (chain B of PDB id: 2J01, denoted rRNA), the p53 DNA binding domain tetramer in complex with DNA (PDB id: 2AC0, denoted p53 DBD), the GROEL heptamer (PDB id: 1SX3, denoted GROEL), and the complete 70S ribosome (PDB id: 2J01 + 2J02, denoted the 70S ribosome). For the 70S ribosome, the missing residues in the terminal regions were ignored and the gaps within protein subunits were rebuilt with MODELLER [84]. The all-atom models of these systems were built according to the Amber force field, with tRNA and metal ion parameters obtained from the Amber contributed parameters database (http://www.pharmacy.manchester.ac.uk/bryce/amber). All models were subsequently energy-minimized with the Amber molecular dynamics program SANDER. More detailed information on the tested biomolecular systems is listed in Table 1.

Table 1 Details for the tested systems. The solute dimensions were estimated by determining the smallest rectangular bounding box that the solute could possibly fit in

In this study, the dielectric constants of the internal and external regions were set to 1 and 80, respectively. The ionic strength was set to be 200 mM for rRNA and the 70S ribosome, and 150 mM for p53 DBD and GROEL. The coarse-grid spacing was 2 Å and the fine-grid spacing was 0.5 Å, respectively, if not otherwise specified. The convergence criterion for finite-difference iteration was set to 10−9, and a two-level focusing technique was used for all the calculations, as described above. The boundary potentials for the coarse-grid FDPB calculation were set to zero. This is only a reasonable approximation for the free space boundary condition when the finite-difference grid boundary is far away from the solute. Thus, the dimension of the coarse grids was set to be at least twice that of the solute to secure a good boundary condition. The fine-grid boundary potentials were interpolated using the coarse-grid potentials. All other computational details can be found in our previous publications on the FDPB method [3035].

The electrostatic energy was computed as

$$ \Delta {G_{{\rm{elec}}}} = \frac{1}{2}\sum\limits_j {{q_j}{\phi_j}}, $$
(7)

where j runs over all charged atoms. The electrostatic potential ϕ j at atomic charge q j was computed by solving the linear PBE as in Eq. 2. The reaction field energy was computed as the difference between the electrostatic energy in the solvent dielectric and the electrostatic energy in the uniform solute dielectric. All reported energy values were averaged from the 64 individual computations with random grid origins. The relative accuracies were measured as relative deviations between the mean energy from multi-block focusing FDPB calculations and the mean energy from nonfocusing FDPB calculations. The PBSA module in the Amber 9 package [3035] was used to implement the multi-block focusing method. LAM/MPI (http://www.lam-mpi.org) was used for parallel implementation. All testing runs were conducted on an Intel Pentium 4 cluster or an Intel Xeon cluster. All data communications were routed with a private gigabit ethernet network, which is separated from the administrative network.

Results and discussion

To assess the accuracy and performance of the distributive multi-block focusing method for FDPB, we first analyzed several factors that are likely to impact the agreement between the multi-block focusing and nonfocusing FDPB calculations. This was followed by our analysis of the efficiency of the distributive method on both single-thread and multi-thread platforms.

Accuracy considerations

To achieve excellent overall numerical quality in the total electrostatic energy with the focusing method, the grid boundary should be set far from the surface of the solute. Otherwise, the potentials on atoms near the grid boundary are often distorted, leading to inaccurate total electrostatic energy. However, the focusing method is also used where only a portion of the solute (i.e., an enzyme active site) is of interest. In such applications, the focusing grid (i.e., the fine grid used here) is defined well within the solute interior. Apparently, for this strategy to be reliable, the focusing grid should be set large enough to cover the region of interest with enough buffering space so that all atoms of interest are located well within the boundary of the focusing grid.

The same consideration also applies to the multi-block focusing calculations, since the block boundaries are well inside the solute to partition the domain of interest into same-sized blocks. Due to this limitation, a thicker buffering space should reduce the numerical inconsistency between the multi-block focusing and nonfocusing FDPB calculations. In addition, a multi-block focusing calculation with larger blocks generally agrees better with the nonfocusing calculation, since there are few boundaries when larger blocks are used. In the following, two selected systems (rRNA and GROEL) were used to assess the influence of both buffering space and block dimension upon the accuracy of the new method.

Figure 2 shows that smaller overall errors can be achieved given a thicker buffering space and/or larger blocks, consistent with our initial estimation above. Specifically, a buffering space of 16 grid points consistently yields good accuracy for all tested block dimensions. Overall, a reasonable relative accuracy of 10−3 can be achieved when the buffering space is set to be 16 grid points and the block dimension is set to be at least (1/6)3 of the fine grid in the classical one-block focusing method. This accuracy level is quite acceptable considering that the FDPB convergence error is usually around 10−3 at the grid spacing of 0.5 Å often used in biomolecular applications [Wang J, Luo R (2010) Reducing grid-dependence in finite-difference Poisson–Boltzmann calculations, submitted].

Fig. 2
figure 2

Relative deviation in reaction field energy with respect to buffering spaces and block sizes. Relative deviation is computed as the difference between the multi-block focusing FDPB calculation and the nonfocusing FDPB computation, and is averaged over 64 runs with random grid offsets. The block size (per dimension) is shown as a fraction of the fine grid, as in the classical one-block focusing calculation

Algorithm efficiency on serial platforms

Apparently there is an extra cost just to maintain enough accuracy by using buffering space. Thus, one issue is how to minimize the extra cost while maintaining enough accuracy. In general, multi-block focusing calculations with larger blocks waste less time due to the existence of fewer block boundaries. Of course, the larger blocks have to fit in the physical memory without accessing virtual memory for the calculation to be efficient at all. Even if the larger blocks do fit into the physical memory, in general it takes more time to compute the resulted linear systems due to the larger memory requirement of the FDPB solvers. That is why an “optimal” multi-block size may exist for a given computer hardware setup, and the size is more or less independent of the actual solute or grid geometry, as illustrated in Fig. 3, which shows that the optimal block dimension is ~733 for the tested hardware, so this is used in the further testing of the method below. More interestingly, multi-block focusing calculations can be even faster than the single-block focusing calculations on a single CPU when the “optimal” multi-block size is used. For example, the multi-block focusing calculations with the optimal block size in Fig. 3 are about 20~30% faster than the single-block focusing calculations for both geometry settings.

Fig. 3
figure 3

Fine-grid performance defined as number of atoms processed per second versus block size in the 70S ribosome for a distributive multi-block focusing computation. Two different geometry settings were used to divide the fine grid into multiple blocks in order to analyze the existence of an optimal block size for each grid geometry. It is possible to observe an optimal block size that yields maximum efficiency in both tested geometry settings

Algorithm efficiency on multi-thread platforms

Due to the distributive nature of the multi-block focusing method, a multi-thread version of the algorithm can be implemented in a straightforward fashion to distribute the fine-grid blocks over multiple computing threads. In this implementation, each thread has all of the necessary data to process the assigned blocks once the coarse-grid FDPB starts (i.e., each thread processes the coarse-grid FDPB redundantly. Thus, there is no cross-thread communication once the coarse-grid FDPB computation starts. Note that this is more beneficial than processing the coarse-grid FDPB computation on the master thread and communicating all data to the slave threads. Given such a setting, there still are other issues that may impact the performance of the multi-block focusing method, as discussed below.

Nonparallel overhead

The multi-block focusing method is meant to divide an unmanageably large system into manageable smaller pieces that can be handled by the computer hardware at hand. However, there is nondistributive overhead; i.e., the coarse-grid FDPB that is used to assign boundary grid potential for fine-grid FDPB calculations. Apparently, the computational cost of the coarse-grid FDPB calculation should be less than or at least comparable to individual fine-grid FDPB calculations. Otherwise the coarse-grid FDPB calculation would become the performance bottleneck. This scenario can easily be avoided by using a very coarse FDPB grid. For example, the default coarse grid is eight times as coarse as the fine FDPB grid in the PBSA program. In addition, it is also possible to use more than two levels of FDPB grids if a finer-coarse FDPB grid is desired, a common practice in focusing calculations.

Cache size issue

Another issue of importance to parallel performance is the cache size, because the FDPB runs are memory-intensive jobs. It is rare that a single FDPB run can fit in the cache memory of a typical machine. This may reduce efficiency on SMP parallel platforms due to the cache memory competition among different threads. We compared the performance of a four-thread calculation on two platforms: one node with two dual-core Intel Xeon processors, and four nodes with identical processors but with only one thread on each node. The p53 DBD and 70S ribosome were used to measure the fine-grid speedups on the two different communication platforms. The physical memory size was set to be larger than the size of each thread, so that no virtual memory was utilized. Table 2 shows that the parallel computation performance over a network of multiple nodes is much better than that over an SMP node of multiple processors, indicating the existence of cache competition.

Table 2 Fine-grid parallel speedups for the p53 DBD and 70S ribosome with the SMP and TCP/IP methods, respectively

To confirm the above finding, we analyzed the computational times of several multi-block focusing runs with much smaller grid dimensions on the same hardware. Indeed, if we set the solver memory usage of each thread to less than 2 MB (i.e., half of the cache size per processor), it is possible to observe a noticeable speedup for the SMP parallel run (Table 3). Of course, parallel efficiency deterioration already starts to occur at ~1.8 MB due to other components that may not be 100% swapped out of the cache memory (Table 3).

Table 3 Dependence of parallel efficiency upon the solver memory size and number of SMP threads

Of course, larger cache memory does alleviate the limitation of an SMP parallel setup, but it is unlikely to eliminate the limitation because of the large memory usages for typical biomolecular applications. For example, the memory usage is ~146 MB for the block dimension (733) used in our current implementation. Thus, our recommendation is to avoid running more threads than the number of physical processors on the same nodes if the cache memory is shared between the different cores on the same physical processor.

Load balance

The fine-grid FDPB performance also depends on how the computational load is distributed over threads. One reason for this is that the number of blocks is not always divisible by the number of threads: some threads are assigned more blocks than others. The second reason is related to the heterogeneity of fine-grid blocks, as shown in Fig. 4, which shows that the distribution of per-block computation times is highly scattered for the two tested systems. Indeed, the variation in computation times can be up to 200%.

Fig. 4
figure 4

Distribution of computation times for the fine-grid blocks for the p53 DBD and 70S ribosome, respectively

Thus, a simple sequential partitioning of blocks does not guarantee a balanced load for good performance. To improve load balancing, a randomized partition was used to assign blocks to computing threads. Comparisons of the randomized and sequential partition on the test cases of the p53 DBD and 70S ribosome are shown in Fig. 5. A noticeable benefit can be observed from the randomized partition: the scalability of the algorithm clearly improves when the randomized partition is used. Finally, the overall performance for both tested systems can be found in Fig. 6, which shows a more or less linear speedup over the number of threads used.

Fig. 5
figure 5

Comparison of randomized and sequential partitioning in terms of fine-grid parallel speedups for p53 DBD (288 blocks) and 70S ribosome (512 blocks)

Fig. 6
figure 6

Overall parallel scaling in the multi-block focusing calculations for p53 DBD (288 blocks) and 70S ribosome (512 blocks), respectively. The scaling of parallel computing is represented by the speedups versus the number of threads used in the computation

Conclusions and future directions

We have implemented and evaluated an approximate distributive method for FDPB calculations of large biomolecular systems. This method is based on the electrostatic focusing principle of decomposing a large fine-grid FDPB calculation into multiple independent FDPB calculations, each of which focuses on only a small and specific portion or block of the large fine grid.

We analyzed the impact of the approximation upon the accuracy of the numerical reaction field energies. It was found that a reasonable relative accuracy of 10−3 can be achieved when the buffering space is set to be 16 grid points and the block dimension is set to be at least (1/6)3 of the fine-grid dimension, as in the classical one-block focusing method. The impact upon the efficiency of using buffering space to maintain enough accuracy was also studied. It was found that “optimal” multi-block dimensions exist for a given computer hardware setup, and that the dimensions are more or less independent of the actual FDPB grid geometry. A multi-thread version of the distributive multi-block focusing method was also implemented, and its efficiency was analyzed. Interestingly, a noticeable degradation of parallel efficiency was observed on the SMP parallel platforms. This can be attributed to the competition of cache memory. We also evaluated the effect of load balancing on the parallel efficiency. It was found that a random shuffling of blocks over computing threads can be used to improve the scalability of the method on parallel platforms.

Finally, it is interesting to discuss future directions for the distributive computation of FDPB, especially for the MICCG-related FDPB solvers. As reviewed, there are several competing methods for the MICCG solvers: hyperplane ordering [61, 6467], multi-color ordering [69, 70], and blocked red–black ordering [71]. The beauty of grouping grid points into hyperplanes is that all grid points in a hyperplane can be computed independently of each other. More interestingly, the computation order in the hyperplane ordering is equivalent to that of the natural ordering, which is crucial for the fast convergence of MICCG methods. However, this method is probably attractive only on shared memory parallel platforms, due to the high level of data dependency. The multi-color ordering was proposed as an improvement of the red–black ordering, in order to strike a balance between data independence and convergence [85]. In this strategy, unknowns of the same color have no data relationship to each other, so that the substitutions in MICCG can be independently computed. In the blocked red–black ordering, grid points are labeled in blocks of colors of both red and black. The natural ordering is used within each block. Since blocks with the same color do not have a data dependency, these blocks can be independently processed. Doing so reduces data transfer frequencies between blocks of different colors, especially when the block size is large [71]. Due to the low data dependency between blocks, this method should be suitable for distributed parallel platforms. We are implementing and analyzing all these strategies to analyze their scaling for the distributive computing of realistic biomolecular applications.