A Benchmark Test of High-Throughput Atomistic Modeling for Octa-Acid Host–Guest Complexes

: Years of massive applications of high-throughput atomistic modeling tools such as molecular docking and end-point free energy calculations in the drug industry and academic exploration have made them indispensable parts of hierarchical screening. While the similarities between host– guest and protein–ligand complexes lead to the direct extension of techniques for protein–ligand screening to host–guest systems, the practical performance of these hit identification tools remains unclear in host—-guest binding. Recent reports on specific host–guest complexes suggest that the experience on the accuracy ladder accumulated from protein–ligand cases could be invalid in host– guest complexes, which makes it an urgent need to perform a systematic benchmark to secure solid numerical supports and guidance of practical setups. Concerning molecular docking, there still lacks a comprehensive benchmark considering popular docking programs. As for end-point reranking, quantitative and rigorous free energy estimation via end-point formulism requires establishing statistically meaningful measurements of uncertainties due to finite sampling, which is neglected or underestimated by a significant portion in almost all main-stream applications. Further, a face-to-face comparison between different screening tools is required for the design of a hierarchical workflow. To fill the above-mentioned critical gaps, in this work, using a dataset containing tens of host–guest complexes involving basket-like macromolecular hosts from the octa acid family, we extensively benchmark seven academic docking protocols and perform post-docking end-point rescoring with twenty protocols. The resulting comprehensive benchmark provides conclusive pictures of the practical value of docking and end-point screening in OA host–guest binding.


Introduction
Due to efficiency and accuracy considerations, modern virtual screening often follows a hierarchical procedure.Low-cost techniques are applied first to eliminate most of the non-promising molecules in an original dataset, and then it comes to the stage employing screening tools of higher costs and accuracy to distill hits from the pool of promising candidates.In the stages where both 3D chemical structures of individual compounds and atomic-detailed interaction patterns between them are incorporated, molecular docking and end-point free energy calculations are the most relevant and popular high-throughput tools for screening of protein-ligand complexes [1][2][3][4][5].
Similar to biomacromolecular assemblies, host-guest complexes are formed by two interacting components of different sizes packed in a shape-complementary manner.Macromolecular hosts are relatively small in size and simpler in composition compared with biomolecules.Commonly encountered species include cyclodextrins, cucurbiturils, and pillararenes, which are often applied as drug carriers and reservoirs [6][7][8][9][10][11]. Their central cavities are often hydrophobic to encapsulate external agents, and their rims are often hydrophilic to serve as hydrogen bond acceptors and/or donors that form directional polar interactions with solvents or guest molecules staying inside the cavity.As host-guest complexes are often considered prototypical analogues of protein-ligand complexes, computational techniques developed for virtual screening of protein-ligand complexes are directly extended to that in host-guest binding [12][13][14][15][16].However, the validity of such direct extensions has not been systematically validated.Further, several recent application reports on specific host-guest datasets report the violation of existing experiences accumulated in protein-ligand situations [17,18], which makes the act of directly extending the computational ladder designed for protein-ligand complexes to host-guest cases questionable.
To fill this critical gap and provide direct numerical evidence and guidance of the design of the hierarchical screening workflow, we devote the current paper to a practical docking and end-point screening on a series of octa-acid (OA) host-guest complexes.A dataset containing 31 OA-guest pairs and 17 host-guest pairs involving the methylated form TEMOA are constructed from SAMPL challenges [19][20][21][22], and fourteen molecular docking protocols and twenty end-point protocols are employed to rank the binding strength.Compared with previous computational works (e.g., those focusing on docking protocols) [18], our work considers larger datasets and many mainstream docking protocols unexplored in the current literature.Further, the extensiveness of the exploration of end-point protocols is also unprecedented.Our benchmark test covers not only many mainstream implicit-solvent models but, more importantly, both the single-and threetrajectory sampling realizations.
The 2D chemical structures of the prototype OA basket, its methylated form, TEMOA, and the guest molecules are presented in Figure 1.Although, in some situations, multiple guest molecules could be encapsulated simultaneously in a single host cavity [23,24], we limit the current investigation to the 1:1 binding in both experimental reference and computational modeling.
Similar to biomacromolecular assemblies, host-guest complexes are formed by two interacting components of different sizes packed in a shape-complementary manner.Macromolecular hosts are relatively small in size and simpler in composition compared with biomolecules.Commonly encountered species include cyclodextrins, cucurbiturils, and pillararenes, which are often applied as drug carriers and reservoirs [6][7][8][9][10][11]. Their central cavities are often hydrophobic to encapsulate external agents, and their rims are often hydrophilic to serve as hydrogen bond acceptors and/or donors that form directional polar interactions with solvents or guest molecules staying inside the cavity.As host-guest complexes are often considered prototypical analogues of protein-ligand complexes, computational techniques developed for virtual screening of protein-ligand complexes are directly extended to that in host-guest binding [12][13][14][15][16].However, the validity of such direct extensions has not been systematically validated.Further, several recent application reports on specific host-guest datasets report the violation of existing experiences accumulated in protein-ligand situations [17,18], which makes the act of directly extending the computational ladder designed for protein-ligand complexes to host-guest cases questionable.
To fill this critical gap and provide direct numerical evidence and guidance of the design of the hierarchical screening workflow, we devote the current paper to a practical docking and end-point screening on a series of octa-acid (OA) host-guest complexes.A dataset containing 31 OA-guest pairs and 17 host-guest pairs involving the methylated form TEMOA are constructed from SAMPL challenges [19][20][21][22], and fourteen molecular docking protocols and twenty end-point protocols are employed to rank the binding strength.Compared with previous computational works (e.g., those focusing on docking protocols) [18], our work considers larger datasets and many mainstream docking protocols unexplored in the current literature.Further, the extensiveness of the exploration of end-point protocols is also unprecedented.Our benchmark test covers not only many mainstream implicit-solvent models but, more importantly, both the single-and threetrajectory sampling realizations.
The 2D chemical structures of the prototype OA basket, its methylated form, TE-MOA, and the guest molecules are presented in Figure 1.Although, in some situations, multiple guest molecules could be encapsulated simultaneously in a single host cavity [23,24], we limit the current investigation to the 1:1 binding in both experimental reference and computational modeling.

Docking Screening of Host-Guest Pairs
Molecular docking is a technique that packs two components (e.g., protein and ligand and, similarly, host and guest) into a single interacting complex with a certain level of conformational compatibility.The main components involved in this technique are conformational search and scoring functions.Conformational search techniques and scoring functions of varying features have been constructed up to now.For instance, scoring functions could be primarily derived from all-atom force fields (e.g., AMBER-like AutoDock) [25,26] and include empirical terms (e.g., plp and chemplp) [27][28][29][30][31]. Below, we briefly introduce the docking protocols considered in this work.
The well-known open-source AutoDock tool is presumably the most popular program in protein-ligand docking.The method achieves good performance in various benchmark calculations on protein-ligand docking [32,33].The up-to-date

Docking Screening of Host-Guest Pairs
Molecular docking is a technique that packs two components (e.g., protein and ligand and, similarly, host and guest) into a single interacting complex with a certain level of conformational compatibility.The main components involved in this technique are conformational search and scoring functions.Conformational search techniques and scoring functions of varying features have been constructed up to now.For instance, scoring functions could be primarily derived from all-atom force fields (e.g., AMBER-like AutoDock) [25,26] and include empirical terms (e.g., plp and chemplp) [27][28][29][30][31]. Below, we briefly introduce the docking protocols considered in this work.
The well-known open-source AutoDock tool is presumably the most popular program in protein-ligand docking.The method achieves good performance in various benchmark calculations on protein-ligand docking [32,33].The up-to-date implementation is Autodock Vina [25].Our benchmark test with the Autodock Vina software involves two scoring functions (Vina and Vinardo) [25,26,34,35], the results of which would thus be abbreviated as AutoDock-Vina and AutoDock-Vinardo in lateral presentations.
Protein-ligand ANT System (PLANTS) docking calculations are performed with two scoring functions (plp and chemplp) [31,36].The conformational search method features a hybrid ant colony optimization that imitates the foraging behavior of ants via artificial pheromones [37][38][39].The scoring functions with PLANTS are constructed as a weighted combination of scoring terms of various origins, such as force-field energetics (e.g., torsional potential from Tripos) [40] and hydrogen bond terms from Chemscore [41,42].While, from the names of scoring functions, it is straightforward to know that the plp and chemplp docking regimes with PLANTS differ in the scoring functions for ranking different structures, there are further differences in conformational search parameters.According to existing experiences published in the literature on protein-ligand complexes, the chemplp scoring function is generally slower in speed but higher in accuracy compared with the plp selection.The PLANTS results will be abbreviated as PLANTS-plp and PLANTS-chemplp in the following paragraphs.
The latest implementation of the DOCK program, DOCK6.10, is also considered.The recommended workflow in the newest implementation incorporates a sphere representation of the molecular surface, an anchor growth algorithm for flexible conformational search, and many selections for pose scoring [43][44][45].Two scoring functions benchmarked in this work include the popular (grid-based) energy scoring and contact scoring with a cutoff of 4.5 Å, which will be abbreviated as DOCK6-energy and DOCK6-contact in this manuscript.
The rDock [46] protocol originally developed as RiboDock is designed for RNA-ligand docking [47].Later, the method is refined and extended to protein-ligand binding [46].The sampling procedure in rDock involves a genetic algorithm search, Monte Carlo and minimization.The rDock scoring function is estimated as the weighted sum of a series of physical and empirical terms, e.g., inter-and intra-molecular energetics, empirical solvation, torsional potential from Tripos, and pharmacophore restraints [48].
We summarize in Table 1 the seven docking protocols considered in this work, and the docking-packed bound structures of all host-guest pairs are shared at https://github.com/proszxppp/virtual_screening_OA (accessed on 7 January 2024).The center of the host pocket is selected as the center of the docking box, and a radius of 7 Å or a side length of 13 Å is used for the dimension of the search space.The unmentioned search algorithm parameters are set to their default values.For each docking protocol, we compute the binding strength of the host-guest complexes with two regimes.The first one follows the most straightforward and popular top-1 treatment that considers the docking score of the top-1 binding pose as the estimate, while the second one considers the multi-modal binding behavior by calculating the exponential sum of the top-10 binding modes [18], i.e., Consequently, there are fourteen estimates of the binding strengths from molecular docking.

Fixed-Charge Parametrization and End-Point Reranking
In end-point screening, the most popular workflow employs fixed-charge force fields to describe intra-and inter-molecular interactions, samples the end-point ensembles in the binding/unbinding reaction, and computes the binding strengths with force-field energetics.
We then proceed to the sampling procedure.The popular single-trajectory protocol in end-point calculations only requires the sampling in the host-guest bound state, while the three-trajectory realization also requires the simulations in the unbound state (i.e., solvated host and solvated guest).Thus, for each host-guest pair, three simulation boxes are needed.The unbound host and guest cells could be constructed via solvation and neutralization of individual molecules, while some initial guess of the bound conformation is needed for the bound state.As the computational cost of our sampling and free energy estimation protocol is high, we only pick the docked poses with one docking protocol as the starting configurations for the host-guest complexes.Normally, the docking protocol for generating the initial condition for atomistic simulations should be selected according to the performance of the docking screening.Therefore, the details of the selected docking protocol would be discussed later in the results section.The docking-produced initial conditions are also available in the online repository.The docked host-guest complex is then solvated in TIP3P water [65], with the 15 Å solute-edge distance and spherical monovalent counter ions of Na + or Cl − [66,67] added for neutralization.Periodic boundary conditions are always employed in our simulations.
Molecular simulations are then initiated using the hybrid-precision GPU engine in AMBER [68].We apply SHAKE constraints [69,70] on bonds involving hydrogen, a 2 fs time step [71], Langevin dynamics for temperature regulation at 300 K, isotropic scaling and Monte Carlo barostat for pressure regulation at 1 atm, the AMBER default 8 Å real-space cutoff for non-bonded interactions, and PME for long-range electrostatics.Starting from the packed initial conditions, we relax the system by performing energy minimization for 5000 cycles, constant-volume heating in 120 ps, and NPT equilibration for 1 ns, after which, 1000 ns unbiased sampling in the host-guest and unbound guest states and 10,000 ns for the unbound host state are spawned.The uneven distribution of sampling lengths in different ensembles arises from the difficult-to-sample behavior (higher energetic fluctuation) observed in a recent work [72].The longer sampling time in the solvated host ensemble helps the minimization of the statistical uncertainty of the three-trajectory estimates.Such a time scale is extremely long in end-point sampling, reaching a practical limit in end-point screening.The sampling interval of successive configurations is set to 10 ps.
For free energy extraction, we utilize MM/PBSA and MM/GBSA with four popular GB models (GB HCT [73,74], GB OBC-I , GB OBC-II [75,76], and GB neck2 [77]).The two GB OBC models correspond to the two sets of parameters published in the original article [75].The salt concentration of counter ions for Debye-Hückel screening is set to 0.1 M. The nonpolar solvation contribution (the SA part) is computed with the solvent-accessible surface area method [78].The popular normal mode approximation [2] is applied on a subsampled configuration set (1/4 of all configurations) to compute the entropic contribution.
The above-mentioned energetics, accumulated and extracted from trajectories, are arranged to form the end-point estimate Each free-energy term includes enthalpic and entropic components, i.e., The difference between the single-and three-trajectory sampling protocols lies in the ensemble that the above terms are estimated.For the popular single-trajectory realization, all energetics are extracted from the host-guest bound state, while host-and guest-only terms are computed with configurations from solvated host and solvated guest ensembles in the three-trajectory realization.
Another consideration concerning the above end-point equation is the exclusion/inclusion of the entropic term (TS gas ).There is a tendency to neglect this term in many applications of the end-point tool, the reasons for which are mostly the similarity of receptor-ligand pairs (and thus, the cancellation of the entropic contribution when computing the relative affinity) and the small improvements of including the entropic term.While the former cancellation statement could be valid under certain circumstances, in most virtual screening scenarios dealing with ligands/guests with diverse chemical features, such an approximation is obviously unsolid, introducing systematic biases of unknown magnitude.Further, such a treatment does not follow the rigorous derivation of the end-point protocol and, thus, could be harmful when the reproduction of absolute affinities is pursued.As for the small improvements observed in certain application reports, the end-point protocol has long been recognized as exhibiting a severe system-dependent behavior in its screening power.Thus, the magnitude of improvements by including the entropic contribution as an open question for newly encountered species requires extensive validation.Whether such a treatment is preferred remains to be revealed in host-guest complexes.We thus compute two estimates for each end-point protocol, i.e., with and without the entropic contribution.In Table 2, a summary of all end-point protocols considered in this work is presented.In virtual screening, the predictive power of a method is often evaluated by the scoring power and the ranking power.The former could be quantified with the root-mean-squared error (RMSE) and the Pearson correlation coefficient, while for the latter, we could compute ranking coefficients including Kendall τ [79] and the Pearlman's predictive index (PI) [80].RMSE quantifies the deviations of computed values from the reference, Pearson r measures the linear correlation between the calculated and experimental reference, Kendall τ assesses the correspondence of the computed affinity rank with the reference rank, and PI, as a metric similar to Kendall rank coefficient, also takes the difference between experimental values into consideration.While RMSE is larger than zero, the other three metrics range from −1 to 1.A close-to-unity Pearson correlation coefficient suggests perfect linear calculation-experiment correlation, and close-to-unity Kendall τ and PI indicate perfect predictions of the affinity rank.A fact worth noting here concerns the physical meaning of docking scores.As the reproduction of absolute values of binding affinities is often considered a critical goal, many scoring functions as a weighted combination of physicsbased and empirical terms take on an energy-like form.In most situations, the unit of scoring functions is set as kcal/mol [27][28][29][30].Thus, computing the absolute deviations of docking scores from experimental affinities is a reasonable analysis.In cases where the dimensionless behavior of the scoring function is explicitly claimed, e.g., the contactbased scoring (DOCK6-contact), we should instead focus only on the ranking power of the docking method.
All docking estimates and the associated quality metrics are summarized in Tables S1 and S2.We visualize the prediction performance in Figure 2 to facilitate enabling a straightforward comparison.When it comes to the scoring power, for all docking techniques, a certain level of host-dependent behavior is observed for the quality metrics.For Pearson r, the OA performance is much better than TEMOA, while the opposite is true for RMSE.The AutoDock family (AutoDock-Vina and AutoDock-Vinardo) are superior to all other docking tools, regardless of the use of the top-1 or exponential sum treatment.AutoDock-Vina is slightly better than AutoDock-Vinardo in both Pearson correlation and RMSE.The RMSE from AutoDock is as small as 1~2 kcal/mol, which deduces that the AutoDock family of docking are powerful tools in reproducing the absolute affinities of host-guest complexes.The next top-performing method varies with the quality metrics used as the criterion.When considering Pearson r, the next top-performing family is the PLANTS docking (PLANTS-chemplp and PLANTS-plp).By contrast, when using RMSE as the criterion, the rDock tool achieves better performance.Compared with the rDOCK RMSE ~5-8 kcal/mol, the PLANTS-plp and PLANTS-chemplp scores exhibit huge deviations from the experimental binding affinities, although the units of their scoring functions are claimed to be kcal/mol [27][28][29][30].However, a major performance difference in Pearson r is observed between OA-and TEMOA-guest systems for rDOCK, indicating the poor robustness of the docking tool.Specifically, the Pearson correlation coefficients of PLANTS and rDock scores are similar (~0.5) for OA-guest complexes, but huge differences could be observed in the TEMOA-guest cases (PLANTS~0.2but rDOCK~−0.2).Therefore, the rDOCK method lacks robustness in host-guest screening and is excluded in the current performance comparison.The DOCK6 results with both energy and contact scoring perform worst in Pearson correlation coefficient (~0-0.2),although the DOCK6-energy estimates are closer to experimental affinities when compared with PLANTS.Overall, using the Pearson correlation coefficient as the criterion, the performance rank is AutoDock-Vina > AutoDock-Vinardo > PLANTS-plp/PLANTS-chemplp > DOCK6-contact > DOCK6-energy, whereas under the RMSE criterion, the performance follows AutoDock-Vina > AutoDock-Vinardo > DOCK6-energy > PLANTS-plp/PLANTS-chemplp > DOCK6-contact.
In terms of the ranking power, interestingly, the disparities in performance among various docking methods do not appear to be as significant as those observed in the previous scoring scenario.While AutoDock-Vina remains the top performer for hostguest complexes involving both OA and TEMOA, AutoDock-Vinardo exhibits a major failure when applied to TEMOA-guest systems.Aside from AutoDock-Vinardo, rDOCK scores also demonstrate an obvious disparity in ranking performance for OA and TEMOA.As the unstableness of their ranking performance is an undesired behavior of robust screening tools, we omit the two docking methods from the performance comparison.The performance rank of the remaining docking methods is AutoDock-Vina > PLANTSplp/PLANTS-chemplp > DOCK6-contact > DOCK6-energy.S1 and S2.

Computational Cost
In addition to the predictive power, another property of great practical interest is the computational cost of a docking tool.We thus summarize the CPU times of finishing the docking calculations for all host-guest pairs using each docking protocol, as shown in Figure S1.Note that we always employ the embarrassingly parallel approach in distributed computing, in order to minimize the communication overhead and maximize the timing performance.A single trail is performed and the consumed CPU time could be influenced by various factors in the computing node, but the current timing data still enable a reasonable comparison.From the CPU time comparison, we know that the AutoDock family, rDOCK and DOCK6-energy docking protocols require similar computational resources, while the PLANTS family and DOCK6-contact protocols are much faster (almost 1/10 CPU time).

Structural Consideration
Apart from the screening power and timing information, another crucial aspect of a docking protocol is the quality of bound structures produced.Unfortunately, the absence of experimentally deposited bound structures makes it impossible to perform a face-to-face comparison of this docking power.However, valuable insights could still be obtained if we take a deeper analysis of the bound structures.To compare the binding modes produced by different docking protocols, we consider two calculation regimes.The quantitative estimator of the structural variation is the root-mean-squared deviation (RMSD) of nonhydrogen heavy atoms of the guest molecule.With the bound structures produced by different docking protocols, the first RMSD estimation regime aligns the guest molecules and computes the RMSD of the guest molecule, i.e., setting the weights of all host atoms to zero in both structural alignment and RMSD calculation.This is the most traditional method in RMSD analysis, and measures the intra-molecular conformational variation of the guest molecule in the bound state.The second RMSD estimation follows a slightly modified regime, which aligns the host molecule (i.e., setting weights of guest atoms to zero but those of host atoms to one) and computes the RMSD of the guest molecule (weights of host atoms to zero and guest weights to one).In contrast to the conventional approach used previously, the RMSD from this altered regime accounts for both translational and rotational contributions of the guest in the host cavity (e.g., change of the binding site) and the intra-molecular conformational variation of the guest molecule.As a result, the second strategy reflects the net structural difference in the binding modes.
As the structural comparison is performed between pairs of the docking protocols, 2D RMSD matrices are formulated.The RMSD matrices of the two estimation regimes are merged for a clearer presentation, with the upper-right triangle holding the first traditional estimates and the bottom-left triangle containing the values of the second altered regime.The heatmaps of several examples are presented in Figure 3, while those of all host-guest pairs are summarized in Figures S2 and S3.For the first two cases (i.e., Figure 3a,b), both the RMSD of intra-molecular conformational variations (upper-right triangle) and the net structural difference (bottom-left triangle) are small, which indicates that the bound structures produced by all docking protocols are similar.In the other cases (Figure 3c,d), various docking protocols yield bound structures with substantial disparities, which implies that the bound-structure similarity does not always hold and there are situations in which different docking protocols predict distinct bound structures.
From the 2D RMSD matrices for all systems shown in Figures S2 and S3, an interesting observation is the small RMSD of the PLANTS-chemplp and PLANTS-plp structures in both of the two RMSD estimation regimes, which indicates a certain level of similarity of bound structure predictions with the two protocols of similar origins (i.e., PLANTS).Thus, it seems indifferent, practically, to pick a scoring protocol with PLANTS docking.Such a behavior is generally absent for the AutoDock family (AutoDock-Vina and AutoDock-Vinardo) and DOCK6 (energy and contact scoring), which insinuates that these scoring protocols are truly independent by construction and serve as independent tools in structure ranking.
bound structures produced by all docking protocols are similar.In the other cases (Figure 3c,d), various docking protocols yield bound structures with substantial disparities, which implies that the bound-structure similarity does not always hold and there are situations in which different docking protocols predict distinct bound structures.(c,d) are examples where some of the docking protocols produce bound structures significantly different from the other.The heatmaps of all host-guest pairs are given in Figures S2 and S3.
From the 2D RMSD matrices for all systems shown in Figures S2 and S3, an interesting observation is the small RMSD of the PLANTS-chemplp and PLANTS-plp structures in both of the two RMSD estimation regimes, which indicates a certain level of similarity Another structural similarity is observed between the top-performing AutoDock-Vina and PLANTS protocols, where the net structural RMSD is below 2 Å for more than 60% host-guest pairs.This suggests that the PLANTS family of intermediate ranking power could reproduce the top-performing AutoDock-Vina structures with a high probability.

Take-Home Message
Overall, considering all the above analyses of scoring and ranking powers, we recommend the AutoDock-Vina protocol as the docking tool for screening host-guest complexes.If the practitioners are dealing with a huge dataset with relatively limited computational resources, the PLANTS family (PLANTS-plp or PLANTS-chemplp) serve as a much cheaper solution with intermediate accuracy in docking screening.The other techniques (DOCK6contact, DOCK6-energy, rDOCK, and AutoDock-Vinardo) are not recommended due to unstable and/or low performance.
Regarding the bound structures produced by different algorithms, the closeness to reference cannot be evaluated due to the absence of experimentally deposited structures.However, the top-performing AutoDock-Vina, PLANTS-plp, and PLANTS-chemplp often produce similar structures (c.f., low RMSDs from both of the two computing regimes shown in Figures 3, S2 and S3).Therefore, structurally, the bound structures and the guest conformations produced by the three recommended docking protocols would not be significantly different, and either of them could produce bound structures of a reasonable quality.

How Long Do We Need to Converge the Statistics
With all the docking results summarized above, we then turn to the post-docking end-point screening in this section.Since AutoDock-Vina consistently outperforms other methods in both quality metrics and host-specific datasets, the most straightforward and reasonable treatment for post-docking end-point screening is extracting the AutoDock-Vina-generated docked poses as the starting configurations of the end-point simulation of the bound complex.Note that the docking-produced bound structures are also available in the online GitHub repository.
Before comparing the performance statistics of end-point screening, we conduct preliminary sanity checks to consolidate the simulation outcome.The convergence check is indispensable in the free-energy calculation.It supports the validity of approximating the ensemble averages with the time-series data.Therefore, instead of directly comparing the end-point results with experimental values, we first look into the time-series data.The time variations of the single-trajectory MM/GB neck2 SA estimates for 31 OA and 17 TEMOA host-guest complexes are plotted in Figures S4 and S5, respectively, with the longest 1000 ns bound-state statistics as the reference value for the long-time limit.The plots clearly show that the end-point estimates of most host-guest pairs could be well-converged with deviations below k B T (~0.59 kcal/mol at room temperature) within tens of ns, but, in some situations (e.g., OA-L30 and OA-L31 pairs), the end-point results still exhibit noticeable fluctuations and/or systematic drifts with 1000 ns unbiased sampling.Based on these observed convergence behaviors of the absolute affinities, we believe that ~100 ns serves as a practically usable setup for converged end-point estimates for most host-guest complexes involving OA derivatives.
Despite the variations in end-point results in some systems, even with 1000 ns sampling, they may not significantly perturb the prediction quality or screening power.To check whether this not-so-well convergence behavior of specific systems would impact the quality metrics in a statistically significant manner, we computed the time series of the four quality metrics (RMSE, τ, PI, and Pearson r), as depicted in Figure 4.For both OA and TEMOA host-guest binding, all four of the quality metrics keep fluctuating in the first 40 ns, and only after 100 ns do the statistics seem stabilized, although minor fluctuations could still be observed in the last several blocks.This phenomenon illustrates that end-point free-energy calculations generally converge after ~100 ns sampling in host-guest complexes involving OA derivatives.Therefore, a practical recommendation for practitioners using end-point free-energy techniques is to extend the sampling time to at least ~100 ns in order to avoid systematic biases introduced by sampling insufficiency.Notably, end-point sampling of such length is generally much longer than commonly applied selections, e.g., several to 10 ns in existing reports [13,81,82], highlighting potential pitfalls in current practices.
complexes involving OA derivatives.Therefore, a practical recommendation for practitioners using end-point free-energy techniques is to extend the sampling time to at least ~100 ns in order to avoid systematic biases introduced by sampling insufficiency.Notably, end-point sampling of such length is generally much longer than commonly applied selections, e.g., several to 10 ns in existing reports [13,81,82], highlighting potential pitfalls in current practices.

Screening Power of End-Point Estimators
The long-time 1000 ns estimates of all end-point protocols (with or without normal mode entropy, implicit solvent model, and the single-or three-trajectory sampling protocol) are summarized in Tables S3-S10.The four quality metrics (RMSE, τ, PI, and Pearson r) and an additional error metric, named mean signed error (MSE), for each dataset are also provided.For easier comparison between performances of different end-point protocols, we plot the quality metrics in Figure 5.

Screening Power of End-Point Estimators
The long-time 1000 ns estimates of all end-point protocols (with or without normal mode entropy, implicit solvent model, and the single-or three-trajectory sampling protocol) are summarized in Tables S3-S10.The four quality metrics (RMSE, τ, PI, and Pearson r) and an additional error metric, named mean signed error (MSE), for each dataset are also provided.For easier comparison between performances of different end-point protocols, we plot the quality metrics in Figure 5.
We first focus on the most popular end-point regime based on the single-trajectory sampling protocol and investigate whether the inclusion/exclusion of the entropic contribution would incur noticeable variations.Comparing the statistical metrics of the ∆H and ∆G, intriguingly, the end-point performance, in most cases, is improved with the inclusion of the entropic term.For example, the RMSE measuring the reproduction of absolute affinities is too huge ~10 kcal/mol for the ∆H estimates, but is minimized to ~4 kcal/mol by the addition of the entropic contribution.Therefore, including the entropic contribution is not a treatment that merely increases the computational cost, but serves as a treatment that completes the theoretical rigor of the end-point workflow and deserves to be considered.Although this observation could be dependent on the system under investigation, for safety, the most reliable end-point estimation scheme for practitioners to follow is still the most rigorous workflow, i.e., including the entropic contribution.We first focus on the most popular end-point regime based on the single-trajectory sampling protocol and investigate whether the inclusion/exclusion of the entropic contribution would incur noticeable variations.Comparing the statistical metrics of the ΔH and ΔG, intriguingly, the end-point performance, in most cases, is improved with the inclusion of the entropic term.For example, the RMSE measuring the reproduction of absolute affinities is too huge ~10 kcal/mol for the ΔH estimates, but is minimized to ~4 kcal/mol by the addition of the entropic contribution.Therefore, including the entropic contribution is not a treatment that merely increases the computational cost, but serves as a treatment that completes the theoretical rigor of the end-point workflow and deserves to be considered.Although this observation could be dependent on the system under investigation, for safety, the most reliable end-point estimation scheme for practitioners to follow is still the most rigorous workflow, i.e., including the entropic contribution.
We then include the three-trajectory results into the analysis set and compare the performance statistics of all end-point protocols.We start by picking the top-performing methods for the OA and TEMOA host-guest datasets.The top-2 end-point protocols based on various quality metrics are listed in Table 3, where different methods achieve top performance under different criteria.To extract some statistical insights into the top-performing selections, we perform two types of analyses.The first one aims at achieving the highest robustness across different hosts (receptors) in virtual screening, and thus picks the end-point protocol that exhibits the top performance for both OA and TEMOA hosts.The robust end-point selections based on individual quality metrics are then given in Table 4. Interestingly, for four quality metrics reflecting different aspects of the screening problem (i.e., ranking and scoring), only two unique end-point protocols appear in the table, i.e., single-trajectory MM/GB neck2 SA ΔG and three-trajectory MM/GB neck2 SA ΔH.These two options are robust screening protocols that perform consistently top for OA derivatives.The second analysis is based on the number of occurrences in the top-2 table (i.e., Table 3).Interestingly, this analysis identifies two end-point regimes that are exactly the same as the previous robustness analysis, as given in Table 5.According to the statistical insights observed from the two analyses, we recommend the two end-point protocol, single-trajectory MM/GB neck2 SA ΔG and three-trajectory MM/GB neck2 SA ΔH, in practical virtual screening of host-guest complexes.We then include the three-trajectory results into the analysis set and compare the performance statistics of all end-point protocols.We start by picking the top-performing methods for the OA and TEMOA host-guest datasets.The top-2 end-point protocols based on various quality metrics are listed in Table 3, where different methods achieve top performance under different criteria.To extract some statistical insights into the topperforming selections, we perform two types of analyses.The first one aims at achieving the highest robustness across different hosts (receptors) in virtual screening, and thus picks the end-point protocol that exhibits the top performance for both OA and TEMOA hosts.The robust end-point selections based on individual quality metrics are then given in Table 4. Interestingly, for four quality metrics reflecting different aspects of the screening problem (i.e., ranking and scoring), only two unique end-point protocols appear in the table, i.e., single-trajectory MM/GB neck2 SA ∆G and three-trajectory MM/GB neck2 SA ∆H.These two options are robust screening protocols that perform consistently top for OA derivatives.The second analysis is based on the number of occurrences in the top-2 table (i.e., Table 3).Interestingly, this analysis identifies two end-point regimes that are exactly the same as the previous robustness analysis, as given in Table 5.According to the statistical insights observed from the two analyses, we recommend the two end-point protocol, singletrajectory MM/GB neck2 SA ∆G and three-trajectory MM/GB neck2 SA ∆H, in practical virtual screening of host-guest complexes.With all the reliable docking and end-point statistics accumulated so far, it is finally possible to explore the most significative aspect of virtual screening and inspect the practical value of applying post-docking end-point rescreening.The recommended protocols of both molecular docking and end-point free energy calculations are compared in a face-to-face manner in Figure 6.While the recommended docking protocol AutoDock-Vina performs best in reproducing the absolute affinities (RMSE) for both hosts and achieves very high correlation coefficients for the OA host-guest pairs, its ranking power for the methylated form (TEMOA host-guest pairs) exhibits a severe performance drop.By contrast, both of the two recommended end-point protocols, single-trajectory MM/GB neck2 SA ∆G and three-trajectory MM/GB neck2 SA ∆H, perform reasonably well for both hosts.Such a stable performance confirms the robustness of the recommended end-point protocols, validating the value of post-docking end-point screening.

Concluding Remarks
Virtual screening often follows a hierarchical workflow design based on efficiency and accuracy considerations.In protein-ligand binding, high-throughput techniques based on atomistic modeling include molecular docking and end-point free-energy calculations, which are directly extended to host-guest complexes due to the similarity of these receptor-ligand assemblies.However, recent reports on specific systems suggest that the computational ladder based on protein-ligand experience could be no longer valid in the

Concluding Remarks
Virtual screening often follows a hierarchical workflow design based on efficiency and accuracy considerations.In protein-ligand binding, high-throughput techniques based on atomistic modeling include molecular docking and end-point free-energy calculations, which are directly extended to host-guest complexes due to the similarity of these receptor-ligand assemblies.However, recent reports on specific systems suggest that the computational ladder based on protein-ligand experience could be no longer valid in the altered receptor-ligand/host-guest complexes.Therefore, there is an urgent need for a face-to-face comparison between the screening performance of molecular docking and end-point calculations in order to provide solid numerical data guiding the design of hierarchical virtual screening.To this end, based on a comprehensive dataset containing ~50 host-guest pairs formed by basket-like OA derivatives and drug-like guests, we perform a detailed and systematic benchmark on docking and end-point screening of OA host-guest systems.
Seven docking protocols with varying features are considered.While rDock, DOCK6energy and DOCK6-contact and AutoDock-Vinardo are rejected due to low accuracy predictions or lack of robustness across different hosts/receptors, the AutoDock-Vina and PLANTS family (chemplp and plp) are considered usable options in host-guest docking.These three selections perform reasonably well in ranking different host-guest pairs with a given receptor/host, and the binding poses predicted by the three recommended docking protocols agree well, with more than 60% host-guest pairs having structural deviations < 2 Å.As for efficiency considerations, the AutoDock-Vina protocol achieves the highest performance statistics but is costlier compared with the PLANTS family.
Twenty protocols (implicit solvent models, single-and three-trajectory sampling, and the inclusion/exclusion of the entropic contribution) are considered for end-point screening.Under the popular single-trajectory realization, we observe that the inclusion of the entropic contribution could be pivotal in reproducing the absolute values of binding affinities, and it also improves the correlation statistics under most circumstances.As the theoretical rigor of the end-point workflow is achieved with the inclusion of this entropic term, we believe that it should not be casually neglected.Unless there is solid evidence supporting the elimination of the entropic term, it is undesirable to perform this approximation.As for the overall performance of all end-point protocols, while the prevailing protocol exhibits certain dependence on the binding target (receptors/hosts) and criterion (quality metrics), statistically, we identify two end-point protocols, single-trajectory MM/GB neck2 SA ∆G and three-trajectory MM/GB neck2 SA ∆H, that consistently deliver top-tier performance across different hosts/receptors.Face-to-face comparison between the recommended end-point and docking protocols reveals the higher robustness of the end-point tool, validating the value of post-docking end-point screening.

Figure 1 .
Figure 1.OA host-guest systems studied in this work.

Figure 1 .
Figure 1.OA host-guest systems studied in this work.

Figure 2 .
Figure 2. Performance comparison between docking protocols for four quality metrics: Pearson correlation coefficient, RMSE, PI, and Kendall τ.The exact values of docking scores are summarized in TablesS1 and S2.

Figure 2 .
Figure 2. Performance comparison between docking protocols for four quality metrics: Pearson correlation coefficient, RMSE, PI, and Kendall τ.The exact values of docking scores are summarized in TablesS1 and S2.

Figure 3 .
Figure 3. 2D RMSD of guest molecules for host-guest bound structures produced by various docking protocols.The upper-right triangle presents the intramolecular conformational variation estimated by the align-guest-compute-guest strategy, while the bottom-left triangle depicts the overall displacement (including both translational and rotational motions of the molecule and intra-molecular conformational change) evaluated with the align-host-compute-guest strategy.In (a,b) we present two examples that all docking protocols give similar models for the bound state, while subplots (c,d) are examples where some of the docking protocols produce bound structures significantly different from the other.The heatmaps of all host-guest pairs are given in Figures S2 and S3.

Figure 3 .
Figure 3. 2D RMSD of guest molecules for host-guest bound structures produced by various docking protocols.The upper-right triangle presents the intramolecular conformational variation estimated by the align-guest-compute-guest strategy, while the bottom-left triangle depicts the overall displacement (including both translational and rotational motions of the molecule and intra-molecular conformational change) evaluated with the align-host-compute-guest strategy.In (a,b) we present two examples that all docking protocols give similar models for the bound state, while subplots (c,d) are examples where some of the docking protocols produce bound structures significantly different from the other.The heatmaps of all host-guest pairs are given in Figures S2 and S3.

Figure 4 .
Figure 4.The quality metrics of the single-trajectory MM/GB neck2 SA protocol as a function of the sampling time for both OA and TEMOA host-guest complexes: RMSE, Kendall τ, PI, and Pearson r.The time-dependent behaviors of free energy estimates of individual systems are provided in Figures S4 and S5.

Figure 4 .
Figure 4.The quality metrics of the MM/GB neck2 SA protocol as a function of the sampling time for both OA and TEMOA host-guest complexes: RMSE, Kendall τ, PI, and Pearson r.The time-dependent behaviors of free energy estimates of individual systems are provided in Figures S4 and S5.

Figure 6 .
Figure 6.Performance comparison between recommended end-point and docking protocols for OA (bottom) and TEMOA (top).

Table 1 .
Docking protocols benchmarked in this work.

Table 2 .
End-point protocols benchmarked in this work.

Table 3 .
Best-performing screening regimes for the two host-guest sets with different quality metrics.

Table 4 .
Robust end-point protocols that achieve top performances for both OA and TEMOA hostguest pairs under individual criteria.

Table 5 .
Number of observations for which end-point protocols perform at the top in the top-2 analysis.