On the estimation of the molecular inaccessible volume and the molecular accessible surface of a ligand in protein – ligand systems †

In this paper, a novel approach is proposed based on the accurate computation of the inaccessible volume and the corresponding surface area which is defined by the locus of points where a ligand molecule can be placed so that it “ touches ” a protein molecule at a preset minimum interatomic distance without resulting in overlaps between the atoms of the protein and the atoms of the ligand. The proposed approach can be considered an extension of the widely used concept of the solvent accessible surface area (SASA). The SASA is defined as the surface where a solvent molecule can be in contact with the initial one without any overlaps. This excluded volume interaction is evaluated by treating atoms as hard core spheres, with the limitation of the solvent molecule being represented as a single sphere. In the proposed concepts of the molecular accessible surface (MASA) and the molecular inaccessible volume (MIV) we have practically removed this limitation and all atoms, both in the initial and the “ inserted ” molecules, are represented as hard spheres. In this paper we focus our examples on biological systems, especially on studying protein – ligand systems, since we expect that this will be one of the most promising fields of applications where the MASA and MIV extensions of the SASA will be of practical and immediate use. Therefore, the MASA and MIV are evaluated based on the surface generated by the ligand while it is being rolled over on all the atoms of the protein without penetrating them. Identification of the inaccessible volume of each candidate protein – ligand pair is also provided in the context of this study, along with the boundary surface where the ligand can be placed so as to be in “ contact ” with the protein. The proposed concepts of the MASA and MIV are expected to significantly enhance the ability to investigate specific protein – drug interactions by explicitly taking into account the polyatomic nature of a ligand. Several trials have been conducted using the analytical method of Dodd and Theodorou leading to accurate volume and surface area measurements of an arbitrary set of fused spheres in systems of various scales.


Introduction
, or between small molecules (ligands) and proteins, highlighting computational methods besides experimental techniques (e.g. X-ray crystallography, NMR). In the growing field of molecular modeling, molecular (Connolly) and accessible surfaces have spurred the interest of the scientific community over the past few years in order to elucidate molecular interactions and putative drug binding phenomena. Notably, using the keywords "accessible surface area" or "molecular surface area" in Google scholar search yields more than 5.5 million results overall, a number which indicates the large amount of studies and research conducted upon the accessible and molecular surface areas of molecules to date. Similarly, "accessible surface area calculation" and "molecular surface area calculation" are stated in a plethora of publications (over 3 million results in total), which include proposed methods and developed algorithms dealing with several biological problems.
In this work, our aim is to extend the notion of the solvent accessible surface area (SASA) by explicitly taking into account the confirmation of potential ligands and thus provide an alternative tool for methodologies, that up to now had to rely on a spherical approximation of the contact molecule in order to estimate the accessible surface around a protein. The concept of the accessible surface area was firstly described by Lee & Richards 1 in 1971 as the solvent accessible surface area (SASA). The SASA traces the geometrical locus derived from the centre of a hypothetical probe-sphere rolling on the van der Waals surface of the molecule without penetrating its atoms. It is also equivalent to the van der Waals surface, with the difference that the atomic radii r i have been substituted with the sum of r i + r p (r p is equal to the atomic radius of the hypothetical probesphere, typically 1.4 Å). Various approaches have been developed for calculating accessible surface areas, with the "rolling ball" algorithm by Shrake-Rupley 2 being one of the earliest and most popular methods among others. Additional improvements to these methods delimited the solventexcluded surface (SES) or widely known molecular/Connolly surface 3,4 which consists of two segments. The first one is the contact surface (part of the van der Waals surface of the atoms), tangent to the hypothetical "rolled-over" probe sphere. The second one is the reentrant surface, which comprises the inward-facing surface of the probe sphere when it is simultaneously tangent to two or more atoms. Analytical calculation of Connolly surfaces is founded upon numerical algorithms retrieving data from the atomic coordinates, van der Waals radii and probe radius, thus generating finite sets of points constructing a network of convex, saddle-shaped and concave faces defined in terms of vertices, circular arcs, spheres and tori so as to compute the solvent-excluded surface.
Apart from the classical Lee-Richards 1 and Shrake-Rupley 2 approaches, the solvent accessible surface area (SASA) can also be calculated analytically 5 using analytical equations plus their first and second derivatives [6][7][8] or by various other approximations. [9][10][11][12] Additionally, computational tools are able to predict the SASA in the unfolded state of the examined molecules incorporating methods such as the artificial neural network (ANN), [13][14][15] Markov chain model, 16 PredAcc 17 and PSAIA (protein structure and interaction analyzer). 18 In Table 1, a list summarizing some of the available computational tools and online servers that provide SASA calculations based on Lee and Richards' fundamental definition 1 is presented. Estimations of the SASA, using tools such as the ones described in Table 1, serve as the basis for several computational tools and methods that have been designed to assist in a variety of more complex problems, with the most prominent being that of estimating free energy differences and protein structure-folding prediction, 34 or even aid in simulations and design of novel molecular structures by predicting physical and chemical properties of polymers prior to synthesis. 35 To a large extent, applications of the SASA contribute to characterizing relationships between the structural and biological properties of chemical compounds via quantitative structure-activity relationship (QSAR) models as well as quantifying molecular lipophilicity (log P), a highly significant pharmacokinetic factor in medicinal chemistry, essential for drug discovery. [36][37][38] Another important application of the SASA has been visualization, with molecular visualization tools like Jmol, 22 VMD 23 and PyMOL 39 capable of providing visual representations of "cavities" and "pockets", as potential candidates for binding sites in proteins. Estimation of the SASA can also be used as part of a "docking" strategy, with docking computation being considered a significant approach for studying proteinprotein or protein-ligand interactions, guided by several theories behind binding phenomena, such as the "lock-key" model, 40 the "induced-fit" theory, 41 the "conformational selection" mechanism 42 and similar established approaches. Development of structure-based virtual screening and construction of novel therapeutic agents via computer-aided drug design (CADD) have all been achieved by molecular docking software applications. 43 Sampling algorithms implemented in docking software programs like DOCK, 44 FLOG, 45 FlexX, 46 Hammerhead, 47 SLIDE, 48 DIVALI 49 or DARWIN 50 are intended to predict the structure via conformational ensemble. Scoring algorithms can also predict the binding affinity of the tested biomolecules during their interactions by scoring functions under certain docking methodologies such as GOLD, 51 AutoDock, 52 LUDI, 53 PLP, 54-56 DrugScore 57 or CScore 58 programs. These algorithms rely on a variety of theoretical, chemical and geometrical approaches to visualize molecular structures and processes. Interactions are handled based on the properties of the amino acid residues found on the surface of molecules. Examination of amino acid charge, polarity, shape, potential for intercalation with other molecules, high evolutionary conservation of surface amino acids and estimated energy of molecular interactions constitute the primary elements for the functional interpretation and calculation of molecular surfaces. Molecular surfaces may have dual use; their graphical representations can provide a prediction of the possible functions and interactions which may take place by visualizing the shape, electron distribution or evolutionary conservation of molecular surface sequences. Moreover, quantification of surfaces is mainly used as a descriptor in an attempt to quantify the binding Gibbs free energy.

Theoretical basis
Extending the typical approaches for calculating the accessible surface area and volume confined to the use of probe-spheres, this paper proposes a novel approach based on the analytical calculation of the accessible volume area of a hard-sphere polyatomic molecule. The core idea was originally developed as part of the staged particle deletion (SPD) method, 59 in an effort to accurately estimate the free energy of cavity formation and its contribution to the chemical potential of small molecules in molecular simulations. For the case of a monatomic molecule that is modeled as a single hard-sphere into a system composed of atoms (represented also as hard spheres), the excluded volume limns the geometric locus of points where a hypothetical insertion hard-sphere center would cause an overlap with any of the existing hard spheres in the system. More specifically, this geometric locus is a set of fused spheres, whose centers coincide with those of the spheres of the atoms in the system but with radii augmented by the radius of the inserted hard-sphere. Provided that a single sphere is inserted and the system consists of molecules made of atoms modeled as spheres, the accessible volume calculation can formally be mapped to the evaluation of the volume of fused spheres, even when periodic boundary conditions are implemented. This approach works irrespective of the presence of intermolecular connectivity, whereas the computational task is expected to depend mainly on the number of atoms in the system and the actual size of the spheres. Furthermore, the estimation of the accessible volume can become quite demanding computationally as the size of the system increases and the actual accessible volume starts to diminish, including the case of inserting a monatomic molecule. On the other hand, what may not be so straightforward is the ability to estimate the inaccessible and in extension the accessible volume after insertion of an arbitrary polyatomic molecule in a similar way. 59 Upon rationalizing the process, a possible solution is to consider the interaction of each sphere in the system with the inserted polyatomic molecule under fixed internal degrees of freedom. It turns out that 59 the volume of the loci of points where a trial insertion of the chain molecule will result in an overlap can also be estimated as the volume of a set of fused spheres.
As it turns out, the problem of estimating the locus of points where a molecule can be inserted without overlapping is very similar to the estimation of the SASA with the main difference being that in the SASA one has to consider where to insert a single sphere by estimating the surface and the volume of a set of "inflated" fused spheres (one for each atom in the system), whereas when a polyatomic molecule is considered, one has to estimate the surface and not the volume of a set of auxiliary fused spheres like we describe in the next paragraph. Once the set of auxiliary spheres has been created the problem of estimating the surfaces and volumes of fused spheres can be handled by any program that has been developed for the SASA. On the other hand, what is probably one of the most accurate ways of estimating the surface and the volumes of any set of fused spheres is the method of Dodd and Theodorou 8 that we have chosen to implement in our calculations. In this work we propose the extension of the notion of the SASA around a protein molecule to the proposed notions of the molecular accessible surface area (MASA) and the corresponding molecular inaccessible volume (MIV) by explicitly taking into account the polyatomic nature of a ligand molecule that is to be placed in contact with the protein without overlapping. To achieve this, one has to create an "auxiliary" sphere for each intermolecular pair of atoms and define the range of the overlap by setting the minimum intermolecular distance that this pair of atoms can reach without overlapping. This minimum intermolecular distance is then set at the radius of the auxiliary sphere creating a set of fused spheres. The difference with the SASA is that the number of fused spheres is no longer equal to the number of atoms in the protein, but is equal to the product of the number of atoms in the protein and the number of atoms in the ligand, and that the radii of the spheres now depend on a pair of atoms, with one belonging to the protein molecule and one belonging to the ligand molecule. As is described in the following paragraph, the process of creating the necessary set of auxiliary spheres whose surface and volume correspond to the proposed notions of the molecular accessible surface area (MASA) and the corresponding molecular inaccessible volume (MIV) is relatively simple and can be summarized in the following steps as have been used in all calculations reported in this work: Step 1. Define the minimum intermolecular distance d ij between a possible pair of atoms: For a given pair of protein-ligand configurations (i.e. the set of positions r and types of all atoms) both the MASA and MIV are a function of the minimum interatomic distance d ij between the possible pairs, consisting of protein atom i and ligand atom j. In our paper we have chosen to express this minimum interatomic distance d ij as a function of the type of atom i, j. By assigning a value for the hard core radius R α of each atom α based on the types of both protein and ligand molecules, we express d ij as the sum of the hard core radii scaled by a common factor f R (i.e. d ij = f R × (R i + R j )). In all calculations reported in this paper, the hard core radius R α for each type has been based on the van der Waals radius used in Jmol in order to have a common reference. Future application may require to either extend the set of types or even to define the minimum distance between pairs of atoms d ij differently. This can be achieved by changing the values of the radius for the auxiliary spheres and should be reported along with the estimation of the MASA and MIV.
Step 2. Generate the set of auxiliary spheres whose surface and volume are the MASA and MIV: Given a relative orientation of the two molecules, the positions r i and the type of atom in the system (protein atoms) and the position r j and the type of inserted atom (atoms in the ligand) create an auxiliary sphere at  Fig. 1) and as a consequence, the surface of auxiliary spheres corresponds to the locus of points for which the placement of the first atom of the ligand will guarantee that all possible distances between any pair of protein (i) and ligand (j) atoms will be bigger or equal to d ij , with at least one distance being exactly equal. Finally, Fig. 1 (a) Graphical representation of the molecular accessible surface volume methodology in Jmol. An illustration of the test system consisting of two molecules, an ethane 60 representing the protein molecule of the system and a methane 61 displaying the inserted ligand molecule distinguished by its cyan haloed atoms. (b) The generated pink auxiliary spheres surrounded by their created 3D isosurface. (c) An illustration of the excluded volume around the ethane molecule of the test system, where the generated gray surface points coincide with the center of the first atom (yellow highlighted sphere) of the inserted methane ligand in opaque color and cyan halos. Two additional ligand molecules colored semitransparently while maintaining all the same orientations are positioned according to their first atom also highlighted in yellow at the minimum interatomic distance. different relative orientations can be examined via rigid rotation of either the protein or the ligand model before the creation of the auxiliary spheres.
Step 3. Evaluate the surface and the volume of the set of fused auxiliary spheres: The molecular inaccessible volume i.e. the volume of the locus of points where the ligand cannot be placed due to the presence of the protein can be evaluated from the set of auxiliary spheres defined in steps 1 and 2 using any algorithm capable of estimating the volume of fused spheres. Similarly, the molecular accessible surface can then be estimated using any algorithm capable of estimating the surface of fused spheres. Although we strongly recommend the use of the Dodd and Theodorou approach in estimating both the surfaces and volumes, we also provide as part of the ESI † scripts that can utilize the available "approximate" tools in Jmol and VMD for the cases that accuracy is not essential. Following the proposed steps, the problem of estimating the molecular inaccessible volume (MIV) and accessible surface area (MASA) is now being expressed as a problem of evaluating the volume and surface of a set of fused spheres similar to that of the SASA with the main difference being in the number of fused spheres that one has to consider, which is now equal to the number of atoms in the system, times the number of atoms of the inserted molecule.
In Fig. 1, a graphical representation of the basic concept is depicted, referring to a simple molecular system of ethane 60 and methane, 61 with the former acting as the protein molecule of the system and the latter as the inserted ligand molecule. The developed method is founded upon creating multiple images of the inserted atoms by maintaining the internal degrees of freedom and relative orientation. The algorithm generates 40 (40 = 8 "protein" atoms × 5 "ligand" atoms) auxiliary spheres (32 of them depicted in pink plus the remaining 8 which are placed at the same position as the protein molecule in the system). The gray 3D surface created by the 40 auxiliary spheres delineates the geometrical locus where the center of the first atom, as ordered in the inserted molecule (here the hydrogen atom as a yellow highlighted sphere), can be placed so that the two molecules of the system can be in "touch". Additionally, placing the center of the first atom of the inserted methane molecule (translucent methane molecules with cyan halos) in different positions on the generated gray 3D surface surrounding the auxiliary atoms brings the hypothetical ligand and protein molecules in contact without overlap. According to step 1, the annotated distances are equal to the sum of the corresponding atomic radii multiplied by the algorithm's scaling factor f R (here adjusted at 1.0). This scaling factor is used to describe the excluded volume interactions of the closest atoms between the inserted and native molecule of the system. In this example, hydrogen-carbon and carbon-carbon atoms are found in the minimum interatomic distance and those distances are equal to the sum of R hydrogen (= 1.2 Å) + R carbon (= 1.95 Å) and R carbon (= 1.95 Å) + R carbon (= 1.95 Å) atomic radii times the scaling factor f R (= 1.0), respectively. Notably, the connectivity between the inserted atoms does not add significant complication at this computational stage. This allows the insertion of two or more molecules simultaneously, as long as the relative position between atoms is maintained during the geometrical calculation and the relative intermolecular degrees of freedom are sampled in an outer loop. Furthermore regarding the SASA, the proposed method is expected to be used in ensembles, where the system configurations are created based on desirable statistical ensembles. Similarly, the internal degrees of freedom of the ligand could be sampled by simulating the inserted molecule under ideal gas conditions. The geometrical calculation would then be performed over a double nested loop over the configurations of the ligand and the system ensembles.
In most calculations reported in this work, the relative protein-ligand orientation has been kept to its original value based on the pdb configuration file downloaded from the web. In the cases that we examine the effect of the relative orientation in our calculations, we have performed random rigid body relations using quaternions. More precisely, the generation of random molecular orientations has been based on the Marsaglia G. method, 62 implemented as follows: -First, two numbers x 1 and y 1 are selected from a random uniform distribution between (−1, 1), until s 1 = x 1 2 + y 1 2 < 1 is satisfied.
-The generated values of s 1 and s 2 are used for the production of a random unit quaternion q as With the previous steps, a set of unitary quaternions is generated. By applying rigid rotations to the ligand molecules using these unitary quaternions, a set of protein-ligand relative orientations is created.
From a technical perspective, the greatest challenge and major concern in developing a computational tool capable of estimating the molecular inaccessible volume and molecular accessible area in protein-ligand systems has been the memory usage due to the large size of the resulting system. In order to be able to use Dodd and Theodorou's analytical approach 8 as a black-box library, the distribution of the computational load using message passage interface (MPI) 63 over a number of processors was compulsory. This was to ascertain the efficient handling of the memory load. As a result, the user can perform analytical calculations of the molecular inaccessible volume and molecular accessible area in realistic protein-ligand systems using reasonable computational resources.

Results and discussion
In order to assess the proposed method, several tests were performed on different systems of varying size. Ligand and protein molecules constituting the main focus systems were mainly downloaded from the Protein Data Bank (PDB) 64 in . pdb format except for simpler molecules like monatomic hydrogen, 65 diatomic hydrogen, 66 water, 67 methane, 61 ethane 60 and caffeine 68 molecules which were retrieved from Github 69 in .xyz format. PDB files with bound molecules underwent conversion so as to separate the ligand and protein components into different .xyz files for more efficient and convenient file manipulation. All molecules and their generated volume area surfaces were visualized by Jmol. 22 Initial tests of the algorithm were performed between simple molecules like monoatomic hydrogen and diatomic hydrogen following progression to more complex molecular systems downloaded from PDB and analyzed. More specifically, the 1zp8, 70 2bpw 71 and 4wtg 72 PDB files were selected as representatives of small-, medium-and large-size scale molecular systems, respectively. 1zp8 and 2bpw PDB entries refer to HIV-1 protease-inhibitor complexes. Former 1zp8 demonstrates an effective replacement of a peptide group in HIV-1 protease inhibitors with 1,2,3-triazole. 73 2bpw demonstrates the ability to replace a putative inhibitor bound to the HIV-1 protease in single crystals. 74 The third PDB entry (4wtg) includes a modified version of the hepatitis C virus (HCV) RNA-dependent RNA polymerase (RdRp) in complex with the clinically active metabolite formed by sofosbuvir, Mn 2+ and a primer-template RNA. 75 Most of the calculations reported in this work have been based on PDB files that contain both ligand and protein molecules and are available via Protein Data Bank. Given such a PDB file, we proceed by first parsing the PDB file in order to acquire the protein and ligand molecule configurations from PDB along with the type of each atom. Based on whether we want to examine the MASA and MIV at a relative orientation different from the original PDB file (as we do in Fig. 2), we choose whether we are going to perform a random rigid rotation of the ligand molecule. Having assigned an atomic radius to each type of atom of both the ligand and protein molecules, and the atomic position at the desired relative orientation, we perform calculations for the MASA and MIV. In all calculations used in this work for the validation of our code, we used as default for the hard core radius in each atom R i the values of van der Waals radii defined for each atom type in Jmol (version 14.6.1), which can be retrieved by executing the command "show vdw" and defined by typing "set defaultVDW Jmol" on a Jmol console. In order to investigate the effect of the hard core radius length in sum calculations, we performed uniform scaling in all hard core radii using a common scaling factor f R (see Fig. 6). Depending on the practical application, the potential user of our computational tool may choose to alter the assignment of each atomic radius, taking into consideration the difference between ions and uncharged atoms for instance. Nevertheless, in this study, given that the main concern is to provide validation of our approach, the simplest reproducible cases were selected while the option of changing the values of atomic radii was deferred for future version purposes. Due to this reason, no further processes were performed on protein molecules extracted from the downloaded PDB files, like restoring missing atoms or imposing the suitable protonated state under a given pH.
Despite the development of this computational tool taking advantage of the analytical calculation of Dodd and Theodorou 8 to a large extent, the proposed calculations of the molecular accessible surface and molecular inaccessible volume can also be carried out by making use of any other computational tool capable of calculating the SASA. To achieve this, one has to generate the set of auxiliary spheres in the same way as described in the previous paragraph ( Fig. 1) and then perform the calculations with the tool of choice. In the context of this research, visual representation was accomplished by using Jmol and its ability to draw 3D isosurfaces. It should be noted that visual rendering of the aforementioned isosurfaces is an arduous computational task, with memory requirements increasing significantly as the size of the molecular systems grows. Nevertheless, most of the available visualization tools output significantly less accurate results when compared to the analytical estimation of Dodd and Theodorou. 8 However, due to graphical representation necessities in many studies, the best strategy is to combine both approaches. Subsequently throughout this work, we report our estimations using the analytical method of Dodd and Theodorou 8 whereas Jmol is used for visualization purposes. Finally, in order to provide better insight into molecular accessible surfaces, ligand placement on a point of the protein surface is also presented, highlighting the contact between the test molecules given the selected relative ligand-protein orientation.
In Fig. 2, the estimations of the molecular accessible surface area (Fig. 2a) and molecular inaccessible volume (Fig. 2b) are presented for various ligand-protein orientations of the 1zp8 test system (HIV-1 protease with its AB-2 inhibitor 73 ). Relative orientations were randomly produced via quaternion formulation of Marsaglia 62 on the ligandprotein pair found in the 1zp8 PDB file downloaded from the Protein Data Bank. The estimated values of the molecular accessible surface area (Fig. 2a) and molecular inaccessible volume (Fig. 2b) are plotted versus the quaternion distance. The baseline against which the quaternion distance was calculated is the ligand-protein orientation of the original input 1zp8 PDB file configuration. Since plotting against the quaternion distance constitutes projection onto onedimensional space, the reader should bear in mind that only distance relevant to the original orientation retains the properties of distance, meaning that any of the expressed orientations depicted as triangles or star points in close proximity in Fig. 2 may actually be far apart. Nonetheless, the above representation style was selected since the deviation of 50 sampled ligand orientations relative to the original one found in the 1zp8 PDB file is better illustrated. Similarly for the 1zp8 test system, an additional 1000 sampled ligand orientations relative to the original one found in the 1zp8 PDB file were generated and the analytical area and volume calculated values of each ligand-protein sampled orientation were plotted as in Fig. 3.
The effect of sampling random relative orientations can be seen in more detail in Fig. 3, where the values of the inaccessible volume and accessible surface area appear to be normally distributed around an average value. At this point the reader should note that the actual shape of such distributions is probably driven towards a "normal" like distribution via the center limit theorem, as is the case for many physical quantities. On the other hand, both the volume and area are bounded continuities and therefore the distributions can never become truly normal. It is therefore advised that the type of the distribution is not taken for granted but considered verified for the particular case of Fig. 2 Analytical calculation of the ligand-protein molecular accessible surface area (a) and molecular inaccessible volume (b) of the HIV-1 protease and AB-2 inhibitor complex retrieved from the 1zp8 PDB entry, 70 at different orientations relative to the ligand-protein orientation of the original PDB file configuration (marked as original orientation). In both charts, the original input molecular configuration is shown at x = 0, followed by 50 random ligand-protein orientations sampled by Marsaglia's method. 62 The estimations are plotted as a function of the quaternion distance (in radians) between each orientation and the relative ligand-protein orientation of the original PDB molecular configuration. The labeled data points A, B, C and D help the reader compare the plotted information against the corresponding modelled structures shown in Fig. 4. interest. In Fig. 4a, samples of 3D representations of the protein-ligand molecular accessible surface are shown, mainly for configurations retrieved from the minima or maxima of Fig. 2a and b using Jmol. According to the displayed molecular states, the inhibitor can "fit" in the original binding site of the protein with significant changes in the relative orientation. Notably, several of the sampled ligand orientations could potentially bind in the opposite direction, reverse to the ligand configuration of the original 1zp8 PDB file (Fig. 4b). We should point out that calculations in Fig. 2 pertain solely to excluded volume interactions. Therefore such observations may serve exclusively for initial screening. Moreover, in the calculations of Fig. 2, we do not distinguish between placing the ligand into pocket cavities or onto the outer surface of the protein. Nevertheless, once the total molecular accessible surface is evaluated, it is also possible to partition the area based on concavity, charge, polarity, or hydrophobicity of the protein contact atom utilizing the tools which have been developed for the SASA and are available in visualization software like Jmol. It should be noted that in the molecular accessible surface, each point corresponds to a specific atom-atom interaction between the ligand and the protein molecule, with more than 3 body contacts, mapped onto lines and points which form from the Fig. 3 Histograms showing the distribution of the estimated values of the molecular accessible surface area (a) and molecular inaccessible volume (b) for a total of 1000 sampled orientations of the 1zp8 test system. 70 intersections of spheres at the surface. It should also be noted that although identification of the hydrophobic part of the accessible surface is straightforward in the proposed methods, by simply identifying auxiliary spheres based on the characterization of the corresponding protein-ligand atom pair, studying "hydrophobicity" as a phenomenon requires much more than simply measuring the amount of the hydrophobic molecular accessible surface since there are many aspects behind the term "hydrophobicity", with some of them being non-local 76 and a strong function of the unique properties of water as a solvent. On the other hand, measuring the amount of the hydrophobic molecular accessible surface appears as a promising potential candidate for developing descriptors in QSAR studies similar to the ones performed using the SASA. 77 In an effort to verify and validate the accuracy of the proposed approach, the estimate of the molecular accessible surface area and a numerical finite difference estimate of the inaccessible volume are shown in Fig. 5. The numerical derivative has been estimated by performing volume calculations over slight increments of the radius parameter δR incorporated in the algorithm. Confirming the consistency between our estimations of the molecular accessible surface and inaccessible molecular volume, the analytical calculation of the molecular accessible surface can be estimated using finite differences provided that the alteration in the radius parameter is neither too small nor too big as it is the case with most numerical estimations based on finite differences.
Having established the consistency between the molecular inaccessible volume and accessible molecular surface, in   Fig. 2 (labeled as A, B, C and D). The protein molecule is displayed according to its secondary structure (yellow b-sheets and pink a-helix) while the ligand molecule holds the typical ball & stick representation style. Ligand and protein molecules are presented at the corresponding relative orientation by placing the sampled ligand configuration onto the molecular accessible surface close to the original binding cavity. Unlike the SASA, our molecular accessible surface is a function of both the actual ligand and the ligand-protein relative orientation. (b) A more detailed view of the sampled ligand configuration C versus the original ligand configuration of the 1zp8 PDB file. Interestingly in this sampled orientation, the sampled ligand configuration C (bright green color) can "fit" in the binding site in the opposite direction contrary to the original ligand configuration of the 1zp8 PDB file depicted transparently in red.
we demonstrate the validity of the molecular inaccessible volume calculation by comparing the proposed analytical calculation with the estimation based on random "Widom"-like test insertions 57 under the original input relative orientation regarding a simple test system, where methane 61 and caffeine 68 act as ligand and protein molecules, respectively. To perform the stochastic estimation, we initially enclosed the molecule of caffeine in a box and then measured the ratio of attempts which   6 Comparison of analytical inaccessible volume calculation (blue circular points) versus the stochastic evaluation based on test "Widom"-like insertions in a simple molecular system, consisting of methane 61 and caffeine 68 as ligand and protein molecules, respectively (blue triangular points). The stochastic estimation results were acquired after 5 repetitive runs on the aforementioned test system at the original relative orientation with different seed numbers for each given number of insertions. All calculations coincided with the analytical estimation of the inaccessible volume, within the 95% confidence interval (depicted as error bars in the above graph). failed to place the methane molecule without overlap in the box, given the original relative orientation. An estimate of the inaccessible volume was produced after multiplying the volume of the box by the ensemble average of the ratio of failed "test" insertions. In Fig. 6, the stochastic estimation is reported as a function of the number of insertions, alongside the analytical volume estimation at the same original relative ligand-protein orientation, where the results from the stochastic method coincide with our analytical calculation output.
In Fig. 7, we observe the effect of scaling all interatomic contact distances by a common factor f R , regarding 3 molecular test systems of different sizes based on the 1zp8, 70 2bpw, 71 4wtg 72 files downloaded from PDB. More specifically, the smallest 1zp8 system consists of 812 atoms Fig. 7 Estimations of the accessible surface area (a) and inaccessible volume (b), both expressed as functions of the algorithm's parameter scaling factor f R . The radii of the auxiliary spheres which determine the range of hard core inter-atomic interactions have been estimated by scaling a common factor, the sum of the van der Waals radii for each atom pair that is used in the formation of the auxiliary sphere. Tests were performed upon 3 molecular systems of varying size (1zp8, 70 2bpw, 71 and 4wtg 72 ).
in total, 765 of which form the HIV-1 protease while the remaining 47 atoms form its ligand inhibitor AB-2. 73 The mid-sized scale 2bpw system contains 1559 atoms, 1514 constituting the HIV-1 protease and 45 its potent ligand inhibitor. 74 Lastly, the largest 4wtg system consists of 4357 atoms, 4327 of which belong to the modified version of HCV RdRp and the remaining 30 atoms are found within its ligand, the clinically active metabolite formed by sofosbuvir, Mn 2+ and a primer-template RNA. 75 Examination of the accessible surface dependency on the algorithm's parameter f R promotes an interesting perspective. There is a certain range where increasing the scaling factor f R leads to reduction of the accessible area, strongly indicating the presence of concave parts on the protein surface which shrink as the radius expands (Fig. 8). However, one may conceive of an approach that uses such observations to identify the presence of cavities but to our knowledge, there is no such method. This is probably due to the usual alternative methods being quite sufficient in identifying cavities or due to the fact that similar calculations would require significant accuracy in the estimation of accessible surfaces. This would not be a practical choice since most of the available methods are of stochastic nature. On the other hand, implementing the analytical calculation of Dodd and Theodorou 8 leads to accurate estimations which can be used to estimate partial differences from finite differences. Finally, for users that would like to use our approach in combination with existing (or newer methods) ones for partitioning the surface area based on concavity, we should note that the correlation between accessible areas of a concave cavity formed out of spheres can be affected by the actual definition of the criteria used to separate concave from convex regions.
Finally, as mentioned previously, a considerable amount of effort has been put into decomposing analytical calculations into independent sub-calculations which can be performed in parallel, since dealing with all of the auxiliary spheres using a single processor may not be feasible for most of the protein-ligand complexes of interest. Aiming to distribute the memory load into multiple processors even at the expense of performing more arithmetic calculations, in Fig. 9 we present the algorithm's execution time as a function of the number of processors used in our parallel decomposition (Fig. 9a) as well as a function of f R alterations utilizing all processors of our computational nodes through MPI 63 (Fig. 9b). The system examined in Fig. 9a consists of the protein-ligand complex retrieved from the 1zp8 PDB entry where all radii have been scaled at half of their van der Waals value by setting the f R parameter to 0.5, while in Fig. 9b, the largest in size 4wtg system (4357 atoms) is tested with increasing f R values exploiting plenty of computational resources. As the size of the molecular system increases relevant to the available resources per processor, the necessity and parallel efficiency of actual calculations may differ, but our approach is expected to be applicable provided that sufficient computational resources are allocated. Subsequently, parallelization through MPI makes our approach suitable both for supercomputers and homemade clusters alike.
For all of the above tests, we implemented our developed algorithm and proposed a methodology which mainly relies on the generation of a set of auxiliary spheres according to the examined molecular system for the analytical calculation of the molecular accessible surface area (MASA) and molecular inaccessible volume (MIV). The user can also perform MASA and MIV calculations by simply inputting the generated set of auxiliary spheres to the computational tool of preference. However, this comes with a considerable trade off in the accuracy of the estimated area and volume values if not calculated by our developed algorithm. In order to assess the validity and robustness of our proposed method, analytical volume and area calculations were carried out on the simplest possible molecular test systems. For this reason, two test systems were formed consisting of Fig. 8 Investigation of concave surfaces of the 1zp8 test system 70 in relevance to f R changes and visualization by Jmol. 22 The translucent orange isosurfaces surrounding (a) and (b) models are created using the "isosurface pocket cavity" Jmol command and are generated according to the set of auxiliary spheres at the specified f R value respectively. In (a), the f R is adjusted to 0.15 and 37 isosurfaces are created with a total accessible surface area of 5008.8 Å 2 . Increasing f R to 0.20 (b) seems to decrease the amount of isosurfaces created to 34 as well as the overall accessible surface area which drops to 3807.8 Å 2 . Increments of the scaling factor f R lead to reduction of the overall accessible area as the concave parts on the protein surface shrink and eventually vanish as the radius expands. either monatomic hydrogen 65 (test 1) or diatomic hydrogen 66 (test 2) as hypothetical protein molecules, while the ligand molecule was always a monatomic hydrogen 65 in both cases. The estimated volume and area after the execution of the developed algorithm were compared with the respective calculation results of other available computational tools applied on the same test systems. Due to this fact, the van der Waals (vdW) radii of each hydrogen atom in ligand and protein molecules were adjusted equally to 1.2 Å among the examined programs for a fair comparison. Several tests were performed utilizing the incorporated volume and area features where applicable of Jmol 22 and VMD 23 molecular visualization software. Apart from that, analytical volume and area calculations of our algorithm were verified by the online partial sphere volume and area calculator 78 as well as checked against Poreblazer 19,20 and Molecularvolume 21 software. The yielded results are summarized in Table 2. Potential applications based on the molecular accessible surface area The molecular accessible surface area (MASA) can be considered as a natural extension of the widely used concept of the solvent accessible surface area. Therefore, like the SASA, the MASA can be used in a variety of applications, spanning from visualization, estimation of descriptors in QSAR models, to "docking". Furthermore, the MASA can "replace" the SASA in most applications with minimal effort. For most applications, one simply needs to replace the set of atoms in the original system with a set of auxiliary atoms. It is then straightforward to use any available tool that has been developed for the SASA either for visualization or for actual estimations. Since the accuracy provided by the user of an analytical estimation is not expected to be essential in many of these applications, and in order to facilitate the use of the MASA in applications that correctly used the SASA, we provide, as part of the ESI † of this paper, the ability to create the appropriate set of auxiliary atoms, given the protein and ligand configurations, without further estimation of the MASA. Furthermore, in the ESI † section we show how to combine this software with Jmol or VMD in order either to visualize the MASA or even to estimate the MASA, using the tools that have been developed for the SASA and are available via these programs. Given that most available tools used for the estimation of the SASA, to our knowledge, are approximate, we believe that the analytical method of Dodd and Theodorou is expected to be used in cases where the accuracy of the estimation is important, whereas the approximate estimates can be used in order to reduce the necessary computational cost where accuracy is not essential. A potential application of the MASA where the accuracy of the analytical method of Dodd and Theodorou is expected to be essential is "docking". Having in mind that the term "docking" is used to describe a variety of approaches that aim at sampling configurations where protein and ligand molecules are "most likely" to bond to each other as summarized in the Introduction section of this paper, the MASA can be used in order to facilitate the creation of such ensembles of configurations by giving the ability to sample all nonzero overlap placements of the ligand, given the internal degrees of freedom and the relative orientation. Although the ability to create samples with non-overlapping configurations is important on its own, we expect that our ability to provide an estimate with high accuracy will eventually turn the molecular accessible surface into a key ingredient in future "docking" applications since it can be used to "measure" the probability of creating nonoverlapping ensembles. Whereas in this paper we limit ourselves in extending the notion of the solvent accessible volume, by proposing two new concepts that of the molecular accessible surface, and the inaccessible molecular volume as the natural extension of the solvent accessible volume, in a continuation of this work we aim to demonstrate how the proposed concepts in this paper can be used for the creation of ensembles with well-defined weights and how this can be used in the further development of docking applications. Finally, the proposed algorithm for the estimation of the molecular inaccessible volume of ligands is expected to have practical uses in the estimation of protein-ligand binding affinities via staged insertion or particle deletion methods. It is therefore interesting to assess the effect that the range of hard core interactions may have on the proposed estimation. As it has already been shown in the development of particle deletion 59,79-82 and staged insertion methods, 83 it is also possible to use estimations of the accessible volume. These estimations are based on hard core interactions as part of the evaluation of the chemical potential in the case of molecules interacting via "soft" potential. In this case, the free energy difference related to the transformation of hard cavities formed through the hard core interactions is added to the final "soft" molecule. The overall calculation becomes independent of the actual choice by considering the range of hard core interactions smaller or larger than the minimum distance of two atoms expected to be in contact with each other under the given conditions. If the free energy difference is estimated via the staged insertion method, hard core interactions should be smaller than the minimum distance. If the particle deletion method is used, hard core interactions should be larger than the minimum distance. In any case, the range of hard core interactions is expected to be smaller than the distance at the first pick of interatomic radius of gyration.

Conclusion
In this work, the estimation of the molecular inaccessible volume and accessible surface area is proposed as a generalization of the SASA. We implemented the proposed approach for estimation of the protein-ligand inaccessible volume and accessible surface area upon a set of molecular systems of various sizes. We demonstrated how it is possible to estimate the proposed molecular volumes and surfaces using any available tool that can be used for the estimation of the SASA by adding the suitable set of auxiliary spheres as in the included example of Jmol. Furthermore, we have shown that by utilizing the power of analytical calculation of the volume of fused spheres by Dodd and Theodorou plus distributing the computational load via MPI, it is possible to make very accurate estimations in a variety of protein-ligand systems. The validity of our approach was assessed firstly by estimating the inaccessible volume via a stochastic Widomlike test insertion method and secondly, by comparing the molecular accessible surface with a numerical finite difference calculation. Finally, by drawing the connection between the proposed molecular inaccessible volume and free energy difference estimations via the staged insertion and deletion schemes, the molecular inaccessible volume ought to be used in the future for estimation of proteinligand binding affinities. Alternatively, it is expected to constitute an additional visualization tool, providing more specificity to the examination of protein-ligand interactions.

Data and software availability
We provide a FORTRAN based program that is able to perform the inaccessible volume and accessible surface calculations reported in this work by performing calls to a library that deploys Dodd and Theodorou estimation of the volume of fused spheres, kindly provided to us by Professor Theodorou. The program requires a minimal input of two . xyz files, the first one constituting the protein molecule and the second one possessing the ligand coordinates at the desired relative orientation. Additionally, example files are provided as ESI † which can be used for verification purposes. For more details, please check the ESI. †

Conflicts of interest
There are no conflicts to declare.