Probabilistic Prediction of Contacts in Protein-Ligand Complexes

We introduce a statistical method for evaluating atomic level 3D interaction patterns of protein-ligand contacts. Such patterns can be used for fast separation of likely ligand and ligand binding site combinations out of all those that are geometrically possible. The practical purpose of this probabilistic method is for molecular docking and scoring, as an essential part of a scoring function. Probabilities of interaction patterns are calculated conditional on structural x-ray data and predefined chemical classification of molecular fragment types. Spatial coordinates of atoms are modeled using a Bayesian statistical framework with parametric 3D probability densities. The parameters are given distributions a priori, which provides the possibility to update the densities of model parameters with new structural data and use the parameter estimates to create a contact hierarchy. The contact preferences can be defined for any spatial area around a specified type of fragment. We compared calculated contact point hierarchies with the number of contact atoms found near the contact point in a reference set of x-ray data, and found that these were in general in a close agreement. Additionally, using substrate binding site in cathechol-O-methyltransferase and 27 small potential binder molecules, it was demonstrated that these probabilities together with auxiliary parameters separate well ligands from decoys (true positive rate 0.75, false positive rate 0). A particularly useful feature of the proposed Bayesian framework is that it also characterizes predictive uncertainty in terms of probabilities, which have an intuitive interpretation from the applied perspective.


Introduction
Atomic level structures are an important source of information for inferring functional aspects about macromolecules and ligands binding to them. For instance, this is illustrated by the substantial amount of existing algorithms and structural data modeling software created for molecular docking and scoring purposes [1], [2], [3], [5], [6]. The Protein Data Bank (PDB) [7] offers the central public access to macromolecular structure files.
Although there is already a large amount of structural data available, it is by no means straightforward to model it reliably. There are several reasons for this, such as the inevitable errors present in experimental results and the ''averaging'' nature of the measurement process used in the construction of x-ray diffraction data. Moreover, along the conversion from a measurement to a structural coordinate file, several computational approximations and the subjective choices of experimentalists will influence the final outcome. Among the latter sources of variability, two major issues are flexibility of the molecules and computational constraints implemented in the refinement process. The first one is related to thermal motion and static disorder, and the second to biochemical a priori information that is always used in the refinement of a structure to create a coordinate file [6], [8]. These are accompanied by crystal packing effects, which also originate from the flexibility of the molecules, uncertainty in orientation and location of small molecules, including water.
It can be argued that for addressing the above-mentioned issues, statistical modeling provides the most promising approach, given its ability to capture uncertainties and errors in data. To meet these goals we introduce a Bayesian statistical method for evaluating atomic level 3D interaction patterns of protein-ligand contacts. Our work is motivated by the previous findings in Rantanen et al., [1], [2], [3] which showcased the usefulness of this kind of a multidisciplinary approach. However, given computational speed related constraints, it has not been possible to pursue these previous Bayesian methods further in contact preference exploration. Therefore, the method discussed here focuses on providing rapid means of computing, together with adjustability and robustness of the statistical model. The latter aspect refers in this context firstly to the constraint that two points in close proximity to each other (with respect to the system size) should not get very different contact preference hierarchies without an easily tractable reason. Secondly, in terms of robustness, the preference prediction model has to be balanced between adhering too closely to the possibly biased overall number of different types of contact atoms in the training data set and using a sole comparison of the probability densities defined for each contact atom type of a molecular fragment. Finally, the adjustability is concerned with both the model structure and the chemistry-based classification of molecular fragments. In our illustrations we consider 24 molecular fragment and 13 contact atom types, exhibiting interactions like hydrogen bonding, dispersion (e.g. aromatic-aromatic) and interactions between charged groups.
The main purpose of this paper is to show how a Bayesian statistical modeling approach can be utilized to make naturally ranked predictions about contact preferences, such that the model itself can be flexibly updated in the presence of novel data and other auxiliary information. Basically, this method is developed to retrieve information to be used in a knowledge-based scoring function. There exist several well performing scoring functions [9], [10] that utilize the experimental knowledge through inverse Boltzmann relation from statistical thermodynamics [11]. These functions depend only on distance between atoms, e.g., a ligand atom and a binding site atom. Our method differs from them in that also directional information is incorporated in the model, which has been shown in case of hydrogen bonding to still significantly improve evaluation of binding energetics from experimental data [12].
Three basic scoring function tasks have been defined [9], of which enrichment of ligands was tested with our method. The test was done through separating catechol-O-methyltransferase (COMT) ligands from decoys using logistic regression on a set of 27 small molecules having similar properties. The receiver operating characteristics (ROC) [13] for the results show that the probabilistic contact preferences give reliable information about the relative affinities in intermolecular contacts. These probabilities can be applied to intramolecular contacts as well. In practice, they are used as part of a molecular docking and scoring routine. The method described in this paper will be integrated as a functionality in the molecular modeling environment BODIL [4].
The structure of the article is as follows. First, data collection and the modeling approach are described. Thereafter, results from several case-studies are presented. Last, implications of the results and some future prospects are discussed.

Data collection and processing
We used PDB as the main source of data in this work. Training data for the model was collected from a set of approximately 28000 structure files published before January 1 st 2009. The files were selected using the criteria presented in Table 1. A reference dataset for model validation was selected under the same criteria as the training set and contained 10361 structure files published between February 2 nd of 2009 and 31 st August 2011. X-ray structures form the biggest group of data present in PDB. The bulk of a structure file is the coordinate section, but there is also chemical and biological information, interpreted as metadata, which is necessary for constructing a sensible predictive model. The type of metadata that is most directly deducible from the experimental observations is the atom type. The atom type classification can be considered sufficiently reliable for the higher resolution (,2.5 Å ) structures, however, with at least one exception, which corresponds to the nitrogen (N) and oxygen (O) atoms in a carbamoyl group (-CO-NH 2 ). In this group, O and N cannot be distinguished solely on the basis of x-ray diffraction data, because of the symmetric structure of the group and very similar electronic densities around both O and N. This is a prime example of a regularly encountered error in the metadata, which, however, can be corrected by reversing the coordinates of O and N. The ligand metadata would impose this error when for instance hydrogen bonding with the ligand would require an acceptor (O) contact, but a donor (-NH 2 ) contact is given a closer coordinate location in the structure file.
A considerably more difficult problem to handle is the influence of the constraints used in the refinement of the protein structure from experimental data. These constraints generate some unreliability in the coordinates, because only conformations with restricted geometries are allowed for the amino acid chain, which together with limited resolution can lead to artificially distorted conformations of ligand structures. In practice this means that the refinement involves fitting an alleged structure to the experimentally determined electron density map, which does not define all structural features uniquely, especially when the resolution is low [8].
Molecular fragments of pre-defined types (see Tables 2 and 3) were searched from coordinate ligand structure files in PDB. The search was based on atom types, chemical connectivity and geometry, and the identified fragments were then labelled for use in the extraction of coordinate data from protein structure files. To obtain unique fragment orientations, atoms from within a functional group were, when possible, chosen for the fragment definitions. In order to build a predictive model, the set of 24 fragment classes in Table 2 was used while collecting a dataset of approximately 70,000 contacts, representing the 13 contact atom types, i.e. target classes in Table 3.
Regarding the contact classes in Table 3, for example, the class C3 represents a pure van der Waals contact [14] and class C4, a hydrogen donor in a possible weak hydrogen bond in addition to a van der Waals contact [15], [16]. Aromatic carbons (C5, f11) can participate in both of the typical C3 and C4 interactions [17]. Halogen bonds have a role in biological processes [18] and therefore Fluorine [19], Chlorine, Bromine and Iodine are considered as so called fragment Main-atoms, as shown in Table 2. The target atoms in proteins, identified with three distance criteria (ƒ3:3 Å for alleged H-bonds and charged groups, ƒ3:7 Å for probable dispersion and ƒ3:9 Å for halogen bonds), were classified during the search using three criteria: 1) element, 2) amino acid residue and 3) side or main chain atom. The interaction was defined as between a fragment type and target type, or between nuclei, mediated by protons and/or electrons.
A fragment was defined by an atom triple: Main-atom, Atom1 and Atom2, and at least the Main-atom was given the following characteristics: element, covalent bond count, aromaticity and possibly functional group, see Tables 2 and 3. These characteristics were used in collecting data from PDB, resulting in coordinates with metadata. The aromaticity of an atom was decided using PDB Ligand Dictionary through PDBeChem [20].
In addition to classification, target atoms have to be put in one coordinate system, i.e. fragments are superimposed. This was done using an elementary translation-rotation: first the database coordinates of the Main-atoms were translated to origin, which creates a new dataset (F database below in eq. 1), and then a rotation operation was defined to connect the fragments reference frame to a common coordinate system. This requires solving the following matrix equation: where R refers to a 3|3 rotation;F~½ r r 1, r r 2, r r 3 , r r i~½ x i ,y i ,z i T and r r 3~ r r 1 | r r 2 , i.e., r r 3 is the cross product of r r 1 with r r 2 . Thus, we have the equation, which was solved for each fragment. The resulting R was then used in the translation-rotations of the respective target atom position vectors to the common coordinate system. When the data is collected as mentioned above, as well as classified and coordinate systemized in this manner, the process results in a collection of three dimensional distributions of points that present measured relative positions of specified atoms with respect to specified fragment types. These distributions were then modeled with 3D probability densities described below.

Statistical modeling
To obtain predictive distributions for contact preferences we utilize a Bayesian framework where the observed 3D coordinates in the training data are modeled with interconnected parametric 1D densities, such that the parameters are provided a priori uncertainty characterizations in terms of probability distributions. The prior distributions enable regulation of parameter estimates in order to prevent them from depending solely on the observed data, which is desirable especially under the circumstances where the data generation process is known to harbor intrinsic biases. Also, regularization of model parameter estimates with the prior information is most crucial when certain class pairs have only very sparse training data, in which case unsmoothed estimates can be strongly biased.
The core distribution we utilize for characterizing coordinate variability is the von Mises-Fisher distribution (vMF) which is widely applied for modeling directional data. Separate probability densities for all three coordinates were necessary in order to capture the properties of the target atom clouds in a uniform setting (details provided below). Spatially the most complex (multimodal) observed differences in target atom distributions are found around the main direction of the fragment, and to a somewhat lesser extent with respect to distributions of polar angle, i.e. angular deviation from the main direction, see Figures 1 and 2. The distance distributions are given a priori as many modes as the corresponding polar angle distributions have, though in most cases they practically form a unimodal density, but not always. This is explained more thoroughly later in this section. The variables and parameters of the densities used in our work are specified in Tables 4 and 5.
Let c denote a generic set of parameters specifying a density in a 3D space. Then, the probability density in spherical polar coordinates f fC (r,w,hDc) for fragment class f and target class C is assumed to be of the piece-wise defined form     Figure 1. It is depicted in the same reference frame with the density. They can be interpreted as overlayed such that elevations in the density correspond to dense areas in the cloud of data points. The main direction of the fragment, as described in the caption of Figure 1, is defined by h~0 in the reference frame of the model, but corresponds in this figure to ½h~p 2 ,w~0, where h is the polar angle and w is the azimuthal angle. f fC ( r rDfm i,r g,fs 2 i,r g,fl i g,fk i g,fm j g,fs j g)! X where r r~½r sin(h) cos(w),r sin(h) sin(w),r cos(h) T , the b i divide (0,p) and c i divide (0,r (cutoff ) fC ) to non-overlapping intervals. The limit r (cutoff ) fC is the maximum distance used in collecting the target atom locations for a fragment class and target class pair (fC). The intervals b i and c i are associated with weight z i , P N vM i~1 z i~1 . The a ij then, are non-overlapping intervals of (0,2p) associated with the weights w ij , PN N,i j~1 w ij~1 (see below) and the different density components and parameters are defined using conventional nomenclature as: Equations 4 show the forms of the densities in the particular coordinate system, or reference frame, that is used for modeling and in which the main direction of the fragment coincides with the positive z-axis. A likelihood function is obtained as a product from the values of the components of the density (eq. 3) for n i (or n ij ) points representing each of the included regions in the sample space, where Variance of Normal density for the distance (i as below) for k observed distances.
The structure of the density (eq. 3) was chosen based on investigations of the forms of the target atom clouds. For example, use of normal densities for the distance data was supported by large p-values in Kolmogorov-Smirnov normality test and the numbers of components needed in the angle dependent part of the density (i.e. N vM and N N,i ) were automatically chosen based on frequency distribution of binned data. This was done for both angles (h and w) by connecting the heights of the adjacent bars of the histogram, creating a sequence of values, in which every change of sign corresponds to a local minimum or a maximum. The number of maxima was restricted to the interval [1,5], was used to represent the number of modes in the density. The number of maxima was restricted by either reducing or increasing the number of bins in case the result would be outside the given interval.
In the separation of variables, azimuthal angle and distance are conditioned on a polar angle interval, see the equations in (4). The angular part reflects arc-like structures around the main direction of the fragment (for examples of this see last paragraph of section Examples 3 and 4). The angular deviation from the main direction is the leading variable in the sense that a multimodal azimuthal angle distribution (i.e. around the main direction) and a unimodal distance distribution are defined separately inside each polar angle segment. The idea behind this is that the peak of a polar angle density is an indicator of the strength of the interaction between a fragment and a target. The smaller the angle, the stronger the interaction, and if there is any effective multimodality in the distance density, the modes should coincide with the modes of the polar angle density. The azimuthal angle distribution thus completes the directional structure within each polar angle mode.
The uncertainties of the distance and the azimuthal angle variances are difficult to model due to the data generation process, the limitations of which were discussed above, and therefore, we use the standard maximum likelihood estimates calculated marginally from observed coordinates. On the other hand, the means m i,r , l i and m ij are central parameters representing a measure of the strength of the interaction between the fragment and the target. 0.2.1 Parameter prior densities. A prior distribution in Bayesian statistics can either be used to model uncertainty about a parameter or to include a priori knowledge, or beliefs, in the model [21]. Both of these uses are necessary for our modeling purposes.
Prior densities can be chosen in various ways such that they are either conjugate distributions for a particular likelihood function, or some other probability densities possessing required statistical properties, such as an asymmetry. Priors utilized here for individual parameters in the model densities are summarized in Table 6. In Table 6, I 0 (:) is the zeroth order modified Bessel function of the first kind. For the von Mises and Normal distributions, conjugate priors are used (for the mathematical derivations related to these distributions see [22], [23], [24]).
Our prior density for the parameters is a slightly modified version of the distributions considered in [25] and [26]. The density has the form: In equation (8) hyperparameters n i and c i represent measures of modeler's belief in the expected values of the concentrations. The larger the value of the hyperparameter, the more the prior is concentrated around k 0 or k l,0 . A default choice for any of these hyperparameters is the number of observations available for calculating the estimates for k 0 and k l,0 .
0.2.2 The posterior distribution. In Bayesian statistics, learning from observations takes place through the posterior distribution which is accessible from the joint probability density defined for the data and the parameters. For our piece-wise defined likelihood, the joint density is formally defined as Table 6. Prior densities for model parameters. p(fm i,r g,fl i g,fk i g,fm ij gDfm i,p g,fl i,p g,fk i,p g,fm ij,p g)L (fm i,r g,fl i g,fk i g,fm ij g) : : The posterior density (eq. 9) has the same functional form as the prior density (eq. 8), but with updated parameters. As shown in the equation (9), the prior parameters m 0 , l 0 , k 0 and m w,0 are updated to m i,p , l i,p , k i,p and m ij,p , respectively, in the usual manner in Bayesian inference. In contrary, the remaining parameters are determined either directly from the data or given a suitable value based on chemical knowledge.
In order to define contact preferences, every fragment class and target class pair has to be specified with some characteristics. Maximum a posteriori (MAP) estimates are in this respect suitable when the associated densities are (piecewise) unimodal. The MAP estimates of the parameters are defined and updated with new data according to the formulae in Table 7. The prior parameter s 0 was given a constant value 0:01(Å 2 ) and R 0 was defined separately for each fragment type.

Updated parameters and the probability mass in a
reference volume. In our method, to evaluate the plausibility of a contact atom type in a given spatial area, the probability mass within in this volume is evaluated. The mass is calculated using the model densities (4) with updated parameters, see Table 7. The spatial area, or volume, is defined by a distance interval and a solid angle (i.e. intervals polar and azimuthal angles), and can be arbitrarily located. The volume that contains all target locations is defined through the intervals ½0,r (cutoff ) fC , ½0,p and ½0,2p for the distance, polar angle and azimuthal angle, respectively. The cutoff r (cutoff ) fC is the maximum distance used when collecting data for a fragment class and target class pair (fC). The size of the volume can be chosen to be large, when for example contact preferences on either side of the fragments plane are investigated. Alterna- Table 7. MAP estimates of model parameters.

Posterior variable MAP estimate Definitions and estimates
Mean of distancem m i,p~ y y~1 m i X mi l~1 y l , ),: Concentration of polar anglek k i,p~ki,p k i,p numerically by equalizing model and data variances.
Mean of azimuthal anglem m ij,p~ y y~1 n ij X nij l~1 y l , doi:10.1371/journal.pone.0049216.t007 tively, the size of the volume can also be small, depending on the situation under investigation. The spatial information content of the model is coded in the particular functional form of the probability density, but if one would solely rely on probability densities, it could easily happen that a scarce contact atom type could get a hierarchically higher preference position than a relatively often encountered type. This could happen in a spatial area where all the few contacts of the former type are observed. These kinds of problems are avoided and results for different contact types made more directly comparable by supervising the model such that chemically more likely contacts are paralleled, as well as the less likely. The model supervision was here achieved by multiplying the probability masses with target type specific weights that are calculated from three parameters electronegativity, softness and mean distance. Softness of an element e, one from the group G = {C, N, O, S, F, CL, BR, I}, was defined as twice the mean value of hardness among the elements in G, minus hardness of the element e. Numerical values for absolute hardness were taken from Parr et al. [27]. The electronegativity and softness were used to represent the tendency of an element to obtain partial charge in a compound. The third parameter, mean distance, is a measure of the strength of the interaction between a fragment and a target, and the numerical value given to it was the arithmetic mean of the    where e f and e C are the obtained partial charges for a fragment Main-atom and a target atom, respectively, and Dr fC D is the mean distance between the Main-atom and the target atom. The motivation for this prior is that the intermolecular interactions are mainly electrostatic, despite of the fact that they occur in many different forms, e.g., between a permanent dipole and an induced  dipole, or between two induced dipoles, known as a London dispersion.
In the above formulation, it is assumed that a generic a priori information can be accurately utilized when modeling an interaction between f and C. It would also be possible to use calculated energies of some simplified fragment-target model as the a priori information, but the described approach is chosen because of its simplicity and independence of molecular details, which follows from utilizing element specific, measurable parameters, i.e. ionization energy and electron affinity [27]. Calculated prior probabilities relevant for this study are given in Tables 8 and  9.

Hierarchy calculations
In order to calculate the spatially dependent hierarchies around an arbitrary reference point  we defined intervals in spherical polar coordinates: which define a volume that includes r r ref ,z , e.g. r ref~1 2 (r 1 zr 2 ),  [28] or directly as Riemann sums.
The fC-specific probability mass is the factor that gives a contact atom type C its rank, in the fragment class f related, and around a reference point r r ref defined hierarchy. Namely, the bigger the mass, the more probable the contact atom type. The volume can cover a larger portion of the neighborhood of the fragment, for example, the hemisphere on either side of the plane of the fragment.
A hierarchy can also be defined for example among the fragment types (f ), with respect to a representative of a target class C around r r ref . The motivation for choosing the probability density and the estimation procedures of the model parameters as described in Methods, is that they provide a rapid and flexible way to capture the relevant features of the target atom distributions, without relying on fine details of the target atom clouds, which are potentially misleading due to the intrinsic uncertainties in the data generation process.

Results
The functionality of the introduced Bayesian method of finding hierarchies is here illustrated by a number of case-studies. The calculated hierarchies are compared with contact atom type counts found in the reference data, in the volume surrounding the reference point r r ref , representing a target atom location. The hierarchies and the reference frequencies are not expected to be Figure 7. Reference point r r (2) ref .
Comparison of contact atom frequencies with model based probabilities for fragment classes f17, f20, f34 and f36. Hierarchy is given by the calculated probabilities, which are represented as circles. The circles are joined with a line to illustrate tendencies among target classes. The bars represent the fractions of target atoms belonging to a particular class. Both sum to one over target classes. Also shown are standard errors for both the frequencies and the model based probabilities (joined with lines and centered around the mean values). The contact atom counts in reference data for f17, f20, f34 and f36 are 4, 7, 0 and 1, respectively. doi:10.1371/journal.pone.0049216.g007 perfectly congruent, due to the fact that the reference data set is not exhaustively large. Therefore, they are expected to simply show some of the specific features around the reference point r r ref .
In order to illustrate the reliability of the results, standard errors for all results were determined with a bootstrap type procedure [29], which is described together with the corresponding results.

Example 1: A typical location of a hydrogen bonding partner
First, we consider a reference point equal to r r (1) ref~2 :8 : cos( which is located just below the plane of the fragment, with a polar angle deviation of p 3 to the left from the main direction, see Figure 3. The intervals D (eq. 12) used were  Table 2) are given in Table 10 below. They are also represented graphically for f2, f5, f26 and f27 in Figure 4. The error bars represent separately standard errors for the model based probabilities and the frequencies in reference data. They were defined by calculating both quantities in a set of 1,000 r r (1) ref centered volumes, that were slightly different in shape and size. To investigate how the inclusion of additives as ligands in the training and reference data sets would affect the model-based rankings, we divided the data into two sets using a list of approximately 770 additives that are found in PDB [30]. Contact data for these additives was removed from the original set to create a new additive-free set. Then, based on the model trained with the new dataset, contact hierarchies were calculated and probability masses were compared with reference data that also lacked the additives.
We replicated the calculations in Example 1 using the additivefree data and the results are shown in Figure 5 and Table 11. In addition, a comparison of the original and new contact atom counts is presented graphically in Figure 6. Comparisons of the rankings reveal that they are nearly identical in the two situations. Nevertheless, some discrepancy does occur and this example illustrates the need of a careful consideration about what structures can be included as representative information to the training data. If predictions based on multiple different training sets are extensively monitored and discovered to be sufficiently similar, it is possible to consider the use of a universal training data collection where the different sets are merged. i.e. on the positive x-axis when defined in the reference frame of Figure 3. The intervals D (see eq. 12) used were r 1 ,r 2 ,h 1 ,h 2 ,w 1, w 2 ~½3:25 ,3:55 ,0, p 5 ,0,2p: Table 12 presents probabilities for classes f13, f17, f18, f20, f21, f34, f35 and f36 and the results are represented graphically for f17, f20, f34 and f36 in Figure 7. The error bars represent separately standard errors for the model based probabilities and the frequencies in reference data. As in the previous example, they were defined by calculating both in a set of 1,000 r r (2) ref centered volumes, that were slightly different in shape and size.

Examples 3 and 4: Two distances in a direction perpendicular to the plane of the fragment
The third and fourth reference points, in the reference frame of Figure 3  and the corresponding volumes are centered around the negative z-axis, as shown in Figure 3. The results are represented graphically for fragment classes f2, f5, f11 and f22 in Figure 8. The error bars represent separately standard errors for the model based probabilities and the frequencies in reference data, calculated in the same manner as in the previous examples.
Hierarchies calculated when the additives were excluded, as discussed in Example 1, are presented in Figure 9. It is seen that a larger discrepancy in the model-based rankings compared with the reference data occurs when the reference data are extremely sparse, which makes the unsmoothed relative frequencies highly volatile. To also visualize more generally how our method smooths scatter data about different types of contacts, 3D-plots of estimated densities are shown in Figures 10, 11, 12.

Example 5: Direct contacts of R-norepinephrine
Here we consider the hydrogen bonding and aromatic interaction preferences of norepinephrine (also known as noradrenaline; PDB ligand identifier: LT4). The molecular environment of this example is the norepinephrine binding site in chain B of human phenylethanolamine N-methyltransferase (PNMT) from PDB entry 3HCD. PNMT catalyses adrenaline synthesis with coenzyme S-adenosyl-L-methionine (AdoMet). In the structure of 3HCD AdoMet is replaced by its demethylated form S-adenosyl-L-homocysteine (AdoHcy) to study the binding mode of LT4 [31]. The x-ray resolution of the entry 3HCD is 2.39 Å , which is near the upper limit considered in our study (v2.5 Å , see Table 1). As discussed previously, this means that some precaution is necessary while deducing interactions from the structure. Consequently, the statistical nature of our method is helpful, since the probability densities can indicate a certain relative location that is associated with what is experimentally observed in other structures. This enable the quantification of the relation in question as a probability.
LT4 has three hydroxyl groups, a terminal amino group and a six-carbon aromatic ring as its functional groups. The most preferred target class for any of these functional groups is defined as a class for which the product of the probability density peak value (eq. 3) and the class conditional prior probability (see Tables 8 and 9) has the highest numerical value.
Two of the hydroxyl oxygens are bonded to the aromatic ring (cathecol hydroxyl groups), and based on the model, prefer a nitrogen from histidine or arginine side chain as a contact. The aromatic ring carbons then prefer aromatic carbons, i.e. phenylalanine, histidine, tyrosine or tryptophan is a likely contact residue. The aromatic carbons also have strong contacts from the hydrophilic targets, for example carboxyl oxygens. The hydroxyl Figure 10. Contacts for carboxyl oxygen (f6). Showing differences in spatial arrangement of contacts among three target classes -C4 (alpha carbon), C10 (amino nitrogen) and C12 (carboxyl oxygen). The scatter plots contain all the target atoms found in the fragment class f6 training data for these target classes. doi:10.1371/journal.pone.0049216.g010 group of the aliphatic tail prefers lysine and the terminal amino group prefers glutamic acid/glutamate or aspartic acid/aspartate. These a priori preferences are not related to LT4, instead only the 3D structure of the interactions is. Some directional aspects related to this example are illustrated in Figures 13,14,15,16,17. Figures 14 and 15 present hydrogen bonding contacts for the Rnorepinephrine tail. There are two carboxyl oxygen (C12) contacts for the LT4 hydroxyl group, namely GLU B219 and ASP B267. The former is at a distance less than the maximum length used in this study for a oxygen donor and oxygen acceptor hydrogen bond, i.e. 3.14 Å v3.30 Å . According to the model, its direction of approach is not typical for a hydrogen bond, but because the same GLU B219 residue is simultaneously a contact for the adjacent amino group, the actual preferred direction is such that it allows the carboxyl to bond with both of these functional groups in LT4. Therefore this is considered a direct contact. Regarding the amino group, the direction of approach of the GLU B219 carboxyl oxygen is typical for a hydrogen bond (almost optimal), only somewhat shifted to a direction that facilitates the double contact described above, see Figure 15. The latter carboxyl, ASP B267, is in a more typical direction, but even further apart, and it is confirmed from PDB entry 3HCD water locations that this contact is a bridged hydrogen bond, not a direct contact.
The TYR B222 contact for the hydroxyl of LT4 tail, see Figure 14, has an aromatic ring that can serve as a hydrogen bond acceptor. Therefore, in case the LT4 tail hydroxyl group would act as an acceptor in the above described hydrogen bonds (i.e. with water and GLU B219) it could in principle donate its hydrogen to a weak hydrogen bond with the aromatic ring of TYR B222, because the closest atom of the ring is at a distance of about 3.5 Å and the ring is facing toward the hydroxyl group. Consequently, as for the LT4 amino group this is not a strong contact, but perhaps has a guiding task in the binding process.

Example 6: Separating COMT ligands from decoys in a subset of DUD
Here we demonstrate the usefulness of the estimated contact probability masses in discriminating between appropriate and poor binders using logistic regression, see e.g. [32]. In the current application context, logistic regression model connects a binary response variable (here ligand/decoy), with explanatory variables describing the modeled system. The outcome is a probability indicating how likely it is for a system to belong to either of the two response groups. Our testing was done by retrieving a set containing 6 out of the 11 Catechol-O-methyltransferase (COMT) ligands and 19 out of the total 468 decoys from the Directory of useful decoys (DUD) [33]. Two extra ligands were added, namely Figure 11. Contacts for amide oxygen (f10). Showing differences in contact arrangements for two target classes -C8 (carbamoyl nitrogen) and C9 (imidazole, guanido or indole nitrogen), and also distance dependence for class C9. The scatter plots contain all contacts found in the training data for these fragment and target class pairs. Note that the target atom clouds in the middle and on the right are the same. doi:10.1371/journal.pone.0049216.g011 dopamine and BIA 3-335 (PDB Ligand identifiers LDP and BIA, respectively), by retrieving their structures from ZINC database, see [34]. These 27 small molecules (ZINC codes in Tables 13, 14 and 15) were chosen so that the DUD molecules have high mutual resemblance, especially so that the decoys have an aromatic ring with at least two primary oxygens bonded to neighboring carbons (in hydroxyl groups typically). This is because all except one COMT ligand in DUD have this type of a structure, and the exception is different only in that the ring is non-aromatic. The two extra ligands were included for reference, because they are known good binders, for BIA [35], and should have clearly higher preference to binding than an average decoy.
The search for the binding mode of the small molecule in the binding pocket (from PDB ID 1H1D) was started by orienting the molecule such that two of the primary oxygens would coordinate with the magnesium ion (Mg 2+ ), participating in the COMT function (see [35]), and here taken as part of the binding site. Then, predefined rotamers of the small molecule were rotated around the axis connecting the two coordinating Os, and to a lesser amount around a second axis. Direct contact probabilities between the small molecule and the binding site were calculated. Probabilities for the two coordinating Os were excluded to emphasize contacts for the rest of the molecule. Rotamers and orientations with close intra-or intermolecular contacts were removed using distance criteria, though the plausibility of a rotamer could be evaluated using probability masses for intramolecular contacts.
For each small molecule, the rotamer and orientation with highest probability were found and the probabilities were then used in a logistic regression model that mimics a docking screening task. Two explanatory variables were used in the logistic regression model: the total probability mass of direct contacts, divided by the mass of the molecule (pmPerMass) and the ratio of the number of hydrophobic and hydrophilic fragments (Phob/Phil) in the molecule.
It is well known that in an actual binding affinity calculation for a ligand-protein pair in solution, one needs to consider energetics of direct contacts, water and metal mediated contacts, desolvation and entropy. The variable pmPerMass is here considered to be a measure of the binding energy of direct contacts, whereas the variable Phob/Phil reflects desolvation properties and perhaps tendency for water mediated contacts. Configurational entropy doesn't have in this study any obvious representative, because for the numbers of rotatable bonds (RotBonds) no predictive role was identified. Results of the predictions based on logistic regression are shown graphically in Figure 18. Values for the putative explanatory variables are given in Tables 13, 14   The molecule that was most highly ranked, BIA 3-335, is a known tight binding inhibitor [35]. It is also heaviest of the 27 molecules included in the example; approximately 360 hydrogen masses compared to the more typical value that is between 150 and 250. The variable pmPerMass is intensive, i.e., the size of the molecule should not directly influence it's value, and it is assumed that success in predicting relative binding affinities for smaller candidate binders depends strongly on the accuracy of this variable.
Probabilities derived from the logistic regression are on average over 0.5 for the molecules in the alleged ligand group and below 0.5 for the alleged decoys, which represents a natural threshold between a ligand and a decoy in a screening process. Two exceptions in the ligand group are the low scoring molecules with index values 1 and 4. The ligand with index value 1 has the third Figure 14. Contacts in the active site of PNMT (a methyltransferase) for R-noradrenaline tail hydroxyl group. Three amino acid residues (ASP B267, GLU B219 and TYR B222) were considered as contacts. doi:10.1371/journal.pone.0049216.g014 Figure 13. The aromatic phenylalanine contact PHE B182 in PNMT (see text of section Example 4) for the LT4 aromatic ring. Also depicted in the figure is the proximity (closest 3.4 Å ) of the TYR B222 aromatic ring to the amino group of LT4. Though the distance and orientation of the ring fit well to the hydrogen bond donor -aromatic ring interaction scheme, the relative direction is such that TYR B222 corresponds to probability density values smaller than 20% of the peak value. Therefore, this might not be a strong contact for the amino group, but possibly has a guiding task in the binding process. The depicted distance between amino group and the methyl group donating sulfur atom is 5.66 Å . doi:10.1371/journal.pone.0049216.g013 highest pmPerMass, but quite low Phob/Phil value, while the ligand with index 4 has a low value for both (see Table 13). Hence, if both these molecules are considered as good binders, our approach does not contain enough information to reveal this. On the other hand, there are not necessarily experimental data available on the binding affinities for the decoys, which in this study were chosen to resemble the ligands as much as possible, each starting with the two, to Mg 2+ anchoring primary oxygens bonded to an aromatic ring. This means that some decoys might be reasonably good binders. Nevertheless, based on the logistic regression model, molecules in the ligand group have on average clearly higher probabilities (0:62) to be ligands than the molecules in the decoy group (0:16). In summary, in the set of 27 chemically similar small molecules, containing 8 experimentally defined ligands, 6 highest ranked molecules were from the ligand group. Consequently, the receiver operating characteristics (ROC) [13] in  the screening experiment are: true positive rate TPR = 0.75, false positive rate FPR = 0 and accuracy ACC = 0.93. A ROC curve was produced by using discriminating thresholds having either different TPR or FPR. The ROC curve is given in Figure 19.
One important aspect that has not been considered here, is whether the active form of COMT is a monomer or a multimer. If it is a multimer, it would be interesting to investigate how informative this characteristic is about the properties of suitable ligands. Additional potential molecular characteristics for further study are the flexibility of the binding site and the features of the binding modes having the highest probabilities (pmPerMass).

Discussion
Predictions about unresolved binding sites, or ligands, can be made by building the preferred contact patterns from the molecules included in a set of functionally classified fragments. In out method, these contact patterns are composed of probability masses calculated for the fragments to have a specific kind of contact in a spatial area. When, for example, the binding affinity of a molecule is studied and the probability masses are defined for an entire molecule, they can be used in a docking and scoring procedure. The absolute binding affinity would be given by the total energetics of the binding process in a thermodynamic setting, including direct and bridged contacts, desolvation and entropy. It is presumed, that using a fragmentation where the fragments have distinct and unique contact patterns, the probability densities described here contain information beyond the chemical complementarity, namely on energetics (for results in this direction, see [36]). This is reasonable by an analogy with quantum mechanics, because it can be argued that the probability masses are proportional to the amount of binding energy, which are needed in evaluating the binding affinities. In our setting this means relative binding affinities, i.e. rankings over a set of ligands and decoys.
The results obtained in Example 6, show a level of reliability that is typical for a successful scoring function, see e.g. reviews [9], [10]. Our experiment revealed that potentially very reliable information could be retrieved when our probabilistic method is combined with an effective search routine. An important aspect is that the ligand and decoy molecules were similar, i.e. the decoys used were 'drug-like' [29]. This is based on that they typically had masses between 150 to 250 hydrogen masses, contained both hydrophobic and hydrophilic fragments throughout the structure and were chosen so that each can be anchored to the magnesium ion in the COMT binding site. This should make separation of ligands from decoys challenging and be ultimately based on finer details of the binding affinity, because no decoy was readily rejectable. In a docking and scoring routine such a method can also be used to find the most favourable orientation for the most favourable rotamer, or conformer, of a small molecule in a binding site, i.e. the pose. When adjusting the method for calculations of Figure 17. R-Norepinephrine (PDB ligand identifier: LT4) with amino acid residues that contain target atoms having highest probability density values in the model. In the figures on the right, the double contacts are created by a degeneracy that follows from the way the fragments are defined. Namely, the third atom (Atom2, see Section Data collection and processing) of a fragment in a molecule can frequently be chosen from more than one possible atom and each choice creates its own probability density, including a maximum. These densities are connected through a rotation around the covalent bond between the Main-atom and Atom1. doi:10.1371/journal.pone.0049216.g017 the absolute binding affinities, the same difficulties will be faced as for any knowledge-based scoring function, see [9].
In addition to the quality of the fragmentation, the reliability of the data is a central issue in the prediction of contact preferences and some issues related to this were discussed in Sections 1 and 2. When choosing the structure for a prediction model, it is essential to understand the data generation process; in principle from the experimental measurement to the coordinate file. Regarding the special characteristics of the experimental method, x-ray diffraction is sensitive to thermal motion in the crystal. This weakens locally the electron density map, and since electron density maps are precisely the starting point in structure refinement, such an effect should preferably be assessed. In the further refinement, constraints are used in order to keep the protein structure within chemically acceptable boundaries. It thus follows that the ligand atom positions have uncertainty which is not straightforward to quantify. Possible approaches to quantification could be exploration of the effects of the constraints on a theoretical basis or using structures refined with different constraints. On the other hand, PDB files contain substantial amounts of metadata that could potentially also be used in modeling. An example of this are the bfactors, which can be used for incorporating thermal motion in the model. Another example is provided by the occupancies that are needed to take into account the more long lasting local displacements, i.e. alternative conformations in the crystallized protein-ligand complexes [8].
Though the results in the example sections are given with standard errors [29], performance of our model in predicting favourable intermolecular contacts could be more quantitatively verified in the future when more extensive reference sets of sufficiently high quality become available. The approximately 10,000 structure files from PDB used as reference data did only give a preliminary test for certain fragment types. This is mainly because of the 3D nature of the problem, since in order to obtain a good spatial resolution, the frequencies need to be defined in less extensive volumes.   Following a reviewer's suggestion, we added a ROC curve to Example 6, which makes it more informative.

Conclusions
The hierarchies illustrated in our examples show both spatial dependence and reliability. They can also be evaluated quite rapidly from coordinates within a classification, typically in seconds, or in tens of seconds. Thus, tentatively our approach can be used to study structural aspects of biochemical reactions or as a tool in predicting the most favourable binding modes and separating ligands from decoys, as described in the examples. A plausible future test would be to create a hierarchy among a group of ligands and compare their binding probabilities to experimentally measured binding affinities, e.g. those of KiBank database referred to in DUD. Test on each stage of the docking and scoring procedure has to be successfully conducted, before it is shown that the method is applicable for the purpose. Then it can be directly compared, e.g., with the knowledge-based potentials that are only distance dependent.
Reliable evaluation of binding affinities for potential ligands of a binding site, would be a desirable feature of a virtual drug design screen, see for example [9], [37]. As discussed, the distance and direction dependent probability masses obtained with the approach described here, are taken to provide direct information on relative binding affinity, which is supported by the results in Example 6. Regarding further development of our modeling approach, both statistics-and chemistry-based generalizations and improvements are possible, including the obvious expansion to all imaginable molecular fragment types.
Bayesian predictive modeling in the normative sense as defined in [38] provides a potential approach to representing contact preference distributions. Such a predictive model could exploit directly the 3D structures of the probability densities (eq. 3) that model the contact atom positions, instead of considering density parameters as the main characterization of the spatial information. The obvious disadvantage of such an approach is the considerably increased computational effort needed to derive approximations to the sought after predictive distributions.
The reliability of an inferred hierarchy depends, on one hand, on how successfully the error from experimental methods and structure refinement is quantified in terms of the used probability densities. On the other hand it depends on how realistically the chemical likelihood of a contact atom type and the bias in the data set are taken into account. The latter are here incorporated as prior information, see equation (10), which guides the model with chemistry-based knowledge. A third fundamental area for chemistry-based improvements are the classifications (Tables 2  and 3). For example, a classification can be envisioned where the covalent bond count of Atom1 (see Data collection and processing) would be used as one of the characteristics defining the fragment, which would then remove the degeneracy described in section Example 5. This kind of more structural way of defining the fragments would expand the classifications, but should also give fragment definitions that are closer to being unique.