Molecular Diversity Assessment: Calculation

A new pairwise similarity formula has been defined and used for molecular diversity assessment based on a novel method. It is demonstrated here that the logarithmic relations of entropy and indistinguishability give the expected diversity values which decrease with the increase in species similarities. The calculation by Agrafiotis would be successful if the present pairwise similarity is used.


Introduction
With the development of high throughput screening technology in recent years, the acquisition of molecular and biomolecular samples by collection and combinatorial synthesis has now become the bottleneck in the process of new drug discovery [1]. It suddenly occurred to us that a collection (or a library) of chemical substances (compound samples) was no less precious than a collection (or a library) of chemical information (chemistry books) [2] and the molecular diversity assessment became an urgent topic [1].
A recent article by Agrafiotis [3] on calculations using our method [4] demonstrated that a correct definition of the pairwise similarity is the necessary condition to ensure expected results of the diversity assessment. This calculation [3] did not give expected results, which is attributable to a improper definition of pairwise similarity. This similarity scale [3] implies that the maximum distinguishable species is always only 2, instead of N which is required in the original method [4] , for assessing diversity of a collection of N chemical samples.

The formula
For convenience, our method [4] is briefly summarized here. The diversity index (D) is defined as the ratio of the information (I) and the maximum information ( ), as given by eq 1. (1) This method is based on a new theory where entropy is clearly defined as information loss by the following relation: (2) In this equation, entropy is given by the familiar expression (3) where is the probability of the ith microstate with the property that (4) while the maximum entropy is (5) where w is the indistinguishability number which is the number of microstates of indistinguishable property.
The apparent indistinguishability number of microstates is defined as (6) and eq 3 becomes (7) which is the logarithmic relation of entropy and indistinguishability.
Practically, in order to record information, a system composed of N "unit devicesî is used. In computer science or in our daily information recording as well, these "unit devicesî are N individuals (such as symbols) assembled on a media such as a piece of paper. These individuals appear as M attributes, based upon which it is said that the system has M species, such as the two species 0 and 1 in the binary system [4].
Because the assessment of diversity of N chemical samples is our only concern here, the individual number (N) and the maximum species number (M) are designated as the same: . This can be envisaged as N holes in microplates used for high throughput screening containing N compound samples [9].
If these N compounds are all distinguishable, they can be used to record the maximum information as given by eq 8. If red ink is used to represent 0 and blue ink 1, and two bottles of these different inks are used, 2 bits of information can be recorded if the number of individual (N) is 2. There will be 4 ( ) distinguishable microstates, see Figure 2). The maximum information is (8) It is said that this is the maximum information because one can still intentionally use only a small part of the available species to record only smaller amount of information. In eq 8, w is the number of distinguishable microstates [8]: The corresponding entropy has the minimum value which is zero: This extreme case is illustrated in Figure 2 (N=2, w=4) and Figure 3 (N=3, w=27) [8].  Let all the N samples in the N bottles be the samples of extremely similar (or the same) property. Then there will be still microstates (or assemblages) constructed by the times of different combinatorial sequences of assembling to form solid structures. However, because they are all virtually indistinguishable microstates, there is always the minimum information and the maximum entropy (eqs 11 and 12): Suppose you have accidentally installed two bottles of red ink for a printer. Even though exactly the same amount of effort is taken to prepare the 4 microstates, i.e., the four microstates are prepared in a same way as that of Figure 2 by using inks from two individual bottles, there will be 4 ( ) indistinguishable microstates (see Figure 4). Similarly, whether we factually take the same sample from one sample bottle or different sample bottles, we always have w indistinguishable microstates, if they are virtually the same compound in all the N bottles; see Figure 4 (N=2, w=4, if all species are factually 0) and Figure 5 (N=3, w=27, if all species are B). The maximum microstate indistinguishability number is therefore (13) These two extreme sets of distinguishable and indistinguishable samples which give minimum (zero) and maximum entropy values respectively (eqs 10 and 12), already illustrated that our method of entropy calculation is different from the classical statistical mechanics and the classical information theory (see also Figures 1).   Table 1, where the property is represented by the symbol "B".
Generally, suppose the N individuals used to construct microstates are only mutually similar to a certain extent and they are neither distinguishable (eqs 8 and 10) nor indistinguishable (eqs 11 and 12). Instead of using eq 1 directly, eq 14 is used to calculate entropy.
The pairwise similarities in the table (15) have values limited between 0 and 1 and are given by pairwise comparison among the N individuals according to one and only one systematically followed standard of comparison for all the values ( ). Then a normalization factor c is required.
It follows that Then eq 14 will give the same results as given by eqs 10 and 12 respectively under the two extreme conditions.
We agree with Agrafiotis [3,10] that, in principle, the general equation (eq 3) should be directly used, where w is simply replaced by . The obvious disadvantage of using eq 1 directly is that the sum runs over all the microstates (see Figures 2-5). The calculation of these terms of enormous number , which can be an astronomical figure, is impractical. Normally N is 100000, the size of a compound sample library or sublibrary. In eq 14, the number is substantially reduced to totally terms of .
Secondly, we are not really interested in using the chemical samples to record information by taking the sample bottles as "unit devices". Therefore, we will not perform experiment or calculation to characterize the chemical structural and other physicochemical properties of all these microstates. Instead, we measure (or calculate from the known structures) the properties of the N molecules, based on which the pairwise similarities are to be easily calculated. This means that, instead of considering similarities and probabilities among microstates, only probabilities , , , etc., calculated from the pairwise similarities among the N samples A, B, C, ..., etc., will be considered.
The first column of Table 1 showed several sets of three imaginary compound samples (A, B and C) plotted against a uniform property scale as used by Agrafiotis [3]. If the properties of the samples are the same, the points will coincide and these samples will be regarded as the same samples. If their distances are very short and they are very close, they are regarded as very similar.
The probability calculated from eq 17 means the probability of finding the jth individual as the ith species. The diversity of these species is unknown and yet to be assessed; they are presumably similar to each other to a certain extent [4]. Therefore, the comparisons are not performed between the N samples and a set of a priori known set of distinguishable prototypes; the comparisons are performed among the N samples themselves. The normalization factor c is required because these values are subject to the constraint: (18) Using of the logarithmic relations of entropy (eq 3) , here, it is easily found that the apparent species indistinguishability number . For the examples shown in Figures 4 and 5, . Generally, Easily, the apparent species number can be calculated. (21) Using this method properly, the molecular diversity as expressed by the diversity index D and several related parameters can be calculated. To compare the diversity of several selections of a sublibrary of compounds from all available sources, and to acquire the same number of samples of the highest diversity for many different screening purposes, the sublibrary of minimized entropy S is the choice.

Similarity Definition
Before calculating the molecular diversity of a library of N compound samples, the similarities for all the mutual pairwise comparisons among all the N individuals should be clearly defined. Whether it is a proper definition of similarities can be quickly checked first by the following criteria of the two extremes and by using eq 14: (a) It should be able to give the definition of the N individuals of maximum apparent distinguishability. The entropy of this system is the minimum which is zero (eq 10). (b) It should be able to give the definition of the N individuals of minimum apparent distinguishability. This means there can be N indistinguishable species. The entropy of this system is the maximum (eq 12).
As the reader can easily estimate from given in the paper [3], the maximum number of distinguishable species is only 2, located at the two ends of the property scale 0.00 and 1.00 respectively. The distance between these two least similar species is 1.00. As shown in the original Figure 2 of Agrafiotis paper [3] of the entropies of various sets of representative imaginary samples, the minimum entropy is not zero, which does not conform with the first simple criterion (eq 10).
Note, eq (23) does not conform with the second criterion either: The minimum similarity value is 0.5, instead of zero, which is normally the minimum value of a properly defined similarity scale [11,12]. The minimum value corresponds to the largest distance which is 1 in this definition [3].
We propose that, instead of eqs 22, the following formula of pairwise similarity, which clearly conforms with the simple criteria, is adopted: In this formula, the shortest distance, which defines that certain two species are distinguishable, is provided that the property scale range is [0, 1]. Again, remember that "distinguishability" means the least similarity. If the distance is shorter than this (eq 25), the two considered samples are similar. If they coincide, they are indistinguishable samples or the same samples.
According to eq 24, it is easily verified that the samples most uniformly distributed on the property scale have the highest diversity (  ,  and  ), where all species are distinguishable, in contrast to a collection of samples as shown in the last row of Table 1, which has the lowest diversity and the highest  indistinguishability (  ,  and  ). For the latter case, if the property of these three samples are represented by a symbol B, the 27 indistinguishable microstates are those listed in Figure 5.
The calculation results of several representative sets of samples by using eq 24 for pairwise similarity calculation are listed in Table 1. Table 1. Calculation of diversity based on the similarity formula eq 24. Eqs 14, 19, 19, 21, 2 and 1 are used for calculating entropy (S), apparent indistinguishability number of microstates ( ), apparent indistinguishability number of species ( ), the apparent number of species ( ), information (I) and diversity index (D), respectively.

Sample Properties
Pairwise Similarity Table  Probability  Table  S

Conclusion
This paper has described the application of a new theory based on the rejection of Gibbs paradox of entropy of mixing and assembling for calculating species diversity. A similarity scale used by Agrafiotis is modified here. As we have demonstrated, if the pairwise similarity is properly defined following very simple criteria, calculations will generate satisfactory and expected results.
Finally, the widespread conceptual confusion between information loss of dynamic mixing and information loss in static assembling, even though has been considered by us [5][6][7][13][14][15], will be discussed in more detail elsewhere.
9. Normally N can be either greater or smaller than M. For example, a harddisk of N bits has N individuals with M equals 2. Normally . If one puts 200 Chinese characters in a typical letter written in Chinese where N equals 200 and (here the species number M is the total number of different Chinese characters normally used which is 10000).

Whether
and are compatible or not under the restrictions specified in the context should be mathematically justified. For the two extremes they give the same results as shown in eqs 10 and 12.