Retrieval Performance using Different Type of Similarity Coefficient for Virtual Screening

Development of a new drug needs chemical databases as references to find lead compounds. This study aims to determine the best similarity coefficient to be used for virtual screening task using chemical databases. We calculated the structural resemblance between each pair of chemical structures in their own activity class to get the Mean Pairwise Similarity (MPS) value to see the nature of heterogeneity for each natural product and synthetic chemical databases. The process involves the 2D descriptor of type ECFC4 fingerprint to represent each structure and Tanimoto coefficient to calculate the similarity score between each pair of chemical structures in the same activity class. MPS for an activity class was obtained by taking the average of all similarity scores within that class. Next, three types of similarity coefficients have been used to calculate the similarity score between a query structure and each of the database structure. The results indicate that Tanimoto coefficient shows better performance compared to Russell Rao and Forbes in retrieval task using chemical database. This implies that Tanimoto coefficient is recommended to carry out virtual screening in drug development. More work should be carried out to determine the best combination of similarity coefficient and fingerprint type to get optimal retrieval performance.


INTRODUCTION
Chemoinformatics can be described as the application of computer and information retrieval technique to solve a problem in the field of chemistry (Prakash and Gareja, 2010). Virtual screening is a technique that is used in drug discovery to search libraries of compounds using computer programs. There are many methods in virtual screening for example similarity searching, 3D pharmacophore matching and ligand docking. The focus of this paper is similarity searching, which can be defined as a measure to compute the degree of similarity between active reference structure and the chemical structures in the database of 2D structures as an effective way of searching large chemical databases (Willett, 2011). Structures in the database that have a high ranking value based on the reference structure can be considered as having similar biological activity with the reference structure (Johnson and Maggiora, 1990). The focus of virtual screening task is to separate compounds that have low similarity values, which will eventually save time, energy and cost for the chemists to investigate compounds in drug discovery process.
Similarity coefficient is used for calculating the degree of resemblance of active reference chemical structure with the chemical structures in the database (Willett, 2003). There are three important components that is used in similarity searching, molcular descriptor to represent a chemical compound; similarity coefficient to measure the resemblance between a pair of chemical structures and a weighting scheme to differentiate importance of each fragment occurrence in a compound. However, No Free Lunch Theorem (Wolpert and Macready, 1997) suggests that an algorithm would not satisfy all condition of a problem. Thus, this study is to determine the best similarity coefficient to be used with ECFC fingerprint in carrying out virtual screening task.

Similarity measures:
Similarity coefficients: There are many types of similarity coefficients, but only three similarity coefficients that are used to calculate similarity search here which are Tanimoto, Russell-Rao and Forbes. Descriptors that represent a molecular structure can be in continuous and dichotomous (i.e., binary) form. Holliday et al. (2002) found that these three coefficients are grouped differently in a clustering work they carried out. Similar results were found when different database and fingerprint types were used (Salim et al., 2003). The list below shows the similarity coefficients for Tanimoto, Russell-Rao and Forbes in continuous form which is applicable to non-binary data representation.
The Tanimoto, Russell-Rao and Forbes coefficients is given by S¹, S² and S³, respectively: For the similarity coefficients (1), (2) and (3), ˲ refers to the representation of the chemical structure for u and v where u is the representation for query structure and v is the representation for database structure and n refers to the number of bits of the representation.

Representations:
Representation describes the structural features of the chemical structures. These representations are fragment bit strings also known as "fingerprints". In this study we only focus on continuous representation which is Extended Connectivity Count vector or ECFC with the length of four bonds (ECFC4), containing 1024 bit-string. This continuous fingerprint is the non-binary representation of fragment bit string and is a 2D fingerprint. ECFC fingerprint are based on counts of how many times each fragment present in the chemical structure rather than binary strings which only encodes the presence and absence of a fragment (Todeschini and Consonni, 2009).

METHODOLOGY
The datasets used in this investigation were Taiwan Traditional Chinese Medicine (TCM) and MDL Drug Data Report (MDDR) database. TCM is one of the natural products database that is freely available at http://tcm.cmu.edu.tw (Chen, 2011) with 12,289 compounds that focuses on plant-based traditional remedies data repositories. In another hand, MDDR represents a synthetic chemical database with 211,061 compounds. MDDR is a commercial database subscribed from Accelrys Inc (available from http://www.accelrys.com) (Sheridan and Joseph, 2004).
Mean pairwise similarity: This task involves 17 activity classes from TCM database and 15 activity classes that has been chosen from MDDR database. First, we calculated the Mean Pairwise Similarity (MPS) for all the activity classes. Mean pairwise similarity is the similarity of chemical structures in each activity class (Saeed et al., 2012). MPS is conducted using Tanimoto coefficient as it is the most popular coefficient used in computing chemical similarity. While ECFC4 is chosen for the representation of the chemical structures as recent work found that it shows the best retrieval performance among many (Franco et al., 2014;Bender et al., 2009;Medina-Franco et al., 2009).
The TCM and MDDR datasets were filtered to remove duplicates of chemical structures in each activity class. Then all the active molecules in each activity class were converted to ECFC4 fingerprints using Pipeline Pilot software (available from http://www.accelrys.com) that gives 1024-element fingerprints (Warr, 2012). MPS is calculated using the Tanimoto coefficient, which will compare the reference structure with all the structures in the activity class thus giving the similarity value between structures in the activity class. The formula used for calculating the MPS is given below:

MPS =
Similarity value # of actives in the activity class Similarity search: The next task is to compute the similarity search. In this task we will use ECFC4 fingerprint with Tanimoto, Russell-Rao and Forbes as the coefficients to calculate the similarity between two chemical structures. First, TCM database are filtered to remove duplicates of the chemical structures. Then the database are converted into ECFC4 (1024 bit) fingerprint using the Pipeline Pilot to represent the chemical structures. Ten reference structures were randomly selected from each activity class. Each reference structure similarity value is calculated against the whole datasets to get the similarity value and only the top 1% of the highest ranked result was chosen for further investigation.
Next, the results that were obtained are then investigated to see how many of them belong to the same activity class which is known as true positives. True positive is the number of successful retrieved chemical structures (Wolpert and Macready, 1997). The next task is to calculate the Mean of Recall (MR) using the frequency of true positives obtained. The equation below shows the formula to calculate mean of recall: Table 1 indicates the number of active molecules in each activity in TCM. From here, it is clear that AM and PE activity class ID has the highest and lowest value of MPS, respectively. We can see the activity class that has the high value of MPS is from the class that has the lower number of active molecules in its  class which consist of 30 active molecules. This shows that the chemical structures in AM activity class ID are the most similar to each other than other activity classes while the chemical structures in PE activity class ID are the most dissimilar to each other. Table 2 shows the MPS values for MDDR activity class. The activity class ID that has the highest value of MPS is CCK (i.e., 0.549) while IB has the lowest (i.e., 0.345). Based on this, the activity class that has lower number of active molecules has the highest value of MPS for TCM and MDDR activity classes which are 30 and 208 actives molecules, respectively. Further analysis shows that MDDR activity classes has a higher value of MPS compared to TCM activity classes.

RESULTS AND DISCUSSION
As TCM shows more heterogeneity which represents a more challenging dataset, we further the work in determining the best similarity coefficients using the natural product database. Table 3 shows the mean of recall for 14 activity class for TCM. Here we can see that PE activity class ID has the highest mean of recall of 0.043 when using Tanimoto as the similarity coefficient. TR activity class ID has the lowest mean of recall using this coefficient with the value of 0.008.
There is a relationship between the MPS value and mean of recall based on results outlined in Table 3. PE activity class ID has the lowest MPS but highest mean of recall using Tanimoto coefficient. This is also true when using Forbes coefficient, where PE activity class ID gives a high mean of recall of 0.035 while HC activity class ID which has lower MPS gives the poorest retrieval performance with mean of recall of 0.005. However, in the case when using Russell-Rao similarity coefficient, it shows LP and DR2 activity class ID which represents high level of homogeneity (i.e., high MPS) has the highest mean of recall (i.e., value 0.030) and lowest (i.e., value 0.00) for the mean of recall. This indicates that Russell-Rao alone should not be considered for chemical similarity task as it is unsuitable to both homogenous and heterogenous datasets.
Recently, there exist more interest in producing molecular descriptors based on physicochemical properties and Structure-Activity Relationship (SAR) in a molecule based on statistical techniques (Hancock et al., 2005;Andersson et al., 2000;Mridha et al., 2014) and machine learning approaches (Kovačević et al., 2014;Nantasenamat et al., 2014). These works found that these molecular descriptors able to give comprehensive coverage in solving chemical problems.  Thus, future work can be done to compare the performance of these QSAR descriptors and fingerprint-based descriptors to determine the best descriptors used with Tanimoto coefficient.

CONCLUSION
Previous works in this field has investigate the effect of similarity coefficients in synthetic chemical database and found that Tanimoto is the best coefficient to be used in virtual screening. This study extends the application and shows that it also perform better than Russell-Rao and Forbes when used with natural product database. In future work we will extend the research by using new molecular descriptors that are produced based on physicochemical properties to see the effect of different types of representations on the retrieval of TCM database.