Study of intra-inter species protein-protein interactions for potential drug targets identification and subsequent drug design for Escherichia coli O104:H4 C277-11.

Protein-protein interaction (PPI) and host-pathogen interactions (HPI) proteomic analysis has been successfully practiced for potential drug target identification in pathogenic infections. In this research, we attempted to identify new drug target based on PPI and HPI computation approaches and subsequently design new drug against devastating enterohemorrhagic Escherichia coli O104:H4 C277-11 (Broad), which causes life-threatening food borne disease outbreak in Germany and other countries in Europe in 2011. Our systematic in silico analysis on PPI and HPI of E. coli O104:H4 was able to identify bacterial D-galactose-binding periplasmic and UDP-N-acetylglucosamine 1-carboxyvinyltransferase as attractive candidates for new drug targets. Furthermore, computational three-dimensional structure modeling and subsequent molecular docking finally proposed [3-(5-Amino-7-Hydroxy-[1,2,3]Triazolo[4,5-D]Pyrimidin-2-Yl)-N-(3,5-Dichlorobenzyl)-Benzamide)] and (6-amino-2-[(1-naphthylmethyl)amino]-3,7-dihydro-8H-imidazo[4,5-g]quinazolin-8-one) as promising candidate drugs for further evaluation and development for E. coli O104:H4 mediated diseases. Identification of new drug target would be of great utility for humanity as the demand for designing new drugs to fight infections is increasing due to the developing resistance and side effects of current treatments. This research provided the basis for computer aided drug design which might be useful for new drug target identification and subsequent drug design for other infectious organisms.


Background
Enteric bacterial infections remain a major cause of morbidity and mortality in both developing and developed countries (Petri et al. 2008). Among various pathogenic E. coli strains that cause intestinal or extra-intestinal diseases in humans, the most devastating are enterohemorrhagic E. coli (EHEC) strains, which produce highly potent cytotoxins called Shiga toxins (Stxs) (Bradley et al. 2012;Brooks et al. 2005). The EHEC E. coli caused diarrhea, hemorrhagic colitis, life-threatening hemolytic uremic syndrome and encephalopathy (Evans and Evans 1996). Several deadly EHEC outbreaks were reported all over the world since last two hundred years predominated by the O157:H7 strain of E. coli (Frank et al. 2011;King et al. 2012;Michino et al. 1999;Waters et al. 1994). However, in 2011, a new EHEC strain O104:H4 was identified and this strain was related to an outbreak in Germany and other countries in Europe. This massive outbreak was of great concern to the drug designers as a number of deaths from the infection were reported due to the ineffective medication (Bielaszewska et al. 2011;Grad et al. 2012). This pathogenic strain acquired a shiga toxin gene and an antibiotic resistant virulence plasmid pAA, which allow it to exhibit resistance against a significant number of antibiotics including cephalosporins, co-trimoxazole and all penicillins while susceptible to imipenem, meropenem, amikacin, kanamycin and carbapenems (Mora et al. 2011;Muniesa et al. 2012). In order to get the most effective treatment options, it is crucial to identify new drug and vaccine candidates to combat with this deadly pathogen.
The crucial step in drug discovery is the target identification (Chan et al. 2010). However, traditional drug discovery methods are capital-intensive, time-consuming, and often yield few drug targets. In contrast, advantages of the bioinformatics, genomics and proteomics approach represent an attractive alternative to identify drug targets worthy of experimental follow-up. The pathogen and host-genome sequence offer a better understanding of pathomechanism of disease and hence identification of drug targets. In recent years, computational methods have been used widely for the identification of potential drug and vaccine targets in different pathogenic microorganisms (Amineni et al. 2010;Damte et al. 2013;Mondal et al. 2014Mondal et al. , 2015Sliwoski et al. 2014). Protein-protein interactions (PPI) and host-pathogen interactions (HPI) approaches offer an area of unexplored potential for next generation drug targets (Taylor et al. 2011). It is important for bacterial cellular processes and pathogenesis analysis and thus efficient to identify the protein-set essential for the pathogen's survival but absent in the host (Archakov et al. 2003;Eisenberg et al. 2000). Subtraction of the host genome from essential genes of pathogens helps in searching for non-human homologous targets which ensures no interaction of drugs with human targets. The integration of these approaches with different advanced bioinformatics tools may ensure the discovery of potential drug targets for most of the infectious diseases. After the drug target(s) optimization, the in silico virtual screening of different chemical databases could provide unprecedented opportunity to select and design the best possible inhibitor(s) (Lavecchia and Di Giovanni 2013).
This study focused on a combination approach of the proteomics data analysis and homology modeling to find out a novel therapeutic target from E. coli O104:H4 C277-11 (Broad). We performed the protein-protein interactions of E. coli O104:H4 through the three different methods (1) protein interactions from PSIbase (2) protein interactions from Database of Interacting Proteins (DIP) and (3) domain-domain interactions from Domain interaction map (DIMA). Host-pathogen interactions (HPIs) between predator E. coli O104:H4 and its target Homo sapiens were predicted by host-pathogen interaction database (HPIDB). E. coli O104:H4 proteins contributed in HPIs were investigated for identifying potential drug targets and subsequent computer aided drug design process. The identified potential drug targets might expand our understanding of the molecular mechanisms of E. coli O104:H4 pathogenesis and also facilitate the design of effective antibiotics.

Materials and methods
Intra species protein-protein interaction prediction from PSIbase, DIP and DIMA Protein sequences of the E. coli O104:H4 str. C227-11 (Broad) was taken from the Patric Pathosystems Resource Integration Center (Gillespie et al. 2011). These sequences were assigned with Structural Classification of Proteins (SCOP) 1.75 database using SUPER FAMILY 1.75 with an e-value cutoff 0.0001 (de Lima Morais et al. 2011). SCOP domains were further submitted to the PSIbase database which is based on Protein Structural Interactome map (PSIMAP) information to obtain interaction partners (Gong et al. 2005b). A comparison of the protein sequences of E. coli O104:H4 str.  to the protein sequences (dip20120218.seq) from DIP (Salwinski et al. In Silico Pharmacol. 2004) database was carried out using BLASTp with an e value cutoff 10 -5 and a minimum sequence identity of 40% (Altschul et al. 1997). Interaction pairs were obtained from the DIP database by applying DIP node identity from the blast output. All the E. coli O104:H4 proteins were aligned for the corresponding Pfam domains using WebMGA server with hmmscan 3.0 and PFAM 2.4 by the e value cutoff 0.01 (Luo et al. 2011;Wu et al. 2011).

Host-pathogen interactions from HPIDB
Homologous protein interactions between E. coli O104:H4 and human were predicted from the HPIDB (Kumar and Nanduri 2010). ''Search Homologous HPIs'' option was selected to identify host-pathogen interaction with the BLASTp parameter set to \10 20 and the sequence identity cutoff of 30%.

Drug target identification
Pathogen proteins participating in host-pathogen interactions were further investigated for drug target identification.

Paralogous and non-homologous protein identification
The complete protein sequence of Homo sapiens was retrieved from NCBI FTP site (ftp://ftp.ncbi.nlm.nih.gov/). To identify duplicate or paralogous proteins within the 1497 proteins of E. coli O104:H4, protein sequences were purged at 60% using CD-HIT (Huang et al. 2010). Paralogs and duplicate proteins were discarded. The non-paralogous proteins were applied to BLASTp search against complete proteome of Homo sapiens to identify non-homologous proteins. Expectation threshold value was set to 10 -4 . The resultant dataset that had significant similarity with the human proteins were excluded and the non-homologous proteins were compiled.

Essential protein identification and metabolic pathway analysis
Non-homologous proteins were subjected to identify pathogen specific essential proteins. From the Database of Essential Genes (DEG) (Zhang et al. 2004) prokaryotes essential genes were downloaded and all of the non-homologous proteins were subjected to BLASTp against DEG database (de Lima Morais et al. 2011). Parameters set for BLASTp execution was cutoff for e value \10 -10 , bit score [100 and percentage of identity [35%. Metabolic pathway analysis of identifying essential proteins was carried out using KEGG Automatic Annotation Server (KAAS) (Moriya et al. 2007). Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway database was used for the comparative pathway analysis between E. coli O104:H4 and Homo sapiens to map out essential proteins entailed in pathogen specific metabolic pathway (Kanehisa and Goto 2000).

Subcellular localization prediction
Subcellular Localization prediction server PSORTb version 3.0.2 was used for the identification of cytoplasmic, periplasmic and transmembrane protein within the essential proteins involved in pathogen specific metabolic pathway (Yu et al. 2010).
Homology modeling, structure refinement and active site prediction Homology modeling of D-galactose-binding periplasmic protein MglB and UDP-N-acetylglucosamine 1-carboxyvinyltransferase was performed by automated homology modeling server SWISS-MODEL using the templates retrieved from Protein Data Bank (PDB) by comparative modeling (Rajender et al. 2011). Three dimensional structure of D-galactose-binding periplasmic protein MglB and UDP-N-acetylglucosamine 1-carboxyvinyltransferase from the homology modeling were passed through the structure refinement process using KoBaMIN (Rodrigues et al. 2012). Structural Analysis and Verification Server (SAVES) was implemented for evaluating the quality and validation of the refined 3-D structure model (Luthy et al. 1992).
Active site in the validated refined model was predicted by the Computed Atlas of Surface Topography of proteins (CASTp) database (Dundas et al. 2006) with the default parameter. Through extensive literature search best active site was identified and selected for both of the models.

Virtual screening and docking ligand into homology model
Virtual screening was performed for identifying active lead compounds with total 6460 molecules, 1447 approved and 5040 experimental posited in DrugBank (Knox et al. 2011 Cluster computer with Linux based with 32 core system was executed for virtual screening using mpiVina. This cluster computer capable to accomplish 500 docking per hour. Based on high binding affinity a simple python script In Silico Pharmacol. selected top 100 molecules from the processed result. Redocking of these 100 molecules was carried out several times to check the stable high binding affinity. Finally, Pymol, Discovery Studio was used to analyze and visualization of the ligand-receptor interaction.

ADMET pharmacokinetic and toxicological analysis
Absorption, Distribution, Metabolism, Excretion and Toxicity (ADMET) analyses constitute the pharmacokinetics of a drug molecule. The PreADMET server (https://pre admet.bmdrc.org) was used to predict the ADME profiles and carcinogenicity of a compound and helps to avoid toxic compound.

Results and discussion
Analysis of E. coli O104:H4 C227-11 proteinprotein interactions (PPI) and host-pathogen interaction PPI A total of 5325 proteins of E. coli O104:H4 C227-11 (Broad) was used for protein-protein interaction analysis. The algorithm PSIMAP was reported to extrapolate the protein-protein interactions by applying the sequence similarities (Kim et al. 2008). It provides a global interactive map of protein-protein and domain-domain interaction information by calculating euclidean distance between structural domains in interacting protein pairs (Gong et al. 2005b). PSIMAP utilizes Protein Data Bank (PDB) and Structural Classification of Proteins (SCOP) domains to retrieve molecular interaction data and test the fundamental working mechanisms of cells (Gong et al. 2005a). We carried out E. coli O104:H4 protein interactions using SUPER FAMILY 1.75 and PSIbase. Our analysis identified 7769 predictive interactions for 1278 proteins which was around 24% of the total E. coli O104:H4 C227-11 (Broad) proteins (Supplementary file 1). On contrary, Database of Interacting Proteins (DIP) analysis revealed 17,411 protein-protein interactions (PPI) from 2512 proteins which was 47% of the total E. coli proteins (Supplementary file 2). All these PPI has experimental evidence because DIP deposit and unionize information on protein-protein interaction was extracted from single research articles (Salwinski et al. 2004).
Domain interaction map (DIMA) imported domain-domain interaction pairs from iPfam and 3DID database which have been known to build physical contacts to the Protein Data Bank (PDB) data (Luo et al. 2011). We predicted 136885 Pfam domain interaction pairs for 3451 proteins comprising 64% of the total proteins using DIMA. We predicted 11107 protein-protein interactions from iPfam composed of 2527 E. Coli proteins representing approximately 47% of the total proteins (Supplementary file 3).
To explore the infection strategies, host-pathogen protein interactions information is crucial. For this purpose, we analyzed the host-pathogen interactions of pathogenic E. coli O104:H4 C227-11 (Broad) with human. In this analysis, we detected 1493 E. coli proteins targeting 1910 human proteins and the total number of protein-protein interactions was 4657. The HPIDB server was used to predict the host-pathogen interactions which infer homologous protein interactions from its integrated databases (Kumar and Nanduri 2010). HPIDB integrates Biomolecular Interaction Network Database (BIND) (Bader et al. 2003), Molecular INTeraction database (MINT) (Zanzoni et al. 2002), pathogen interaction gateway (PIG) (Driscoll et al. 2009), Gene Reference Into Function (GeneRIF), REACTOME (Matthews et al. 2009), INACT (Aranda et al. 2010) databases into one station and serve as a unified resource to investigation of host-pathogen interaction. From the HPIDB result we excluded the protein sequences less than 30% sequence identity for pruning the resultant dataset. According to the E. Coli-Homo sapiens interaction results we observed that the protein sequences of E. coli O104:H4 C227-11 (Broad) showed significant homology with the Bacillus anthracis, Yersinia pestis, Francisella tularensis subsp. tularensis SCHU S4, E. coli K-12, E. coli O157:H7. Detailed data for this full set of interactions is given in Supplementary file 4.
Drug target identification from the E. coli O104:H4 C227-11 (Broad) For identifying putative drug targets against E. coli O104:H4 C227-11 (Broad), we analyzed 1493 E. coli proteins that can interact or target their human host proteins. For rationalizing protein dataset, paralogs or duplicate proteins were discarded using CD-HIT. We identified 56 duplicate protein sequences and omitted from the total list. Finally, 1437 protein sequences were subjected to a BLASTP program for deciphering non-human homologous sequences. Non-human homologous protein should be selected for drug target to preclude the cross binding of drug to the human body and for avoid the consequences of adverse drug side effects (Sakharkar et al. 2004). Total 797 non-human homologous proteins were further tested for identification essential gene using DEG database (Zhang and Lin 2009). For development of new promising and potential antimicrobial drugs it is the prerequisite first to identify essential genes that own fundamental role in the bacterial cellular process (Zhang and Lin 2009). From the BLASTP result, 339 essential proteins were identified which are crucial for the survival of E. coli O104:H4 C227-In Silico Pharmacol. 11 (Broad) bacterium. Metabolic pathway analysis of essential proteins was carried out by the KEGG Automatic Annotation server (KAAS) (Moriya et al. 2007). From 339 essential proteins 71 metabolic pathways were constructed and compared to the Homo sapiens metabolic pathways for drawing pathogen specific unique metabolic pathway. We select 17 unique pathways comprising 79 essential proteins to map out best candidate for drug targets by predicting their subcellular localization in a cell.
Computational anticipation of protein subcellular localization acts an important agent for detecting potential drug targets (Yu et al. 2010). Subcellular localization of essential proteins in unique pathways was accomplished by PSORTb (Yu et al. 2010). Out of 79 proteins, 49 proteins were found to be cytoplasmic, 25 to be localized in the cytoplasmic membrane, 1 outer membrane protein, 2 proteins found to be periplasmic localized and the remaining 2 proteins location were unknown. Although the membrane proteins, particularly receptor proteins and ion channels, have great importance in a wide variety of therapeutics, there are practical problems of working with membrane proteins like as protein purification, expression and crystallization (Duffield et al. 2010;Arinaminpathy et al. 2009). We selected best candidate by extensive literature review from cytoplasmic and periplasmic proteins. Finally, two potential drug targets are selected in Galactose/methyl galactoside ABC transport system, D-galactose-binding periplasmic protein MglB and UDP-N-acetylglucosamine 1-carboxyvinyltransferase (Fig. 1a, b). Drug target identification steps and their results are summarized (Table 1). D-Galactose-binding periplasmic protein involved in the active transport of glucose and galactose. It mediates chemotaxis towards the two sugar residue by interacting with the Trg chemoreceptor in a number of bacterial species (Borrok et al. 2007). By chemotaxis bacteria encounter the environmental surrounding and find out more favorable conditions for its survival (Rao et al. 2004). D-Galactosebinding periplasmic protein plays a crucial role in the  bacterial chemotaxis which emphasizes this protein as a potential drug target. UDP-N-acetylglucosamine 1-carboxyvinyltransferase, alternative name is UDP-N-acetylglucosamine enolpyruvyl transferase, participates for the biosynthesis of peptidoglycan polymer (Gautam et al. 2011;Kumar et al. 2009). Peptidoglycan serves as a crucial component of bacterial cell wall and important for bacterial survival. Furthermore, this essential enzyme is omnipresent in all the prokaryotes but not required by mammals. Therefore UDP-N-acetylglucosamine 1-carboxyvinyltransferase is an attractive target for the drug design to pursue against the drug resistant bacteria (Eschenburg et al. 2005).

Homology modeling, structure refinement and validation
Homology modeling was performed to determine the 3Dstructure of D-galactose-binding periplasmic and UDP-Nacetylglucosamine 1-carboxyvinyltransferase. Template identification of the target proteins was carried out based on the sequence identities, lower expect value, X-ray crystallography resolution and R value. We selected 2FVY for D-galactose-binding periplasmic protein and 3KQJ for UDP-N-acetylglucosamine 1-carboxyvinyltransferase as the most suitable templates using the aforementioned criteria. The complete results of template identification for these two drug target proteins are presented (Table 2). Homology modeling was performed by using SWISS-MODEL server based on template structures 2FVY and 3KQJ, respectively (Fig. 1a, b).
Structure refinement for 3D homology model of Dgalactose-binding periplasmic protein and UDP-N-acetylglucosamine 1-carboxyvinyltransferase was carried out by KoBaMIN web server (Rodrigues et al. 2012). To refine the protein structure after homology modeling KoBaMIN provides a fast and effective protein structure refinement based on minimization of a knowledge-based potential of mean force (Rodrigues et al. 2012). The initial energy of Dgalactose-binding periplasmic homology structure was -3867.28 kcal/mol and after refinement final energy appeared to be -6706.5036 kcal/mol. Total energy minimizations was -2839.22 kcal/mol. In addition, UDP-Nacetylglucosamine 1-c-rboxyvinyltransferase homology structure energy minimization was -3918.85 kcal/mol (Fig. 1a, b; Supplementary file 5).
Further we determined the structure quality by validating the modeled PDB. We used SAVES metaserver (which integrates 6 programs PROCHECK, WHAT_CHECK, ERRAT, VERIFY_3D, PROVE, CRYST1 record matches) to check structural quality (http://services.mbi.ucla.edu/ SAVES/) (Pontius et al. 1996). PROCHECK program executes the Ramachandran plot to evaluate model quality (Hariprasad et al. 2010). Ramachandran plot analysis of the 3D structures of D-galactose-binding periplasmic protein showed that 94.5% residues in the core region, 4.4% in allowed region. ERRAT and PROVE results for the same protein structure showed quality factor 96.633 and average Z score -0.036, respectively. On contrary, UDP-Nacetylglucosamine 1-carboxyvinyltransferase Ramachandran plot expressed that 95.0% of the residues are in the core region. (Figure 1c, d and Supplementary file 6).

Active site prediction and virtual screening for docking
Drug targets proteins active sites were predicted by the Computed Atlas of Surface Topography of proteins  (CASTp). Using CASTp, we selected pocket volume 1006.7 and 1737.2 for D-galactose-binding periplasmic protein and UDP-N-acetylglucosamine 1-carboxyvinyltransferase proteins, respectively (Fig. 2). Comparative active site analysis of 2FVY (template) and D-galactosebinding periplasmic protein (homology model) reveal 100% conserved residues between template and homology model. Follow the prediction of the active site residues we carried out molecular docking using mpiVina based on lowest binding energy (Azam and Abbasi 2013). Top 100 ligands were selected from the result dataset which were showing lower binding energy to the receptor's active site filtered for Lipinski's rule of 5 (Lipinski et al. 2001). 81 ligands follow the Lipinski's rule of 5 and out of 81 inhibitors top 10 ligands having lower binding energies were selected and most importantly bacterial proteins as their targets. Inhibitors that target human proteins were filtered out for avoiding off target binding with human proteins. Finally, top 10 hits for both of the target proteins Dgalactose-binding periplasmic protein and UDP-N-acetylglucosamine 1-carboxyvinyltransferase protein were listed according to their binding energy. Top 10 docking results of these two target proteins are represented (Tables 3, 4).

ADMET pharmacokinetic and toxicological analysis
Prediction and significant description of drug-likeness such as absorption, distribution, metabolism, excretion (ADME) and toxic (Tox) properties were calculated with the help of online server PreADMET. Through this server, we can calculate different parameters such as human intestinal absorption (HIA), cellular permeability (P caco-2 ), cell permeability Maden Darby Canine Kidney (P MDCK ), plasma protein binding (PPB), carcinogenicity, and penetration of the blood-brain barrier (C brain /C blood ). The ADME properties should be perfect for a drug candidate. Both of our proposed drug (DB03571 and DB08512) shows proper ADME properties (Table 5).
In general, HIA indicates 'poor' absorption in the range of 0-20%, 'moderate' absorption from 20 to 70% and 'well' absorption between 70 and 100%. We found HIA was 94.07% and 88.61% for DB03571 and DB08512, respectively, indicating well-absorbed compounds. The MDCK computational component would predict the renal clearance of the molecule. The MDCK results showed 0.08 nm/s and 0.39 nm/s for DB03571 and DB08512, respectively. Both compounds show very low permeability in in vitro MDCK cells (less than 25 nm/s according to PreADMET). The value of in vitro cell permeability (P caco-2 ) on intestinal epithelium cells are considered as 'low' permeability when the value is less than 4 nm/ s, 'middle' permeability when the range is from 4 to 70 nm/s and high permeability when it is above 70 nm/s. The observed cell permeability of DB03571 and DB08512 are 19.85 and 20.84 nm/s, respectively, in intestinal epithelium and it indicates middle permeability in in vitro Caco-2 cells. Blood brain barrier (BBB) restricts the passage of most of the compounds from the blood to brain, thus having a brain protecting property. The in vivo blood brain barrier (BBB) is the direct measure of penetration of drug in central nervous system (CNS). The DB03571 and DB08512 are found to be low and middle absorption category according to Pre-ADMET analysis. Plasma protein binding prediction results showed 92.26 and 86.92% plasma protein binding for DB03571 and DB08512, respectively, indicating strongly bound chemicals which are not desirable.
Generally, the carcinogenicity test requires much study time ([2 years) in case of experimental procedures, but PreADMET helps to analyse the results quickly from NTP (National toxicology program) and EUA/FDA 2 year's in vivo data. Both drugs exhibit no evidence of carcinogenic activity. So, the overall ADME properties of DB03571 and DB08512 are excellently satisfactory.

Conclusion
Through the integrated protein-protein interaction and host-pathogen interaction analysis, we identified two potential drug targets from the cytoplasmic and periplasmic region of E. coli 01O4:H4 C227-11 (Broad). The first target D-galactose-binding periplasmic protein play important role in bacterial chemotaxis. In chemotaxis, bacteria sense chemical gradients in the environment and move toward favorable conditions for survival. The pathway is arguably best characterized in the case of E. coli. Whereas UDP-N-acetylglucosamine 1-carboxyvinyltransferase target protein function in the peptidoglycan biosynthesis which is crucial for bacterial cell wall formation and thus important for bacterial survival. Most importantly, this enzyme is essential for bacteria but absent Lowest docking energies, important residues of the binding site observed to be interactive with the ligands from DrugBank in human host. Extensive literature review supports both of this protein as prominent drug target candidate. Bacteria become resistant against single or multiple antibiotics and ensure their survival in the environment. The identification of potential targets for the development of new antibiotics is becoming a topical and widely recognized need. The present study is an important basis for screening novel and alternative targets in a way to design and develop new drugs against other emerging human pathogens.