Multi-schema computational prediction of the comprehensive SARS-CoV-2 vs. human interactome

Background Understanding the disease pathogenesis of the novel coronavirus, denoted SARS-CoV-2, is critical to the development of anti-SARS-CoV-2 therapeutics. The global propagation of the viral disease, denoted COVID-19 (“coronavirus disease 2019”), has unified the scientific community in searching for possible inhibitory small molecules or polypeptides. A holistic understanding of the SARS-CoV-2 vs. human inter-species interactome promises to identify putative protein-protein interactions (PPI) that may be considered targets for the development of inhibitory therapeutics. Methods We leverage two state-of-the-art, sequence-based PPI predictors (PIPE4 & SPRINT) capable of generating the comprehensive SARS-CoV-2 vs. human interactome, comprising approximately 285,000 pairwise predictions. Three prediction schemas (all, proximal, RP-PPI) are leveraged to obtain our highest-confidence subset of PPIs and human proteins predicted to interact with each of the 14 SARS-CoV-2 proteins considered in this study. Notably, the use of the Reciprocal Perspective (RP) framework demonstrates improved predictive performance in multiple cross-validation experiments. Results The all schema identified 279 high-confidence putative interactions involving 225 human proteins, the proximal schema identified 129 high-confidence putative interactions involving 126 human proteins, and the RP-PPI schema identified 539 high-confidence putative interactions involving 494 human proteins. The intersection of the three sets of predictions comprise the seven highest-confidence PPIs. Notably, the Spike-ACE2 interaction was the highest ranked for both the PIPE4 and SPRINT predictors with the all and proximal schemas, corroborating existing evidence for this PPI. Several other predicted PPIs are biologically relevant within the context of the original SARS-CoV virus. Furthermore, the PIPE-Sites algorithm was used to identify the putative subsequence that might mediate each interaction and thereby inform the design of inhibitory polypeptides intended to disrupt the corresponding host-pathogen interactions. Conclusion We publicly released the comprehensive sets of PPI predictions and their corresponding PIPE-Sites landscapes in the following DataVerse repository: https://www.doi.org/10.5683/SP2/JZ77XA. The information provided represents theoretical modeling only and caution should be exercised in its use. It is intended as a resource for the scientific community at large in furthering our understanding of SARS-CoV-2.


Multi-Schema Computational Prediction of INTRODUCTION
The novel coronavirus (CoV) pandemic has galvanized the research community into the investigation of 40 the SARS-CoV-2 virus and the COVID-19 disease it manifests in humans (Guarner, 2020). Research has 41 progressed with unprecedented speed in large part due to the rapid determination of the SARS-CoV-2 therapeutic targets, facilitate complimentary research, and inform public discussions for the present and 47 any future outbreaks of HCoVs. 48 Coronaviruses share many similarities to the influenza viruses in that they are both enveloped, single-49 stranded, and helical RNA-viruses among the Group IV viral families (Baltimore, 1971). The four 50 coronaviruses known to commonly infect humans are believed to have evolved such that they maximize 51 proliferation within a population. This evolved strategy involves sickening, but not ultimately killing, 52 their hosts. By contrast, the two prior novel coronavirus outbreaks (SARS and MERS) arose in humans 53 after cross-species jumps from animals, as was H5N1 (the avian influenza). These latter diseases were 54 highly fatal to humans, with relatively few mild or asymptomatic cases. A greater proportion of mild 55 or asymptomatic cases would have resulted in wide-spread disease, however, SARS and MERS each 56 ultimately killed fewer than 1,000 people (World Health Organization, 2020; Regional Office for the 57 Eastern Mediterranean, 2011). 58 All known HCoVs arise from zoonotic origins (i.e. from other animal species). The wide diversity of 59 CoVs within the animal kingdom stem from the genetic alterations to CoV genomes through acquisition 60 of mutations and a high frequency of recombination between different CoV genomes (Makino et al., 1986;61 Van Der Most et al., 1992). Such genetic modifications occurring in animal CoVs may facilitate a "host 62 jump" and are the primary reason for inter-species and animal-to-human transmission (Cui et al., 2019). 63 The HCoVs that are endemic to the human population are causative agents of more mild disease (e.g. 64 common cold) and there is less urgency to identify the animal reservoirs of these viruses.  It is of critical importance that the cellular entry mechanism and viral replication pathways of SARS- 75 CoV-2 and the role of accessory proteins be rapidly elucidated to develop anti-viral therapies to mitigate 76 the spread and infectivity of the virus in the present pandemic. 77 Promisingly, many computational approaches have been rapidly deployed to increase our under-78 standing of SARS-CoV-2, including protein function, three-dimensional (3D) protein structures, and 79 possible target regions for small inhibitory molecules (Senior et al., 2020;Smith and Smith, 2020). Given 80 that the Spike protein from the original SARS coronavirus, SARS-CoV, is known to interact with the 81 human Angiotensin-Converting Enzyme 2 (ACE2), current efforts are focused to better characterize the 82 SARS-CoV-2 Spike protein and its putative interaction with the ACE2 protein.  We hope to contribute to the scientific effort using the latest version of our sequence-based protein-93 protein interaction (PPI) predictor, PIPE4 (Dick et al., 2020b) (Dick and Green, 2018). We leverage a multi-schema methodology in order to identify a high-confidence 97 subset of putative interactors. The three predicted interactomes were leveraged in combination to produce 98 candidate targets for experimental validation and to subsequently guide the development of inhibitory 99 polypeptides. Finally, the PIPE-Sites algorithm was used to predict the sub-sequence regions with a high 100 likelihood of mediating the physical interaction between two given pairs (Amos-Binks et al., 2011).

135
The proteome of SARS-CoV-2 was obtained from the Uniprot pre-release available at SARS-CoV-2 136 Pre-Release, (Swiss Institute of Bioinformatics, 2020), with the disclaimer that these data would become 137 part of a future UniProt release and may be subject to further changes. While other SARS-CoV-2 proteins 138 are reported among other sequence repositories, we restricted our study to these highest-confidence 139 proteins available at the time. The 14 SARS-CoV-2 proteins and their function are tabulated in Table 2. 140 Notably, the Spike glycoprotein (Accession: P0DTC2) is of special interest to this and related work, since 141 its SARS-CoV homolog is known to interact with the human ACE2 protein and is presently the target of a 142 recent mRNA-based vaccine candidate.

144
The computational prediction of PPIs is a diverse field which encompasses multiple paradigms (e.g. 145 sequence-, structure-, evolution-, and network-based methods). Sequence-based predictors rely solely 146 upon primary sequence data, making them amenable to the investigation of proteome-wide networks.

147
Furthermore, these methods tend to be highly efficient, where individual PPIs can be predicted in a 148 fraction of a second. 150 PIPE is a sequence-based method of PPI prediction that operates by examining sequence windows on each 151 of the query proteins. If the pair of sequence windows shares significant similarity with a pair of proteins 152 previously known to interact, then evidence for the putative PPI is increased. A similarity-weighted (SW) 153 scoring function uses normalization to account for frequently occurring sequences, not related to PPIs.

154
Given sufficient evidence, a PPI is predicted. PIPE has previously been validated on numerous species for 155 both intra-species and inter-species PPI prediction tasks (Schoenrock et al., 2011;Pitre et al., 2006Pitre et al., , 2012. 156 Furthermore, the distribution of evidence along the length of each query protein forms a 2D landscape 157 that can indicate the site of interaction (discussed later) (Amos-Binks et al., 2011).

158
The fourth version of the Protein-protein Interaction Prediction Engine (PIPE4) was recently adapted 159 to improve predictive performance for understudied organisms (Dick et al., 2020b). That is, species 160 for which the proteome is known, but the number of experimentally validated intra-specific PPIs is 161 insufficient to train a model to generate the comprehensive interactome. To circumvent this, the PIPE4 162 algorithm leverages the known PPIs of evolutionarily similar and well-studied organisms, serving as 163 a proxy training set. Using an approach denoted as cross-species PPI prediction, the experimentally 164 validated PPIs from the proxy species are used to train the PPI predictor which is then applied to the 165 proteome of the understudied target organism. Due to the limited availability of known SARS-CoV-2 166 PPIs, we here use the PPIs from a collection of well-studied and evolutionarily similar proxy viruses to 167 generate these cross-species predictions as depicted in Figure 1. 168 The PIPE4 algorithm is particularly well-suited to cross-and inter-species PPI prediction schemas, 169 given that the SW-scoring function appropriately normalizes the prevalence of sequence windows within 170 each training and target species proteome (Dick et al., 2020b   Thus, for each O2A score curve, a score threshold delineating the "high-scoring" pairs from the 203 baseline was identified and used to determine the high-confidence predicted interactions. In the absence 204 of known PPIs between SARS-CoV-2 and human, it is difficult to determine a suitable global decision 205 threshold. By instead examining the morphology of the O2A score curves for both perspectives, we can 206 qualitatively identify high-scoring pairs. This process can be further automated through the identification 207 of the baseline/knee for each view under the assumption that true PPIs are rare and high-scoring, while 208 non-interacting pairs tend to generate scores residing below the knee in the baseline. Manuscript to be reviewed Materials. 214 We identified the common set of predicted pairs above each locally defined knee from both the PIPE4 215 and SPRINT methods (their intersection) for each schema. For example, the all schema, resulted in a 216 set of 225 putative human protein targets among 279 intersection pairs. The predicted pairs from each 217 schema were considered to be the predicted interactome and were subsequently analyzed by PIPE-Sites;

218
GO-term enrichment analysis was performed using the identified human proteins. The results of each 219 schema's interactome were also combined into higher-confidence sets by taking their set intersections and 220 were visualized as a network.

222
The PIPE4 algorithm generates its prediction for a given pair of proteins based on a two-dimensional 223 landscape of scores, where the score at location x, y, the number of sequence window similarity "hits",  The list of PPIs generated from both methods can be used to inform the design of anti-SARS-CoV-2 230 therapeutics by using peptide sequences from the predicted PPI site, which we refer to as the PPI-Site. 231 We define the PPI-Site as the peptide sequence that is responsible for mediating a given PPI, which is  clarity, we syntactically distinguish the interface residues from a predicted PPI-Site. Since we are looking 244 at sequence similarity across many proteins, the PPI-Site is a proxy for measuring sequence conservation.

245
Therefore, we are identifying the subsequence that has been conserved to support the interaction site, 246 which may include scaffolding residues distal to the actual interface.  To evaluate the performance increase of the combined classifier, we perform Leave-One-Family-Out  To determine which human cellular pathways may be targeted by SARS-CoV-2, PANTHER Gene    With the RP-PPI model, the comprehensive set of human-SARS-CoV-2 pairs were scored to produce 341 14 one-to-all curves. As above, knee-detection was used to identify the highest confidence subset 342 comprising n = 539 pairs, as depicted in Figure 3C. 343 We provide the hit and SW landscapes and predicted PIPE-Sites for each of the predicted interactions  We later discuss the biological relevance of our set of highest-confidence predictions and how these 349 may be leveraged to develop anti-SARS-CoV-2 therapeutics. We further consider these interactions in the 350 context of corroborating evidence from scientific literature and illustrate two particular phases of the viral 351 life cylce that might be targeted.

353
The genomes of SARS-CoV-2 and other coronaviruses encode for numerous proteins of diverse functions.

354
The proteolytic cleavage products of the two polyproteins (i.e. non-structural proteins) play essential roles   CoV-2 and the significance of this remains to be explored; however, based on the previous findings, this 482 interaction should be further investigated. work. 500 We encourage the scientific community to delve into the findings of this study. For example, of 501 the GO-terms observed from the all schema alone, the highly over-represented biological processes 502 in Supplementary Materials Table 4  This small network is depicted in Figure 10.

530
To better visualize the predicted interactome the complete network-based representation is depicted in 531 Figure 11. Much like the HLA proteins highlighted above, we note a number of highly represented 532 GO-terms around several of the proteins of interest including those related to the immune response, 533 various types of signalling, and the viral life cycle. We hope that this work will guide the broader research 534 community in their search for putative inhibitory molecules.

536
The purpose of this work is to help guide the broader research community in the collective pursuit to 537 understand the SARS-CoV-2 viral pathogenesis. To that end, we assessed 285,124 protein pairs using 538 two state-of-the-art sequence-based PPI predictors within three prediction schemas, thereby creating 539 the comprehensive SARS-CoV-2 vs. human interactome. For each of the 14 SARS-CoV-2 proteins 540 considered in this study, a highly conservative locally defined decision threshold was determined to obtain 541 a predicted interactome comprising putative PPIs within the predicted intersection of the PIPE4 and 542 SPRINT methods. Furthermore, the PIPE-Sites algorithm was used to predict the putative interaction 543 interfaces to identify the subsequence regions of interest that might mediate these interactions. 544 These predictions have been deposited in a public DataVerse repository for use by the broader 545 scientific community in the collective effort to combat the COVID-19 pandemic (Dick et al., 2020a). 546 We re-emphasize that the information provided is theoretical modelling only and caution should be 547 exercised in its use. It is intended only as a resource for the scientific community at large in furthering our 548 understanding of SARS-CoV-2.   coronavirus nl63 employs the severe acute respiratory syndrome coronavirus receptor for cellular entry.