Bioinformatic prediction of immunodominant regions in spike protein for early diagnosis of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2)

Background To contain the pandemics caused by SARS-CoV-2, early detection approaches with high accuracy and accessibility are critical. Generating an antigen-capture based detection system would be an ideal strategy complementing the current methods based on nucleic acids and antibody detection. The spike protein is found on the outside of virus particles and appropriate for antigen detection. Methods In this study, we utilized bioinformatics approaches to explore the immunodominant fragments on spike protein of SARS-CoV-2. Results The S1 subunit of spike protein was identified with higher sequence specificity. Three immunodominant fragments, Spike56-94, Spike199-264, and Spike577-612, located at the S1 subunit were finally selected via bioinformatics analysis. The glycosylation sites and high-frequency mutation sites on spike protein were circumvented in the antigen design. All the identified fragments present qualified antigenicity, hydrophilicity, and surface accessibility. A recombinant antigen with a length of 194 amino acids (aa) consisting of the selected immunodominant fragments as well as a universal Th epitope was finally constructed. Conclusion The recombinant peptide encoded by the construct contains multiple immunodominant epitopes, which is expected to stimulate a strong immune response in mice and generate qualified antibodies for SARS-CoV-2 detection.


INTRODUCTION
The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is highly contagious and has caused more than one hundred million infection cases and over 2.4 million deaths (https://www.who.int/, as of February 15, 2021), posing a huge economic and social burden internationally (Lan et al., 2020;Shang et al., 2020). The reports of SARS-CoV-2 reinfection cases suggest that stronger international efforts are required to prevent COVID-19 re-emergence in the future (Zhan, Deverman & Chan, 2020). Nevertheless, the possibility of SARS-CoV-2 becoming a seasonal epidemic cannot be excluded (Shaman & Galanti, 2020). Even worse, the large number of asymptomatic infections greatly increase the difficulties of epidemic control (Rothe et al., 2020). At present, no specific drugs have been developed for SARS-CoV-2, and the effectiveness of the vaccines on the market still needs time to be evaluated. Therefore, early detection and isolation of infected people are still indispensable means to control the spread of the epidemic, which requires accurate, early, economical, and easy-to-operate diagnostic methods (Yan, Chang & Wang, 2020).
The real-time reverse transcriptase-polymerase chain reaction (RT-PCR) and antibody-capture serological tests are currently the main diagnostic methods for SARS-CoV-2 (Ishige et al., 2020). As the golden standard, RT-PCR is highly reliable (Bustin & Nolan, 2020;Padoan et al., 2020). However, the implementation costs and relatively cumbersome operation problems make it a big challenge for large population screening (Thabet et al., 2020). The antibody-capture serological test is convenient, but seroconversion generally occurs in the second or third week of illness. Therefore, it is not ideal for the early diagnosis of infection (Hachim et al., 2020;Liu et al., 2020;Tang et al., 2020). The antigen-capture test is an alternative diagnostic method that relies on the immunodetection of viral antigens in clinical samples. Accordingly, this method could be applied for the detection of early infection no matter if the patient was asymptomatic or not (Ohnishi, 2008). Compared with RT-PCR based detection method, it is relatively inexpensive and can be used at the point-of-care.
Rapid viral antigen detection has been successfully used for diagnosing respiratory viruses such as influenza and respiratory syncytial viruses (Cazares et al., 2020;Ji et al., 2011;Ohnishi et al., 2005Ohnishi et al., , 2012Qiu et al., 2005). The sensitivity and specificity of the antigen-capture detection system depend highly on the antigen employed to generate antibodies (Ohnishi et al., 2012). The spike protein is one of the structural proteins of SARS-CoV-2, with the majority located on the outside surface of the viral particles (Fehr & Perlman, 2015;Kumar et al., 2020;Woo et al., 2005). It has a 76.4% homology with the spike protein of SARS-CoV. Sunwoo et al. (2013) showed that the bi-specific spike protein derived monoclonal antibody system exhibited excellent sensitivity in SARS-CoV detection. The virus infection is initiated by the interaction of spike protein receptorbinding domain (RBD) and angiotensin-converting enzyme 2 (ACE2) on host cells. It is widely accepted that the spike protein is one of the earliest antigenic proteins recognized by the host immune system (Callebaut, Enjuanes & Pensaert, 1996;Chen et al., 2020c;Gomez et al., 1998;Lu et al., 2004;Sanchez et al., 1999). Nevertheless, the difficulties of using spike protein as an antigen are also obvious. Firstly, it is not easy to express and purify the full-length spike protein (Tan et al., 2004). Besides, the spike protein is highly glycosylated (Kumar et al., 2020) and prone to mutation (Wang et al., 2020a), which may counteract the sensitivity of antigen-capture based detection method. Hence, it is critical to truncating the glycosylation and mutation sites on spike protein as much as possible in antigen design (Meyer, Drosten & Muller, 2014;Tan et al., 2004). A study using the truncated spike protein to detect SARS-CoV achieved a diagnostic sensitivity of >99% and a specificity of 100% (Mu et al., 2008), which suggests that the truncated spike protein of SARS-CoV-2 could also be an appropriate candidate for the early diagnostic testing and screening of SARS-CoV-2. In this study, we analyzed the spike protein via bioinformatics tools to obtain immunodominant fragments. The predicted sequences were joined together as a novel antigen for the immunization of mice and antibody production. Epitopes information presented by this work may aid in developing a promising antigen-capture based detection system in pandemic surveillance and containment.

METHOD Data retrieval and sequence alignment
Multiple bioinformatics analysis tools were used in this study, and the flowchart is depicted in Fig. 1. Coronaviruses had four genera composed of alpha-, beta-, gamma-and deltacoronaviruses. Among them, alpha-and beta-genera could infect humans. Seven beta-coronaviruses are known to infect humans (HCoV-229E, HCoV-OC43, HCoV-NL63, HCoV-HKU1, SARS-CoV, MERS-CoV, and SARS-CoV-2) (Kin et al., 2015;Su et al., 2016). We utilized the NCBI database to obtain the sequences of these human-related coronaviruses spike proteins, of which accession numbers were presented in Fig. 2A. The Clustal Omega Server-Multiple Sequence Alignment was used to analyze the sequence similarity. The analysis of the phylogenetic tree was calculated by the same server. In this study, we set parameters of Clustal Omega as default (Sievers et al., 2011). Additionally, we conducted the EMBOSS Needle Server-Pairwise Sequence Alignment (Needleman & Wunsch, 1970) to compare the whole sequence and several major domains between SARS-CoV-2 and SARS-CoV to find out the specific genomic regions on SARS-CoV-2.

Linear B-cell epitope prediction
Linear B-cell epitopes of the SARS-CoV-2 spike protein were calculated by ABCpred and Bepipred v2.0 servers. For ABCpred, we set a threshold of 0.8 to achieve a specificity of 95.50% and an accuracy of 65.37% for prediction. The window length was set to 16 (the default window length) in this study (Saha & Raghava, 2006). The BepiPred v2.0 combines a hidden Markov model and a propensity scale method. The score threshold for the BepiPred v2.0 was set to 0.5 (the default value) to obtain a specificity of 57.16% and a sensitivity of 58.56% (Jespersen et al., 2017). The residues with scores above 0.5 were predicted to be part of an epitope.

T-cell epitope prediction
The free online service TepiTool server, integrated into the Immune Epitope Database (IEDB), was used to forecast epitopes binding to mice MHC molecules (Paul et al., 2016). Alleles including H-2-Db, H-2-Dd, H-2-Kb, H-2-Kd, H-2-Kk, and H-2-Ld were selected for MHC-I binding epitopes analysis. We checked the "IEDB recommended" option during computation and retained sequences with predicted consensus percentile rank ≤1 as predicted epitope (Trolle et al., 2015). For MHC-II binding epitopes, alleles including H2-IAb, H2-IAd, and H2-IEd were selected for analysis. As the same as MHC-I binding computation, we chose the "IEDB recommended" option, and peptides with predicted consensus percentile rank ≤10 were identified as potential epitopes (Wang et al., 2010;Zhang et al., 2012).

Profiling and evaluation of selected fragments
The secondary structure of the SARS-CoV-2 spike protein (PDB ID: 6VSB chain B) was calculated by the PyMOL molecular graphics system using the SSP algorithm. PyMOL (http://www.pymol.org) is a python-based tool, which is widely used for visualization of macromolecules, such as SARS-CoV-2 spike protein in the current study (Yuan et al., 2016). Vaxijen2.0 server was utilized to analyze the antigenicity of epitopes and selected fragments. A default threshold of 0.4 was set and the prediction accuracy is between 70% and 89% (Doytchinova & Flower, 2007). The hydrophilicity of the selected fragment was analyzed by the online server ProtScale (Wilkins et al., 1999). Surface accessibility of predicted fragments was evaluated by NetsurfP, an online server calculating the surface accessibility and secondary structure of amino acid sequence (Petersen et al., 2009). Critical features such as allergenicity and toxicity were evaluated by online server AllerTOP v2.0 (Dimitrov et al., 2014) and ToxinPred (Gupta et al., 2013). In addition, we utilized IEDB (www.iedb.org) to search the selected fragments and epitopes to clarify whether these peptides have been experimentally verified (Vita et al., 2019). Protein sequence BLAST was performed to evaluate the possibility of cross-reactivity with other mouse protein sequences (Altschul et al., 1997).

Sequence alignment of spike protein in different coronaviruses
We performed sequence alignment to determine the evolutionary relationships between SARS-CoV-2 and other beta-coronaviruses that could infect humans. According to the results of sequence alignment (Figs. 2A and 2B), SARS-CoV is the closest virus to SARS-CoV-2 among the seven HCoVs, exhibiting a 77.46% sequence identity. To better understand the divergence of spike protein sequences between SARS-CoV-2 and SARS-CoV, we further analyzed the sequences of main domains. Results showed that the S2 subunit was the most conserved domain with a 90.0% identity. RBM and NTD domains, which were located in the S1 subunit, exhibited 49.3% and 50.0% identity respectively (Figs. 2C and 2D). Hence, we chose the S1 subunit (amino acid 1-685) for the subsequent bioinformatics analysis given their high specificity.
Linear B-cell epitope prediction of S1 subunit in SARS-CoV-2 spike protein The B-cell epitope is a surface accessible cluster of amino acids, which could be recognized by secreted antibodies or B-cell receptors and elicit humoral immune response (Getzoff et al., 1988). The immunodominant fragments should contain high-quality linear B-cell epitopes to stimulate antibody production effectively. The sequence of the SARS-CoV-2 S1 subunit was evaluated via ABCpred and BepiPred v2.0. A total of 31 peptides were identified by the ABCpred algorithm (Table S1). For the Bepipred v2.0 server, 14 epitopes were forecasted (Table S2). After antigenicity evaluation, 19 and 9 potential linear B-cell epitopes predicted by the ABCpred server and BepiPred v2.0 server were obtained respectively ( Table 1). The peptides predicted by both bioinformatics programs are more likely to be an epitope recognized in vivo. After mapping the positions of peptides identified by these servers, 3 regions containing predicted epitopes were obtained. These regions could be preliminarily considered as candidates for immunodominant fragments ( Fig. 3; Table 2).
Murine T-cell epitope prediction of S1 subunit in SARS-CoV-2 spike protein Though B cells are responsible for producing antibodies, humoral immunity is heavily dependent on the activation of T cells (Cho et al., 2019a). Helper T cells (Th) recognize antigen peptides presented by MHC-II molecules and facilitate the humoral immune response (Cho et al., 2019b;Mahon et al., 1995). During humoral immune responses, antigen-activated T cells could provide help in many aspects including directing antibody class switching and guiding the differentiation of antibody-secreting plasma cells as well as the properties of the B-cell antigen receptor (Cho et al., 2019a;Paus et al., 2006;Shulman et al., 2014). Therefore, the immunodominant fragments containing T-cell epitopes could offer essential help to powerful antibody production. The S1 subunit was selected for the prediction of T-cell epitopes. We utilized the TepiTool server to forecast MHC-I and MHC-II binding epitopes. A total of 35 MHC-I binding epitopes was predicted (Table S3), and 27 peptides were identified as MHC-II binding epitopes (Table S4). The antigenicity of these peptides was calculated via Vaxijen 2.0 server (Table 3). Combined with the MHC-II epitopes prediction results, the candidate immunodominant fragments were adjusted (Fig. 4). Compared with the preliminary candidate immunodominant fragments screened according to the linear B-cell epitope prediction, we added the Spike 14-34 fragment into consideration because it contains a linear Immunodominant fragments refinement according to the glycosylation site distribution, mutation site distribution, and secondary structure A profile of 24 glycosylation sites of SARS-CoV-2 spike protein has been reported (Shajahan et al., 2020). Since glycans could hinder the recognition of antigens by shielding the residues (Walls et al., 2019), protein glycosylation would affect the performance of antigen detection. Thus, glycosylation sites should be circumvented when selecting the immunodominant fragments. According to the study of Shajahan et al. (2020), 15 glycosylation sites were located in the S1 subunit of the spike protein. Hence, the fragments in this study were adjusted to Spike 14-34 , Spike 49-101 , Spike 199-261 , and Spike 583-620 . To retain antigenicity of the epitopes, the final identified fragments only contained 3 glycosylation sites which should have a minimum effect on antigen recognition. Rapid transmission of COVID-19 provides the SARS-CoV-2 with substantial opportunities for natural selection and mutations. To ensure the stability of the detection method, the immunodominant fragments were modified to avoid high-frequency mutation sites (Wang et al., 2020b). Spike 14-34 were excluded for containing four high-frequency mutation sites. Fragment Spike 49-101 was adjusted to Spike 56-92 , and fragment Spike 583-620 was adjusted to Spike 583-609 . By adjusting the fragments, we avoided  the above high-frequency mutation sites, which might avoid the impact of mutations on detection performance and improve the detection efficiency in the future (Li et al., 2020; Tegally et al., 2020).
The PyMOL was used to present the secondary structure of the spike protein (PDB ID: 6VSB) (Fig. S1). To keep the integrity of the secondary structure of the selected fragments, we extended the N-and C-ends with 2~5 residues, and the immunodominant fragments were finally adjusted to Spike 56-94 , Spike 199-264 , and Spike 577-612. The epitopes  and potential glycosylation sites contained in the selected immunodominant fragments were displayed in Fig. 5.

Profiling, evaluation, and visualization of selected immunodominant fragments
To further evaluate the antibody binding potentiality of these antigenic regions, the key features of the selected fragments such as antigenicity, hydrophilicity, surface accessibility, toxicity, and allergenicity were analyzed and presented (Table 5). The hydrophilicity and surface accessibility of the spike protein subunit 1 were calculated. The selected fragments of interest were submitted for computation of antigenicity, toxicity, and allergenicity. Three fragments presented relatively moderate hydrophilicity and surface Figure 5 The epitopes and glycosylation sites on the selected immunodominant fragments. (A-C) Present the predicted epitopes and glycosylation sites on fragment Spike56-94, Spike199-264 and Spike577-612 respectively. The black squares represent epitopes predicted by ABCpred server, and the black frames represent epitopes predicted by Bepipred v2.0 server. The red squares represent MHC-I binding epitopes, and the red frames represent MHC-II binding epitopes. The gray squares mean occupied glycosylation sites contained in the selected fragments. accessibility. The proportion of hydrophilic amino acids in the selected fragments Spike 56-94 , Spike 199-264 , Spike 577-612 are 48.72%, 45.45%, 33.33% respectively. The surface accessibility of these fragments calculated by the online server was shown in Table 5. The toxicity of the selected fragments was examined and no fragment was predicted to be toxic. The allergenicity was assessed and only fragment Spike 577-612 was predicted to be a probable allergen. Attention should be paid to monitor potential allergic reactions when injecting the recombinant protein into mice. And the selected fragments were presented as the sphere in the trimer structure (Fig. 6). Next, we scanned the selected fragments utilizing the IEDB database to determine whether they were experimentally tested. The results showed that Spike 200-215 (IEDB ID: 1330367) and Spike 238-252 (IEDB ID: 1329417) were identified experimentally as HLA class II epitope in SARS-CoV-2. Spike84-92 (IEDB ID: 1321049) and Spike202-210 (IEDB ID: 1319559) have been experimentally proved as HLA-B epitopes (Table S5). These findings enhanced the credibility of the current in silico analysis. The fragments identified would have a strong capacity in stimulating powerful antibody production.

Immunodominant fragments based recombinant antigen design
Three immunodominant fragments embody several linear B-cell epitopes, MHC-I binding, and MHC-II binding T-cell epitopes were selected. As a universal Th epitope, the PAN DR epitope [PADRE(AKFVAAWTLKAAA)] was added into the construction aiming to boost helper T cell activity (Alexander et al., 2000;Ghaffari-Nazari et al., 2015). (GGGGS) n is a wildly used flexible linker with the function of segmenting protein fragments, maintaining protein conformation, preserving biological activity, and promoting protein expression (Chen, Zaro & Shen, 2013). Finally, we combined the fragments and a PADRE epitope by linker peptide (GGGGS) 2 and (GGGGS) 3 (Chen, Zaro & Shen, 2013) (Fig. 7). The predicted antigenicity of the final construct (194 aa) was 0.5690 (Table 6). A protein BLAST for the final construct was conducted to evaluate the possibility of cross-reactivity. The BLAST result suggested that, except for the SARS-CoV-2 spike protein, no protein would cross-react with the construct (Raw data in the Supplemental Files), which indicated that our fragments possess good specificity.

DISCUSSION
In this study, the immunodominant fragments within the S1 subunit of the SARS-CoV-2 spike protein were explored. The final construct consists of three immunodominant fragments Spike 56-94 , Spike 199-264 , Spike 577-612 , and a PADRE epitope. The recombinant antigen will be used to immunize mice to generate qualified antibody which could be applied for developing an antigen-capture based detection system. The antibody-based antigen capturing method is user-friendly, time-saving, and economical. Thus, it is an ideal complementary detection strategy especially for early diagnosis and large population screening. The monoclonal antibodies against SARS-CoV have been successfully applied in the immunological antigen-detection of SARS-CoV (Ohnishi, 2008). Accordingly, we explored the immunodominant fragments on the spike protein of SARS-CoV-2, which would provide aid in developing an accurate and fast antigen-capture based early detection system for SARS-CoV-2.
We selected the S1 subunit for immunodominant fragments screening after divergence analysis. It had been reported that an S1 antigen-based assay of SARS-CoV could capture the virus as soon as the infection occurs (Sunwoo et al., 2013). Jong-Hwan Lee et al. designed a method that could seize and detect spike protein S1 subunit of SARS-CoV-2 using ACE2 receptor and S1-mAb (Lee et al., 2021). This finding suggests that it is appropriate to use the S1 subunit for specific and early diagnosis of SARS-CoV-2. Three immunodominant fragments (Spike 56-94 , Spike 199-264 , and Spike 577-612 ) were identified in the present study. These sequences will be joined to construct recombinant peptides in the next step. Instead of using inactivated full-length spike protein, we designed a novel recombinant protein construct that increased sequence specificity as well as circumvented mutation sites and glycosylation sites. As the antigen design is based on bioinformatics study, the exact ability of the selected fragments to produce qualified antibodies for virus detection has yet to be determined by experiments.
Noticeably, the spike protein of SARS-CoV-2 is heavily glycosylated. Glycans could shield epitopes during antibody recognition, which may interfere with the detection of viral proteins (Shajahan et al., 2020). About 17 N-glycosylation sites along with two O-glycosylation sites were found occupied in the spike protein of SARS-CoV-2 (Shajahan et al., 2020). We circumnavigated most glycosylation sites when selecting immunodominant fragments. The three selected fragments in this study only contain 3 glycosylation sites. In case these glycosylation sites impede the diagnostic performance, an additional deglycosylation step with N-glycanase should be applied for the test specimens (Dermani et al., 2019), which is a simple and efficient method for deglycosylation (Hirani, Bernasconi & Rasmussen, 1987;Huang et al., 2015;Lattová et al., 2016;Zheng, Bantog & Bayer, 2011). Alternatively, an eukaryotic expressing system could be employed to mimic the antigen presented in human cells.
Though coronaviruses can find and repair errors during the replication process (Wang et al., 2020b), the SARS-CoV-2 genome still presents a large number of mutations. Mutations could not only help virus slip past our immune defense, but also spoil the efficiency of diagnostic tests (Chen et al., 2020b). In this study, we circumvented high-frequency mutation sites when selecting antigen fragments. In addition, our fragments also avoided RBD regions which are prone to mutation (Chen et al., 2020b). The construct finally built contained no high-frequency mutation.
To date, several studies using predictive algorithms to analyze SARS-CoV-2 have been reported (Alam et al., 2020;Behmard et al., 2020;Can et al., 2020;Chen et al., 2020a;Dong et al., 2020;Poran et al., 2020;Saha, Ghosh & Burra, 2021;Sohail et al., 2021). However, most of these bioinformatics analyses against SARS-CoV-2 intended to develop effective vaccines to prevent infection and the identified sequences possess high homology with other viruses, especially SARS-CoV (Bhattacharya et al., 2020;Chen et al., 2020a;Robson, 2020). On the contrary, the fragments suitable for diagnosis should be unique when compared with other species to ensure the specificity of detection. Therefore, the results obtained from vaccine studies are not ideal for virus detection. In this study, attention was paid to the sequences with high variability, hence the immunodominant fragments identified are more specific. Distinct from vaccine studies, murine MHC alleles were selected in epitopes prediction in this study, so that the designed antigen could trigger a strong humoral immune response in mice. Furthermore, glycosylated sites and recently identified high-frequency mutation sites were deliberately avoided during the screening process to eliminate their potential adverse impact.
In silico analysis has been widely used to mine and identify various pathogens as well as epitopes prediction (Kiyotani et al., 2020;Liò & Goldman, 2004;Qin et al., 2003;Robson, 2020;Shen et al., 2003). In this study, identified fragments were further scanned in the IEDB database, and found four peptides contained in the sequences were experimentally validated epitopes (Table S5), which reinforced the conclusion of the present study. In the following studies, we will immunize Balb/c mice with the designed antigen to generate mAbs which could be utilized for SARS-CoV-2 diagnosis after evaluating their sensitivity, specificity, and other related properties.

CONCLUSION
Through bioinformatics analysis, three immunodominant fragments were identified in the present study. After connected by flexible linkers, we acquired a final recombinant peptide with 194 residues. It was predicted to possess high antigenicity and specificity for SARS-CoV-2. Our next move is to express and purify the recombinant protein in a suitable expression system, followed by immunizing the mice with purified immunogen to obtain specific antibodies. The present study would provide aid in developing an antigen-capture based detection system.

ADDITIONAL INFORMATION AND DECLARATIONS Funding
This work was supported by the emergency and special scientific program of COVID-19 epidemic of Changsha, China (grant number: kq2001038) and the National Science