SPIKES: Identification of physicochemical properties of spike proteins across diverse host species of SARS-CoV-2

Summary We describe a protocol to identify physicochemical properties using amino acid sequences of spike (S) proteins of SARS-CoV-2. We present an S protein prediction technique named SPIKES, incorporating an inheritable bi-objective combinatorial genetic algorithm to determine the host species specificity. This protocol addresses the S protein amino acid sequence data collection, preprocessing, methodology, and analysis. For complete details on the use and execution of this protocol, please refer to Yerukala Sathipati et al. (2022).

CRITICAL: To download the initial dataset, computer must have 4 GB of disk space. The minimum requirements to run the protocol are 8 GB RAM and 1.5 GHz quad-core processor. The SPIKES program can be performed in Mac OS/Linux platforms.

Overview of the method Dataset preparation
Timing: 1-2 h 5. The S protein sequences of the human and animal host's CoVs can be downloaded from the Global Initiative on Sharing Avian Influenza Data (GISAID) link and National Center for Biotechnology Information (NCBI) link databases. a. Human and animal host coronavirus (CoV) protein sequences in FASTA format can be downloaded from GISAID and NCBI databases. Examples of sequence IDs of human and animal host CoVs and their accession links are shown in Table 1. The dataset consisting of human and animal host CoV sequences are available at Github link. The description of the FASTA format can be found at link. b. Reduce the sequence redundancy and ambiguity of the sequences using CD-HIT web server (Huang et al., 2010) link.
CRITICAL: The time needed depends on the data size, computer processing time and disk storage. Please check the recommended computer specifications before downloading and extracting the data from the GISAID website.

Feature extraction and modeling
Timing: 1 h 6. Convert the FASTA sequences to a numerical dataset. a. Extract the PCPs from the FASTA sequences using the Amino acid index database (AAindex). b. Extract amino acid compositions, dipeptide compositions, and pseudo amino acid compositions from the S protein sequences for the purpose of analysis. 7. Scaling the features is suggested before you start the data modeling. 8. Divide the dataset into training and test sets for model training and evaluation, respectively. 9. Use SPIKES for feature selection and model tuning.
a. SPIKES is designed to learn the best combination of feature set and model parameters from the applied data set. SPIKES selects a subset of features from the original data and predicts the species specificity simultaneously. b. There are two parts in this section, first, the setting of parameters for feature selection and evaluation, and second, the setting of the search range of SVM parameters. i. The GA-chromosome consists of binary GA-genes for selecting informative features of PCPs and two 4-bit GA-genes for encoding the parameters C and g of SVM. The inheritable bi-objective combinatorial genetic algorithm (IBCGA) can simultaneously obtain a set of solutions, Xr, where r=r end , r end + 1,., r start in a single run. In SPIKES, the radial basis function (RBF) kernel was used for the implementation of SVM. ii. The prediction model with tuned parameters is available at link. CRITICAL: Machine learning knowledge and programming skills are necessary to tune the parameters of SVM.
Trouble shooting: A stepwise procedure and a set of recommended parameters for the SPIKES model can be accessed from link.

Physicochemical property analysis
Timing: 3 h 10. Use Microsoft Excel or GraphPad Prism 9 to calculate the amino acid differences between human and animal host coronaviruses. a. Measure the amino acid compositions for all S protein sequences of human and animal host CoVs using the code provided in the following pages. b. Compare the amino acid composition differences between S proteins of human and animal host CoVs using the subtraction function in Excel (example [Human(A)-Animal(A)]). c. The expected outcome would be the differences between amino acid compositions between S proteins of human and animal host CoVs. 11. Measure the distinguishing features, identified by SPIKES, in the S proteins of human and animal CoVs to determine how amino acid preferences differ between the S proteins of human and animal CoVs. 12. Use the spike glycoprotein mutation surveillance dashboard from CoVsurver in GISAID to explore mutations in the CoV variant of interest. 13. Download and install UCSF Chimera software for protein structure visualization from link. This section describes download of the S protein sequence data from GISAID and NCBI databases.

KEY RESOURCES
1. User registration is required to download the data from GISAID link. a. First, log in to the GISAID database, search for the SARS-CoV virus in the search box. Select the host (human/others), and download the S protein sequences in FASTA format. b. Human and animal host CoV S protein sequences can be obtained from the NCBI database using the search terms ''S protein'' and ''SARS-CoV''. c. A description of acquired data from the GISAID and NCBI databases, and an example of the FASTA file format is shown in Figures 1A and 1B. 2. Preprocessing the FASTA sequence data.

OPEN ACCESS
a. The human and animal FASTA sequences must be formatted independently. b. The input must be all the S protein sequences of human/animal in bulk. c. The Initial dataset consisted of several thousand S protein sequences of spike human (spike-H) and spike animals (spike-A). Because amino acid changes are a crucial factor for disease transmission, we considered 99% sequence identity in the redundancy reduction process. Use the CD-HIT web server to reduce the sequence redundancy. i. Go to the CD-HIT we bserver at link. ii. Load query FASTA sequence file under cd-hit tab.
iii. Choose sequence identity cut-off value of 0.99. iv. Algorithm parameters: select 'yes' to the 'use global sequence identity' and 'sequence is clustered to the best cluster that meet the threshold'. Set bandwidth of alignment to 20 and length of sequence to skip to 10. v. Alignment coverage parameters: set a default value '0' for minimal alignment coverage (fraction) for the longer sequence, minimal alignment coverage (fraction) for the shorter sequence, and minimal length similarity (fraction). Set 'unlimited' to the maximum unaligned part (amino acids/bases) for the longer sequence, maximum unaligned part (amino acids/bases) for the shorter sequence, and maximum length difference in amino acids/bases(-S). vi. User may change or use default options for algorithm parameters and alignment coverage parameters. vii. Click on submit button. d. Users can also use the protein-to-protein BLAST option from the BLAST server link to determine the sequence similarity between two protein sequences. e. Exclude ambiguous FASTA sequences by removing any protein sequences containing special characters, such as 'B, J, O, U, X and Z', that do not represent a specific amino acid. f. After sequence redundancy reduction and accounting for uncertainties, the dataset consisted of 211 and 611 S protein sequences of spike-H and spike-A, respectively. For analysis, select the 211 spike-H proteins and a random sample of the 611 spike-A proteins. This will produce a final balanced dataset consisting of 211 and 211 S protein sequences of spike-H and spike-A, respectively. g. It could take 2 h for downloading the desired/updated SARS-CoV-2 sequences and preprocessing, which depend on the speed of the Internet connection and the computer processor.
CRITICAL: Ensure you have sufficient storage on your computer before downloading data from GISAID. Downloading and extracting the initial dataset from GISAID requires 4 GB of disk space.

Feature extraction
Timing: 1 h The AAindex database contains numerical indices representing the PCPs of amino acids (Kawashima et al., 2008). Each index represents a different PCP and has 20 values, one for each of the 20 naturally occurring amino acids.
3. SPIKES adopted 531 PCPs retrieved from the AAindex database as candidate features to distinguish S proteins of diverse CoV strains. The original CoVs' amino acid sequences were converted into AAindex numerical indices according to the 531 PCP values. That is, each amino acid in the sequence was represented by the 531 index values for that amino acid in the AAindex. The feature representation of the 531 PCPs is described as follows: a. Collect the spike protein sequences from the dataset. b. Calculate the amino acid composition l(aa i ) of a sequence for the i th amino acid aa i of 20 amino acids and encode the protein sequence of variable length into the feature vector with a length of 531 properties. The AAindex matrix is provided in link. c. Calculate the feature value of the p th physicochemical property, PCP(p), of a spike protein, where p=1, 2, ., 531.
4. Before converting the FASTA sequence into AAindex features, users are required to compile the AAindex to get the executable program.

Feature selection and model tuning
Timing: 1 h Feature selection and model tuning are the core of this workflow.
5. Feature scaling helps to normalize the features within a specific range. Scaling maps the feature values into the {-1, 1} or {0, 1} interval. The feature scaling to the dataset can be applied using ''svm-scale'', a regular program provided by LIBSVM. 6. The feature values were scaled in to the range of -1 and +1. The feature scaling of the dataset is provided in the link.

Note:
The training and test data should apply the same scaling parameter. For more information, please refer LIBSVM website and download its SVM package: link.  (Vapnik, 1999) incorporating an optimal feature selection algorithm, IBCGA (Ho et al., 2004) to select informative features from a large number of candidate features. SPIKES selects important PCPs of S proteins and distinguishes the human host from the animal host coronaviruses simultaneously. Further details of the feature selection algorithm could be found in these studies (Tung and Ho, 2007) and (Tung and Ho, 2008). The parameter tuning of SVM can also be found in the study (Yerukala Sathipati and Ho, 2021). The steps involved in the feature selection process is depicted in Figure 2. 8. The steps involved in IBCGA are as follows: a.

Feature selection
Step 1: (Data Preparation) Compile the training sets from spike-H and spike-A for developing and evaluating the SPIKES method. b.
Step 2: (Initialization) Randomly generate an initial population of N pop individuals. N pop = 50, r start = 50, r end = 10, and r = r start . c.
Step 3: (Evaluation) Evaluate the fitness value of all individuals using the fitness function that is the prediction accuracy in terms of 10-fold cross-validation (10-CV). d.
Step 4: (Selection) Use a conventional tournament selection method that selects the winner from two randomly selected individuals to generate a mating pool. e.
Step 5: (Crossover) Select two parents from the mating pool to perform an orthogonal array crossover operation. f.
Step 6: (Mutation) Apply a conventional bit mutation operator to GA-genes of SVM parameters and a swap mutation to the binary GA-genes for keeping r selected features. The best individual was not mutated for the elite strategy. g.
Step 7: (Termination test) If the stopping condition for obtaining the solution X r is satisfied, output the best individual as the solution X r . Otherwise, go to step 2. h.
Step 8: (Inheritance) If r > r end , randomly change one bit from 1 to 10 in the binary genes for each individual. Decrease the number r by one and go to step 2. Otherwise, stop the algorithm. i.
Step 9: (Output) Obtain a set of m features in total for PCPs, from the best solution X m among the solutions X r , where r = r end , r end + 1,. , r start . 9. The SPIKES model file to select the PCPs has been provided at link. 10. First train the SPIKES-Training data using the SPIKES model file and validate on SPIKE-Test data.
The usage of datasets and SPIKES is provided at link.

OPEN ACCESS
Below is the list of PCPs obtained from SPIKES.
Note: Users may get a different set of features by using a larger S protein dataset and modifying the tuning of search parameters.
14. To determine the significance of compositional changes in amino acids in S protein from different strains of CoVs, the user may use the hCoV-19 spike glycoprotein mutation surveillance dashboard from GISAID. a. Login into the GISAID webpage. Select EpiCoV > spike glycoprotein mutation surveillance. 15. The specific amino acid changes in S proteins have been implicated in increasing the infectivity and virulence of new variants. User may use GISAID data statistics to examine the amino acid changes in S protein that increased the infectivity in emerging new variants. 16. Users can use Chimera protein visualization software to visualize the S protein structure in .pdb format and observe the amino acid preferences in alpha helices and beta sheets. a. Visit the Chimera URL link and download and install the software. b. Visit the Protein Data Bank URL link. Search for the S protein (e.g., PDB: 6VXX). c. Download the protein structure in .pdb format. d. Open the 6VXX.pdb file in the Chimera protein visualization software. File > open > 6VXX.pdb.

EXPECTED OUTCOMES
The expected outcome from SPIKES includes the prediction probability of distinguishing human and animal host CoVs. The normalized probabilities less than 0.5 are predicted to be human host spike proteins, those greater than 0.5 are animal host spike proteins. The results format is as shown on the Github page link.
The comparison analysis of PCPs reveals the amino acid preferences for each property between human and animal host CoVs. A PCP is compared between human and animal host CoVs as shown in Figure 3. The PCP values can be calculated for the human and animal host CoVs using Microsoft Excel or Graphpad Prism.
PCPs could be calculated using the following code or either using an Excel sheet. In Excel enter the PCP values from the AAindex link in one column and multiply with the amino acid composition values of either S protein sequences of human or animal. Next, the amino acid composition of S proteins between human and animal host CoVs can be compared as shown in Figure 4. An important property identified by SPIKES is JOND920102, which describes the degree of sequence differences among species, termed as 'Relative mutability' by Jones et al. (Jones et al., 1992). The relative mutability of amino acids demonstrates the amino acid changes that occur over time. Recurrent mutations are one of the crucial factors for the ongoing adaption of SARS-CoV-2 to the human host   (van Dorp et al., 2020). To determine the significance of compositional changes in amino acids in S proteins from different strains of CoVs, we compared changes between Rousettus bat coronavirus (Gen-Bank: AOG30822.1) and hCoV/wuhan/WIV05/2019 strain using the hCoV-19 spike glycoprotein mutation surveillance dashboard, GISAID. The results of the comparison are shown in Figure 5. Next, we used GISAID data statistics to examine the amino acid changes in the S protein that increased the infectivity in emerging new variants. The amino acid changes that increased the infectivity in different variants are shown in Table S1. Users may get more details on emerging variants from link.
From our previous analysis, we observed that the average hydrophobicity index for the hydrophobic amino acids of the S proteins in human host CoVs (0.18 G 0.18) was slightly larger than that in animal CoVs (0.17 G 0.14) (Yerukala Sathipati et al., 2022). To measure the surface hydrophobicity of an S protein, the Chimera protein visualization software can be used to view the protein structure. The surface hydrophobicity of an S protein is shown in Figure 6.

QUANTIFICATION AND STATISTICAL ANALYSIS
The PCP comparison analysis was used to calculate the mean G standard deviation of each PCP. The Student's t test was used to compare group means between human and animal host CoVs; p < 0.05 indicates statistical significance. The LiBSVM package (Chang and Lin, 2011) was used to classify S proteins as being derived from human or animal host CoVs. Additional details can be found at link.

LIMITATIONS
The SPIKES method described has some limitations. Using different feature selection algorithms and classifiers can produce contradictory predictions for the same dataset.
However, the robust SPIKES model was selected after 50 independent runs of IBCGA. We provided the stepwise procedure at link. Here, we only extracted the amino acid sequence-based features. However, the function of an S protein also could be influenced by the stoichiometry of its complex with the ACE2 receptor. Including a greater amount of S protein sequence data in the analysis could influence the selection of PCPs. It is a possibility that features selected using modified or alternative approaches might overlap, at least partially, with properties selected by SPIKES. Second, this study used data from GISAID until a certain period (May 2021); therefore, the inclusion of updated data may add to the significance of PCPs and amino acid compositions. Third, in vivo-based experimental validation and molecular docking studies could further strengthen the findings.

TROUBLESHOOTING
Problem 1 Protein sequences are not converting into AAindex properties.

Potential solution
Ensure all sequences are in FASTA format. None of the amino acid sequences should contain uncertainties like 'B, J, O, U, X, Z' or other non-amino acid codons. FASTA files and aaindex.exe files should be in the same working directory.
Problem 2 AA index conversion and scaling the FASTA sequences.

Potential solution
Commands are case sensitive. Sequence files and executable files should be in the same working directory.

Problem 3
Getting errors in modeling.

Potential solution
Please follow the instructions on the Github page link for the input FASTA format and generating prediction results.

Problem 4
How to extract the physicochemical property values from AAindex?

Potential solution
Please use the AAindex matrix values provided in the link.

Problem 5
How to perform mutational analysis?

Potential solution
Please use the CoVsurver mutation analysis application link from GISAID to perform the analysis.

RESOURCE AVAILABILITY
Lead contact Further information and requests for resources and reagents should be directed to and will be fulfilled by the technical contact and lead contact, Ming-Ju Tsai (mingjutsai@hsl.harvard.edu) and Srinivasulu Yerukala Sathipati (sathipathi.srinivasulu@marshfieldclinic.org).

Materials availability
This study did not generate new unique reagents.

Data and code availability
The protein sequence data used in this analysis is available at https://www.gisaid.org and https:// www.ncbi.nlm.nih.gov.

ACKNOWLEDGMENTS
This study is funded by the Marshfield Clinic Research Institute.

AUTHOR CONTRIBUTIONS
S.Y.S. designed the system and carried out the detailed study. S.Y.S. and M.T. participated in the experimental design and implemented the experiments. T.C., S.K.S., and S.Y.H. participated in data analysis and interpretation. S.Y.S. and S.Y.H. supervised the study. All authors participated in the manuscript preparation and approved the final manuscript.

DECLARATION OF INTERESTS
The authors declare competing interests.