Predicting COVID-19 disease severity from SARS-CoV-2 spike protein sequence by mixed effects machine learning

Epidemiological studies show that COVID-19 variants-of-concern, like Delta and Omicron, pose different risks for severe disease, but they typically lack sequence-level information for the virus. Studies which do obtain viral genome sequences are generally limited in time, location, and population scope. Retrospective meta-analyses require time-consuming data extraction from heterogeneous formats and are limited to publicly available reports. Fortuitously, a subset of GISAID, the global SARS-CoV-2 sequence repository, includes “patient status” metadata that can indicate whether a sequence record is associated with mild or severe disease. While GISAID lacks data on comorbidities relevant to severity, such as obesity and chronic disease, it does include metadata for age and sex to use as additional attributes in modeling. With these caveats, previous efforts have demonstrated that genotype-patient status models can be fit to GISAID data, particularly when country-of-origin is used as an additional feature. But are these models robust and biologically meaningful? This paper shows that, in fact, temporal and geographic biases in sequences submitted to GISAID, as well as the evolving pandemic response, particularly reduction in severe disease due to vaccination, create complex issues for model development and interpretation. This paper poses a potential solution: efficient mixed effects machine learning using GPBoost, treating country as a random effect group. Training and validation using temporally split GISAID data and emerging Omicron variants demonstrates that GPBoost models are more predictive of the impact of spike protein mutations on patient outcomes than fixed effect XGBoost, LightGBM, random forests, and elastic net logistic regression models.


Introduction
Throughout the COVID-19 pandemic, SARS-CoV-2 has mutated in ways that have significantly impacted pathogenesis. Epidemiological studies can show different risks of severe disease due to different COVID-19 variants, such as Delta and Omicron, but typically lack resolution at the level of specific combinations of changes in viral genome sequences. The emergence of the COVID-19 pandemic, however, has coincided with the widespread availability of lower cost, rapid whole genome sequencing. As of writing, over 10 million SARS-CoV-2 sequences were available to researchers from the GISAID website (http://www.gisaid.org) [1,2]. GISAID includes a metadata field for ''patient status'' for a subset of sequences, which represents a potentially unparalleled resource for genetic analysis. If its potential can be unlocked, GISAID could provide the data necessary to develop a model of disease severity based on viral genotype and the limited patient characteristics available on GISAID, age and gender.
Clinical data which differentiate SARS-CoV-2 genotype generally do so at the level of lineages using the Pango nomenclature [3,4], or most has appeared to result in less severe disease, even when controlling for vaccination [17]. Lower risks of hospitalization, and in particular shorter hospital stays, reduced ICU admissions, and less use of ventilation rates, particularly as compared to Delta, have been shown in studies from Denmark [18], the United States, [19,20], and the United Kingdom [21,22]. Epidemiological data for Omicron are consistent with in vitro and animal studies, which have shown a reduction in lower lung infectivity, deficient cell entry, and a reduction in syncytium formation due to reduced ability of the spike protein to mediate plasma membrane fusion [23][24][25][26].
The aforementioned differences in case outcome are between apparently sequential emergent SARS-CoV-2 genetic lineages. In fact, however, changes to SARS-CoV-2 properties often implicate combinations of multiple mutations that emerge simultaneously-and then sometimes revert in whole or in part as the virus continues to evolve [27,28]. For example, Delta, originally Pango-designated B.1.617.2, has spawned complex sublineages (generally identified by an AY prefix) with distinct immune evasion and virulence properties-and can genetically share more in common with other lineages than the one from which they apparently branched [29,30]. During a long-term infection, a spike protein may emerge with multiple variations, i.e., a ''long branch'' divergence from the phylogenetic tree, a process hypothesized as the origin of the Omicron variant of concern (initially identified as Pango lineage B.1.592, and soon redesignated as BA.1 and BA.2) [31]. As Omicron has become dominant, subsequent Omicron subvariants have emerged, including subvariants of BA.1 and BA.2, as well as BA.3, BA.4, and BA.5 [32]. These Omicron subvariants are characterized by apparent recombination of mutations, as well as the appearance of mutations that look similar to those in previous variants, such as Alpha [33]. Recombinant variants of Delta and Omicron have also been identified in larger numbers; these are designated in the Pango scheme with an initial ''X'', e.g., XD and XE [34].
Conventional techniques for studying the effect of SARS-CoV-2 genotype on COVID-19 mortality and symptom severity have included analysis of single nucleotide polymorphisms (SNP) and genome wide association studies [35][36][37], literature meta-analysis [38]. Individual studies have been limited in the number of patients, and meta-analyses are time-consuming, complicated by having to access underlying data, and generally exclude unpublished data. The viral sequences from many such published and unpublished studies have been deposited in the GISAID global SARS-CoV-2 sequence repository, along with data for patient status. Various groups have investigated the incidence and prevalence of mutations in GISAID entries with clinical metadata [39,40]. While these studies have yielded some potential candidate mutations, the data are often conflicting or not necessarily consistent with epidemiological or laboratory observations. For example, some of the latter studies showed a link between the D614G mutation, which emerged early on as cases spread from Asia to Europe and North America, with increased disease burden. But another study of case fatality rates by region did not find a correlation with the dominant clade in that region [41].
In general, efforts to build logistic regression and other statistical models to predict mild versus severe disease on GISAID data have shown that much of the explanatory power is provided by patient age, gender, and region of origin, rather than clade or lineage [42]. However, it has been shown that adding sequence data to a logistic regression method can produce a more accurate prediction of severe versus mild disease than one with only age, gender, and region, although the difference was not particularly large [43]. Moreover, an updated of the latter model trained and tested on more recent sequence data resulted in deteriorated performance, even when employing a more accurate random forests classifier method [44]. Another group employed a powerful gradient-boosted decision tree ensemble classifier method, XGBoost, and found that models evaluated using temporally split data, i.e. trained on earlier sequences and tested on later-emerging sequences, substantially outperformed models evaluated using cross-validation, in which the training and test samples are randomly selected [45]. The authors analyzed the trained models to identify mutations associated with increased severe disease risk. However, key findings such as the V1176F mutation, while present in VOC, have not been specifically linked to disease severity in laboratory or epidemiological studies. Other methods have been employed, including deep neural networks [46,47] and Bayesian multinomial logistic regression to infer growth rate, and thus viral fitness, from individual sequence mutations [48]. However, there is no consensus modeling method for analyzing GISAID data, and the complexity of the data has not been fully analyzed.
The modeling approach in this paper begins by first evaluating the trends and structure of GISAID data-a critical step for developing robust genotype-phenotype models. There are two issues that impact the use of GISAID as a data source: (1) The nature of the pandemic has changed over time, with more screening of asymptomatic or mild cases, as well as improvements in therapeutics and widespread vaccination. (2) While unprecedented numbers of SARS-CoV-2 sequences have been deposited, that still represents only a sampling of viral infections worldwide. For example, as of April 2022, over half of all sequences in GISAID are from the United States and United Kingdom [49]. The set of sequences with GISAID patient metadata which we were able to curate (excluding illegible or unknown metadata fields) are an even smaller subsample. In practice, other work has shown that models trained on earlier records do not perform well on later sequence records [44,45]. In part due to the evolution of novel mutations, but it may also have to do with changes in the nature of what kind of sequences are submitted to GISAID over time. For example, in our previous work, we showed that, through September 2021, there had been a consistent increase in ''mild'' cases observed in the database [47]. In the analysis shown here, we identify that the latter temporal trends may have now stabilized, but that heterogeneity in the geographic origin of samples may be an important confounder.
Notably, other potential risk factors, such as obesity or chronic disease, are not provided in GISAID metadata. Moreover, while vaccination status is a metadata field, only very few samples include an entry for it. As a result, modeling efforts based on GISAID data will always lack information on known co-founders. Even so, GISAID does have many more samples than targeted studies that do include information about comorbidities, which at least mitigates potential data bias. Also, training om data after vaccination is more widespread, as shown here, can mitigate biases due to vaccination status. That said, the foregoing caveats are important for the work presented here as well as all efforts to model the effect of viral genotype on disease severity.
Taking the aforementioned observations and caveats into account, in this paper we examine the overall data set. Then, we identify a timeframe for model training that can result in more robust models for evaluation, and hypothesize that including geographic origin through the ''country'' metadata field in a mixed effects model will result in more robust models as well. We propose to use a recently-developed mixed effects machine learning method, GPBoost, which incorporates decision tree-boosting to efficiently train on large data sets with many features [50]. GPBoost is compared to conventional methods, including logistic regression, as well as ensemble decision-tree based methods, Random Forests [51], XGBoost [52], and LightGBM [53]. The bestperforming methods, GPBoost, XGBoost, and LightGBM, are interpreted using SHAP (SHapley Additive exPlanations) [54], which has previously been used to interpret XGBoost models [45]. The modeling methodologies are evaluated on the spike protein sequence, as it binds to host cell receptors, mediates cell entry, is a key target for the immune response, and has a high rate of mutation [55][56][57][58]. Analyzing only the spike protein sequence further reduces the risk of overfitting and make models more computationally tractable.

Spike protein sequence collection and pre-processing
Spike protein sequences are obtained from a FASTA file available from the GISAID database (http://www.gisaid.org). The data for this study were downloaded on sequences that were submitted to and processed by GISAID as of April 15, 2022. Based on the metadata for collection date, the latest-collected sample in this data set was from April 10, 2022. GISAID performs various data curation tasks; of relevance here, Spike protein sequences are preprocessed by GISAID by multiple sequencing alignment, identifying ORFs, and translating nucleotide sequences to obtain protein FASTA files [2]. The FASTA file is parsed to obtain only those sequences for which patient metadata are available. (The section below details how patient metadata are obtained). The acknowledgment table for the sequences used in this study may be found at https://doi.org/10.55876/gis8.220606hk.
Many of the spike sequences are truncated due to sequencing gaps and errors. Therefore, the Spike protein sequences from the FASTA file are aligned with respect to the consensus Spike reference sequence (Wuhan-Hu-1 isolate) obtained by multiple sequence alignment of early genome sequences [59]. The alignments are generated using the local pairwise Striped Smith-Waterman (SSW) method [60,61], with BLO-SUM62, implemented with the scikit-bio package in Python 3.8 [62]. Aligned sequences shorter than the reference (1273 residues) are front and/or end padded with a ''*'', and otherwise all indels are at positions corresponding to the reference. To preserve as many samples as possible, there is no filtering sequences with ''*'' (mask) or ''X'' (ambiguous amino acid).

Patient status metadata collection and pre-processing
The GISAID database provides an option to identify sequence records that include ''patient metadata'' and to download the metadata file with that information. This study includes the data from the records available for sequences collected by April 15, 2020: 414,297 records in total. (By comparison, at that time, the aforementioned Spike protein sequence FASTA file, that are from studies with and without metadata, included over 5 million sequences.) After metadata exclusions are applied (described in the following paragraphs), 163,496 samples remain available for machine learning. These records include an entry for ''patient status'' as well as metadata fields generally available for all SARS-CoV-2 sequences on GISAID, which include inter alia host, the continent/country/region of collection, Pango nomenclature lineage, NextStrain clade, sample collection and submission date, patient age, and patient gender. As an initial matter, all samples for which the host is not identified as ''Human'' are removed from the dataset.
The patient metadata consists of a single field with text provided by the submitter of the sequence. There are many different kinds of entries, including misspellings. As a preliminary step, these metadata entries are translated to a ''Status''. The ''Status'' translates different entries which may consist of different spellings or synonyms for the same activity, such as cases obtained by screening asymptomatic carriers. The table includes examples of entries assigned to these categories. Table 1 shows all of the unique metadata entries in the full patient metadata set (414,297 records) along with the corresponding ''Status'' designation. The resulting status is then categorized, generally following the commonly used case classification such as those defined by the United States National Institutes for Health (NIH) COVID-19 guidelines [63]. 1 shows the categories and the status designation. For example, sequences with metadata indicating ICU admission or mechanical ventilation are categorized as ''Severe''. Some metadata entries are categorized as ''Unknown'' even if they are not explicitly entered as such, as they do not contain information about the patient's status, for example some appear to refer to the age of the patient or to the location where the sample was taken. Metadata entries of ''recovered'' were also placed in the unknown category, as there was no indication of the severity of prior illness. Notably, the ''Asymptomatic'' category is defined to also include paucisymptomatic cases which are not expressly defined as ''Mild''. As such, there will be some overlap between those two categories.
The categories are then assigned to ''Mild'', ''Severe'', and ''Unknown'' classes, according to the NIH categories where there is sufficient information. 1 shows the class assignments for each category. For example, it is not clear whether ''Alive'' indices alive, but in an ICU, or alive and with mild symptoms. Accordingly, the ''Alive'' category is assigned to ''Unknown''. Similarly, a ''Symptomatic'' or ''Alive'' patient may have severe symptoms or have been hospitalized; therefore, the ''Symptomatic'' category is thus assigned to ''Unknown''. The ''Released'' category indicates release from prior hospitalization, and, therefore, ''Released'' is classified the same as ''Hospitalized''. Cases in the ''Screening'' category are classified as ''Mild''. These are cases with metadata entries such as ''random screening'', ''community screening'', and ''airport screening;'' as such, they are assumed to be from asymptomatic, or, at minimum, ambulatory individuals who were not hospitalized for COVID-19 symptoms at the time of sequencing. Cases in the Unknown class are dropped from the analysis.
The models described in this study also utilize metadata fields for ''age'' and ''gender''. Notably, the ''gender'' field includes entries that suggest it is being used interchangeably for gender and sex. For the purpose of simplicity and to align with the GISAID field names, the term ''gender'' is used in this paper. With respect to the gender field, any entry that is cognizable as Male or Female (e.g., misspellings, foreign language words such as ''Homme'', which is French for ''man'', etc.) are classified accordingly. Any other entry is classified as ''Unknown'' and excluded from analysis. The ''age'' metadata entries are assigned to an integer age where possible. Where the ''age'' entry is provided as a range, e.g., '''21-30'', it is assigned to the mid-point, e.g., 25. Where the ''age'' field entry is ''unknown'' or a value that cannot be translated to an integer, the sample is excluded.

Machine learning
Five machine learning methods are used: (1) logistic regression with elastic net regularization [64] (referred to as ''elastic net'' or ''logistic regression'' in this paper), which has previously been utilized for genetic association studies [65,66]; (2) the random forests (RF) ensemble tree-based method [51], which is used to classify SARS-CoV-2 sequences to Pango lineages [3]; (3) eXtreme Gradient Boosting (XGBoost) [52], a decision tree-based ensemble learning method which has been used for SARS-CoV-2 nucleotide sequence classification [45] and which our group and others have previously used to classify protein sequences [67,68]; (4) LightGBM, which is a gradient-boosting method developed by Microsoft that grows trees leaf-wise, unlike XGBoost, which grows trees depth-wise, thus running much more efficiently while achieving comparable results [53,69]; and (5) GPBoost, which trains a mixed effects model including both features, implemented using decision trees (trained using LightGBM), and random effects at the group level [50].

Mixed effects models
Linear mixed effects models have been used for analyzing genetic studies where there are group-level random effects, such as in longitudinal studies and other sampling studies where there may be batch effects due to different laboratory methods being used for samples taken at different locations or times [70][71][72][73]. Eq. (1) shows the general matrix formulation for a mixed effects model.
( ) is the row-size evaluation of function F, and . and are fixed effects and random effects predictor variable matrices respectively, i.e., the rows of are predictor variables for observations (columns of ). In this study, the random effects vector is assumed to contain grouped (i.e. clustered) random effects. In this case, the columns of will be one-hot encoded (i.e., will be an incidence matrix with 1s and 0s) with the categorical variables that define the structure of groups.
Assuming that the fixed effects model is linear, then Eq. (1) may be written in terms of groups as shown in Eq. (2).
In Eq. (2), the linear model for ( ) is given as is the × model matrix for fixed effects for observations in the th group, where there are features, is a ×1 vector of model coefficients, and is the ×1 vector of errors for the th group.
is now a × model matrix of random effects for the th group, where is the × 1 vector of random effect coefficients for the th group.
In this paper, groups of random effects are identified by country metadata. The means of and , an unknown vector of random errors, are 0; accordingly, we take the mean of the model response in order to evaluate its predictions. We implement mixed effects machine learning with GPBoost, which has been made publicly available at https://github.com/fabsig/GPBoost. GPBoost is a highly efficient package for fitting mixed effects models to data, as it utilizes LightGBM tree-boosting to model fixed effects [50]. Further elaboration of the mathematical foundation of mixed effects models relevant to this paper can be found in [50]. GPBoost is thus able to handle the large feature set required to include the full spike protein sequence.

Feature representation
The input for machine learning are features vectors of integers for each sample, and training labels set at 0 (for Mild) and 1 (for Severe). Features are obtained as follows. After the alignment procedure described above, all of the resulting sequences have 1273 characters (amino acids, deletions, or masks). The sequences are tokenized, converting each character, including the deletion symbol ''-'', to a distinct nonzero integer. A position with padding mask ''*'' or ambiguous amino acids represented as X, B, J, or Z are considered to be missing data. Accordingly, they are a value of NAN for XGBoost, LightGBM, and GPBoost, which can then treat them as missing values; or, they are assigned a value of 0 for logistic regression and Random Forests, which cannot handle missing data. The age is represented as an integer, as describe above, and gender is treated as 0 and 1. In total, then, there are 1275 features: 1273 amino acid positions, age, and gender. As described in the paper, we also tested using the metadata for ''Country'' of origin of a case as a feature (increasing the number of features to 1276), or in the case of GPBoost, as a grouping of random effects in a mixed effects model. In that case, the ''Country'' was tokenized and represented as an integer using scikit-learn.

Model interpretation to obtain feature importance
The feature significance shown in the Results section for XGBoost, LightGBM, and GPBoost were obtained from SHAP (Shapley Additive eXplanations) values of terms for the test data set using the TreeExplainer method within the SHAP module (https://shap.readthedocs.io/) in Python 3.7 [54]. Among the principal reasons for selecting GPBoost to implement mixed effects machine learning was its compatibility with SHAP for interpretation [50,74]. Feature importance can also be derived for the aforementioned ensemble decision tree methods by computing, for example, the number of times a feature is used to split trees, or the gain in score towards the objective function obtained by splitting trees based on a feature [75,76]. However, we found no substantive differences between the features identified as significant using SHAP and those computed based on decision tree characteristics; moreover, SHAP not only estimates feature significance, but can also estimate whether a feature value tends to result in one classification or another.

Hyperparameter tuning and model implementation
For the results of this paper, training and testing data splits were determined by sample collection date as described in the Results section. Hyperparameter tuning was performed using a data set consisting of 60,196 samples collected between May 6, 2021 through November 2, 2021. Five-fold cross-validation was used to define training and testing splits, and the mean class prediction accuracy on the testing sets across three runs of the algorithm was computed for each hyperparameter combination. The hyperparameter combination with the highest accuracy was selected for the data presented in the Results. Other hyperparameter combinations were tested on that data and were not found to perform better than those that were used. The hyperparameters for the respective methods are as follows. Where not provided here, hyperparameters were set at their default values.

Resource availability 2.4.1. Lead contact
Further information and requests for resources should be directed to and will be fulfilled by the lead contact, Dr. Bahrad A. Sokhansanj (bahrad@molhealtheng.com).

Data and code availability
• The datasets analyzed for this study were downloaded from GI-SAID EpiCoV database pursuant to the GISAID terms of use. They are availabile for download to users who register with GISAID at the website http://wwww.gisaid.org. The list of GISAID accession numbers used for this paper and data acknowledgments are available at https://epicov.org/epi3/epi_set/EPI_SET_20220606hk or https://doi.org/10.55876/gis8.220606hk. • The code used for pre-processing and analysis in this paper has been deposited to and made publicly available from the authors' GitHub repository, https://github.com/EESI/covid_severity. • Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.

Descriptive analysis of GISAID patient data
To develop an understanding of the GISAID patient data set (i.e. data with metadata subject to the exclusions described in the Methods section), we analyze the trends for severity for the metadata fields available in the GISAID data set: age, gender, sample collection date, and geographic origin of the case. To quantitatively measure severity trends, the classifications for Mild disease is assigned a severity level of 0, and Severe, a severity of 1. (The classifications are derived from patient status metadata based on Supplementary Tables S1 and S2 as described in the Methods.) Thereby, the mean of the severity values can be computed, which equates to the proportion of samples which are classified as Severe cases. Fig. 1A shows the mean severity for each age from 0 to 100, as well as the number of samples in the GISAID patient data set for each age. In general, the fraction of severe cases increases with age, which has been a consistent feature of the pandemic [77,78]. Trends are different at the extremes, very young and old ages. Notably, there are far fewer cases at these ages, thus small biases in sample collection can have significant effects. For example, very young patients may be likelier to be observed in a hospital setting than observed in a screening study. Male sex has also been identified as a potential risk factor for more severe outcomes, such as ICU admission and death [79,80]. GISAID provides sex information in a ''gender'' metadata field. As shown in Fig. 2A, the relationship between increased age and increased proportion of severe cases are consistent throughout the pandemic. Fig. 1B shows that, in addition to known risk factors, over time the mean case severity significantly decreases. The declining severity trend through 2020 in GISAID data is consistent with a period of improved COVID-19 therapeutics. For example, a Canadian study measured a decrease in case fatality rate (CFR) between the first and second waves prior to any vaccination, even when controlling for age and increased testing [81]. Later reduction is consistent with increased levels of COVID-19 vaccination reducing severe outcomes [82][83][84][85], as well as continued improvements in therapeutics such as monoclonal antibodies [86]. Notably, while the trend shows an overall decrease, which we had previously observed through October 2021, it is not monotonic, which an increase shown in late 2021 and early 2022. Moreover, the absolute level of severe cases suggests that while the trend of decreasing severity is consistent with the global trend of decreasing overall severity, the nature of reported cases also affects the trend. For example, in the initial first binned time periods, there  are over 70% severe cases. As illustrated in Fig. 3, the large majority of these are hospitalizations, although approximately 10% of samples are from dead patients according to the metadata entries. Studies of the initial pandemic waves in March-April 2020 did show that a CFR that approached or exceeded 10% [87][88][89]. Subsequent analysis estimated that the infection fatality rate (IFR) was likely much closer to 1%-2%, with the elevated CFR being due to underreporting of cases [88,89]. Among the passengers of the Diamond Princess cruise ship in February 2020, who were all tested, the IFR was 1% (in a population which skewed older). Accordingly, while the proportion of deaths in the GISAID patient data set did reflect community observations, at least in this timeframe, they were elevated due to so many cases being missed. However, the fraction of sequences submitted from dead patients continued to run at 4%-5% even through October 2021. We do observe that many cases have a metadata entry of ''alive'' or ''live'', as a contrast with ''dead''. But ''alive'' or ''live'' metadata cannot be identified as either Mild and Severe, and are thus excluded from the data set analyzed in this paper and shown in Fig. 1. Sequences from dead patients are thus overrepresented. Fig. 2B and C show that GISAID gender metadata similarly indicate elevated severe disease among male patients. Samples with Male gender metadata are classified as 49.8% Mild and 50.2% Severe; by comparison, samples with Female gender metadata are 55.1% Mild and 44.9% Severe. An increased proportion of severe disease for older and male patients is consistently observed in GISAID samples collected at different dates over time. Notably, the difference in severity between male and female patients (defined according to gender metadata) was much greater in samples collected up to mid-2021, and has decreased since then. It is unclear whether this reflects a broader trend or is an artifact of where GISAID samples are collected.
Moreover, the number of hospitalizations is much more elevated even than the inflated hospitalization rates observed during the period of significant underreporting in early 2020. As Fig. 3 shows, even by March 2021, over 50% of GISAID samples being collected were from individuals who were either hospitalized or released from hospital, per their patient status metadata. This makes intuitive sense, since sequence samples with clinical information may well come from clinical settings, particularly hospitals. As a result, even though there is a steady increase in cases classified as Mild (see Fig. 3), this is likely at least in part because of a change in settings where sequences with metadata are collected, with more of a mix of outpatient settings. We observed a pronounced spike in the proportion of sequences annotated as Asymptomatic or collected from population screening studies in April-May 2021, from around 3% to 12%, which is consistent with changes in sampling sources. Fig. 1C shows how the aforementioned distortions in the data can practically impact the development of genotype-severity models. There is a lower observed mean severity during the Delta wave as compared to the timeframe of Delta's emergence (when other lineages have a significant fraction of samples being collected) and before then, when Alpha was a plurality lineage. While this potentially could be due to continued vaccination resulting in less severe cases overall, and in turn fewer sequences from severe cases in the GISAID patient data set, Fig. 1C shows the trend of reduced severity reverses in the initial Omicron (BA.1) wave. However, vaccination and improved therapeutics, while certainly having resulted in a reduction in severe disease outcomes in general, do not explain the observed reduction of mean severity within the GISAID data set. We can show that as follows: Time-dependent changes in external conditions, such as increased vaccination rates, can be controlled for by looking only at the short timeframe where Alpha and Delta were collected in similar numbers, circa May-June 2021. As Fig. 1C shows, the reduction in severity over time at the Alpha-Delta transition point is abrupt. Significantly, during May-June 2021, Delta samples were 60.7% Mild (5199 total samples) and Alpha samples were 52.2% Mild (8658 total samples). As a result, a model based on the GISAID data set will show that Delta is milder than Alpha, contradicting the epidemiological and laboratory evidence discussed above [13][14][15][16]. It is also not the case that Delta samples during the May-June 2021 timeframe were collected in countries with higher vaccination rates than Alpha samples. The main source of samples during this timeframe for both lineages was France (44% and 42% of Delta and Alpha respectively), and the second largest sources was Mexico (19% and 11%). Notably, in samples from France, Delta was 92.6% Mild and Alpha was 62.6% Mild, while in samples from Mexico, Delta was 27.1% Mild and Alpha was 38.4% Mild. Therefore, the observed decrease in severity from Alpha to Delta, due to this apparent artifact found in data from France, will confound a genotype-patient status model. Mutations associated with Delta will appear to result in reduced severity, in contradiction to epidemiological and other evidence.
Previous efforts in modeling GISAID patient data have found that including the location metadata for the country of origin of the sequence results in a more accurately predictive model [43]. As an initial matter, country and sequence will likely have some correlation, since mutations, including both the lineage an sublineage level, cluster between countries [90,91]. However, there is likely some variance in patient outcomes between countries due to differences in the enforcement of non-pharmaceutical interventions (NPI) which may be more protective of vulnerable populations, differences in circulating virus and thus hospital burdens, and differences in hospital capacity and standards of care [92][93][94]. Fig. 1D shows, however, that inter-country variation is more complex and does not appear to be directly related to the  Fig. 1 are not due to changes in how metadata are described and characterized. aforementioned factors. Some of the data show consistent trends or levels across all time points. For example, essential all records submitted from Hong Kong, and the overwhelming majority of records from Brazil, are classified as Severe (hospitalizations or deaths). Cases from France, which as discussed above is the largest source of sequences, show a decline over time, although with a lot of fluctuation. Cases from neighboring Belgium, by contrast, have been consistently increasing in severity since Summer 2021. Samples from the United States have increased and decreased in severity with no discernible pattern. In sum, samples originating from different countries follow distinct patterns, which means that including country as a feature will likely improve classification. But the country feature will reflect sampling differences over time between countries-whether the sequences are being submitted mostly from hospital settings, or if that is the case at different points in time. While there is variance between samples from a country that is independent from that of other countries, it is at least in large part due to factors that are not relevant to disease outcome.
Accordingly, we hypothesize that sampling variations can be modeled as a random effect in a linear mixed effects model [72,73,95], where country of origin is a random effect group rather than a feature. To test this hypothesis, in the following section we compare mixed effect machine learning to other classification methods for predicting disease severity. The comparison is based on GISAID data from a timeframe when the overall decrease in severity will not confound a model. Otherwise, as discussed above, any model will inevitably make predictions that are not clinically relevant going forward. For example, mutations that emerged in Delta will be found as leading to less severe disease, and thus when they are found in Omicron sublineages, they will be predicted as reducing severity-due to the potential artifact in samples collected during the period when Delta and Alpha coincided discussed above. We therefore limit our analysis of modeling methods to samples collected beginning in July 2021, when the declining trend in cases has stabilized (see Fig. 1B and C). The result is a training data set made up of mostly Delta subvariants, along with a substantial number of Omicron sequences, and smaller numbers of other lineages such as the Gamma (P.1) variant of concern, which has shown some ability to evade neutralizing antibodies [96,97] and may result in increased disease severity [98]. Doing so helps avoid confounders in analyzing the impact of different mutations and combinations of mutations in Delta and Omicron subvariants, allowing for predictions about whether mutations observed in Delta will result in more severe disease if they occur in future Omicron variants.

Comparison of machine learning methods
To evaluate our hypothesis that a mixed effect modeling approach can be useful for GISAID patient data, where country should be treated as a random effect group, we evaluate the GPBoost mixed effect machine learning method [50] alongside two popular highly efficient ensemble decision tree methods which employ gradient-boosting, XG-Boost and LightGBM [53], random forests [51], a well-established ensemble decision tree method, and conventional logistic regression with elastic net regularization [65]. To compare with GPBoost, two feature sets of models are evaluated: (i) using sequence, age, and gender as features, and (ii) using country metadata as an additional feature.
As discussed above, the following analysis focuses on GISAID data starting from July 2021. Models are trained on the samples collected between July 17, 2021 through December 25, 2021 (68,815 in total).
Testing is performed on samples collected entirely after the training period, from December 26, 2021 through the latest-collected sequences from April 10, 2022 (42,420 samples). Although cross-validation is a typical way of evaluating classifiers to avoid overfitting [99,100], in this data set, as shown in Fig. 1 and discussed above, there will be potential clusters of confounding variables at different times. For example, a narrow range of sequence collection dates may correspond to a study of patients with common characteristics, e.g., patients who are all hospitalized, or mildly symptomatic patients from a screening study. Cross-validation will sample time points evenly, and, as such, a classifier may overfit to patterns within the confounders and then appear to perform better than it otherwise would be on a realistic prediction task. As an alternative, we evaluate the classifiers on temporally split data: i.e., we seek to predict disease severity for sequences from a model trained on samples collected earlier. Previous work has confirmed that temporally split test and training sets provide a more realistic evaluation of classification methods, in that methods perform less well when evaluated on a temporally split validation data set than they do using cross-validation [45]. Notably, there is some class imbalance, which is similar between the training and test data sets: 39.2% of the test samples were severe and 37.4% of training were Severe. Class or sample balancing did not substantially affect the results for the methods which allow it, i.e., not GPBoost (data not shown). Fig. 4 compares machine learning methods by showing aggregate and class-specific test data classification metrics for models trained using the different methods under evaluation. The aggregate metrics shown in Fig. 4 are the accuracy, which is measured as the number of correct class predictions divided by the number of total predictions, and the balanced (weighted average) F1-score, which reflects the sensitivity and specificity of the predictions, accounting for the aforementioned class imbalance. The balanced F1-score is the harmonic mean of precision (true positives divided by all positive predictions) and recall (true positive rate, i.e., sensitivity). Fig. 4 also shows the class-specific precision and recall, which is a useful measure as to whether methods might underperform on predicting a particular class, such as the minority (Severe) or majority (Mild) class.
Although the performance of the methods varies depending on metric, two trends are clear. First, the best-performing methods are consistently (1) the high-performance gradient boosting decision tree methods, XGBoost, LightGBM, and GPBoost, with class prediction accuracies above 75%. Second, the best-performing methods account for country-either as an independent feature, in the case of XGBoost and LightGBM, or as group level random effects for the mixed effects model trained by GPBoost. Notably, these three methods, unlike classical regression methods, can handle missing information for sequence positions, which are allowed to increase the number of data for training. Fig. 5 further shows the receiver operator characteristic (ROC) curves for the best-performing methods, and reports the area under the curve (AUC), which provides a metric for comparing model performance. Using the AUC metric, XGBoost with country as an independent feature has the highest AUC. It is important to keep in mind, however, that as shown in Fig. 1D, a model that includes country as a feature may be overfitting to consistent sample collection biases. Notably, GPBoost outperforms LightGBM and XGBoost when country is not an independent feature of the latter two models. GPBoost also has a higher AUC than LightGBM and only marginally lower than that of XGBoost.
To further compare the performance of the best-performing models, they can be tested on their ability to predict whether specific sequence mutations affect the relative risk of severe disease. This analysis focuses on two specific spike protein site mutations for which there is substantial evidence from both epidemiological and laboratory studies for increased disease severity: a leucine-to-arginine mutation at position 452 (L452R) and a proline to arginine mutation at position 681 (P681R). These mutations are characteristic of the Delta variant [101]. SARS-CoV-2 with P681R has been found to have higher spike protein The top row shows, at left, the accuracy of the Mild/Severe classification and, at right, the balanced F1-score, which is the harmonic mean of precision (true positives divided by all positive predictions) and recall (true positive rate, i.e., sensitivity). The middle row shows the precision for the Mild and Severe class predictions separately, and the bottom row shows the recall. Metrics are shown for models trained with country metadata used as a feature and without, as indicated in the labeled axes below, except for GPBoost, which takes into account the country metadata by using it as the groups of random effects. All models otherwise use age, gender, and each sequence position as a feature. Error bars show the standard deviations across three runs with different random number seeds, and in some cases are not visible. Statistics for GPBoost are computed based on the mean of the response. GPBoost and LightGBM/XGBoost including country as a feature consistently outperform other methods.
cleavage and viral fusogenicity in vitro, and result in higher pathogenicity in a Syrian hamster animal model [102]. Another study introducing P681R on an Omicron background showed an increase in fusogenicity and synctitia formation, which have been correlated to pathogenicity [103]. The L452R mutation has been found to also increase viral fusogenicity in vitro, and to result in increased infectivity in a mouse lung cell model [104]. Another in vitro study has also shown that L452R resulted in increased spike protein stability, viral fusogenicity and infectivity, and, in turn, increased viral replication [105]. And, the Delta variant, which is characterized by the P681R and L452R mutations, was to result in an increased risk of hospitalizations in epidemiological studies in Denmark [13], England [16], and Canada [14]. Accordingly, a model would be expected to show that a L452R or P681R mutation will result in greater severity.
As an additional validation study, therefore, machine learning models are evaluated on whether they are more likely to a predict a Severe  [62] for test samples and trained models as described for Fig. 4 for XGBoost and LightGBM (with and without country metadata) and GPBoost (using country as a random effects group). The data are shown for one run; run-to-run variation was found to be insignificant. GPBoost performs better than either LightGBM or XGBoost, unless country metadata are used for the latter methods. classification in the presence of L452R or P681R sequence changes. However, the methods being compared here are decision tree-based methods, which unlike classical logistic regression do not generate coefficients that can be used to analyze individual features. The impact of specific feature changes may be estimated instead. In particular, SHAP values can be utilized in conjunction to provide an estimate of the log-odds for a Severe case given a particular feature value [54]. SHAP values are typically generated for a subset of samples, as it is a computationally intensive process. Fig. 6 shows SHAP dependency plots for samples collected from March 8 through April 10, 2022 (5918 samples). The points in the plots represents the estimated SHAP value (log-odds for a Severe case) for each sample; the color indicates the age of the patient for the sample. This means that SHAP dependency plots show how a specific feature interacts with another feature: the age of the patient in Fig. 6. (As indicated in Fig. 1, age has a significant correlation with disease severity in GISAID patient data, as well as in real-world epidemiological studies.) The SHAP dependency plots represent the potential sequence features at spike protein positions 452 and 681: L (ancestral), leucine, M, methionine, and R, arginine for position 452, and P (ancestral), proline, H, histidine, R, and Y, tyrosine for position 681. (P681H is a common mutation founds in Omicron sequences [106]). The '*' character indicates that there was a missing amino acid at that position in the sampled sequence, likely due to sequencer error, which is treated as missing data by the respective methods. As Fig. 6 shows, GPBoost is the only method which shows an increased SHAP value, or estimated log-odds of a Severe outcome, for the L452R and P681R mutations.

Predicting the potential severity of emerging omicron variants
A key objective for training a sequence-phenotype model is to be able to predict how novel combinations of mutations -such as the reemergence of a mutation found in a separate lineage -could affect pathogenicity and clinical outcomes. Here, the potential utility of a spike protein sequence-clinical severity prediction model trained on GISAID data is demonstrated for Omicron lineages emerging as significant threats as of May 2022: BA.4 and BA.5, which had become the predominant variants in South Africa and found to be rapidly growing in Portugal [107], and BA.2.12.1, which had accounted for substantial case growth in the United States [108].
The predicted relative severity resulting from different spike sequences may be compared by looking at the relative raw (unrounded) prediction of the model. In the context of this paper, the probability on the logistic curve fit by the model that the binary classification will be 0 or 1. In practice, the class prediction is provided by rounding the model output to 0 (Mild) or 1 (Severe), i.e., to generate the classification metrics shown in Fig. 4. However, as explained above, GISAID data do not provide a realistic measurement of the actual observed probability of severe outcomes, as there are far more hospitalized and deceased patients than real-world hospitalization and CFR data indicate. The quantitative model predictions should be interpreted, therefore, in a relative manner. Accordingly, the raw model output can help in providing relative predictions, but should not be interpreted as an absolute probability of severe disease. In sum, predictions for the aforementioned emerging sublineages may be compared against the predictions for the original Omicron sublineages, BA.1 and BA.2. Fig. 7 shows the output of the trained GPBoost, LightGBM, and XGBoost models, where the latter two include country as a feature, as shown above in Fig. 6. The sequences used to generate the predictions in Fig. 7 are the most common of those variants found in the GISAID patient data set used in this paper (collected before April 15, 2022), with GISAID accession numbers as provided in the figure caption. An additional BA.2.12.1 sequence collected after the data set used in this paper was separately retrieved from GISAID (accession number EPI_ISL_12048110). As Fig. 7 illustrates, Country has a substantial impact on the predictions made using LightGBM and XGBoost. This sharply limits the utility of LightGBM and XGBoost models as predictive tools.
While it is possible to standardize the country and view the prediction relatively, as shown in 7, the relative difference between variants differs greatly between countries. For a simulated patient sample from the United States, the variants have nearly identical (and very high) predictions, while the predictions for simulated samples from France vary differently, with much more dynamic range. Samples from Mexico are in between. GPBoost models, by contrast, do not vary between countries. The mixed effects model trained by GPBoost does not account for country in grouping only random effects. By considering only the mean model response, random effects cancel each other out, and there is only one prediction for any country. Given that, as shown in Fig. 1D, the differences between countries are apparently unrelated to actual local conditions, such as access to treatment, a countryneutral prediction provides a more realistic, and likely more relevant, of the relative increase in severe disease risk associated with a new SARS-CoV-2 variant.
Accounting for the variation between countries, Fig. 7 generally shows that BA.2, BA.2.12.1, BA.4, and BA.5 all have higher predicted severity than BA.1. Notably, a study of infectivity in mouse and hamster models suggested that there is no difference in infectivity, replication, and pathogenicity between BA.1 and BA.2 virus [109]. Another study, however, found greater fusogenicity and replication in nasal epithelial cells studied in vitro, as well as more pathogenicity in a hamster model for BA.2 as compared to BA.1 [110]. Moreover, a recently published population study in England reports that individuals infected with BA.2 reported more symptoms than those with BA.1 [111]. Another study of patients in Italy also reported more symptomatic disease when infected with BA.2 rather than BA.1 [112].
The SHAP method used for feature analysis above can be used to examine in detail how specific features influence the prediction for emerging variants as well [54]. Fig. 8 shows exemplary SHAP   ' is a visualization which shows, based on SHAP values estimating the log-odds contribution of features to the model prediction, how much a specific feature tends to weigh the decision between binary classes. This plot is based on a simulated 30 year old male patient, and thus the Age feature tends to weigh the model towards a Mild prediction for this sample. Other features tend to weigh towards a more Severe prediction, such as mutations at sites characteristic of BA.2, including positions 371 and 408. visualization for the GPBoost prediction of the representative BA.2.12.1 sequences shown in Fig. 7 (GISAID accession EPI_ISL_12048110), simulated for a 30 year old male patient. The plot shows how key features tend to make a prediction of greater severity (indicated by an increasing value) or lower severity (decreasing value). In the case of this younger patient, for example, the Age feature tends to reduce the predicted severity. Notably, Fig. 8 suggests that three mutations characteristic of BA.2 influence an increase in predicted severity for BA.2.12.1: a deletion at positions 24 through 26, S371F (serine to phenylalanine), and R408S (arginine to serine) [113]. S371F is a mutation in the receptor binding domain (RBD) of the spike protein which has been shown to be evasive to antibodies [114,115]. While an antibody evasive mutant might not necessarily confer greater severity on an immunonaive patient, given the high rates of vaccination and/or prior infection now, a model based on contemporary GISAID data can be expected to shown greater severity for immune escape variants. While the impact is smaller than for those features shown in Fig. 8, analysis of SHAP values shows that another immune escape change found in BA2.12.1, E484 A, also tends to elevate the severity prediction [116]. Similarly, BA.4 and BA.5 have been found to be more immunoevasive in BA.1, which may also result in increased severe disease among populations with acquired immunity [117,118].

Discussion
Global genome repositories like GISAID have the potential to be an unparalleled resource for understanding and quantitatively modeling genotype-phenotype relationships. As the foremost repository for SARS-CoV-2 genome sequences, GISAID offers the largest possible potential data set with the greatest global reach. As a result, GISAID can solve one of the key challenges with biomedical modeling problems: small data set sizes which make them particularly vulnerable to overfitting, because it is often difficult and costly to obtain experimental data [119][120][121]. The best (and perhaps only real) solution to overfitting is to have more data to develop models. Conventional meta-analyses require searching for relevant studies and parsing through papers with often inconsistent formats and data reporting methods, and they are also limited to published or otherwise documented studies. However, because repositories are generally incorporating multiple studies collected from different sites and under different conditions, heterogeneity is still the key challenge [122]. As Fig. 1 and the accompanying text explain, heterogeneity is a critical problem with GISAID data. The challenges of GISAID source data heterogeneity are particularly exacerbated by the very limited metadata associated with patient samples, even for the small subset for which patient status metadata are available at all. Sequence repository data will be more useful as efforts continue to grow to collect and curate important information about the sample and establish minimum information standards [123]. The results in this paper demonstrate that, accounting for the aforementioned caveats, useful information can be obtained by analyzing the GISAID patient data set. There are three key problems with the data set that analytical and modeling methods need to address.
First, it is hard to robustly define mild and severe cases based on patient status metadata. As an initial matter, metadata entries are often inconsistent between different entries or noisy and hard to interpret reliable (see, e.g., Supplementary Table S1). This paper takes a hierarchical approach to defining mild and severe cases, based on established clinical definitions [63], as described in Supplementary  Table S2. However, because of confounding variables like vaccination, therapeutic availability, and prior infection, it has become difficult to estimate the ''intrinsic'' severity of variants [124]. A particular challenging issue concerns whether hospitalizations should be considered as mild or severe cases, especially given how prevalent they are in the GISAID data set (see, e.g., Fig. 3 and accompanying text). While this paper treats them as severe cases, that definition has become increasingly unreliable as the vaccination has become more prevalent. Studies from multiple sites suggests that as vaccination has increased, more hospitalization patients classified have only tested positive on admission but have mild or no symptoms [125][126][127][128]. Moreover, the kinds of sequence variation that lead to more severe clinical outcomes may change due to vaccination. As suggested in Fig. 8 and accompanying text, as the overwhelming majority of individuals in many regions have at least some immunity due to vaccination and/or prior infection, immune escape variants may result inn severe disease because they can evade immune responses that would otherwise rapidly clear the virus and prevent infection. However, such variants may not result in more pathogenicity in immunonaive hosts, and thus would not show more severe outcomes earlier in the pandemic.
Second, conditions have changed over time, as shown in the reduction of case severity over time shown in Fig. 1, which consists of reductions in the proportions of both hospitalizations and deaths (see Fig. 3). While trends of decreasing severity are consistent with improved patient outcomes due to vaccination and improved therapeutics [81][82][83][84][85][86], they may also reflect changes in sequence collection practices, such as obtaining more sequences and performing more studies based on screening the general public outside of hospital settings. These artifacts can have significant impacts on modeling studies. For example, the models derived in this study similarly indicate that the E1258D spike protein mutation has a significant impact on increasing severity. In addition to our group's previous work, another independent investigation of GISAID terminating in Fall 2021 showed E1258D as the strongest sequence feature in determining the severity prediction [44,47]. While E1258D was observed in one publication as an observed result of a missense mutation, that study did not show any effects for that mutation on increased pathogenicity [129]. In fact, E1258D is only found in 1898 of the over 160,000 samples analyzed in this paper. Of those samples, 1849, or 97.4% originated from Mexico, of which 1772 were hospitalized or released from hospital (i.e. considered severe in most studies), and 76 were deceased. Significantly, the metadata were all consistent, including at the level of capitalization, whereas metadata entries generally showed a high degree of heterogeneity. (Supplementary Table S1 shows all unique entries.) Therefore, it is highly likely that E1258D is either sequence artifact, particularly as it is in the cytoplasmic tail of the spike protein and the result of a missense mutation, and thus potentially an unreliable site for interpreting short-read next generation sequencing technologies [130,131]. In sum, caution must be employed in interpreting any features identified as important.
Third, the region from which sequences are collected can have a significant impact on data due to systematic bias. As the E1258D feature demonstrates, large-scale studies in particular regions may interpret sequence data in such a way that can identify a spurious variant if it is inconsistent with other studies. As Fig. 1D shows, even though it seems logical to ascribe regional differences in clinical outcomes to factors like vaccination, fluctuations at the country level are either virtually constant or otherwise have no consistent pattern. As such, country-level seem more reflective of how samples are collected and metadata are annotated within countries, which motivates the use of mixed effect models as has been previously used for genotype-phenotype modeling where sample batches affect the data [70][71][72][73]. The results in this paper demonstrate that a mixed effect machine learning approach in which countries are groups for random effects can be successful in developing a predictive model. The GPBoost method [50] proves to be fast, effective, and robust to missing data, which suggests that it should be more widely utilized in modeling genetic variation. Notably, as Fig. 4 illustrates, using country as a feature does result in much more accurate models. However, these models are overfitting to country-level trends, as evinced by the predictions graphed in Fig. 7, which show dramatic differences in predictions between different countries. As such, while previous studies of GISAID data have shown that including country metadata as a feature in models provides greater explanatory power [42,43,46], any resulting models are likely overfitting as they are here and will have difficulty being generalized to real-world predictions. Accordingly, while region-level features may appear to result in superior models, they risk creating artifacts. For example, as shown in this paper, including country as a feature results in predictions for the impact of L452R and P681R mutations at odds with epidemiological and in vitro evidence (see Fig. 6). The challenges of country-level variation are heightened by substantial regional imbalances in the GISAID patient data set. The entire GISAID database is fundamentally biased towards Europe, North America, and select countries in Asia and elsewhere, with over half the sample originating in either the United Kingdom or United States as of January 2022 [49]. Within the subset of data with patient status metadata, the biases are similarly idiosyncratic; for example, over 40% of the training and test samples shown in this paper were obtained from France.
In addition to the foregoing issues, the work in this paper has further limitations in scope. GISAID patient metadata omit information about comorbidities known to increase the risk of severe clinical outcomes and mortality, such as chronic disease and obesity [132,133]. Studies have shown that host (patient) genetics may also be significant determinants of infection outcomes. a [134][135][136] Indeed, a recent study showed many genetic correlates of severe COVID-19 that were also correlates for other chronic conditions associated with heightened severity, with a particular focus on immune-mediated conditions [137]. Epigenetic factors may also be significant [138], as well as the host transcriptome [139]. In addition, to make the work shown here more tractable, we focus on the spike protein. However, there is some evidence that a mutation in the nucleocapsid gene may account for some of Delta's increased severity [140]. Finally, the methods described herein rely on training or fitting to existing databases. Entirely novel mutations will not be accounted for and may result in unpredictable outcomes. However, it may be possible to train models on the predicted or in vitro studies of novel mutations that could emerge in the future, such as those identified in deep mutational scanning and other exploration of the mutational landscape [141][142][143][144]. In sum, while there are important caveats to utilizing the GISAID data set as a resource for modeling clinical outcomes based on viral genotypes, it provides the most diverse and largest data set possibility. Any other meta-analyses will inherently suffer from the same kinds of data heterogeneity, and will necessarily be more limited as there is data in GISAID beyond that contained in published reports. The relative success of a mixed effect modeling approach suggests that refining the modeling of group level random effects or otherwise incorporate hidden variables are necessary to account for structural issues in the data. Moreover, having established a proof of concept in this study using logistic regression and boosted decision trees, future work can explore the potential application of deep learning methods, which have proven to be highly useful to genetic sequence to function modeling in other contexts [49,[145][146][147].

Conclusion
Despite increasingly widespread vaccination and development of new antiviral therapies, COVID-19 continues to represent a significant threat to human health. The virus also continues to be highly unpredictable. Significant genetic variants of SARS-CoV-2 continue to proliferate, and the risk of severe disease in an emerging variant is a particular concern. A critical tool in staying ahead of the virus can be a predictive model for the risks of severe disease based on viral genotype. Potentially predictive genotype-disease severity models depend on a substantial amount of patient data, which exceeds the capability of conventional epidemiological studies and meta-analyses. Patient data within GISAID, the primary global SARS-CoV-2 sequence repository, therefore, represents a key resource for building predictive models. Unfortunately, GISAID patient metadata are limited, both in number and quality; for example, there is no data on comorbidities or vaccination status. Despite these caveats, it has been previously shown that GISAID patient metadata can be used to develop predictive models. However, until this paper, there has not been a rigorous analysis of potential confounders within the data which may prevent such models from being clinically useful.
As shown in this paper, there are temporal trends in sample collection biases which must be accounted for in model training. Moreover, there are significant differences in sample collection biases between countries. Models are more predictive if they take country-of-origin of sequences into account, but such models are likely overfitting to artifacts in how samples are collected in different countries. This study demonstrates that a superior approach to accounting for variation between the country-of-origin of viral sequence and patient data is to employ mixed effects modeling, where country is treated as a random effect group. Mixed effects modeling can be efficiently implemented for the large number of sequence features analyzed in this paper by using the recently developed GPBoost package, which uses gradient boosted decision trees for fixed effects with performance comparable to XGBoost and conventional LightGBM. This study also presents a novel way to validate genotype-disease severity models for COVID-19: interpreting models to determine whether they are able to show that they can predict the effect of known mutations which affect disease severity. This kind of validation further reinforces the potential superiority of mixed effects methods over conventional logistic regression and boosted decision tree methods. Finally, trained GPBoost genotype-severity models are shown to be able to predict severity of emerging SARS-CoV-2 Omicron variants. For example, the GPBoost model presented in this paper predicts that BA.2 and subsequent Omicron variants may pose a greater risk of severe disease than Omicron BA.1, in line with preliminary epidemiological evidence.

CRediT authorship contribution statement
Bahrad A. Sokhansanj: Conceptualization of this study, Data curation, Methodology, Data analysis, Software, Writing -original draft. Gail L. Rosen: Conceptualization of this study, Data analysis, Writing -review & editing.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments
We gratefully acknowledge the following Authors from the Originating laboratories responsible for obtaining the specimens and the Submitting laboratories where genetic sequence data were generated and shared via the GISAID Initiative, on which this research is based. A table of acknowledgments is located at https://epicov.org/epi3/epi_ set/EPI_SET_20220606hk (DOI link:https://doi.org/10.55876/gis8.220 606hk). GLR received U.S. National Science Foundation (NSF) grants #1919691, #1936791, and #2107108. The funders had no role in study design, deciding to publish, collecting or analyzing data, or preparing the manuscript. Work reported here was run on hardware supported by Drexel's University Research Computing Facility.

Appendix A. Supplementary data
Supplementary material related to this article can be found online at https://doi.org/10.1016/j.compbiomed.2022.105969.