Molecular Architecture of Early Dissemination and Massive Second Wave of the SARS-CoV-2 Virus in a Major Metropolitan Area

There is concern about second and subsequent waves of COVID-19 caused by the SARS-CoV-2 coronavirus occurring in communities globally that had an initial disease wave. Metropolitan Houston, TX, with a population of 7 million, is experiencing a massive second disease wave that began in late May 2020. To understand SARS-CoV-2 molecular population genomic architecture and evolution and the relationship between virus genotypes and patient features, we sequenced the genomes of 5,085 SARS-CoV-2 strains from these two waves. Our report provides the first molecular characterization of SARS-CoV-2 strains causing two distinct COVID-19 disease waves.

experiencing a massive second disease wave that began in late May 2020. To understand SARS-CoV-2 molecular population genomic architecture and evolution and the relationship between virus genotypes and patient features, we sequenced the genomes of 5,085 SARS-CoV-2 strains from these two waves. Our report provides the first molecular characterization of SARS-CoV-2 strains causing two distinct COVID-19 disease waves.
The Houston metropolitan area is the fourth largest and most ethnically diverse city in the United States, with a population of approximately 7 million (14). The 2,400-bed Houston Methodist Hospital health system has seven hospitals and serves a large, multiethnic, and socioeconomically diverse patient population throughout greater Houston (13,14). The first COVID-19 case in metropolitan Houston was reported on 5 March 2020, with community spread occurring 1 week later (15). Many of the first cases in our region were associated with national or international travel in areas known to have SARS-CoV-2 virus outbreaks (15). A central molecular diagnostic laboratory serving all Houston Methodist hospitals and our very early adoption of a molecular test for the SARS-CoV-2 virus permitted us to rapidly identify SARS-CoV-2-positive patients and interrogate genomic variation among strains causing early infections in the greater Houston area. Our analysis of SARS-CoV-2 genomes causing disease in Houston has continued unabated since early March and is ongoing. Genome sequencing and related efforts were expanded extensively in late May as we recognized that a prominent second wave was under way (Fig. 1).
Here, we report that SARS-CoV-2 was introduced to the Houston area many times, independently, from diverse geographic regions, with virus genotypes representing genetic clades causing disease in Europe, Asia, and South America and elsewhere in the United States. There was widespread community dissemination soon after COVID-19 cases were reported in Houston. Detection of strains with a Gly614 amino acid replacement in the spike protein, a polymorphism that has been linked to increased transmission and in vitro cell infectivity, increased significantly over time and caused virtually all COVID-19 cases in the massive second disease wave. Patients infected with strains with the Gly614 variant had significantly higher virus loads in the nasopharynx on initial diagnosis. Some naturally occurring single amino acid replacements in the receptor binding domain (RBD) of spike protein resulted in decreased reactivity with a neutralizing monoclonal antibody, consistent with the idea that some virus variants arise due to host immune pressure. the metropolitan area includes 9,444 square miles. The estimated population size of metropolitan Houston is 7 million (https://www.houston.org/houston-data).
Epidemic curve characteristics over two disease waves. The first confirmed case of COVID-19 in the Houston metropolitan region was reported on 5 March 2020 (15), and the first confirmed case diagnosed in Houston Methodist hospitals was reported on 6 March 2020. The epidemic curve indicated a first wave of COVID-19 cases that peaked around 11 to 15 April, followed by a decline in cases until 11 May. Soon thereafter, the slope of the case curve increased, with a very sharp uptick in confirmed cases beginning on 12 June (Fig. 1B). We consider 11 May to represent the transition between waves, as this date represents the inflection point of the curve of cumulative new cases and had the absolute lowest number of new cases in the mid-May time period. Thus, for the data presented here, wave 1 is defined as 5 March through 11 May 2020, and wave 2 is defined as 12 May through 7 July 2020. Epidemiologic trends within the Houston Methodist Hospital population were mirrored by data from Harris County and the greater metropolitan Houston region (Fig. 1A). Through 7 July, 25,366 COVID-19 cases were reported in Houston, 37,776 cases in Harris County, and 53,330 in metropolitan Houston, including 9,823 cases in Houston Methodist Hospital facilities (inpatients and outpatients) (https://www.tmc.edu/ coronavirus-updates/infection-rate-in-the-greater-houston-area/ and https://harriscounty .maps.arcgis.com/apps/opsdashboard/index.html#/c0de71f8ea484b85bb5e fcb7c07c6914).
During the first wave (early March through May 11), 11,476 COVID-19 cases were reported in Houston, including 1,729 cases in the Houston Methodist Hospital system. Early in the first wave (from March 5 through 30 March 2020), we tested 3,080 patient specimens. Of these, 406 (13.2%) samples were positive for SARS-CoV-2, representing 40% (358/898) of all confirmed cases in metropolitan Houston during that time period. As our laboratory was the first hospital-based facility to have the capacity for molecular testing for SARS-CoV-2 on site, our strain samples are likely representative of COVID-19 infections during the first wave.
For the entire study period (5 March through 7 July 2020), we tested 68,418 specimens from 55,800 patients. Of these, 9,121 patients (16.4%) had a positive test result, representing 17.1% (9,121/53,300) of all confirmed cases in metropolitan Houston. Thus, our strain samples are also representative of those responsible for COVID-19 infections in the massive second wave.
To test the hypothesis that, on average, the two waves affected different groups of patients, we analyzed individual patient characteristics (hospitalized and nonhospitalized) in each wave. Consistent with this hypothesis, we found significant differences in the COVID-19 patients in each wave (see Table S1 in the supplemental material). For example, patients in the second wave were significantly younger, had fewer comorbidities, were more likely to be Hispanic/Latino (by self-report), and lived in Zip codes with lower median incomes (Table S1). A detailed analysis of the characteristics of patients hospitalized in Houston Methodist Hospital facilities in the two waves has recently been published (16).
SARS-CoV-2 genome sequencing and phylogenetic analysis. To investigate the genomic architecture of the virus across the two waves, we sequenced the genomes of 5,085 SARS-CoV-2 strains dating to the earliest time of confirmed COVID-19 cases in Houston. Analysis of SARS-CoV-2 strains causing disease in the first wave (5 March through May 11) revealed the presence of many diverse virus genomes that, in the aggregate, represent the major clades identified globally to date (Fig. 1B). Clades G, GH, GR, and S were the four most abundantly represented phylogenetic groups (Fig. 1B). Strains with the Gly614 amino acid variant in spike protein represented 82% of the SARS-CoV-2 strains in wave 1 and 99.9% in wave 2 (P Ͻ 0.0001; Fisher's exact test) (Fig. 1B). This spike protein variant is characteristic of clades G, GH, and GR. Importantly, strains with the Gly614 variant represented only 71% of the specimens sequenced in March, the early part of wave 1 (Fig. 1B). We attribute the decrease in the number of strains with this variant observed in the first 2 weeks of March (Fig. 1B) to fluctuation caused by the relatively fewer COVID-19 cases occurring during that period.
Relating spatiotemporal genome analysis with virus genotypes over two disease waves. We examined the spatial and temporal mapping of genomic data to investigate community spread during wave 1 (Fig. 2). Rapid and widespread community dissemination occurred soon after the initial COVID-19 cases were reported in Houston. The heterogenous virus genotypes present very early in wave 1 indicate that multiple strains independently entered metropolitan Houston, rather than a single strain having been introduced and then spread. An important observation was that strains of most of the individual subclades were distributed over broad geographic areas (see Fig. S1 in the supplemental material). These findings are consistent with the known ability of SARS-CoV-2 to spread very rapidly from person to person.
Relationship between virus clades, clinical characteristics of infected patients, and additional metadata. It is possible that SARS-CoV-2 genome subtypes have different clinical characteristics, analogous to what is believed to have occurred with Ebola virus (17)(18)(19) and is known to occur for other pathogenic microbes (20). As an initial examination of this issue in SARS-CoV-2, we tested the hypothesis that patients with disease severe enough to warrant hospitalization were infected with a nonrandom subset of virus genotypes. We also examined the association between virus clades and disease severity based on overall mortality, highest level of required care (intensive care unit [ICU], intermediate care unit [IMU], inpatient or outpatient), need for mechanical ventilation, and length of stay. There was no simple relationship between virus clades and disease severity using these four indicators. Similarly, there was no simple relationship between virus clades and other metadata, such as sex, age, or ethnicity (Fig. S2).
Machine learning analysis. Machine learning models can be used to identify complex relationships not revealed by statistical analyses. We built machine learning models to test the hypothesis that virus genome sequence can predict patient outcomes, including mortality, length of stay, level of care, ICU admission, supplemental oxygen use, and mechanical ventilation. Models designed to predict outcomes based on virus genome sequence alone resulted in low F1 scores of less than 50% (0.41 to 0.49), and regression models showed similarly low R 2 values (Ϫ0.01 to (Ϫ0.20) ( Table S2). F1 scores near 50% are indicative of classifiers that are performing similarly to random chance. The use of patient metadata alone to predict patient outcome improved the model's F1 scores by 5% to 10% (0.51 to 0.56) overall. The inclusion of patient metadata with virus genome sequence data improved most predictions of outcomes, compared to genome sequence alone, to 50% to 55% F1 overall (0.42 to 0.55) in the models (Table S2). The findings are indicative of two possibilities that are not mutually exclusive. First, patient metadata, such as age and sex, may provide more signal for the model to use and thus result in better accuracies. Second, the model's use of single nucleotide polymorphisms (SNPs) may have resulted in overfitting. Most importantly, no SNP predicted a significant difference in outcome. A table of classifier accuracy scores and performance information is provided in Table S2.
Patient outcome and metadata correlations. Overall, very few metadata categories correlated with patient outcomes (Table S3). Mortality was independently correlated with increasing age, with a Pearson correlation coefficient (PCC) equal to 0.27. This means that 27% of the variation in mortality can be predicted from patient age. Length of stay correlated independently with increasing age (PCC ϭ 0.20). All other patient metadata correlations to outcomes had PCC values of less than 0.20 (Table S3).
We further analyzed outcomes correlated to isolates from wave 1 and 2 and to the presence of the Gly614 variant in spike protein. Presence in wave 1 was independently correlated with mechanical ventilation days, overall length of stay, and ICU length of stay, with PCC values equal to 0.20, 0.18, and 0.14, respectively. Importantly, the presence of the Gly614 variant did not correlate with patient outcomes (Table S3).
Analysis of the nsp12 polymerase gene. The SARS-CoV-2 genome encodes an RNA-dependent RNA polymerase (RdRp; also referred to as Nsp12) used in virus replication (21)(22)(23)(24). Each of two amino acid substitutions (Phe479Leu and Val556Leu) in RdRp confers significant resistance in vitro to remdesivir, an adenosine analog (25). Remdesivir is inserted into RNA chains by RdRp during replication, resulting in premature termination of RNA synthesis and inhibition of virus replication. This compound has shown prophylactic and therapeutic benefits against Middle East respiratory syndrome coronavirus (MERS-CoV) and SARS-CoV-2 experimental infection in rhesus macaques (26,27). Recent reports indicate that remdesivir has a therapeutic benefit in some patients with hospitalized COVID-19 (28)(29)(30)(31)(32), leading to its current widespread use in patients worldwide. Thus, it may be important to understand variation in RdRp in large collections of strain samples.
To acquire data about allelic variation in the nsp12 gene, we analyzed our 5,085 virus genomes. The analysis identified 265 SNPs, including 140 nonsynonymous (amino acid-altering) SNPs, resulting in amino acid replacements throughout the protein (Table 1) ( Fig. 3 and 4; see also Fig. S3 and S4). The most common amino acid change was Pro322Leu, identified in 4,893 (96%) of the 5,085 patient isolates. This amino acid replacement is common in genomes from clades G, GH, and GR, which are distinguished from other SARS-CoV-2 clades by the presence of the Gly614 amino acid change in the spike protein. Most of the other amino acid changes in RdRp were present in relatively small numbers of strains, and some have been identified in other isolates in a publicly available database (33). Five prominent exceptions included the following amino acid replacements: Ala15Val in 138 strains, Met462Ile in 59 strains, Met600Ile in 75 strains, Thr907Ile in 45 strains, and Pro917Ser in 80 strains. All 75 Met600Ile strains were phylogenetically closely related members of clade G and also had the Pro322Leu amino acid replacement characteristic of this clade (Fig. S3). These data indicate that the Met600Ile change is likely the evolved state, derived from a precursor strain with the Pro322Leu replacement. Similarly, we investigated phylogenetic relationships among strains with the other four amino acid changes noted above. In all cases, the vast majority of strains with each amino acid replacement were found among individual subclades of strains (Fig. S3).  Importantly, none of the observed amino acid polymorphisms in RdRp were located precisely at two sites known to cause in vitro resistance to remdesivir (25). Most of the amino acid changes were found to be located distantly from the RNA binding and catalytic sites ( Fig. S4; see also Table 1). However, replacements at six amino acid residues (Ala442Val, Ala448Val, Ala553Pro/Val, Gly682Arg, Ser758Pro, and Cys812Phe) may potentially interfere with either remdesivir binding or RNA synthesis. Four (Ala442Val, Ala448Val, Ala553Pro/Val, and Gly682Arg) of the six substitution sites are located immediately above the nucleotide-binding site, which is comprised of Lys544, Arg552, and Arg554 residues as shown by structural studies (Fig. 4). The positions of these four variant amino acid sites are comparable to that of Val556 (Fig. 4), and a Val556Leu mutation in SARS-CoV was identified to confer resistance to remdesivir in vitro (25). The other two substitutions (Ser758Pro and Cys812Phe) are inferred to be located either at, or in the immediate proximity of, the catalytic active site, which is comprised of three contiguous residues (Ser758, Asp759, and Asp760). A proline substitution that we identified at Ser758 (Ser758Pro) is likely to negatively impact RNA synthesis. Although Cys812 is not directly involved in the catalysis of RNA synthesis, it is only 3.5 Å away from Asp760. The introduction of the bulkier phenylalanine substitution at Cys812 (Cys812Phe) may impair RNA synthesis. Consequently, these two substitutions are expected to detrimentally affect virus replication or fitness.
Analysis of the gene encoding the spike protein. The densely glycosylated spike protein of SARS-CoV-2 and of its close coronavirus relatives binds directly to host cell angiotensin-converting enzyme 2 (ACE2) receptors to enter host cells (34)(35)(36). Thus, the spike protein is a major translational research target, including intensive vaccine and therapeutic antibody research . Analysis of the gene encoding the spike protein identified 470 SNPs, including 285 that produce amino acid changes ( Table 2; see also Fig. 5). Forty-nine of these replacements (V11A, T51A, W64C, I119T, E156Q, S205A,  D228G, L229W, P230T, N234D, I235T, T274A, A288V, E324Q, E324V, S325P, S349F,  S371P, S373P, T385I, A419V, C480F, Y495S, L517F, K528R, Q628E, T632I, S708P, T719I,  P728L, S746P, E748K, G757V, V772A, K814R, D843N, S884A, M902I, I909V, E918Q, S982L,  M1029I, Q1142K   We mapped the location of amino acid replacements onto a model of the full-length spike protein (34,64) and observed that the substitutions are found in each subunit and domain of the spike (Fig. 6). However, the distribution of amino acid changes is not uniform throughout the protein regions. For example, compared to some other regions of the spike protein, the RBD has relatively few amino acid changes, and the frequency of strains with these substitutions is low, each occurring in fewer than 10 isolates. This finding is consistent with the functional constraints on RBD to mediate interaction with ACE2. In contrast, the periphery of the S1 subunit amino-terminal domain (NTD) contains a dense cluster of substituted residues, with some single amino acid replacements found in 10 to 20 isolates (Table 2; see also Fig. 5 and 6). Clustering of amino acid changes in a distinct region of the spike protein may be a signal of positive selection. Inasmuch as infected patients make antibodies against the NTD, we favor the idea that host immune selection is among the forces contributing to some of the amino acid variation in this region. One NTD substitution, H49Y, was found in 142 isolates. This position is not well exposed on the surface of the NTD and likely does not represent a result of immune pressure. The same is true for another highly represented substitution, F1052L. This substitution was observed in 167 isolates, and F1052 is buried within the core of the S2 subunit. The substitution observed most frequently in the spike protein in our sample is D614G, a change observed in 4,895 of the isolates. As noted above, strains with the Gly614 variant significantly increased in frequency in wave 2 compared to wave 1.

Characterization of recombinant proteins with single amino acid replacements in the receptor binding domain region of spike protein.
The RBD of spike protein binds the ACE2 surface receptor and is also targeted by neutralizing antibodies (35, 36, 40, 42-45, 47-61, 70). Thus, single amino acid replacements in this domain may have functional consequences that enhance virus fitness. To begin to test this idea, we expressed spike variants with the Asp614Gly replacement and 13 clinical RBD variants  Table S4A and B). All RBD variants were cloned into an engineered spike protein construct that stabilizes the perfusion state and increases overall expression yield (spike-6P, here referred to as spike) (63). The C␣ atom of residues found to be substituted in one or more virus isolates identified in this study is shown as a sphere on the ribbon representation. Residues found to be substituted in 1 to 9 isolates are colored tan, those substituted in 10 to 99 isolates yellow, those substituted in 100 to 999 isolates red (H49Y and F1052L), and those substituted in Ͼ1,000 isolates purple (D614G). The surface of the amino-terminal domain (NTD) that is distal to the trimeric axis has a high density of substituted residues. RBD, receptor binding domain. Two Waves of COVID-19 Disease in Houston, Texas ® We first assessed the biophysical properties of spike-Asp614Gly, an amino acid polymorphism that is common globally and that was present at significantly increased levels in our wave 2 strain isolates. Pseudotyped viruses expressing spike-Gly614 have higher infectivity for host cells in vitro than spike-Asp614 (65,66,68,71,72). The higher infectivity of spike-Gly614 is correlated with increased stability and incorporation of the spike protein into the pseudovirion (72). We observed a higher expression level (Fig. 8A and B) and increased thermostability for the spike protein construct containing this variant ( Fig. 8C and D). The size exclusion chromatography (SEC) elution profile of spike-Asp614 was indistinguishable from that of spike-Gly614, consistent with a trimeric conformation (Fig. 8A). These results are broadly consistent with higher-resolution structural analyses of both spike variants.
Next, we purified and biophysically characterized 13 RBD mutants that each contain Gly614 and one additional single amino acid replacement that we identified by genome sequencing of our clinical samples (Table S4C). All variants eluted as trimers, indicating that the global structure remained intact ( Fig. 8; see also Fig. S6). However, several variants had reduced expression levels and virtually all had decreased thermostability relative to the variant that had only a single D614G amino acid replacement (Fig. 8D). The A419V and A522V mutations were especially deleterious, reducing yield and precluding further downstream analysis (Fig. 8B). We next assayed the affinity of the 11 highest-expressing spike variants for ACE2 receptor and the neutralizing monoclonal antibody CR3022 via enzyme-linked immunosorbent assays (ELISAs) (Fig. 8E to G; see also Table S4C). Most variants retained high affinity for the ACE2 surface receptor. However, importantly, three RBD variants (F338L, S373P, and R408T) had substantially reduced affinity for CR3022, a monoclonal antibody that disrupts the spike protein homotrimerization interface (62,73). Notably, the S373P mutation is one amino acid away from the epitope recognized by CR3022 (62). These results are consistent with the interpretation that some RBD mutants arising in COVID-19 patients may have an increased ability to escape humoral immune pressure but otherwise retain strong ACE2 binding affinity.

DISCUSSION
In this work, we analyzed the molecular population genomics, sociodemographic, and medical features of two waves of COVID-19 disease occurring in metropolitan Houston, TX, between early March and early July 2020. We also studied the biophysical and immunologic properties of some naturally occurring single amino acid changes in the spike protein RBD identified by sequencing the 5,085 genomes. We discovered that the first COVID-19 wave was caused by a heterogenous array of virus genotypes assigned to several different clades. The majority of cases in the first wave were related to strains that caused widespread disease in European and Asian countries, as well as other localities. We conclude that the SARS-CoV-2 virus was introduced into Houston many times independently, likely by individuals who had traveled to or from different parts of the world, including other communities in the United States. In support of this conclusion, the first cases in metropolitan Houston were associated with a travel history to a region with a known high incidence of COVID-19 (15). The data are consistent with the fact that Houston is a large international city characterized by a multiethnic population and is a prominent transport hub with direct flights to major cities globally.
The second wave of COVID-19 cases is also characterized by SARS-CoV-2 strains with diverse genotypes. Virtually all cases in the second and ongoing disease wave had been caused by strains with the Gly614 variant of spike protein (Fig. 1B). Our data unambiguously demonstrate that strains with the Gly614 variant increased significantly in frequency in wave 2 relative to wave 1 in the Houston metropolitan region. This shift occurred very rapidly, in a matter of just a few months. Amino acid residue Asp614 is located in subdomain 2 (SD-2) of the spike protein and forms a hydrogen bond and electrostatic interaction with two residues in the S2 subunit of a neighboring protomer. Replacement of aspartate with glycine would eliminate both interactions, thereby substantively weakening the contact between the S1 and S2 subunits. We previously Long et al. ® speculated (74) that this weakening produces a more highly fusogenic spike protein, as S1 must first dissociate from S2 before S2 can refold and mediate fusion of virus and cell membranes. Stated another way, virus strains with the Gly614 variant may be better able to enter host cells, potentially resulting in enhanced spread. Consistent with this idea, Korber et al. (65) showed that the Gly614 variant grows to a higher titer as pseudotyped virions. On initial diagnosis, infected individuals had lower real-time PCR (RT-PCR) cycle threshold values, suggesting higher upper respiratory tract viral loads. Our data (Fig. 7) are fully consistent with the finding, previously reported by Zhang et al. (72), that pseudovirus with the 614Gly variant infected ACE2 receptor-expressing cells more efficiently than the 614Asp variant. Similar results have been described by Hu et al. (66) and Lorenzo-Redondo et al. (67). Plante et al. (75) recently studied isogenic mutant SARS-CoV-2 strains with either the 614Asp or 614Gly variant and found that the 614Gly variant virus showed significantly increased replication in human lung epithelial cells in vitro and increased infectious titers in nasal and tracheal washes obtained from experimentally infected hamsters. These results are consistent with the idea that the 614Gly variant bestows increased virus fitness in the upper respiratory tract (75).
Additional work is needed to investigate the potential biomedical relevance and public health importance of the Asp614Gly polymorphism, including but not limited to virus dissemination, overall fitness, impact on clinical course and virulence, and development of vaccines and therapeutics. Although it is possible that stochastic processes alone may account for the rapid increase in COVID-19 disease frequency caused by viruses containing the Gly614 variant, we do not favor that interpretation, in part because of the cumulative weight of the epidemiologic, human RT-PCR diagnostics data, in vitro experimental findings, and animal infection studies using isogenic mutant virus strains (65-69, 72, 75). In addition, if stochastic processes are solely responsible, we believe it is difficult to explain essentially simultaneous increases in frequency of the Gly614 variant in genetically diverse viruses in three distinct clades (G, GH, and GR) in a geographically large metropolitan area with 7 million ethnically diverse people. Regardless, more research on this important topic is warranted.
The diversity present in our 1,026 virus genomes from the first disease wave contrasts somewhat with data reported by Gonzalez-Reiche et al., who studied 84 SARS-CoV-2 isolates causing disease in patients in the New York City region (11). Those investigators concluded that the vast majority of disease was caused by progeny of strains imported from Europe. Similarly, Bedford et al. (10) reported that much of the COVID-19 disease in the Seattle, WA, area was caused by strains that are progeny of a virus strain recently introduced from China. Some aspects of our findings are similar to those reported recently by Lemieux et al. on the basis of analysis of strains causing disease in the Boston area (76). Our findings, like theirs, highlight the importance of multiple importation events of genetically diverse strains in the epidemiology of COVID-19 disease in this pandemic. Similarly, Icelandic and Brazilian investigators documented that SARS-CoV-2 was imported by individuals traveling to or from many European and other countries (77,78).
The virus genome diversity and large sample size in our study permitted us to test the hypothesis that distinct virus clades were nonrandomly associated with hospitalized COVID-19 patients or disease severity. We did not find evidence to support this hypothesis, but our continuing study of COVID-19 cases accruing in the second wave will further improve statistical stratification.
We used machine learning classifiers to identify if any SNPs contribute to increased infection severity or otherwise affect virus-host outcomes. The models could not be trained to accurately predict these outcomes from the available virus genome sequence data. This may have been due to sample size or class imbalance. However, we do not favor this interpretation. Rather, we think that the inability to identify particular virus SNPs predictive of disease severity or infection outcome likely reflects the substantial heterogeneity in underlying medical conditions and treatment regimens of the COVID-19 patients studied here. An alternative but not mutually exclusive hypothesis is that patient genotypes play an important role in determining virus-human interac-tions and in the resulting pathology. Although some evidence has been presented in support of this idea (79,80), available data suggest that in the aggregate, host genetics does not play an overwhelming role in determining outcome in the great majority of adult patients, once virus infection is established.
Remdesivir is a nucleoside analog reported to have activity against MERS-CoV, a coronavirus related to SARS-CoV-2. Recently, several studies have reported that remdesivir shows promise in treating COVID-19 patients (28)(29)(30)(31)(32), leading the FDA to issue an emergency use authorization. Because in vitro resistance of SARS-CoV to remdesivir has been reported to be caused by either of two amino acid replacements in RdRp (Phe479Leu or Val556Leu), we interrogated our data for polymorphisms in the nsp12 gene. Although we identified 140 different inferred amino acid replacements in RdRp in the 5,085 genomes analyzed, none of these were located precisely at the two positions associated with in vitro resistance to remdesivir. Inasmuch as remdesivir is now being deployed widely to treat COVID-19 patients in Houston and elsewhere, our findings suggest that the majority of SARS-CoV-2 strains currently circulating in our region should be susceptible to this drug.
The amino acid replacements Ala442Val, Ala448Val, Ala553Pro/Val, and Gly682Arg that we identified occur at sites that, intriguingly, are located directly above the nucleotide substrate entry channel and nucleotide binding residues Lys544, Arg552, and Arg554 (21,22) (Fig. 4). One possibility is that substitution of the smaller alanine or glycine residues with the bulkier side chains of Val/Pro/Arg may impose structural constraints for the modified nucleotide analog to bind and may thereby disfavor remdesivir binding. This, in turn, may lead to reduced incorporation of remdesivir into the nascent RNA, increased fidelity of RNA synthesis, and, ultimately, drug resistance. A similar mechanism has been proposed for a Val556Leu change (22).
We also identified one strain with a Lys477Asn replacement in RdRp. This substitution is located close to a Phe479Leu replacement reported to have produced partial resistance to remdesivir in vitro in SARS-CoV patients from 2004, although the amino acid positions are numbered differently in SARS-CoV and SARS-CoV-2. Structural studies have suggested that this amino acid is surface exposed and is distant from known key functional elements. Our observed Lys477Asn change is also located in a conserved motif described as a finger domain of RdRp ( Fig. 3 and 4). One speculative possibility is that Lys477 is involved in binding an as-yet-unidentified cofactor such as Nsp7 or Nsp8, an interaction that could modify nucleotide binding and/or fidelity at a distance. These data warrant additional study in larger patient cohorts, especially in individuals treated with remdesivir.
Analysis of the gene encoding the spike protein identified 285 polymorphic amino acid sites relative to the reference genome, including 49 inferred amino acid replacements not present in available databases as of 19 August 2020. Importantly, 30 amino acid sites in the spike protein had two or three distinct replacements relative to the reference strain. The occurrence of multiple variants at the same amino acid site is one characteristic that may suggest functional consequences. These data, coupled with structural information available for spike protein, raise the possibility that some of the amino acid variants have functional consequences, including, for example, altered serologic reactivity as shown here. These data permit generation of many biomedically relevant hypotheses now under study.
A recent study reported that RBD amino acid changes could be selected in vitro using a pseudovirus neutralization assay and sera obtained from convalescent plasma or monoclonal antibodies (81). The amino acid sites included positions V445 and E484 in the RBD. Note that variants G446V and E484Q were present in our patient samples. However, these mutations retain high affinity to CR3022 (Fig. 8F and G). The highresolution structure of the RBD/CR3022 complex shows that CR3022 makes contacts to residues 369 to 386, 380 to 392, and 427 to 430 of RBD (73). Although there is no overlap of CR3022 and ACE2 receptor epitopes, CR3022 is able to neutralize the virus through an allosteric effect. We found that the Ser373Pro change, which is located within the CR3022 epitope, resulted in reduced affinity to CR3022 (Fig. 8F and G). The F338L and R408T mutations, although not found directly within the interacting epitope, also display reduced binding to CR3022. Other investigators (81) using in vitro antibody selection identified a change at amino acid site S151 in the N-terminal domain, and we found mutations S151N and S151I in our patient samples. We also note that two variant amino acids (Gly446Val and Phe456Leu) that we identified were located in a linear epitope found to be critical for a neutralizing monoclonal antibody described recently by Li et al. (82).
In the aggregate, these findings suggest that mutations emerging within the spike protein at positions within and proximal to known neutralization epitopes may result in escape from antibodies and other therapeutics currently under development. Importantly, our study did not reveal that these mutant strains had disproportionately increased in number over time. The findings may also bear on the occurrence of multiple amino acid substitutions at the same amino acid site that we identified in this study, commonly a signal of selection. In the aggregate, the data support a multifaceted approach to serological monitoring and biologics development, including the use of monoclonal antibody cocktails (45,46,83).
Concluding statement. Our work represents analysis of the largest sample to date of SARS-CoV-2 genome sequences from patients in one metropolitan region in the United States. The investigation was facilitated by the fact that we had rapidly assessed a SARS-CoV-2 molecular diagnostic test in January 2020, more than a month before the first COVID-19 patient was diagnosed in Houston. In addition, our large health care system has seven hospitals and many facilities (e.g., outpatient care centers, emergency departments) located in geographically diverse areas of the city. We also provide reference laboratory services for other health care entities in the Houston area. Together, our facilities serve patients of diverse ethnicities and socioeconomic statuses. Thus, the data presented here likely reflect a broad overview of virus diversity causing COVID-19 infections throughout metropolitan Houston. We previously exploited these features to study influenza virus and Klebsiella pneumoniae dissemination in metropolitan Houston (84,85). We acknowledge that not every "twig" of the SARS-CoV-2 evolutionary tree in Houston is represented in these data. The samples studied are not comprehensive with respect to the entire metropolitan region. For example, it is possible that our strain samples are not fully representative of individuals who are indigent, homeless, or of very low socioeconomic status. In addition, although the strain sample size was relatively large compared to other studies, the samples represented only about 10% of all COVID-19 cases in metropolitan Houston documented in the study period. In addition, some patient samples contained relatively small amounts of virus nucleic acid and did not yield adequate sequence data for high-quality genome analysis. Thus, our data likely underestimate the extent of genome diversity present among SARS-CoV-2 strains causing COVID-19 and will not identify all amino acid replacements in the virus in this geographic region. It will be important to sequence and analyze the genomes of additional SARS-CoV-2 strains causing COVID-19 cases in the ongoing second massive disease wave in metropolitan Houston, and such studies are under way. Data of this type will be especially important to have if a third wave and subsequent waves were to occur in metropolitan Houston, as it could provide insight into molecular and epidemiologic events contributing to them.
The genomes reported here are an important data resource that will underpin our ongoing study of SARS-CoV-2 molecular evolution and dissemination and medical features of COVID-19 in Houston. As of 19 August 2020, there were 135,866 reported cases of COVID-19 in metropolitan Houston, and the number of cases is increasing daily. Although the full array of factors contributing to the massive second wave in Houston is not known, it is possible that the potential for increased transmissibility of SARS-CoV-2 with the Gly614 amino acid replacement may have played a role, as well as changes in behavior associated with the Memorial Day and July 4th holidays and relaxation of some of the social constraints imposed during the first wave. The availability of extensive virus genome data dating from the earliest reported cases of COVID-19 in metropolitan Houston, coupled with the database we have now constructed, may provide critical insights into the origin of the new infection spikes and waves that are occurring as public health constraints are further relaxed, schools and colleges reopen, holidays occur, commercial air travel increases, and individuals change their behavior because of COVID-19 "fatigue." The genome data will also be useful in assessing ongoing molecular evolution in spike and other proteins as baseline herd immunity is generated, either by natural exposure to SARS-CoV-2 or by vaccination. The signal of potential selection contributing to some spike protein diversity and identification of naturally occurring mutant RBD variants with altered serologic recognition warrant close attention and expanded study.

MATERIALS AND METHODS
Patient specimens. All specimens were obtained from individuals who were registered patients at Houston Methodist hospitals, associated facilities (e.g., urgent care centers), or institutions in the greater Houston metropolitan region that use our laboratory services. Virtually all individuals met the criteria specified by the Centers for Disease Control and Prevention to be classified as a person under investigation.
SARS-CoV-2 molecular diagnostic testing. Specimens obtained from symptomatic patients with a high degree of suspicion for COVID-19 disease were tested in the Molecular Diagnostics Laboratory at Houston Methodist Hospital using an assay granted Emergency Use Authorization (EUA) from the FDA (https://www.fda.gov/medical-devices/emergency-situations-medical-devices/faqs-diagnostic-testing-sa rs-cov-2#offeringtests). Multiple testing platforms were used, including an assay that follows the protocol published by the WHO (https://www.who.int/docs/default-source/coronaviruse/protocol-v2-1.pdf) using an EZ1 virus extraction kit and an EZ1 Advanced XL instrument or a QIASymphony DSP virus kit and a QIASymphony instrument for nucleic acid extraction and an ABI 7500 Fast Dx instrument with 7500 SDS software for reverse transcription RT-PCR, the COVID-19 test using BioFire Film Array 2.0 instruments, the Xpert Xpress SARS-CoV-2 test using Cepheid GeneXpert Infinity or Cepheid GeneXpert Xpress IV instruments, the SARS-CoV-2 assay using a Hologic Panther instrument, and the Aptima SARS-CoV-2 assay using a Hologic Panther Fusion system. All assays were performed according to the manufacturer's instructions. Testing was performed on material obtained from nasopharyngeal or oropharyngeal swabs immersed in universal transport media (UTM), bronchoalveolar lavage fluid, or sputum treated with dithiothreitol (DTT). To standardize specimen collection, an instructional video was created for Houston Methodist Hospital health care workers (https://vimeo.com/396996468/2228335d56).
Epidemiologic curve. The number of confirmed COVID-19-positive cases was obtained from USA-Facts.org (https://usafacts.org/visualizations/coronavirus-covid-19-spread-map/) for Austin, Brazoria, Chambers, Fort Bend, Galveston, Harris, Liberty, Montgomery, and Waller counties. COVID-19positive cases for Houston Methodist Hospital patients were obtained from our Laboratory Information System and plotted using the documented collection time.
SARS-CoV-2 genome sequence analysis. Consensus virus genome sequences from the Houston area isolates were generated using the ARTIC nCoV-2019 bioinformatics pipeline. Publicly available genomes and metadata were acquired through GISAID on 19 August 2020. GISAID sequences containing greater than 1% N characters and Houston sequences with greater than 5% N characters were removed from consideration. Identical GISAID sequences originating from the same geographic location with the same collection date were also removed from consideration to reduce redundancy. Nucleotide sequence alignments for the combined Houston and GISAID strains were generated using MAFFT version 7.130b with default parameters (86). Sequences were manually curated in JalView (87) to trim the ends and to remove sequences containing spurious inserts. Phylogenetic trees were generated using FastTree with the generalized time-reversible model for nucleotide sequences (88). CLC Genomics Workbench (Qiagen) was used to generate the phylogenetic tree figures.
Geospatial mapping. The home address Zip code for all SARS-CoV-2-positive patients was used to generate the geospatial maps. To examine geographic relatedness among genetically similar isolates, geospatial maps were filtered for isolates containing specific amino acid changes.
Time series. Geospatial data were filtered into wave 1 (5 March 2020 to 11 May 2020) and wave 2 (12 May 2020 to 7 July 2020) time intervals to illustrate the spread of confirmed SARS-CoV-2-positive patients identified over time.
Machine learning. Virus genome alignments and patient metadata were used to build models to predict patient metadata and outcomes using both classification models and regression. Metadata considered for prediction in the classification models included age, ABO and Rh blood type, ethnic group, ethnicity, sex, ICU admission, IMU admission, supplemental oxygen use, and ventilator use. Metadata considered for prediction in regression analysis included ICU length of stay, IMU length of stay, total length of stay, supplemental oxygen use, and ventilator use. Because sex, blood type, Rh factor, age, age decade, ethnicity, and ethnic group are features in the patient features and combined feature sets, models were not trained for these labels using patient and combined feature sets. Additionally, age, Long et al.
length of stay, IMU length of stay, ICU length of stay, mechanical ventilation days, and supplemental oxygen days were treated as regression problems and XGBoost regressors were built while the rest were treated as classification problems and XGBoost classifiers were built.
Three types of features were considered for training the XGBoost classifiers: alignment features, patient features, and the combination of alignment and patient features. Alignment features were generated from the consensus genome alignment such that columns containing ambiguous nucleotide bases were removed to ensure that the models did not learn patterns from areas of low coverage. These alignments were then one-hot encoded to form the alignment features. Patient metadata values were one-hot encoded with the exception of age, which remained as a raw integer value, to create the patient features. These metadata values consisted of age, ABO, Rh blood type, ethnic group, ethnicity, and sex. All three types of feature sets were used to train models that predict ICU length of stay, IMU length of stay, overall length of stay, days of supplemental oxygen therapy, and days of ventilator usage, while only alignment features were used to train models that predict age, ABO, Rh blood type, ethnic group, ethnicity, and sex.
A 10-fold cross validation was used to train XGBoost models (89) as described previously (90,91). Depths of 4, 8, 16, 32, and 64 were used to tune the models, but the accuracies plateaued after a depth of 16. SciKit-Learn's (92) classification report and R 2 score were then used to access the overall accuracy of the classification and regression models, respectively.
Patient metadata correlations. We encoded values into multiple columns for each metadata field for patients if metadata was available. For example, the ABO column was divided into four columns for A, B, AB, and O blood type. Those columns were encoded with a 1 for the patients' ABO type, with all other columns encoded with 0. This was repeated for all nonoutcome metadata fields. Age, however, was not reencoded, as the raw integer values were used. Each column was then correlated to the various outcome values for each patient (deceased, ICU length, IMU length, length of stay, supplemental oxygen length, and ventilator length) to obtain a Pearson coefficient correlation value for each metadata label and outcome.
Analysis of the nsp12 polymerase and S protein genes. The nsp12 virus polymerase and S protein genes were analyzed by plotting SNP density in the consensus alignment using Python (Python v3.4.3, Biopython Package v1.72). The frequency of SNPs in the Houston isolates was assessed, along with amino acid changes for nonsynonymous SNPs.
Cycle threshold (C T ) comparison of SARS-CoV-2 strains with either Asp614 or Gly614 amino acid replacements in the spike protein. The cycle threshold (C T ) value for every sequenced strain that was detected from a patient specimen using the SARS-CoV-2 assay on a Hologic Panther instrument was retrieved from the Houston Methodist Hospital Laboratory Information System. The statistical significance of results of comparisons between the mean C T values for strains with an aspartate (n ϭ 102) or glycine (n ϭ 812) amino acid at position 614 of the spike protein was determined with the Mann-Whitney test (GraphPad Prism 8).
Creation and characterization of spike protein RBD variants. Spike RBD variants were cloned into the spike-6P (HexaPro; F817P, A892P, A899P, A942P, K986P, V987P) base construct that also includes the D614G substitution (pIF638). Briefly, a segment of the gene encoding the RBD was excised with EcoRI and NheI, mutagenized by PCR, and assembled with a HiFi DNA assembly cloning kit (NEB).
FreeStyle 293-F cells (Thermo Fisher Scientific) were cultured and maintained in a humidified atmosphere of 37°C and 8% CO 2 with shaking at 110 to 125 rpm. Cells were transfected with plasmids encoding spike protein variants using polyethylenimine. Three hours posttransfection, 5 M kifunensine was added to each culture. Cells were harvested 4 days after transfection, and the protein-containing supernatant was separated from the cells by two centrifugation steps: 10 min at 500 relative centrifugal force (rcf) and 20 min at 10,000 rcf. Supernatants were kept at 4°C throughout. Clarified supernatant was loaded on a Poly-Prep chromatography column (Bio-Rad) containing Strep-Tactin Superflow resin (IBA), washed with five column volumes (CV) of wash buffer (100 mM Tris-HCl [pH 8.0], 150 mM NaCl, 1 mM EDTA), and eluted with four CV of elution buffer (100 mM Tris-HCl [pH 8.0], 150 mM NaCl, 1 mM EDTA, 2.5 mM D-desthiobiotin). The eluate was spin concentrated (Amicon Ultra-15) to 600 l and further purified via size exclusion chromatography (SEC) using a Superose 6 Increase 10/300 column (GE) and SEC buffer (2 mM Tris [pH 8.0], 200 mM NaCl, 0.02% NaN 3 ). Proteins were concentrated to 300 l and stored in SEC buffer.
The RBD spike mutants chosen for analysis were all RBD amino acid mutants identified by our genome sequencing study as of 15 June 2020. We note that the exact boundaries of the RBD domain vary depending on the paper used as the reference. We used the boundaries demarcated in Fig. 1A  Enzyme-linked immunosorbent assays. ELISAs were performed to characterize binding of S6P, S6P D614G, and S6P D614G-RBD variants to human ACE2 and the RBD-binding monoclonal antibody CR3022. The ACE2-hFc chimera was obtained from GenScript (Z03484), and the CR3022 antibody was purchased from Abcam (Ab273073). Corning 96-well high-binding plates (CLS9018BC) were coated with spike variants at 2 g/ml overnight at 4°C. After four washes with phosphate-buffered saline-0.1% Tween 20 (PBST; 300 l/well), plates were blocked with PBS-2% milk (PBSM) for 2 h at room temperature and again washed four times with PBST. These were serially diluted in PBSM 1:3 seven times in triplicate. After 1 h Two Waves of COVID-19 Disease in Houston, Texas ® of incubation at room temperature, plates were washed four times in PBST, labeled with 50 l mouse anti-human IgG1 Fc-HRP (SouthernBiotech, 9054-05) for 45 min in PBSM, and washed again in PBST before addition of 50 l 1-step Ultra TMB-ELISA substrate (Thermo Scientific, 34028). Reactions were developed for 15 min and stopped by addition of 50 l 4 M H 2 SO 4 . Absorbance intensity (450 nm) was normalized within a plate, and 50% effective concentration (EC 50 ) values were calculated through 4-parameter logistic curve (4PL) analysis using GraphPad Prism 8.4.3.

SUPPLEMENTAL MATERIAL
Supplemental material is available online only.