Phylogenetic signatures reveal multilevel selection and fitness costs in SARS-CoV-2

Background Large-scale sequencing of SARS-CoV-2 has enabled the study of viral evolution during the COVID-19 pandemic. Some viral mutations may be advantageous to viral replication within hosts but detrimental to transmission, thus carrying a transient fitness advantage. By affecting the number of descendants, persistence times and growth rates of associated clades, these mutations generate localised imbalance in phylogenies. Quantifying these features in closely-related clades with and without recurring mutations can elucidate the tradeoffs between within-host replication and between-host transmission. Methods We implemented a novel phylogenetic clustering algorithm ( mlscluster, https://github.com/mrc-ide/mlscluster) to systematically explore time-scaled phylogenies for mutations under transient/multilevel selection. We applied this method to a SARS-CoV-2 time-calibrated phylogeny with >1.2 million sequences from England, and characterised these recurrent mutations that may influence transmission fitness across PANGO-lineages and genomic regions using Poisson regressions and summary statistics. Results We found no major differences across two epidemic stages (before and after Omicron), PANGO-lineages, and genomic regions. However, spike, nucleocapsid, and ORF3a were proportionally more enriched for transmission fitness polymorphisms (TFP)-homoplasies than other proteins. We provide a catalog of SARS-CoV-2 sites under multilevel selection, which can guide experimental investigations within and beyond the spike protein. Conclusions This study provides empirical evidence for the existence of important tradeoffs between within-host replication and between-host transmission shaping the fitness landscape of SARS-CoV-2. This method may be used as a fast and scalable means to shortlist large sequence databases for sites under putative multilevel selection which may warrant subsequent confirmatory analyses and experimental confirmation.


Introduction
Although the majority of polymorphic sites within a viral genome are selectively neutral or under weak selection 1 , frequent variant replacements driven by adaptive evolution are observed in SARS-CoV-2, Influenza A, and other endemic viruses 2 .Despite less understood, the phenomena where some mutations confer a large transient increase in fitness, being advantageous to viral replication within hosts but detrimental to transmission, could present large epidemic-level effects.Well-documented examples of multilevel selection are provided in the context of HIV-1, which evolves considerably faster within individuals than at the epidemic level 3,4 , and virus which is more highly diverged from the population consensus is less likely to be transmitted 5 .A variety of mechanisms may underlie multilevel selection effects, including preferential transmission of ancestral variants 4 , switch in coreceptor usage 6 or heterogeneous immune selection and frequent reversions of host-specific adaptations 7 .Human Influenza A is an endemic virus that evolves more similarly to SARS-CoV-2.Background selection outside, and not only positive selection within, antigenic epitopes was shown to control the mode and rate of influenza evolution, as immunological adaptation and the preservation of other viral functions can conflict with one another 8 .
During the COVID-19 pandemic, analyses of the extensive genomic data collected provided valuable insights about the competing forces influencing SARS-CoV-2 evolutionary dynamics 9-14 .However, it also highlighted the need for more advanced methods able to simultaneously handle large-scale datasets and perform fine-scale inferences (e. g. investigating fitness cost of individual mutations) 9, [15][16][17] .By developing such scalable analytic pipelines, we are able to identify recurrent mutations in different branches of a phylogenetic tree from a densely sampled epidemic, which potentially arise convergently as a consequence of virus response to adaptive selective pressures within hosts.
The inference of fitness from the shape of phylogenies was introduced through the local branching index (LBI) 18 and treeImbalance methods 19 .Remarkably, even when clades are distantly related, they can present very similar distributions of coalescent times and branch lengths 20 , as well as the proportion of descendants, persistence time (or longevity), and growth rates when compared with closely-related clades.Most importantly, mutations influencing virus transmission fitness are expected to affect the distribution of offspring 9 , consequently generating localised and quantifiable imbalance in time-scaled phylogenies.Therefore, the comparison of these parameters can indicate if similar evolutionary, demographic, or epidemiological processes are shaping viral evolution across different clades of a phylogenetic tree.
In molecular epidemiological studies, a set of particularly scalable approaches have been developed based on the detection of phylogenetic clusters comprising two or more closely related samples 21,22 .The frequency of phylogenetic clustering in a sample is sometimes considered a proxy for high transmission rate, especially in HIV datasets 21,23,24 , and can potentially indicate spread efficiency of a particular genotype (e.g.HIV drug resistance-associated mutations [DRAMs]).Intuitively and by extension, transmissibility and within-host evolution between variants can be considered a proxy for overall fitness 25 .Recently, a genetic clustering analysis of HIV-1 identified variants containing specific DRAMs in antiretroviral therapy (ART)-naive transmission networks that reduce transmission fitness and suggested a negative correlation between lower frequencies of rare polymorphisms and fitness advantage 24 .
Currently, similar clustering approaches have not yielded major insights into negatively-selected variants in SARS-CoV-2 despite the collection of unprecedented numbers of whole-genome sequences 9 .Furthermore, there is considerable scope to improve on distance-based genetic clustering methods because such approaches were demonstrated to be systematically biased to detect variation in sampling rates instead of transmission rates 22 .Consequently, these methods will potentially present poor specificity for variants that negatively influence fitness.During the past few years, positive and negative selection in SARS-CoV-2 have mainly been investigated using methods that rely on synonymous rate variation across sites/branches 26,27 , and results from these approaches on SARS-CoV-2 comprehensive datasets are available for comparison 28 .However, methodology to identify mutations that potentially have a transient fitness advantage is still lacking.
We developed a tree-based clustering algorithm, available as open-source R package mlscluster () 29 , to identify potential transmission fitness polymorphisms (TFPs), which can be defined as mutations carrying a transient/multilevel fitness advantage, i.e. being simultaneously beneficial for replication within hosts and deleterious to transmission.This is achieved by computing and comparing simple summary statistics from the offspring of recurring clade-defining mutations

Amendments from Version 1
In this revised version of the paper, we addressed minor and major comments from three different expert reviewers.We incorporated new analyses comparing our approach against experimental (deep mutational scanning) and computational estimates of fitness effects across SARS-CoV-2 proteins, and discussed how these approaches differ from mlscluster.Most importantly, we tried to make the wording less definitive about the sites actually being TFPs and explain that this method should be used as a fast and scalable way to shortlist large sequence databases for candidate sites under multilevel selection.Such candidates could then be used as input for more rigorous coalescent-based modelling or experimental studies to confirm their biological significance.Additionally, we tried to increase the readability, contextualisation with relevant literature, and discuss some limitations of the method in more detail.
Any further responses from the reviewers can be found at the end of the article in a time-scaled phylogeny, therefore specifically investigating homoplasies and aggregating information across many occurrences of the same mutation.This approach complements standard procedures based on synonymous rate variation across sites/branches by highlighting variants which likely have different and/or competing selective pressures within and between hosts.We demonstrate its applicability through the analysis of a representative SARS-CoV-2 data set comprising >1.2 million genome sequences from England.The analysis indicated slightly higher TFP-homoplasy enrichment on B.1.1.7 and AY.4.* lineages, particularly in genomic regions of known functional significance such as spike (S), nucleocapsid (N), and ORF3a, as well as regions of poorly understood significance including ORF7 and ORF8.By providing a comprehensive catalog of the main sites potentially resulting from multilevel selective pressures throughout the SARS-CoV-2 genome, we also expand the understanding of the SARS-CoV-2 fitness landscape outside the well-studied spike protein.Therefore, these results are meant to guide the design of experimental studies and population genetic models to characterise the functional impact of specific mutations, especially the subset that is advantageous to within-host replication but detrimental to transmission.

Terminology and tree-based clustering statistics for detecting localised imbalance
We propose a phylogenetic tree-based clustering algorithm to systematically explore all clades in a time-scaled phylogenetic tree reconstructed from viral genomes.Assume two clades u (target clade) and v (comparator/sister) (Figure 1A) organised in a time-scaled tree t and sharing full ancestry (i.e., the same defining mutations) up to the point which they diverge and present their own exclusive defining mutations.The size of each clade (n in Figure 1) is defined as the number of descendant sequences arising from it until the leaves of the tree are reached.The persistence time (given by a in Figure 1), is defined as the difference between the maximum sample time of samples descended from that clade and the estimated time of its most recent common ancestor [tMRCA].After computing these simple parameters, the target clades u are then contrasted against their comparators v, which can be their sister clade (the onesharing an immediate ancestor assuming bifurcating phylogenetic relationship) or all other clades sharing the same immediate ancestor (in case of polytomies/ multifurcations).
If considering a clade u with sister clade v, we compute three statistics based on these local phylogenetic patterns: (i) the ratio S uv of 'clade sizes' (i.e., the number of samples descended from each clade) between u and v denoted n u and n v (Equation 1), (ii) the ratio T uv of persistence times (i.e., longevities) denoted by a u and a v (Equation 2), and (iii) the logistic growth rate.The latter is defined as the coefficient of a logistic regression having a response variable defining sampling a descendent of u versus v and the sample time as a predictor (Equation 3).The coefficient of such a logistic regression quantifies the relative growth of the clade and is related to the selection coefficient 30 .

Tree-based clustering algorithm implementation
The mlscluster method is implemented as an R package (https://github.com/mrc-ide/mlscluster) 29,31that incorporates these multiple statistical methods for identifying especially convergently acquired mutations (homoplasies) that are potentially detrimental for transmission (i.e., homoplasies occurring in a subtree for which at least one of the three aforementioned statistics fall below a low quantile [e.g., 2%] of the empirical distribution).These statistics applied to each clade are designed to be simple and computationally fast, making it possible to scan phylogenies with more than a million tips in hours using multiple CPU cores.
The clustering algorithm (Figure 1B) starts by receiving a rooted bi-or multifurcating time-scaled tree (e. g., estimated using treedater 32 , treetime 33 or chronumental 34 ) and associated metadata in a tabular format including sequence name, sample date, lineage, major lineage, and annotated mutations.The package then uses standard tree manipulation strategies implemented in the ape R package 35 , particularly postorder traversal to visit clades and tips based on the two-column edge matrix from the "phylo" class.Given this efficient way to visit clades of the tree and edge lengths, we can easily extract the parameters of interest (e. g., time of the most recent common ancestor of each clade, descendant identifiers and quantities, clade ages, etc).Therefore, target clades are selected if they meet all of the following conditions: (i) have at least a specified minimum number of descendants (ii) have up to a maximum number of descendants, (iii) last for a minimum period (cluster age, in years), (iv) are sampled after a minimum sampling date, and (v) are sampled up to a maximum sampling date.The default values of (i) 10, (ii) 20×10 -3 , and (iii) 1/12 (1 month) were chosen to avoid ratios being taken from small or unreliable clade sizes/persistence times, an unrealistically high number of viral generations, and very short timeframes.These values were selected based on previous experience with UK SARS-COV-2 analyses using a similar software developed by our research group 15,36 .
A systematic evaluation of how these parameters would influence the results was not performed, but is planned for future applications of this method.The minimum date was set to be the one of the Wuhan reference sequence and the maximum dates were modified according to the investigated pre-Omicron (up to mid-November 2021) and including Omicron (up to end of April 2022) periods.Subsequently, target and comparator (sister) clade(s) are gathered together and ratio of sizes, ratio of persistence time and logistic growth rates are calculated as previously stated.
Since every sequence should include a metadata column (e.g.precomputed by COG-UK consortium, see ) listing mutations from its genome, the clustering algorithm tool incorporates a function to identify defining polymorphisms in target clades.The mutation must be present in >75% of the sequences in that clade (while absent or in a smaller fraction than this percentage in its comparator) to be considered as defining, although this cutoff value can be changed.After computing clade-defining mutations, these are all tabulated and those which happen more than once in different clades are retrieved as homoplasies.To enhance inspection of results, homoplasies are annotated into (i) regions of interest (e.g.SARS-CoV-2 spike and nucleocapsid proteins), (ii) different mutations at the same site, and (iii) mutations within a 3 amino acid sliding window.There is also an additional sanity check for known sequencing artifacts 37 and for positively selected sites found by other analyses 28 .Then, based on cut points dividing the range of the empirical distribution of each statistic into equally-spaced continuous intervals (quantiles), cluster thresholds can be specified.These thresholds retain only clades, along withtheir respective defining homoplasies, that are potentially detrimental for transmission (default threshold of <1%) or carrying a positive fitness advantage (e.g., >99%), therefore aggregating summary statistics from the phylogeny across many occurrences of the same mutation when compared with sister clades without the mutation.We intended to make the method flexible by creating a parameter that specifies in how many percentiles the statistic should be splitted (default = 1/100) and another to keep values below or above the cutoff point.
Different comma-separated detailed outputs are generated for each of the three statistics showing clades (and defining homoplasies) contained in the chosen cluster threshold, as well as the intersection (the ones identified by the threshold of the three statistics) and union (clades associated with at least one statistic threshold).Additionally, for each of the three homoplasy-annotated categories, three files are generated with their frequencies, and clades of occurrence considering (a) all target clades that passed minimal filtering conditions, (b) only clades detected by one or more statistic threshold, (c) the ones not detected by any cutoff.Finally, these outputs are joined into one data.frameto facilitate further statistical analyses.
Identifying potential TFPs using the mlscluster algorithm and a representative SARS-CoV-2 genomic dataset Tree and metadata.A global SARS-CoV-2 maximum likelihood (ML) phylogenetic tree was estimated using FastTreeMP v2.1.11 38and UShER 39 as part of the phylopipe workflow 40 .
In addition, associated metadata including adm2 regions following the Database of Global Administrative Areas (GADM) subdivisions, PANGO-lineages, and annotated synonymous and non-synonymous mutations were obtained from the COVID-19 Genomics UK Consortium (COG-UK).From the ML tree, we estimated a time-scaled phylogeny using chronumental v0.0.53 34 .Only sequences from England were retained alongside the Wuhan/WH04/2020 (EPI_ISL_406801) reference sequence.In total, we included  2).

Statistical analysis for identifying genomic regions enriched for TFPs
We tested our approach using two COVID- We also performed rigorous quality control to ensure our estimates were not biased by sequencing and base-calling artifacts.Firstly, we removed outlier sites highly enriched for homoplasies (above the 99% quantile of homoplasy frequencies for all cluster thresholds), which we manually confirmed to be sequencing artifacts due to the high number of undetermined bases at respective sites in the alignment generating the phylogenetic tree.However, even after performing this approach, BA.1-defining mutations in the Receptor-binding Domain (RBD) were particularly identified as TFP-homoplasies for threshold=2% (Extended Data Figure S1A) 41 , which was an unexpected result.To further inspect this inconsistency, we selected eight BA.1-spike defining mutations that were in our top100 of most frequent homoplasies (S:S371L, S373P, S375F, G496S, Q498R, N501Y, Y505H, N764K) and excluded all sequences (n=71,414, 5.6% of all sequences) that had undetermined bases (e.g."NNN") in their respective codons from the nucleotide alignment.As a result, these sites were not detectable anymore (Extended Data Figure S1B) 41 and we could confirm they were the result of sequencing artifacts, which generally occur due to systematic differences in sequencing protocols and primer selection over time 42 and in different laboratories.
Consequently, we decided to perform a more aggressive quality control (henceforth called alignment-aware artifact removal) that has the advantage of not relying on excluding sequences without perfect coverage.First, we ran the mlscluster algorithm (https://github.com/mrc-ide/mlscluster)without any artifact removal.We then extracted every homoplasic site detected and used seqtk v1.3-r106 43 to create alignment files for each codon matching these sites in case of non-synonymous mutations and for each nucleotide in case of synonymous mutations.Afterwards, for each sequence, we added "X" (undetermined amino acid) for every site which had one or more non-ACTG character in respective codon positions, and "N" (undetermined nucleotide) for every non-ACTG synonymous site.Subsequently, we appended the "X" and "N" site annotations for all sequences within the existing metadata mutations column.We then incorporated a function named .fix_sites(https://github.com/mrc-ide/mlscluster/blob/main/R/mlsclust.R#L314) to deal with those highly uncertain sites within the mlscluster package.In summary, since the first step to extract a defining mutation for each target clade is to compute the frequency of each mutation within the target and the comparator clades, we used the proportion of the most frequent mutation at a given site across all sequences within the clade and added up the frequency of the "X" or "N" at that site, because it is most likely that the artifacts follow the majority within that specific clade given the complete shared ancestry.For example, if the target clade has the site S:G446 changing to S with frequency = 0.7 (i.e., 70% of the sequences in the target have this mutation) and to X with frequency = 0.3, we consider the S frequency = 1, and this mutation now has enough frequency (>75%) to be considered defining, which only would occur if the comparator has S:G446S at frequency <0.75%.
In cases where "X" and "N" are the most frequently mutated characters within a clade, the second-ranked amino acid or nucleotide at that site is added to these undetermined characters.For example, the target clade has the site S:N501 changing to X with frequency = 0.6 and to Y with frequency = 0.4, then the X frequency = 1.In such cases where either the target or comparator has a higher frequency of "X" or "N", the sites are not considered defining.A mutation is only considered defining when (i) it has one of the 20 valid amino acids (or stop codon) or the four valid nucleotides, (ii) it has a >75% frequency on the target clade and simultaneously <75% on the its comparator.This approach removed not only the eight previously investigated BA-1-spike defining mutations but also other six S sites for threshold = 2%, and affected mostly the BA.1 lineage without major changes for other lineages.Therefore, we consider that results arising from this alignment-aware artifact removal method are more reliable than previously and report those throughout the paper.
We performed Poisson regressions using the glm function from the stats package 44 and having the frequency of homoplasies as response variable (Equation 4) to identify if any genomic regions and/or major PANGO-lineages were associated with increased TFP-homoplasy emergence for non-synonymous polymorphisms across different time periods and thresholds.A p-value ≤ 0.05 was considered statistically significant.).These were main drivers of epidemic waves in the UK and around the globe.Genomic regions included all 15 non-structural proteins (NSPs) from ORF1ab (NSP1-10, 12-16), S, ORF3a, Envelope (E), Membrane (M), ORF6, ORF7a, ORF7b, ORF8, N, and ORF10.Moreover, regions of characterised functional significance including the N-terminal Domain (NTD), the RBD and the Furin-cleavage site (FCS) of S, as well as the Linker Domain of N 45 were considered as additional genomic regions.The other covariate was whether the sites were independently found under selection based on a HyPhy-based synonymous rate variation across sites/branches analysis 28 .

( ) ( ) [ ]
To further investigate the genomic regions enriched for TFP-homoplasies resulting from the Poisson regressions and to compare the sites identified as potentially under selection against results from the literature 28 , we generated different exploratory visualisations using ggplot2 46 .These were stratified by the method of detection (mlscluster, HyPhy, or both), major PANGO-lineages, genomic regions (including frequency normalisation by size), and cluster thresholds.
Although not strictly comparable, we also examined how our estimates of multilevel selection were consistent with deep-mutational scanning (DMS) experiments 47-51 and comprehensive computational estimates of the magnitude of one level of selection based on fixed fitness effects 52 , Firstly, we extracted DMS measurements against spike 47 , RBD 48 , and NSP5 (MPro) [49][50][51] as compiled previously (https://github.com/jbloomlab/SARS2-mut-fitness/tree/main/results/dms).Subsequently, we selected fitness effects estimates 52 that would be suitable for comparison against our dataset, leading to the choice of the England subset from early 2022 (https://github.com/jbloomlab/SARS2-mut-fitness/blob/main/results_public_2022-01-31/aa_fitness/aamut_fitness_by_subset.csv).We then matched mutations detected as TFPs and non-TFPs (homoplasies that are not identified by any of the mlscluster statistics) against mutations having positive and negative effects given by the respective DMS (experimental) and computational estimates.We also compared the association between TFP status and fitness magnitude using a Fisher's Exact test.Finally, we explored how the distribution of computational selection coefficients was different for TFPs and non-TFPs using sina plots and a Wilcoxon signed-rank test.
Codon-aware false discovery rate (FDR) We used synonymous homoplasies for characterising the FDR of our approach under the assumption that synonymous sites would not provide a fitness advantage (i.e., are nearly neutral, as demonstrated 52 ).Since these sites represent one-third of the genome and mutations tend to occur in the third codon position to preserve the encoded amino acid, this weighting needs to be taken into account when computing FDR.Firstly, we defined the percentage of erroneous TFP calls for each threshold t as: where Y is the number of TFP calls specifically among the i polymorphic third codon position sites with > 100 mutated sequences at the given site (considering the analysed > 1.2 million genomes).This threshold was selected after retrieving the number of polymorphic synonymous sites at each codon position for thresholds ranging from 10 to 1000 mutated sequences, balancing the number of legitimate and possibly artifactual polymorphic sites and representing approximately 0.1% of the analysed sequences.The FDR calculation is also performed for each SARS-CoV-2 protein.Multiplication by 100 transforms the proportion of erroneous TFP calls into a percentage for easier interpretation.
Similarly, a separate error rate (ε or codon-aware FDR) is also computed relative to the sites at first and second codon positions as follows: where Z is the number of TFP calls at codon positions one and two, and j is the total number of polymorphic sites at first and second positions with > 100 mutated sequences at the given site.
Both calculations were performed for the two analysed time periods, with slightly smaller error rates for the timeframe before Omicron emergence.

Results
We utilised our new approach (Figure 1) 41 2).For clarity and due to the main patterns observed, the analytic period presented includes: (i) June 2020 to mid-November 2021 (pre-Omicron interval) and (ii) mid-November 2021 to April 2022 (including Omicron).Only data collected through community sampling (Pillar 2) were included to reduce bias towards more severe infections and avoid the inclusion of data that was collected for special purposes.For these reasons, we were unable to investigate e.g.second-generation BA.2, BA.5, XBB, and other relevant variants that emerged after April 2022.The geographical representation of the data is similar across different regions of England, with a mean proportion of community cases having a sequence of 6.7% and median of 7.1% (Figure 2, inset plot).

Lineages and genomic regions enriched with SARS-CoV-2 TFP-homoplasies
The presence of recurring synonymous polymorphisms classified as TFP-homoplasies allowed us to investigate the FDR for each genomic region as a function of the applied cluster thresholds.The frequency of synonymous mutations along the genome (Extended Data Figure S2) 41 and the FDR across genomic regions (Extended Data Figure S3) 41 support that sites only detected at thresholds ≥ 5% must be investigated with caution since they generally have associated FDRs >20% and ε ≈ 5%.In contrast, cutoffs ≤ 2% retain acceptable FDRs ≈ 10% and ε ≈ 2%.Additionally, these more erroneous thresholds (≥ 5%) represent > 50% of the identified TFP sites (Figure 3A).
Therefore, among the ten cluster thresholds ranging from the more strict (0.25%) to the more lenient (25%) values (Extended Data Text S1, Extended Data Figure S4) 41 , we report results with the 2% threshold and after performing rigorous quality control using an alignment-aware artifact removal method to represent sites under putative multilevel selection (see Methods: Statistical analysis for identifying genomic regions enriched for TFPs).With this threshold, the false discovery rate (FDR) and codon-aware FDR (ε) (see Methods: Codon-aware false discovery rate (FDR)) are respectively around 10% and 2.5% (Extended Data Figure S3) 41 .Results for sites identified with other thresholds are presented in the Extended Data 41 .
For the period predating Omicron BA.1.*emergence, we found that B.1.1.7 was consistently the lineage with the highest coefficient for enrichment of TFP-homoplasies, reaching statistical significance for ≥ 2% thresholds.However, TFP enrichment was similar across lineages and not found for specific genomic regions (Extended Data File S1) 41 .Although not significantly different, TFP-homoplasies were slightly more abundant in the small linker domain 45 of the N protein, ORF3a, and ORF8 for this time period (Figure 3B and C).
All considered major lineages were associated with increased TFP-homoplasy emergence for ≥ 1% thresholds during the timeframe which includes Omicron BA.Once again, there were no statistically significant results at the 2% threshold regarding genomic regions (Extended Data File S2) 41 .However, N:linker domain, ORF3a, S:FCS, ORF7a, and ORF10 presented a higher number of TPFs per site (Figure 3D and E), which is relatively consistent with the preceding period.
Normalising homoplasy counts by the size of each genomic region (giving less weight to larger genomic regions) has a large influence in interpreting the relative rates of TFP acquisition.This is especially demonstrated by NSP3, which accrues more TFPs due to its size of 5835 nucleotides (Figure 3B and D), but when normalised has a similar distribution of TFPs compared to other genomic regions (Figure 3C and E).
TFPs along the SARS-CoV-2 genome and low concordance with positively selected sites When comparing the 30 most frequent mlscluster TFP-homoplasies against the 30 mutations under positive selection detected by the HyPhy-based approach 28 for the cluster threshold = 2% and period before Omicron emergence, we detected three concordant sites (S:67, S:95, and S:484), 27 positively selected sites only detected by HyPhy, and 63 sites only detected by mlscluster statistics (Extended Data Figure S5) 41 .For the timeframe including Omicron, we identified two sites under selection concordantly between methods and with the previous period (S:67 and S:484), 28 discordant results, and 51 new potential TFPs across seven proteins/ORFs (Figure 4A).Notably, the minimal overlap between sites identified as under multilevel (mlscluster) and positive (HyPhy) selection suggests these methods are indeed capturing different selective pressures, as expected.
The top 30 most frequent TFP-homoplasies across lineages shows that the B.1.1.7 (n=22) and the AY.4.* (n=21) were similarly enriched for those highly-frequent TFP-homoplasies up to mid-November 2021 (Extended Data Figure S5) 41 , while AY.4.* (n=20) was notably the major lineage harbouring more TFPs (Figure 4B, Table 1) when considering Omicron BA.1.*.Although this would technically mean that these lineages are morely likely to give rise to TFPs, it is possible that this observation is driven by surveillance effects, since there was relatively deep sampling of B.1.1.7 and AY.4 that may have enabled the detection of more candidate TFPs.Morever, the analysis of lineage-specific top 30 TFP-homoplasies regardless of threshold shows that less than half of those are firstly detected on smaller cluster thresholds (up to 2%) (Extended Data Figures S6-S10) 41 .
When expanding to the top 100 TFP-homoplasies (Extended Data Table S1) 41 , 21 sites are located within the larger NSP3 (0.011 TFPs per site), 14 are from ORF3a (0.051 per site), 12 from spike excluding NTD, RBD, and FCS (0.017), nine A manual inspection of TFP-homoplasies with frequency ≥ 5 (Extended Data Table S1) 41 confirmed that they emerged independently in multiple lineages during the pandemic and are predominantly found in extremely low (<1%) frequencies.This independent analysis shows that the method works as expected, capturing potential signals of multilevel selection.This is indicated by the recurrence of such mutations and their appearance in clades where the size, longevity or growth rate of the target clade is much smaller when compared to its  1. sister(s).Additionally, it suggests the existence of potential functionally important mutations outside the spike protein that warrant further investigation.(Extended Data Table S1) 41 .
By focusing on individual proteins that harbour a higher normalised count of TFPs and both established (S, N, ORF3a) and yet to be characterised functional significance (ORF7a and ORF8), we highlight relevant sites potentially under multilevel selection for further experimental investigations.These sites include S:A67V, S:S98F, S:L216F, S:E484K, S:A688V, N:P151L, ORF3a:L15F, ORF8:Y73C, etc.Additionally, these transient selective processes are more likely to be acting uniformly across each protein and not in specific hotspots (Figure 5, Extended Data Figure S11, Extended Data Table S1) 41 .

Comparison against DMS experiments and other approaches to detect sites under selection
When exploring the DMS fitness effects for the spike protein 47 , we identified a total of 49 TFP sites matching DMS negative effects and 15 sites corresponding to positive estimates (Extended Data Figure S12A).When tabulating these values with the non-TFPs (332 and 169, respectively), the resulting odds ratio of a TFP having negative fitness effect is 1.66 (95% CI: 0.88 to 3.29).The positive effect of such 15 TFPs is very close to zero (maximum: 0.35).Similarly, of the five TFPs within the RBD 48 , four are associated with DMS small negative effects and one with a slightly positive effect (Figure S12B).On the other hand, our comparison against NSP5 (MPro) DMS experiments 49,50 , which greatly presented overall positive fitness effects (87.63% and 90.62% of all investigated mutations across these two studies), demonstrated that all 12 TFPs had small positive effects between zero and one.In another DMS study 51 that presented a much smaller percentage of positive effects in MPro (31.79%), our identified TFPs displayed eight positive effects and two negative effects, in both cases very close to zero.
We also compared our approach against a comprehensive analysis of SARS-CoV-2 fitness effects that relies on contrasting the expected number of independent occurrences of a mutation in four-fold degenerate sites (absence of selection) against the actually observed appearances 52 .Of the total of 539 TFPs, 80 had positive and 459 had negative effects.If considering the non-TFPs (351 and 3633, respectively), the odds ratio of a TFP having positive effect is 1.80 (95% CI: 1.37 to 2.35), which opposes to the DMS results.The distribution of positive (Extended Data Figure S13A) and negative (Extended Data Figure S13B) fitness effects over the entire SARS-CoV-2 genome is significantly different when comparing TFPs and non-TFPs (Wilcoxon signed-rank test p-values = 0.009 for positive and p<0.0001 for negative), with TFPs having less positive and less negative overall fitness (i.e.skewing towards zero) values (Extended Data Figure S13A-E).A possible explanation for this observation is that TFPs have moderate multilevel positive and negative selection effects.However, the compared method, which is based on fixed effects, would average out such effects making them approach zero because it only measures the magnitude of one level of selection.

Discussion
We have presented a tree-based clustering method to investigatetransient selective forces acting on SARS-CoV-2 lineages and mutations through the calculation of three statistics (ratio of sizes, ratio of persistence time and logistic growth rates between sister clades) and extraction of clades containing values of those statistics below small cluster threshold cutoffs.To mitigate the inclusion of spurious sites, we included only recurring clade-defining mutations (homoplasies) across cluster thresholds with low associated FDRs and excluded probable sequencing artifacts.Our approach provides a scalable way to analyse huge genomic datasets with >1 million sequences for multilevel selection while also accounting for shared ancestry.
Although the COVID-19 pandemic offered the opportunity to collate genomic datasets of unprecedented sizes, estimating the transmission fitness of individual polymorphisms in this context is challenging.In the early epidemic, inference of sites under positive selection was hampered by low sensitivity given the small genetic diversity of the virus, probably resulting in underestimated estimates of fitness costs.For example, a phylogenetic approach was developed to quantify imbalance in clades containing recurrent mutations 9 , and this approach found a lack of evidence for increased transmissibility from recurrent SARS-CoV-2 mutations.However, this approach only used ≈50,000 sequences up to July 2020 and did not considered persistence times and growth rates as measures of differential fitness across clades of the phylogeny.Following the emergence of VOCs exhibiting elevated substitution rates, other attributes such as convergent evolution, sparse sampling, and vaccine-elicited immunity appeared as relevant confounding factors.Most importantly, the detection of positive selection does not necessarily imply enhanced transmissibility, and the effects of individual mutations on this trait will typically be modest 10 .
Although we carefully considered convergent evolution in our approach by aggregating summary statistics across multiple occurrences of the same mutation, the statistical power is reduced when there are only a few recurrences, as observed here for SARS-CoV-2.For example, this limits our ability to disentangle context-dependent selection effects of TFPs, i.e., when particular mutations are only detrimental at the between-host level if they appear in one major lineage and not others.However, we believe realistic population genetic simulations and data from other pathogens in which within-host evolution is the major source of molecular diversity (e.g.HIV-1) may provide a better understanding of our capacity to detect multilevel selection more rigorously.Sampling biases and changes in the immunological landscape of the human population were not directly addressed here and deserve future developments.Phylogenetic tree estimation and molecular dating on such large-scale datasets also represented enormous challenges in the early pandemic, due to exceedingly high turnaround times and inference uncertainty.However, novel elegant developments allowed the increasingly fast and accurate merging of lineage-specific phylogenies into bigger trees 40 , placing new sequences into existing phylogenies using parsimony methods and efficient data structures 39 , and estimating time-scaled trees using a differentiable graphs for efficient calculation 34 .These advancements in the field of phylogenetics were essential for this work to be possible.
Genetic diversity in infected individuals is governed by repeated cycles of within-host (e.g.replication and immune escape pressures) and between-host processes (e.g.transmission bottlenecks), with the outcome of selection at each level having an effect on the other 53 .The rapid accumulation of mutations in individuals 54,55 with long-lasting chronic SARS-CoV-2 infection is hypothesised to contribute to the emergence of variants such as Alpha and Omicron 56 .Thus the interaction of within-host and between-host selective processes can occasionally have very large epidemic-level effects.
The inspection of the global and lineage distributions of highly frequent TFP-homoplasies confirmed that these mutations generally emerge independently in multiple lineages but remain quite rare, which is consistent with a simultaneous within-host advantage and between-host disadvantage.This systematic investigation also emphasises the scarcity of experimental studies to characterise the functional impact of mutations outside the spike protein.
Our approach identified modest differences in multilevel selection signals across two different epidemic phases, lineages and genomic regions in the UK.We hypothesised that transient selective forces would become stronger after high-levels of convalescent and vaccine-induced immunity have been reached, but our results do not support this hypothesis.Our observation of approximately constant levels of transient selection across waves driven by extremely distinct variants may in part be driven by long-duration chronic infections which occur at low frequency and provide greater opportunities for accelerated within-host evolution favouring immune evasion.
On the other hand, a systematic evaluation of longitudinal samples from chronically infected patients using a range of molecular dating methods found a lack of evidence for such higher within-host as opposed to between-host rates.These findings suggest that intrahost evolutionary rates have been overestimated and other considerations such as prolonged viral shedding and relapsing viral load dynamic should also be considered for a more comprehensive understanding of SARS-CoV-2 within-host dynamics 57 .Our data did not include clinical covariates that would allow us to investigate the association of chronic infection or duration of infection and the presence of TFPs.
Sequencing of chronically-infected patients throughout multiple time points of their long-lasting infection provided external validation of our observed patterns.Nucleotide substitution rates were around twice as fast during chronic infections when compared with the global SARS-CoV-2 evolutionary rate 58 .
Additionally, mutations identified in the top 100 most frequent TFP-homoplasies by our approach such as S:E484K [58][59][60][61] , S:T95I 60,61 , ORF8:Y73C 59 , ORF8:L118V 58 , ORF1ab:S944L 58 , and ORF1ab:T1543I 58 also emergedafter days of chronic infection.Although usually associated with immune escape and increased ACE2 affinity, these recurrent mutations lack the capacity to enhance transmission 60,61 as demonstrated by their low epidemic-level frequency after multiple independent occurrences.Additionally, distinct viral populations appear to be residing in different niches (e.g.organs) of a patient's body 61 and an impaired immune system selects for mutations that confer intra-host replication and persistence (e.g.immune evasion) as opposed to general acute infections, in which mutations favouring inter-host transmission are a major target of selective pressures 60 .Remarkably, another study which investigated 27 chronic infections showed that certain recurrent mutations arising in such persistently infected patients do not appear at the between-host level, also indicating a tradeoff between antibody evasion and transmissibility as a plausible explanation 61 .More recently, analysis of a large UK community surveillance study also demonstrated that particular mutations, including ORF1ab:T1638I (NSP3) and T4311I (NSP10), are recurrent in persistent infections but mildly deleterious at the between-host level.Conversely, many other recurrent mutations appearing in persistent infections were shown to be advantageous at the between-host level 62 .Importantly, since our approach is especially designed to detect beneficial mutations at the within-host level, we do not expect that it directly captures mutations that are, at the same time, favourable for between-host transmission.
However, the comparison of our results against DMS experiments [47][48][49][50][51] showed that although most TFPs matched negative (detrimental) fitness effects, some level of short-lived positive selection seems plausible for a few sites.We also contrasted our results against comprehensive SARS-CoV-2 fitness estimates from a method that compares the anticipated number of independent mutations at four-fold degenerate sites (where no selection occurs) with the observed mutation frequencies 52 .It is important to emphasise that such approach is quantifying the magnitude of one level of selection only, while our approach attempts to identify multilevel selection.In other words, a single parameter is not sufficient to describe a site under multiple levels of selection, and future development is needed on statistical models for inferring multilevel selection effects.Although the fitness effects of these two approaches are not directly comparable, we found that TFPs tend to have fitness effects approaching zero, which can be explained by the simplistic assumption of fixed (unchanged) effects that would average the actual slightly positive and negative multilevel fitness estimates.Despite distance-based clustering in HIV networks having been extensively used as a proxy for transmissibility, this approach is generally based on a cutoff from pairwise distances separating sequences 21 .Consequently, poor specificity for variants negatively influencing fitness is evident (i) when a variant is isolated, occurring along a long branch not captured by distance threshold, (ii) when a variant is imported, and large genetic distances can reflect unsampled diversity in the country of origin or a rare recombination or hypermutation event, not necessarily reflecting a fitness cost.Our method addresses these limitations by incorporating the number of descendants, persistence times, and growth rates across sister clades with and without the mutations under investigation, and using independently-acquired substitutions to remove spurious relationships.Introducing these multiple sources of information can provide more accurate estimates, but also introduce biases.Primarily, our analysis is sensitive to sequencing artifacts.Although we used data from a highly standarised sequencing consortium (COG-UK), changes in primer sets after Omicron emergence 42 , as well as sequencing coverage and base-calling errors can potentially influence our conclusions, as demonstrated by our several quality control and artifact removal methods employed.A second limitation is our focus on consensus sequences.By using our approach, some beneficial within-host mutations (TFPs) might be missed because minority variant transmission is rare in scenarios where the de novo mutation rate is low (i.e., mutation takes time to reach >50% frequency within a host) and the transmission bottleneck is tight, as observed in SARS-CoV-2 63 .A third caveat arises from the assumption of representative sampling.Although we utilised data from England during a period of proportional (to cases) community sampling to minimise this effect, the rate of sampling varied substantially over time and further analyses are needed to investigate the impact of non-representative sequencing in our approach.Moreover, sample density will certainly impact the sensitivity of the method to detect circulating TFPs.By definition, these variants will have low persistence and growth, and therefore will only be detectable when a sufficiently high sample density is achieved.A detailed quantitative understanding of detection thresholds will need to be carried out in future work.Lastly, the small absolute frequency of reoccurrence of some identified TFPs raises concerns about the false discovery rate of the proposed method.We believe that, given that comprehensive analyses of SARS-CoV-2 fitness effects 52 point to a nearly neutral between-host effect of synonymous mutations along the SARS-CoV-2 genome, our FDR estimates of ~10% (that uses TFP detection in synonymous sites as a proxy for false detection) is reasonable for these sites to be considered under putative multilevel selection.

Conclusions
We developed a method designed to identify candidate sites under multilevel selection from >1.2 million SARS-CoV-2 sequences using rigorous quality control, statistical tests, and control for false detection.The comprehensive catalog of TFPs identified here and especially abundant in S, N, ORF3a, ORF7, and ORF8 suggest the existence of important tradeoffs between within-host replication and between-host transmission of SARS-CoV-2 that may warrant further experimental investigation and more realistic coalescent-based model developments.License: MIT occurred multiple times in the phylogeny in clades that are characterised by relatively small values of one or all of the test statistics (size, persistence, growth).By arguing that the recurrence is evidence for a within-host transmission advantage and the clade characteristics for a between-host disadvantage, these homoplasies are interpreted as transmission fitness polymorphisms (TFP) and evidence for multilevel selection in SARS-CoV-2.However, especially given the small absolute frequency of reoccurrence of most identified TFPs, this interpretive step is not straightforward and a bridge between result and interpretation is missing.I would therefore recommend the addition of either a validation study on either synthetic or empirical data (providing evidence that the identified mutations are really the result of a transient transmission advantage) or of a very well-constructed argument supported by appropriate literature.(This is the reason why I chose the response "Partly" to "Are the conclusions drawn adequately supported by the results?")

Data availability
Which target node conditions were used for the SARS-CoV-2 tree?How are the choices justified and how sensitive are the results to changes in these parameters?I do see in the code that the default values are min_descendants = 10, max_descendants = 20*10^3, min_cluster_age_yrs = 1/12, min_date = max_date = NULL, but I cannot find information regarding it in the manuscript.

2.
Why is caution only recommended starting from thresholds associated with an FDR of around 40%? 3.
If sister clades both qualify as target nodes, is the same (but reversed) comparison done twice and counted in the empirical distribution?

4.
Are some parts of the tree counted multiple times, e.g. if a parent and child node both qualify as target nodes with the same defining mutation?

5.
As demographic factors will still influence the clade characteristics, even in the presence of multilevel selection, how many independent samples are necessary, i.e. how often would the mutations have to be observed in the tree, to draw statistically reliable conclusions (see also Supplementary Figure S17 in van Drop et al. 2020)?What is empirically observed for the reported homoplasies?

6.
Do the identified homoplasies also appear in parts of the tree that are not characterised as imbalanced? 7.
How are multiple defining mutations in one clade handled? 8.

I would recommend expanding the Discussion in the following two points:
More thorough discussion of the limitations that are already listed (how does the method deal with the challenges described in the second paragraph of the Discussion) and additional ones that apply (e.g., phylogenetic uncertainty and reliable tree inference as potentially time-consuming preprocessing step, statistical power from only few recurrences)

○
Explanation for and discussion of the pronounced differences observed between the mlscluster and HyPhy results.I would recommend a more careful phrasing of the possible explanation of the observed TFP-homplasy pattern by accelerated within-host rates, as I see no ground justifying a connection presented in the study and too few references showing evidence of the connection and rate acceleration (see also preprint [1]. 10.

○
To my understanding, the conclusion statement is not supported by the study results (see also major comment 1), as they do not allow to draw direct conclusions about within-host vs. between-host evolution.I would recommend rephrasing this.Plain Language Summary: I would recommend using "clade" instead of "lineage" for consistency.
○ "…highlighting the existence of important tradeoffs in selection between intrahost replication and inter-host transmission": I would recommend rephrasing this or providing evidence that the identified homplasies are necessarily the product of multilevel selection (see major comment 1) ○ 2.

Introduction:
A literature reference is missing for the statement "In molecular epidemiological studies, a set of particularly scalable approaches have been developed based on the calculation of phylogenetic clusters comprising two or more closely related samples."Additionally, I would suggest replacing "calculation" with, for example, "identification", "detection" or similar.

○
Either a literature reference or a clearer argument is missing for the statement "Furthermore, there is considerable scope to improve on distance-based genetic clustering methods because such approaches will potentially have poor specificity for variants that negatively influence fitness."

○
In the sentence following the one cited in the previous point the reference to SARS-CoV-2 is duplicated.

○
A formal definition of "transmission fitness polymorphism" is missing.
○ "We demonstrated its applicability through the analysis of a representative >1.2 million SARS-CoV-2 genomic data set from England…" should, e.g., rather be phrased as "We demonstrated its applicability through the analysis of a representative SARS-CoV-2 data set comprising >1.2 million genome sequences from England…" ○ "By providing a comprehensive catalog of the main sites driving multilevel selective pressures throughout the SARS-CoV-2 genome, we also expand the understanding of [the] SARS-CoV-2 fitness landscape outside the well-studied spike protein".I would recommend rephrasing of this sentence, as (i) the identified sites are not the drivers of the selective pressures, rather the result of them and (ii) see major comment 1.

Methods
Usage of "node" and "clade" is inconsistent.I would recommend primarily using only one of the two terms and, if necessary, write "node" only to refer to the actual internal node and "clade" when referring to the whole subtree arising from said "node".
○ "Assume two clades u (target clade) and v (comparator/sister) organised in a timescaled tree t and sharing ancestry (i.e. the same defining mutations)."--> "Assume two clades u (target clade) and v (comparator/sister) organised in a time-scaled tree t and ○ 4.
sharing full ancestry (i.e. the same defining mutations).""The persistence time (given by a in Figure 2, is defined as.." --> "The persistence time (given by a in Figure 2), is defined as.." (closing bracket at wrong position) ○ "…which can be their sister clade (the clade sharing an immediate ancestor assuming bifurcating phylogenetic relationship) or against all other clades…" --> "…which can be their sister clade (the clade sharing an immediate ancestor assuming bifurcating phylogenetic relationship) or all other clades…" ○ Unclear if 'clade size' refers to S_uv or n_u ○ I would recommend explicitly stating that in this study it is assumed that a homplasy that arises in a subtree for which the statistics fall below a low quantile of the empirical distribution (see sentence "… for identifying especially convergently acquired mutations (homplasies) that are detrimental for transmission (within a low quantile of the probability distribution of at least one of the three statistics").

○
The observed distribution of each of the three statistics in the tree is formally not a probability distribution.I would recommend replacing this terminology.
○ "Given this efficient way to visit nodes of the tree and edge lengths, we can easily extract the parameters of interest (e.g. the time of the most recent common ancestor of each node, …" --> "Given this efficient way to visit nodes of the tree and edge, we can easily extract the parameters of interest (e.g. the time of the most recent common ancestor of each clade, …" ○ "Target nodes are extracted based on the following conditions …" The following needs rephrasing, as it currently reads as if only the node with the smallest number of descendants etc. is kept.
○ Definition of "sharing ancestry" as same defining mutations between sister clades and "defining mutation" in one sister clade is contradicting.

○
Which homoplasy annotations are used later in the text?
○ "We intended to make the method flexible by creating a parameter that specifies in how many percentiles the statistic should be splitted...": Isn't it rather a parameter that specifies which percentile to use (e.g.2%)?
○ I find the terminology "cluster threshold" confusing, as the output of the method is not primarily a clustering of parts of the tree into groups, but rather the identification of recurring mutations based on relative differences between any sister clades (passing the filtering conditions).
○ A literature reference is missing for the ML tree inference.
○ "We tested our approach using two COVID-

If applicable, is the statistical analysis and its interpretation appropriate? Yes
Are all the source data underlying the results available to ensure full reproducibility?Yes

Are the conclusions drawn adequately supported by the results? Partly
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Pathogen phylodynamics I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.
(within and between-host).Therefore, mlscluster narrows down our search by making us start from the sites that are more likely to be TFPs and not considering every single site in the genome, which would probably be prohibitive.We are currently working on such methods and simulations to compare the accuracy of mlscluster vs these coalescent-based models, and will present such benchmarks in future work.We agree that the small absolute frequency of reoccurrence of most identified TFPs makes the interpretation weaker.But given that comprehensive analyses of SARS-CoV-2 fitness effects (doi: 10.1093/ve/veae026) point to a nearly neutral effect of synonymous mutations along the SARS-CoV-2 genome, we believe that our FDR of ~10% (that uses TFP detection in synonymous sites as a proxy for false detection) for threshold = 2% is reasonable enough for these sites to be considered under putative multilevel selection.These, in turn, will need further coalescent-based modelling or experimental studies to confirm the existence of multilevel selection effects.We included such explanation at the end of the 'Discussion'.Location of change in manuscript: R3#1 (Abstract, Plain Language Summary, Introduction, Discussion, Conclusions) Response: The default values of (i) 10, (ii) 20×10^-3 , and (iii) 1/12 (1 month) were chosen to avoid ratios being taken from small or unreliable clade sizes/persistence times, an unrealistically high number of viral generations, and very short timeframes.These values were selected based on previous experience with UK SARS-COV-2 analyses using a similar software developed by our research group (doi: 10.1016/j.ebiom.2023.104939).

Which target node conditions
Unfortunately, a systematic evaluation of how these parameters would influence the results was not performed, but is planned for future applications of this method.We have included such clarifications in the revised manuscript.Location of change in manuscript: R3#2

Why is caution only recommended starting from thresholds associated with an FDR of around 40%?
Response: We agree that this statement was not rigorous enough and changed it accordingly to recommend caution mainly starting at threshold = 5% that has an FDR ~ 20%.Location of change in manuscript: R3#3

If sister clades both qualify as target nodes, is the same (but reversed) comparison done twice and counted in the empirical distribution?
Response: Yes, reversed comparisons between sister clades are included in our empirical distribution without any controls to avoid counting them twice.While it is true that the ratios are reciprocals and a log-transformation could be used to make them symmetric, our objective is to aggregate transmission fitness for all target nodes (and their resulting defining mutations) against their sisters.This means we are interested in capturing all comparisons, regardless of whether they are reversed.Excluding one of the two ratios or transforming them would not align with our goal of aggregating of transmission fitness.Therefore, we include both ratios to ensure that all potential relationships are considered in our analysis.

Are some parts of the tree counted multiple times, e.g. if a parent and child node both qualify as target nodes with the same defining mutation?
Response: A child clade will have the mutations of the parent and their own.We only consider a defining mutation if it does happen in the target clade in a percentage > 75% across its tips while occurring in < 75% of the tips in the sister clade.Therefore, since the target and sister will carry the mutations of their parent, extracting the difference will remove such mutations and only keep the ones that are really defining, therefore not counting the parent mutations over and over.One of the reasons we do the alignmentaware artifact removal is because some tips in a clade could not have some mutations found in their parent called due to lack of sequencing coverage at that position, and that could make us call mutations that are not actually defining multiple times.

6.
As demographic factors will still influence the clade characteristics, even in the presence of multilevel selection, how many independent samples are necessary, i.e. how often would the mutations have to be observed in the tree, to draw statistically reliable conclusions (see also Supplementary Figure S14 in van Drop et al. 2020)?What is empirically observed for the reported homoplasies?Response: Sample density will certainly impact the sensitivity of the method to detect circulating TFPs.By definition, these variants will have low persistence and growth, and therefore will only be detectable when a sufficiently high sample density is achieved.A detailed (quantitative) understanding of detection thresholds will need to be carried out in future work, but in the current paper we have added some discussion of these issues.Location of change in manuscript: R3#6 7. Do the identified homoplasies also appear in parts of the tree that are not characterised as imbalanced?
Response: Thank you for such a great question.Yes, I compared the frequencies of all nonsynonymous homoplasies found in the tree (n=4692) against the detected TFP-homoplasies (cluster threshold=2%, n=543).This means that only 11.57% of all non-synonymous homoplasies are also detected as TFP-homoplasies.The median difference in frequency for homoplasies detected at both (as general homoplasy [i.e.non-TFP] and as TFP) is 4, mean=8.52,and IQR=13.Non-synonymous TFP-homoplasy median frequency is 2, mean=2.45,and IQR=0.5, while non-synonymous general homoplasy median frequency is 3, mean=6.73,and IQR=4.So this preliminary investigation suggests that the identified homoplasies also appear in parts of the tree not characterised as imbalanced (in this case above the 2% cluster threshold) roughly a median of 4 times, but the variation can be quite large.We will not include such results in the manuscript, but will think of ways of formally quantifying such differences in future work.

How are multiple defining mutations in one clade handled?
Response: The defining mutations of each clade are computed and the associated statistics (ratio of sizes, ratio of persistence time, and logistic growth rate) of the clade are attached to them, regardless of the clade having none (polytomy), one or multiple defining mutations.In the final data.frame, the mutations are 'individualised' to find the ones that happen independently in different parts of the tree that fall below the specified cluster threshold.
9. I would recommend expanding the Discussion in the following two points: 9.1.More thorough discussion of the limitations that are already listed (how does the method deal with the challenges described in the second paragraph of the Discussion) and additional ones that apply (e.g., phylogenetic uncertainty and reliable tree inference as potentially time-consuming preprocessing step, statistical power from only few recurrences) Response: Thank you very much for your recommendation.We expanded the existing discussion to address the limitations already listed, although sampling biases is more thoroughly discussed by the end of the discussion and changes in immunological landscape were not considered.The additional points were also incorporated into the discussion.Location of change in manuscript: R3#9.1

Explanation for and discussion of the pronounced differences observed between the mlscluster and HyPhy results.
Response: Thank you.This was also pointed out by Reviewer#2.The minimal overlap suggests that these methods are capturing different features / selective pressures, as expected.We included this clarification in the 'results' section.Location of change in manuscript: R3#9.1

To my understanding, the conclusion statement is not supported by the study results (see also major comment 1), as they do not allow to draw direct conclusions about within-host vs. between-host evolution. I would recommend rephrasing this.
Response: Thank you very much, we agree with the recommendation and rephrased it to make it clear that these sites are not mechanistically (using e.g.realistic simulations) proven to be under multilevel selection.Location of change in manuscript: R3#11.2Plain Language Summary: 12.1.I would recommend using "clade" instead of "lineage" for consistency.
Response: Thank you for the suggestion.We agree using "clade" is better for consistency and adjusted accordingly.We also rephrased the following to avoid repetition and provide Response: Thank you for noticing this.We included a reference to the most widely used distance-based method (HIV-TRACE) and to a paper that evaluates the limitations of this and other similar methods.We also replaced "calculation" with "detection".Location of change in manuscript: R3#13.1 13.2.Either a literature reference or a clearer argument is missing for the statement "Furthermore, there is considerable scope to improve on distance-based genetic clustering methods because such approaches will potentially have poor specificity for variants that negatively influence fitness." Response: We included a clarification on why these approaches are expected to present poor specificity for variants that negatively influence fitness.They were demonstrated to be systematically biased to detect variation in sampling rates instead of transmission rates.Location of change in manuscript: R3#13.2 13.3.In the sentence following the one cited in the previous point the reference to SARS-CoV-2 is duplicated.Response: We are not sure we understand this point.Actually, in this sentence: "During the past few years, positive and negative selection in SARS-CoV-2 have mainly been investigated using methods that rely on synonymous rate variation across sites/branches (26, 27) , and results from these approaches on SARS-CoV-2 comprehensive datasets are available for comparison (28)" reference 27 refers to the HyPhy method description paper and reference 28 to one of its applications to SARS-CoV-2 by the same author.

A formal definition of "transmission fitness polymorphism" is missing.
Response: We agree it was not defined clearly, so we included a definition in the 'introduction' section.Location of change in manuscript: R3#13.4Response: We have rephrased it as "potentially resulting from multilevel selective pressures".Location of change in manuscript: R3#13.6 Methods: 14.1.Usage of "node" and "clade" is inconsistent.I would recommend primarily using only one of the two terms and, if necessary, write "node" only to refer to the actual internal node and "clade" when referring to the whole subtree arising from said "node".
Response: We agree using only one of the terms avoids confusion, and therefore chose "clade".Then, all instances of "node" were replaced by "clade" and some minor edits were made to avoid repetition of "clade" too many times.Location of change in manuscript: R3#14.1 (in the first replacement) and throughout the manuscript Response: Thank you very much for noticing this.We have rephrased accordingly and believe the sentence implies the intended meaning now.Location of change in manuscript: R3#14.9 14.10.Definition of "sharing ancestry" as same defining mutations between sister clades and "defining mutation" in one sister clade is contradicting.
Response: We are not sure we completely understand this statement, but we believe this was accordingly addressed together with 14.2 above.The main idea is to compare clades carrying a specific defining mutation against their immediate sisters without the mutation.Location of change in manuscript: R3#14.10 14.11.Which homoplasy annotations are used later in the text?Response: Homoplasies are annotated in the text and figures using the standard protein:{ancestral_aminoacid}{aminoacid_coordinate}{mutated_aminoacid} notation to make them consistent and comparable with other studies.The specific annotations (e.g.regions of interest including RBD and NTD of spike) are mentioned when appropriate to illustrate important results, in table 1 ('Genomic region' column), and in the detailed CSV outputs of the package.
14.12."We intended to make the method flexible by creating a parameter that specifies in how many percentiles the statistic should be splitted...": Isn't it rather a parameter that specifies which percentile to use (e.g.2%)?
Response: 'quantile_choice' parameter specifies in how many percentiles the statistics should be splitted.For example, 'quantile_choice=1/100' (default) splits from 1% to 100% in intervals of 1%, 1/400 splits from 0.25% to 100% in intervals of 0.25%.The later allows more strict "cluster thresholds" to be used (e.g.0.25%).Each statistic has its own parameter to set the percentile to use ('quantile_threshold_ratio_sizes', 'quantile_threshold_ratio_persist_time', and 'quantile_threshold_logit_growth').Again, this option just makes the method more flexible, but it is recommended to consider the same threshold for all three statistics (which was the only scenario we tested).These options are properly documented in the R package, available using: help("run_diff_thresholds").
14.13.I find the terminology "cluster threshold" confusing, as the output of the method is not primarily a clustering of parts of the tree into groups, but rather the identification of recurring mutations based on relative differences between any sister clades (passing the filtering conditions).
Response: Thank you very much for raising this terminology opinion.We respect it and agree it is a bit confusing, but preferred to use this instead of more technical terms such as "quantile threshold", "empirical distribution threshold", etc.It was a topic of discussion during the paper preparation and both of us agreed "cluster threshold" would not be a perfect name but better than other terms we could think of.14.17.Alignment-aware artifact removal: Is it correct that this is only necessary if occurrences of "X" and "N" at each genome position are not randomly distributed over the tree?
Response: We think it should be performed regardless of whether their occurrence is randomly distributed over the tree or not.Since the method relies on extracting homoplasies along the tree, it will be highly prone to artifacts as demonstrated by our Omicron case scenario, in which we found a massive enrichment for TFPs (due to the known Omicron sequencing dropout issues) when not performing such sanity checks.

Poisson regression: definition of i and j is missing
Response: These are already defined."(...) i polymorphic third codon position sites (...)" (count of polymorphic sites at third codon position) and "(...) j is the total number of polymorphic sites at first and second positions (...)".

Definition of Y as TFP calls among the I polymorphic third codon position sites with >100 mutated sequences: Why 100?
Response: This was selected after retrieving the number of polymorphic synonymous sites at each codon position for a couple of thresholds of mutated sequences (ranging from 10 to 1000).N=100 represents ~0.1% of the ~1.7 million sequences analysed, so a threshold as high as 1000 or 500 was excluding too many legitimate polymorphic sites and a threshold as low as 10 or 50 could include artifacts or spurious polymorphic sites.We included a summarised version of this explanation in the revised manuscript.

Mahan Ghafari
University of Oxford, Oxford, UK Franceschi and Volz have developed a method for identifying mutations that have a transient selective advantage at the within-host level but are disadvantageous at the between-host level.Their method utilises three statistics based on the size, persistence time, and logistic growth rates of clades harbouring the target mutations.Their analysis of SARS-CoV-2 is an interesting one and their methodology appears sound.One external validation of their method is the identification of recurrent mutations found during chronic infections, including mutations outside Spike.I have a few comments:

Precedence:
The authors' claim of being the first to identify mutations with complex fitness trade-offs may benefit from further clarification and context regarding previous research in this area.However, I think their approach to identifying such mutations is particularly useful for large-scale datasets.
For instance, the work by Harari et al., which the authors also cited, found trade-offs between immune evasion (and viral replication) and transmissibility.They showed that certain recurrent within-host mutations in chronic infections do not appear at the between-host level, pointing to the trade-off between immune escape and transmissibility as the possible explanation.
More recently, Ghafari et al. [1] demonstrated that certain mutations, such as T1638I in ORF1ab, are recurrent in persistent infections but are mildly deleterious at the between-host level, while many other recurrent mutations in persistent infections are also highly advantageous at the between-host level.

Methods:
The assumption that artifacts ("X") would necessarily follow the majority allele at a given position is not immediately clear to me.Could the authors clarify what this assumption is based on or verify from base frequency files whether "X" is indeed the majority nucleotide in at least some cases?
A potential limitation of this approach is its focus on consensus sequences.A de novo mutation reaching over 50% frequency within a host takes time, and given the tight transmission bottleneck, minority variant transmission is unlikely.Thus, at least some beneficial within-host mutations might be missed.
It would be interesting to know if the authors have explored context-dependent selective advantage/disadvantage of TFPs, for example, whether certain mutations are only detrimental at the between-host level if they appear in one major lineage and not others.

Results:
It is still unclear to me whether the set cluster thresholds chosen for the analyses correspond to all three considered statistics, some of them, or at least one.
The significance of observing B.1.1.7 and AY.4.* enriched for TFP-homoplasies is not well explained.Should this be interpreted as these lineages being more likely to give rise to beneficial within-host mutations that are deleterious at the between-host level?If so, the biological explanation for such lineage-dependent effects needs some clarification.
Given that the study period extends only until mid-2022, it would be worth clarifying whether any second-generation BA.2 lineages, as well as BA.5 or XBB, were included in the analysis or not for interested readers.
The discordance between sites identified as under positive selection using mlscluster and the HyPhy-based approach is noteworthy.The minimal overlap suggests these methods are capturing different features.Some comments/clarification helps here.

Discussion:
Bloom and Neher's [2] recent investigation into the between-host fitness effect of SARS-CoV-2 mutations across all major lineages pre-and post-Omicron uses the number of independent appearances of a mutation on a global phylogeny as a measure of mutation fitness effects (not cluster size or their persistence times).Their approach also seems to work particularly well in capturing deleterious or nearly neutral mutations at the between-host level.I would be curious to see some discussion on how the methods compare with the author's approach in this paper.
The discussion on how this approach could be used to investigate the association between chronic infection, duration of infection, and the presence of TFPs is an interesting one.Given Ghafari et al.'s [1] findings that many mutations during persistent infections are also beneficial at the between-host level, would the authors expect that their approach cannot capture mutations that Reviewer Expertise: Virus evolution, phylogenetics, and epidemiology.
I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.
majority nucleotide in at least some cases?A potential limitation of this approach is its focus on consensus sequences.A de novo mutation reaching over 50% frequency within a host takes time, and given the tight transmission bottleneck, minority variant transmission is unlikely.Thus, at least some beneficial within-host mutations might be missed.
Response: Thank you for pointing this out.We understand it was probably not accurately described before.What we actually did was to compute the proportion of the most frequent mutation at a given site across all sequences within the clade and add up the frequency of the "X" or "N" (undetermined bases) at that site (across all sequences in that same clade).
This assumption is based on the complete shared ancestry across sequences in that clade.In cases where "X" and "N" are the most frequently mutated characters at the given clade, we discard them to avoid the inclusion of artifacts.We added such clarification in 'Methods: Statistical analysis for identifying genomic regions enriched for TFPs ' and hope this is clear now.We agree the focus on consensus sequences brings limitations such as the one cited, so we included it in the 'discussion' with an appropriate reference.Location of change in manuscript: R2#3 (methods and discussion)

It would be interesting to know if the authors have explored context-dependent
selective advantage/disadvantage of TFPs, for example, whether certain mutations are only detrimental at the between-host level if they appear in one major lineage and not others.
Response: Thank you very much for such an interesting question.As pointed out by Reviewer 3, the statistical power to detect such relationships would be very limited given the lower frequency of TFP-homoplasies.We have not investigated these contextdependent relationships in depth because we think any conclusions from that would be highly speculative given these limitations.We included a statement in the 'discussion' (paragraph 3) mentioning this.We found some TFPs to occur in more than one major lineage and others to happen only in one, so probably the answer is yes, there are lineagespecific differences.Hopefully, analysis of genomic data from other viruses where multilevel selection is more common and some realistic coalescent-based simulations we are currently working on will help us to better understand these relationships.Location of change in manuscript: R2#4

Results:
5. It is still unclear to me whether the set cluster thresholds chosen for the analyses correspond to all three considered statistics, some of them, or at least one.
Response: We agree this was not very clear.We incorporated this sentence: "homoplasies occurring in a subtree for which at least one of the three statistics fall below a low quantile [e.g., 2%] of the empirical distribution" into the 'Methods: Tree-based clustering algorithm implementation' section to clarify that the analysis is performed for homoplasies and associated clades detected by at least one of the three statistics.Location of change in manuscript: R2#5 6.The significance of observing B.1.1.7 and AY.4.* enriched for TFP-homoplasies is not well explained.Should this be interpreted as these lineages being more likely to give rise to beneficial within-host mutations that are deleterious at the between-host expect that their approach cannot capture mutations that are both beneficial at the within-and between-host level?Response: Since our approach is especially designed to detect beneficial mutations at the within-host level, we do not expect that it directly captures mutations that are, at the same time, favourable for between-host replication.We included such statement in the discussion.Location of change in manuscript: R2#10 15.There seems to be an accidental text break in the following paragraph "… ORF1ab:T1543I also emerged after days of chronic infection" which seems unnecessary.
Response: Thank you very much for pointing this out.We have tried to fix it but there still seems to be a formatting issue there, so left a comment for the editorial team.Location of change in manuscript: R2#15 16. On figure 4, it would help to indicate which mutations belong to the intersection category (yellow).Are these S484 and S98? Response: These are S:A67V and S:E484K.Unfortunately, the available HyPhy analysis only contained the identified sites and not the actual replacements under positive selection.Consequently, the actual mutations and further annotations are presented in Table 1.We added this clarification in the legend of Figure 4. We believe the yellow color is sufficient to distinguish these sites in the 'intersection' category and these sites are mentioned in the 'TFPs along the SARS-CoV-2 genome and low concordance'.Location of change in manuscript: R2#16 17.The sentence "…it shows that the impact of very few mutations outside the S protein have been characterised experimentally" could be rephrased for clarity.The results highlight potential functionally important mutations outside the Spike protein that warrant further investigation rather than directly showing lack of experimental characterisation.
Response: Thank you for your suggestion.We have rephrased the sentence for clarity.We believe this revision more accurately conveys the intended meaning.Location of change in manuscript: R2#17 Competing Interests: No competing interests were disclosed.

Richard Neher
University of Basel, Basel, Switzerland Franchesci and Volz present a method to investigate the effects of specific mutations on viral fitness by comparing clades that carry a specific mutation to their immediate siblings without the mutation.The method relies on comparing the total size, the 'longevity', and the relative growth of the clades.
The paper is rather hard to read.It is, for example, not said explicitly that the method is specifically looking at homoplasies and aggregating information across many occurrences of the same mutation.Instead the discussion is kept general (target/comparator) when it would help a lot if the canonical scenario was explained.There are several other ways in which the readability of this paper could be improved.
The specific claim that the method allows to identify multi-level selection as opposed to simply aggregate transmission fitness is not well supported.

Specific points:
Introduction: -the first paragraph is oddly neutralist.For SARS-CoV-2 or the HA segment of influenza A, adaptation is common and we witness a repeated pattern of variant replacement.While it might still be technically true that the 'majority' of polymorphisms is neutral or weakly selected, this introduction strikes me as strange.Furthermore, discussing transient fitness advantage as an exception to the neutral majority is particularly weird given the rest of the article is about SC2 with frequent replacements driven by adaptive evolution.A useful reference in this context could be Kistler and Bedford [1]   -The motivation with multi-level selection in HIV is a bit strange and does not have much to do with what is discussed in the manuscript.It is unclear that different within/between host evolutionary rates in HIV are directly related to multi-level selection through preferential transmission of ancestral variants, or whether they are the result of frequent reversions of hostspecific adaptations, for example T-cell escape.Influenza virus would be a much better model for the SC2 analysis done here than HIV.Strelkowa and Laessig [2] might be useful context.
-the 2nd and 3rd paragraph of the introduction are rather cryptic.
-the idea of fitness inference from the shape of phylogenies was introduced much earlier, for example [3]   -the most comprehensive estimates of (mostly deleterious) fitness effects of mutations in SC2 to date were probably published in [4]   Methods: -throughout the manuscript, it remains unclear whether identification of sites with fitness effects is the primary result, or enrichment of such sites in major variants or regions.This makes the manuscript hard to follow.

Results
-the paragraphs in the second column of page 9 summarize counts and observations without putting them into context and it is unclear what the reader is supposed to take away from this discussion.
-I believe the functional significance of ORF7 and 8 is still not very well understood. Discussion: -it remains unclear to me why the approach presented here specifically reveals 'multi-level selection'.The approach picks up signals of clade growth and thus aggregate transmission rate in the population.The fact that most mutations are rare isn't evidence for a within-host/betweenhost trade-off.This is a natural consequence of a very large densely sampled pathogen with rapid exponentially growing outbreaks.
-the results would be more convincing if they would be quantitatively compared to inferences from other analyses or deep mutational scanning data.

Are sufficient details of methods and analysis provided to allow replication by others? Yes
If applicable, is the statistical analysis and its interpretation appropriate?Yes Are all the source data underlying the results available to ensure full reproducibility?Yes

Are the conclusions drawn adequately supported by the results? Partly
Competing Interests: No competing interests were disclosed.
I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

Figure 1 .
Figure 1.Schematic view of the tree-based clustering algorithm implementation and analytic pipeline.(A) Main notations, parameters, and respective statistic formulas that are computed by mlscluster (https://github.com/mrc-ide/mlscluster)for sister clades of the time-scaled phylogeny.(B) Analysis workflow with main steps from input data to TFP inference.

Figure 2 .
Figure 2. Spatiotemporal distribution of the SARS-CoV-2 sequences from England included in this study during the investigated period (June 2020 to April 2022).Main plot: Monthly-stratified frequency of the sequences stacked by major PANGO-lineage.Inset plot: Proportion of included sequenced cases across adm2 regions in England during the investigated period for 77% of the samples with unambiguous adm2-level assignments.

Figure 3 .
Figure 3. Frequency of SARS-CoV-2 TFP-homoplasies per genomic region considering all cluster thresholds and the more reliable threshold of 2%.(A) Count of TFP-homoplasic sites for all SARS-CoV-2 proteins across the 10 different cluster thresholds ranging from the more (0.25%) to the less stringent (25%).(B-E) Count of TFP-homoplasies per genomic region for two different time periods and considering threshold = 2%.(B) Non-normalised counts per lineage for timeframe pre-Omicron (June 2020 to mid-November 2021).(C) Normalised counts per lineage (divided by genomic size) for the same period as (B).(D) Non-normalised counts for the timeframe including Omicron (June 2020 to end of April 2022).(E) Normalised counts for the same period as (D).

Figure 4 .
Figure 4. TFP-homoplasy identification compared to sites identified as under positive selection.Sites are compared across different major lineages.(A) Comparison of top 30 identified sites under multilevel selection by our tree-based clustering approach for the whole-period (including Omicron) for cluster threshold = 2% against the HyPhy analysis 28 , also presenting concordant results (intersection) between both methods.(B) Bubble plot of TFP-homoplasy frequencies attributed to different major PANGO-lineages.The HyPhy analysis only contained the identified sites and not the underlying amino acid replacements.The actual mutations and further annotations are presented in Table1.

Figure 5 .
Figure 5. Frequency of identified TFP-homoplasies alongside genomic regions with major functional significance and normalised counts for cluster threshold = 2% and period including Omicron.(A) Spike.(B) Nucleocapsid.(C) ORF3a.TFPs are coloured by major PANGO-lineage and annotated if frequency > 2.
were used for the SARS-CoV-2 tree?How are the choices justified and how sensitive are the results to changes in these parameters?I do see in the code that the default values are min_descendants = 10, max_descendants = 20*10^3, min_cluster_age_yrs = 1/12, min_date = max_date = NULL, but I cannot find information regarding it in the manuscript.

10 .
I would recommend a more careful phrasing of the possible explanation of the observed TFP-homplasy pattern by accelerated within-host rates, as I see no ground justifying a connection presented in the study and too few references showing evidence of the connection and rate acceleration (see also preprint [1].Response: Thank you very much for the recommendation.We agree this point deserves further discussion and a more careful phrasing, and included a statement in the discussion citing the suggested paper, which indeed does a great systematic investigation of the within-host evolutionary rates in chronically infected SARS-CoV-2 patients.Location of change in manuscript: R3#10 Minor comments Abstract: 11.1."We applied this method for a SARS-CoV-2 time-calibrated phylogeny.." --> "We applied this method to a SARS-CoV-2 time-calibrated phylogeny.." Response: Fixed.Location of change in manuscript: R3#11.1

1 12. 2 . 2 Introduction: 13 . 1 .
additional clarification: "growth rates of clades carrying a specific mutation in comparison with their immediate sisters without the mutation".Location of change in manuscript: R3#12."…highlighting the existence of important tradeoffs in selection between intrahost replication and inter-host transmission": I would recommend rephrasing this or providing evidence that the identified homplasies are necessarily the product of multilevel selection (see major comment 1) Response: Similar to 11.2.Location of change in manuscript: R3#12.A literature reference is missing for the statement "In molecular epidemiological studies, a set of particularly scalable approaches have been developed based on the calculation of phylogenetic clusters comprising two or more closely related samples."Additionally, I would suggest replacing "calculation" with, for example, "identification", "detection" or similar.

13. 5 . 5 13. 6 .
"We demonstrated its applicability through the analysis of a representative >1.2 million SARS-CoV-2 genomic data set from England…" should, e.g., rather be phrased as "We demonstrated its applicability through the analysis of a representative SARS-CoV-2 data set comprising >1.2 million genome sequences from England…" Response: This sentence has been changed accordingly.Location of change in manuscript: R3#13."Byproviding a comprehensive catalog of the main sites driving multilevel selective pressures throughout the SARS-CoV-2 genome, we also expand the understanding of [the] SARS-CoV-2 fitness landscape outside the well-studied spike protein".I would recommend rephrasing of this sentence, as (i) the identified sites are not the drivers of the selective pressures, rather the result of them and (ii) see major comment 1.

14. 2 . 2 14. 3 . 3 14. 4 . 4 14. 5 . 5 14. 6 . 6 14. 7 . 7 14. 8 .
"Assume two clades u (target clade) and v (comparator/sister) organised in a time-scaled tree t and sharing ancestry (i.e. the same defining mutations)."--> "Assume two clades u (target clade) and v (comparator/sister) organised in a timescaled tree t and sharing full ancestry (i.e. the same defining mutations)."Response: I have added "full" as suggested, and also added the clarification that the ancestry is up to the point they diverge and therefore present their own exclusive defining mutations.Location of change in manuscript: R3#14."Thepersistence time (given by a in Figure2, is defined as.." --> "The persistence time (given by a in Figure2), is defined as.." (closing bracket at wrong position) Response: Fixed.Location of change in manuscript: R3#14."…whichcan be their sister clade (the clade sharing an immediate ancestor assuming bifurcating phylogenetic relationship) or against all other clades…" --> "…which can be their sister clade (the clade sharing an immediate ancestor assuming bifurcating phylogenetic relationship) or all other clades…" Response: Fixed.Location of change in manuscript: R3#14.Unclear if 'clade size' refers to S_uv or n_u Response: Thanks for noticing this.It referred to n_u and n_v and not to the ratio S_uv.We tried to rephrase this to ensure the meaning is not dubious.Location of change in manuscript: R3#14.I would recommend explicitly stating that in this study it is assumed that a homplasy that arises in a subtree for which the statistics fall below a low quantile of the empirical distribution (see sentence "… for identifying especially convergently acquired mutations (homplasies) that are detrimental for transmission (within a low quantile of the probability distribution of at least one of the three statistics").Response: We agree that the suggested change clarify the analytic methods used, making it easier to understand and technically more accurate.It is now incorporated into the manuscript.Location of change in manuscript: R3#14.The observed distribution of each of the three statistics in the tree is formally not a probability distribution.I would recommend replacing this terminology.Response: We agree and have fixed this throughout the manuscript.Location of change in manuscript: R3#14."Given this efficient way to visit nodes of the tree and edge lengths, we can easily extract the parameters of interest (e.g. the time of the most recent common ancestor of each node, …" --> "Given this efficient way to visit nodes of the tree and edge, we can easily extract the parameters of interest (e.g. the time of the most recent common ancestor of each clade, …" Response: Fixed.Location of change in manuscript: R3#14.8 14.9."Target nodes are extracted based on the following conditions …" The following needs rephrasing, as it currently reads as if only the node with the smallest number of descendants etc. is kept.

11 .
The authors have explained other caveats and advantages of their approach well in the context of existing literature.Response: Thank you for your positive feedback!Minor: 12. TFP-homoplasies should be defined before being mentioned in the abstract.Response: Thank you very much for noticing this.It is fixed.Location of change in manuscript: R2#12 13.The terms Upper and Lower Tier Local Areas need definitions before their abbreviations are used.Response: Thank you very much.It is addressed.Location of change in manuscript: R2#13 14.Consider changing "… up to July 2020 and has not considered persistence times" to "did not consider persistence times".Response: Fixed.Location of change in manuscript: R2#14

Reviewer Report 28
March 2024 https://doi.org/10.21956/wellcomeopenres.22912.r76372© 2024 Neher R.This is an open access peer review report distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
This project contains the following underlying data: • ExtendedData.pdfsupplementary text, figures, and table cited in the paper • ExtendedData_FileS1.txt -output of Poisson regressions across the ten employed cluster thresholds for the time period before Omicron BA.1.*emergence (early June 2020 to mid-November 2021) that tested whether TFPs were enriched in particular genomic regions or major PANGO lineages.
Definition of Y as TFP calls among the I polymorphic third codon position sites with >100 mutated sequences: Why 100?This independent analysis provides additional evidence that their evolution is consistent with a transient selective pressure.":Why is this the case?To my understanding, the manual inspection of the relatively frequent TFP homoplasies mainly shows that the method worked as expected here (i.e. the identified homplasies are recurrent and in 'small' clades) ○Alignment-aware artifact removal: Is it correct that this is only necessary if occurrences of "X" and "N" at each genome position are not randomly distributed over the tree?○ Poisson regression: definition of i and j is missing ○ ○ "○ Discussion "We have quantified transient selective forces acting on SARS-CoV-2 lineages and mutations…": I would recommend rephrasing this, as the study results do not represent a quantification of the selective forces.○ "After the emergence of VOCs with elevated substitution rates, other attributes…": I would recommend rephrasing this, since it reads as if VOCs have elevated evolutionary rates.○ "Genetic diversity in an infected individual is goverened by repeated cycles…" --> "Genetic diversity in infected individuals is goverened by repeated cycles…" ○ "Additionally, mutations identified in the top 100 most frequent TFP-homoplasies […] also emerged after days of chronic infection.":Since they emerged within days, were they shown to only emerge in chronically infected individuals and not others?○ 5. Conclusions: "We developed a method capable of identifying sites under multilevel selection...": I would recommend being more careful with the wording here, see major comment 1. ○ 6. References 1. Översti S, Gaul E, Jensen B, Kühnert D: Phylogenetic meta-analysis of chronic SARS-CoV-2 infections in immunocompromised patients shows no evidence of elevated evolutionary rates.bioRxiv.2023.Publisher Full Text

"Multiplication by 100 transforms the probability of erroneously calling a TFP into a percentage for easier interpretation": The FDR does not represent the probability of erroneously calling a TFP, I would recommend rephrasing this.
Response: Thanks for noticing this.You are correct.We rephrased it as "proportion of erroneous TFP calls".Location of change in manuscript: R3#14.20

This independent analysis provides additional evidence that their evolution is consistent with a transient selective pressure.": Why is this the case? To my understanding, the manual inspection of the relatively frequent TFP homoplasies mainly shows that the method worked as expected here (i.e. the identified homplasies are recurrent and in 'small' clades)
Response: Thank you for the suggestion.We agree and rephrased it to highlight only that it shows that the method works as expected and provided additional explanation ("indicated by the recurrence of such mutations and their appearance in clades where the size, longevity or growth rate of the target clade is much smaller when compared to its sister(s).")Location of change in manuscript: R3#14.

"Genetic diversity in an infected individual is goverened by repeated cycles…" --> "Genetic diversity in infected individuals is goverened by repeated cycles…
" Response: Fixed.Location of change in manuscript: R3#15.3 15.4.

"Additionally, mutations identified in the top 100 most frequent TFP-homoplasies […] also emerged after days of chronic infection.": Since they emerged within days, were they shown to only emerge in chronically infected individuals and not others?
Response: The studies cited investigated their occurrence specifically in chronically infected patients, and to the best of our knowledge these mutations were not demonstrated to be particularly prevalent at the population level (acute infections) in several lineages throughout the pandemic.
Reviewer Report 11 April 2024 https://doi.org/10.21956/wellcomeopenres.22912.r76365© 2024 Ghafari M. This is an open access peer review report distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.