Mining genomic repositories to further our knowledge of the extent of SARS-CoV-2 co-infections

Recombination events between Delta and Omicron SARS-CoV-2 lineages highlight the need for co-infection research. Existing studies focus on late-phase co-infections, with few examining earlier pandemic stages. This new study aims to globally identify and characterize co-infections using a bioinformatic pipeline to analyse genomic data from diverse locations and pandemic phases. Among 26988 high-quality SARS-CoV-2 isolates from 11 diverse project databases, we identified 141 potential co-infection cases (0.52%), surpassing previous prevalence estimates. These co-infections were observed throughout the pandemic timeline, with an increase noted after the emergence of the Omicron variant. Co-infections involving the Omicron variant were the most prevalent, potentially influenced by the high level of diversity within this lineage and its impact on the viral landscape. Additionally, we found co-infections involving the pre-Alpha/Alpha lineages, which have been rarely described, raising possibilities of contributing to new lineage emergence through recombination events. The analysis revealed co-infection cases involving both different and the same lineages/sublineages. Our study showcases the potential of our pipeline to leverage valuable information stored in global sequence repositories, advancing our understanding of SARS-CoV-2 co-infections. The prevalence of co-infections highlights the importance of monitoring viral diversity and its potential implications on disease dynamics. Integrating clinical data with genomic findings can further shed light on the clinical implications and outcomes of co-infections.


INTRODUCTION
The identification of recombination events between the Delta and Omicron (BA.1 and BA.2) lineages justifies the investigation of SARS-CoV-2 co-infections.However, most existing studies are restricted to co-infections involving these two divergent lineages and to late phases of the pandemic when they co-circulated [1][2][3][4].Only a few studies have analysed co-infections at earlier stages of the pandemic [5][6][7].
Previous research has estimated the prevalence of SARS-CoV-2 co-infections involving the Delta and Omicron lineages at approximately 0.2 % [8].In the population covered by our hospital in Madrid, Spain, our group found comparable co-infection frequencies throughout the entire pandemic.For that study [9], we applied a pipeline that we had recently designed, which enabled us not only to identify co-infection regardless of the strains/lineages involved (>7 SNP differences between them), but also to segregate the sequences involved for individual analysis.
The objective of this new study, applying the same pipeline as a tool, was to exploit sequences deposited in genomic repositories in order to make a global identification and characterization of co-infections in different geographical settings and at different phases of the COVID-19 pandemic.
The co-infection detection pipeline (https://github.com/MG-IiSGM/SARS_COVinfections)focuses on the single nucleotide polymorphisms (SNPs) identified, using IVar from sequences with ≥90 % genome coverage and a depth of >30X.For the SNPs with heterozygous (HZ) calls (two alleles co-existed, with one of the alternative alleles at a frequency between 15 and 85 %), it

Impact Statement
This study sheds light on the prevalence and characteristics of SARS-CoV-2 co-infections during different phases of the COVID-19 pandemic.By applying our specialized bioinformatics pipeline, this research exploits the potential of genomic global repositories to extend our knowledge of the complex landscape of SARS-CoV-2 co-infections.In contrast to previous studies that focused on specific lineages in certain populations during limited periods, our analysis covers a wide range of lineages, locations and time periods.The identification of 141 potential co-infection cases from a dataset of 26988 isolates surpassed previous estimates of co-infection prevalence.Our analytical pipeline also provides the segregation of the two co-infecting strains, something unprecedented in the literature on this topic, allowing for the precise description of the lineages involved.From our data, it can be deduced that co-infection occurred especially after the Omicron emergence; however, it could also be found in preceding stages, involving not only strains from different lineages but also from the same one.Altogether, our findings suggest that recombination due to co-infection might have been a more common driver of SARS-CoV-2 diversification along the pandemic than previously expected.calculates the mean proportion of major alleles (MHP), the standard deviation of the MHP (SHP), and the percentage of SNPs with a heterozygous proportion of the major allele found within the MHP ± (SHP +1.5 %) to assure a homogeneous distribution of frequencies in the SNPs with HZ calls.Co-infection candidates are chosen based on the number of SNPs with heterozygous calls (>7 positions), the MHP (< 75 %), the mean SHP (≤ 0.08 %) and the percentage (≥70 %) of those SNPs with the major allele found within that MHP ± (SHP +1.5 %).Our approach requires a minimum average proportion of the minor strain in a co-infection and checks for a consistent allele relative frequency across the positions with HZ calls.

RESULTS AND DISCUSSION
We selected 11 different SARS-CoV-2 projects from the ENA database (Table 1).Taking into account the frequencies of SARS-CoV-2 co-infection in the literature, we ruled out projects with fewer than 5000 sequences obtained with Illumina.In our selection, we considered different time windows for the collection dates of the sequenced specimens to ensure comprehensive coverage over the entire timeline of the pandemic and different geographical settings.Our selection included five projects from Europe: To ensure that the analysis was computationally feasible and an equitable representation of sequences from each month throughout the pandemic timeline (from March 2020 to March 2023), we randomly selected 1100 sequences per month among all the sequences for that month from all projects' sequences (Table 1).In total, 40700 sequences were downloaded, with each project contributing approximately 9% of the total sequences downloaded; 26988 sequences (66 %) met the quality threshold, with genome coverage ≥90% and >30X depth.The proportion of high-quality sequences was not consistent across all projects.Some projects contained a higher number of low-quality deposited sequences, which limits the sample size to be used in our analysis.Notably, in the project from the UK, a mere 9.34% of the sequences met the predefined quality criteria.This inconsistency can be attributed primarily to the lack of stringent quality standards generally governing the submission of sequences within the ENA database.These high-quality sequences were processed using our pipeline [9].
Overall, 141 potential co-infection cases (0.52 %) were identified (Table 2) across various countries (Table 1) and for most of the months during the pandemic (Fig. 1).Notably, no co-infections were detected in the Australian project.It is important to note that the sequences from this project mainly corresponded to the early stages in the pandemic (March 2020 to October 2020), when low diversity is still expected among the circulating variants, which makes the identification of co-infections difficult.In addition, we must remind that due to the random subsampling of the sequences, we cannot fully rule out co-infections in some of the periods or projects in which we did not identify them.
The co-infection frequency was higher than reported elsewhere [8,9], although the frequencies of individual projects ranged widely (0.15-2.03) (Table 1).Moreover, higher co-infection rates were found in the Spanish and South African projects (1.04 and 2.03 %, respectively).These higher rates could initially be attributed to the longer time covered by these projects, encompassing sequences collected throughout almost the entire pandemic timeline.Thus, it would be conceivable that the elevated rates of co-infections within these projects might result from the higher opportunity to detect them when the sampling periods are longer.Nevertheless, in other projects with equivalent time coverages, co-infection percentages were lower (ranging from 0.15-0.67%,) suggesting a role for other factors, out of time-coverage, behind differences in co-infection proportions.
As our pipeline allows for segregation of the sequences involved in each co-infection, we first determined the lineages involved in each case.Segregation of the two strains involved in a co-infection was confidently performed in 124 (88 %) of the co-infection cases where mean heterozygosity was between 60 and 75 % (Table 2).Fifty-five percent of the co-infections involved sequences from the same lineage/sublineage, while the remaining cases involved different lineages/sublineages (Table 2).In the first group, we identified 20, 11 and 46 co-infections involving the same pre-Alpha/Alpha, Delta and Omicron lineages, respectively.In the second group, we observed 17 co-infections involving two distinct pre-Alpha or Alpha sublineages, three involving pre-Alpha/ Alpha and Delta, ten involving two distinct Delta lineages/sublineages and 34 involving two different Omicron lineages/sublineages (Table 2).Most of the co-infections involved Omicron variants (57 %, 80/141 sequences).Co-infections involving Delta and Omicron lineages were not detected in our analysis.However, the absence of their detection in this study does not necessarily imply that they did not occur or they cannot be detected, as we have previously identified such cases with this pipeline in our prior work [9].
The total number of HZ calls observed in the co-infections identified ranged from 8 to 71.As expected, cases with a higher number of HZ calls (>17) corresponded to co-infections involving different lineages/sublineages (41 of 48).Conversely, those with a lower number of HZ calls (8-17) mainly corresponded to co-infections within the same lineage or sublineage.
In our previous study performed in Madrid [9], we identified a proportion of co-infection candidates that were likely due to co-existence of variants emerging in immunosuppressed patients or to laboratory contaminations.However, in this current study,  Continued the unavailability of comprehensive epidemiological, laboratory and clinical data and the constraints of sample resequencing hinder the replication of some of the validation procedures applied in our prior work.Nonetheless, we evaluated the consistency between the lineages involved in each of the co-infection candidates and the corresponding circulating lineages at the moment of each diagnosis.In all cases, the linages involved in the co-infections corresponded to the expected circulating variants (Fig. 1).Furthermore, the detection of low-frequency recombined reads (alleles from each of the co-infecting variants lying in the same single short read) would demonstrate the simultaneous presence of both strains and their concurrent viral replication, ruling out a mere co-existence of sequences due to laboratory cross-contamination.To perform this analysis, we first identified the cases in which we could find some of the differential SNPs for the co-infecting strains located 150-pb apart (length for Illumina  short-reads).Among the 141 co-infection cases, we identified these tandem close SNPs in 59 of them, which could allow us to detect potential recombination in the same short read.Recombined low-frequency (average frequency 3.6%) single reads were observed in 43 of these (73 %).These findings indicate that recombination was found in a high number of the co-infection cases, in which the distribution of SNPs for the co-infecting strain allows this analysis to be performed.Whenever recombination is detected, cross-contamination can be fully ruled out.
We must remind that the replication of the robust validation procedures applied in our previous study to distinguish co-infections from potential cross-contaminations is unattainable in the current study.However, the epidemiological consistency and the detection of low-frequency recombined reads can potentially rule out the possible significant inflation of estimated rates, reinforcing the validity of the co-infection rates obtained in this study.
While our study primarily focused on identification and segregation for analysis of SARS-CoV-2 co-infections, it would also be useful to consider the clinical implications.Analysing clinical data in conjunction with genomic data could offer deeper insights into the impact of co-infections on disease severity, treatment response and patient outcomes [10,11].Future studies should integrate clinical metadata for a comprehensive understanding of co-infections.associated with the high level of diversity within this lineage and its potential impact on the viral landscape.In addition, we identified co-infections involving the pre-Alpha/Alpha lineages, which have been only sporadically described and may also have contributed to the emergence of new lineages through recombination events.Our analysis revealed co-infection cases involving not only different lineages/sublineages but also strains from the same lineage.This study demonstrates the potential of our pipeline to readily exploit the valuable information stored in global sequence repositories and advances our understanding of SARS-CoV-2 co-infections.Integrating clinical data with genomic findings could provide valuable information on the clinical implications and outcomes of co-infections.
Portugal (March 2020 to March 2023), Spain (April 2020 to November 2022), Switzerland (March 2020 to May 2021), Estonia (January 2021 to April 2022), and the UK (February 2020 to March 2023); four from the USA: Baltimore (December 2020 to March 2023), New York (September 2020 to April 2023), San Francisco (March 2020 to May 2021), and Utah (March 2020 to May 2021); one from Africa: South Africa (January 2020 to March 2023); and one from Australia: Victoria (March 2020 to October 2020).

Fig. 1 .
Fig.1.Distribution of the co-infections identified throughout the pandemic period, specifying the lineages involved.The dominant circulating lineage for each period is indicated in the upper section.

Table 1 .
SARS-CoV-2 co-infection cases identified from ENA repositories and general features of the projects selected for analysis