SARS-CoV-2 Recombination and Coinfection Events Identified in Clinical Samples in Russia

Recombination is one of the mechanisms of SARS-CoV-2 evolution along with the occurrence of point mutations, insertions, and deletions. Recently, recombinant variants of SARS-CoV-2 have been registered in different countries, and some of them have become circulating forms. In this work, we performed screening of SARS-CoV-2 genomic sequences to identify recombination events and co-infections with various strains of the SARS-CoV-2 virus detected in Russia from February 2020 to March 2022. The study included 9336 genomes of the COVID-19 pathogen obtained as a result of high-throughput sequencing on the Illumina platform. For data analysis, we used an algorithm developed by our group that can identify viral recombination variants and cases of co-infections by estimating the frequencies of characteristic substitutions in raw read alignment files and VCF files. The detected cases of recombination were confirmed by alternative sequencing methods, principal component analysis, and phylogenetic analysis. The suggested approach allowed for the identification of recombinant variants of strains BA.1 and BA.2, among which a new recombinant variant was identified, as well as a previously discovered one. The results obtained are the first evidence of the spread of recombinant variants of SARS-CoV-2 in Russia. In addition to cases of recombination we identified cases of coinfection: eight of them contained the genome of the Omicron line as one of the variants, six of them the genome of the Delta line, and two with the genome of the Alpha line.


Introduction
The SARS-CoV-2 virus was first detected in China in December 2019 and has since spread around the world. During the pandemic, a system for rapid, high-throughput sequencing of the virus genomes was established, which provided a significant amount of data for analysis by the scientific community. In particular, as of 23 June 2022, the GISAID database contained more than 11.5 million virus genomes with accompanying metadata. Analysis of the genomic variability of pathogens that pose a threat to health makes it possible to trace the ways of infection transmission, allows us to study the molecular mechanisms of the manifestation of pathologies, and also contributes to the development of more effective vaccines and specific diagnostic tests.
It is known that the main mechanism of SARS-CoV-2 evolution is the occurrence of point mutations and insertions/deletions; however, recombination mechanisms contribute to a significant extent to the evolution of coronaviruses [1]. Recombination events result in chimeric genotypes of two or more viral variants infecting the same cell. Thus, new unique variants of genomes appear, containing combinations of mutations from various strains. As a result of this process, viruses with altered indicators of both transmissibility and virulence may appear.

Database of SARS-CoV-2 Nucleotide Variants
A database of nucleotide variants (single nucleotide variants and insertions/deletions) used for viruses' classification and identification of recombinant variants was created based on various SARS-CoV-2 virus strains sequencing data according to the algorithm described below. From the entire variety of SARS-CoV-2 genomes downloaded from the GISAID database (11,692, 140 genomes available on 5 July 2022) using an average of 1000 genomes per SARS-CoV-2 genetic lineage randomly selected, the overall genomes of 1735 lineages were analyzed. If the number of genomes for a genetic line was less than 1000, then all genomes were included in further analysis. For lines with less than 20 genomes, the analysis was not performed. The genome sequences for each of the lines were saved in separate fasta files. The resulting files contained 679,833 genome sequences for 1451 (84% of the original value) SARS-CoV-2 genetic lineage. For the sequences of each of the fasta files, multiple alignments were constructed with the SARS-CoV-2 reference genome (MN908947.3) using the MAFFT program (version v7.490) and the --keeplength and --addfragments options. The resulting multiple alignment files were processed using standard Python language tools and the pandas library. For each of the lines under consideration, positions different from the reference genome were identified, and the proportions of alternative alleles among the sequences belonging to the same line were calculated. The collected data were saved as a table in a TSV file (Table S1).

Samples Collection
Nasopharyngeal swabs from patients with confirmed COVID-19 infection were collected from 3 February 2020 to 26 March 2022 by Moscow clinical diagnostic laboratories, with subsequent transfer to the PCR laboratory of the Centre for Strategic Planning of FMBA of Russia for further whole genome sequencing. The total number of analyzed samples was 9336. All samples were collected in a transport medium (Disposable Virus Collection Tubes, CDVCT-1, and CDRICH) suitable for storage of the SARS-CoV-2 virus, transferred to the laboratory within 48 h after collection, and stored at −80°C before RNA isolation. Transportation and storage were in accordance with local regulations for the handling of biologically hazardous specimens. For each sample, the following information was stored: a unique identifier, the date the sample was collected, and the geographic region of origin. The samples included in this study were collected in Moscow and the Moscow region.

SARS-CoV-2 Whole Genome Sequencing
The presence of SARS-CoV-2 RNA in the collected biological sample was confirmed using the AmpliTest SARS-CoV-2 test kit (Centre for Strategic Planning of FMBA of Russia, Moscow, Russia), with further identification of SARS-CoV-2 variants using the AmpliTest SARS-CoV-2 VOC PCR assay (Centre for Strategic Planning of FMBA of Russia, Moscow, Russia) according to the instructions. RNA from positive samples containing Omicron strains was isolated using a Ribo-prep purification kit (Centre for Strategic Planning of FMBA of Russia, Moscow, Russia) and reverse transcribed using an Ampliseq cDNA synthesis kit for Illumina (Illumina; San Diego, CA, USA). The resulting cDNA was amplified using the AmpliSeq for Illumina SARS-CoV-2 research panel (Illumina), which contains 247 amplicons in 2 pools targeting the entire SARS-CoV-2 genome. Library preparation was performed using the AmpliSeq Library PLUS kit (Illumina). Library quality was assessed by capillary electrophoresis using an Agilent 2100 Bioanalyzer system (Agilent; Santa Clara, CA, USA). Library concentration was measured with a Qubit 4 fluorometer (Thermo Fisher Scientific; Waltham, MA, USA) using a Qubit dsDNA HS Assay Kit (Thermo Fisher Scientific, Waltham, MA, USA). Sequencing was performed on an Illumina NextSeq 550 system with a NextSeq 500/550 Mid Output Kit v2.5 (300 cycles) (Illumina). Manufacturers' recommendations were followed in all cases. The median genome coverage of sequencing reads was 99.68%, with a median coverage per nucleotide of 3851.
The reads were mapped to the reference Wuhan-Hu-1 genome (MN908947.3) using the bwa mem program [10]. BAM files were generated using the samtools program [11]. The BAM files were analyzed using the Tablet program [12].

Consensus Sequences and Genomic Variant Identification
Sequencing data were generated for a total of 9336 samples. These sequences were then processed according to the algorithm described below. Raw reads were filtered (minimum mean quality >30 and length >50), and adaptors were cut using tool cutadapt (version 2.10) [13] to remove adapter and primer sequences. Low-quality ends were trimmed with prinseq-lite (version 0.20.4) [14]. The truncated reads were mapped to the reference Wuhan-Hu-1 genome (MN908947.3) using the bwa mem program [10]. BAM files were generated using the samtools program [11]. Single nucleotide variations (SNVs) and short insertions and deletions (InDels) were detected using the GATK (version 4.1.9.0) [15] program, which generates .vcf files. SNVs, insertions, and deletions were considered consistent if the MQ value was greater than 50 and the DP value was greater than 1000. Regions that were covered by less than 10 reads were masked as N. Consensus genomes were generated by GATK FastaAlternateReferenceMaker using the resulting vcf files. Sequences with more than 4000 nucleotides marked N were removed from further analysis, resulting in a total of 8961 sequences and their .vcf and .bam files.

Data Preparation and Phylogenetic Analysis
For phylogenetic analysis, a dataset of representative genomes containing 2908 complete genomes of various SARS-CoV-2 lineages was downloaded from the GISAID database (2 August 2022). From the downloaded file, a subset of genomes of the BA.1* and BA.2* lineages were formed, containing 1499 sequences. Three sequences of the complete genomes of the detected recombinants were added to the created subset. The complete genome of the bat coronavirus RaTG13 (genbank: MN996532.2) was used as an outgroup for phylogenetic analysis; the entire genome of Wuhan SARS-CoV-2 (hCoV-19/Wuhan/Hu-1/2019, genbank: MN908947) was used as a control genome in the analysis. In all obtained sequences, some parts of the genome were pre-masked with N due to their insufficient coverage in sequenced samples (

Principal Component Analysis
Principal component analysis (PCA) was carried out as follows:

•
Using the VCF files, a table was assembled containing information on the substitutions found in each of the samples; • Filtering of substitutions in low-coverage regions (<10 reads) of the genome was carried out; • We removed all substitutions that were not in the reference database, as well as substitutions whose occurrence count was less than 10; • For each sample, a binary replacement vector was created (each element of the vector corresponds to a certain replacement. The value of the element is 1 if the replacement is present in the sample and 0 if it is absent); • Using the obtained vectors, the principal component analysis was carried out and implemented in the scikit-learn library of the Python language; • Visualization of the result was performed using the seaborn library.

Algorithm for the Detection of Recombinant Genomes and Co-infections
To search for potential recombinant viral genomes and co-infection cases, we performed an analytical method based on the analysis of sequencing reads aligned on a reference genome and used a database of SARS-CoV-2 genome-wide sequence data and genetic lineage classification. A detailed description of the method is described in the Supplementary Materials.

SARS-CoV-2 Nucleotide Variations Database
A database of nucleotide variants degerming various SARS-CoV-2 phylogenetic lineages was created using representative genomes from the GISAID database. The database contains 679,833 genomes representing 1451 SARS-CoV-2 genetic lineages, downloaded from the GISAID database on 5 July 2022. The database of SARS-CoV-2 nucleotide variations contains information on 4526 unique substitutions with a frequency higher than 50% in 1451 SARS-CoV-2 genomes. The median number of mutations in a lineage was 16, and the average was 22.6. The frequency of substitutions in the SARS-CoV-2 genome, calculated using the created database, is shown below ( Figure 1). The variability of various viral genome regions was studied. According to Figure 1, the major nucleotide substitutions are focused around coordinates 10,000-12,000 (ORF1ab gene), 21,000-24,000 (Spike gene), 26,000-27,000 (genes E and M), and 28,000-29,000 (genes ORF9 and N). Analysis of nucleotide substitution frequencies in various parts of the genome allows for the suggestion that SARS-CoV-2 recombinant variants might be  The variability of various viral genome regions was studied. According to Figure 1, the major nucleotide substitutions are focused around coordinates 10,000-12,000 (ORF1ab gene), 21,000-24,000 (Spike gene), 26,000-27,000 (genes E and M), and 28,000-29,000 (genes ORF9 and N). Analysis of nucleotide substitution frequencies in various parts of the genome allows for the suggestion that SARS-CoV-2 recombinant variants might be more likely to be detected if recombination happens closer to the 3 -end of the genome, because in this case a larger number of SNVs that distinguish viral strains can be analyzed.

Analysis of Consensus Genomic Sequences
The analysis of consensus genomic sequences of potential recombinant variants of SARS-CoV-2 was performed using the Nextclade program [18]. The sequences downloaded from the GISAID database (BA.1: EPI_ISL_14153781, EPI_ISL_14157328, EPI_ISL_14157805; BA.2: EPI_ISL_14189751, EPI_ISL_14189762, EPI_ISL_14195) were used as reference genomes of the BA.1 and BA.2 lineages. Genome sequences with infor-  As a result of the analysis, samples LQ-24660 and LQ-24661 were identified as XAD recombinants: the part of the genome up to position 24,503 corresponds to the BA.2 lineage sequence, while the sequence after position 26,060 has substitutions characteristic of BA.1. It is interesting that LQ-24283 was not annotated as recombinant by the Nextclade program, but the structure of its genome also allows us to distinguish two regions belonging to different lineages. The part of the genome from position 1 to position 24,503 is similar to that of the BA.1 lineage, while from position 26,060, the SNV profile is similar to that of the BA.2 lineage.
A comparative analysis of the found recombinant sequences was carried out with the previously described recombinant genomes described in the work of Focosi et al. [9] (Figure 4). It is interesting that LQ-24283 was not annotated as recombinant by the Nextclade program, but the structure of its genome also allows us to distinguish two regions belonging to different lineages. The part of the genome from position 1 to position 24,503 is similar to that of the BA.1 lineage, while from position 26,060, the SNV profile is similar to that of the BA.2 lineage.
A comparative analysis of the found recombinant sequences was carried out with the previously described recombinant genomes described in the work of Focosi et al. [9] ( Figure 4).
ing to different lineages. The part of the genome from position 1 to position 24,503 ilar to that of the BA.1 lineage, while from position 26,060, the SNV profile is sim that of the BA.2 lineage.
A comparative analysis of the found recombinant sequences was carried out w previously described recombinant genomes described in the work of Focosi et al.  As can be seen from the figure above, the LQ-24660 and LQ-24661 genomes are more similar to the previously discovered recombinant sequences of the XAE lines, which does not coincide with the annotation of the Nextclade program, using which these samples were assigned to the XAD lineage. At the same time, no identical recombinant genomes were found for sample LQ-24283.

Analysis of Short Read Alignment Files
To confirm recombination sites and the absence of a mixture of strains in the samples mentioned above, analysis of aligned reads at positions distinguishing variants BA.1 and BA.2 was performed according to the data obtained from the analysis of the collected database ( Figure 5).
The results of the analysis of the proportion of the sequence reads in each of the positions that allows us to distinguish BA.1 and BA.2 lineages demonstrate the absence of a mixture of several genetic variants in the sample and also allow us to determine that the recombination points in the genomes are located between positions 24,503 and 26,060 (ORF3a, M genes). It is interesting to note that sample LQ-24283 was assigned by the Pangolin and Nextclade programs to variant BA.1, as well as LQ-24660 and LQ-24661, which, according to our analysis, are products of recombination between BA.1 and BA.2 genomes. The list of substitutions in the identified recombinant genomes is provided in the Supplementary Materials (Table S3). The results of the analysis of the proportion of the sequence reads in each of the positions that allows us to distinguish BA.1 and BA.2 lineages demonstrate the absence of a mixture of several genetic variants in the sample and also allow us to determine that the recombination points in the genomes are located between positions 24,503 and 26,060 (ORF3a, M genes). It is interesting to note that sample LQ-24283 was assigned by the Pangolin and Nextclade programs to variant BA.1, as well as LQ-24660 and LQ-24661, which, according to our analysis, are products of recombination between BA.1 and BA.2 genomes. The list of substitutions in the identified recombinant genomes is provided in the Supplementary Materials (Table S3).  (Table S2). Also, three alternative primer pairs shifted relative to the main one by several nucleotides were used for cross-validation of the results obtained ( Figure 6). quencing method. A pair of primers was selected to enrich the extended genome region from position 23791 to position 26712 (length 2921 bp), covering positions 24130, 24503, 26060, and 26530, located on both sides of the putative recombination point and enabling the differentiation of options BA.1 and BA.2 (Table S2). Also, three alternative primer pairs shifted relative to the main one by several nucleotides were used for cross-validation of the results obtained( Figure 6).  Figure 6B,C), which correspond to the BA.2/BA.1 recombination. Thus, using the selected primer pairs, an extended 2921-bp genome region was amplified. for putative recombinant samples, which was subsequently sequenced using the Oxford Nanopore system and confirmed the presence of recombinant genotypes BA.1/BA.2 and BA.2/BA.1 in the studied isolates.

Phylogenetic Analysis
Using the detected recombinant samples and a representative subset of the genomes of the BA.1* and BA.2* SARS-CoV-2 virus lines from the GISAID database (1499 genomes, uploaded on 8 February 2022), a phylogenetic analysis was carried out, the purpose of which was to establish the nearest related genomes for the studied recombinant isolates (LQ-24283, LQ-24660, and LQ-24661). To construct a phylogenetic tree, regions of the genome with low-coverage were preliminarily masked. The resulting tree is shown below (Figure 7).  (Table S4).
Phylogenetic analysis allowed not only to identify the position of studied recombinant isolates on the phylogenetic tree but also to identify a group of candidates missannotated recombinant genomes in the GISAID database and prove it using Nextclade classification.

Detection of Co-infections in Sequencing Data
Recombination of SARS-CoV-2 genomes can occur when the host is infected with multiple viral strains. Thus, tracking co-infections is an important task for predicting recombination events and their frequency. In the presented work, we screened 9336 sequenced genomes to detect cases of coinfection among patients in Moscow and the Moscow region.  (Table S4).
Phylogenetic analysis allowed not only to identify the position of studied recombinant isolates on the phylogenetic tree but also to identify a group of candidates missannotated recombinant genomes in the GISAID database and prove it using Nextclade classification.

Detection of Co-infections in Sequencing Data
Recombination of SARS-CoV-2 genomes can occur when the host is infected with multiple viral strains. Thus, tracking co-infections is an important task for predicting recombination events and their frequency. In the presented work, we screened 9336 sequenced genomes to detect cases of coinfection among patients in Moscow and the Moscow region.
To search for cases of coinfection, we developed an algorithm that allowed us to compare nucleotide variants with reference genomes and determine whether a sample belongs to a group of possible co-infections based on the analysis of vcf files. As a result of the screening of sequenced samples, 23 cases of coinfection were identified. Subsequent analysis of histograms of allele frequencies, as well as information on the sample preparation process (excluding the possibility of contamination) made it possible to confirm 14 of them (Table 1).
Based on our analysis, over 50% of coinfection cases contain Omicron lineages as one of the viral variants (eight cases), six cases contain the Delta variant, and two cases contain the Alpha variant. The most frequent pair of variants is BA.1/BA.2 of the Omicron group of strains (four cases). Also, it is interesting to note that according to Pangolin annotation, the total number of sequenced genomes belonging to the Delta variant is 3981, Alpha is 222, and Omicron is 927. The representation of viral variants in the database might also influence the accuracy of the detection of coinfection cases.

Discussion
The study of SARS-CoV-2 genomic variability is directly related to the classification of viral genome variants, phylogenetic analysis, and the determination of the phylogenetic lineage, which includes non-recombinant circulating strains. The classification of viral strains is intended to systematize knowledge about common strains, is based on detectable genomic variability, and allows virologists and epidemiologists to track transmission routes and accurately communicate information about circulating variants in different regions of the world. In the study of the COVID-19 pandemic, at least four naming systems for genetically different types and lines of SARS-CoV-2 have been developed and are being implemented: three classifications used in scientific research (Pango, Nextstrain, and GISAID) [19] and the classification proposed by the WHO to identify common variants related to VOI and VOC [20]. Each of the systems has its own scientific approach, all classifications except the WHO classification were proposed before the definition of VOI and VOC. In our work, we relied on the Pango classification, which is based on the analysis of nucleotide substitutions across the SARS-CoV-2 genome. The phylogenetic lineage algorithm underlying the Pangolin Most Likely Phylogeny Lineage Determination Tool [16] is based on machine learning methods and comparative analysis of the studied genomic sequence with a database of tagged genomes. This method has certain limitations and is able to identify recombinant genome variants if they are already present in the Pangolin training dataset.
To identify new variants of recombinant forms of SARS-CoV-2, as well as to search for cases of simultaneous infection of a patient with different strains, we created a database of genomic variants characterizing 1451 genetic lineages of the virus. Using this database, we carried out our classification. As a part of the study, we analyzed the genomic sequences of strains collected from Russian patients diagnosed with COVID-19, which were compared with a database of genomic variants characterizing SARS-CoV-2 genetic lines annotated using the Pangolin algorithm. Any inaccuracy in the classification of non-recombinant lines may be due to the analytical tools available today.
The analysis made it possible to detect several cases of infection of Russian patients with recombinant variants of SARS-CoV-2: two cases of infection with a recombinant XAD variant according to the Pangolin or XAE classification based on the publication of Focosi et al. [9], which is a case of recombination of BA.2 and BA.1 lines (isolates LQ-24660 and LQ-24661), as well as one case of infection with a new recombinant variant BA.1/BA.2 (isolate LQ-24283) with localization of the recombination point between genome coordinates 24,503 and 26,060 (ORF3a, M genes). All three variants of the recombination were discovered in the spring of 2022, during the period of high prevalence of SARS-CoV-2 in Russia as well as around the world. A detailed phylogenetic analysis of the studied population of SARS-CoV-2 viruses and a representative sample of GISAID revealed new recombinant variants of the genomes identified by GISAID as BA.2 variants (Table.1). It is important to understand that the use of phylogenetic analysis for the determination of potential recombinant forms of SARS-CoV-2 could not be used as the only method due to the small number of recombinant genome variants in the GISAID reference database, which can lead to unambiguous results and some mistakes in classification. It is necessary to use alternative approaches to detect previously known SARS-CoV-2 and search for new recombinant forms.
Recombination of viruses in the coronavirus group is a common phenomenon and one of the mechanisms of evolution [21]. Recombinant variants of SARS-CoV-2 are increasingly being discovered by scientific teams due to the accumulation of genomic data [9]. A possible effect of the SARS-CoV-2 recombination process might be the emergence of virus variants with new properties, which is due to the biological effect of individual nucleotide substitutions and their combinations.
Also, as a result of the analysis, 14 cases of coinfection were found, most of which (eight samples) contained the Omicron strain as one of the variants. The total number of genomes of the Omicron strain sequenced in the laboratory was four times less compared to the Delta variant, which was present in six samples. This result may be due to the higher transmissibility of this variant compared with the lines found earlier and the greater likelihood of infection with two or more strains. Apparently, this fact also explains the higher detection of recombinants of the BA.1/BA.2 species in comparison with the previously studied VOC/VOI.
The spread of recombinant variants of SARS-CoV-2 has become widely discussed recently due to their significant role in the epidemiological process. For example, starting in October 2022, sublines of the XBB* variant (a recombinant variant of Omicron BA.2.10.1 and BA.2.75) will be widely spread across the world [22]. Our data also suggest that the ability of Omicron variants for rapid distribution and mutations to result in resistance to immune defenses acquired upon infection with earlier SARS-CoV-2 variants is likely to contribute to the emergence of more and more frequent cases of coinfection as well as the emergence of new recombinant forms. However, not all recombinant forms play an important role from an epidemiological point of view and could be classified using standard methods for determining the genetic lineage of SARS-CoV-2.
Supplementary Materials: The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/v15081660/s1. Supplementary methods: Algorithm for the detection of recombinant genomes and co-infections. Table S1: List of strain-specific mutations used for SARS-CoV-2 lineage identification.