High-Resolution Composition Analysis of an Inactivated Polyvalent Foot-and-Mouth Disease Vaccine

Appropriate vaccine selection is crucial in the control of foot-and-mouth disease (FMD). Vaccination can prevent clinical disease and reduces viral shedding, but there is a lack of cross-protection between the seven serotypes and their sublineages, making the selection of an adequately protective vaccine difficult. Since the exact composition of their vaccines is not consistently disclosed by all manufacturers, incompatibility of the strains used for vaccination with regionally circulating strains can cause vaccination campaigns to fail. Here, we present a deep sequencing approach for polyvalent inactivated FMD vaccines that can identify all component strains by their genome sequences. The genomes of all strains of a commercial pentavalent FMD vaccine were de novo assembled and the vaccine composition determined semi-quantitatively. The genome assembly required high stringency parameters to prevent misassemblies caused by conserved regions of the genome shared by related strains. In contrast, reference-guided assembly is only recommended in cases where the number of strains is previously known and appropriate reference sequences are available. The presented approach can be applied not only to any inactivated whole-virus FMD vaccine but also to vaccine quality testing in general and allows for better decision-making for vaccines with an unknown composition.


Introduction
Foot-and-mouth disease (FMD) is a highly contagious disease that causes severe economic damage, crippling local and global agriculture and commerce. The etiologic agent, foot-and-mouth disease virus (FMDV), infects cloven-hoofed animals, including domesticated and wild ruminants and pigs, leading to acute febrile disease with vesicles in and around the mouth and on the feet [1,2]. Although the mortality rate is low, the morbidity rate can be up to 100% combined with a drastic reduction in the productivity in the infected herds [3,4]. Suitable vaccines can prevent clinical disease and reduce viral shedding, thus limiting the economic consequences, even though subclinical infection and the development of a carrier state can still occur [5][6][7]. In many endemic areas, i.e., parts of Africa, the Middle East, and Asia, FMD control is attempted by regular mass vaccination [8].
FMDV is a positive-sense, single-stranded RNA virus in the genus Aphthovirus, family Picornaviridae. The FMDV genome is about 8.3 kb in length and encodes one long open reading frame of about 7 kb that is flanked by a long 5'-untranslated region and a shorter 3'-untranslated region, ending with a poly-(A) tail [9]. Like other RNA viruses, FMDV has an error-prone polymerase, leading to a high mutation rate during genome replication and the ability of rapid adaptation [10,11]. The seven

FMD Vaccines Contain Highly Fragmented Nucleic Acids due to BEI Inactivation
A commercial pentavalent FMD vaccine made from purified virus preparations that had been inactivated by treatment with BEI and formulated with a double oil emulsion (water in oil in water) adjuvant was used for the composition analysis. The extracted RNA (600 ng from 250 µL of vaccine) was highly fragmented, as shown in Figure 1a,b in comparison with RNA extracted from native vesicular fluid containing a large amount of infectious FMDV particles. While the fragmentation of the viral genome during the inactivation process is essential to ensure the safety of the vaccine, it creates a challenging condition for the reconstruction of whole genome sequences by deep sequencing. An adequate library preparation method was chosen to convert the majority of fragments into functional library molecules, aiming in a first step for an insert size of 400 bp. Additionally, in a second step, shorter fragments were collected and retained as backup (lib03330_small). The final libraries covering different insert sizes were quality controlled on a Bioanalyzer HS chip (Figure 1c,d). The library containing the longer molecules (lib03330_large) was sequenced on an Ion Torrent S5 XL, delivering a dataset of 4.2 million reads with a median read length of 277 bp (interquartile range: 106 bp) after trimming.

Vaccine Purity
Using deep sequencing, the genetic material within the sample can be screened for residues from cell culture propagation or other contaminations. A megablast analysis of a data subset of 5000 randomly selected reads against known FMDV genomes identified >99% reads of FMDV origin, demonstrating the high purity of the vaccine. The remaining 23 reads were either too short to obtain a significant match, only distantly related to FMDV, or identified as hamster-derived sequences, very likely residuals of the virus propagation on baby hamster kidney cells . The purity of the vaccine was also supported by a more thorough metagenomic analysis on the complete dataset using the RIEMS software pipeline [21] with 99.2% high-quality reads assigned to FMDV and only a small amount of hamster-derived sequences.

Attempted Identification of Strain Composition Using Megablast
One approach to identify the vaccine strains is by BLAST searches, but this approach requires the genomes of all component strains to be available in the database. Using megablast on a data subset of 5000 reads against known FMDV genomes with every read allowed to match only once, reads were assigned to 87 FMDV strains overall (Figure 2a and Supplementary Table S1). However, 75% of these hits had only one to four matched reads. These matches stem from conserved regions of the FMDV genome leading to reads matching several strains with equal E-value, and, to a lesser extent, from sequencing errors in the reads, but do not indicate the presence of all 87 strains in the vaccine. Of the 5000 reads, the majority (91.4%) were assigned to four virus isolates: FMDV type O isolate "o1manisa iso87" (accession no. AY593823; 2,722 reads), FMDV type A strain "Malaysia 97" (KJ933864; 673 reads), FMDV type Asia 1 isolate "asia1leb-89 iso89" (AY593798; 615 reads), and FMDV type A isolate "a22iraq-95 iso95" (AY593762; 560 reads). Only these four isolates had more than 500 matched reads per strain, and together accounted for 92.2% of the FMDV reads ( Figure 2b). The remaining 5.8% were assigned to 17 strains with read counts between 5 and 60, 1.2% to 23 strains with 2 to 4 reads, and 0.9% to strains with only a single read (Figure 2b). Without prior knowledge of the number of strains in the vaccine, strain identification using megablast can be misleading, particularly if appropriate reference sequences are missing from the database. In our case, the high numbers of matched reads could lead to the

Vaccine Purity
Using deep sequencing, the genetic material within the sample can be screened for residues from cell culture propagation or other contaminations. A megablast analysis of a data subset of 5000 randomly selected reads against known FMDV genomes identified >99% reads of FMDV origin, demonstrating the high purity of the vaccine. The remaining 23 reads were either too short to obtain a significant match, only distantly related to FMDV, or identified as hamster-derived sequences, very likely residuals of the virus propagation on baby hamster kidney cells (BHK-21). The purity of the vaccine was also supported by a more thorough metagenomic analysis on the complete dataset using the RIEMS software pipeline [21] with 99.2% high-quality reads assigned to FMDV and only a small amount of hamster-derived sequences.

Attempted Identification of Strain Composition Using Megablast
One approach to identify the vaccine strains is by BLAST searches, but this approach requires the genomes of all component strains to be available in the database. Using megablast on a data subset of 5000 reads against known FMDV genomes with every read allowed to match only once, reads were assigned to 87 FMDV strains overall (Figure 2a and Supplementary Table S1). However, 75% of these hits had only one to four matched reads. These matches stem from conserved regions of the FMDV genome leading to reads matching several strains with equal E-value, and, to a lesser extent, from sequencing errors in the reads, but do not indicate the presence of all 87 strains in the vaccine. Of the 5000 reads, the majority (91.4%) were assigned to four virus isolates: FMDV type O isolate "o1manisa iso87" (accession no. AY593823; 2,722 reads), FMDV type A strain "Malaysia 97" (KJ933864; 673 reads), FMDV type Asia 1 isolate "asia1leb-89 iso89" (AY593798; 615 reads), and FMDV type A isolate "a22iraq-95 iso95" (AY593762; 560 reads). Only these four isolates had more than 500 matched reads per strain, and together accounted for 92.2% of the FMDV reads ( Figure 2b). The remaining 5.8% were assigned to 17 strains with read counts between 5 and 60, 1.2% to 23 strains with 2 to 4 reads, and 0.9% to strains with only a single read (Figure 2b). Without prior knowledge of the number of strains in the vaccine, strain identification using megablast can be misleading, particularly if appropriate reference sequences are missing from the database. In our case, the high numbers of matched reads could lead to the incorrect conclusion that only the four isolates listed above (and/or close relatives thereof) were present in the vaccine.
Pathogens 2020, 9, x FOR PEER REVIEW 4 of 10 incorrect conclusion that only the four isolates listed above (and/or close relatives thereof) were present in the vaccine.

Reference-Guided Assembly Approach on Full Genomes
A reference-guided assembly can be performed if all strains in the vaccine have been reliably identified in advance and corresponding reference sequences are available. However, in our case, an initial megablast search of the raw reads had only identified four strains, because for the fifth (Component 2/O-3039), no whole-genome sequence was available in the database. Component 2 was only later identified in a de novo assembly (see below). A stringent mapping of the raw reads against the four reference sequences identified by the megablast search produced three almost complete FMDV genomes (Component 1, 3 and 5) while a fourth (Component 4/A Malaysia 97) was split in three contigs. This demonstrates that reference-guided assembly can be used to reconstruct full genomes but only when the total number of strains present is known in advance and the corresponding reference sequences are available. If these prerequisites are not met, reference-guided assembly will give incomplete and misleading results.

De Novo Assembly of Full-Genomes
The difficulty in the de novo assembly of the components of a polyvalent vaccine lies in the conserved genomic regions shared between different strains. To obtain optimal results, two assembly algorithms, GS De Novo Assembler (Newbler) and SPAdes [22], were compared. Using default settings in the de novo assembly with both methods, only fragmented genomes were obtained, indicated by many contigs being assigned to different strains. To obtain full genome sequences in the assembly, an optimal range for the assembly stringency needed to be determined as well as the data set size adjusted. For example, a highly stringent assembly combined with a high sequencing depth produced reliable contigs, but at the same time failed to provide full genome sequences. In case of the GS De Novo Assembler, very strict assembly parameters of a minimum identity of 100%, minimum overlap length of 99%, and a data set size of 20,000 reads delivered 185 contigs, with the largest contig being 1801 bp. The best assembly results were achieved with the minimum read identity set to 97% and minimum overlap length set to 99%, with a data set of 20,000 reads. With these settings, the GS De Novo Assembler returned exactly 10 contigs, where each viral genome was divided in two contigs that were separated by the poly-C region in the 5'-untranslated region. These two contigs were combined manually based on their individual BLAST results to obtain five complete FMDV genomes, including the genome of Component 2, which could not be reconstructed by reference-guided assembly due to the lack of an appropriate reference sequence. The assembly was then verified by mapping the dataset against each

Reference-Guided Assembly Approach on Full Genomes
A reference-guided assembly can be performed if all strains in the vaccine have been reliably identified in advance and corresponding reference sequences are available. However, in our case, an initial megablast search of the raw reads had only identified four strains, because for the fifth (Component 2/O-3039), no whole-genome sequence was available in the database. Component 2 was only later identified in a de novo assembly (see below). A stringent mapping of the raw reads against the four reference sequences identified by the megablast search produced three almost complete FMDV genomes (Component 1, 3 and 5) while a fourth (Component 4/A Malaysia 97) was split in three contigs. This demonstrates that reference-guided assembly can be used to reconstruct full genomes but only when the total number of strains present is known in advance and the corresponding reference sequences are available. If these prerequisites are not met, reference-guided assembly will give incomplete and misleading results.

De Novo Assembly of Full-Genomes
The difficulty in the de novo assembly of the components of a polyvalent vaccine lies in the conserved genomic regions shared between different strains. To obtain optimal results, two assembly algorithms, GS De Novo Assembler (Newbler) and SPAdes [22], were compared. Using default settings in the de novo assembly with both methods, only fragmented genomes were obtained, indicated by many contigs being assigned to different strains. To obtain full genome sequences in the assembly, an optimal range for the assembly stringency needed to be determined as well as the data set size adjusted. For example, a highly stringent assembly combined with a high sequencing depth produced reliable contigs, but at the same time failed to provide full genome sequences. In case of the GS De Novo Assembler, very strict assembly parameters of a minimum identity of 100%, minimum overlap length of 99%, and a data set size of 20,000 reads delivered 185 contigs, with the largest contig being 1801 bp. The best assembly results were achieved with the minimum read identity set to 97% and minimum overlap length set to 99%, with a data set of 20,000 reads. With these settings, the GS De Novo Assembler returned exactly 10 contigs, where each viral genome was divided in two contigs that were separated by the poly-C region in the 5'-untranslated region. These two contigs were combined manually based on their individual BLAST results to obtain five complete FMDV genomes, including the genome of Component 2, which could not be reconstructed by reference-guided assembly due to the lack of an appropriate reference sequence. The assembly was then verified by mapping the dataset against each of the five genomes with even higher stringency and manually inspecting the read alignments, thereby also confirming the correct assembly of the two contigs with reads covering the poly-C region. The five obtained FMDV genomes have sequence identities of 91% to 100% to FMDV genomes available from the International Nucleotide Sequence Database Collaboration (INSDC; http://www.insdc.org/) ( Table 1) and match the strain identification provided by the manufacturer on the label of the vaccine. The VP1 sequence of Component 2 was 100% identical to FMDV O/TAI/Ban/60 (accession no. KM243030.1), which corresponds to the vaccine strain commonly known as O-3039 [23]. A whole-genome sequence of this strain is not available from INSDC. The genomes themselves have pairwise sequence identities of less than 90% (Table 2). De novo assembly with SPAdes was also attempted, but the fine-tuning of stringency that can be done by the user is limited. The "-careful" mode was applied during assembly, and different data set sizes were provided. Data sets of 5000 and 10,000 reads proved to be too small, resulting in 12 and 7 contigs, with the longest contigs amounting to 3697 and 3743 bp in length, respectively. A data set of 20,000 reads gave better results, with 3 contigs of almost 8 kb (corresponding to nearly complete genomes of Component 2, 3, and 4) and 10 contigs of 256 to 4298 bp in length. A data set of 100,000 reads returned 38 contigs, with the largest contig being 10,105 bp long, a peculiar chimera of Components 5 and 3, as well as an almost complete genome of Component 4. An input of 1 million reads produced 41 contigs, with Component 5 almost fully assembled but all other genomes again fragmented, and some additional hamster sequences. Hence, it was not possible to reliably reconstruct the genomes of the vaccine strains using SPAdes in this case.

Distribution of FMDV RNA
Competitive mapping of a partial dataset of 60,000 randomly selected reads against the de novo assembled genomes of the five identified strains allowed a rough estimation of the quantitative composition of the pentavalent vaccine. With highly stringent parameters, half of the reads were allocated to Component 1 (49.9%), and roughly 11% to Component 4 and 5, 8.9% to Component 3, and 5.4% to Component 2 (Table 3), respectively, resulting in only 86.1% of the reads allocated overall.
When the stringency was slightly reduced, 98.2% of the dataset were allocated to the five FMDV strains (Table 3).  1 : Highly stringent mapping parameters included a minimum overlap identity of 99% and a minimum sequence identity of 98%; 2 : Less stringent mapping parameters included a minimum overlap identity and a minimum sequence identity of each 95%; *: Number of reads that mapped uniquely to this reference out of a total of 60,000 randomly selected reads.

Discussion
The application of vaccines to prevent clinical disease and reduce viral shedding is an important component of FMD control, particularly in endemic regions. Ideally, a universal vaccine would provide full protection against all FMD strains with long-lasting immunity and without persistent infection [24,25]. However, the lack of cross-protection between different serotypes and strains is a hurdle to the successful application of current vaccines. Accordingly, a jointly published checklist for vaccine selection by the Food and Agriculture Organization of the United Nations and the World Organisation for Animal Health highlights the impact of antigenic differences between vaccine and field strains and the necessity to obtain advice on strain selection from reference laboratories [16]. This is often made difficult by a lack of disclosure of the strains used for vaccine production. Additional confusion can be caused by different manufacturers referring to different virus strains by the same name [17].
The universal approach presented here gives the opportunity to determine the full genomes of all vaccine strains, including the regions coding for the capsid proteins, in one sequencing run. The protocol does not require any modification for different vaccine components and inherits no bias from a specific amplification with possibly suboptimal primer binding. A somewhat similar approach has been described for poliovirus vaccines, where both attenuated strains and wild-type strains propagated only for inactivated vaccines, had been handled at the same production site [26], and for the quantification of low-frequency revertants in live oral poliovirus vaccines [27]. Several whole-genome sequencing approaches for FMDV have been published [28][29][30], but to our knowledge, none have been applied to inactivated vaccines.
The BEI inactivation leads to a strong fragmentation of the RNA, but with appropriate fine-tuning of the assembly this does not hamper the analysis. In general, longer reads are favorable for the assembly process and the disambiguation of related strains. Therefore, the library containing the longer fragments (lib03330_large) was selected for sequencing, although overall, less library material was available compared to the smaller fragments. Nevertheless, several million reads were obtained with only a fraction of the library, produced with an amplification-free library preparation workflow, of which several 10,000 reads were enough to reconstruct the whole genome sequences. This was favored by the high purity of the vaccine, with a low amount of background reads as residuals from the cell line used for virus propagation. Lower quality vaccines will have an increased background of non-FMDV reads, and an inadequate sequencing depth can then lead to problems with the FMDV assembly. However, by increasing the sequencing depth, it should generally be possible to assemble at least one large contig for each virus strain.
Local alignment searches and mapping approaches for vaccine strain identification require the corresponding reference sequences to be known. In the case presented here, only four of five genomes could be reconstructed, because for the fifth no reference sequence was available in public databases. In contrast, the open approach of de novo assembly can reconstruct sequences without prerequisites, resulting in the reconstruction of all five genomes in this case. If the strains are genetically highly similar or contain conserved regions, the assembly of short sequences is challenging and requires strict assembly parameters to distinguish between different strains and avoid the assembly of spurious chimeric constructs. In our opinion, the de novo assembly with strict parameters is to be favored over a mapping approach, even if it might result only in partial but more reliable genome sequences. With this approach, the investigator can also be certain to have identified all strains present in the vaccine product, even if the manufacturer does not provide the total number of strains. It is important to note, however, that even virus strains that are only present in trace amounts-intentionally or from contamination-will be detected. If reads are specific and unique to one assembled contig, it is highly likely that the corresponding strain was actually present in the vaccine, intentionally or otherwise. There is no unequivocal cut-off to differentiate contaminants from minor vaccine components, but we suppose that if less than 1% of all FMDV reads match a particular virus strain, it was probably not intended to be a component of the vaccine.
After successful strain identification and genome assembly, the proportional distribution of the total viral RNA in the vaccine across the component strains can be determined. In this case, over half of the viral RNA originated from one component (O 1 Manisa) while the other serotype O strain accounted for less than 7% of RNA. The other three strains each represented between 11% and 13% of the total RNA. This provides additional information about the vaccine composition that is not usually disclosed by the manufacturer. Due to the stringent parameters, the read assignment itself is unbiased. It is important to note, however, that the quantitative distribution of viral RNA is not necessarily the same as the distribution of viral antigen, because the RNA-to-protein ratio may vary between strains.
We compared two commonly used de novo assemblers, SPAdes and GS De Novo Assembler (Newbler), with Newbler delivering much better results. This is mostly due to its stringency parameters that can be adjusted in more detail than for SPAdes. However, other assemblers will also work, if the user can set the assembly stringency to be high enough.
The fragmented RNA limits the maximum read length obtainable in sequencing, and we showed that the genome reconstruction is possible with Ion Torrent single-end reads, with a median read length of 277 bp after trimming. Nonetheless, sequencing on an Illumina device in a paired-end mode might provide additional benefit to the assembly.
We trust that the detailed description of our approach allows others to explore the composition of multivalent vaccines when supporting information is not available, and hope that the information gained thereby simplifies the selection of appropriate vaccines for FMD control.

Nucleic Acid Extraction, Library Preparation, and Sequencing
A commercial pentavalent FMD vaccine, made from purified antigens that had been inactivated by treatment with BEI and were formulated with a double oil emulsion (water in oil in water) adjuvant, was selected for the study. Total RNA was extracted from the inactivated vaccine with TRIzol LS (Thermo Fisher Scientific, Waltham, MA, USA) and RNeasy Mini spin columns (Qiagen, Hilden, Germany) with on-column DNase I digestion, as described previously [31]. In detail, 250 µL of vaccine emulsion was combined with 750 µL of TRIzol LS, mixed and incubated for 5 min at room temperature (RT), then 200 µL of chloroform was added, mixed, and incubated for 10 min, also at RT. After centrifugation for 10 min at 18,000× g at 4 • C, the upper aqueous phase was removed, mixed with 600 µL 75% (v/v) ethanol, and subsequently transferred in two steps onto the spin column. Binding, washing, and DNase digestion were carried out as per the manufacturer's instructions for the RNeasy Mini Kit. Purified RNA was eluted from the column three times with 50 µL of RNase-free water for a total elution volume of 150 µL. The amount of RNA was determined by absorption at 260 nm using a NanoDrop ND1000 spectrophotometer (Thermo Fisher Scientific). Quality of extracted RNA was assessed using an RNA 6000 Pico chip on an Agilent 2100 Bioanalyzer.
For cDNA synthesis, 125 µL of RNA (=500 ng) were concentrated by adding 1.8 volumes of Agencourt RNAClean XP beads (Beckman Coulter, Brea, CA, USA) and incubating for 7 min on a shaker at 550 rpm at RT. After collecting the beads on a magnetic rack and washing twice with 700 µL 80% (v/v) ethanol, the bead pellet was air-dried and the RNA eluted in 13 µL of RNase-free water. For reverse transcription, the SuperScript IV First-Strand cDNA Synthesis System (Thermo Fisher Scientific) in combination with the NEBNext Ultra II Non-Directional RNA Second Strand Synthesis Module (New England Biolabs, Ipswich, MA, USA) was used according to the manufacturers' instructions. The cDNA was sheared on a Covaris M220 Focused-ultrasonicator with a target size of 400 bp. Sheared DNA was concentrated by adding 1.8 volumes Agencourt AMPure XP beads (Beckman Coulter), washed twice with 80% (v/v) ethanol, eluted in 25 µL of nuclease-free water, and used for library preparation with the GeneRead DNA Library L Core kit (Qiagen). After end repair and adapter ligation, the library was purified using 1.8 volumes AMPure XP beads, followed by a size selection. In the first step, 104 µL of AMPure XP beads were diluted with 80 µL of nuclease water and added to 100 µL of purified library. After 7 min of incubation on a shaker (550 rpm at room temperature), the beads were pelleted on a magnetic rack and 250 µL of the cleared solution were combined with 30 µL of AMPure XP beads in a new tube. After a second round of incubation and pelleting, 276 µL of the cleared solution were recovered, containing only the small library molecules (size distribution~150-350 bp, labelled "lib03330_small") while longer molecules in the library (size distribution~350-550 bp) remained bound to the beads. The beads were washed twice with 80% ethanol, air-dried, and the final library for sequencing ("lib03330_large") eluted in two steps of 17 µL each. The supernatant containing the smaller fragments was purified twice with 1.2 volumes of AMPure XP beads. Quality control was performed with an Agilent High Sensitivity DNA kit and the molarity determined with a KAPA Library Quantification kit. Sequencing of lib03330_large was performed in a pooled run on an Ion Torrent S5 XL instrument with an Ion 530 Chip Kit.

Data Analysis
For a first impression of the proportion of FMDV reads in the vaccine library, 5000 randomly selected reads were compared against known FMDV genomes deposited in public databases (date of access: 13/09/2019) using megablast (National Center for Biotechnology Information, U.S. National Library of Medicine), with an E-value cut-off of 0.00001 and only the best hit returned for every read. For a deeper investigation into vaccine purity, the metagenomic software pipeline RIEMS [21] was used on the full dataset in its basic analysis mode. Reads identified as FMDV by megablast were sorted and counted according to their best hit.
For reference guided-assembly, the GS Reference Mapper (version 3.0; 454 Life Sciences/Roche) was used on a data set of 200,000 reads with stringency parameters of a 99% minimum overlap identity (-mi; default setting 90%) and minimum overlap length (-mL; default setting 90%). Based on the megablast results, four reference sequences (accession no. AY593823, KJ933864, AY593798, and AY593762) were used for the mapping.
For de novo assembly, the GS De Novo Assembler (Newbler version 3.0; 454 Life Sciences/Roche) was used as follows: The minimum overlap length was set to 99% while the minimum overlap identity was varied from 95% to 100% in the search for optimal results. Data set sizes ranged from 20,000 to 100,000 reads. The best-performing assembly parameters resulted in two contigs for each genome, which were manually combined according to their megablast results against the nucleotide database. For verification of the correct assembly, 500,000 reads were mapped to the assembled sequence using GS Reference Mapper with a minimum overlap length of 99% and minimum overlap identity of 98%, resulting in the final genome, and the alignments were visually inspected for correctness. For identification of the most closely related virus strain, a megablast search with default parameters was performed against the INSDC database and the hit with the highest score was selected. Pairwise nucleotide sequence identities were determined using the EMBOSS Needle Pairwise Alignment tool (version 6.3.1) with default settings.
The 454 Genome Sequencer software package (version 3.0), including GS Reference Mapper and GS De Novo Assembler is available for download as part of the RIEMS software pipeline [21] at https://github.com/EBI-COMMUNITY/fli-RIEMS.
In parallel, de novo assembly using SPAdes version 3.11.1 was performed with Ion Torrent single reads defined as input (-iontorrent), mismatch careful mode turned on (-careful), k-mer sizes of 21, 33, 55, 77, 99, and 127 in the mode of read error correction and subsequent assembly. Data set sizes ranged from 5000, 10,000, 20,000, 100,000 to 1 million reads.
For proportional composition analysis, a partial dataset of 60,000 reads was competitively mapped against the five de novo assembled genomes using GS Reference Mapper with the information of uniquely mapped reads provided by the software in the RefStatus file. The highly stringent approach included a minimum overlap identity of 99% and a minimum sequence identity of 98%; the second slightly less stringent mapping approach included a minimum overlap length and a minimum sequence identity of 95% each.