A high quality reference genome for the fish pathogen Streptococcus iniae

Fish mortality caused by Streptococcus iniae is a major economic problem in fish aquaculture in warm and temperate regions globally. There is also risk of zoonotic infection by S. iniae through handling of contaminated fish. In this study, we present the complete genome sequence of S. iniae strain QMA0248, isolated from farmed barramundi in South Australia. The 2.12 Mb genome of S. iniae QMA0248 carries a 32 Kb prophage, a 12 Kb genomic island, and 92 discrete insertion sequence (IS) elements. These include 9 novel IS types that belong mostly to the IS3 family. Comparative and phylogenetic analysis between S. iniae QMA0248 and publicly available complete S. iniae genomes revealed discrepancies that are likely due to misassembly in the genomes of isolates ISET0901 and ISNO. We also determined by long-range PCR that a tandem duplication of an rRNA region in the PacBio assembly of QMA0248 was an assembly error. A similar rRNA duplication in the PacBio genome of S. iniae 89353 may also be a misassembly. Our study not only highlights assembly problems in existing genomes, but provides a high quality reference genome for S. iniae QMA0248, including manually curated mobile genetic elements, that will assist future S. iniae comparative genomic and evolutionary studies.


Introduction 44
Streptococcus iniae is a fish pathogen that causes mortality in a wide range of fish 45 species in wild and farmed, marine and freshwater environments, resulting in 46 large economic losses to aquaculture (1, 2). S. iniae is also considered an 47 opportunistic human pathogen, causing sporadic infections mostly in the elderly 48 who have more than one underlying health condition such as diabetes mellitus, or 49 chronic rheumatic heart disease (3,4). S. iniae pathogenesis is imparted through 50 a repertoire of virulence factors (VF) including surface proteins, secreted toxins, 51 and capsular polysaccharide (CPS) (4). VFs can be acquired through lateral gene 52 transfer (LGT) of mobile genetic elements (MGE) such as composite transposons, 53 genomic islands (GI) or prophages. 54 55 MGEs are a means by which bacterial pathogens acquire traits that help adapt to 56 changing conditions including vaccination, antibiotics, a new host or environment 57 (5, 6). Indeed, they are considered the main drivers of gene flux in bacteria, 58 contributing to diversity within species (7). Insertion sequence (IS) elements, for 59 instance, are small MGEs (0.7-3.5 Kb) that have an important role in evolution and 60 genome plasticity. IS insertion within bacterial chromosomes or plasmids can 61 result in genetic modifications through insertional inactivation of genes or up-62 regulation of adjacent intact genes through outward-facing promoter sequences 63 carried by some IS (8,9). In some cases pairs of IS can mobilize intervening 64 sequence as a composite transposon (8). The mobility of IS elements leads to their 65 expansion or loss within bacterial lineages. Expansion is associated with 66 7 PacBio reads for QMA0248 were initially assembled into a large ~2 Mb contig 147 representing most of the chromosome of S. iniae QMA0248 and three contigs less 148 than 10 Kb in length. The short contigs appeared to be single read chimeras that 149 were discarded from the final assembly. However, the identification of the tandem 150 rRNA region in S. iniae 89353 prompted us to review the assembled short contigs. 151 One of these ~7 Kb discarded contigs encoded an rRNA region (5S, 16S, and 23S 152 genes in tandem with an intervening cluster of tRNA genes). Subsequent  whereas it is entirely absent in YSFST01-82, ISET0901, and ISNO ( Figure 3). 202 203

Characterization of S. iniae QMA0248 insertion sequences 204
Insertion sequences (IS) were analyzed in the S. iniae QMA0248 genome using the 205 ISFinder database coupled with manual curation. The analysis revealed 92 IS 206 (Table 3), which is higher than the average number per bacterial genome (n=38) 207 but consistent with the lifestyle of S. iniae as a facultative pathogen (24, 25). 208 Furthermore, the number of IS found in S. iniae QMA0248 is substantially higher 209 than other Streptococci such as S. mitis strain B6 (n=63) but comparable to that of 210 the Gram positive fish pathogen Lactococcus garvieae (26, 27). The 92 IS elements 211 belong to 7 different IS families and 20 IS types. These include 9 novel types 212 belonging to IS3, IS30, IS1182, and IS200/IS605 families, which we have submitted 213 to the ISFinder database (ISStin2-ISStin10) ( Table 3 and Supplementary Tables  214   S2-4). Around half of all IS copies in QMA0248 belong to these 9 novel types 215 consistent with expansion of S. iniae specific IS since speciation (    Figure 3). This includes a ~28 Kb region that is only 255 found in YSFST01-82 (ROD1), and a ~20 Kb region that is present in four genomes 256 but almost entirely absent from YSFST01-82 (ROD2) (Table 2, Figure 3 Figure S4). To investigate the discrepancies 263 between MGEs and phylogeny we compared multiple phylogenetic trees that were 264 constructed using different methods, including the core genome, core SNP and 265 using different software (Supplementary Figure S4). All phylogenies consistently 266 revealed that S. iniae isolates QMA0248 and SF1 cluster together in one clade, 267 whereas ISET0901 and ISNO cluster in another, and all four isolates cluster 268 12 separately to YSFST01-82, the latter diverging earliest from the root 269 (Supplementary Figure S4). We have no reason to suspect that the YSFST01-82 genome assembly is inferior to 286 that of QMA0248, but adopting the latter as an alternative representative S. iniae 287 genome is justified for investigators wishing to take advantage of a manually 288 curated set of MGEs. In contrast we strongly recommend not using ISET0901 or 289 ISNO in future comparative studies of S. iniae genomes. Reference-guided 290 assembly was introduced to enable comparisons between two very closely related 291 isolates. However, this practice can result in the erroneous inclusion of MGEs that 292 exist in the template genome but are absent from the comparison strain. Even with 293 13 careful curation it is impossible to avoid misplacing repetitive sequences such as 294 IS, as observed here in the case of cas9 insertion and the other 8 IS copies that are 295 absent in SF1, ISET0901, and ISNO only. Moreover, reference-guided assemblies 296 may result in the loss of novel regions that are only present in the newly 297 sequenced strain, in which case a de novo approach is always required (33). 298 Although reference-guided assembly is no longer generally accepted for 299 prokaryote genomes, a number of examples remain available in public 300 repositories such as GenBank. For both ISET0901 and ISNO the assembly strategy 301 is clearly outlined in the comment field of the GenBank file, and the primary 302 publications (12, 13). Never-the-less, the consequences of using such genomes in 303 downstream analyses may not be apparent to all (      BLASTn comparison was produced using EasyFig (50) using 2000 bp as minimum 637 length, 50% as minimum identity value, and 1 x 10 -17 as maximum e-value. 638

Supplementary Information
A high quality reference genome for the fish pathogen
Uncharacterised homologs with 99% amino acid identity are found in 8 other available S. iniae complete or draft genomes including SF1, YSFST01-82, ISET0901, ISNO and 89353 (1). In most cases restriction enzymes predicted to recognise GCNGC are predicted to be encoded nearby, or immediately adjacent to the respective MTase gene. Notably, in QMA0248 the adjacent restriction enzyme (QMA0248_0515) is a pseudogene that has been truncated by an IS981. In S. iniae KCTC 11634BP, the orthologous gene in its draft quality 454 genome was truncated the same point by a contig break (2), suggesting that in both QMA0248 and KCTC 11634BP the MTase does not function as part of a restrictionmodification system. The GCNGC motif is found 8074 times in the QMA0248 genome suggesting that methylation activity could have wide-ranging regulatory consequences.
The closest homologs to QMA0248_0514 for which MTase activity has been determined is the M.CmaLM2II enzyme from Clostridium mangenotii LM2 and the M.LmoJ3I enzyme from Listeria monocytogenes J3115. Despite sharing modest overall amino acid similarity to QMA0248_0514 (59% and 45%, respectively), regions of high amino acid identity within their predicted target recognition domains (34/34 for M.CmaLM2II and 32/24 for M.LmoJ3I) support the prediction that QMA0248_0514 would also methylate the 2 nd cytosine of the GCNGC motif.
Detection of m5C using PacBio data is normally unreliable (3). As expected, methylation was detected at only a small fraction of GCNGC sites in the QMA0248 genome and the consensus motif determined by the PacBioSMRT-Portal software includes additional bases that are probably artefactual (e.g. GCNGCAGC) (Supplementary Table S5). Further experimental work (such as using Tet1 pretreatment to enhance detection of m5C with PacBio sequencing, or Oxford Nanopore sequencing) is needed to determine the true extent of cytosine methylation in the S. iniae genome and its role in gene regulation.
M.NgoDCXV homologs are remarkably rare and confined to a few streptococcal species (including S. iniaie SF1). No specific methylation of GCCHR was detected in the QMA0248 genome but 126 motifs that partially overlapped with GCCHR showed evidence of methylation (Supplementary Table S5). The GCCHR motif is present in 13,347 locations in the QMA0248 genome so this represents only a fraction of available sites. The m4C modification is normally detectable from PacBio sequence data, therefore further work is required to determine if QMA0248_1949 is expressed and functional.

methyltransferase QMA0248_0505
The third MTase in S. iniae QMA0248 (QMA0248_0505, known as M.Sin248ORF0505P in REBASE) is encoded ~5kb upstream of QMA0248_0514 and shares a similar strain distribution. There are no close homologs of QMA0248_0505 for which a recognition site has been determined. Accordingly it has been annotated by REBASE as a putative Type II N4-cytosine or N6-adenine DNA methyltransferase of unknown recognition sequence (1).  where Artemis colours duplicate reads that span the same region in green black.  Table 4). Primers are listed in Table 4. "L" denotes 1 Kb Extend DNA Ladder (NEB).