Functional annotation and distribution overview of RNA families in 27 Streptococcus agalactiae genomes

Streptococcus agalactiae, also known as Group B Streptococcus (GBS), is a Gram-positive bacterium that colonizes the gastrointestinal and genitourinary tract of humans. This bacterium has also been isolated from various animals, such as fish and cattle. Non-coding RNAs (ncRNAs) can act as regulators of gene expression in bacteria, such as Streptococcus pneumoniae and Streptococcus pyogenes. However, little is known about the genomic distribution of ncRNAs and RNA families in S. agalactiae. Comparative genome analysis of 27 S. agalactiae strains showed more than 5 thousand genomic regions identified and classified as Core, Exclusive, and Shared genome sequences. We identified 27 to 89 RNA families per genome distributed over these regions, from these, 25 were in Core regions while Shared and Exclusive regions showed variations amongst strains. We propose that the amount and type of ncRNA present in each genome can provide a pattern to contribute in the identification of the clonal types. The identification of RNA families provides an insight over ncRNAs, sRNAs and ribozymes function, that can be further explored as targets for antibiotic development or studied in gene regulation of cellular processes. RNA families could be considered as markers to determine infection capabilities of different strains. Lastly, pan-genome analysis of GBS including the full range of functional transcripts provides a broader approach in the understanding of this pathogen.


Background
Streptococcus agalactiae (GBS) is a Gram-positive bacterium that can colonize the gastrointestinal and urogenital tracts of humans. Despite this natural association, S. agalactiae infections are common in neonates [1], immunosuppressed adults and the elderly [2].
GBS infections are not restricted to humans. Historically, this microorganism is a causative agent of bovine mastitis and may infect several fish species, causing production losses in aquaculture systems [3][4][5], besides affecting other animals such as mice, dogs, cats and horses (as seen in [6,7]).
This diversity of environments in which S. agalactiae can be found depends on its ability to control regulatory networks responsible for avoiding the host's immune system and to acquire nutrients [8].
Transcriptome microarray analysis revealed that S. agalactiae strains were able to regulate their transcriptional levels according to different factors, such as growth medium temperature [9], the presence of human blood [10] and the presence of human amniotic fluid [11]. On the other hand, in cultures with high glucose concentration and absence of host's stimuli, genes related to virulence and stress response were down regulated [8].
The development of new techniques, such as tiling array and ribosomal RNA depletion, has made it possible to verify a great diversity of non-translated RNAs in bacteria [12]. The number and diversity of these elements led to the organization of these RNAs in families (Rfams) that increased from 25 to over 2200 in 10 years [13].
Research studies have shown that non-coding RNAs (ncRNAs) can modulate gene expression in bacteria, as well as coordinate adaptive processes that responds to environmental changes and control target gene expression [14,15]. Consequently, ncRNAs play key roles in gene regulatory networks that responds to environmental stimuli, and this occurs in several pathogenic bacteria, such as Vibrio cholerae, Salmonella Typhimurium, Staphylococcus aureus, and various species of Streptococcus [16][17][18][19].
At the time of this writing, 18 global analyses aiming to investigate whether non-coding RNAs are available in Streptococcus have been accomplished [19,20]. The number of ncRNAs detected in these studies range from 10 to 900 [19]. However, more than half of these studies focus on Streptococcus pneumoniae and S. pyogenes (as seen in [19]). On the other hand, only two reports are currently available for a single strain of S. agalactiae. In the first one a model developed by the authors found 197 candidates [21], and in the second one differential RNA-Seq data (dRNA-Seq) revealed more than 100 ncRNAs [22].
Here we report the first comparative genome analysis of ncRNA of S. agalactiae. The data showed a compact Core genome region interleaved with Shared and Exclusive regions. Our predictions showed 27 to 89 Rfams per genome where 25 are always present in Core genome regions. Some of the RNA families identified were related to mobile genetic elements and essential functions like iron homeostasis, sugar metabolism and virulence genes regulation. Moreover 4 RNA families are reported for the first time in the Core region of strain NEM316.

ncRNAs selection
In this study RNA families (Rfam) refer to families of untranslated RNAs from Rfam database [23]; among the Rfam those that have 500pb or less were considered small RNAs (sRNAs).

Genomic sequences
This study comprised 27 genomes falling either into the "Complete Genome" or "Complete Chromosome" level available from the National Center for Biotechnology Information (NCBI) database [24] on January 24, 2016. Information on the isolation source of each genome was obtained from the same database and simplified in Table 1.

Determination of Core, exclusive and shared genomic regions
Mauve [26] software was used to align the 27 S. agalactiae genomes and the strain NEM316 was used as reference genome for alignments since its genome has been completely sequenced [27], there is transcriptomic data available [22,28] and was taken as a model for the search of non-coding RNAs in this group [21].
Genomic coordinates were extracted from Mauve output (backbone file) and classified into three kinds of sequences (I) Core: regions present in all genomes; (II) Exclusive: regions occurring only in one genome; (III) Shared: regions present in two or more, but not in all genomes. To avoid misclassification of small regions that happen in the normal evolution of strains, such as single nucleotide polymorphisms (SNPs), only regions larger than 10 base pairs (bp) were extracted. To visually inspect the distribution of both genomic and ncRNA regions, annotations were plotted with the software Circos [29].

Horizontal gene transfer analysis
To investigate the existence of probable horizontally transferred regions into the Exclusive regions of each genome, we used the Alien Hunter [30,31], PHASTER [32] and IslandViewer [33]. Alien Hunter regions with the output tag "probably overlapping rRNA operon" were disregarded.

Annotation of Rfams
RNA families identification and annotation was performed with Infernal v1.1 [34] with default settings, and the Rfam database v12.0 [23]. Rfams were restricted to 147 families already identified in the order Lactobacillales (Additional file 2), as recommended by Nawrocki in [35]. The family with the lowest e-value was selected in case of overlap [35]. Also, the nocoRNAc [36] was used with default settings to predict possible promoter and terminator regions around the reported Rfams. It should be noted that group II intron prediction uses several RNA families from Rfam database (group II D1-D4 1 to 7 and intron_gpII), thus estimations were then curated (Additional file 3).
Intergenic regions of all genomes were obtained with Artemis genome browser [37] to evaluate identified RNA families overlapping these regions.
The overlap between RNA families found for strain NEM316 were compared to the data provided by Pichon et al. [21] and Rosinski-Chupin et al. [22]. The resulting RNA families for NEM316 were tagged in the annotation files (Additional file 4) indicating in which study they were observed.
Annotated RNA family's coordinates were compared to Core, Exclusive and Shared regions; when the RNA family coordinates overlap the genome region, the count for that region was increased by one.
To investigate the relationship between the number of RNA families and genome sizes, correlation analysis was performed after Shapiro-Wilk test of normality over the variables of total number of RNA families and genome size (Table 1) in R statistical environment [38] with package Hmisc.

Transcription evidence
We obtained the RNA-Seq data for strain NEM316 from NCBI Sequence Read Archive (SRA) (Accession Numbers: SRX315261, SRX315262, SRX315263, SRX315264 and SRX315265) to verify if the RNA families regions were capable of being transcribed to exert their functional role. Sequences were cleaned with Trimmoatic [39] and mapped on NEM316 strain genome with Bowtie2 [40] using the "--very-sensitive" flag. The coverage per base was obtained with SAMtools [41]. The raw number of reads over RNA family's annotation extension was plotted for visual inspection of transcriptional activity. The coverage was also plotted for two ranges, of 800 base pairs (bp) and 1600 bp, surrounding the RNA family annotation using the "stat = "coverage"" argument for the autoplot function of ggbio package of Bioconductor [42]. If there was read coverage, i.e. reads mapped over the region with a RNA family annotated, the RNA family was considered real due to transcription activity on site.

Cluster analysis
The expectation-maximization (EM) algorithm was applied taking into account the genome size, number of Shared

Results
In order to study the pangenome of GBS strains we did a comparative genome analysis. We included in this study 27 genomes. We did a clustering analysis based on genomic features (for details see Methods), which led to the identification of three groups: cluster2, which were related to fish and mammalian strains; cluster1, related exclusively to a mammalian source; and cluster0 related to fish samples only ( Table 1, the graphics in Additional file 6 show the mix of groups according to each attribute analyzed). The alignments of 27 genomes allowed the classification of 5224 regions (amongst Core, Exclusive and Shared regions) distributed in all the analyzed genomes. Of those, 458 were classified as Core, 997 as Exclusive, and 3769 as Shared (Fig. 1, tracks 2, 3 and 7; for all genomes pictures refer to Additional file 7). The size of Core regions ranged from 10 to 39,361 bp, while Shared and Exclusive regions ranges from 10 to 21,935 bp and 19 to 47,076 bp respectively. To avoid information overload, only one genome of each cluster (Table 1) was exhibited, the image with all genomes together and individual genome plots can be found in Additional file 7. Shared regions were found scattered throughout the genome. Moreover, it was possible to observe that amongst the regions identified in each cluster, several of them formed mosaic structures that intercalates with the Core genome ( Fig. 1 and Additional file 7). This suggests that horizontal transfer events might have occurred that contributed with genome evolution. Thus, we used softwares Alien Hunter, PHASTER and IslandViewer to identify regions supposedly acquired by horizontal transfer events; these regions overlapped with one another and with Shared and Exclusive Regions (Fig. 1 tracks 4, 5 and 6).
Then, we looked for the different non-coding RNA families present in GBS pangenome and evaluated their distribution in each genome. The search for these elements resulted in the identification of 27 to 89 RNA families per genome (Table 1). When considered by region, the Core, Shared, and Exclusive groups presented a minimum/maximum of 25/25, 2/54 and 0/10, respectively. The Shapiro-Wilk test of normality show that genome size and the number of RNA families did not follow a normal distribution (W = 0.89404, p-value = 0.009798 < 0.05 and W = 0.73345, p-value = 1.174e-05 < 0.05 respectively) then Spearman's rank was selected for correlation analysis. A strong correlation between the genome size and the number of RNA families was found (Rho = 0.61), and this relationship was significant (p-value = 7e-04 < 0.05).
We then conducted a more detailed analysis of the identified Rfams in order to obtain additional information on GBS genome evolution. We detected 36 Rfams that overlapped at least with one of the three regions, which includes large ncRNAs, such as GOLLD, group II introns and RNAseP (Table 2). We focused in the distribution of the sRNAs in GBS genomes. This revealed that 24 out of 27 families were highly conserved (> 96%) (Additional file 3 and Additional file 5). Four sRNAs were found less frequently in the genomes included in this study: rli38 was present in 70% of genomes, SSRC38 in 48% of them, and rli28 and RatA in 55% of them. Other Rfams were found solely in one or two genomes, i.e. RNAout in strain A909, SpF01 in strain GBS85147, SpF39 in strain NGBS061, and Spf41 in strains CNCTC_10/84 and GBS85147. All sRNAs were fully or partially overlapping intergenic regions, except one family, namely sau-50.
Amongst the sRNA families overlapping with Core regions, FMN, L10_leader, Lacto-rpoB, PreQ1, Spd-sr37, tmRNA, and tracrRNA presented transcriptional signals, i.e., the probable occurrence of promoter and terminator signals detectable in at least one strain. RNA families RatA, rli28 and rli38 were found in Exclusive and Shared regions. Many of these sRNAs are known for their role in the regulation of metabolic processes, the biosynthesis of compounds or other key cellular processes (Additional file 5). Furthermore, RNA-OUT, SpF01_sRNA, SpF39_ sRNA and SpF41_sRNA were found only in Exclusive regions, whereas sRNAs PyrR, SSRC38, TPP and yjdF were found in Shared regions.
Noteworthy, while 28 Rfams reported by previous studies (from [21,22]) were found in strain NEM316, the present analysis also led us to the identification of 4 new RNA families (SSRC38, rli38, sau-50 and SSRC34) in NEM316 Core region (Fig. 2). An annotation file with new and previous information/data is provided as supplemental material (Additional file 4). Coverage plots based on mapped RNA-Seq data showed that families SSRC38, rli38 and SSRC34 seem to be transcribed (Additional file 8). Further studies are necessary to validate the functional role of these RNAs. The sau-50 family overlaps with a DNA binding protein and its activity was previously validated in another bacterial species [44], but unfortunately no additional information is available on its role and there is no noticeable difference in coverage by observing the transcriptome (Additional file 8).
Regarding the sRNA families found in Exclusive or Shared regions, many of them have not been characterized yet. Several of them were proposed as elements involved in the bacterial virulence (rli28, rli38 and SpF41), while others seemed to be related to mobile genetic elements (RatA and RNAout). The coverage plots based on mapped RNA-Seq data showed transcription activity for the remaining Rfams annotated (Fig. 3, coverage plots and read counts for all Rfams annotated are available on Additional file 9, additional coverage plots for a range of 800 and 1600 bp surrounding area of Rfam annotations is provided as Additional file 10). Although the number of raw reads (Y axis of Fig. 3 and Additional file 8) does not necessarily reflect the exact number of transcribed molecules, these plots provide supporting information on the transcription and potential role of the ncRNAs described in this study.
We last evaluated the clonal relationship of the strains included in this study by MLST analysis. This analysis showed large variability among isolates, which have little correlation with Rfams occurrence (Table 1).

Discussion
The genomic regions found in GBS genomes are in accordance with the original description, which indicates that this genus has a Core genome and some regions without corresponding alignments in other strains [45,46]. The observed overlap of several Shared and Exclusive regions with those probably acquired by horizontal gene transfer (HGT) (Fig. 1,  tracks 4, 5 and 6) suggests that GBS genomes are continuously evolving, adapting and acquiring genetic information. Evidences of genome dynamics has been already reported in S. agalactiae by Brochet et al. [47].
RNA families annotation in the strain NEM316 showed that 88% were already reported by previous works ( [21,22]). The visualization of RNA-Seq data from Richards et al. [28] in the strain NEM316 also identified the transcriptional activity of the RNA families found in intergenic regions (e.g. Asd family in Fig. 3). However, our study resulted in the identification of 4 additional RNA families (SSRC38, rli38, sau-50 and SSRC34) (Fig. 2). Little is known in the function of these RNAs, notwithstanding transcription of SSRC38 and SSRC34 have been detected during exponential growth phase in S. pyogenes [48]. On the other hand, studies on rli38 function suggest that this RNA plays a key role in the pathogenesis of Listeria monocytogenes [49]. Last, although sau-50 have been reported as transcribed in other bacteria [44,50] it still unclear the function of this RNA.
Although the annotated Rfams were not experimentally validated, transcriptomic data had already been used to successfully identify non coding RNAs (see [51,52]). Furthermore, the annotation method utilized have already proved to be a powerful tool for large-scale analyses even in the presence of deletions, alterations in G + C content, and in distantly-related genomes [53]. Although not all annotated families show a significant coverage pattern, transcription is occurring in their loci. A possible explanation is that these RNAs are either co-transcribed or transcribed in different growth conditions. In either case they have the potential to act as regulatory elements [54]. As the annotation criteria for all strains were the same, the data obtained here represent a useful set of ncRNAs of S. agalactiae.
The comparative analysis in S. agalactiae genomes allowed us to detect 23 RNA families in Core regions and 17 (considering each Rfam database entry of group II introns) in the Shared and Exclusive regions (Table 2). However, the number of sRNA families was 25 in Core regions, while the sum of the Shared and Exclusive Rfams was greater than 28 for all cluster2 strains (Table 1) which give us a 25/28+ ratio of core/variable regions. This shows that HGT probably play a key role in the acquisition of ncRNAs as observed in Salmonella (see [54,55]). Interestingly this was opposite to what was reported for Escherichia coli and Shigella spp. These bacteria have 60/23 ratio of core/variable regions [56]. It  [21,22] seems that Core and Shared regions in GBS genome evolved divergently for ncRNAs, where core regions are more conserved for the presence of these elements while shared regions provide the context for acquiring novel ncRNAs that have an adaptive benefit to the host. We observed a strong correlation between the genome size and RNA families number, this direct relationship already has been observed in prokaryotes [53], and may explain the general variation of sRNAs when different strains of S. agalactiae are compared.
Cluster analysis determined three distinct clusters of S. agalactiae: cluster 2, fish and mammalian strains; cluster1, mammalian strains; and cluster0, fish strains ( Table 1). The clusters resemble phylogenetic relationships revealed by single nucleotide polymorphism (SNP) analysis found in Rosinski-Chupin et al. [46] where CC7 is related to fish/human, CC19, CC23 and CC67 related to human/ bovine and ST261, ST552 and ST260 to fish/frog. In addition the number of RNA families seems to be the factor that better separated the observed groups (Additional file 6). This finding provides evidence that there is a pattern concerning to strains identification based on Rfams number, therefore contributing to future research directions in the attempt to understand non-coding RNAs mechanisms and development of molecular markers.
All sRNAs with possible promoter and terminator transcriptional signals were located in the Core region overlapping the intergenic regions. Although the presence of these transcriptional signals increases the chance that Rfams are functional transcripts, they were not detected for some RNA families. This could be either due to the difficulty in finding such signals in intergenic regions, or because ncRNA promoter signals may be weak compared to the coding gene signals [57]. All RNaseP bact b is a ribozyme, asd is a intergenic non-coding RNA on NEM316, preQ1-II is a riboswitch and SSRC38 is a small RNA signals found are tagged on annotation files provided as Additional file 4.

RNA families in shared and exclusive regions
This in silico analysis led to the identification of two main ribozymes commonly found in bacteria, type B Ribonuclease P (RNase P) and group II (GII) introns. RNase P was found in all GBS genomes as expected since it is considered ubiquitous in bacteria. Its role is to catalyze the cleavage of the 5′-end of precursor transfer RNAs in order to generate the mature molecule [58,59].
On the other hand, GII introns are ribozymes that catalyze their own splicing and in the presence of its cofactor became mobile by way of a retrotransposition mechanism [60]. All GII introns belonged to the subclass C and shared the same lineage with GBSi1 from S. agalactiae [61]. In their work the authors proposed that GBSi1 shares the same insertion site with the insertion sequence IS1548 thus limiting its dissemination, and that this ribozyme may act as a marker for some clonal lineages of GBS. Our results are in line with the previous findings, since these elements were found in many but not all genomes (n = 10). However, we did not observe a correlation between the presence of this ribozyme and the clonal relationship (Additional file 5). Almost all GII introns had their cofactor, the intron encoded protein, which indicates that these ribozymes are active retroelements. Seven genomes (strains GD201008-001, HN016, YM001, GX064, GBS1-NY, GBS2-NM, GBS6) have a range from 9 to 19 non-redundant copies whereas the remaining 3 (strains 2603 V/R, NGBS572, GBS85147) only showed less than 3 copies. This indicates that when a genome contains the target site the ribozyme will invade it. Furthermore we observed that along GBS genomes there is a significant variation on ribozyme presence, amount of copies and target site availability. Taken together the data suggests that GII introns are key elements in the evolution and plasticity of GBS genomes.
Our results also showed that there is an association between the rli28 and RatA elements (Additional file 11). Genomes carrying both elements showed that there were 60 bp apart. It has been proposed that rli28 has a role in virulence [49], whereas RatA has been proposed as the antitoxin of TxpA toxin. Together RatA and TxpA act as type I toxin/antitoxin (TA) system [62]. Since this TA system has been described as located in phage-like elements, we evaluated whether this was also the case. Predictions with Phaster and Alienhunter confirmed that both rli28 and RatA are located in phages and therefore has the potentiality to be transferred to other bacterial species.
It is worth to mention that SSRC38 has been found in almost all genomes from clinical isolates, except for GBS85147 (Additional file 5). Although its function has not been yet described but its transcription reported in S. pyogenes [48], this sRNA seems to be an interesting candidate to be used as a molecular marker in the identification of new isolates or clones.
The presence of sRNAs in both Shared and Exclusive regions seems to be related to genomic areas with hotspots of recombination and insertion of mobile genetic elements. Frequently, phages can contribute to the creation of new interaction systems between hosts and pathogens [63] and regulatory RNAs can also be transferred between bacterial genomes [64]. Also, recombination between genomic sequences or sequences originating from phages may alter bacterial genomic structures so as to generate or eliminate RNA families [55].

Families in Core regions
Almost all Rfams detected in Core regions were overlapping the intergenic regions. This predominance had previously been observed (see [65]). The only exception was sau-50 which is antisense in relation to a DNA-binding protein. It is worth to mention that sau-50 has been validated in Staphylococcus aureus [44] and also overlaps with a DNA binding protein.
RNA families related to essential activities and housekeeping functions were found in Core regions. Families like L10_leader control ribosomal protein transcription [66], FMN riboswitch is related to riboflavin production [67,68], and tmRNAs release ribosomes caught in messenger RNAs without stop codons [69]. Clustered regularly interspaced short palindromic repeats (CRISPR) provide immunity against mobile genetic elements and had been analyzed in various strains of S. agalactiae (for details see [70]). The tracrRNA family, which is fundamental to the maturation of the CRISPR's crRNAs [71] was found in Core regions.
Other RNA families found in the Core regions seems to be related with their upstream or downstream genes like 23S-Methyl, Lacto-rpoB and Spd-sr37 (as can be seen in [72][73][74] respectively).
Gene ontology (GO) terms associated to antibiotic resistance and cell membrane were observed in GBS host pathogen interactions [28]. sRNAs like Lacto-rpoB [73]) and Small Bacterial SRPs [75] also seems to be related to the previously cited GOs, thus these pieces of evidence suggest a putative relationship between sRNAs with GBS virulence, and thus can contribute with its pathogenicity.

Conclusions
The original concept of pan-genome proposed by Tettelin et al. [45] for GBS was predicted on protein sequences. Since non-coding RNAs do not have a coding region they have been left aside.
S. agalactiae shows a compact core genome with few sRNAs at Core regions and various Rfams at its Exclusive