Phylogenomics and comparative genomics of Lactobacillus salivarius, a mammalian gut commensal

The genus Lactobacillus is a diverse group with a combined species count of over 200. They are the largest group within the lactic acid bacteria and one of the most important bacterial groups involved in food microbiology and human nutrition because of their fermentative and probiotic properties. Lactobacillus salivarius, a species commonly isolated from the gastrointestinal tract of humans and animals, has been described as having potential probiotic properties and results of previous studies have revealed considerable functional diversity existing on both the chromosomes and plasmids. Our study consists of comparative genomic analyses of the functional and phylogenomic diversity of 42 genomes of strains of L. salivarius using bioinformatic techniques. The main aim of the study was to describe intra-species diversity and to determine how this diversity is spread across the replicons. We found that multiple phylogenomic and non-phylogenomic methods used for reconstructing trees all converge on similar tree topologies, showing that different metrics largely agree on the evolutionary history of the species. The greatest genomic variation lies on the small plasmids, followed by the repA-type circular megaplasmid, with the chromosome varying least of all. Additionally, the presence of extra linear and circular megaplasmids is noted in several strains, while small plasmids are not always present. Glycosyl hydrolases, bacteriocins and proteases vary considerably on all replicons while two exopolysaccharide clusters and several clustered regularly interspaced short palindromic repeats-associated systems show a lot of variation on the chromosome. Overall, despite its reputation as a mammalian gastrointestinal tract specialist, the intra-specific variation of L. salivarius reveals potential strain-dependant effects on human health.


3
Quality assessment of genomes 4 Three additional quality control steps were performed on all genomes. First, the genomes 5 were BLASTed (blastn v2.2.26+) [1] against a filtered version of the RDP database (v11.1) [2] that 6 included only complete or near-complete 16S rRNA genes (>1400 bp) annotated to species level. This 7 was carried out to confirm that the top hit for each assembly was an L. salivarius sequence in the RDP 8 database and also to make sure that no other good hit (>1000 bp; identity >= 97%) was found, which 9 would indicate possible contamination. Second, 39 universal marker genes [3] were BLASTed 10 (tblastn) against the contigs of each genome. The reasoning here was that all 39 marker genes should 11 be fully assembled in a high-quality genome (which they were). Third, the contigs of each genome 12 were assessed using Kraken (v0.10.6) [4] as a further test for possible contamination. The results of 13 all 43 Kraken runs are shown in Table S4. 14 15 Assigning contigs to replicons 16 A common problem when analysing genes of interest in draft genomes is being able to tell 17 whether a particular gene is present on the chromosome or on a plasmid. Since all of a contig must 18 either be part of the chromosome or part of a plasmid, once a contig has been assigned to a replicon 19 the genes present on the contig can also be assigned to the same replicon. We describe here the 20 method used in this study to assign each contig to its most likely replicon.

21
A database of 92 plasmids (10 megaplasmids and 82 plasmids) from 16 Lactobacillus species 22 was generated from complete genomes available on NCBI. Each draft genome was BLASTed (blastn 23 v2.2.26+) against this database and the results were filtered in order to assign each contig to a plasmid 24 or, failing plasmid assignment, to the chromosome. Megaplasmids (>10 kb; both circular and linear) 25 and smaller plasmids (<10 kb) were included in the database so four categories of replicon were 26 possible (including the chromosome). 27 To justify BLAST thresholds for assigning contigs to replicon categories, 4 complete 28 genomes (UCC118, CECT5713, Ren and NIAS840) were broken up into contigs using a randomly 29 generated chi-squared distribution with Degrees of Freedom equal to 1. This distribution was chosen 30 because its median Spearman correlation with the distribution of lengths of contigs for each draft 31 genome in our study was 0.96 (Q1 = 0.9; Q3 = 0.98; n = 38), ensuring that the artificial draft genomes 32 would resemble the draft genomes in our dataset in terms of contig length distributions. The values of 33 this distribution were then converted to proportions and randomly permuted in order to avoid a bias 34 between contig length and genome region. Each genome was divided up based on the order of the 35 randomly permuted proportions where each proportion is a fraction of the total number of base pairs 36 in the genome. For each of the 4 complete genomes, each replicon was broken up into 50 contigs and 37 contigs less than 200 bp were excluded. The FASTA header for each contig was labelled with the 38 replicon from which it was taken so that the specificity and sensitivity of the contig assignment 39 method could be tested. The R (v3.2.3) code uploaded to figshare (Data Bibliography of main text; 40 data file 5) shows the steps for generating 50 draft contigs from the complete chromosome sequence 41 of UCC118. 42 The 4 artificial draft genomes were then BLASTed (blastn) against the database of 92 43 plasmids. This was done for each genome separately so that the complete plasmids from each genome 44 being BLASTed could be removed from the database beforehand. This ensured that draft plasmid 45 contigs were not just aligning to the complete version of their own plasmids. An unfiltered evaluation 46 of the BLAST results showed that the highest % alignment length of a chromosomal contig against 47 the plasmid database was 23.7% (490/2,070). An alignment length of 25% against the plasmid 48 database was therefore chosen as the cut-off for assigning contigs to plasmids. BLAST hits between 49 two sequences can have multiple high-scoring pairs (HSPs) so the sum of the non-overlapping length 50 of all HSPs between each contig and reference plasmid was calculated. The reference plasmid 51 sequence with the highest % alignment to the contig was chosen and all alignments of less than 25% 52 were excluded. Depending on their top hit, these remaining contigs were assigned to one of three 53 categories: plasmid, circular megaplasmid or linear megaplasmid. It should be noted that small 54 contigs representing transposases or other small repetitive regions may be present on both the 55 chromosome and the plasmid(s) so the assignment of these contigs is less reliable. The sensitivity and 56 specificity of the BLAST results for the 4 artificial draft genomes against the plasmid database are 57 shown in Table S5. Code for calculating the sum of the non-overlapping length of HSPs between each 58 contig and each reference plasmid has been uploaded to figshare (Data Bibliography of main text; 59 data file 6). 60 As an additional quality check, three genes identified as being specific to the L. salivarius 61 circular megaplasmid(5) -repA (LSL_1739), repE (LSL_1740) and parA (LSL_1741)were 62 BLASTed (tblastn) against the contigs for each genome assembly to see if all top hits were to 63 predicted megaplasmid contigs. Results are in Table S2.

65
Specific functional groups 66 COG categories for genes were predicted by BLASTing (blastp) amino acid sequences 67 against a COG database (ftp://ftp.ncbi.nih.gov/pub/COG/COG2014/data) with thresholds of 40% 68 identity, 50% alignment length of the query gene and a BLAST bit score of 60. Any gene match that 69 fell below these thresholds was added to the COG category 'unknown function'. For each genome, 70 genes were assigned to their respective replicons. 71 Peptidases were predicted by BLASTing (blastp) amino acid sequences against full sequences 72 from the MEROPS database (https://merops.sanger.ac.uk). BLAST thresholds used were 40% 73 identity, 50% alignment length of the query gene and a BLAST bit score of 60. 74 Sortase genes were predicted using hmmscan from the HMMER3 (v3.1b1) [ webserver was used to locate putative pilus operons using default parameters. All three methods used 81 amino acid sequences as input. 82 Glycosyl hydrolases and glycosyl transferases were predicted using hmmscan with DBcan [9] 83 HMM profiles (http://csbl.bmb.uga.edu/dbCAN/). For each genome, GH and GT genes were assigned 84 The core-gene phylogenetic tree of L. salivarius has similar sub-clade 145 topology to POCP clusters, but overall tree topology is dissimilar 146 147 Percentage of Conserved Proteins (POCP) [20] calculates a similarity score based on 148 percentage of genes in common between all the amino acid sequences in two genomes. POCP was 149 designed as a method to identify whether a particular species belongs within a genus. We were not 150 interested in applying this threshold since all strains obviously fall within a single genus; instead, the 151 goal was to assess the congruency of a core-gene phylogeny with a method that clustered the strains 152 based on the presence and absence of genes. Fig. S3 shows a heatmap of POCP values, where 153 clustering of strains is in reasonable agreement with the core-gene phylogeny of Fig. 2 in terms of 154 sub-clades. Several strains cluster apart from their core-gene sub-clades including CECT5713 and 155 CCUG38008. A greater difference between POCP and the core-gene tree versus ANI and the core-156 gene tree is expected because POCP value calculations ignore homologous regions, using similarity 157 based on gene presence and absence distributions to cluster strains. This is a rough approximation of 158 the combined effect of gene decay and HGT since a gene that is present in one strain and absent in 159 another has either acquired a deleterious mutation or else has been horizontally transferred by one of 160 several mechanisms. The reason why many of the sub-clades in Fig. S3 agree with the core-gene 161 phylogeny is that the probability of gene decay or HGT events having occurred after two strains start 162 to diverge from a common ancestor increases with time. Adaptation to different niches and differing 163 selection pressures then start to disrupt the correlation between core-gene phylogeny and clustering of 164 shared/unshared genes [21]. We found no general association of clusters from any tree generated in 165 this study with the isolation sources of the strains (Table S1), but members of several small clusters 166 were all isolated from the same source. This overall lack of niche-strain association may be due to the 167 transient appearance of L. salivarius in niches associated with the gastro-intestinal tract (food, 168 opportunistic infection of body sites, etc.) and it would be a mistake to assume that every strain has 169 acquired niche-specific adaptations to its source of primary isolation. Proteases are a large group of proteins, divided into many families that are involved in the 175 hydrolysis of peptides. Fig. S4 shows 53 protease families that display variation across the 43 genomes in this dataset. Genes for eighteen additional protease families were predicted in L. 177 salivarius , but these families showed no variation across the strains, with 17 represented by a single  178  gene per strain (A01A, C108, C14B, C19, C46, I04, I87, M02, M10A, M13, M15D, M20B, M20D,  179  M24A, S09A, S09B, T05, T06) and one, a cysteine protease (C19) described as 'ubiquitin-specific' 180 by the MEROPS database (Table S6), represented by two genes per strain. It can be speculated that 181 these 18 families are subjected to purifying selection since the remaining 53 protease families vary 182 both in gene count and in presence and absence across the strains. 183 Out of the 53 protease families that vary in their distributions, 33 are present in all 43 184 genomes, but have variable gene counts; genes for thirty of these are found on the chromosome only 185 while the remaining three are present on multiple replicons. The gene count per protease family 186 ranges from 0 to 24 where some families are present in all but a single genome and other families are 187 present in one only (usually DSM18933the strain of L. hayakitensis used in this study). The 188 protease family with the most genes in L. salivarius (4-24) is M23B, which is annotated as a 189 lysostaphin in the MEROPS database (Table S6), an antibacterial enzyme that degrades peptidoglycan 190 in the cell walls of certain bacteria, staphylococci in particular.

191
There are a number of protease families and protease inhibitors that are rare in the dataset of 192 L. salivarius annotations, with representatives belonging to one or several genomes only. JCM1046 193 has gene products in two families that the other strains do not have -I75 and S26Bboth relevant 194 genes predicted to reside on the chromosome. The gene encoding I75 is on a small contig of 964 bp 195 that has a 99% match over its full length to a phage from E. coli, suggesting recent acquisition of this 196 sequence as a prophage. The only other predicted protease inhibitor, I63, is an inhibitor of pappalysin-197 1 and it is present in all L. salivarius genomes but absent from L. hayakitensis DSM18933. S26B is a 198 signal peptidase that cleaves signal peptides from a secreted protein as it is being translated.

199
DSM18933 has 2 protease families that are not present in L. salivarius, which suggests that they were 200 either horizontally acquired by L. hayakitensis after the split from its common ancestor with L. 201 salivarius or else that L. salivarius subsequently lost these families through gene decay, whether 202 through genetic drift or active selection pressure. These two families are M42 and M60, a glutamyl 203 aminopeptidase and an enhancin, respectively.

204
Sun et al conducted a genus-wide, comparative genomic study of lactobacilli and found 205 considerable variation in cell-envelope proteases [22]. Our study shows that a more general overview 206 of protease families reflects the high levels of variation seen in Lactobacillus, at the species level, in 207 L. salivarius. 208 209 Prophages, CRISPRs and insertion sequences are widely distributed across 210

212
Two agents of HGT that affect both the bacterial chromosome and extrachromosomal 213 replicons are bacteriophages and insertion sequences (consisting primarily of a transposase gene).

214
Bacteriophages are ubiquitous among bacterial communities and phage-host dynamics has been 215 shown to stabilise diversity within a community [23] as well as to drive the arms race between the 216 evolution of bacterial defences (often in the form of CRISPR-cas systems) and the counter-evolution 217 of phage structures that neutralise those defences [24]. Insertion sequences (IS) have been implicated 218 in the horizontal transfer of a wide range of functions and are noted for their role in conferring niche-219 specific advantages to bacteria, allowing the persistence of strains or species in new environments that 220 were previously uninhabitable [25]. 221 Fig. S5 shows a heatmap of predicted prophage genes (COGs) and Fig. S6 shows a barplot 222 of prophage counts. Nine strains (2 sub-clades of 4 strains each and JCM1230) lack predicted 223 prophages; it is unlikely that these 9 strains have no history of interacting with bacteriophages -224 instead, VirSorter has failed to predict relatively intact prophages in the genomes of these strains.

225
Canchaya et al summarise the relationship between bacteria and prophages by writing that prophages 226 are lost from bacterial genomes as easily as they are acquired [26]. There is no clear association of the 227 two sub-clades with a single niche, although the human oral cavity is the isolation source of 5 of these 228 strains and the other 4 were isolated from the mammalian intestine. It is tempting to suggest that the 229 oral environment selects against the persistence of prophages; however, Edlund et al describe the oral 230 cavity as the perfect portal for viruses to access the oral microbial community [27] and previous 231 studies have shown that it is host to a diverse community of phages [28,29]. 232 The COG category in Fig. S5 with by far the most genes is 'Function unknown' (S) with a 233 mean average gene count of 61.3 compared with the second highest -'General function prediction 234 only' -of 3. The size of these categories emphasises the limits of current knowledge regarding 235 bacteriophage gene function. There is a correlation between number of predicted prophages and 236 number of prophage genes (Spearman; rho = 0.78; p < 0.001), which is largely expected and 237 highlights the size constraints on phages that infect L. salivarius since number of prophages, not 238 phage type, approximately accounts for number of prophage genes. Some of the COG categories that 239 are least abundant in predicted prophages are those involved in cell-specific functions such as cell 240 motility (N) and secretion (U). The distribution of the remaining COG categories across the strains is 241 indicative of the dynamic nature between bacteria and their phages, with considerable intra-species 242 variation suggesting that the prophage complement of the ancestor of L. salivarius does not resemble 243 any of the currently extant strains since their repertoire of prophages is so distinct. 244 Table S7 describes the distribution of CRISPRs across the 43 genomes as well as their 245 associated cas genes. All CRISPRs are located on the chromosome, highlighting their role in 246 protecting against extrachromosomal sequences. Almost all strains in this dataset have either the type-247 II or type-III CRISPR-cas system (or both), identified by the cas 9 or cas 10 gene, respectively, and 6 248 strains have no identified CRISPRs. The presence of either type-II or type-III CRISPR-cas systems 249 show some clustering on the core-gene tree in Fig. 2: the DSM20555 T sub-clade consisting of 4 strains 250 all have the type-III system only while the CECT5713 (6 strains) and UCC118 (4 strains) sub-clades 251 have the type-II system only; the AH43348 sub-clade (6 strains), in contrast, has both type-II and 252 type-III systems. The partial clustering of CRISPR-cas systems according to the core-gene tree is 253 supported by Fig. S7, which shows a maximum-likelihood tree of the cas 1 gene for type-II CRISPR-254 cas, providing evidence of CRISPR-cas systems being acquired and maintained in the common 255 ancestors of these sub-clades. The 6 strains with no CRISPRs show some clustering on the core-gene 256 tree in Fig. 2, but JCM1045 and DSM18933 are singletons. The absence of CRISPR-cas systems does 257 not have an obvious association with niche or the presence of prophages, suggesting that the 258 interaction between CRISPR-cas systems, bacteriophages and the environment is far from 259 straightforward. There are also 6 undefined CRISPRs from 4 strains that could not be described due to 260 the absence of cas genes in close proximity. These CRISPRs are probably degraded systems that are 261 no longer functional since all functioning CRISPR-cas systems have the cas 1 gene, which is involved 262 in recognition and cleavage of invading DNA. 263 Fig. S8 shows a heatmap of gene counts for insertion sequences across the 43 strains, divided 264 up into their respective replicons. The most striking thing about this figure is the inter-strain diversity 265 of transposases, both within and between replicons. The gene counts for each transposase family in a 266 specific strain on a particular replicon range from 0 to 52, highlighting the considerable variation in 267 copy number displayed by these horizontally transferred sequences. The majority of transposases have 268 copies on the chromosome and the plasmids, suggesting that they utilise the conjugative ability of 269 plasmids to increase their abundance within and between species. The distributions of the IS families 270 follow different patterns, from being widely spread over all three replicon groups (IS3) to being 271 limited to the chromosome and megaplasmid (IS21) to being confined to the smaller plasmid(s) 272 (IS256). The only IS family confined to the chromosome is IS1 in a single strain -NIAS840.

273
The multi-replicon distribution of IS families implies that there is strong selection pressure on 274 insertion sequences to transpose regularly from chromosomes to plasmids and vice versa, perhaps 275 being partly responsible for the fact that transposases are currently considered to account for the most 276 abundant gene families in both prokaryotes and eukaryotes [30]. The widespread distribution of IS3 in 277 L. salivarius replicons is mirrored by its abundance (539 genes across the 43 strains); it is also the 278 only family to consist of two sub-families -IS3 and IS150. Similarly, the other IS families with the 279 widest distributions -ISL3, IS21 and IS200 -also have the greatest abundances after IS3, although 280 IS21 is absent from the smaller plasmids even if it is ubiquitous on the L. salivarius chromosome. 281 Overall, IS families with a higher copy number in this dataset show a strong correlation with 282 how many strains ( important role in the interaction of a bacterium with its environment since they are either secreted 295 from the cell or function as membrane-bound structures. The number of predicted genes with signal 296 peptides and with trans-membrane domains range from 56 to 84 and from 33 to 54, respectively. 297 There is no association between niche and the number of either of these functional groups, which is 298 not entirely surprising. These results highlight once more the point made earlier that certain isolation 299 sources of L. salivarius strains shouldn't be interpreted as the niches that each strain has adapted to 300 over time -some strains might be acting as opportunists that don't persist in a given environment for 301 long such as the Lactobacillus species from a 2007 study (mainly L. rhamnosus) that were isolated genes in the 42 L. salivarius strains. The ability to hydrolyse bile salts is a necessary trait for any 309 bacterium that is adapted to traversing the initial sections of the gastro-intestinal tract in order to 310 colonise the intestine. It is also a required function for probiotics since a potential probiotic without 311 the ability to reach its target area (usually the colon) will be ineffective. All 42 L. salivarius strains 312 have at least one Bsh gene while L. hayakitensis DSM18933 has none, suggesting that the common 313 ancestor of L. salivarius and L. hayakitensis did not possess a Bsh gene, although it is possible that 314 another strain of L. hayakitensis does harbour one or more; if this is the case then gene decay of the 315 Bsh gene in DSM18933 is a likely explanation. Two Bsh genes -one on the chromosome and one on 316 the megaplasmid -seems to be the typical organisation in L. salivarius as described by Claesson et al 317 [32] since 36 out of 42 strains fit this description. Four strains -CECT5713, JCM1230, LMG14476 318 and LMG14477 -have a single Bsh gene located on the chromosome while 2 strains -cp400 and 319 JCM1046 -have three BSH genes, both having two on the megaplasmid and one on the chromosome. 320 The presence of at least one Bsh in all 42 L. salivarius strains reinforces the point that this 321 species is commonly isolated from the GIT of humans and animals. The variable number of Bsh genes 322 and their presence on both the chromosome and the megaplasmid suggests that there is variability in 323 bile resistance across the strains. This was shown in a study by Fang et al, but they cautioned that bile 324 resistance is independent of the bsh1 allele type (the Bsh on the megaplasmid of most strains) and 325 they go on to show that, upon exposure to bile and cholate, a transcriptome analysis reveals the up-326 regulation of numerous stress response and efflux proteins, which might mask the variable influence 327 of Bsh allele types [33]. 328 It should be noted that for the BLAST analysis of this category, a stricter cut-off value of 50% 329 for coverage of both the query and reference genes was used. This was done because the number of 330 BLAST hits to Bsh genes in the database contradicted previous literature so a closer agreement in 331 protein length between query and reference sequences was enforced. It is possible that large 332 discrepancies between the lengths of sequences in the database and sequences in the predicted L. 333 salivarius gene repertoire led to false negative Bsh predictions. When the criteria are relaxed to 334 include only 50% coverage of the query gene (and not the reference) an extra Bsh is predicted in some 335 strains and these might actually be genuine Bsh genes that this study has excluded. 336 337 and virulence factors (VF) across the 43 strains. VFs range from 2 to 3 genes and ARs range from 7 to 340 16. Virulence and antibiotic resistance are two traits that are screened for when assessing the 341 suitability of a strain to act as a probiotic [34] and these traits are particularly dangerous in clinical 342

Summary survey of virulence factors and antibiotic resistance genes
settings. Table S8 and Table S9 give a more detailed summary of these results for ARs and VFs, 343 respectively, while data file 7 and data file 8 give the corresponding amino acid sequences in FASTA 344 format (figshare; Data Bibliography of main text).

345
The most commonly predicted function for AR genes in this dataset is transport, specifically a 346 subset of efflux pumps for such antibiotics as tetracycline, elfamycin, bacitracin, clindamycin, 347 fosfomycin, dalfopristin and others. Efflux pumps evolved long before the advent of antibiotic usage 348 in modern medicine and probably originated as a defence against toxic substances entering the cell 349 [35]a strategy that has more recently been used to confer antibiotic resistance to microbes from 350 multiple drugs, leading to a health crisis in the effective treatment of infection with antibiotics.