Phylogenetic and comparative genomics of the family Leptotrichiaceae and introduction of a novel fingerprinting MLVA for Streptobacillus moniliformis

The Leptotrichiaceae are a family of fairly unnoticed bacteria containing both microbiota on mucous membranes as well as significant pathogens such as Streptobacillus moniliformis, the causative organism of streptobacillary rat bite fever. Comprehensive genomic studies in members of this family have so far not been carried out. We aimed to analyze 47 genomes from 20 different member species to illuminate phylogenetic aspects, as well as genomic and discriminatory properties. Our data provide a novel and reliable basis of support for previously established phylogeny from this group and give a deeper insight into characteristics of genome structure and gene functions. Full genome analyses revealed that most S. moniliformis strains under study form a heterogeneous population without any significant clustering. Analysis of infra-species variability for this highly pathogenic rat bite fever organism led to the detection of three specific variable number tandem analysis loci with high discriminatory power. This highly useful and economical tool can be directly employed in clinical samples without laborious prior cultivation. Our and prospective case-specific data can now easily be compared by using a newly established MLVA database in order to gain a better insight into the epidemiology of this presumably under-reported zoonosis.


Background
The Leptotrichiaceae are a family of underexplored and rarely isolated microorganisms within the phylum Fusobacteria containing both species known from certain pathologies as well as colonising members of the resident microbiota. Many if not all species of the Leptotrichiaceae inhabit the oral cavities, gastrointestinal or urogenital tracts of humans and animals [1][2][3]. One of the reasons they are rarely encountered is the obligate anaerobic or capnophilic growth dependence of these fastidious bacteria and the usual presence of a high number of concomitant microorganisms. Some members of this family are well known pathogens, such as Streptobacillus (S.) moniliformis, one of the two causative organisms of the bacterial zoonosis rat bite fever [4]. Recently, a number of novel species have been described, most of which could be attributed to clinical disease [5][6][7][8]. It can also be concluded from numerous phylotypes, Leptotrichiaceae normally colonize mucous membranes [9][10][11][12][13][14][15], but when introduced into new tissue or host sites they are also able to shift their pathogenic potential and cause severe and even lifethreatening disease. With increasing availability of next generation sequencing a number of single genomes have been published [6,[16][17][18][19][20]. However, almost no comprehensive genomic studies including these microorganisms have been completed, nor have virulence properties been identified in these species. Phylogenetic studies and identifications within the phylum Fusobacteria have been carried out and based on single or multiple gene sequences such as 16S rRNA, 16S-23S rRNA internal transcribed spacer, gyrB, groEL, recA, rpoB, conserved indels and genes for group-specific proteins, 43-kDa outer membrane protein and zinc protease [18,[21][22][23][24][25][26][27][28][29][30]. In an attempt to characterize different members of this phylum Gupta & Seti proposed various conserved signature indels (CSIs) in amino acid sequences for the Leptotrichiaceae from which three CSIs were found to be specific for this family [31]. On the other hand, no detailed phylogenetic and comparative genome studies dedicated to Leptotrichiaceae have been published up to now. Furthermore, and due to a general paucity of strains and attempts to differentiate members from the same species there is currently no tool available to type isolates in order to prove transmission chains. Our data, presented here, were derived from 46 complete genomes from 20 different taxa of the family Leptotrichiaceae aiming to provide the first such comparative analysis. Our study results confirm the picture of earlier phylogenies from this group that are now based on a larger scale of orthologous genes. We give a surveying insight into the investigated genomes, thereby also including recently described species from this family. With a novel approach it was, furthermore, possible to accurately and unequivocally type isolates of S. moniliformis based on three variable number tandem repeat (VNTR) sequences. With this, we are presenting a culture-independent, species-specific fingerprinting tool in order to type the most important causative organism of rat bite fever for the first time.

Accession numbers
The GenBank/EMBL/DDBJ accession numbers for the genome sequences used in this study are summarized in Table 1.

Phylogenetic analysis based on orthologous genes
To determine the phylogeny within the genus Streptobacillus we aligned the allelic variations of 281 orthologous genes from 29 strains of S. moniliformis, S. ratti, S. notomytis, S. felis and S. hongkongensis which resulted in 57,841 single nucleotide polymorphisms (SNPs). From these SNPs we inferred a maximum likelihood phylogeny showing the distance between the different species within this genus (Fig. 1). To zoom deeper into the phylogeny of the S. moniliformis group we repeated this analyses with 775 orthologous genes present in 23 S. moniliformis strains which resulted in 5,211 SNPs. These SNPs were also used to construct a maximum likelihood phylogeny (Fig. 2).
As shown in the tree, most S. moniliformis strains used for this study are unrelated and form a heterogeneous population without any significant clustering. Solely strains A378/1 and B5/1 that both originate from the same source but without a common epidemiological background were phylogenetically indistinguishable.

Analysis of genomes and protein functions
The genome size in members of the Leptotrichiaceae varies between 1. 22  ). An organization of the genomes under study into clusters of orthologous groups (COGs) is depicted in Additional files 1 and 2 and shows, however, high intra-species as well as inter-species variations. On a generic level, gene contents of COG classes J, L, D and F are inversely correlated with increasing genome size, whereas COG classes K, N, T and Q are positively correlated (see Additional files 1 and 2).

Multiple-Locus Variable number tandem repeat Analysis (MLVA) In silico VNTR analysis
Under default conditions, 127 repeats were identified by the tandem repeat finder. For further analysis, the three most variable VNTRs were identified according to the degree of variability of allele types identified by   (Table 3). These three allelic loci were only present in S. moniliformis and thus proved to be specific for this microorganism (all other members of the Leptotrichiaceae were negative). The combination of the three loci yielded a high discriminatory index (0.94296 DI; Table 4).

PCR-based validation of in silico results
The absence of the calculated VNTR loci could also be proven by polymerase chain reaction (PCR) in all Leptotrichiaceae members other than S. moniliformis (data not shown). Contrarily, each of the ten S. moniliformis strains exhibited a specific band corresponding to their predicted tandem repeats pattern. Analysis of the sequenced PCR products confirmed the allele type allocation determined in silico (Table 4). VNTR_Sm1 alleles of two isolates, which were not found in silico, were successfully assigned (Table 4). Re-calculation revealed a DI of 0.9529 after including these two isolates, as well as one isolate for which no genome data was available. In order to facilitate comparisons of results in future studies, every genotype (from the allele types of the three loci) was expressed as a specific allele combination resulting in a specific allele code (Table 4). An online database dedicated to MLVA results of S. moniliformis has been established on the webserver of University Paris-Saclay, Orsay, France (http://microbesgenotyping.i2bc.paris-saclay.fr/databases/public) which is open to future entries and strain comparisons.

Discussion
Members of the Leptotrichiaceae are rarely encountered microorganisms, a phenomenon that seems to be highly dependent on difficulties with cultivation. With the availability of molecular methods in this field the number of findings and frequencies has significantly increased [10][11][12][13][14][15][32][33][34][35][36]. On the other hand, we still need deeper insight into the genomes of this group. In particular, the mechanisms involved in pathogenesis and virulence of pathogenic species are completely unexplored. We have undertaken a first step into this direction by analysing a broad spatio-temporal collection of strains, thereby including especially species with regular evidence for pathologies. Firstly, the large dataset from this study has been utilized for the confirmation of our phylogenetic picture from earlier studies [18,30,37,38]. An intra-genus phylogeny that was based on 775 orthologous genes revealed a very similar picture to previous studies involving only four selected functional genes ( Figs. 1 and 2). Conversely and in contrast to almost identical average nucleotide identity (ANI) values [30], full genome analyses revealed a high level of heterogeneity for all but two strains (no. 15 and 18) of S. moniliformis without any significant clustering. This is, albeit, not surprising, because the present study included a large spatio-temporal collection of 23 S. moniliformis strains that have been isolated over a period of 90 years from at least five different host species and from almost all subcontinents. We were also able to display the three predicted Leptotrichiaceae specific CSIs of MreB/MrI (2 aa deletion), AlaS and RecA (5 and 2 aa insertions, respectively) in all of our genomes as well as in the recently described members of the family (data not shown) [31]. Genome size dependent gene content has been described and could also be confirmed for the genomes from this study [19]. With increasing genome size gene contents of COG classes J, L, D and F involved in DNA replication, cell cycle regulation and protein translation  Table 1). The tree is based on 281 orthologous genes including 57,841 SNPs are inversely correlated, whereas COG classes K, N, T and Q involved in transcription, signal transduction, cell motility and the biochemistry of secondary metabolites are positively correlated (see Additional files 1 and 2). This makes sense when essential gene functions are preserved in smaller genomes and less important gene functions which are dispensable or can be 'outsourced' to the host, are lost [19]. On first impression the group of S. moniliformis strains is highly similar as can be concluded from related morphological and phenotypical properties and also from their high intra-species ANI of 98.5-99.3 % (cf. Table S2 in [30]). Based on data from this study very similar COG classes were also observed within this group (see Additional files 1 and 2), but differences in coding densities suggested, on the other hand, remarkable discrepancies. Fuelled by the idea that these discrepancies could, furthermore, be utilized with respect to epidemiology, we have developed a specific MLVA typing scheme for the major pathogen from this group, S. moniliformis, and the causative organism of rat bite fever. This scheme proved to be sufficient in unequivocally typing all 23 S. moniliformis strains under study plus one additional isolate with high discriminatory power (0.9529 DI). Interestingly, only four allele codes (genotypes; LHL2, LHL5, LHL10 and LHL11) were found more than once among isolates (Table 4). At least for LHL2 isolates, a connection could be pursued in that both isolates have been stored in the same strain   15 and 18) belonging to the allele code LHL2 indeed shared an identical CRISPR region, thereby pointing towards a clonal relation of these two isolates (data not shown) as could also be concluded from the phylogenetic tree (Fig. 2). Due to its length of up to approximately 3,000 nucleotides and its high level of heterogeneity the CRISPR region seems, on the other hand, presently not very well suited as a direct typing tool, but could be useful in certain situations to confirm or negate clonality of strains. A second advantage of the MLVA method described in this study is that it can effectively be pursued directly from the original matrix (e.g., a mouth microbiota swab and a clinical sample) without prior cultivation of the organism, which offers the possibility to better understand transmission chains in the future. This seems to be especially relevant since established PCR assays are not species specific, but limited to genus level specificity [37,39,40]. The majority of diagnoses of rat bite fever cases in the recently published literature relies only on partial 16S rRNA gene sequence analysis that may in the light of very similar novel Streptobacillus spp. that also colonize ratsbe quite uncertain for proper pathogen identification [41]. Hopefully, the newly established MLVA database will help to clarify regional infectious clusters and confirm transmission of certain lineages.

Conclusion
We have undertaken a first analysis of Leptotrichiaceae genomes using a large spatiotemporal collection of strains also including novel members of this group. Our dataset unveiled a first insight into characteristics founding a stable phylogeny, genome structure and COG classes. Beside apparent intra-species similarities we have detected also genetic heterogeneities that provided a basis for fingerprinting the most relevant pathogen from this group, the rat bite fever organism, S. moniliformis. This highly useful and economical tool can be directly used from clinical samples without ambitious prior cultivation and with high discriminatory power. Our data form the basis for a newly established MLVA database that provides the opportunity to store and compare isolate-specific information in future cases with this neglected zoonosis.

Generation of genomic data
Twenty-two strains of S. moniliformis were sequenced in this study, ten strains were taken from previous publications of our group and 15 strains were descended from other projects (Table 1). Genomic DNA was extracted   [43]. Table 1 depicts the set of strains and reference genomes used for this study.

Phylogenetic analysis based on orthologous genes
The determination of the maximum common genome (MCG) alignment was done comprising those genes present in all genomes considered for comparison [44]. Based on the parameters sequence similarity (minimum 70 %) and coverage (minimum 90 %) the genes were clustered and those genes that were present in each genome, fulfilling the threshold parameters were defined as MCG. This resulted in 281 orthologous genes for the comparison of 29 strains of S. moniliformis, S. ratti, S. notomytis, S. felis and S. hongkongensis and in 775 orthologous genes for the comparison within 23 strains of S. moniliformis only.
The following extraction of the allelic variants of these genes from all genomes was performed by a blast based approach after which they were aligned individually for each gene and concatenated which resulted in an alignment of 219,961 bp for the 29 strains and of 546,508 bp for the 23 S. moniliformis strains [45].
This alignment was used to generate a phylogenetic tree with randomized axelerated maximum likelihood (RAxML) 8.1 [46] using a General Time Reversible model and gamma correction for among site rate variation.

Analysis of genomes and protein functions
Genes were predicted with Prodigal [47] and assigned to COGs with the NCBI's Conserved Domain Database [48].

Multiple-Locus Variable number tandem repeat Analysis (MLVA) In silico VNTR analysis
The complete genome sequence of the S. moniliformis type strain DSM12112 T (accession number CP001779.1) was used to search for potential VNTRs using a tandem repeat finder web tool (http://tandem.bu.edu/trf/trf.basic.submit.html). We focused our search on repeats that were characterized by high purity, large size, and/or large number of repeat copies [49]. Repeats of interest were aligned against a set of available genomes depicted in Table 1 using Geneious and allele types were determined as shown in repeat copy numbers. The DI was calculated for a combination of three most variable VNTRs using an online discriminatory power calculator (http://insilico.ehu.es/ mini_tools/discriminatory_power/).  Bold rows represent strains used for a PCR-based validation of in silico identified VNTR allele types (underlined alleles were not found in silico and only identified after PCR amplification); a in order to fit requirements of the database, the repeat copy numbers at locus VNTR_Sm1 have been rounded up to receive integer values (e.g., 9 instead of 8.7); b while the repeat copy numbers at locus VNTR_Sm3 have been rounded up to the next half-value and doubled to receive integer values (e.g., 15 instead of 7.2); T : type strain; c strain was only used for validation (no complete genome available)

Ten
for which complete genomic data were not available) as well as all accessible members of the Leptotrichiaceae other than S. moniliformis were used for validation. DNA was extracted from respective isolates (2-3 colonies) by boiling in 100 μL distilled water for 20 min (min.) followed by centrifugation at 20,817 × g for 5 min. The 20 μL final PCR reaction contained 10 μL of Hotstar Taq MasterMix (Qiagen, Hilden, Germany), 1 μL of each forward and reverse primer (10 pmol/μL) (TIB MOLBIOL, Berlin, Germany) (

Additional files
Additional file 1: Table S1. Analysis of clusters of orthologous groups (COGs) of the Leptotrichiaceae members used in this study. COGs were assessed as described in the Materials and Methods. (XLSX 16 kb) Additional file 2: Figure S1.