Genomic determination of minimum multi-locus sequence typing schemas to represent the genomic phylogeny of Mycoplasma hominis

Mycoplasma hominis is an opportunistic human pathogen, associated with clinically diverse disease. Currently, there is no standardised method for typing M. hominis, which would aid in understanding pathogen epidemiology and transmission. Due to availability and costs of whole genome sequencing and the challenges in obtaining adequate M. hominis DNA, the use of whole genome sequence analysis to provide clinical guidance is unpractical for this bacterial species as well as other fastidious organisms. This study identified pan-genome set of 700 genes found to be present in four published reference genomes. A subset of 417 genes was identified to be core genome for 18 isolates and 1 reference. Leave-one-out analysis of the core genes highlighted set of 48 genes that are required to recapture the original phylogenetic relationships observed using whole genome SNP analysis. Three 7-locus MLST schemas with high diversity index (97%) and low dN/dS ratios (0.1, 0.13, and 0.11) were derived that could be used to confer good discrimination between strains and could be of practical use in future studies direct on clinical specimens. The genes proposed in this study could be utilised to design a cost-effective and rapid PCR-based MLST assay that could be applied directly to clinical isolates, without prior isolation. This study includes additional genomic analysis revealing high levels of genetic heterogeneity among this species. This provides a novel and evidence based approach for the development of MLST schema that accurately represent genomic phylogeny for use in epidemiology and transmission studies.


Background
Mycoplasma hominis is an opportunistic human pathogen and resides on the mucosal surfaces of the cervix or vagina in 21 to 53% of sexually mature women; its presence is somewhat lower in the urethra of males. The presence of M. hominis is associated with clinically diverse diseases including; urogenital diseases, postpartum fever [28], pneumonia [15], meningitis, post-operative wound infection, post-organ transplant infection [29] and septic arthritis. The capacity of M. hominis to cause disease has been proven by induction of preterm labour and development of foetal chronic lung disease following experimental in utero administration of M. hominis to pregnant macaque monkeys [19].
Understanding pathogen epidemiology and transmission is important in preventing future infections and comprehending the transmission chains. There is currently no standardised method of molecular typing for M. hominis and due to the fastidious growth requirements of Mollicutes, genomic typing is unlikely to be available for routine practice for the foreseeable future. Current discriminatory methods for typing of M. hominis to improve understanding of epidemiology of infection and genetic diversity are not in clinical use. Several molecular typing mechanisms have been developed for M. hominis including: restriction fragment length polymorphism (RFLP) analysis, amplified length polymorphism (AFLP) [13] and random amplified polymorphic DNA (RAPD) [24]. These techniques have displayed poor reproducibility (RAPD), require specialist equipment, time, and large quantities of biological material to perform the test. Differing typing schemes based on sequence analysis of the p75, p120' and vaa genes do not give concordant results [4,6,13,24]. Multiple locus variable-number tandem-repeat (VNTR) analysis (MLVA) has successfully been used to subtype other mycoplasma species. However, the high genetic heterogeneity of M. hominis restricted the test's use to individual studies, and was too discriminatory for large epidemic studies [9].
Multi-locus sequence typing (MLST) analysis of the diversity of housekeeping genes that are considered to be under less selective pressure than other genes have been successfully employed for several bacterial species including Mycoplasmas such as M. bovis [26], M. agalactiae, M. hyorhinis [27] and M. hyopneumoniae [16]. Sogaard et. al examined six house-keeping gene sequences to investigate evidence of genomic recombination in M. hominis and revealed a high degree of variability between these genes [23]. However, the authors did not utilise the data to create a genotyping scheme. The aim of this study was to develop an MLST scheme based on analysis of M. hominis genomic sequences to derive the minimum number of genes required to accurately reflect genomic phylogeny.

Pan-genome
Raw genomic sequence reads from 18 M. hominis clinical isolates (Table 1) were assembled and scanned against a database of Hidden Markov Models (HMMs) representing gene coding families constructed using four complete genomes published on NCBI: ATCC 27545 (NZ_CP009652;533 genes), PG21 (NC_013511; 497 genes), Sprott (NZ_CP011538; 524 genes), and AF1 (NZ_CP009677; 531 genes) (see Methods). On average, M. hominis pan genome clustering was able to detect 550 (median: 553) genes per sample, which is comparable to the mean number of genes found in the four reference genomes (521). The M. hominis pan-genome contained total of 700 genes ( Fig. 1) with 417 genes (54.9%; 95% CI: 51.44-58.42) present across all samples and the reference strain ATCC27545 at least once. The shoulders in the pan-genome frequency distribution are likely to correspond to the genes found in the specific phylogenetic clades, Fig. 1(b). The number of alleles per gene family is normally distributed with mean of 15 alleles per gene, Fig. 1(a). This is as expected given the relatively long phylogenetic distance from the reference genome observed in the whole genome SNP tree (Fig. 2). No hyper-variant, large copy number genes, such as transposases and integrases, were detected in the pangenome data analysis.

Core-genome MLST
A subset of genes found across all samples are likely to be highly conserved and carry little phylogenetic signal. Conversely, a small subset of genes could be carrying phylogenetic signal that is similar to the species as a whole, and therefore can be used as a proxy for whole genome evolution. In order to identify genes that carry increased discriminatory potential, leave-one-out analysis was performed. This involved removing one gene at a time from the set of 417 core genes and constructing a phylogenetic tree using the remaining alleles from 416 genes. The resulting phylogenetic tree was compared with the phylogeny derived using whole genome variants (considered the gold standard). Of the 417 genes, 379 genes (88.76%) conferred the same phylogenetic topology as the whole genome tree whilst 48 genes (11.24%) disrupted the phylogenetic relationship of the samples to varying extent. These results suggest that these 48 genes are necessary in reconstructing the correct relationships between the isolates included in this study, Fig. 3(a). To assess the sufficiency of these genes to replicate the topology, a phylogenetic tree was constructed using only these 48 genes (Fig. 3), confirming that these 48 genes were necessary and sufficient for tree reconstruction. These genes are considered the minimum gene set to construct a core-genome MLST scheme for M. hominis as they are present in all reference genomes and the 18 sequenced M. hominis strains, and are each required to reconstruct the whole genome topology.

7-gene MLST schema
Analysis indicated that 48 C 7 = 73,629,072 total combinations of allelic profile could be derived from the genomic data of 22 isolates; exhaustive search of a large search space like this is not computationally tractable. Subsets of target genes were selected in order to reduce the size of the search space for further analysis and contained three sets of genes; genes that caused lowest overlap with the whole genome SNP tree in leave-one-out analysis (n = 10), manually selected genes for their biological function (n = 9), and combination of the above two sets (n = 12). Two sets of genomic data from M. hominis isolates originating from two single patient samples were included, patient 1: MH9 and MH10, patient 2: MH11 and MH12. These isolate pairs were used to identify which gene sets conferred identical MLST profiles for Fig. 1 Mycoplasma hominis pan-genome statistics: a: Allele frequency of the genes in the pan genome is distributed with median 3 alleles (red) and mean 12 alleles (yellow) ; lower peak corresponds to conserved genes (low allele count), higher peak corresponds to genes with many different alleles. b: Gene frequency across all samples in the pan-genome. Grey line indicates the core number of genes appearing in all clinical strains and reference strain ATCC 27545 at least once (417) these lineages. The reduction in the selected target genes resulted in 948 possible combinations of the three gene category subsets. All combinations were analysed for phylogenetic topology closest to that obtained with genomic SNP analysis. Fifteen combinations were identified to produce the highest and identical topological similarity of 0.77 among all tested combinations and three schemas ( Table 2) were found to confer the highest similarity to the whole genome SNP tree. All 15 phylogenetic trees classified MH23 and MH28 into a different subtree from the original whole genome analysis. In addition, trees from schemas B and C classified MH17 into a different topological position. Three selected schemas had overall very similar topology to the original whole genome SNP tree representing two general clades and correct bifurcations. Schema A correctly placed  MH17 as outer most group to that clade, whereas schemas B and C incorrectly placed the ancestry of MH17. The branch length of the whole genome SNP tree could not be reproduced as shorter sequences with different number of SNPs were used. Overall the three schemas almost completely reproduced the phylogenetic relationship found using whole genome SNP data.

7-gene in-silico PCR
Although full gene schemas can be used for in-silico MLST predictions using existing software, e.g. SRST2 [12] it is desirable to have a typing technique based on short (400-600 bp) PCR fragments. A set of primers (Table 3) was designed to minimise polymorphisms in primer binding sites, while maximizing information content required to approximate to whole genome SNP based phylogeny. Tree topologies from sequence products from all schemas were compared to the topologies of trees using full length genes and whole genome SNPs. Schema Ashowed the highest similarity to the topology observed in whole genome SNP tree, similarity score 0.5. Schemas B and C showed lower similarity scores 0.375 and 0.4375 respectively.

Diversity of MLST genes and determination of synonymous sequence changes
The Hunter-Gaston DI was calculated for each schema and showed that all three schemas have discriminatory power of 0.97 ST per M. hominis strain ( Table 2). All three MLST schemas are close together in terms of their diversity index indicating that there was no optimal combination of seven genes. Indeed, the schemas only differ by two genes ( Table 2). However only scheme A correctly positioned MH17. Genes under negative selective pressure are more suitable targets for an MLST schema as they are actively conserved in the host, especially where the organism exhibits a high level of genetic diversity. For each of the three proposed MLST schemas the ratio of nonsynonymous to synonymous (dN/dS) changes were calculated using the Nei and Gojobori method [18] ( Table 2). All genes in all three MLST schemas have a dN/dS ratio of < 1. Schemas A, B and C all contain genes under moderate negative selective pressure: Schema A GENE-58 and GENE-519; Schema B GENE-58, GENE-138 and GENE-519; and Schema C: GENE-58 and GENE-138. The remaining genes have dN/dS ratio below 0.1 indicating strong negative selection. Overall, schema A appears to contain, on average, a more strongly selected gene set (average dN/dS ratio: 0.01), than the other two schemas (average dN/dS ratio: 0.13 and 0.11 for schemas B and C, respectively).

Seven gene MLST stability
Stability of the genes for all three schemas was assessed in two M. hominis strains. Whole genome sequencing was performed following short-term passage (10 passages) in liquid culture and compared to the original sequence. None of the genes showed allelic variation in the two strains examined (Fig. 4).

Recombination analysis
Genomic sequences of the M. hominis strains were assessed for predicted regions of variation arising from homologous recombination. Recombination analysis was undertaken using Gubbins. The tree derived using whole genome SNPs and used in leave-one-out analysis was used as a starting tree for Gubbins. Multiple potential areas of recombination were identified across all genomic M. hominis data included in the study (Fig. 5). In particular, high levels of recombination were predicted in the phylogenetic clade containing the reference strain ATCC 27454, with multiple recombination events predicted at the same loci.

Discussion
Assessment of the pan-genome assembled for M. hominis revealed 417 core genes. Previous examination of Mycoplasma species pan-genome identified a coregenome of only 196 genes (Liu et al.); however, this was an inter-species analysis and lower levels of conserved gene than observed in this study is to be expected. Of the entire pan-genome, the core genome represents 55% of all genes. This level of similarity between M. hominis strains is in stark contrast to M. pneumoniae where over 99% identity has been observed between the genomes [30].     [3,13].
To determine stability of the proposed genes for the seven gene MLST schemas, sequences were compared before and after short-term passage in two M.hominis strains (MH2 and MH20). Genes found to be genetically variable after short-term passage would be unsuitable candidates for an MLST schema. From the set of 48 genes, required to replicate whole genome sequence SNP phylogeny, one gene (GENE-220: pcrA) was found to have acquired mutations and was therefore not suitable for use in the MLST schemas. Genomic stability was only assessed in two M. hominis strains, it is likely that further instability would be identified upon assessment of a larger number of strains. Sequence heterogeneity has been well documented for M. hominis [23] and therefore it is essential that the stability of the genes chosen for the three MLST schemes is further assessed.
Recombination has previously been examined for M. hominis; however, only a limited number of genes were examined [23]. Nevertheless, analysis revealed inter-and intra-genic recombination in M. hominis and recombination was proposed as a method for the high intraspecies variability of M. hominis. In addition, examination of M. pneumoniae, M. genitalium, M. pulmonis and Ureaplasma urealyticum indicated a large number of repeats within the genomes of these organisms suggesting the existence of a large potential for recombination. In concordance with this, a large number of predicted recombination sites were identified in this study. This could be due to the diversity of the isolates from the reference used for mapping. The reference is on average 8500 SNPs from the isolates and each isolate is on average 9100 SNPs from each other isolate, indicating a large degree of diversity. 18 isolates used in this study, although adequate for inferring phylogenetic relationship, do not represent the full temporal and geographic diversity of the M. hominis species and future analysis of a larger data set may reveal the true extent of recombination in this species. Currently, with the potential for a large number of recombination events, discriminatory methods that rely on phylogenetic trees have limitations. In this case, the phylogenetic tree is influenced to a greater extent by horizontal transfer of genetic material, rather than vertical inheritance from the parental strains.
The development of MLST schemes has been important both individually and epidemiologically for pathogenic bacteria. At the level of an individual patient, this approach allows discrimination between relapse or persistence and new infection. In this study, M. hominis strains originating from the same patient specimen were used to determine gene sets that conferred identical MLST profiles for these strains. Furthermore, identical MLST profiles can be used to identify sexual transmission of M. hominis infections or horizontal transmission between mother and baby and may also be used in cases of post-transplant infection. Other methods including MLVA typing of M. hominis isolates from two motherneonate pairs resulted in the identification of identical MLVA types in each case studied, confirming mother-tochild transmission [9]. This study describes MLST schemes with a DI of 0.97 ST per M. hominis strain, revealing a genetic heterogeneity among this species.
Genomic sequencing may eventually displace traditional MLST based on multiple gene target PCR and sequencing. However, lack of local/international capability of genomic sequencing technology, specialised culture requirements of mycoplasmas and volume of biomass required to obtain sufficient high quality DNA for analysis is unlikely to lend whole genome sequencing analysis to provide practical clinical guidance for this bacterial species and other fastidious organisms. The major advantage of utilising PCR-based MLST, relative to whole genome sequencing, is the ability to conduct PCR amplification directly on clinical samples. The advent and improvement of metagenomics processes and analysis may eventually supersede this; however, practical use on infections with mycoplasmas and mixed strains is not yet tested. A phylogenetic analysis of a mixed population of M. hominis strains would not be possible directly on a clinical sample; however, this may be feasible on infections originating from a clonal M. hominis infection

Conclusions
This study has utilised bioinformatics analysis of M. hominis genomic sequence to identify a minimum set of 48 genes required to recapitulate the relationships observed in the whole genome phylogeny of M. hominis constructed using SNP data. Following this, three sets of seven genes were identified that could be used to construct an MLST typing schema. Due to availability and costs of whole genome sequencing and the challenges in obtaining adequate M. hominis DNA, the use of whole genome sequence analysis to provide clinical guidance is impractical for this bacterial species as well as other fastidious organisms. Furthermore, this study has identified PCR primers and found Schema A to have the highest topological similarity to the phylogenetic tree constructed using whole genome SNPs. The results presented here, provide a novel approach for the development of MLST schemes that accurately represent genomic phylogeny for use in epidemiology and transmission studies.

M. hominis strains, Culture and DNA preparation
Eighteen isolates of M. hominis from the UK that were submitted to or isolated by Public Health England from 1986 to 2013 from various anatomical locations were included in the study. Isolates were triple cloned on Mycoplasma Agar (Mycoplasma Experience; Surrey, UK) and confirmed as M. hominis by amplification and sequencing of yidC gene [8]. All isolates were subsequently cultured in Mycoplasma Liquid Medium (MLM; Mycoplasma Experience). All isolates were grown in 100 mL broth culture and the genomic DNA was extracted using the GenElute TM Bacterial Genomic DNA Kit (Sigma; Dorset, UK).

Genome assembly and pan-genome analysis
Genomic assembly of M. hominis was performed using SPAdes v3.6.1 [1] without error correction (−−only-assembler) and 21,33,45,53,65,77,83,93 kmers. For pan genome analysis, four reference sequences (NZ_CP009652.1, NC_013511.1, NZ_CP011538.1 and NZ_CP009677.1) were first clustered together using ggPRO (unpublished). Briefly, coding gene sequences were extracted from the annotated GenBank files for the strains listed above and each gene was checked against the database of hidden Markov models (HMMs) using HMMER v3.2 [11]. A hit was considered significant if the ratio of score over HMM length was greater than or equal to 0.85. Alleles were clustered into the same gene family; new alleles were added to a gene family if they introduced a gap of less than 10% into the gene family alignment. The HMM database was only updated when a new gene or allele was identified. Regions that were not covered by any HMM hits were scanned to check for presence of mycoplasma specific start and stop codons. If they were found, then a putative new gene was assigned in the database. After clustering the four reference sequences, the pan-genome contained 777 genes occurring in at least one reference genome. Each clinical M. hominis isolate genome was individually scanned against the pan-genome HMM database using HMMER v3.2 and significant hits (score/hmm length ratio > 0.85) were recorded in the database. FASTA sequence of pan-genome alleles are provided in Additional file 1.
MLST gene selection procedure M. hominis genomic sequences were used to develop a minimal MLST scheme. Following pan-genome assembly, alleles of 417 genes were extracted into individual multi FastA files and aligned using Muscle software [7]. Leave-one-out analysis was performed to identify genes required to maintain whole genome sequence SNP phylogeny, whereby one gene was removed at a time and the remaining genes were concatenated into a single multifasta file, with one entry per sample. To construct the tree, FastTree [22] with -gtr -gamma -nt parameters was used to reflect similar parameters used for tree construction using RAxML software (see above). The topology of the leave-one-out tree was compared to the whole genome tree using the following formula: D = S t /S ref , where D is the similarity score between reference and target trees, S ref is total number of nodes in a tree, and S t is the total number nodes such that children of these nodes are the same in reference and target tree. Distance D, has strict range (0-1) and represents overall similarity of a target tree to the reference tree, where 0: no similarity and 1: identical trees. This value can be calculated for each internal node for more refined similarity measure. The results were analysed and plotted using custom scripts written for R statistical package. Final MLST allele sequences and sequence types (STs) are found in Additioanl file 2.