Comparative genomics of Mycoplasma pneumoniae isolated from children with pneumonia: South Korea, 2010–2016

Background Mycoplasma pneumoniae is a common cause of respiratory tract infections in children and adults. This study applied high-throughput whole genome sequencing (WGS) technologies to analyze the genomes of 30 M. pneumoniae strains isolated from children with pneumonia in South Korea during the two epidemics from 2010 to 2016 in comparison with a global collection of 48 M. pneumoniae strains which includes seven countries ranging from 1944 to 2017. Results The 30 Korean strains had approximately 40% GC content and ranged from 815,686 to 818,669 base pairs, coding for a total of 809 to 828 genes. Overall, BRIG revealed 99% to > 99% similarity among strains. The genomic similarity dropped to approximately 95% in the P1 type 2 strains when aligned to the reference M129 genome, which corresponded to the region of the p1 gene. MAUVE detected four subtype-specific insertions (three in P1 type 1 and one in P1 type 2), of which were all hypothetical proteins except one tRNA insertion in all P1 type 1 strains. The phylogenetic associations of 30 strains were generally consistent with the multilocus sequence typing results. The phylogenetic tree constructed with 78 genomes including 30 genomes from Korea formed two clusters and further divided into two sub-clusters. eBURST analysis revealed two clonal complexes according to P1 typing results showing higher diversity among P1 type 2 strains. Conclusions The comparative whole genome approach was able to define high genetic identity, unique structural diversity, and phylogenetic associations among the 78 M. pneumoniae strains isolated worldwide.

. The resulting contigs were mapped to the M129 reference genome and joined via PCR. The thirty genomes had all contigs joined to form a single, continuous (circular) contig. Following assembly and editing, the genomes underwent automated gene annotation. With approximately 40% GC content and ranging from 815,686 to 818,669 bp, the genomes coded for a total of 809 to 828 genes.

Overall comparison
The 30 sequenced genomes were aligned to the reference M129 genome using BLAST Ring Image Generator (BRIG). Overall, the genomes were 99% to > 99% identical. The similarity dropped to approximately 95% in the type 2 strains, which corresponded to the area of the p1 gene (Fig. 1).

SNP and indel analysis
SNPs and indels were compared for the identification of sequence level differences against the reference genome. The results are shown in Table 2. As expected, P1 type 1 strains showed fewer variant numbers (140-455) than P1 type 2 strains (1778-1796), showing a clear distinction.

Proteins and functional analysis
The Protein Family Sorter tool at Pathosystems Resource Integration Center (PATRIC) allows selection of a set of genomes of interest and examination of the distribution of protein families across genomes. An interactive heatmap viewer provides a comprehensive view of the distribution of the protein families across multiple genomes, with clustering and anchoring functions to show relative conservation of synteny and to identify lateral transfers. Based on gene annotation from PATRIC, a heatmap of all proteins was produced along with the reference genome M. pneumoniae M129 (Fig. 3). Unsurprisingly, when genomes were classified into P1 types 1 and 2, distinction between the genomes was apparent. Nevertheless, most of the genomes that showed different expressions were hypothetical proteins with uncertain significance.

Phylogenetic associations
Thirty genomes were aligned with MAFFT, and a phylogenetic tree was generated (Additional file 2). The phylogenetic tree was divided into two clades in accordance with the P1 typing. In general, the STs of the 30 strains were consistent with the phylogenetic relationship.
All 78 strains, including strains from this study and NCBI, were aligned and phylogenetic tree was constructed and visualized (Fig. 4). In general, the strains in this study were scattered throughout the entire phylogenic tree, along with the expansion of certain clades. Trees were divided into two major clades in accordance with the P1 typing. Each P1 type was divided into another two clades. Clade 1 formed the largest clade. It included strains of ST3 from the current study and global collections. Strains with ST20, ST17 and ST19 were included in Clade 1. Clade 2 was consisted of ST1 strains, exclusively. This clade harbored a subclade which consisted of strains from China in year 2015 and 2016. Clade 2 also included the M129 reference strain. Major ST of Clade 3 was ST14 with one each of ST2, ST15 and ST33 strain. Clade 4 showed high proportion of ST2 strains with a subclade which included four ST2 strains from USA and a ST2 strain from Japan. Overall, clade 1 showed the most heterogenicity in terms of both the origin and the time of strain collected.

Comparative genomics with global strains-MLST
For the comparative genome analysis of global strains, 48 genomes of M. pneumoniae were accessed from NCBI. Typing of P1 types and MLST types was performed (Additional file 1). An eBURST diagram was constructed based on the 30 strains from this study, 48 global strains from NCBI, and previously reported STs from PubMLST (http://pubmlst.org/mpneumoniae/).
The eBURST diagram showed two clonal complexes with two singletons of ST12 and ST22 (Fig. 5). The founder ST of CC1 was identified as ST3 with no double locus variants (DLVs). The founder ST of CC2 was recognized as ST2 with multiple subgroup founders (ST7, ST14 and ST24), multiple single locus variants (SLVs) and DLVs. In the eBURST diagram of global strains, ST3 and ST1, and ST2 and ST14 were the main STs from CC1 and CC2, respectively. Strains from this study (colored in red) constituted a considerable proportion of ST3 from CC1 and ST14 from CC2.
There were several other STs that were previously reported, but not included in the investigation of this study.

Discussion
M. pneumoniae is known as an organism 'difficult-to-culture' (1). Thus, unlike ordinary bacterial pathogens, the aid of molecular biology in the diagnosis of M.
pneumoniae is critical (13). As the burden of disease caused by this organism is considerable and patients may experience diverse extrapulmonary clinical manifestations, M. pneumoniae has drawn the attention of many researchers.
Nevertheless, in addition to the molecular diagnosis of M. pneumoniae by the P1 adhesin, P1 typing has been the sole method for classification for decades (14).
However, because the size of the M. pneumoniae genome is short compared to that of other bacteria and because the P1 adhesin is the only diverse part of the whole genome, researchers continued to focus on the P1 adhesin. Despite these efforts, P1 was not sufficient for the explanation of epidemics or for the explanation of clinical severity (15,16). was also frequently identified in Japan during the similar period, but ST19 was prevalent among macrolide-resistant strains in Japan, while ST19 has never been identified in Korea (18). A study from China, which applied MLVA on 835 samples from different regions, has also found regional differences in genotype distribution  Comparative genome analysis was performed using BRIG, MAUVE, and MAFFT. The genomes were classified mainly by the legendary P1. BRIG clearly distinguished P1 types 1 and 2, but no further information could be found, as separate genes could not be visualized (22). MAUVE utilizes LCBs, which are conserved segments that appear to be internally free from genome rearrangements (23). The result from MAUVE showed that large rearrangements (e.g., plasmids, phage or resistance genes) were not observed among M. pneumoniae. Specific insertions were noted in both P1 types. Nevertheless, the translated proteins of the inserted genes were generally hypothetical proteins with the exception of a tRNA. This is consistent with a previous report by Xiao et al., but the two insertions at 169-170 kb and 178-179 kb have not been described previously (20). The heatmap generated by PATRIC confirmed the P1 classification by differences in protein production. This is consistent with additional studies that applied NGS technology (24, 25).
The SNP approach is widely used in the study of antimicrobial resistance and genetic diversity and is not limited to M. pneumoniae (26)(27)(28) Even though not apparent in the phylogenetic analysis probably due to the genetical proximity of the strains, eBURST analysis shows ST3 as the founder strain of the CC1. We assume that despite the fact that M129, one of the ST1 strains, is used as reference strain, it is more convincing that ST1 strains may have evolved from the ST3 strains. The strain 'USA/1960/P1_1428/P1_1/ST3' which is the earliest known strain of the P1 type 1 strains also supports this idea.
In general, the result of the current study is consistent with that of the previous studies (20, 21). High stability was observed by the small number of SNPs across the genome and lack of rearrangements. The fact that P1 types shown as a major factor for the genetic classification is also consistent with the findings of the current study. Diaz et al. grouped 107 strains from four other studies and their study into three P1 type 1 and two P1 type 2 subgroups based on core protein sequences (21). Even though there are differences in the methods of tree alignment, construction, and visualization, the subgroups are consistent with the current study, in general. A distinct subgroup designated as 1N (New) which included four isolates from their study was the only subgroup which did not exist on the current study. When comparisons are made between the different phylogenetic trees, we find that the abundance and heterogenicity of the Clade 1 in the current study and the group 1U (Ubiquitous) in the study of Diaz et al. as the common finding. We assume that this certain subgroup harbors the most actively evolving strains in global and demands attention in terms of pathogenicity or in accordance with macrolide resistance. This study has some limitations. First, the number of strains included in the study was small, thus we were not able to interpret the clinical significance of the findings. Second, isolates were chosen from two consecutive outbreaks. Further analysis from sporadic cases and new outbreaks is needed. Nevertheless, this study expanded our understanding of the genome structure of M. pneumoniae through whole genome analysis. Whole genome approach provided more detailed information than traditional molecular typing methods for exploring genomic diversity among M. pneumoniae strains.

Conclusions
The comparative whole genome approach was able to define high genetic identity, unique structural diversity, and phylogenetic associations among the 78 M.

MLST analysis and P1 typing
MLST was performed on the M. pneumoniae DNA samples as previously described (10). P1 subtypes and each subtype variants were determined by sequencing the RepMP2/3 and RepMP4 genes and in comparison with previously published data (29, 30).

Selection of strains for Whole-genome analysis
A total of 30 strains were selected for the whole-genome sequencing (WGS) investigation. Thirty-seven M. pneumoniae strains were isolated during the 2010-15 62.2%. The remaining 37.8% consisted of ST1, ST14, ST17, and ST33. In contrast, among the 45 isolates detected during the 2014-2016 epidemic, P1 subtype 1 accounted for 50.0% and the ST distribution was 88.9% for ST3 and 11.1% for ST14.
In order to include as many different STs as possible, all strains that showed STs other than ST3 (ST1, ST14, ST17, and ST33) were included for WGS analysis. We have randomly selected 20 ST3 strains from each epidemic.

Next-generation sequencing (NGS)
The library for whole genome sequencing was prepared using Truseq Nano DNA Lib

Genome assembly and annotation
NGS reads were assembled de novo using SPAdes (31). The number of contigs generated ranged from 3 to 8 per strain. These contigs were mapped to the M129 reference genome using the BLAST-like alignment tool (BLAT) and visualized using Integrative Genomics Viewer (IGV) (32-34). This mapping was used to develop PCR primers to join the contigs. High fidelity PCRs and Sanger sequencing were performed using standard methods. Overlapping and joining of the contigs were performed manually with Sequencher version 5.4.6 (Gene Codes Corporation, Ann Arbor, MI, USA). The initial NGS reads were aligned to the de novo assembled genome for the correction of errors. The corrected and completed circular genomes were annotated using Rapid Annotation using Subsystem Technology (RAST) (35).

Comparative genomics
Completed genomes were aligned using BRIG for the overall sequence similarity between the strains (22). MAUVE was used to detect large chromosomal rearrangements, deletions, and duplications (23). In the phylogenetic analysis with the 48 global strains downloaded from the National Center for Biotechnology Information (NCBI) were included. MAFFT was applied using the 'FFT-NS-2' method for multiple sequence alignment of the strains from the current study and with the global strains. Phylogenetic tree was constructed using the maximum composite likelihood approach based on neighbor-joining algorithms and visualized using Phylo.io (strains from the current study) and MEGA X (with the global strains) (36, 37). For the phylogenetic tree with the global strains, 500 iterations of bootstrapping analysis were used to generate confidence values. eBURST version 3 software (http://eburst.mlst.net/) was used to estimate the relationships among the strains and to assign strains to a clonal complex (CC) (38).

Single nucleotide polymorphism (SNP) and insertion/deletion (indel) analysis
To call SNPs and indels, completed genomes were first broken into 10-kb "reads" at 1-kb intervals and then aligned to the M129 reference strain (NCBI Accession Number NC_000912) using BWA v0.7.7 (39). Variant calling was performed using Samtools (40). The effects of the SNPs and indels in the resulting VCF files were evaluated and annotated using SnpEff v3.3 (41).  L50, smallest number of contigs whose length sum makes up half of genome size;

Proteins and functional analysis
N50, sequence length of the shortest contig at 50% of the total genome length; CDS, coding sequence.   Figure 1 Overall sequence identity of the 30 sequenced strains with the reference M129 genome. Soli