Core-genome multilocus sequence typing and core-SNP analysis of Clostridium neonatale strains isolated in different spatio-temporal settings

ABSTRACT Clostridium neonatale was described as a new species within the Clostridium genus cluster I sensu stricto. It was recovered from the gut microbiota of infants and has been associated with necrotizing enterocolitis (NEC), a severe gastrointestinal disease affecting preterm neonates. In the absence of molecular typing methods, we developed a core (cg)-genome multilocus sequence typing (MLST) scheme based on 48 newly sequenced and 12 publicly available genomes of C. neonatale. Using the ChewBBACA algorithm, a stable MLST scheme consisting of 2,350 target genes (with genomes having ≥95% cgMLST targets) is proposed. The strains are distributed among five clades, and while some are clonality related within clades, the C. neonatale strain distribution is independent of geographic or temporal clustering. When considering strains isolated from patients with and without NEC, there were no observable differences in clustering. When compared to a core-single nucleotide polymorphism (SNP) analysis approach (31,248 positions), we showed that both cgMLST and cgSNP have comparable genetic discrimination of strains. In this study, we validated the use of cgMLST and cgSNP-based typing methods for the genetic comparison of clinical isolates of C. neonatale, which will allow for future surveillance and epidemiological clinical investigations of this potential opportunistic pathogen. IMPORTANCE Clostridium neonatale has been isolated from the fecal samples of asymptomatic neonates and cases of necrotizing enterocolitis (NEC). Taking advantage of a large collection of independent strains isolated from different spatio-temporal settings, we developed and established a cgMLST scheme for the molecular typing of C. neonatale. Both the cgMLST and cgSNP methods demonstrate comparable discrimination power. Results indicate geographic- and temporal- independent clustering of C. neonatale NEC-associated strains. No specific cgMLST clade of C. neonatale was genetically associated with NEC.

Although briefly described in 2002, the classification of C. neonatale as a new species was not formally published.In 2014, based on a polyphasic approach, C. neonatale was proposed to represent a new species within the Clostridium genus (7).The name and species of C. neonatale were validated within the Clostridium genus cluster I sensu stricto in 2018 (11).The ambiguous status of C. neonatale between 2002 and 2018 could have led to misidentification and/or inadequate representation of C. neonatale popula tions in prior studies.Consequently, very little is known about C. neonatale's genetics, population structure, or evolution.Recently, we published the complete genome of the C. neonatale reference strain 250.09 (= ATCC BAA-265 T ) and conducted a comparative genome analysis with eight available draft genomes that were accessible.We found that C. neonatale possesses an open pan-genome with genetic diversity and a flexible gene repertoire (12).In terms of molecular typing tools, a multilocus sequence analysis (MLSA) approach was used to show that C. neonatale belonged to the Clostridium genus sensu stricto (7).Furthermore, a quantitative real-time PCR targeting the rpoB gene was developed to detect C. neonatale from the fecal samples of patient (4).
The development of next-generation sequencing as a cost-effective technology has enabled genomic epidemiology monitoring and source tracking (13).This noticeably increases the amount of information available to compare bacterial strains by improv ing the discriminatory power of bacterial typing.In particular, the core-genome MLST (cgMLST) and core-single nucleotide polymorphisms (cgSNP) methods are largely used for bacterial typing and epidemiological analysis purposes (14)(15)(16)(17).Relying upon whole genome sequencing, cgMLST extends the classical MLST concept to include the genes that make up the bacterial core genome, resulting in a systematic allele numbering system (14).cgMLST is considered to be highly discriminatory and less susceptible to deletions and other mutations in the genome (13).The cgSNP variant calling approach is another suitable typing method which can provide greater discrimination than cgMLST.It uses a representative reference genome and allows for the filtering of recombinant regions (18).
To the best of our knowledge, neither the cgMLST scheme nor the cgSNP approach has been developed or applied to C. neonatale.In this study, taking advantage of a unique well-characterized strain collection, we first created an ad hoc cgMLST scheme using the open source ChewBBACA algorithm (19).The scheme is developed using 48 newly sequenced C. neonatale and 12 publicly available genomes at the time of this publication.Second, both cgMLST and cgSNP methods were used to investigate the epidemiological phylogeny and genetic relationships of C. neonatale at the strain level.Additionally, the cgMLST and cgSNP methods were employed to distinguish strains obtained from NEC cases and controls in different spatio-temporal settings.

Strain origin
All strains were previously isolated from the fecal samples of neonates enrolled in three clinical studies (Fig. 1).The PREMAFLORA cohort (ANR-07-PNRA-007) included infants less than 37 weeks of gestational age who were hospitalized at a French NICU between 2008 and 2009 (20).The EPIFLORE study took place in 2011, and the ClosNEC study was conducted between 2015 and 2016.These were NEC case-control multicenter studies that involved preterm neonates less than 32 weeks of gestational age in 12 French NICUs (3,21).NEC cases and control PN were obtained from the same NICU and were matched for gestational age, birth weight, type of feeding, and mode of delivery.Each NEC case was matched with two controls.The presence of clinical evidence fulfilling the modified criteria for NEC Bell's stage II (associated with radiologic pneumatosis intestinalis) or III (definitive intestinal necrosis seen at surgery or autopsy) confirmed the diagnosis of NEC as defined by the neonatal clinical team (1).No outbreak was declared during the inclusion periods.All studies were conducted in accordance with the relevant French guidelines and regulations, and informed consent was obtained from the parents of all enrolled children.Of note, in 2008-2009, parental consent was sufficient to ensure that fecal samples were collected under ethical conditions.The EPIFLORE and ClosNEC cohorts (clinical trial nos.NCT01127698 and NCT02444624, respectively)

Strain isolation, media, and growth conditions
For the EPIFLORE and ClosNEC cohorts, stool samples were collected at the time of NEC diagnosis (the first stool was collected after the diagnosis).For neonates in the control group, fecal samples were collected within 1 week after their NEC cases.Strains were previously isolated from fresh stool samples collected from diapers, placed into sterile tubes containing 0.5 mL of brain-heart infusion broth supplemented with 20% glycerol (a cryoprotective agent), and immediately frozen at −80°C.For isolation of C. neonatale, the fecal samples were crushed in the brain-heart infusion broth with an Ultra-Turrax T25 (Fisher Bioblock, Illkirch, France).They were then diluted in peptone water, and 10 -2 , 10 -4 , and 10 -6 dilutions were spread on a sulfite-polymyxin-milk selective agar medium (22) with the use of a WASP apparatus (Don Whitley Scientific, UK).After incubation under anaerobic conditions (CO 2 :H 2 :N 2 , 10:10:80) in an anaerobic chamber (Don Whitley Scientific), colonies were identified using matrix-assisted laser desorption ionization-time of flight mass spectrometry (MALDI-TOF MS, Bruker Daltonics S. A.).When MALDI-TOF MS identification was inconclusive, amplification and sequencing of the 16S rDNA gene were performed as described previously (7).Bacterial counts were reported as log 10 colony forming units (CFU)/g of feces, with a count threshold of log 10 CFU/g of feces.
The reference strain 250.09 (= ATCC BAA-264 T = CCUG 46077 T ) was also included in the study (12).All C. neonatale strains were subcultured in a liquid TGYH broth (tryptone 30 g/L, glucose 5 g/L, yeast extract 20 g/L, and hemin 5 mg/L) or TGYH agar media and incubated under anaerobic conditions for 48 h at 37°C.

Whole genome sequencing
Genomic DNA was extracted from 24 h bacterial liquid cultures using the DNeasy UltraClean microbial kit (Qiagen, Courtaboeuf, France).Sequencing of the 48 new C. neonatale genomes included in the present study was carried out by the Biomics sequencing platform (Institute Pasteur, Paris, France) using the Illumina Nextera XT DNA Library Prep kit and HiSeq or NextSeq 500 system sequencing devices.Paired-end reads were preprocessed using fqCleanER v.3.0 (https://gitlab.pasteur.fr/GIPhy/fqCleanER)with default parameters.The de novo assembly of the 48 genomes was performed using SPAdes v3.15.4 (23) with k-mer lengths of 21, 33, 55, and 77.The assembled contigs were evaluated for quality using QUAST.The genome annotation for the reference strain C. neonatale 250.09 (GenBank accession no.SAMEA9534266) was transferred to 48 genomes with 80% of similarity and identity values using the MicroScope pipeline platform v3.16.0, as previously reported (12).The CheckM (v1.2.2) marker gene set was used to evaluate completeness and contamination in each genome (24).

Genome similarities
Genome similarity was measured using the FastANI tool (25) (last accessed 3 March 2022), which computes a pair-wise average nucleotide identity (ANI) value for the sample genome based on BLAST (ANIb).The resulting matrix was visualized using R package heatmap software from R package (26).

Development and creation of the C. neonatale cgMLST scheme
The C. neonatale cgMLST scheme was established using the Blast Score Ratio-Based Allele Calling Algorithm workflow v.2.7.0 (chewBBACA) (19).First, a wgMLST scheme was constructed containing each coding DNA sequence (CDS) of the 60 strain genomes included in the present study.This step was performed by the CreateSchema operation, during which Prodigal v2.6.3 ( 27) identified each CDS.The CDS comparison (pairwise) and an all-against-all BLASTP search enable the grouping of genes that code for equivalent or very similar proteins (default blast score ratio above 0.6) as alleles for the same locus and catalog them in a unique file.This procedure defines the scheme as a set of CDS, each representing a single allele at distinct loci.Second, based on the obtained CDS file, the AlleleCall operation identifies paralogous loci from the obtained CDS file, which are then excluded using the RemoveGenes module.The resulting list of loci corresponds to the wgMLST scheme from which the cgMLST scheme is extracted.Third, the quality of the wgMLST scheme was evaluated using the TestGenomeQual ity algorithm with 100 and 200 thresholds prior to extracting the cgMLST scheme (Fig. S1).This step investigates how threshold values affect the number of loci in the cgMLST scheme in relation to the number of genomes included.To extract the cgMLST, we utilized the ExtractCgMLST module and conducted a core loci polymorphism analysis with the SchemaEvaluator module using default parameters.Allelic profiles of the core locus, shared by 95% of the analyzed isolates, were utilized to generate a cgMLST similarity tree.The neighbor-joining algorithm (StandardNJ) was employed in the GrapeTree software v.1.5.0 (28) using the FastME implementation.Trees were edited using the iTOL v.6.5.7 tool (29).Furthermore, a minimum spanning tree based on the cgMLST allelic profiles of the 60 C. neonatale strains was created using GrapeTree.

Core-SNP-based phylogeny
The cgSNP analysis was conducted on the raw data of 52 genomes.Since we did not have access to the initial raw sequencing data before reassembly, we excluded C25-UICQ01, NEC25, NEC26, NEC32, NEC86, LCDC no.99-A-005, LCDC no.99-A-006, and Q4564 genomes from the cgSNP analysis.The complete genome of the reference strain C. neonatale 250.09 served as the reference for the read mapping and cgSNP variant calling.The cgSNP analyses were performed using Snippy v4.6 (30) with default settings.The resulting file underwent filtering of variants with high densities of base substitu tions, which were identified as possible recombination events with default parameters set in Gubbins version 3.0.0(30).This process generated a recombination-free coregenome alignment.RAxML-NG was employed to develop a maximum likelihood tree with GTR-GAMMA bootstrapping utilizing 1,000 replicates.The phylogenetic tree was subsequently edited and visualized with iTOL v.6.5.7 (29).Pairwise distances between isolates were calculated using Snp-Dists (30) following Gubbins correction.The resulting data were used to generate a heatmap depicting SNP distances across genomes.

Statistical analysis
XLSTAT Version 2014.5.03 was used for statistical analysis.To determine non-random associations between two categorical variables, Fisher's exact test was applied with significance set at P < 0.05.

C. neonatale population characteristics
The features of the 48 newly sequenced and 12 publicly accessible genomes used in this study are listed in Table 1.The newly sequenced genomes were from strains isolated from independent neonates included in three distinct cohorts over three time periods (2008, 2011, and 2015-2016) (Fig. 1).A total of 13 strains were isolated from neonates with NEC and 38 from neonates without NEC.The C. neonatale strains were obtained from neonates as follows: 17 strains were obtained from the monocentric PREMAFLORA study (2008)(2009); 17 strains (7 NEC cases and 10 controls) were obtained from the nationwide multicenter EPIFLORE study (2011); and 17 strains (6 NEC cases and 11 controls) were obtained from the nationwide multicenter ClosNEC (2015-2016).In addition, we included 12 publicly available genomes from databases (4,5,11,12 S2).

Whole genome data
Regarding the sequencing data of the 60 C. neonatale genomes included in this study (Table 1), the genomes' estimated size ranged from 3.9 Mb to 5.6 Mb, with an average length of 5 ± 0.36 Mb.The average genome G + C content was 28.57± 0.13%.The number of predicted protein-coding genes ranged from 3,808 to 5,692, with an average of 4,980 ± 533.Whole genome pairwise sequence comparisons revealed an ANI of 98.20% and 99.98% for the most distant and closest strains, respectively (Fig. S3).These data complement the previous findings of a comparative genomics study using nine C. neonatale genomes (12).The study reported a genome length ranging from 4.6 to 5.6 Mb (an average of 4.9 ± 0.40), a predicted number of protein-coding genes ranging from 4,259 to 5,505 (an average of 4,399 ± 684), an average G + C content of 28.64 ± 0.13%, and an ANI ranging from 98.41% to 99.83% for the most distant and the closest strains, respectively (12).In this study, the G + C content of C. neonatale genomes is comparable to that of other Clostridium sensu stricto members.Nevertheless, some variations exist in genome sizes, which range from 2.55 Mb for C. novyi to 6.53 Mb for C. saccharoperbutylacetonicum, and CDS numbers range from 2,601 for C. tetani to 5,533 for C. saccharoperbutylacetonicum (31).
In the current study, 58 out of the 60 analyzed genomes were classified as "high-qual ity" draft genomes.This definition is based on the minimum reporting standards for genomes, which require at least 90% completeness, a maximum of 5% contamination, the presence of 23S, 16S, and 5S rRNA genes, and a minimum of 18 tRNAs (32).

C. neonatale cgMLST scheme and analysis
One of the most commonly used methods for bacterial typing in whole genome sequencing is the gene-by-gene cgMLST approach.The cgMLST method focuses on a wide range of CDSs present in most strains, which permits high discrimination and renders the system less prone to mutations such as insertions and deletions (33).Additionally, cgMLST does not require a specific reference genome, making it a suitable method for identifying potential clusters from samples of an entire species (34).In the current study, the wgMLST scheme was initially represented by a data set of 7,127 possible target loci identified from the 60 genomes that were analyzed (Fig. S4).After the filtering steps, 56 loci were found to be paralogous and were excluded.A genome quality test eliminated additional 4,721 loci.Ultimately, the cgMLST scheme included a total of 2,350 gene targets that were present in at least 95% of the genomes, meeting a well-defined cutoff (35)(36)(37).It was assumed that up to 5% of the loci in each strain may not be identified due to issues such as sequencing coverage or assembly or other factors associated with using draft genome (14) (Table S1).
The relatedness of the C. neonatale strains was investigated using the cgMLST scheme.Out of the 60 strains, five clades were identified as follows: clade I (n = 1, 2%), II (n = 9, 15%), III (n = 20, 33%), IV (n = 8, 13%), and V (n = 22, 37%) (Fig. 1).Clades III and V had the highest strain diversity when considering the period of isolation.The strain LF22 was the only representative of clade I, suggesting higher genetic diversity in this strain compared to that in others.The reference strain 250.09 as well as two strains isolated during the same outbreak (LCDC99A005 and LCDC99A006) (2)(3)(4)(5)11) belonged to clade II.Additionally, six strains isolated from different regions in 2008 or 2011 also belonged to clade II (Fig. 1).Some clades included mostly strains isolated from the same period (or cohort), particularly clade III in 2008, corresponding to the monocentric cohort PREMAFLORA.This was also observed for other clades and time periods, but to a lesser extent.Within each clade, certain genomes are organized in tight (near-clonal) groups, suggesting the same clone spread in the same NICU.This is in agreement with the draft genome-based phylogeny and core-genome analysis of a small number of C. neonatale isolates (n = 5), which report clonality among strains within the same NICU (4).
However, some strains were also distributed independently of the isolation period (Fig. 1).Geographically, similar findings were obtained.Taken together, our results support the distribution of the C. neonatale strain independent of spatial and temporal clustering.
Compared to the multilocus sequencing analysis scheme previously used to study the phylogenetic relationship of four C. neonatale strains (7), our results placed three of these strains (LF22, PM53), and 250.09) into distinct clusters, providing a better resolution for distinguishing between the genetic relatedness of the strains.

cgSNP analysis and comparison with cgMLST
Alternatively, the genetic relatedness of strains can be determined using the SNP typing approach, which identifies single nucleotide polymorphisms (SNPs) that differ between strains.SNPs are detected by mapping sequence reads against a closely related reference genome and recording nucleotide differences (13).In particular, the handling of recombination events differs between cgMLST and SNP alignments.The cgMLST method collapses recombination regions containing a high density of SNPs into fewer allelic changes (17), while they can be filtered in the SNP alignment.
Using C. neonatale 250.09 as the reference genome, the cgSNP calling step resulted in an alignment of 31,248 SNPs subsequent to the removal of the predicted recombinant regions, based on the raw data availability of 52 genomes (Table S2).The maximum likelihood phylogenetic tree of cgSNP enabled the clustering of strains within the five identical clades as cgMLST (Fig. 2).This was confirmed by the heatmap showing pairwise SNP distances across the strain genomes differing by at least one and at most 12,826 SNPs and distributed into the five clades (Fig. S5).The cgMLST results indicated that some strains were clonal and associated with isolation from the same period or cohort.However, clonal strains were also observed independently of clustering and periods or cohorts, suggesting possible clone dissemination.
The comparison of the cgMLST and cgSNP-based phylogenetic analyses of the 60 genomes is presented in Fig. 3.The phylogenetic tree for cgSNPs confirmed our cgMLST results.Although differences in topology and strain distance were noticed in clades III, IV, and V, the two methods produced similar results indicating equal discriminatory power.This is consistent with other studies that showed a high level of agreement between the cgMLST and cgSNP approaches, resulting in comparable levels of differentiation and relatedness (13,33,35).

cgMLST-and cgSNP-based analyses of C. neonatale strains isolated from patients with or without NEC
C. neonatale is one of the species found in the stool samples of preterm neonates with NEC (3-5, 8).However, there are limited data available on the epidemiological surveillance and clinical monitoring of this potential opportunistic pathogen.As we have demonstrated that both cgMLST-and SNP-based methods yield similar results, a minimum-spanning phylogenetic tree based on cgMLST was generated to compare the distribution of the strains isolated from NEC cases and control patients (Fig. 4).Our results showed a distribution of strains among 15 cgMLST types.We observed a dominance of two cgMLST types that differed by 1,638 alleles and encompassed 62% (n = 37) of the strains (Fig. 4).The distribution of the strains was independent of their isolating group (i.e., NEC cases vs controls).Regarding the period of isolation, there were no significant differences observed between C. neonatale strains and NEC (P = 0.74).Previously, a possible existence of NICU clones has been proposed without a relationship to the occurrence of NEC in preterm neonates (4).In this study, no significant differences were observed based on the geographical location (P = 0.29).With regard to NEC, the clade distribution of C. neonatale strains did not correlate with the occurrence of NEC (P = 1).Altogether, our findings did not establish a connection between a particular genetic type of C. neonatale and the patient's NEC status.This suggests that additional factors associated with host receptivity or gut microbiota could be implicated.Besides,

Strengths and limitations
One limitation of this study is the lack of strain diversity as the analysis only included strains from neonates.However, it is notable that the number of C. neonatale strains and available genomes were limited.In addition, to our knowledge, the natural reservoir for C. neonatale remains unknown as strains from adult humans, animals, and the environment are not currently available.Although a total of 12 NICUs participated in the study, the low number of patients included in some NICUs hindered our regional analysis due to low statistical power.The strength of this study is that it utilizes the well-characterized 250.09 reference strain genome as a seed genome as well as the largest, well-characterized, and diverse collection of C. neonatale strains obtained from different spatio-temporal settings with ≥95% cgMLST targets.

Conclusions
Little information is available about C. neonatale, one of the potential pathogens associated with NEC.Whole genome sequencing typing methods were applied to evaluate the phylogenetic and epidemiological links among a unique collection of clinical C. neonatale strains isolated in different spatio-temporal settings.We developed and established a potentially stable cgMLST scheme for ad hoc usage, providing an

FIG 2
FIG 2 Maximum likelihood cgSNP phylogenetic tree (bootstrapping: 1,000 replicates) based on the 31,248 cgSNPs after removal of the predicted recombinant regions.Strain ID, patient status (NEC cases or controls), cohort, and year of isolation are given for each strain.The C. neonatale type strain 250.09 is the reference strain (4, 5).

FIG 3
FIG 3 Comparison of the cgMLST neighbor-joining tree (left) and the cgSNP maximum likelihood tree (right).The C. neonatale type strain 250.09 is the reference strain (bold).Connecting lines indicate changed positions between the two trees.(*) Strains from other studies.
applicable and discriminatory typing method at the strain level.We have demonstra ted that both cgMLST and coreSNP exhibit similar discriminatory abilities.This study sheds light on the epidemiology and population dynamics of C. neonatale.It offers new insights into the distribution of C. neonatale strain clonal complexes among NEC patients.The proposed cgMLST scheme will contribute to the study of the molecular dynamics of C. neonatale strains, providing information for local and global clinical surveillance of this opportunistic pathogen.
were approved (nos.911009 and 915094, respectively) by Commission Nationale de l'Informatique et des Libertés and the Consultative Committee on the Treatment of Information on Personal Health Data for Research Purposes (approval nos.10.626 and 15.055, respectively).

TABLE 1
Clostridium neonatale genome data c in 2012, 18 in 2015-2016, and 1 in 2021.The strains originated from 12 different NICUs across five French regions (Fig.
a Corresponds to the reference strain LCDC99A005.b Contigs containing "N" stretches.c 2011, 3