Genome sequence and description of Pantoea septica strain FF5

Strain FF5 was isolated from the skin flora of a healthy Senegalese 35-year-old woman. This strain was identified as belonging to the species Pantoea septica based on rpoB sequence identity of 99.7 % with Pantoea septica strain LMG 5345T and a highest MALDI-TOF-MS score of 2.3 with Pantoea septica. Like P. septica, this FF5 strain is a Gram-negative, aerobic, motile, and rod-shaped bacterium. Currently, 17 genomes have been sequenced within the genus Pantoea but none for Pantoea septica. Herein, we compared the genomic properties of strain FF5 to those of other species within the genus Pantoea. The genome of this strain is 4,548,444 bp in length (1 chromosome, no plasmid) with a G + C content of 59.1 % containing 4125 protein-coding and 68 RNA genes (including 2 rRNA operons). We also performed an extensive phenotypic analysis showing new phenotypic characteristics such as the production of alkaline phosphatase, acid phosphatase and naphthol-AS-BI-phosphohydrolase. Electronic supplementary material The online version of this article (doi:10.1186/s40793-015-0083-0) contains supplementary material, which is available to authorized users.


Introduction
Pantoea septica Brady et al. 2010 was first isolated from a human stool sample in New Jersey USA [1]. Pantoea septica strain FF5 (= CSUR P3024 = DSM 27843) was cultivated from the skin of a healthy Senegalese woman [2]. To date, the genus Pantoea consists of 22 species and 2 subspecies [3,4] and no genome had been described for Pantoea septica when this paper was written. Pantoea species have been isolated mostly from the environment, particularly from plants, seeds and vegetables, several being phytopathogenic [5]. Some species such as P. agglomerans, P. septica and P. eucrina are also frequently isolated from humans in whom they can cause opportunistic infections [1][2][3][4][5][6].
We provide here a summary classification and a set of features for Pantoea septica strain FF5, together with the description of the complete genomic sequence and annotation.

Classification and features
A skin sample was collected with a swab from a healthy Senegalese volunteer living in Dielmo (a rural village in the Guinean-Sudanian area in Senegal) in December 2012 (Table 1). This 35-year-old woman was included in a research project that was approved by the Ministry of Health of Senegal, the assembled village population and the National Ethics Committee of Senegal (CNERS, agreement numbers 09-022), as published elsewhere [7]. Strain FF5 (Table 1) was isolated by aerobic cultivation on 5 % sheep blood-enriched Columbia agar (BioMérieux, Marcy l'Etoile, France). As the 16S rRNA gene sequence cannot be used as a means of identifying Pantoea species, a comparative rpoB nucleotide sequences analysis between strain FF5 and other Pantoea species was performed. Strain FF5 exhibited a 99.7 % sequence identity with P. septica, its phylogenetically closest validly published Pantoea species (Fig. 1) [8]. This strain is motile and its cells grown on agar are Gram-negative rods (and have a mean diameter of 0.79-1.06 μm and a mean length of 1.25-2.04 μm).
Matrix-assisted laser-desorption/ionization time-offlight mass spectrometry protein analysis was performed using a Microflex spectrometer (Bruker Daltonics, Leipzig, Germany), as previously reported [9]. The scores previously established by Bruker Daltonics, used to validate or invalidate identification compared to the instrument database, were applied. Briefly, a score ≥ 2 for a species with a Evidence codes -IDA: Inferred from Direct Assay; TAS: Traceable Author Statement (i.e., a direct report exists in the literature); NAS: Non-traceable Author Statement (i.e., not directly observed for the living, isolated sample, but based on a generally accepted property for the species, or anecdotal evidence). These evidence codes are from the Gene Ontology project [30] validly published name provided allows the identification at the species level; a score ≥ 1.7 and < 2 allows the identification at the genus level; and a score < 1.7 does not allow any identification. Twelve distinct deposits of strain FF5 were made from 12 isolated colonies. Each smear was overlaid with 2 μL of matrix solution (saturated solution of alpha-cyano-4-hydroxycinnamic acid) and dried for 5 min, as previously reported [9,10]. The spectra from the 12 different colonies were imported into the MALDI BioTyper software (version 2.0, Bruker) and analyzed by standard pattern matching (with default parameter settings) against the spectra of 6252 bacterial spectra. Spectra were compared with the Bruker database that contained spectra from the ten validly named Pantoea species. The spectra obtained were similar to those of P. septica. A score of 2.3 was obtained for strain FF5 supporting the identification of P. septica. Its reference mass spectrum was added to our database (Fig. 2).

Genome sequencing information
Genome project history Pantoea septica strain FF5 was selected for sequencing because no genome of P. septica has previously been described. Besides, this strain is part of a study aiming to characterize the skin flora of healthy Senegalese people. It is the 17 th genome of Pantoea species to be sequenced and the first genome within P. septica. The GenBank accession number is CCAQ000000000 and it consists of 4 scaffolds and 37 contigs. Table 2 shows the project information and its association with MIGS version 2.0 compliance [11]. Associated MIGS records are detailed in Additional file 1: Table S1.

Growth conditions and genomic DNA preparation
Pantoea septica strain FF5 (= CSUR P3024 = DSM 27843) was grown aerobically on 5 % sheep blood-enriched Columbia agar (bioMérieux) at 37°C. Bacteria grown on four Petri dishes were resuspended in 5 × 100 μL of TE buffer; 150 μL of this suspension was diluted in 350 μL 10X TE buffer, 25 μL proteinase K and 50 μL sodium dodecyl sulfate for lysis treatment. This preparation was incubated overnight at 56°C. DNA was purified using 3 successive phenol-chloroform extractions and ethanol precipitation at −20°C of at least two hours each. Following centrifugation, the DNA was suspended in 65 μL EB buffer. Genomic DNA concentration was measured at 46.06 ng/μL using the Qubit assay with the high-sensitivity kit (Life technologies, Carlsbad, CA, USA).

Genome sequencing and assembly
The genomic DNA of Pantoea septica was sequenced using MiSeq Technology (Illumina Inc, San Diego, CA, USA) with the 2 applications: paired-end and mate-pair. The rpoB sequences were aligned using MUSCLE [31], and the phylogenetic tree was inferred using the Maximum Likelihood method with Kimura 2-parameter model from MEGA software. Numbers at the nodes are percentages of bootstrap values obtained by repeating the analysis 1,000 times to generate a majority consensus tree. The scale bar represents a rate of substitution per site of 0.02 The paired-end and mate-pair strategies were barcoded in order to be mixed respectively with 10 other genomic projects prepared with the Nextera XT DNA sample prep kit (Illumina) and 11 other projects with the Nextera Mate-Pair sample prep kit (Illumina).
Genomic DNA was diluted to 1 ng/μL to prepare the paired-end library. The "tagmentation" step fragmented and tagged the DNA with an optimal size distribution of 2.25 kb. Limited cycle PCR amplification (12 cycles) completed the tag adapters and introduced dual-index barcodes. After purification on AMPure XP beads (Beckman Coulter Inc, Fullerton, CA, USA), the libraries were normalized on specific beads according to the Nextera XT protocol (Illumina). Normalized libraries were pooled into a single library for sequencing on the MiSeq. The pooled single-strand library was loaded onto the reagent cartridge, then onto the instrument along with the flow cell. Automated cluster generation and paired-end sequencing with dual index reads were performed in single 39-h run in 2x250-bp. Total information of 5.91 GB was obtained from a 654 K/mm2 cluster density with a cluster passing quality control filters of 93.7 % (12,204,000 clusters). Within this run, the index representation for P. septica was determined to be 2.25 %. So P. septica has 257,400 reads filtered according to the read qualities.
The mate pair library was prepared with 1 μg of genomic DNA using the Nextera mate-pair Illumina guide. The genomic DNA sample was simultaneously fragmented and tagged with a mate-pair junction adapter. The fragmentation profile was validated on an Agilent 2100 BioAnalyzer (Agilent Technologies Inc, Santa Clara, CA, USA) with a DNA 7500 labchip. The DNA fragments ranged in size from 1.5 kb up to 14 kb with an optimal size of 9 kb. No size selection was performed and 600 ng of tagmented fragments were circularized. The circularized DNA was mechanically sheared into small fragments on the Covaris device S2 in microtubes (Covaris, Woburn, MA, USA). The library profile was visualized on a High-Sensitivity Bioanalyzer LabChip (Agilent Technologies Inc, Santa Clara, CA, USA). The libraries were normalized at 2 nM and pooled. After a denaturation step and dilution to 10 pM, the pool of libraries was loaded onto the reagent cartridge, then onto the instrument along with the flow cell. Automated cluster generation and sequencing were performed in a single 39-h run in a 2x250-bp. An overall quantity of 3.2 GB was obtained from a 690 K/mm2 cluster density with a cluster passing quality control filters of 95.4 % (13,264,000 clusters). The index representation for P. septica was determined to be 7.26 % within this run. P. septica has a total of 918,753 reads filtered according to the read qualities.

Genome annotation
Open Reading Frames prediction was performed using Prodigal [12] with default parameters. We removed the predicted ORFs if they spanned a sequencing gap region. Functional assessment of protein sequences was performed by comparing them with sequences in the GenBank [13] and Clusters of Orthologous Groups (COG) databases using BLASTP. tRNAs, rRNAs, signal peptides and transmembrane helices were identified using tRNAscan-SE 1.21 [14], RNAmmer [15], SignalP [16] and TMHMM [17] respectively. Artemis [18] was used for data management whereas DNA Plotter [19] was used for visualization of genomic features. In-house perl and bash scripts were used to automate these routine tasks. ORFans were sequences with no homology in a given database i.e. in a nonredundant (nr) or identified if their BLASTP E-value was lower than 1e-03 for alignment lengths greater than 80 amino acids. If alignment lengths were smaller than 80 amino acids, we used an E-value of 1e-05. PHAST was used to identify, annotate and graphically display prophage sequences within bacterial genomes or plasmids [20].
To estimate the nucleotide sequence similarity at the genome level between P. septica and another 7 members of the genus of Pantoea and 4 members of the genus Enterobacter, we determined the AGIOS parameter as follows: orthologous proteins were detected using the Proteinortho software (with the parameters following: E-value 1e-5, 30 % identity, 50 % coverage and algebraic connectivity of 50 %) [21] and genomes compared two by two. After fetching the corresponding nucleotide sequences of orthologous proteins for each pair of genomes, we determined the mean percentage of nucleotide sequence identity using the Needleman-Wunsch global alignment algorithm. The script created to calculate AGIOS values was named MAGi (Marseille Average genomic identity) and is written in perl and bioperl modules. GGDC analysis was also performed using the GGDC web server as previously reported [22].

Genome properties
The genome of P. septica strain FF5 is 4,548,444 bp long (1 chromosome, no plasmid) with a 59.1 % G + C content (Fig. 3). Of the 4193 predicted genes, 4125 were proteincoding genes and 68 were RNAs. A total of 3040 genes (72.50 %) were assigned a putative function. A total of 522 genes were annotated as hypothetical proteins. The properties and statistics of the genome are presented in Table 3. The distribution of genes into COG functional categories is presented in Table 4. A total of 214 were identified as ORFans (5.18 %).
According to the previous demonstration that the G + C content deviation is at most 1 % within species, these values confirm the classification of strain FF5 in a distinct species [23].
Orthologous gene comparison of P. septica strain FF5 with other closely related species are summarized in Table 6. Intraspecies values ranged from 99.06 to 99.33 % for P. ananatis (Table 7). Interspecies AGIOS values ranged from 77.46 to 84.94 % within the Pantoea genus, and from 71.27 to 72.57 % between Pantoea and Enterobacter species (Table 7). When compared to other species, P. septica exhibited AGIOS values ranging from 77.7 to 80.5 with Pantoea species and from 72.38 to 73.26 with Enterobacter species (Table 7).

Conclusions
We describe the genome of Pantoea septica strain FF5. This is the first reported genome of P. septica. We also The total is based on either the size of genome in base pairs or the total number of protein-coding genes in the annotated genome   The total is based on the total number of protein-coding genes in the annotated genome