The complete genome sequence of the yogurt isolate Streptococcus thermophilus ACA-DC 2

Streptococcus thermophilus ACA-DC 2 is a newly sequenced strain isolated from traditional Greek yogurt. Among the 14 fully sequenced strains of S. thermophilus currently deposited in the NCBI database, the ACA-DC 2 strain has the smallest chromosome, containing 1,731,838 bp. The annotation of its genome revealed the presence of 1,850 genes, including 1,556 protein-coding genes, 70 RNA genes and 224 potential pseudogenes. A large number of pseudogenes were identified. This was also accompanied by the absence of pathogenic features suggesting evolution of strain ACA-DC 2 through genome decay processes, most probably due to adaptation to the milk ecosystem. Analysis revealed the existence of one complete lactose-galactose operon, several proteolytic enzymes, one exopolysaccharide cluster, stress response genes and four putative antimicrobial peptides. Interestingly, one CRISPR-cas system and one orphan CRISPR, both carrying only one spacer, were predicted indicating low activity or inactivation of the cas proteins. Nevertheless, four putative restriction-modification systems were determined that may compensate any deficiencies of the CRISPR-cas system. Furthermore, whole genome phylogeny indicated three distinct clades within S. thermophilus. Comparative analysis among selected strains representative for each clade, including strain ACA-DC 2, revealed a high degree of conservation at the genomic scale, but also strain specific regions. Unique genes and genomic islands of strain ACA-DC 2 contained a number of genes potentially acquired through horizontal gene transfer events, that could be related to important technological properties for dairy starters. Our study suggests genomic traits in strain ACA-DC 2 compatible to the production of dairy fermented foods.


Introduction
The use of microorganisms in food fermentations is the means for converting perishable and frequently inedible raw materials into safe, shelf-stable and nutritionally upgraded foods [1]. The economic importance of starter cultures for the food industry has led to the continuous search for the discovery of new microorganisms with important technological characteristics. In many cases it has been proven that traditionally fermented foods represent a natural reservoir of undiscovered microbial strains for possible diverse food applications [2,3].
Streptococcus thermophilus is among the species commonly used in the dairy industry, mainly in the fermentation of yogurt and several cheese varieties, contributing to the desirable organoleptic characteristics of the final product [4,5]. It is the sole species considered GRAS within the Streptococcus genus, which includes mostly pathogens and opportunistic pathogens [6]. Due to the industrial significance of the species, a plethora of studies has been conducted for a number of strains, revealing information about their diverse technological features [7,8]. Furthermore, during the last 15 years, the advance of high-throughput sequencing techniques along with the development of novel bioinformatics tools facilitated the analysis of complete genome sequences, providing information for the overall genetic content of S. thermophilus [9][10][11][12]. These studies have demonstrated that S. thermophilus strains have been adapted to the milk environment through extensive reductive evolution as indicated by the large number of pseudogenes found in all strains. Adaptation to the milk environment is also supported by the loss of genes related to carbohydrate metabolism and virulence.
In this study, we present the analysis of the complete genome sequence of S. thermophilus ACA-DC 2. The genomic insights acquired could be proven useful for the exploitation of the specific strain in the production of fermented dairy products.

Classification and features
Streptococcus thermophilus ACA-DC 2 is classified within the order Lactobacillales of the class Bacilli. It is a non-sporulating, Gram-positive bacterium with coccus-shaped cells (Fig. 1). The strain was isolated from traditional Greek yogurt manufactured through backslopping [13,14]. Its optimum growth takes place in M17 medium at 42 o C under microaerophilic conditions within 24 h. Information about the classification and the features of S. thermophilus ACA-DC 2 is summarized in Table 1. The phylogenetic analysis was based on 16S rRNA gene sequences and places S. thermophilus ACA-DC 2 in the distinct cluster formed by the S. thermophilus strains and within the salivarius group, as shown in Fig. 2.

Genome sequencing information
Genome project history S. thermophilus ACA-DC 2 is deposited in the ACA-DC culture collection of the Laboratory of Dairy Research, Agricultural University of Athens, Athens, Greece. The strain was selected for sequencing in order to obtain information about its technological and probiotic potential, having as basic aim its application as a starter culture in the production of dairy fermented foods. The project was carried out in 2015 and the genome was sequenced, fully assembled and annotated. The genome sequencing project was registered in the European Nucleotide Archive database under accession number LT604076. The summary of the project information is shown in Table 2.
Growth conditions and genomic DNA preparation S. thermophilus ACA-DC 2 was grown in M17 medium (Biokar Diagnostics, Beauvais, France). For the isolation of the genomic DNA, 2 ml from an overnight culture incubated at 42°C were used and the extraction procedure was performed according to the protocol of Pitcher et al. [15]. The purity and the concentration of the extracted DNA were measured with a UV-Vis spectrophotometer (Q5000, Quawell, San Jose, USA) while its integrity was evaluated electrophoretically in a 0.8% agarose gel.

Genome sequencing and assembly
Whole-genome sequencing was performed using the Illumina HiSeq2500 and the PacBio RSII platforms at BaseClear service laboratory for DNA-research (Leiden, The Netherlands). Paired-end sequence reads were generated using the Illumina HiSeq2500 system. FASTQ sequence files were obtained using the Illumina Casava pipeline v1.8.3. Initial quality assessment was based on data passing the Illumina Chastity filtering. Subsequently, reads containing adapters and/or PhiX control signal were removed using an in-house filtering protocol. The second quality assessment was based on the remaining reads using the FASTQC quality control tool v0.10.0 resulting in 4,403,680 reads.
The data collected from the PacBio RSII instrument were processed and filtered using the SMRT Analysis software suite. The Continuous Long Read data were The quality of the Illumina FASTQ sequences was enhanced by trimming off low-quality bases using the program bbduk, which is part of the BBMap suite v34. 46. The quality-filtered sequence reads were puzzled into a number of contig sequences. The analysis was performed using ABySS v1.5.1. The contigs were linked and placed into super-scaffolds based on the alignment of the PacBio CLR reads with BLASR [16]. The alignment was further used to estimate the orientation, order and distance between the contigs by the SSPACE-LongRead scaffolder v1.0 [17]. The gapped regions within the super-scaffolds were closed in an automated manner using GapFiller v1.10 [18]. The method takes advantage of the insert size between the Illumina paired-end reads. The assembly resulted in one circular chromosome of 1,731,838 bp.

Genome annotation
Prediction of genes was carried out with the online programs Prodigal [19], MetaGeneAnnotator [20] and FGENESB [21], for comparison and verification of the obtained results. Genome annotation was performed using RAST v2.0 [22]. Annotation anomalies, including pseudogenes, were identified using Gene-PRIMP [23]. All data acquired were combined and subjected to manual curation. The WebMGA server [24] and the EggNog v4.5 [25] were used for COG annotation, the Phobius web server was used for the identification of genes with transmembrane helices and genes with signal peptides [26] and the Pfam database was used for the identification of genes with Pfam domains [27]. Potential pathogenic features were identified using the MP3 tool [28]. The CRISPRs, the restriction-modification systems and the putative antimicrobial peptides were predicted using the CRISPRFinder web tool [29], the REBASE database [30] and BAGEL3 [31], respectively. The KODON software (Applied Maths NV, Sint-Martens-Latem, Belgium) was utilized for the visualization of synteny among the CRISPR regions of ACA-DC 2 and LMD-9 strains. The EDGAR server [32] was used for whole genome phylogeny and Venn diagram analysis. Circoletto [33] was employed for whole genome alignment among S. thermophilus strains. Finally, the genomic islands were identified through the IslandViewer 3 web-based resource [34].

Genome properties
The complete genome of S. thermophilus ACA-DC 2 consists of one circular chromosome containing 1,731,838 bp. The average GC content of the chromosome is 39.2%. A total of 1,850 genes were predicted after manual curation, including 1,556 protein-coding genes, 70 RNAs (56 tRNAs and 14 rRNAs) and 224 potential pseudogenes (Table 3). A circular map of the genome was generated using the CGView comparison tool [35] as shown in Fig. 3. Function was assigned to 1,182 genes (63.89%), while 1,318 genes (71.24%) had one or more conserved Pfam domains. The distribution of protein-coding genes into COG Evidence codes -IDA inferred from direct assay, TAS traceable author statement (i.e., a direct report exists in the literature), NAS non-traceable author statement (i.e., not directly observed for the living, isolated sample, but based on a generally accepted property for the species, or anecdotal evidence). These evidence codes are from the Gene Ontology project [54] functional category is shown in Table 4. The analysis revealed that approximately 28.5% of the proteincoding genes do not have any described function.

Insights from the genome sequence
Main genome sequence characteristics The genome of S. thermophilus ACA-DC 2 is the smallest one described to date among the fully sequenced strains of the species deposited in NCBI and it is approximately 200 kbp smaller than the larger described genome. The majority of potential pseudogenes encode hypothetical proteins, transposases and proteins involved in carbohydrate transport and metabolism. Analysis of the genome for virulence factors with the MP3 tool revealed a number of hits (data not shown). Detailed inspection of these hits with EDGAR demonstrated that several such genes are conserved among S. thermophilus strains indicating that it is rather unlikely to be related to pathogenicity, given the safe history of the species. The high percentage of pseudogenes along with the absence of typical virulence factors for streptococci suggest that the ACA-DC 2 strain evolved through genome decay during the adaptation to the rich in nutrients dairy environment [9,11]. S. thermophilus ACA-DC 2 carries a complete lactosegalactose operon containing the galR, galK, galT, galE, galM, lacS and lacZ genes (STACADC2_1195-1189) and it is able to ferment lactose and galactose, the latter in a fairly slow rate (data not shown). It has been reported that fermentation of galactose is limited among the strains of S. thermophilus [11]. As mentioned above, Fig. 2 Phylogenetic tree highlighting the position of S. thermophilus ACA-DC 2 relative to other Streptococcus species. The tree was constructed based on 16S rRNA gene sequences. GenBank accession numbers are presented in parentheses and type strains are indicated with a superscript T (type strains = T ). Strains with complete genome sequence are marked with an asterisk. 16S rRNA gene sequences were aligned using MUSCLE [55]. The phylogenetic tree was built by the Maximum Likelihood method within MEGA7 software [56] using the Tamura-Nei substitution model [57]. Lactococcus lactis subsp. lactis NCDO 604 T served as the outgroup. Bootstrap values derived after 1,000 replicates. The scale bar indicates an estimated 0.01 nucleotide change per nucleotide position several genes responsible for the transport and degradation of sugars, such as fructose, maltose and trehalose, have been identified as pseudogenes in the genome of ACA-DC 2, further supporting the specialization of the bacterium to catabolize lactose.
Similar to other dairy bacteria, S. thermophilus ACA-DC 2 is able to synthesize exopolysaccharides (EPS) that may confer improved viscosity and texture to yogurt [4]. The EPS cluster is flanked by a deoD gene encoding a purine nucleoside phosphorylase (STACADC2_0949) and a pseudogene originally encoding a beta-glucosidase. Four of these genes, namely epsA (STACADC2_0948), epsB (STACADC2_0947), epsC (STACADC2_0946) and epsD (STACADC2_0945) are implicated in the regulation, polymerization, chain length and export of the EPS and are conserved among several S. thermophilus strains [36].
The genome analysis of strain ACA-DC 2 revealed a number of genes known to be responsive to unfavorable conditions prevailing during industrial applications. Among them we identified conserved heat shock genes like grpE, dnaK, dnaJ (STACADC2_0105-0107) and groES, groEL (STACADC2_0179-0180), genes encoding Clp proteases (STACADC2_0071, STACADC2_0315, S TACADC2_0526, STACADC2_0544, STACADC2_13 91), a proton translocating F 0 F 1 -ATPase system (STAC ADC2_0430-0437) and a P-type calcium pump ATPase (STACADC2_0983). The strain also harbors genes related to oxidative stress, namely a Mn-superoxide dismutase (STACADC2_0657), a glutathione reductase (STACADC2_0362), two thioredoxins (STACADC2_1 043, STACADC2_1624), two thioredoxin reductases (STACADC2_1208, STACADC2_1429), a NADH oxidase (STACADC2_1113) and two sulfoxide reductases (STACADC2_1408, STACADC2_1159). Furthermore, the genome carries four putative antimicrobial peptides that need further investigation (STACADC2_0091, STACADC2_1453, STACADC2_1458 and STA-CADC2_1709).  Two candidate CRISPRs were detected in the chromosome of strain ACA-DC 2. Intriguingly, both CRISPRs contained only one spacer. One CRISPR was found surrounded by cas proteins (STACADC2_0849-0856) while the other was orphan. The CRISPR-cas system of strain ACA-DC 2 exhibited the same organization and high degree of identity to that described previously for strain LMD-9 (Fig. 4) [37]. The two CRISPR-cas systems differed mainly in the csm6 gene, which in the case of strain ACA-DC 2 is a potential pseudogene as well as in csm2 gene that seems to be distinct in the two strains. S. thermophilus LMD-9 carries three CRISPR-cas systems and the system that is similar to that of ACA-DC 2 carries the lowest number (three) of spacers. Combined these findings could indicate low activity or even inactivation of the entire CRISPR-cas system in strain ACA-DC 2. Another possibility that cannot be excluded concerns low exposure of strain ACA-DC 2 to foreign DNA. Of course, any deficiency in the activity of the CRISPR-cas system may be compensated by restriction-modification (RM) systems. Strain ACA-DC 2 carries four putative RM systems according to the REBASE database (data not shown) belonging to RM types I (STACADC2_0642, STACADC2_0645, STACADC2_0648), II (STACADC2_0597-0598), III (STACADC2_0788-0789) and IV (STACADC2_0626).

Comparative genomic analysis, strain specific genomic features and genomic islands
Resolution of phylogenetic trees based on 16S rRNA gene sequences is limited due to high sequence identity, especially for strains of the same species. For this reason, we also performed whole genome phylogeny as implemented in EDGAR, using all available complete genomes of S. thermophilus. The phylogenetic tree produced revealed that S. thermophilus strains could be clustered in two distinct branches, the second of which could be also split in two sub-branches (Fig. 5). Strain ACA-DC 2 formed one of the branches along with strains CNRZ1066, LMG 18311, S9 and CS8. We chose strains ACA-DC 2, JIM 8232 and KLDS 3.1003 as representatives of each branch for comparative genomic analysis (Fig. 5). Whole genome alignments revealed extensive regions of high identity (>98%) among the genomes. However, regions of lower identity (between 80 and 98%) as well as strain specific regions were also identified. Using Venn diagram analysis as implemented in EDGAR, we determined a core genome of 1,303 genes among the three genomes as well as 137, 185 and 236 unique genes for strains ACA-DC 2, KLDS 3.1003 and JIM 8232, respectively.
The 137 unique genes of strain ACA-DC 2 were found to be involved in diverse functions (Fig. 6). At least some of those genes may be the result of horizontal gene Fig. 3 Circular map of S. thermophilus ACA-DC 2 genome features generated with the CGview tool. From periphery to center: Protein coding genes on forward strand colored by COG category assignment; Genes on forward strand; Protein coding genes on reverse strand colored by COG category assignment; Genes on reverse strand; GC content; GC skew; Genome region in kbp transfer (HGT). HGT acquired genes may play a role in the technological properties of S. thermophilus strains [11]. Another analysis that may also reveal regions of HGT in the bacterial chromosome is the identification of GIs [38]. Twelve integrated GIs were predicted in the genome of S. thermophilus ACA-DC 2 (Fig. 6), containing a total of 213 genes also involved in diverse functions (Fig. 6). Detailed analysis of genes either unique or in the GIs could relate some of them to important technological traits. For example, we determined genes coding for cold shock proteins CspA and CspG (STACADC2_0749-0750), acid resistance locus arl7 (STACADC2_0743), putative bacteriocin peptides (STA-CADC2_1453 and STACADC2_1458) and a type I RM system (STACADC2_0642, STACADC2_0645, STA-CADC2_0648). A putative agmatinase gene (STA-CADC2_0818) that may play a role to protocooperation of S. thermophilus and L. bulgaricus during polyamine metabolism, was also detected in ACA-DC 2 strain [10]. Furthermore, genes implicated in zinc and heavy metals transport (STACADC2_0165-0166, STACADC2_0752), in DNA repair and metabolism (STACADC2_1696, STACADC2_1716, STACADC2_1719, STACADC2_1 754) as well as several ribosome binding proteins, were also identified (STACADC2_0137, STACADC2_1568-1 569, STACADC2_1667, STACADC2_1669-1671, STAC ADC2_1675-1695, STACADC2_1717, STACADC2_1 732-1733, STACADC2_1752, STACADC2_1755).

Conclusions
The genome of S. thermophilus ACA-DC 2 presents characteristics in accordance with its adaptation to the milk environment including a high percentage of pseudogenes and absence of pathogenic features. Our analysis revealed that the strain carries genes involved in lactose and galactose catabolism and protein degradation that may accommodate its growth during milk fermentation. Stress response related genes that may contribute to survival under technological hurdles were also detected. Whole genome phylogeny suggested that S. thermophilus strains may diversify in three phylogenetic clades. Comparative analysis of genomes representative of each clade, including strain ACA-DC 2, revealed a number of unique genes for the latter. Furthermore, certain unique genes or genes belonging to GIs could be related to technological The total is based on the total number of protein coding genes in the genome Fig. 4 Synteny plot of the CRISPR loci between S. thermophilus strains ACA-DC 2 and LMD-9. The synteny of the two regions was calculated by the KODON software. In both strains the cas genes are denoted in blue. Gene csm6 in strain ACA-DC 2 is a potential pseudogene and it is denoted in yellow. The pyrD and pyrF genes colored in beige define the upstream and downstream limits of the CRISPR loci. Percentages displayed in the ribbon areas correspond to the % identity among the nucleotide sequences