Streptococcal taxonomy based on genome sequence analyses

The identification of the clinically relevant viridans streptococci group, at species level, is still problematic. The aim of this study was to extract taxonomic information from the complete genome sequences of 67 streptococci, comprising 19 species, by means of genomic analyses, multilocus sequence analysis (MLSA), average amino acid identity (AAI), genomic signatures, genome-to-genome distances (GGD) and codon usage bias. We then attempted to determine the usefulness of these genomic tools for species identification in streptococci. Our results showed that MLSA, AAI and GGD analyses are robust markers to identify streptococci at the species level, for instance, S. pneumoniae, S. mitis, and S. oralis. A Streptococcus species can be defined as a group of strains that share ≥ 95% DNA similarity in MLSA and AAI, and > 70% DNA identity in GGD. This approach allows an advanced understanding of bacterial diversity.


Introduction
Bacteria are subjected to numerous forces driving their diversification. As a consequence, different strains of a single bacterial species sometimes have the ability to explore distinct niches, to be pathogenic or non-pathogenic and to present different metabolic pathways 1,2 . In such a scenario, the identification of bacteria isolates to the species level is a hard task 1,2 .
Currently, bacterial species are considered to be a group of strains (including the type strain) that are characterized by a certain degree of phenotypic consistency, showing > 70% DNA-DNA hybridization values and over 97% 16S rRNA sequence similarity 4,5 . Identification of streptococci is based on the current taxonomic standards using a combination of 16S rRNA gene sequence analyses, DNA-DNA hybridization, serologic and phenotypic data; however, they have been strikingly resistant to satisfactory classification, reflected in frequently changing nomenclature 6,7 . For instance, the 16S rRNA gene sequences of S. mitis and S. oralis are almost identical (> 99%) to S. pneumoniae, making the use of this information alone insufficient to distinguish these species 8 .
Recent studies have used whole genome analysis to determine the taxonomic relationships among bacterial species 9-14 . In order to determine the robustness of genomic markers in streptococci species delineation, we analyzed a collection of 67 complete genomes. The availability of whole genome sequences of several closely related species, for instance, S. mitis -S. oralis -S. pneumoniae, and S. salivarius -S. thermophilus -S. vestibularis, formed an ideal test case for the establishment of the genomic taxonomy of streptococci.

Material and methods
Genome sequence data The genomic sequences of 67 streptococci that were publicly available for download by June 2 nd , 2011 at the National Center for Biotechnology Information (NCBI) under the project accession number indicated in Table 1 were used in this study. The following analyses were performed according to Thompson et al. (2009) 13 and are briefly described below.
16S rRNA gene sequence analysis and multilocus sequence analysis (MLSA) The 16S rRNA gene sequences and the gene sequences used for MLSA were obtained from GenBank (http://www.ncbi.nlm.nih.gov). The MLSA approach was based on the concatenated sequences of five house-keeping genes (aroE, ddl, gki, pheS and recA) 15,16 . The concatenated sequences were aligned with ClustalX program 17 . The phylogenetic inference was based on the neighbour-joining genetic distance method (NJ) 18 using MEGA5 19 . Distance estimations were obtained according to the Kimura-2-parameter 20 for 16S rRNA gene and MLSA. The reliability of each tree topology was checked by 2000 bootstrap replications 21 .

Average amino acid identity (AAI)
The AAI of all conserved protein-coding genes was calculated as described previously 22 . Conserved protein-coding genes between a pair of genomes were determined by whole-genome pairwise sequence comparisons using the BLASTp algorithm 23 . For these comparisons, all protein-coding sequences (CDSs) from one genome were searched against the genomic sequence of the other genome. The genetic relatedness between a pair of genomes was measured by the AAI of all conserved genes between the two genomes as computed by the BLAST algorithm. By this approach, a value of < 95% AAI of protein-coding genes indicates separate species.

Codon usage
Codon usage bias was calculated for each genome. The effective number of codons used in a sequence (Nc) 24 was calculated using CHIPS (http://emboss.bioinformatics.nl/cgi-bin/emboss/chips) with the default parameters.
Determination of dinucleotide relative abundance values and genomic dissimilarity Mononucleotide and dinucleotide frequencies were calculated using COMPSEQ (http://emboss.bioinformatics.nl/cgi-bin/emboss/ compseq) with default parameters. Dinucleotide relative abundances (ρ*XY) were calculated using the equation ρ*XY = fXY/fXfY where fXY denotes the frequency of dinucleotide XY, and fX and fY denote the frequencies of X and Y, respectively. The difference in genome signature between two sequences is expressed by the genomic dissimilarity (δ*), which is the average absolute dinucleotide of relative abundance difference between two sequences, and were calculated using the equation: δ*(f,g) = 1/16Σ|ρ*XY (f) -ρ*XY (g)| (multiplied by 1000 for convenience), where the sum extends over all dinucleotides 25 .

Genome-to-genome distances (GGD)
The genome distance was calculated using genome-to-genome distance calculator (GGDC) 26 . Distances between a pair of genomes were determined by whole-genome pairwise sequence comparisons using BLAST 23 . For these comparisons, algorithms were used to determine high-scoring segment pairs (HSPs) for inferring intergenomic distances for species delimitation. The corresponding distance threshold can be used for species delimitation 26 .

Results and discussion
In this work we compared complete genomes for 67 streptococci comprising 19 species to address their taxonomic position. A previous study with a small set of streptococci genomes (eight) and species (four), using a combination of several genomic analyses, showed the applicability of this approach in streptococci taxonomy 9 . Overall our analysis, using a large data set, showed that genomic taxonomy is an accurate approach to clearly define the streptococci species. The taxonomic resolution of the 16S rRNA, AAI, MLSA, GGD and codon usage analysis for streptococci species definition is summarized in Table 2.

General genomic features
The complete genome of the streptococci comprised a single chromosome. The estimated size of the genomes ranged from 1.7 Mb (S. infantis) to 2.3 Mb (S. sanguinis). The number of CDS varied from 1,700 (S. pyogenes) to 2,352 (S. pneumoniae) ( Table 1). The average G+C content of streptococci genomes ranged from 35% to 43%. These species presented a variable interspecies genome size and G+C content, indicating heterogeneity within the genus Streptococcus. One of the reasons for this variability could be associated with the frequent occurrence of horizontal gene transfer events 27-29 .
Phylogenetic reconstructions by 16S rRNA and MLSA MLSA and 16S rRNA phylogenetic trees showed similar topologies (Figure 1). The MLSA was performed using five instead of the seven genes applied in the pneumococcus multilocus sequence typing

Nc
S. thermophilus and S. vestibularis species, showed dissimilarity values between 5 and 12 and S. pneumoniae, S. mitis and S. oralis species had dissimilarity values between 3 and 14. Thus, there was not a clear differentiation of these closely related species within the VGS group on the basis of the genomic dissimilarity values. This could be due to the extensive recombination and horizontal gene transfer events which occur between closely related streptococci species that share ecological niches 12,30 .
On the other hand, species within the Pyogenic group had a distinct genomic signature, with values ranging from 13 to 85. However, genome signatures alone have significant limitations when used as phylogenetic markers for differentiating members of the VGS. The exact mechanisms that generate and maintain the genome signatures are complex, but possibly involve differences in speciesspecific compositional bias, i.e., G+C content, G+C and A+T skews, codon bias, and mutation bias 32,33 .

Codon usage bias (Nc)
Nc values provide a meaningful measure of the extent of codon preference in a genome, values range between 20 (extremely biased genome where one codon is used per amino acid) and 61 (all synonymous codons are used). Within the set of 67 complete streptococci genomes examined in this study, the Nc ranged from 44.0 to 54.5 (Table 1)

Genome distance analysis
The GGD was calculated only for closely related species that were not differentiated by 16S rRNA gene sequence analysis (Figure 1). Based on GGD analysis the species within the Mitis and Salivarius groups were identified as separate species, showing GGD values analogous to the < 70% discriminatory value used for DNA-DNA hybridization. Conversely, S. bovis ATCC 700338 and S. gallolyticus were identified as belonging to the same species by GGD.
S. bovis ATCC 700338 (biotype II) and S. gallolyticus as well as S. sanguinis ATCC 49296 and S. oralis ATCC 35037T were not separated and, therefore, according to this analysis would be classified as the same species, respectively. It was shown that S. bovis biotype I and II/2 isolates were, in fact, S. gallolyticus 34 , and S. sanguinis ATCC 49296 was placed into S. oralis species by GGD analysis. A misidentification of S. sanguinis ATCC 49296 has already been shown by means of biochemical and serological properties by Narikawa and colleagues 35 .
Another interesting result is that the S. parasanguinis ATCC 15912 and F0405 strains were found to be at the upper limits for definition as members of the same species based on different genomic analyses. For instance, they shared 95% AAI, 94% identity by MLSA, a value of 17 on the basis of genomic signature and < 70% similarity in GGD. Therefore, based on these genomic

Salivarius
Mitis markers, these S. parasanguinis strains could, in fact, be separate species. This data reflects the complexity of bacterial species delineation, since these organisms are all under a constant evolutionary process.

Conclusion
The delineation of closely related streptococci species was evident in this genomic study. Different methods produced different levels of taxonomic resolution. The methods with the higher resolution for species identification were MLSA and AAI, while closely related species had similar Nc values and genomic signatures. Based on the genomic analyses, a Streptococcus species can be defined as a group of strains that shares ≥ 95% identity in MLSA and AAI, and > 70% identity in GGD. This definition may be useful to advance the taxonomy of Streptococcus. This approach allows an advanced understanding of bacterial diversity and identification.
Author contributions CCT and VEE carried out the computational and genomic analyses and analyzed the results. All authors (ACPV, CCT, ELF, MAM and VEE) participated in discussing and writing the manuscript. All authors have agreed to the final contents of the article.

Competing interests
No relevent competing interests were disclosed.

Grant information
VEE had a PRODOC-CAPES fellowship. CCT has a PNPD-CAPES fellowship, ELF has a PNPD-FAPERJ fellowship and MAM has a CAPES fellowship.