Recently published Streptomyces genome sequences

Many readers of this journal will need no introduction to the bacterial genus Streptomyces, which includes several hundred species, many of which produce biotechnologically useful secondary metabolites. The last 2 years have seen numerous publications describing Streptomyces genome sequences (Table ​(Table1),1), mostly as short genome announcements restricted to just 500 words and therefore allowing little description and analysis. Our aim in this current manuscript is to survey these recent publications and to dig a little deeper where appropriate. The genus Streptomyces is now one of the most highly sequenced, with 19 finished genomic sequences (Table ​(Table2)2) and a further 125 draft assemblies available in the GenBank database as of 3rd of May 2014; by the time this is published, no doubt there will be more. The reasons given for sequencing this latest crop of Streptomyces include production of industrially important enzymes, degradation of lignin, antibiotic production, rapid growth and halo-tolerance and an endophytic lifestyle (Table ​(Table11). 
 
 
 
Table 1 
 
Recent genome publications (2013 and 2014) for Streptomyces species 
 
 
 
 
 
Table 2 
 
Completely sequenced Streptomyces species genome sequences available in GenBank as of 29 April 2014 
 
 
 
 
Mining genomes for secondary metabolism gene clusters 
Given the strong emphasis on secondary metabolism in Streptomyces genomics research, it is timely that version 2.0 of antiSMASH has been released and published (Blin et al., 2013). This computational tool has become a de facto standard for mining secondary metabolism gene clusters in genome sequences. Version 2.0 is completely revamped and, significantly, can now be used with highly fragmented draft-quality genome sequences whereas the previous version only worked well with finished genomes. Clearly, this is of immense importance to the discovery of novel metabolites in the ever-expanding database of streptomycete draft-quality genome sequences. For example, antiSMASH 2.0 analysis of the Streptomyces roseochromogenes subsp. oscitans DS 12.976 genome sequence revealed 43 new gene clusters in addition to recovering the already known clorobiocin gene cluster (Ruckert et al., 2014). 
 
The genome sequence of Streptomyces gancidicus strain BKS 13–15 was published before antiSMASH 2.0 became available. The authors state that seven genes mapped on to the streptomycin biosynthesis pathway based on gene-by-gene sequence similarities (Kumar et al., 2013) against homologues of genes in KEGG pathways (Kanehisa et al., 2012). However, we found no bioinformatic evidence for a streptomycin biosynthesis pathway encoded in this genome, although our antiSMASH 2.0 search did find 38 putative gene clusters. In common with many other pathways for secondary metabolism, genes for production of the aminoglycoside streptomycin are organized into a cluster of contiguous genes. The nucleotide sequences of at least two such clusters are available (GenBank accessions {"type":"entrez-nucleotide","attrs":{"text":"GU384160","term_id":"288549237","term_text":"GU384160"}}GU384160 and {"type":"entrez-nucleotide","attrs":{"text":"AJ862840","term_id":"62896300","term_text":"AJ862840"}}AJ862840 from Streptomyces platensis and Streptomyces griseus respectively). Our blastn searches (using these two cluster sequences as queries) failed to detect a complete streptomycin gene cluster in the S. gancidicus genome, but there were some regions of sequence similarity on a 111 kb contig (GenBank: {"type":"entrez-nucleotide","attrs":{"text":"AOHP01000057","term_id":"455649599","term_text":"AOHP01000057"}}AOHP01000057). An antiSMASH 2.0 search failed to find any aminoglycoside biosynthetic cluster in this genome. We are not aware of any experimental evidence that this strain produces the aminoglycoside streptomycin and conclude that these seven genes highlighted by the authors (Kumar et al., 2013) most probably encode components of another, perhaps novel, pathway. This illustrates the value of the antiSMASH 2.0 tool, which has the potential to discover new pathways, rather than relying on similarity to the pathways already represented in the KEGG database (and therefore, by definition, not novel). 
 
The case of Streptomyces species strain Mg1 (Hoefler et al., 2013) illustrates another consideration when mining bacterial genome sequences for secondary metabolism gene clusters. Many of the recently published Streptomyces genome sequences are assembled from massively parallel sequencing platforms such as 454 GS-FLX and Illumina HiSeq. The short sequence reads (typically less than 450 bp) and relatively high error rates associated with these platforms can lead to rather fragmented and/or incomplete genome assemblies. The situation is not helped by the biased sequence composition (approximately 70% G + C) of Streptomyces DNA. Furthermore, non-ribosomal peptide synthases (NRPS) and polyketide synthetases (PK) are long, modular proteins made up of many repeated domain units. This means that the genes encoding these key enzymes can be particularly difficult to assemble accurately from short sequence reads. To overcome this issue, the authors of the Mg1 genome project (Hoefler et al., 2013) exploited the PacBio SMRT sequencing technology, which provides sequences reads of several Kb in length, meaning that an entire PK or NRPS gene could be represented on a single sequence read, thus avoiding the difficulties of assembling repetitive sequence from short fragments. They also generated an assembly of the same genome based on 454 GS-FLX and Illumina HiSeq. The results were striking: more than 90% of the genome was represented in a single contig of 7.8 Mb in the PacBio-based assembly and the PacBio-based assembly was 19.9% longer than the 454/Illumina-based one (8 705 754 versus 7 260 368 bp). As the authors point out, this implies that more than 1 Mb of sequence in the PacBio-based assembly is missing from the 454/Illumina-based one, as can be seen in Fig. ​Fig.1A.1A. However, the 454/Illumina-based assembly is not simply a subset of the PacBio-based one; as illustrated in Fig. ​Fig.1B,1B, a substantial portion of the 454/Illumina-based assembly is missing from the PacBio assembly. Although it is by no means certain which assembly is more ‘correct’, it might be possible to generate a more complete genome assembly by reconciling the two different assemblies. 
 
 
 
Figure 1 
 
Comparison of two different genome assemblies for Streptomyces strain Mg1, one based on PacBio sequence data and the other based on 454 and Illumina sequence data. A illustrates alignment of both the assemblies against the PacBio-based assembly. B illustrates ... 
 
 
 
Fragmentation and incompleteness of a genome assembly has implications for discovery of secondary metabolism gene clusters. In Fig. ​Fig.1C,1C, we show a putative NRPS gene cluster detected apparently intact in a single contig of the PacBio-based sequence assembly identified by antiSMASH 2.0. Searching the 454/Illumina-based assembly reveals two incomplete fragments of the gene cluster, lying on two different contigs, and with part of the cluster apparently absent. Although we should be cautious about extrapolating too much from this single anecdotal example, the evidence suggests that longer read lengths can be very valuable in genome mining for secondary metabolism clusters.


Introduction
Many readers of this journal will need no introduction to the bacterial genus Streptomyces, which includes several hundred species, many of which produce biotechnologically useful secondary metabolites. The last 2 years have seen numerous publications describing Streptomyces genome sequences (Table 1), mostly as short genome announcements restricted to just 500 words and therefore allowing little description and analysis. Our aim in this current manuscript is to survey these recent publications and to dig a little deeper where appropriate. The genus Streptomyces is now one of the most highly sequenced, with 19 finished genomic sequences (Table 2) and a further 125 draft assemblies available in the GenBank database as of 3rd of May 2014; by the time this is published, no doubt there will be more. The reasons given for sequencing this latest crop of Streptomyces include production of industrially important enzymes, degradation of lignin, antibiotic production, rapid growth and halo-tolerance and an endophytic lifestyle (Table 1).

Mining genomes for secondary metabolism gene clusters
Given the strong emphasis on secondary metabolism in Streptomyces genomics research, it is timely that version 2.0 of antiSMASH has been released and published (Blin et al., 2013). This computational tool has become a de facto standard for mining secondary metabolism gene clusters in genome sequences. Version 2.0 is completely revamped and, significantly, can now be used with highly fragmented draft-quality genome sequences whereas the previous version only worked well with finished genomes. Clearly, this is of immense importance to the discovery of novel metabolites in the ever-expanding database of streptomycete draft-quality genome sequences. For example, antiSMASH 2.0 analysis of the Streptomyces roseochromogenes subsp. oscitans DS 12.976 genome sequence revealed 43 new gene clusters in addition to recovering the already known clorobiocin gene cluster .
The genome sequence of Streptomyces gancidicus strain BKS 13-15 was published before antiSMASH 2.0 became available. The authors state that seven genes mapped on to the streptomycin biosynthesis pathway based on gene-by-gene sequence similarities  against homologues of genes in KEGG pathways (Kanehisa et al., 2012). However, we found no bioinformatic evidence for a streptomycin biosynthesis pathway encoded in this genome, although our antiSMASH 2.0 search did find 38 putative gene clusters. In common with many other pathways for secondary metabolism, genes for production of the aminoglycoside streptomycin are organized into a cluster of contiguous genes. The nucleotide sequences of at least two such clusters are available (GenBank accessions GU384160 and AJ862840 from Streptomyces platensis and Streptomyces griseus respectively). Our BLASTN searches (using these two cluster sequences as queries) failed to detect a complete streptomycin gene cluster in the S. gancidicus genome, but there were some regions of sequence similarity on a 111 kb contig (GenBank: AOHP01000057). An antiSMASH 2.0 search failed to find any aminoglycoside biosynthetic cluster in this genome. We are not aware of any experimental evidence that this strain produces the aminoglycoside streptomycin and conclude that these seven genes highlighted by the authors  most probably encode components of another, perhaps novel, pathway. This illustrates the value of the antiSMASH 2.0 tool, which has the potential to discover new pathways, rather than relying on similarity to the pathways already represented in the KEGG database (and therefore, by definition, not novel).
The case of Streptomyces species strain Mg1 (Hoefler et al., 2013) illustrates another consideration when mining bacterial genome sequences for secondary metabolism gene clusters. Many of the recently published Streptomyces genome sequences are assembled from massively parallel sequencing platforms such as 454 GS-FLX and Illumina HiSeq. The short sequence reads (typically less than 450 bp) and relatively high error rates associated with these platforms can lead to rather fragmented and/or incomplete genome assemblies. The situation is not helped by the biased sequence composition (approximately 70% G + C) of Streptomyces DNA. Furthermore, non-ribosomal peptide synthases (NRPS) and polyketide synthetases (PK) are long, modular proteins made up of many repeated domain units. This means that the genes encoding these key enzymes can be particularly difficult to assemble accurately from short sequence reads. To overcome this issue, the authors of the Mg1 genome project (Hoefler et al., 2013) exploited the PacBio SMRT sequencing technology, which provides sequences reads of several Kb in length, meaning that an entire PK or NRPS gene could be represented on a single sequence read, thus avoiding the difficulties of assembling repetitive sequence from short fragments. They also generated an assembly of the same genome based on 454 GS-FLX and Illumina HiSeq. The results were striking: more than 90% of the genome was represented in a single contig of 7.8 Mb in the PacBio-based assembly and the PacBio-based assembly was 19.9% longer than the 454/Illumina-based one (8 705 754 versus 7 260 368 bp). As the authors point out, this implies that more than 1 Mb of sequence in the PacBio-based assembly is missing from the 454/Illumina-based one, as can be seen in Fig. 1A. However, the 454/Illumina-based

Digesting wood: Streptomyces viridosporus T7A
Streptomycetes may have important applications other than production of secondary metabolites, for example lignin degradation (Thomas and Crawford, 1998;Bugg et al., 2011;Brown and Chang, 2014). The aromatic polymer lignin is a major component of plant material and there is significant interest in organisms that can break down lignocellulose waste materials to generate useful products such as bioethanol (Bugg et al., 2011). Digestion of lignin is important not only because it can comprise up to 30% of plant biomass but also because its removal is necessary to facilitate degradation of Fig. 1. Comparison of two different genome assemblies for Streptomyces strain Mg1, one based on PacBio sequence data and the other based on 454 and Illumina sequence data. A illustrates alignment of both the assemblies against the PacBio-based assembly. B illustrates both the assemblies aligned against the 454/Illumina-based assembly. C illustrates a novel secondary-metabolism gene cluster identified by antiSMASH 2.0 (Blin et al., 2013) in both assemblies. The entire cluster is recovered intact in the PacBio-based assembly but it is split across two different contigs in the 454/Illumina-based assembly and part of the middle of the cluster is missing. Alignments in A and B were generated using Basic Local Alignment Search Tool Nucleotide tool BLASTN (Altschul et al., 1990) and visualized using the BLAST Ring Image Generator (BRIG) (Alikhan et al., 2011). The innermost ring indicates the genomic position. The next ring is a plot of G + C content. The remaining five concentric rings indicate the presence or absence of BLASTN hits at that position, with one ring corresponding to each of the five indicated genome assemblies. To aid clarity, each ring is represented in a different colour. Positions covered by BLASTN alignments are indicated with a solid colour; whitespace gaps represent genomic regions not covered by the BLASTN alignments. The graphics in C were cut and pasted directly from the antiSMASH output.
hemicellulose and cellulose. The enzymology of lignin degradation is best understood in fungi, but it has become apparent that a number of bacterial species also have this capability (Brown and Chang, 2014). For example, S. viridosporus T7A is able to solubilize lignin, probably via the action of at least one extracellular peroxidase (Thomas and Crawford, 1998). A complete genome sequence is now available for this strain (Davis et al., 2013), revealing a number of genes encoding candidate lignin-degrading enzymes (see Table 3). This species is closely related to Streptomyces ghanaensis for which a genome sequence is also available (GenBank: ABYA00000000) and which is notable for its production of the antibiotic moenomycin A (Subramaniam-Niehaus et al., 1997;Ostash et al., 2007;2009). Most of the candidate lignin metabolism genes in Table 3 are also conserved in S. ghanaensis. We are not aware of any published reports of S. ghanaensis being able to degrade lignin, but it would be interesting to experimentally test whether it has this capability; if it does not, then comparative genomics between these closely related strains might reveal novel genetic determinants of lignin degradation.

Genome size: Streptomyces violaceusniger
Among bacteria, streptomycetes have some of the largest genomes, typically within the range of 8.7 Mbp to 11.9 Mbp . However, the recently reported genome sequence of S. violaceusniger strain SP6 weighs in at just 6.4 Mb (Chen et al., 2013) and that of Streptomyces albus J1074 6.8 Mb (Zaburannyi et al., 2014). Although both sets of authors (Chen et al., 2013;Zaburannyi et al., 2014) claim theirs as the smallest reported genome of any streptomycete, in fact that record is held by the previously sequenced Streptomyces somaliensis strain DSM 40738, a pathogenic strain isolated from a human infection (Kirby et al., 2012). The assembly of this genome was just 5.18 Mbp in length; the authors of that study claim that this is consistent with results from pulsed-field gel electrophoresis. Our multilocus sequence analysis (data not shown) reveals that strain SPC6, also known as Streptomyces thermolilacinus SPC6, is not closely related to S. violaceusniger strain TU 4113 (GenBank: CP002994), which has an 11.14 Mbp genome. Rather, strains SPC6 and DSM 40736 are closely related and fall within a clade with several other strains for which draft genomes are available and with Streptomyces venezuelae for which a complete finished genome sequence is available (Pullan et al., 2011). Figure 2 shows the sizes of these genomes. It appears that genome reduction may have occurred at least twice in this clade: once in a common ancestor of SPC6 and DSM 40738, and also independently in an ancestor of strain CNT372 (GenBank: ARHT00000000). It is even possible that genome reduction has occurred independently in SPC6 and DSM 40738 as Fig. 2C reveals differences as well as similarities in gene conservation with respect the S. venezuelae reference sequence. Evidently, genome reduction has also occurred in S. albus strain J1074 (Zaburannyi et al., 2014), which is not closely related to this clade. In this strain, the  Fig. 2. Variation in genome size among Streptomyces somaliensis and its close relatives. A shows a section of a maximum-likelihood phylogenetic tree based on aligned sequences of five housekeeping genes (atpD, gyrB, recA, rpoB, trpB) extracted from draft genome sequence assemblies or, in the case of S. venezuelae, finished genome sequence, which is indicated by the black triangle. The tree was generated using MEGA6 (Tamura et al., 2013). B indicates the length of each genome assembly. C illustrates alignments of each genome assembly against the S. venezuelae reference genome, which consists of a single linear chromosome. Alignments were generated using Basic Local Alignment Search Tool Nucleotide tool BLASTN (Altschul et al., 1990) and visualized using the BLAST Ring Image Generator (BRIG) (Alikhan et al., 2011). The innermost ring indicates the genomic position. The next ring is a plot of G + C content. The remaining five concentric rings indicate the presence or absence of BLASTN hits at that position, with one ring corresponding to each of the five indicated genome assemblies.
To aid clarity, each ring is represented in a different colour. Positions covered by BLASTN alignments are indicated with a solid colour; whitespace gaps represent genomic regions not covered by the BLASTN alignments. reduction seems to have been achieved by deletion of duplicated genes. The evolutionary driver for genome reduction in streptomycetes is unclear, although it might not be mere coincidence that the smallest genome reported so far is from a pathogen, namely S. somaliensis (Kirby et al., 2012), and evolution of pathogenesis is often associated with genome reduction (Toft and Andersson, 2010).

The future of Streptomyces genomics
The availability of cheap sequencing has led to the generation of numerous genome sequences for Streptomyces and related species [e.g. ] with the objective of discovering novel metabolic products. However, sequencing the genome and discovering novel gene clusters is just the beginning; many of the metabolic products of these gene clusters are 'cryptic', not being expressed under normal laboratory conditions. Productive 'genome mining' requires either genetic modification of the cluster to force expression or cloning and expression of the cluster in a heterologous host (Gomez-Escribano and Bibb, 2014). The value of this approach, even starting from rather poor-quality draft genome sequences, has been demonstrated by the discovery of the gene cluster encoding cypemycin in Streptomyces sp. strain OH-4156, revealing an unusual class of post-translationally modified ribosomally synthesized peptides (Claesen and Bibb, 2010). There will inevitably be a lag between the initial frenzy of genome sequencing and the characterization of novel useful products as the biochemical investigations are more laborious than the sequencing. Another interesting emerging theme is the role of endophytic streptomycetes and the emerging picture that secondary metabolites contribute to the medicinal properties of their host plants [e.g. (Akshatha et al., 2014)]. The most recently published Streptomyces genome comes from strain PRh5, an endophyte of wild rice that produces nigericin, an antibiotic effective against mycobacteria (Yang et al., 2014).