The biotechnological relevance of Rhodococcus rhodochrous. The complete genomic sequence of nitrile biocatalyst strain ATCC BAA-870. CURRENT STATUS:

Background Rhodococci are industrially important soil-dwelling Gram-positive bacteria that are well known for both nitrile hydrolysis and oxidative metabolism of aromatics. Rhodococcus rhodochrous ATCC BAA-870 is capable of metabolising a wide range of aliphatic and aromatic nitriles and amides. The expressed nitrilase, nitrile hydratase and amidase activities have shown stereoselective preferences for beta-substituted nitrile compounds. The genome of the organism was sequenced and analysed in order to better understand this whole cell biocatalyst. Results The genome of R. rhodochrous ATCC BAA-870 is the first Rhodococcus genome fully sequenced using Nanopore sequencing. The circular genome contains 5.9 megabase pairs (Mbp) and includes a 0.53 Mbp linear plasmid, that together encode 7548 predicted protein sequences according to BASys annotation, and 5535 predicted protein sequences according to RAST annotation. The genome contains numerous oxidoreductases, 15 identified antibiotic and secondary metabolite gene clusters, and several terpene and nonribosomal peptide synthetase clusters, as well as 6 putative clusters of unknown type. The 0.53 Mbp plasmid encodes 677 predicted genes and contains the nitrile converting gene cluster. Based on COG functional categories of proteins using RAST annotation, the main distributions of predicted annotated genes belong to known subsystems encoding amino acids and derivatives (19.7%), carbohydrates (13.4%), fatty acids, lipids and isoprenoids (12.2%), and cofactors, vitamins, prosthetic groups and pigments (9.4%). However, 74% of RAST annotated genes are not assigned clear functional roles within known metabolic pathways, and 38% of genes are annotated as hypothetical. BASys annotation predicts that 55% of annotated genes have an unknown function. The R. rhodochrous ATCC BAA-870 genome contains one possible CRISPR, identified by CRISPRCasFinder. R. inorganic ion, 411 lipid and 67 nucleotide transport and metabolism gene functions, and 174 secondary metabolite biosynthesis, transport and catabolism genes. These multiple transport systems highlight the metabolic versatility of this Rhodococcus species, which facilitates the use of whole cells in biotechnological applications. The ectD gene, which is involved in the synthesis of the compatible solute hydroxyectoine, is essential for thermoprotection of the halophilic bacterium Chromohalobacter salexigens

transaminases. The capacity of this strain to hydrolyse nitriles resides upon a plasmid, containing a nitrilase, a low molecular weight nitrile hydratase, and an enantioselective amidase.

Background
Rhodococcus is arguably the most industrially important actinomycetes genus [1] owing to its wideranging applications as a biocatalyst used in the synthesis of pharmaceuticals [2], in bioactive steroid production [3], fossil fuel desulphurization [4], and the production of kilotons of commodity chemicals [5]. Rhodococci have been shown to have a variety of important enzyme activities in the field of biodegradation (for reviews see [6] and [7]). These activities could also be harnessed for synthesis of various industrially relevant compounds [8]. One of the most interesting qualities of rhodococci that make them suitable for use in industrial biotechnology is their outer cell wall [9]. It is highly hydrophobic through a high percentage of mycolic acid, which promotes uptake of hydrophobic compounds. Furthermore, upon contact with organic solvents the cell wall composition changes, becoming more resistant to many solvents and more stable under industrially relevant conditions like high substrate concentration and relatively high concentrations of both water-miscible andimmiscible solvents. This results in a longer lifetime of the whole cell biocatalyst and subsequent higher productivity.
Rhodococcal species isolated from soil are known to have diverse catabolic activities, and their genomes hold the key to survival in complex chemical environments [10]. The first full Rhodococcus genome sequenced was that of Rhodococcus jostii RHA1 (NCBI database: NC_008268.1) in 2006 [10].
R. jostii RHA1 was isolated in Japan from soil contaminated with the toxic insecticide lindane (hexachlorocyclohexane) [11] and was found to degrade a range of polychlorinated biphenyls (PCBs) [12]. Its full genome is 9.7 Mbp, inclusive of the 7.8 Mbp chromosome and 3 plasmids (pRHL1, 2 and 3). Since then, many additional rhodococci have been sequenced and deposited in databases or announced in journals (Supp. Info. Table S1). Most of these genomes have been sequenced by large sequencing initiatives such as DOE -Joint Genome Institute (USA), national institutes like the National Institute of Technology and Evaluation (Japan) or the Korea Research Institute of Bioscience and Biotechnology (KRIBB, Korea), and even single universities such as the University of Tokyo (Japan). In all cases, the reason for sequencing was to improve the public availability of a larger variety of genomes in order to get more "reference genomes" or "type strains". These can be used to identify useful genes for taxonomic classification, risk/safety assessment and industrial applications, or for understanding the structure and evolution of microbial genomes. A sequencing effort to improve prokaryotic systematics has been implemented by the University of Northumbria, which showed that full genome sequencing does provide a robust basis for the classification and identification of rhodococci that have agricultural, industrial and medical/veterinary significance [13].
A few rhodococcal genomes have been more elaborately described, including R. erythropolis PR4 (NC_012490.1) [14] which degrades long alkanes [15]. Multiple monooxygenases and fatty acid βoxidation pathway genes were found on the R. erythropolis PR4 genome and several plasmids, making this bacterium a perfect candidate for bioremediation of hydrocarbon-contaminated sites and biodegradation of animal fats and vegetable oils. The related R. rhodochrous ATCC 17895 (NZ_ASJJ01000002) [16] also has many mono-and dioxygenases, as well as interesting hydration activities which could be of value for the organic chemist. The oleaginous bacterium R. opacus PD630 is a very appealing organism for the production of biofuels and was sequenced by two separate groups. Holder et al. used enrichment culturing of R. opacus PD630 to analyse the lipid biosynthesis of the organism, and the ~300 or so genes involved in oleaginous metabolism [17]. This sequence is being used in comparative studies for biofuel development. The draft sequence of the R. opacus PD630 genome was only recently released (NZ_AGVD01000000) and appears to be 9.15 Mbp, just slightly smaller than that of R. jostii RHA1. The full sequence of the same strain was also deposited in 2012 by Chen et al. (NZ_CP003949) [18], who focussed their research on the lipid droplets of this strain. Twenty strains of R. fascians were sequenced to understand the pathogenicity of this species for plants [19], which also resulted in the realisation that sequencing provides additional means to traditional ways of determining speciation in the very diverse genus of Rhodococcus [20]. The clinically important pathogenic strain R. hoagii 103S (formerly known as R. equi 103S) was also fully sequenced in order to understand its biology and virulence evolution (NC_014659.1) [21]. In this and other pathogenic R. hoagii strains, virulence genes are usually located on plasmids, which was well described for several strains including ATCC 33701 and 103 [22], strain PAM1593 [23] and 96 strains isolated from Normandy (France) [24]. As many important traits are often located on (easily transferable) plasmids, numerous rhodococcal plasmid sequences have been submitted to the NCBI (Supp. Info. Table S2). More elaborate research has been published on the virulence plasmid pFiD188 from R. fascians D188 [25], pB264, a cryptic plasmid from Rhodococcus sp. B264-1 [26], pNC500 from R. rhodochrous B-276 [27], and several plasmids from R. opacus B4 [28] and PD630 [18]. R. erythropolis harbours many plasmids besides the three from strain PR4, including pRE8424 from strain DSM8424 [29], pFAJ2600 from NI86/21 [30] and pBD2 from strain BD2 [31]. All these sequences have highlighted the adaptability of rhodococci and explain the broad habitat of this genus.
The versatile nitrile-degrading bacterium, R. rhodochrous ATCC BAA-870 [32], was isolated through enrichment culturing of soil samples from South Africa on nitrile nitrogen sources. R. rhodochrous ATCC BAA-870 possesses nitrile-hydrolysing activity capable of metabolising a wide range of aliphatic and aromatic nitriles and amides through the activity of nitrilase, nitrile hydratase and amidase [32][33][34][35]. These enzymes catalyse the hydrolysis of a broad range of nitriles, including 3phenylpropionitrile and 3-hydroxy-3-phenylpropionitrile. The enzymes also performed enantioselective hydrolysis of compounds selected from classes of chemicals used in pharmaceutical intermediates, such as β-adrenergic blocking agents, antitumor agents, antifungal antibiotics and antidiabetic drugs. Compounds such as 3-hydroxy-4-aryloxybutanenitriles, 3benzoyloxypentanedinitriles and 3-amino-3-p-tolylpropanenitrile were enantioselectively hydrolysed by the nitrile hydratase-amidase system, while the nitrilase hydrolysed the opposite enantiomer of 3amino-3-phenylpropanenitrile [36]. Biocatalytic nitrile hydrolysis affords valuable applications in industry, including production of solvents, extractants, pharmaceuticals, drug intermediates, and pesticides [37][38][39][40]. Herein, we describe the sequencing and annotation of R. rhodochrous ATCC BAA-870, identifying the genes associated with nitrile hydrolysis as well as other genes for potential biocatalytic applications. The extensive description of this genome and the comparison to other sequenced rhodococci will add to the knowledge of the Rhodococcus phylogeny and its industrial capacity.

Results
Genome preparation, assembly and annotation Initial sequencing and assembly, and final nanopore sequencing The genome of R. rhodochrous ATCC BAA-870 was originally sequenced in 2009 by Solexa Illumina (sequence reads with average length 36 bp), resulting in a coverage of 74%, with an apparent raw coverage depth of 36x. An initial assembly of this 36-cycle, single-ended Illumina library, together with a mate-pair library, yielded a 6 Mbp genome of 257 scaffolds. A more recently performed pairedend Illumina library combined with the mate-pair library reduced this to only 6 scaffolds (5.88 Mbp).
Even after several rounds of linking the mate-pair reads, we were still left with 3 separate contiguous sequences (contigs). The constraint was caused by the existence of repeats in the genome of which one was a 5.2 kb contig that, based on sequence coverage, must exist in four copies, containing 16Slike genes. Applying third generation sequencing (Oxford Nanopore Technology) enabled the full assembly of the genome, while the second generation (Illumina) reads provided the necessary proofreading. This resulted in a total genome size of 5.9 Mbp, consisting of a 5.37 Mbp circular chromosome and a 0.53 Mbp linear plasmid. The presence of the plasmid was confirmed by performing Pulse Field Gel Electrophoresis using non-digested DNA [41].

Annotation
The assembled genome sequence of R. rhodochrous ATCC BAA-870 was submitted to the Bacterial Annotation System web server, BASys, for automated, in-depth annotation [42]. The BASys annotation was performed using raw sequence data for both the chromosome and plasmid of R. rhodochrous ATCC BAA-870 with a total genome length of 5.9 Mbp, in which 7548 genes were identified and annotated ( Figure 1). The plasmid and chromosome encode a predicted 677 and 6871 genes, respectively. The same sequence run through RAST (Rapid Annotation using Subsystem Technology) predicted 5535 protein coding sequences (Figure 2), showing the importance of the bioinformatics tool used, which makes comparison to other genomes more difficult. Confirmation of annotation was performed manually for selected sequences. In BASys annotation, COGs (Clusters of Orthologous Groups) were automatically delineated by comparing protein sequences encoded in complete genomes representing major phylogenetic lineages [43]. As each COG consists of individual proteins or groups of paralogs from at least 3 lineages, it corresponds to an ancient conserved domain [44,45]. A total of 3387 genes annotated in BASys were assigned a COG function (44.9% of annotated genes), while 55 to 59% of annotated genes on the chromosome and plasmid respectively have unknown function. Based on counts of total genes annotated in RAST (5535), only 26% are classified as belonging to subsystems with known functional roles, while 74% of genes do not belong to known funtional roles. Overall 38% of annotated genes were annotated as hypothetical irrespective of whether they were included in subsystems or not. The complete genome sequence of R. rhodochrous Rhodococcus genome records deposited in the NCBI database, 16S rRNA gene counts range from 3-5 copies, with an average of 4 [46]. Of the four 16S rRNA genes found in R. rhodochrous ATCC BAA-870, two pairs are identical (i.e. there are two copies of two different 16S rRNA genes). One of each identical 16S rRNA gene was used in nucleotide-nucleotide BLAST for highly similar sequences [47].
BLAST results were used for comparison of R. rhodochrous ATCC BAA-870 to other similar species using 16S rRNA multiple sequence alignment and phylogeny in ClustalO and ClustalW respectively [48][49][50] (Figure 3). Nucleotide BLAST results of the two different R. rhodochrous ATCC BAA-870 16S rRNA genes show closest sequence identities to Rhodococcus sp. 2G and R. pyridinovorans SB3094, with either 100% or 99.74% identities to both strains depending on the 16S rRNA copy.
We used the in silico DNA-DNA hybridisation tool, the Genome-to-Genome Distance Calculator version 2.1 [51][52][53], to assess the genome similarity of R. rhodochrous ATCC BAA-870 to its closest matched strains based on 16S rRNA alignment (R. pyridinovorans SB3094 and Rhodococcus sp. 2G). The results of genome based species and subspecies delineation, and difference in GC content, is summarised (Supp. Info  Figure 4). Out of 7548 BASys annotated genes, 1481 are annotated enzymes that could be assigned an EC number (20%). BASys annotation could provide a possible overprediction of gene numbers, due to sensitive GLIMMER ab initio gene prediction methods that may give false positives for higher GC content sequences [54]. The RAST subsystem annotations are assigned from the manually curated SEED database, in which hypothetical proteins are annotated based only on related genomes. RAST annotations are grouped into two sets (genes that are either in a subsystem, or not in a subsystem) based on predicted roles of protein families with common functions. Genes belonging to recognised subsystems can be considered reliable and conservative gene predictions. Annotation of genes that do not belong to curated protein functional families however (i.e. those not in the subsystem), may be underpredicted by RAST, since annotations belonging to subsystems are based only on related neighbours.
Sequences of other Rhodococcus genomes were obtained from the Genome database at NCBI [55] and show a large variation in genome size between 4 and 10 Mbp (Supp. Info. Table S1), with an average of 6.1 ± 1.6 Mbp. The apparent total genome size of R. rhodochrous ATCC BAA-870, 5.9 Mbp (consisting of a 5.37 Mbp genome and a 0.53 Mbp plasmid), is close to the average. From the welldescribed rhodococci (Table 1), the genome of R. jostii RHA1 is the largest rhodococcal genome sequenced to date (9.7 Mbp), but only 7.8 Mbp is chromosomal, while the pathogenic R. hoagii genomes are the smallest at ~5 Mbp. All rhodococcal genomes have a high GC content, ranging from 62 -71%. The average GC content of the R. rhodochrous ATCC BAA-870 chromosome and plasmid is 68.2% and 63.8%, respectively. R. jostii RHA1 has the lowest percentage coding DNA (87%), which is predictable given its large overall genome size, while R. rhodochrous ATCC BAA-870 has a 90.6% coding ratio, and on average large genes, consisting of ~782 bps per gene. Interestingly, the distribution of protein lengths on the chromosome is bell-shaped with a peak at 350 bps per gene, while the genes on the plasmid show two size peaks, one at 100 bps and one at 350 bps. Together with the lower GC content, this shows that the plasmid content was probably acquired over different occasions [56]. Additional analysis of the genome sequence using the tRNA finding tool tRNAScan-SE v. 2.0 [57,58] confirms the presence of 56 tRNA genes in the R. rhodochrous ATCC BAA-870 genome, made up of 52 tRNA genes encoding natural amino acids, 2 pseudogenes, one tRNA with mismatched isotype and one +9 Selenocysteine (TCA) tRNA.

Protein location in the cell
It is often critical to know where proteins are located in the cell in order to understand their function [59], and prediction of protein localization is important for both drug targeting and protein annotation.
In this study, prediction was done using the BASys SignalP signal prediction service [42]. The majority of annotated proteins are soluble and located in the cytoplasm (83%), while proteins located at the cellular membrane make up 16% of the total. Cell membrane proteins include proteins that form part of lipid anchors, peripheral and integral cell membrane components, as well as proteins with single or multiple pass functions. Of the membrane proteins in R. rhodochrous ATCC BAA-870, 47% constitute single-pass, inner or peripheral membrane proteins, while 41% are multi-pass membrane proteins.
Most of the remaining proteins will be transported over the membrane. The periplasm contains proteins distinct from those in the cytoplasm which have various functions in cellular processes, including transport, degradation, and motility. Periplasmic proteins would mostly include hydrolytic enzymes such as proteases and nucleases, proteins involved in binding of ions, vitamins and sugar molecules, and those involved in chemotaxic responses. Detoxifying proteins, such as penicillin binding proteins, are also presumed to be located mostly in the periplasm. The Entner-Doudoroff pathway is, however, rare in Gram positive organisms which preferably use glycolysis for a richer ATP yield. There is no evidence of this pathway existing in R. rhodochrous ATCC BAA-870, indicating that the RHA1 strain must have acquired it rather recently. Enzymes found in other rhodococci such as lipases and esterases [62,63] are also present in strain BAA-870.

Aromatic Catabolism and oxidoreductases
As deduced from the better characterized pseudomonads [64], a large number of 'peripheral aromatic' pathways funnel a broad range of natural and xenobiotic compounds into a restricted number of 'central aromatic' pathways. Analysis of the R. rhodochrous ATCC BAA-870 genome suggests that at least four major pathways exist for the catabolism of central aromatic intermediates, comparable to the well-defined aromatic metabolism of Pseudomonas putida KT2440 strain [65].
Catabolism typically involves oxidative enzymes. The presence of multiple homologs of catabolic genes in Rhodococcus species suggests that they may provide a comprehensive biocatalytic profile [1]. In R. rhodochrous ATCC BAA-870 the dominant portion of annotated enzymes are involved in oxidation and reduction. There are about 500 oxidoreductase related genes including oxidases, hydrogenases, reductases, oxygenases, dioxygenases, cytochrome P450s, catalases and peroxiredoxins. These numbers are quite high compared to other bacteria of the same size, but in line with most other (sequenced) rhodococci [66]. In R. rhodochrous ATCC BAA-870 there are 71 monooxygenase genes, 11 of which are on the plasmid. Rhodococcus genomes usually encode large numbers of oxygenases [1]. Some of these are flavonoid proteins with diverse useful activities [67], which include monooxygenases capable of catalysing Baeyer-Villiger oxidations wherein a ketone is converted to an ester [68,69].
In R. rhodochrous ATCC BAA-870 there are 14 cytochrome P450 genes and their prevalence reflects a fundamental aspect of rhodococcal physiology. Similarly, the number of cytochrome P450 genes in R.
jostii RHA1 is 25 (proportionate to the larger genome) and is typical of actinomycetes. It is unclear which oxygenases in R. rhodochrous ATCC BAA-870 are catabolic and which are involved in secondary metabolism, but their abundance is consistent with a potential ability to degrade an exceptional range of aromatic compounds (oxygenases catalyse the hydroxylation and cleavage of these compounds). Rhodococci are well known to have the capacity to catabolise hydrophobic compounds, including hydrocarbons and polychlorinated biphenyls (PCBs), mediated by a cytochrome P450 system [70][71][72][73]. Cytochrome P450 oxygenase is often found fused with a reductase, as in Rhodococcus sp. NCIMB 9784 [74]. Genes associated with biphenyl and PCB degradation are found in multiple sites on the R. jostii RHA1 genome, both on the chromosome as well as on linear plasmids [1]. R. jostii RHA1 was also found to show lignin-degrading activity, possibly based on the same oxidative capacity as that used to degrade biphenyl compounds [75].

Nitrile biocatalysis
Rhodococci are well known for their application in the commercial manufacture of amides and acids through hydrolysis of the corresponding nitriles. R. rhodochrous J1 can convert acrylonitrile to the commodity chemical acrylamide [86], and both Mitsubishi Rayon Co., Ltd (Japan) and Senmin (South Africa) are applying this biocatalytic reaction at the multi-kiloton scale. Lonza Guangzhou Fine Chemicals use the same biocatalyst for large-scale commercial synthesis of nicotinamide from 3cyanopyridine [87]. Both processes rely on rhodococcal nitrile hydratase activity [81]. The locations and numbers of nitrile converting enzymes in the available genomes of Rhodococcus were identified ( Table 2). As expected from previous studies, strain BAA-870 contains several nitrile converting enzymes [32]. A low molecular weight cobalt-containing nitrile hydratase and a nitrilase are present, along with two amidases. The low molecular weight nitrile hydratase gene and amidase form a cluster, along with their associated regulatory elements (Table 2), including cobalt transport genes necessary for uptake of cobalt for inclusion in the nitrile hydratase active site. This is all in line with previous activity assays using this Rhodococcus strain [33,34]. However, in most R. rhodochrous strains these enzymes are on the chromosome, while in R. rhodochrous ATCC BAA-870, they are found on a plasmid. In R. rhodochrous ATCC BAA-870 the nitrile hydratase is expressed constitutively, explaining why this strain is an exceptional nitrile biocatalyst [36]. Environmental pressure through chemical challenge by nitriles may have caused this deregulation of the nitrile biocatalyst by transferring it to a plasmid.
The nitrilase from R. ruber can hydrolyse acetonitrile, acrylonitrile, succinonitrile, fumaronitrile, adiponitrile, 2-cyanopyridine, 3-cyanopyridine, indole-3-acetonitrile and mandelonitrile [108]. The nitrilases from multiple R. erythropolis strains were active towards phenylacetonitrile [109]. R. rhodochrous nitrilase substrates include (among many others) benzonitrile for R. rhodochrous J1 [110] and crotononitrile and acrylonitrile for R. rhodochrous K22 [111]. R. rhodochrous ATCC BAA-870 expresses an enantioselective aliphatic nitrilase encoded on the plasmid, which is induced by dimethylformamide [36]. Another nitrilase/cyanide hydratase family protein is also annotated on the plasmid (this study) but has not been characterised. The diverse, yet sometimes very specific and enantioselective substrate specificities of all these rhodococci gives rise to an almost plug-and-play system for many different synthetic applications. Combined with their high solvent tolerance, rhodococci are very well suited as biocatalysts to produce amides for both bulk chemicals and pharmaceutical ingredients.

Secondary metabolism and metabolite biosynthesis clusters
The ongoing search for new siderophores, antibiotics and antifungals has led to a recent explosion of interest in mining bacterial genomes [112], and the secondary metabolism of diverse soil-dwelling microbes remains relatively underexplored despite their huge biosynthetic potential [113]. Evidence of an extensive secondary metabolism in R. rhodochrous ATCC BAA-870 is supported by the presence of at least 227 genes linked to secondary metabolite biosynthesis, transport and catabolism. The  [116,117], and hydroxyectoine has been shown to confer heat stress protection in vivo [118]. Ectoines provide a variety of useful biotechnological and biomedical applications [119], and strains engineered for improved ectoine synthesis have been used for the industrial production of hydroxyectoine as a solute and enzyme stabiliser [120,121]. AntiSMASH analysis reveals 3 terpene biosynthetic clusters in the genome of R. rhodochrous ATCC BAA-870. Terpenes and isoprenoids are implicated in diverse structural and functional roles in nature, providing a rich pool of natural compounds with applications in synthetic chemistry, pharmaceutical, flavour, and even biofuel industries. The structures, functions and chemistries employed by the enzymes involved in terpene biosynthesis are well known, especially for plants and fungi [122,123].
However, it is only recently that bacterial terpenoids have been considered as a possible source of new natural product wealth [124,125], largely facilitated by the explosion of available bacterial genome sequences. Interestingly, bacterial terpene synthases have low sequence similarities, and show no significant overall amino acid identities compared to their plant and fungal counterparts.
Yamada et al. used a genome mining strategy to identify 262 bacterial synthases, and subsequent isolation and expression of genes in a Streptomyces host confirmed the activities of these predicted genes and led to the identification of 13 previously unknown terpene structures [124].
Soil-dwelling Rhodococci present rich possible sources of terpene and isoprenoid discovery. Some of the examples of annotated R. rhodochrous ATCC BAA-870 genes related to terpene and isoprenoid biosynthesis include phytoene saturase and several phytoene synthases, dehydrogenases and related proteins, as well as numerous diphosphate synthases, isomerases and epimerases. The genome also contains, for example, lycopene cyclase, a novel non-redox flavoprotein [126], and farnesyl diphosphate synthase, farnesyl transferase, geranylgeranyl pyrophosphate synthetases and digeranylgeranylglycerophospholipid reductase. Farnesyl diphosphate synthase and geranylgeranyl pyrophosphate synthases are potential anticancer and anti-infective drug targets [122]. In addition, the R. rhodochrous ATCC BAA-870 plasmid encodes a lactone ring-opening enzyme, monoterpene epsilon-lactone hydrolase.
The abundance of PKS and NRPS clusters suggest that R. rhodochrous ATCC BAA-870 may host a significant potential source of molecules with immunosuppressing, antifungal, antibiotic and siderophore activities [127]. The R. rhodochrous ATCC BAA-870 genome has two PKS genes, one regulator of PKS expression, one exporter of polyketide antibiotics, as well as three for polyketide cyclase/dehydrase involved in polyketide biosynthesis. In addition, there are two actinorhodin polyketide dimerases. A total of five NRPS genes for secondary metabolite synthesis can be found on the chromosome, while in comparison R. jostii RHA1 contains 24 NRPS and seven PKS genes [10]. R. jostii RHA1 was also found to possess a pathway for the synthesis of a siderophore [128]. R. rhodochrous ATCC BAA-870 contains 4 probable siderophore-binding lipoproteins, 3 probable siderophore transport system permeases, and two probable siderophore transport system ATPbinding proteins. Other secondary metabolite genes found in R. rhodochrous ATCC BAA-870 include a dihydroxybenzoic acid-activating enzyme (2,3-dihydroxybenzoate-AMP ligase bacillibactin siderophore), phthiocerol/phenolphthiocerol synthesis polyketide synthase type I, two copies of linear gramicidin synthase subunits C and D genes, and tyrocidine synthase 2 and 3. strongly support our theory that R. rhodochrous ATCC BAA-870 has adapted its genome recently in response to the selective pressure of routine culturing in nitrile media in the laboratory. Even though isolated from contaminated soil, the much larger chromosome of R. jostii RHA1 has undergone relatively little recent genetic flux as supported by the presence of only two intact insertion sequences, relatively few transposase genes, and only one identified pseudogene [10]. The smaller R.
rhodochrous ATCC BAA-870 genome, still has the genetic space and tools to adapt relatively easily in response to environmental selection.

Discussion
New sequencing technology has revolutionized the cost and pace of obtaining genome information, and there has been a drive to sequence the genomes of organisms which have economic applications, as well as those with environmental interest [137,138]. This holds true for Rhodococcus genomes, of which only two were sequenced in 2006, while 13 years later 353 genomes are now available, mainly due to Whole Genome Shotgun sequencing efforts (Supp. Info. Table S1). The impact of better and faster sequencing, using improved sequencing techniques, is evident in this case of sequencing the R. reason for the assembly to break into 6 scaffolds. Using third generation sequencing (Nanopore), this problem was overcome, and the genome could be fully assembled. Hence, we see second generation sequencing evolving to produce higher quality assemblies, but the combination with 3rd generation sequencing was necessary to obtain the full-length closed bacterial genome.
It has been assumed that the annotation of prokaryotic genomes is simpler than that of the introncontaining genomes of eukaryotes. However, annotation has been shown to be problematic, especially with over-or under-prediction of small genes where the criterion used to decide the size of an ORF can systematically exclude annotation of small proteins [139]. Warren et al. 2010, used high performance computational methods to show that current annotated prokaryotic genomes are missing 1153 candidate genes that have been excluded from annotations based on their size [139].
These missing genes do not show strong similarities to gene sequences in public databases, indicating that they may belong to gene families which are not currently annotated in genomes. Furthermore, they uncovered ~38,895 intergenic ORFs, currently labelled as 'putative' genes only by similarity to annotated genes, meaning that the annotations are absent. Therefore, prokaryotic gene finding and annotation programs do not accurately predict small genes, and are limited to the accuracy of existing database annotations. Hypothetical genes (genes without any functional assignment), genes that are assigned too generally to be of use, misannotated genes and undetected real genes remain the biggest challenges in assigning annotations to new genome data [140][141][142][143]. As such, there is the possibility that we are under-estimating the number of genes present on this genome.

Identification of target genes for future biotechnology applications
An estimated 150 biocatalytic processes are currently being applied in industry [144][145][146]. The generally large and complex genomes of Rhodococcus species afford a wide range of genes attributed to extensive secondary metabolic pathways that are presumably responsible for an array of biotransformations and bioremediations. These secondary metabolic pathways have yet to be characterised and offer numerous targets for drug design as well as synthetic chemistry applications, especially since enzymes in secondary pathways are usually more promiscuous than enzymes in the primary pathways.
A number of potential genes which could be used for further biocatalyses have been identified in the genome of R. rhodochrous ATCC BAA-870, including nitrilase, nitrile hydratase, epoxide hydrolase, and monooxygenases. A substantial fraction of genes in the sequenced R. rhodochrous ATCC BAA-870 genome have unknown functions, and these could be important reservoirs for novel gene and protein discovery. Most of the biocatalytically useful classes of enzyme suggested by Pollard and Woodley [147] are present on the genome: proteases, lipases, esterases, reductases, nitrilase/cyanohydrolase/nitrile hydratases and amidases, transaminase, epoxide hydrolase, monooxygenases and cytochrome P450s. Only oxynitrilases (hydroxynitrile lyases) and halohydrin dehalogenase were not detected, although a haloacid dehalogenase is present. Rhodococci are robust industrial biocatalysts, and the metabolic abilities of the Rhodococcus genus will continue to attract attention for industrial uses as further bio-degradative [6] and biopharmaceutical [148] applications of the organism are identified. Preventative and remediative biotechnologies will become increasingly popular as the demand for alternative means of curbing pollution increases and the need for new antimicrobial compounds and pharmaceuticals becomes a priority.

Conclusions
The genome sequence of R. rhodochrous ATCC BAA-870 is one of 353 Rhodococcus genomes that are sequenced to date, but it is only the 4th sequence that has been fully characterised on a biotechnological level. Therefore, the sequence of the R. rhodochrous ATCC BAA-870 genome will facilitate the further exploitation of rhodococci for biotechnology applications, as well as enable paired-end Illumina library was aligned, using BWA [150], to the assembly and the resulting Binary Alignment Map (BAM) file was processed by Pilon [151] for polishing the assembly (correcting assembly errors), using correction of only SNPs and short indels (-fix bases parameter).

Annotation
The assembled genome sequence of R. rhodochrous ATCC BAA-870 was submitted to the Bacterial Annotation System web server, BASys, for automated, in-depth annotation of the chromosomal and plasmid sequences [42]. BASys annotates based on microbial ab initio gene prediction using GLIMMER [54]. The genome sequence was also run on the RAST (Rapid Annotation using Subsystem Technology) server using the default RASTtk annotation pipeline for comparison [152,153]. RAST annotation uses the manually curated SEED database to infer gene annotations based on protein functional roles within families [154]. The two annotation pipelines offered different but useful and complimentary input formats and results, and gene annotations of interest could be manually compared and confirmed.      Phylogenetic tree created using rhodococcal 16S rRNA ClustalW sequence alignments.
Neighbour joining, distance corrected, phylogenetic cladogram created using Phylogeny in