Computational strategies for genome-based natural product discovery and engineering in fungi

Fungal natural products possess biological activities that are of great value to medicine, agriculture and manufacturing. Recent metagenomic studies accentuate the vastness of fungal taxonomic diversity, and the accompanying specialized metabolic diversity offers a great and still largely untapped resource for natural product discovery. Although fungal natural products show an impressive variation in chemical structures and biological activities, their biosynthetic pathways share a number of key characteristics. First, genes encoding successive steps of a biosynthetic pathway tend to be located adjacently on the chromosome in biosynthetic gene clusters (BGCs). Second, these BGCs are often are located on specific regions of the genome and show a discontinuous distribution among evolutionarily related species and isolates. Third, the same enzyme (super)families are often involved in the production of widely different compounds. Fourth, genes that function in the same pathway are often co-regulated, and therefore co-expressed across various growth conditions. In this mini-review, we describe how these partly interlinked characteristics can be exploited to computationally identify BGCs in fungal genomes and to connect them to their products. Particular attention will be given to novel algorithms to identify unusual classes of BGCs, as well as integrative pan-genomic approaches that use a combination of genomic and metabolomic data for parallelized natural product discovery across multiple strains. Such novel technologies will not only expedite the natural product discovery process, but will also allow the assembly of a high-quality toolbox for the re-design or even de novo design of biosynthetic pathways using synthetic biology approaches.


Abstract
Fungal natural products possess biological activities that are of great value to medicine, agriculture and manufacturing. Recent metagenomic studies accentuate the vastness of fungal taxonomic diversity, and the accompanying specialized metabolic diversity offers a great and still largely untapped resource for natural product discovery. Although fungal natural products show an impressive variation in chemical structures and biological activities, their biosynthetic pathways share a number of key characteristics. First, genes encoding successive steps of a biosynthetic pathway tend to be located adjacently on the chromosome in biosynthetic gene clusters (BGCs). Second, these BGCs are often are located on specific regions of the genome and show a discontinuous distribution among evolutionarily related species and isolates. Third, the same enzyme (super)families are often involved in the production of widely different compounds. Fourth, genes that function in the same pathway are often co-regulated, and therefore co-expressed across various growth conditions. In this mini-review, we describe how these partly interlinked characteristics can be exploited to computationally identify BGCs in fungal genomes and to connect them to their products. Particular attention will be given to novel algorithms to identify unusual classes of BGCs, as well as integrative pan-genomic approaches that use a combination of genomic and metabolomic data for parallelized natural product discovery across multiple strains. Such novel technologies will not only expedite the natural product discovery process, but will also allow the assembly of a high-quality toolbox for the re-design or even de novo design of biosynthetic pathways using synthetic biology approaches.

Introduction
During the past decade, genome mining has added a whole new dimension to fungal natural products research (Keller et al., 2005;Scharf and Brakhage, 2013;Wiemann and Keller, 2014). With the exponential increase of genomic data, computational tools have become indispensable to the discovery process, as they allow effective exploitation of these data to identify novel molecules and characterize the biosynthetic routes for the production of these compounds . Clearly, the full potential of this technology has not yet been fulfilled in fungi. Indeed, we argue that the development of new bioinformatic technologies offers new opportunities to increase the effectiveness of discovering and engineering fungal natural product biosynthetic pathways.
Firstly, the study of exotic classes of biosynthetic pathways, sometimes referred to as 'biosynthetic dark matter', has been limited in fungi. A key step in natural product bioinformatics is the identification of biosynthetic gene clusters (BGCs) that encode the enzymatic pathways for compound production, as well as the prediction of their products. Existing computational tools for fungal BGC identification, such as antiSMASH (Blin et al., 2013;Medema et al., 2011a;Weber et al., 2015), SMIPS/CASSIS (Wolf et al., 2015) and SMURF (Khaldi et al., 2010), have largely focused on known types of BGCs, encoding the enzymatic pathways to produce compounds such as polyketides, nonribosomal peptides (NRPs) and terpenes. While these algorithms are very accurate for the detection of BGCs for these specific compound classes, the intrinsic disadvantage is that they cannot detect types of BGCs that they were not trained to predict. Recently, algorithms have emerged that also allow the identification of additional, more obscure, BGCs (Cimermancic et al., 2014;Takeda et al., 2014;Umemura et al., 2013).
Secondly, the nature of fungal genome projects is emphatically shifting from the sequencing of single genomes to the generation of larger sets of genome sequences from one ore more closely related species or genera. This development still has to be factored into bioinformatic tools, which currently mostly provide single-genome outputs. In the meantime, however, there are a number of ways in which these single-genome outputs can already be combined in ways to generate pan-genomic overviews of specialized metabolism that greatly aid in connecting genes to metabolites.
Thirdly, both refactoring (i.e., redesigning) of BGCs from exotic organisms and de novo engineering of BGCs by design are becoming increasingly feasible. The former will greatly expand our abilities to study the specialized metabolites of uncharted branches of the fungal tree of life: thus far, the vast majority of the ~160 fungal BGCs characterized (i.e., with known chemical products) originate from only three genera: Aspergillus, Fusarium and Penicillium (Li et al., this issue). The latter will greatly benefit from the use of systematized information on BGCs that have been experimentally characterized in the past. The functions of many genes, enzymes and protein domains have been determined in the process, offering a wide array of building blocks for engineering novel biosynthetic pathways. Initiatives such as the Minimum Information about a Biosynthetic Gene cluster (MIBiG) aim to systematically catalog these parts as well as the experimental evidence available for their documented functions . This will allow combining them in new ways in synthetic biology efforts to (e.g., retrosynthetically) engineer biosynthetic pathways (Bachmann, 2010;Birmingham et al., 2014;Medema et al., 2012;Planson et al., 2012).
To more effectively exploit bioinformatic technology in the study of fungal natural products, it may be vital to recognize the distinct characteristics of fungal BGCs as opposed to bacterial ones (at which many bioinformatic tools have been focused thus far). In this mini-review, we will therefore start with providing an overview of the distinct architectural features and evolutionary dynamics of fungal biosynthetic gene clusters. Subsequently, we will use this as a foundation to discuss key ways in which computational tools can open up new avenues for fungal natural product discovery and engineering.

Key architectural features and evolutionary dynamics of fungal biosynthetic gene clusters
Fungal gene clusters have evolved in specific taxonomic lineages in response to various ecological needs. They not only comprise BGCs involved in the production of antimicrobials or virulence factors, but also gene clusters encoding the potential to utilize uncommon nutrients, or to degrade or modify harmful compounds. The organization of these genes in clusters has been suggested to facilitate horizontal gene transfer (HGT) of the genes (Walton, 2000), protect offspring from intoxication in case of partial inheritance of a biosynthetic pathway with toxic intermediates, and impart concerted gene regulation due to local chromosome features such as methylation (Bok et al., 2009). Each of these options have been supported to various degrees by recent studies, as reviewed by Wisecaver and Rokas (2015). The predicted evolutionary similarity of genes within BGCs often shows a discord with the phylogeny of the species (Ward et al., 2002). While this has been noted previously several times, the increasing availability of fungal genomes and the robust phylogeny of fungal species now available (Lutzoni et al., 2004) makes it easier to detect genes showing such a discord. As a result, also complex evolutionary histories of BGCs can be resolved, through the use of state-of-the-art phylogenetic methods. Several pan-genomic studies have started to demonstrate the evolutionary dynamics of these gene clusters: often, these regions are highly variable even within species (Proctor et al., 2013(Proctor et al., , 2009Schardl et al., 2013;Susca et al., 2014). These dynamics typically include birth and death of clusters and chromosomal relocation of the entire cluster, recruitment of additional genes as well as loss of genes from the cluster and reordering of the genes within the cluster (Proctor et al., 2013(Proctor et al., , 2009Schardl et al., 2013;Susca et al., 2014). A related feature that has caught considerable attention is the positioning of these clusters in the genome. The (sub)telomeric regions, non-conserved regions and supernumerary or B chromosomes are often enriched for BGCs. These regions are more dynamic and may show higher rates of gene duplication and gene re-location. These regions also show co-regulated gene expression (Galazka and Freitag, 2014;Zhao et al., 2014), often by global regulators such as the Velvet complex (Bayram et al., 2008;Palmer et al., 2013;Wiemann et al., 2010) and EBR1 (Jonkers et al., 2014;Zhao et al., 2011). The complex evolutionary dynamics of fungal genomes also has a large effect on BGC architectures: besides the growth of BGCs by gene additions, BGC breakage can occur in some cases, causing BGCs to be split over multiple loci. However, concerted expression of genes generally remains present in multi-locus pathways. From an ecological perspective, it is important to appreciate that BGCs result in the production of natural products that may be crucial for survival and/or proliferation in a specific niche or host. Therefore, it is worthwhile to consider the environmental context of the organism: the same niche may require similar needs, leading to "capturing" of BGCs via HGT from fungi or bacteria that share the same habitat (Chooi and Solomon, 2014).
Because many computational tools for BGC identification and analysis thus far have been developed based on datasets from bacterial genomes, it is important to note the differences between fungal and bacterial gene clusters to improve fungal BGC detection methods. Although fungal and bacterial specialized metabolism share many features, such as the presence of certain functional classes of genes and the phenomenon of gene clustering itself, fungal BGCs are sometimes more difficult to uncover as they are often composed of fewer genes and are more often split over multiple loci. In addition, bioinformatic technologies for the identification of BGCs frequently rely on correct structural annotation of genes. The presence of introns in fungal genes makes this more difficult. Moreover, the annotation of the relatively exotic genes found in BGCs is in fact the most difficult. Not only do they often lack homology to genes from related species, they are also frequently transcriptionally silent; hence, acquiring transcriptomic evidence to verify intron/exon boundaries is challenging. Whereas bacterial genes are organized in operons to achieve coregulated gene expression, fungal genes often have a bidirectional orientation in which the cis-regulatory region is shared between a pair of genes (Brakhage, 2013). Finally, many polyketide and nonribosomal peptide biosynthetic pathways in fungi act in an iterative fashion (in contrast to the multimodular assembly lines that are more frequently found in bacteria); this also makes it more difficult to predict their chemical structures. Each of these differences have to be taken into account when using or adapting programs originally designed for bacteria.

Biosynthetic dark matter: beyond nonribosomal peptides and polyketides
While dozens of distinct BGC types have been characterized in bacteria, virtually all experimentally characterized BGCs from fungi center around a handful of enzymatic classes: polyketide synthases (PKSs), NRP synthetases (NRPSs), terpene cyclases (TCs) and dimethylallyl-tryptophan synthases (DMATs). While one explanation for this would be that few biosynthetic classes exist in fungi, there is also the intriguing possibility that a large part of fungal biosynthetic diversity has thus far escaped detection.
Over the past few years, evidence has been accumulating that supports the latter: through the use of motif-independent BGC identification techniques , the BGCs for kojic acid and ustiloxin B (a ribosomally synthesized and post-translationally modified peptide, abbreviated RiPP [Arnison et al., 2013]) were discovered, neither of which involves PKSs, NRPSs, TCs or DMATs (Figure 1a). The discovery of the ustiloxin B gene cluster even lead to the computationally driven discovery of a widespread and diverse class of fungal RiPPs (Nagano et al., 2016). In the two cases above, different computational techniques were applied to identify BGCs that do not contain any known types of core scaffold-synthesizing enzymes.
The discovery of the ustiloxin B BGC (Tsukui et al., 2015;Umemura et al., 2014) was made possible by the MIDDAS-M algorithm (Umemura et al., 2013), which uses co-expression correlation to identify genes that together may encode a biosynthetic pathway. More precisely, it uses a sliding window with a flexible size to identify chromosomally adjacent clusters of co-expressed genes in a fungal genome, based on two or more transcriptomic datasets. Even though the identification through such an algorithm is contingent on genes being expressed in at least one of the transcriptomic samples and will also identify nonbiosynthetic gene clusters, the power of this approach is that no assumptions are made regarding the nature of the genes that will identified. Hence, this approach potentially allows the discovery of novel classes of BGCs. In research on plant natural products, the coexpression approach has also been used successfully to identify biosynthetic pathways that are not or only partially clustered, or for which no clustering is known because of limited genomic closure (Giddings et al., 2011;Lau and Sattely, 2015;Rajniak et al., 2015); hence, provided that sufficient transcriptomic datasets are available that can be standardized to be comparable, this technique can also connect multiple biosynthetic loci that are involved in the same pathway.
The discovery of the kojic acid BGC (Terabayashi et al., 2010) was powered by the MIPS-CG algorithm (Takeda et al., 2014), which is based on a different principle: co-localization of genes over prolonged spans of evolutionary time. In 2005, Machida et al. found that BGCs are enriched in specific regions of the genome called non-syntenic blocks, specialized or 'accessory' genomic regions that are void of contiguous sets of genes with orthologs in related genomes (Machida et al., 2005). The MIPS-CG algorithm exploits this characteristic by looking for clustered genes in the non-syntenic blocks of one genome that show sequence similarity to (but are not orthologs of) clustered genes in a second genome. The assumption here is that at least some other genomes will contain (distant) relatives of the BGCs that are to be identified, which would have similar member genes that are not orthologous (i.e., have a discordant phylogeny compared to that of the species, e.g. due to HGT). Of course, while this will not be the case for all BGCs, the method again presents another way of looking for biosynthetic novelty, which could be extended upon in multiple ways as recently discussed elsewhere .
Yet another algorithm that does not assume specific genes that should be in a BGC, ClusterFinder (Cimermancic et al., 2014), does utilize homology information of protein sequence, but in a radically more broadly defined way than in strict motif-based search algorithms like antiSMASH or SMURF. ClusterFinder uses the frequencies of broadly defined protein families inside and outside BGCs to calculate the probability of each genomic region to be involved in the biosynthesis of specialized metabolites. While the training set for ClusterFinder mainly included bacterial BGCs, the algorithm still predicts interesting and potentially novel BGCs when applied to fungal genome sequences. For example, we identified several clustered genomic regions that do not contain PKSs, NRPSs, DMATs or TCs, but do contain a range of genes encoding enzymes known to be involved in decorating natural product backbones, such as cytochrome P450s, glycosyltransferases, dehydrogenases, prenyltransferases, etc. (see Figure 1b for three examples). As several of these BGCs are also species-specific, they have a high probability to encode novel specialized biosynthetic (or catabolic) pathways. As a word of caution, it should be noted that predicting the chemistry produced by such BGCs is currently not yet feasible; hence, the identification of their products remains a challenge.
In principle, the outputs of multiple motif-independent algorithms could be merged with those of motif-based tools such as antiSMASH and SMURF. Algorithms that are not specifically designed for the detection of BGCs, but do identify co-evolving modules of genes, may also add valuable information. For example, Li et al. (2014) recently published a phylogenetic profiling method to identify biological pathways by exploiting patterns of coevolution; because the genes involved in the same biosynthetic pathways are often taxonspecific to the same taxa, this method could work well to identify novel BGCs, without specifying a priori which genes should be present in them.

Pan-genomic studies of biosynthesis
A very useful characteristic of fungal BGC repertoires is that their diversity within a genus is considerable but still tractable; this allows the direct correlation of absence/presence patterns of metabolites with those of the BGCs that encode their production. Such a pattern-based genome mining strategy was recently pioneered for bacteria in a study of 35 Salinispora strains (Duncan et al., 2015): overlaying absence/presence matrices of compounds and BGCs lead to the identification of the gene cluster for retimycin, by using the absence/presence patterns to exclude all other candidate BGCs. This can be regarded as an extension and upscaling of the comparative genomic discovery strategy that has been used in fungi before for connecting individual BGCs to compounds, as recently reviewed by Cacho & Chooi (2015). Notably, considerable sequence variation may exist among BGCs that encode pathways towards the same chemical scaffolds, and differences frequently occur in the tailoring of these scaffolds. Hence, an essential step in this analysis is the clustering of compounds and gene clusters into families by means of adequate distance metrics based on similarities in mass fragmentation patterns and BGC sequences, respectively, using procedures recently reviewed elsewhere (Bouslimani et al., 2014;. As these procedures depend on an accurate genome annotation, BGCs that are missed due to false negative gene calls or other misannotations can lead to incongruencies between the two data types: if a molecule is present, but the corresponding BGC is incorrectly deemed absent, logic would lead to the conclusion that BGCs from the corresponding BGC family cannot encode the pathway for that molecule. One way to solve such a problem would be to use gene cluster homology searches  at the nucleotide level during the stage of BGC family delineation, to make sure that misannotated BGC-encoding genomic regions are still detected. The addition of transcriptome data and the gathering of metabolome/transcriptome data across multiple conditions or time points would enhance such a pattern-based strategy, as the compounds could be correlated to BGCs for which the key genes are actually expressed: by overlaying BGC family absence/presence and expression patterns and subsequently excluding possible connections for each BGC family with molecules until only one possibility remains (aided by chemical structure predictions where possible), multiple BGC families can be linked to molecules in parallel (Figure 2). The 'OSMAC (One Strain-Many Compounds) approach', as defined more than a decade ago, already took advantage of the fact that different conditions trigger the production of different metabolites (Bode et al., 2002). But while such strategies have frequently been attempted at the single-strain level (including through proteomic analysis [Gubbens et al., 2014]), the combination of a pan-genomic approach with the correlation of gene expression to metabolite identification would add a lot of statistical power. The challenge for designing such a large-scale experiment would be to find the right combination of strains and/or conditions so that as many BGCs as possible are expressed in at least a few condition/strain combinations: after all, if the majority remains cryptic, metabolites may still remain undetected. But if this succeeds, if needed through the use of mutant strains or ecologically relevant biological or chemical cues (Abdelmohsen et al., 2015;Nützmann et al., 2011;Zuck et al., 2011;Zutz et al., 2014), it has the potential to accelerate fungal natural product discovery. Indeed, such a strategy could even be made targeted by integrating phenotypic screens that allow for the identification of compounds with desired biological activities. Moreover, a key advantage of such a 'pangenomic' and highly parallelized natural product discovery strategy would be that it is not biased towards variants of known compounds and pathways; relatively large numbers of putative BGCs predicted from abovementioned computational approaches could easily be analyzed along with the more 'classical' ones without a further increase in cost.

Prospects for engineering biosynthesis
As new approaches in both computation and experimentation accelerate the discovery of new natural products and their corresponding BGCs, synthetic biology will also play an increasing role in both the discovery of novel compounds and the engineering of their pathways to diversify them and achieve production titers suitable for industry (Mattern et al., 2015;Winter and Tang, 2012).
Importantly, synthetic biology-based refactoring of BGCs (Medema et al., 2011b;Shao et al., 2013;Smanski et al., 2014) allows access to fungal strains that are difficult to cultivate, and potentially the characterization of compounds produced from clusters that are only known from metagenomic contigs or single-cell genomes from unculturable organisms. Several strategies for heterologous expression have already been devised (reviewed recently by Lazarus et al., 2014), which have lead to the successful production of several compounds outside their native host Heneghan et al., 2010;Tagami et al., 2013). Such approaches make it feasible to effectively isolate BGCs encoding cryptic pathways from metabolically complex organisms to study them in a clean background Chiang et al., 2011;Rutledge and Challis, 2015). In principle, they provide access to the full taxonomic and metabolic diversity of the fungal kingdom: around 70% of fungi in soils worldwide belong to subphyla or classes for which no BGCs have been studied yet (Tedersoo et al., 2014). Moreover, the vast repertoires of enzyme sequences retrieved from fungal (meta)genomes can be used as building blocks to optimize and diversify known biosynthetic pathways by identifying combinations of enzyme homologs that maximize catalytic efficiency across the entire metabolic route (Bayer et al., 2009).
As more and more enzymes and enzymatic domains are also experimentally characterized, a large catalogue of parts will develop that can be used not only to design modifications to existing pathways Pickens et al., 2011;Yeh et al., 2013), but even to attempt the de novo design of novel biosynthetic pathways in the future, e.g. through retrosynthetic approaches (Planson et al., 2012). A wide range of computational tools is being developed that can assist in such efforts by predicting optimally efficient pathways through metabolic maps (Medema et al., 2012). Importantly, community initiatives such as MIBiG  already allow the systematic storage of detailed gene and enzyme annotations from BGCs and the experimental evidence that these annotations are based on; the development of more intuitive, effective and user-friendly search tools to mine these data will make this information even more accessible for engineering purposes. As synthetic biology methodologies are further developed that cover the entire design/build/test cycle, the immense potential of fungal enzymatic and metabolic diversity will be exploited even more effectively to produce natural products of high industrial value.

Conclusions
Research on fungal natural products is in an exciting phase, as several ongoing technological developments are about to accelerate progress in the field in several ways. First, improvements in genome and transcriptome sequencing technologies are ongoing, leading to a continued exponential growth in the volumes of sequence data generated, of ever higher quality. Second, the recent development of several heterologous expression systems pave the way to explore more exotic fungal strains and species, and to study biosynthetic pathways in a defined background. Third, novel technologies such as CRISPR/Cas9 mutagenesis (Fuller et al., 2015;Nødvig et al., 2015;Zhang et al., 2016) will allow much more rapid generation of targeted mutants, which will make it easier to connect molecules to specific BGCs. To optimally profit from all of these exciting developments, clever use of computational tools will be of great importance: these will allow effective and even parallelized natural product discovery through pattern-based analysis of natural and engineered variation in strain and mutant libraries. However, these will not immediately lead to large numbers of completely elucidated structures, but rather to lists of molecular masses that are linked to BGC families. Subsequent estimation of variation in chemical structures and enzymatic mechanisms will be needed to make smart diversity-based choices on BGCs to heterologously express for detailed biochemical characterization. And for continued progress in the field, the hard work of biochemical analysis will still need to be done, as all computational strategies, however smart, depend on solid experimental data and require validation.

Figure 1: Recently characterized and computationally predicted exotic fungal biosynthetic gene clusters that escape detection by traditional motif-based tools
A) Two biosynthetic gene clusters from Aspergillus that have been recently characterized, encoding the enzymatic pathways towards ustiloxin B and kojic acid. Neither of these pathways contains PKSs, NRPSs or DMATs. B) Three examples of putative biosynthetic (or catabolic) gene clusters identified by the ClusterFinder algorithm, which evade detection by tools such as SMURF or antiSMASH, due to the absence of known scaffold-synthesizing enzymes. Cluster borders are by manual approximation.

Figure 2: Integrative pan-genomic approaches to genome-based discovery of fungal natural products
Pan-genomic pattern-based genome mining can be used to connect BGCs to molecules by logically excluding all other possible gene-molecule connections. After clustering of BGCs and mass-spectrometry-detected molecules into gene cluster families (GCFs) and molecular families (MFs), matrices can be generated that represent absence/presence patterns of GCFs and MFs across multiple strains (Duncan et al., 2015). This can be extended by also generating a similar matrix based on transcriptomic analysis of the BGCs, which adds valuable extra information, as BGCs that are not expressed can be excluded to be involved in the production of a molecule that is observed. Besides the scanning of multiple strains, multiple conditions can be used as well. A simple pattern-matching algorithm can then integrate the various data sources to directly connect MFs and GCFs. The more strains and conditions are used, the more connections can be made.