Multi- Omics Data Mining: A Novel Tool for BioBrick Design

Currently, billions of nucleotide and amino acid sequences accumulate in free-access databases as a result of the omics revolution, the improvement in sequencing technologies, and the systematic storage of shotgun sequencing data from a large and diverse number of organisms. In this chapter, multi- omics data mining approaches will be discussed as a novel tool for the identification and characterization of novel DNA sequences encoding elementary parts of complex biological systems (BioBricks) using omics libraries. Multi- omics data mining opens up the possibility to identify novel unknown sequences from free-access databases. It also provides an excellent platform for the identification and design of novel BioBricks by using previously well-characterized biological bricks as scaffolds for homology searching and BioBrick design. In this chapter, the most recent mining approaches will be discussed, and several examples will be presented to highlight its relevance as a novel tool for synthetic biology. physicochemical characteristics such as lipophilicity and charge may affect the complementarity between a chemical moiety and its target receptor.


The omics revolution
Within the last decades, a magnificent transformation in biology took place when a huge success in sequencing, bioinformatics, and bioanalytics was achieved. Several technologies were created to decrypt the metabolism of cells or interactions within tissues, organisms, and even entire ecosystems based on the identification of genes (genomics), mRNA (transcriptomics), proteins (proteomics), and metabolites (metabolomics) [1]. Since the discovery of the DNA structure by Watson and Crick in 1953 [2], an ever-increasing number of technologies for gene identification and characterization was established. One of the most relevant breakthroughs in DNA characterization was the invention of Sanger's sequencing in 1977 [3]. This sequencing technique uses chemical analogs of the deoxyribonucleotides (dNTPs, monomers of DNA strands) called dideoxynucleotides (ddNTPs), which lack the 3 0 hydroxyl group that is required for extension of DNA chains and therefore cannot form a bond with the 5 0 phosphate of the next dNTP [4]. The overall advantages of accuracy, robustness, and ease of use against other established methods led Sanger sequencing to become one of the most common technologies used to sequence DNA. Several improvements were subsequently applied to this technique, such as the use of fluorometric detection and capillary-based electrophoresis, thus contributing to the development of automated DNA sequencing machines [5][6][7][8][9][10][11]. These machines allowed researchers to obtain sequence reads slightly less than one kilobase (kb) in length and boosted the development of other crucial technologies such as the Polymerase Chain Reaction (PCR) in 1985 and the recombinant DNA technology in the following years [12,13].
In parallel to the development of large-scale dideoxy sequencing methods, a new technique set the novum for next-generation DNA sequencers. This approach remarkedly varies from the abovementioned methods as it does not involve the use of radio-or fluorescently labeled dNTPs. Instead, it is based on a luminescent method for measuring pyrophosphate synthesis in a process called pyrosequencing [14]. This sequencing technology is a two-enzyme process starting with the conversion of pyrophosphate into ATP (by an ATP sulfurylase) and the subsequent use of ATP as a substrate for luciferase, thus emitting light proportional to the amount of pyrophosphate available. Pyrosequencing became a popular technique for two major reasons: (i) it uses natural nucleotides instead of modified ones, and (ii) that sequencing results can be obtained in real-time without requiring time-consuming electrophoresis. In addition to pyrosequencing, other sequencing technologies were also devolved -the most important probably being the Solexa method, later acquired by the company Illumina [15]. Hereby, adapter-bracketed DNA molecules pass a lawn of complementary oligonucleotides bound to a flow cell. This method involves solid-phase PCR with neighboring clusters of clonal DNA strands in a process called "bridge amplification" [15][16][17]. Apart from Illumina, which is probably the most important technique currently in use, other sequencing companies established their novel methodologies [18,19], which are known as the secondgeneration sequencing techniques. The most notable second-generation sequencing platform is probably Ion Torrent. It is the first "post-light sequencing" technology with neither using fluorescence nor luminescence. Its methodology is based on beads bearing clonal populations of DNA fragments washed over a pico well plate, thereby releasing protons measured via the generated pH difference [20].
Recently, a third sequencing generation started with the invention of S. Quake in 2003 termed Single Molecule Sequencing (SMS) [21,22]. Its principle is similar to Illumina but skipping bridge amplification. In SMS, DNA templates attached to a planar surface and propriety fluorescent reversible terminator dNTPs (dubbed as "virtual terminators") are washed over one base at a time and imaged, before cleavage and cycling the adjacent base over. SMS has been recently improved in the Single-Molecule Real-Time (SMRT) platform from Pacific Biosciences, available for the PacBio machines [23]. During SMRT runs, DNA polymerization happens in arrays of microfabricated nanostructures called zero-mode waveguides (ZMWs) which are essentially tiny holes in a metallic film covering a chip. It allows visualization of single fluorophore molecules because the zone of laser excitation is so small that it allows distinction over the background of neighboring molecules in the solution [24]. Nonetheless, the probably most anticipated third-generation DNA sequencing method is nanopore sequencing which enables researchers to detect and quantify all types of biological molecules [25]. Its principle was theoretically established even before second-generation sequencing emerged by demonstrating that single-stranded RNA or DNA could be driven across a lipid bilayer through a large α-hemolysin ion channel by electrophoresis. Furthermore, passage through the channel blocks ion flow, decreasing the current for a length of time proportional to the length of the nucleic acid [26]. With Oxford Nanopore Technologies (ONT) as the first provider of nanopore sequencers and their nanopore platforms GridION and MinION [27,28], the latter of which is a small, mobile phone-sized USB device (released in 2014) [29]. Despite the admittedly poor quality profiles currently observed, it is hoped that such sequencers represent a genuinely disruptive technology in the DNA sequencing field in the future, producing incredibly long read (non-amplified) sequence data far cheaper and faster than what was previously possible [28,30]. The average read length, error rate, total number of reads, and run prices vary significantly among the different sequencing methodologies. Thus, the selection of the appropriate technology for sequencing is a crucial step that depends on the purpose of the study. For instance, Illumina and Ion Torrent produce accurate short reads ideal for the analysis of fragmented DNA, while PacBio and Min-ION produce long reads with a lower accuracy but very useful, for example, for the assembly of scaffolds during genome sequencing.
Similar to the development of advanced techniques for sequencing nucleic acids, other methods have been extensively developed for dissecting the proteome [31] and metabolome [32] of a multitude of organisms. Of these omics approaches metabolomics, however, is distinct from the others. In metabolomics not a set of linear (1D) molecules with a sequence of defined monomers (4 bases or 21 amino acids) is to be determined, but a wild bunch of different 3D compounds. Eventually, a large number of databases have been developed to collect all these information, which provide excellent platforms for data mining as will be discussed in the following chapters.

Genome and transcriptome data mining
The exponential accumulation of data in genomic databases during the last decades has motivated the creation of bioinformatics tools to explore, relate and understand the genetic information from a vast number of organisms [33,34]. These bioinformatics tools have been validated by experimental data, thus strengthening the design and assembly of novel biological entities (i.e., genes, RNA molecules, proteins, and metabolites). Those biological entities that can be used as building blocks for the assembly of artificial biosynthetic pathways are known as BioBricks. Consequently, the selection and design of BioBricks is important to further create and understand complex biological systems and biofactories of relevance in industrial biotechnology [35]. The general idea of comparing genomic sequences to identify such novel components of different metabolic pathways is not new. In fact, early in the 1970s, several efforts were performed to elucidate physiological and metabolic information through the comparative analysis of genetic sequences [36][37][38]. Classical genetics and reverse genetics approaches were then used to identify, annotate, compare, and connect genetic clusters associated with biosynthesis, using previously reported genetic data sets [39,40].
It was not until 1999 that Genome Mining (GM) formally emerged as a strategy for the computational analysis of genetic sequences that sought to recognize patterns between them within the framework of the human genome project. Later, alongside bioinformatics advances in the area of microbiology, GM acquires new attributes, building the concept known today: a bioinformatics approach that aims to predict DNA sequences associated with physiological and/or metabolic events, allowing the elucidation/prediction of metabolic pathways that lead to secondary metabolites of scientific and industrial interest [35,38,41,42]. Today, GM is not limited only to genomic predictions but seeks a holistic approach that includes the entire spectrum of molecular biology, articulating the prediction of the products of gene expression, the control of that expression, as well as the identity and structure of those potential metabolites, strengthening the creation of biological models that allow the comparison, understanding, and manipulation of cellular molecular systems [41,43].
GM was initially developed in bacterial models and demonstrated a high relevance for synthetic biologists and metabolic engineers, thus becoming one of the biggest breakthroughs in molecular biology and biotechnology [38,44]. Between the 1990s and 2000s, the genus Streptomyces (which is well known for its production of valuable antibiotics) was extensively studied at the experimental level, which allowed the identification of a large number of gene sequences involved in secondary metabolite production, regulation and antibiotic resistance. Comparison of gene sequences between different species of this genus, revealed a total of about 30 Biosynthetic Gene Clusters (BGCs) associated with the biosynthesis of such secondary metabolites [45,46]. Following these advances, GM was extended to study novel bacterial genera with abundant genomic information and was initially used to fight against bacterial resistance [47,48]. During the last years, GM was successfully used as a tool for the identification of alternative pathways for the biosynthesis of different natural products in diverse microorganisms [33,49], an approach which usually proved to be more efficient than other screening methods used for the identification of novel enzymes of relevance for the biosynthesis of secondary metabolites [33,49].
Recently, GM was also scaled up to eukaryotic models, thus revealing that multiple BGCs contain not only relevant information regarding the biosynthesis of secondary metabolites but also valuable information to study evolutionary events and ecological adaptation of different gene clusters [38,50,51]. A good example of the vast collection of BGCs predicted up to now can be found on the "Atlas of Biosynthetic Gene Clusters", a database of the Joint Genome Institute founded in 2015. This Atlas contains data on predicted and experimental gene clusters related to many secondary metabolites. As of June 2021, there are a total of 411,006 biosynthetic gene clusters reported, of which only 1285 have been experimentally validated [52]. GM is completely dependent on bioinformatics and computational technology available for the analysis of a large dataset. Thus, to boost the potential of this information, the development of novel computational tools and algorithms as well as the interest of researchers to join this effort is still required [42,51]. There are currently a variety of methods for performing GM using the available genomic information that will be further discussed hereafter.

Classical genome mining
The "classical" form of GM consists of the search for enzymes linked to the synthesis of secondary metabolites, by mining highly conserved sequences [35]. Before the current databases (composed of hundreds of genomic datasets and several bioinformatics tools) were established, novel sequences were evaluated by using reverse genetics, where genomic libraries were scanned for basic biosynthetic genes associated with a metabolic pathway of interest [38,53]. Those annotations had to be performed manually and by obtaining experimentally corroborated results. This formed the basis of classical GM, which provided the first consensus sequences to be compared with the vast amount of novel sequences obtained from different next-generation sequencing platforms [54]. Both, reverse genetics and GM follow the same mining pattern: one or several reference sequences, whose enzymatic products were already experimentally validated, are used to compare them with the genomes of interest and to identify homologous sequences in the organism of interest. Sequences of interest are considered as being generally associated with catalytic domains and highly conserved motifs [35,38].
Classical GM was initially focused on the identification of genomic clusters associated with enzymes for the production of secondary metabolites, that involve the following bacterial groups of enzymes and bioactive peptides: (i) polyketide synthases (PKSs); (ii) non-ribosomal peptide synthetases (NRPSs); and ribosomally and post-translationally modified peptides (RiPPs) [55][56][57]. Sequence comparison of these groups of proteins allowed the subsequent identification of conserved motifs that are currently helping to identify novel BGCs in pre-existing genomes, without resorting to the strenuous processes of experimentation and first considering the bioinformatic in silico approach [58]. Thus, numerous examples have demonstrated the advantage of GM as a successful screening tool for evaluating the ability of one organism to produce a particular metabolite based on the available BGCs information [59][60][61]. An example of this is presented by Su et al. who performed GM on a strain of Bacillus subtilis (i.e., NCD-2), initially predicting its potential for the production of fengicin, surfactin, bacillaene, subtilosin, bacillibactin, bacillosin and other not previously reported molecules, that were later detected by UHPLC-QTOF-MS/MS in its fermentation extracts [62]. The increasing popularity of classical GM promoted the development of GM-specialized databases and novel bioinformatics tools with improved homology searching tools, specialized sequence analyses, and advanced prediction algorithms. A list of some currently available GM specialized databases and related bioinformatics tools are presented in Tables 1 and 2, respectively.
Currently, the most popular platform for GM of bacterial and fungal genomes is antiSMASH. It is up to now the most comprehensive by integrating its own database and incorporating different prediction tools [63]. The key of its popularity results from the integration of different complex secondary metabolite-specific gene analysis methods using a much more researcher-friendly interface [69]. Unfortunately, as shown in the tables, most advances have been made in bacteria and there is still a need to improve or create new bioinformatics tools to enable GM in other organisms such as fungi and especially plants, which commonly do not have biosynthetic gene clusters but a separated, often compartmentalized (cell type specific) synthesis of secondary metabolites, including transport of intermediates between cell types and even organs [70,71].

Database Description
Ref.

antiSMASH database
Comprehensive resource on BGCs for secondary metabolites identified in bacterial genomes. [63] BACTIBASE Open-access database used for the characterization of bacterial antimicrobial peptides. [64] ClusterMine360 Contains over 200 curated entries of BGCs clusters including classification of the potential compounds produced, taxonomic information of the producing organisms, and links to original data. [65] CSDB/r-CSDB Manually curated database containing more than 160 PKS, NRPS, and PKS/ NRPS BGCs. [66] DoBISCUIT Contains a literature-based collection of BGCs for PKS and NRPS. [67] IMG-ABC Contains automatically identified gene clusters, clusters with known biosynthesis products, and secondary metabolites. [68]

Comparative genome mining
Classical GM alone fails to identify BGCs in genomic regions that do not follow a classical modular gene topology, as described by Donadio et al. since 1991. The organization of open reading frames (ORFs) associated with secondary metabolite-producing genes that generally follow an order of distribution between catalytic and structural domains for modular PKSs or NRPSs, for example, is called a modular pattern [39]. These extensively described and annotated modules serve as a template for comparison with new sequences from available genomes [42].
Leblond and coworkers found more than 3300 BGCs for about 16,500 possible NRPS-associated enzymes in Streptomyces ambofaciens. However, when evaluating the potential enzymes in silico, they realized that many did not follow the modular pattern used as a template [85]. This, indeed, reduced the possibilities of modeling the possible secondary metabolites that could be produced by this bacterium. This is certainly an example of the current limitations of classical GM, which must contemplate new technologies (e.g., artificial intelligence (AI) and machine learning (ML)) in response to unconventional sequences that do not completely follow the expected organization.
One way to address these limitations is by integrating already existing tools that are focused more on the identification of patterns related to phylogeny and evolution instead of molecular function. For example, descriptions of lineage relationships can be made and some non-modular combinations of putative BGCs can be described between organisms that may not belong to the same taxonomic level.

Tool
Description Ref.
antiSMASH Fully automated tool for extracting genome data from bacteria and fungi to search for BGCs. [72] BiG-SCAPE Uses the distance between BGCs (identified with antiSMASH), to create sequence similarity networks. [73] CLUSEAN Allows homology searches and identification of conserved domains in BGCs of genes encoding for PKS and NRPS. Also classifies enzymes and predicts the domains specificity. [74] CLUSTER FINDER Uses a probability approach to recognize BGCs in genomic and metagenomic data. [75] EvoMining Uses phylogenetics to recognize, compare and identify BGCs associated with primary metabolism but that present a divergent phylogeny. [76] FunGeneClusterS Allows the prediction of BGCs based on genomic and transcriptomic data for fungi. [77] MIPS-CG Allows the identification of totally new BGCs using only genomic data. [78] NaPDoS Detects and analyze genes associated with secondary metabolites. [79] PhytoClust Detects BGCs of secondary metabolites in plant genomes. [80] PKMiner Predicts novel BGCs of type II PKS and aromatic polyketide chemotypes using their conserved aromatase and cyclase domains. [81] plantiSMASH An antiSMASH' version that uses plant genomes. [82] SBSPKS Allows chemical analysis of experimentally characterized BGCs for PKS/ NRPS proteins. [83] SMURF Used for mining BGCs in fungi to identify conserved domains in PKS, NRPS, PKS/NRPS hybrids, and terpenoid genes.
[84] These results are not only valuable for the search for pathways to new natural products, but they also allow evolutionary reconstruction in the creation of metabolic pathways that respond to defense, competition, and attack of organisms in their ecosystem [86]. In plant metabolomics, such phylogentic relationships based on an untargeted fingerprint approach of natural products of different species were for the first time described in 2013 for Urtica species [87], still awaiting a full correlation with genomic data. Two different ways of using phylogenetics approaches for comparative GM can be defined: In the first one, phylogenetics trees are constructed using both the whole sequences of the organisms under study and a pool of conserved wellcharacterized gene clusters associated to the production of a defined compound. In this way, BGC lineages can be traced and evolutionary relationships between apparently unrelated organisms can be established. Abdelmohsen et al. used this strategy to investigate biosynthetic pathways in actinomycetes isolated from marine sponges from the Red Sea. After a combination of taxonomic evaluation using the 16S ribosomal gene, PCR amplification of genes associated with modular PKS and NRPS, and phylogenetic analysis, the authors found that 20 of the actinomycetes isolates (speeded over 10 genera) possessed at least one of the biosynthetic genes analyzed [88]. This method has been extensively applied to identify novel potential BGCs [73,89] and to create new gene clusters that can be further related to already annotated genomes of organisms previously studied at the experimental level.
The use of comparative GM has also allowed the identification of genes involved in the production of secondary metabolites in bacteria, by considering horizontal gene transfer events and phylogenetic analysis. Here, relationship trees are constructed using genes that are directly associated with the creation of specific compounds/secondary metabolites [90]. In this model, gene relationships are inferred primarily using the biosynthetic gene sequences only, and later those relationships are contrasted or strengthened by evaluating the rest of the organism's genome [91]. An example of the use of this method are studies conducted on the genus Streptomyces, where the production of secondary metabolites was again evaluated considering events of lateral gene transfer. It was found that, although horizontal gene transfer of the studied BGCs is not so frequent, the transfer of exogenous regulatory, resistance, and secondary metabolite production genes can significantly contribute to recombination events in those BGCs. Thus, comparative GM brings new relevant concepts such as the variable nature of those BGCs and their diversification even within very specific levels of phylogenetic discrimination. This undoubtedly paves the way not only to understand the evolution of BGCs in microorganisms but also to understanding the ecological landscape that it influences [91].
Currently, one of the methods to specifically evaluate putative catalytic domains in enzymes, using phylogenetic algorithms, is the Natural Product Domain Seeker (NaPDos), which organizes sequences into clades and allows the recognition of lineages of organisms capable of producing selected metabolites [79,92]. This represents a new approach for the evaluation of possible non-homologous and undescribed enzymes (shown for modular PKS and NPRS) and to elucidate new chemical structures not yet identified. NaPDos initially contained only data from PCR fragments but now is a comprehensive tool that also includes genomics and metagenomics data [93]. This is particularly important because it allows the evaluation of genomic data obtained from complex samples such as soils, sediments, water sources, wastes, etc. (metagenomics). With NaPDos it is even possible to estimate the diversity of microorganisms from the sampled source, as well as to evaluate the genetic potential for the biosynthesis of different metabolites [93].

Genome mining in synthetic biology
The identification of novel BGCs resulting from genomic mining studies represents a great opportunity for synthetic biologists and metabolic engineering as it allows the identification, construction, synthesis, and expression of BioBricks in heterologous models or to discover natural compounds with outstanding properties. One of the most significant commercial examples of this application has been observed during the engineering of yeast for the biosynthesis of valuable products such as artemisinin (an antimalarial drug) by using BioBricks identified through GM [35,94]. Recently, GM has been also used to identify more than 70 syntheses involved in the production of hypermodified peptide cytotoxins (i.e., unique, and valuable chemotherapeutics) by mining prokaryotic diversity [95]. With the help of GM, the identification of several cryptic metabolic pathways has been possible, giving way to combinatorial biosynthesis, which can be used in the construction of biosynthetic units, following the pattern of BGCs. These approaches also present challenges mainly related to our current understanding of the interdependent metabolic circuits, and the complexity in tracking them. This will certainly require many more efforts from bioinformatics to enrich genomic mining by including additional omics data such as transcriptomics, metabolomics, and proteomics not only for microorganisms but also for eukaryotes with their complexer, usually unclustered biosynthetic production networks [96].

Transcriptome mining
A transcriptome represents a "snapshot" of a RNA population in a certain tissue or at a specific developmental stage. Compared to the genomic information of the same organism, a transcriptomic dataset is less complex as it does not contain any information, for example, on the untranslated regions of a genome (e.g., promoters). Transcriptomes also do not provide information on the physical organization of the individual genetic elements-a fact which in turn represents an obstacle for the application of classical GM methods (see previous sections) used, for instance, for pathway elucidation in plants. However, several advantages make transcriptome mining (TM) a valuable alternative in the last years: First, unlike in a "static" genome, differential analysis is possible for transcriptomic data. Thus, the identification of tissue-specific transcripts (pathways restricted to special organs) and discrimination of non-functional RNAs (pseudogenes) is much easier than in GM approaches. Secondly, the less complex datasets facilitate mining in organisms with large and complex genomes such as plants [97], which in general developed multi-member gene families with redundant functions during evolution. In conjunction with the fact that the organization of biosynthetic pathways into gene clusters is exceptional in plants [98], TM is increasingly used in this class of organisms to mine for NP pathways as well as to study different aspects of plant physiology. Recent examples for the latter purpose include the dissection of the response to changing temperatures [99], drought stress [100], or defense against pathogens in model and non-model plants [101,102].
First reports on TM used for the discovery of NP biosynthetic genes date back to the first decade of the 21st century. The reports were based on so-called expressed sequence tag (EST) databases [103], which were developed as an alternative to earlier microarray-driven methods for expression analysis. Milestones for the application in the plant field were the establishment of specific EST databases [104] and the access to programs that used both microarray data and transcriptome datasets in the frame of transcriptome profiling (e.g., eVOC [105]). Continued software development led to more advanced approaches which integrated data modeling in targeted plant engineering [106]. Alongside with the use of co-expression analysis as a standard tool in multifaceted mining strategies [107] and the current decrease in prices for transcriptome sequencing, the developments led to a continuous increase in the annual output of TM-based publications (3 in 2003, 84 in 2020).
For instance, all classes of NPs found in plants were targeted using TM in the last years. Most reports focused on terpenoids, including papers on the identification of single enzymes such as terpene cyclases/synthases [108], associated biocatalysts [109] or comparative evolutionary studies of genes in whole plant families such as Pinaceae [110] or Lamiaceae [111]. An outstanding example is the mining for biocatalysts involved in the biosynthesis of the insecticidal limonoid azadirachtin in neem (Azadirachta indica) [112]. By using a comparative analysis of three limonoidcontaining species from the order Sapindales, the authors could identify key enzymes involved in the early steps of the pathway, namely the initial terpene cyclase forming the basal triterpene scaffold and subsequent cytochromes involved in tailoring modifications. In the field of alkaloids, TM was similarly applied, yielding the enzyme norbelladine synthase from Narcissus pseudonarcissus [113]. This enzyme, which is used for a coupling step during the synthesis of the anticancer agent galantamine in Narcissus species, was fished by a TM-based screening for functional homologs of an enzyme catalyzing a similar enzymatic reaction in opium poppy. Hagel and co-workers [114] used a similar but broader approach to compare plants with a pronounced production of benzylisoquinoline alkaloids. Differential analysis of the transcriptomes and metabolomes of 20 species from the order Ranunculales revealed 850 genes that are potentially involved in alkaloid biosynthesis and are interesting candidates for use in alkaloid Synthetic Biology. A noteworthy example concerning the biosynthesis of plant phenolics is the study of Lau and Satteley [115], which describes mining for enzymes required for the production of podophyllotoxin. This lignan is an antiviral polyphenol isolated from mayapple (Podophyllum peltatum), and six of the enzymes involved in its biosynthesis could be identified by TM followed by subsequent co-expression in tobacco. Another example is the insight from TM and Metabolomics in the synthesis of hypericin in the medicinal plant St. John's wort (Hypericum perforatum) [116].
Future studies will certainly use extensive TM to further explore the biosynthetic machineries to high-value metabolites other than terpenes, alkaloids, and phenolics. In agreement with this assumption, the latest reports on TM already target pathways to antimicrobial cyclopeptides [117], polysaccharides [118], or compounds derived from fatty acids [119]. In general, TM studies will definitely benefit from the integration of multi-level omics data in the future. Such comprehensive methods have already been applied in proof-of-concept studies, including the combination of TM with proteomics to mine for cyclopeptides [120] or in-plant "regulomics", i.e., in software tools comparing transcriptomes with (epi)genomic data to identify regulatory networks [121].

Metabolic data mining
Metabolism is typically defined as the sum of pathways and cycles representing all the sets of biochemical reactions occurring at a cell and in which the product of a particular chemical reaction becomes the substrate of the subsequent reaction [122]. Certainly, the understanding of this concept is key in the realm of biological sciences, especially in the post-genomic era, where we have embraced a paradigm shift from a gene-centered view to an increasing interest in omics-driven highthroughput data types, sources, and approaches [123]. In line with the current move towards systems biology, the mining of metabolism data (metabolic data mining) includes not only the systematic study of component metabolites (i.e., metabolomics) [124], but also of all the controlled biochemical reactions in an organism responsible for their production, which is more recently understood under the name of reactomics [125] and related processes such as in fluxomics [126,127]. In metabolomics, numerous subclasses have emerged, as in distinction to especially genomics, a really holistic determination of the metabolome is impossible: no method exists to extract and analyze all metabolites of an organism completely in one experiment. Unlike in genomics, transcriptomics or proteomics, metabolome analytics cannot rely on a one dimensional sequential biopolymer of a limited number of monomer units and a few handful of derivatizations (methylation, posttranslational modifications etc.). Instead, most compounds are unique, they are rarely produced by linear monomer assembly processes which can be deconvoluted by standardized processes. But instead a metabolome is a mixture of compounds with highly complex 2D and mostly 3D molecular structures of maximum variability and physicochemical property divergenceies (e.g., sugars vs. triglycerides). Subclasses have thus emerged, e.g., lipidomics or glycomics. Along with the great advances of computing technologies, all types of studies -especially when applied in combination-have led us to witness an unprecedented revolution in biotechnology by finding patterns or trends that explain the behavior of large data sets in a specific context and as automated as possible. Thus, during the last decade, a large number of metabolic pathways have been mined to identify the key elements and modules for the production of drugs, foods, fuels, and a plethora of bioactive compounds [128][129][130], including the combination of transcriptome and metabolome studies [116].
The trifold correlation of metabolomic, transcriptomic/genomic and phenotypical data ideally allows to identify both gene loci responsible and the biosynthetic components responsible for a property (phenotype), the biosynthetic pathways for their production, and the genetic control elements associated with them (GWASgenome wide association study). This allows e.g., improved molecular breeding in plants without the necessity of producing GMOs. An example is a study on downy mildew resistance in hops (Humulus lupulus), i.e., tackling it most devastating pathogen by identifying the intrinsic strengths of its chemical defense. The identification of key metabolites responsible for mildew resistance, their associated pathways and genetic breeding markers associated with downy mildew resistance now allows the targeted (non-GMO) molecular breeding of resistant phenotypes [131]. The same tools can, of course, also be used for higher production using genetic improvement (GMOs) [70]. The different strategies for the identification of these metabolic pathways via data collection and coupling, reactome reconstruction, and rational exploration of the chemical space will be further discussed.

Metabolic data collection and coupling
A typical workflow in metabolic data mining aimed to elucidate interaction networks and reactomes is shown in Figure 1. Initially, metabolic data is collected including information on enzymes and metabolites. Then, the recognition and coupling of network patterns are carried out by association analysis and data modeling to obtain a reduction in data dimensionality. Finally, reactomes are reconstructed to elucidate the corresponding network dynamics and topology [132]. This knowledge forms the basis for future metabolic engineering experiments aimed to enhance the production of the desired compound or to assemble novel native but also synthetic/unnatural biosynthetic pathways. Interestingly, the current advances in the development of novel BioBricks and the design of novel artificial metabolic networks promote the rapid and efficient coupling of a series of biological parts into a highly reusable large-scale framework [133].

Proficient exploration of chemical space: natural products and fragments
Metabolic data mining also may involve the use of small compounds derived from the primary and, most especially, secondary metabolism of living organisms. These metabolites, typically referred to as natural products (NPs), have largely been used as a source of chemical entities with promising physicochemical, medicinal or other features, being used directly (unmodified), as a substructure, or as inspiration for a structurally similar chemical scaffold [134,135]. NPs have been used for ages as medicines than the synthetic bioactives and as scaffolds for the rational design of novel synthetic drugs [136,137]. Interestingly, they occupy a much larger fraction of the ensemble of all chemical compounds (i.e., have a larger structural diversity), which is classically known among theoretical and computational chemists as chemical space ($10 60 molecules) [138,139]. In the field of medicinal chemistry, and considering we only know just a bit portion of the estimated chemical space ($10 8 molecules) [140], the use of NP-based libraries represents a priceless opportunity for scientists to make bigger and faster leaps within it [141,142]. This fact represents an additional advantage taking into account that conventional combinatorial chemistry (usually termed combichem) without input from natural products initially had very limited success in novel drug discovery [141,143], having its strength rather in optimization in most cases [141]. On the other hand, an alternative scenario intended to explore the chemical space more profoundly and, thus, may be used to harness metabolic data involves the principles of molecular fragmentation. According to this technique, a chemical compound of interest is not identified and evaluated as a whole, but instead, it is developed starting from structural molecular components usually within the range 120-300 Da (i.e., fragments) [144,145]. Although many current chemical libraries are available as fragments per se, various cleavage methods such as RECAP (Retrosynthetic Combinatorial Analysis Procedure) have been widely used to deconstruct chemical libraries of both NPs and other classes of chemical entities [146,147]. Among the many advantages of using fragments are not only their potential to navigate into the chemical space in a more cost-effective manner compared, for example, to drug-sized molecules, but also their potential to favor the protein-ligand complementarity and facilitate selectivity adjustments during optimization processes (a more detailed description is given in Figure 2) [148,149]. Once more, within the field of BioBricks, the possibility of understanding every fragment as an independent brick could facilitate not only the recovery of specific substructures during a virtual screening (VS) protocol but also the coupling of the best combinations of substructures to obtain a final candidate for further development. It is worth mentioning that fragments could be "recycled" to be considered in the development of a bigger compound if other partner fragments can supply -and balance-particular physicochemical properties of interest. This is fully illustrated in terms of ligand efficiency (LE) metrics as a phenomenon called fragment "rescue" effect [150]. Through an application of these kinds of concepts and approaches, the scientific community may benefit from metabolomic data mining of compounds able to mediate diverse functions in biological systems.

Conclusions
Multi-omics data mining has revolutionized science by enabling overlaps among different fields of study such as biochemistry, molecular biology, synthetic biology, organic and medicinal chemistry, computational chemistry, chemical engineering, and high-performance computing. This represents a crucial breakthrough that is expected to accelerate our comprehension of complex biological systems and, most interestingly, the identification, selection, and recovery of novel pieces of biological information in the form of BioBricks for the design of biofactories. Currently, we have unprecedented access to large multi-omics data repositories, which make possible the discovery, identification, and coupling of these BioBricks. This is an important step to unleash different biological functions, or to rationally design metabolic pathways for the biosynthesis of valuable products. However, there is still a need for integrating additional cutting-edge technologies in computing and data science such as machine learning, artificial intelligence, and big and smart data analytics that can further boost the discovery and de novo design of BioBricks with high impact in pharma, cosmetics, fine chemical and nutraceutical industries.

Conflict of interest
The authors declare no conflict of interest. Comparison between typical high-throughput screening and fragment-based screening. In the left panel, it is evident that although one specific part of the drug compound exhibits a good fit within most of the pocket of a hypothetical target protein (red curved line), the other two parts of the same compound do not occupy any specific binding (blue curved line) or occupies subsites of the active center only partially (green curved line). In contrast, the right panel shows that the consideration of fragments for screening allowed the identification of chemical entities with high inherent affinity to the corresponding pockets. Although only shape and size are included in the illustration for clarity, many other physicochemical characteristics such as lipophilicity and charge may affect the complementarity between a chemical moiety and its target receptor.