Deep self-supervised learning for biosynthetic gene cluster detection and product classification

Natural products are chemical compounds that form the basis of many therapeutics used in the pharmaceutical industry. In microbes, natural products are synthesized by groups of colocalized genes called biosynthetic gene clusters (BGCs). With advances in high-throughput sequencing, there has been an increase of complete microbial isolate genomes and metagenomes, from which a vast number of BGCs are undiscovered. Here, we introduce a self-supervised learning approach designed to identify and characterize BGCs from such data. To do this, we represent BGCs as chains of functional protein domains and train a masked language model on these domains. We assess the ability of our approach to detect BGCs and characterize BGC properties in bacterial genomes. We also demonstrate that our model can learn meaningful representations of BGCs and their constituent domains, detect BGCs in microbial genomes, and predict BGC product classes. These results highlight self-supervised neural networks as a promising framework for improving BGC prediction and classification.


Introduction
Natural products are chemical compounds that form the basis of many pharmaceuticals and clinical therapeutics [1]. Their  antimicrobial drugs, anticancer therapies, and other therapeutic areas [2]. To initiate the discovery of natural products, the pharmaceutical industry has traditionally relied on laboratory research, yet this approach cannot feasibly capture the entire chemical diversity of natural products. Thus, new methods are needed to advance natural product discovery [3]. Diverse natural products can be produced in living organisms via groups of genes called biosynthetic gene clusters (BGCs). Genome mining has become a powerful tool for exploring the complex and diverse chemical space of natural products [3]. Fast, inexpensive genome sequencing technology has contributed to the advancement of BGC identification and, by extension, natural product discovery. This approach has been particularly successful in microbes, where BGCs are often a group of physically colocalized genes whose sequence and function dictates the synthesis of natural products. This discovery of BGCs supports the assembly-line enzymology model, where biosynthetic systems are multimodular and each module contains a set of domains that collectively catalyze one round of elongation and chemical modification of the growing natural product peptide chain [4]. This type of natural product synthesis is particularly characteristic of multimodular polyketide synthases (PKS) and nonribosomal peptide synthases (NRPS), which are two major biosynthetic systems that synthesize polyketides and non-ribosomal natural product peptides, respectively [5]. However, evidence suggests that much of the biosynthetic capacity of the microbial world remains unexplored [6]. Improved identification and characterization of BGCs directly from genomic data could accelerate the discovery of novel natural products with therapeutic relevance.
Identification of BGCs directly from genomic sequences is critical to navigating natural product space and nominating novel natural products. While complementary data modalities involving joint genome sequencing and mass-spectrometry data can be used to link products with gene clusters [7], the majority of known BGCs were characterized directly from DNA sequencing performed without any associated analysis of chemical structures in the sample. As such, computational methods which focus exclusively on identifying BGCs from genomes are essential components of BGC discovery pipelines.
antiSMASH (ANTIbiotics & Secondary Metabolite Analysis SHell) is an early tool for BGC discovery that uses a set of curated profile-Hidden Markov Models (pHMMs) to call biosynthetic gene families and a set of heuristics to tag a portion of a genome as a BGC [8,9]. anti-SMASH then annotates these called BGCs by using carefully curated rules based on expert knowledge. Similarly, ClusterFinder uses a Hidden Markov Model (HMM) to identify gene clusters of known and unknown classes [10]. Despite their effectiveness, HMM-based algorithms do not capture higher-order dependencies between genes, limiting their accuracy and generalizability [11]. Likewise, rule-based methods are limited by the need for human expertise and do not generalize well to new BGC classes.
A recent approach, DeepBGC, introduced a deep learning genome-mining strategy for biosynthetic gene cluster annotation that addresses these limitations [12]. Similar to antiSMASH, DeepBGC uses sets of curated pHMMs to call biosynthetic gene families; however, it uses a supervised neural network to predict BGC boundaries and annotate BGC function. Specifically, they employ a bidirectional long short-term memory (Bi-LSTM) recurrent neural network (RNN), which offers the advantage of capturing short-and long-term dependencies between adjacent and distant genes [13]. DeepBGC reported promising improvements in the identification of BGCs in microbial genomes. However, DeepBGC is trained on a small number of high-quality annotations, and the supervised approach requires mining examples of genes that are not part of BGCs. The quality of the predictions is therefore likely to depend on the quality of the negative examples, which must be similar to BGC sequences while ideally containing no false negatives.
Rather than relying on expert-curated annotations and negative examples, self-supervised masked language models promise the ability to learn biologically-relevant patterns directly from a large set of BGC examples. Recently, self-supervised masked language models of biological sequences have been used to study proteins [14][15][16][17][18][19][20][21], DNA [22], RNA [23,24], and glycans [25,26]. In these models, a neural network is trained either to reconstruct the original sequence from a corrupted version of the sequence, or to predict the next element in the sequence given the preceding elements. After training on a large dataset, such as all protein sequences in UniProt [27], the model can be used for zero-shot predictions of fitness [28] or structure [29], and can additionally be fine-tuned on downstream supervised tasks [30,31].
To accelerate identification and classification of BGCs, we developed a self-supervised neural network masked language model of BGCs from bacterial genomes (Fig 1). Our model represents BGCs as chains of functional protein domains, and uses ESM-1b [14], a protein masked language model, to obtain pretrained embeddings of functional protein domains with amino acid-level context. We then train a convolutional masked language model on these domains to develop meaningful learned representations of BGCs and their constituent domains. The architecture for our model is based off of convolutional autoencoding representations of proteins (CARP) [32], a masked language model of proteins, and we will therefore refer to it as Biosynthetic Gene CARP (BiGCARP). We leverage these representations to detect BGCs from microbial genomes and then classify them based on their natural product class. We further investigate the potential advantages of our model by comparing our approach with DeepBGC, and demonstrate that BiGCARP achieves improvements in BGC prediction and natural product classification. BiGCARP highlights self-supervised neural networks as a promising framework for improving BGC characterization.

Self-supervised training
We first developed a self-supervised training scheme to train BiGCARP to learn representations of BGCs. As BGCs have a hierarchical structure, they can be represented at four main levels. From the least-to-most granular, these are: genes, Pfam domains (families of evolutionaryrelated proteins), amino acids, and nucleotides. We note that more granular units of representation lead to longer sequences. BGCs typically contain several dozen genes, each of which contains one or more Pfam domains. Each Pfam domain contains tens to hundreds of amino acids, and each amino acid is encoded by three nucleotides. This introduces a trade-off between modeling short sequences where each unit is complex or modeling long sequences where each unit is simple. In order to balance input sequence length and information content of individual units, we chose to represent BGCs as sequences of Pfams. This is the same level chosen by DeepBGC [12]. As shown in Fig 1, during training, we append a BGC product class token to the start of each BGC Pfam sequence in order to learn BGC product classes from their Pfam domain sequences. We then corrupt the sequence according to the BERT [33] corruption scheme and train Biosynthetic Gene Convolutional Autoencoding Representations of Proteins (BiGCARP) to reconstruct the original class token and Pfam sequence. BiGCARP combines the ByteNet encoder dilated CNN architecture from [34] with linear input embedding and output decoding layers, as shown in Fig 2a. Pfam embeddings map protein families from our vocabulary to vectors into a 1280-dimensional space, and thus serve as the inputs to BiGCARP. We train three versions of BiGCARP with different initial Pfam embeddings. The BiGCARP-ESM-1b-finetuned and BiGCAR-P-ESM-1b-frozen models are both initialized with Pfam embeddings obtained by averaging the per-residue output from ESM-1b for each domain. BiGCARP-ESM-1b-finetuned has its Schematic of the workflow for characterizing BGCs with BiGCARP, a self-supervised deep neural network. We curate a dataset of annotated BGCs from antiSMASH for training BiGCARP. We then use ESM-1b [14], a protein masked language model, to obtain pretrained embeddings of protein family (Pfam) domains in our dataset and to explore whether pretrained Pfam domain embeddings show improvement on the quality of their representations. By representing BGCs as chains of Pfams, we train a self-supervised masked language model on these domains to characterize BGC properties in microbial genomes. We leverage these learned representations to detect BGCs from microbial genomes and to predict their natural product class.
https://doi.org/10.1371/journal.pcbi.1011162.g001 (a) We use the masked language model objective described in [33] to train BiGCARP to reconstruct the BGC product class and Pfam sequence on our self-supervised dataset, which contains around 127,000 BGC Pfam sequences. BiGCARP is a dilated 1D-convolutional neural network masked language model based on CARP [32] and ByteNet [34]. (b) Validation loss (cross-entropy) and accuracy for BiGCARP with different initial Pfam embeddings.

PLOS COMPUTATIONAL BIOLOGY
https://doi.org/10.1371/journal.pcbi.1011162.g002 PLOS COMPUTATIONAL BIOLOGY embeddings finetuned during self-supervised BGC training, while BiGCARP-ESM-1b-frozen has the initial embeddings frozen at the onset of self-supervised BGC training. Finally, BiG-CARP-random is initialized with a random Pfam embedding, which is finetuned during selfsupervised BGC training. All three versions of BiGCARP are trained on BGC sequences extracted from the antiSMASH dataset [8,9]. We used approximately 127,000 BGC sequences and split the dataset 80/10/10 between training, validation, and testing, respectively. The training set is deduplicated against all datasets used in downstream evaluation. We refer the reader to Materials and Methods for details about the model training and architecture and the selfsupervised training dataset. Fig 2b plots the learning curves of the validation performance on the self-supervised dataset for all three versions of BiGCARP. We discover BiGCARP-ESM-1b-frozen is outperformed by BiGCARP-ESM-1b-finetuned and BiGCARP-random, which both show similar performance and attain an accuracy of around 75%.

Learned embeddings encode relevant representations of Pfam domains
We used uniform manifold approximation and projection (UMAP) to visualize the input Pfam embeddings after self-supervised training on the antiSMASH training set (Fig 3). Each protein family is represented as a single point, and protein families of similar sequence and function should have similar representations and thus be mapped to nearby points. In order to determine if our embeddings capture these properties of related Pfam domains, we plot every Pfam domain that falls under the ten most common Pfam superfamilies (clans) in our selfsupervised dataset: NADP Rossman (CL0063), P-loop NTPase (CL0023), Zn Beta Ribbon (CL0167), E-set (CL0159), HTH (CL0123), TPR (CL0020), PDDEXK (CL0236), MBB (CL0193), Beta propeller (CL0186), and OB (CL0021) [35].
We find that initializing Pfam domain embeddings using ESM-1b improves the quality of the learned representations, as these embeddings take into account protein family amino-acid sequence and protein structural information.

BiGCARP captures meaningful patterns in BGCs
We next evaluated BiGCARP's pretraining performance after self-supervised training ( Table 1). We use the exponentiated cross entropy (ECE) metric for evaluating BiGCARP. This metric provides a measure for a model's ability to narrow its prediction of a token from the set of options. An ideal model would have an ECE of 1, whereas a model choosing at random would have an ECE of the vocabulary size, which in our case is 19,550 for Pfam domains and 55 for BGC product classes. On our antiSMASH dataset test set, BiGCARP-ESM-1b-finetuned achieves the lowest ECE on the Pfam domains, while BiGCARP-ESM-1b-frozen achieves the lowest ECE on the product classes despite performing worse on domain ECE.
In addition, using the 9-genomes validation set from DeepBGC [12], we evaluate whether BiGCARP can identify the start locations of BGCs and whether each domain is in a BGC without further supervised training. We append a mask token to the beginning of every window of 64 domains in the dataset and pass them through BiGCARP. Intuitively, if the window is the start of a BGC, the model's BGC class prediction should have low entropy, and its reconstructions of the domains should be both low-entropy and have low cross-entropy with the original input domain. This scheme is shown in Fig 1. We refer the reader to Materials and Methods for details about scoring start positions and BGC Pfam domains. As shown in Table 1, all three versions of BiGCARP can detect BGC start locations and whether domains are part of a BGC, with BiGCARP-ESM-1b-frozen performing worse on both tasks than the other two versions.
We then finetuned BiGCARP on the training dataset reported in DeepBGC v0.1.0 [12], which uses all BGC domain sequences from MiBIG (version 1.4) as positive BGC samples and 10128 negative examples for 100 epochs and choose the epoch with the highest area under the receiver operating characteristic curve (AUROC) on the 9-genomes validation set for further testing (Methods). Table 2 shows domain-level classification performance using AUROC on PLOS COMPUTATIONAL BIOLOGY the 9-genomes validation set and AUROC and average precision (AvgPrec) on the 6-genomes test set from DeepBGC. Note that the DeepBGC results on 9-genomes are for cross-validation directly on 9-genomes. All three versions of BiGCARP outperform DeepBGC on the 6-genomes test set and 9-genomes validation set. However, self-supervised training did not improve performance on the 6-genomes test set for BiGCARP.

BiGCARP predicts BGC product classes
In addition to detecting BGCs in microbial genomes, predicting their product classes would provide further aid in discovering new natural products. BiGCARP learns to predict a BGC's product class from its Pfam sequence by reconstructing masked class tokens during self-supervised training (Fig 1). During self-supervised training, we use the antiSMASH product classes. In order to compare BiGCARP's performance to DeepBGC, we map antiSMASH product classes to those in the Minimum Information about a Biosynthetic Gene cluster (MIBiG) dataset used in DeepBGC [12,36]. DeepBGC trains a random forest classifier on its embeddings to predict BGC product classes. In contrast, we simply append a mask token to the beginning of each BGC sequence and evaluate the model's predictions for the identity of the mask, removing the need to train an additional model. All three versions of BiGCARP out-perform DeepBGC on average AUROC across the product classes, and ensembling their predictions via their arithmetic mean further improves accuracy, as shown in Table 3 and S1 Table. BiGCARP-ensemble outperforms DeepBGC on AUROC for four out of seven product classes. This is likely because the antiSMASH training set is approximately 100-times larger than MIBiG. Performance is generally similar for product classes that are well-represented in both datasets, with the largest gains coming in the "other" and alkaloid classes, which are under-represented in MIBiG. This underscores the importance and utility of training on a large and diverse BGC dataset. However, BiGCARP does not do as well as DeepBGC on precision and recall. This is likely because directly training on MIBiG labels enables better calibrated predictions. We note that DeepBGC is further advantaged here by reporting 5-fold cross-validation results on MIBiG, while BiGCARP is not trained on any sequences from MIBiG.

BiGCARP identifies BCGs from unannotated microbial genomes
We compared the ability of BiGCARP and antiSMASH 6.1.1 [37] to identify BGCs in 773 randomly-chosen bacterial genomes released after the antiSMASH 3.0 database. antiSMASH 6.1.1 identified 4287 clusters (renamed 'regions' in antiSMASH 5.0 and later). Of these, we were able to match the Pfam domain sequences for 3174 clusters to those produced by our domain annotation pipeline (see Methods). We treat these 3174 clusters identified by antiSMASH as ground

PLOS COMPUTATIONAL BIOLOGY
truth labels against which we evaluate matched BiGCARP predictions. Table 4 shows unsupervised BiGCARP performance in predicting the locations of clusters in these genomes, using the same methods as on the 9 genomes test set. We also identify 199 possible start locations with better scores than any of the clusters found by antiSMASH 6.1.1, which may be a fruitful starting point for further investigation. All datasets and annotations used can be found on Zenodo.

Discussion
Biosynthetic gene clusters (BGCs) are a promising source of natural products, but are difficult to discover, express, and characterize. Recent work in self-supervised deep learning has shown promise for modeling DNA, RNA, proteins, and glycans. We develop Biosynthetic Gene Convolutional Autoencoding Representations of Proteins (BiGCARP), a masked language model that learns representations of BGCs based on their Pfam domains, detects BGCs, and predicts their product classes. To our knowledge, this is the first work to use Pfam domains as tokens in a masked language model. We demonstrate that our model learns biologically-reasonable representations of Pfam domains. Representing BGCs as Pfam domains was a compromise between limiting the sequence length while having fine-grained sequence information. Models on the level of amino acid residues or the nucleotide sequence may be able to resolve more details at the cost of more computation. BiGCARP is a strong BGC detector even without seeing negative examples, and achieves state-of-the-art accuracy in product class prediction. In presenting the first use of Pfam domains as tokens for a self-supervised language model, this work opens opportunities for future method development and model refinement. For example, our results indicated minimal benefit to using ESM-1b pre-trained Pfam embeddings. Future work could evaluate whether complementary methods for protein sequence pretraining, such as the convolutional-based CARP model [32], confer greater benefit relative to language models like ESM-1b. Additionally, sensitivity analyses could assess how the predictive performance of BiGCARP and other competing methods change as a function of different BGC representations, such as amino acid residues or nucleotide sequences. The BGC masked language model introduced here additionally demonstrates promise for the expansion of BGC science and engineering. In natural language processing and protein engineering, masked language models are often fine-tuned on downstream tasks of interest.
For BGCs, these downstream tasks could include predicting their expression conditions or the chemical structures of their products. Future work to assess the performance benefit of fine-tuning BiGCARP will help determine the potential utility of BiGCARP for a variety of downstream tasks. Without fine-tuning, our models are useful for detecting previously unknown BGCs in microbial genomes and predicting BGC product classes.
In summary, we present BiGCARP, a self-supervised masked language model for the detection and characterization of biosynthetic gene clusters. The BiGCARP model described here could be deployed for downstream tasks of interest, including chemical product structure characterization and BGC mining. This study highlights the potential of self-supervised deep learning as a framework for BGC discovery and characterization.

Materials and methods
In this section, we elaborate on details of our self-supervised deep learning framework for detecting BGCs from bacterial genomes and classifying them into their natural product classes. The workflow is summarized in Fig 1, which consists of curating data, pretraining Pfam domain embeddings, training BiGCARP, and using BiGCARP to characterize BGCs.

Data
Pretraining dataset curation. To curate our pretraining dataset, we ran antiSMASH (ANTIbiotics & Secondary Metabolite Analysis SHell) 2.0, a microbial genome mining tool for BGC identification and analysis [8], on a database of 6,200 full bacterial genomes and 18,576 bacterial draft genomes [9]. This led to 142,821 total BGCs spanning 55 classes identified for model development and evaluation. Our choice of representing BGCs as Pfam domains led to a vocabulary size of 19,500 unique Pfam domains collected from Pfam database versions 31 and 32 [35]. We also remove sequences from the self-supervised training and validation sets that contain substrings from or are substrings of sequences from the MIBiG, 9-genomes, and 6-genomes datasets from DeepBGC described below. This results in 127,294 BGCs in our pretraining dataset prior to data splitting. All datasets used can be found on Zenodo. Future work may include using newer antiSMASH databases for pretraining for improvement of BiGCARP.
Pretraining data split for training and evaluation. Our training, validation, and test sets were produced from an 80/10/10 split of the total set. Note that random splitting of data is widely avoided in biological sequence modeling, since it leads to evaluation of overly simple generalization. For example, in protein modeling, one instead uses sequence-identity based splits as a proxy for evolutionary signal [38]. Proper splitting of BGCs is more complex, as evolution of BGCs is poorly understood. To reduce redundancy between the data splits, we ensured that no example in one set was a strict substring of an example in another set.
DeepBGC datasets for evaluating BGC detection and product classification. We evaluated the performance of our models and compared it to the DeepBGC model by testing its ability to detect BGCs within bacterial genomes and to predict their corresponding product classes. To do this, we utilized DeepBGC's training set with 617 positive and 10128 negative BGC samples to finetune our models [12]. We also used their 6-genomes and 9-genomes datasets to perform supervised domain classification tasks. For BGC product classification, we used DeepBGC's MIBiG dataset, which contains 1406 BGCs. Our mapping from antiSMASH product types to common MIBiG compound classes can be found on Zenodo.
Unannotated bacterial genomes for evaluating BiGCARP performance. To demonstrate the applicability of BiGCARP in advancing natural product discovery, we curated 773 unannotated bacterial genomes for BGC identification. We obtained all bacterial genomes with an assembly level of 'complete', 'chromosome', or 'scaffold' in GenBank and FASTA format released after 4 September 2020 available to download from the NCBI Datasets Genome Data Package using the ncbi-genome-download tool, yielding 108,007 assemblies that are unannotated in antiSMASH database version 3.0. We randomly chose 773 to analyze.
We then used Prodigal [39] [35]. Hmmsearch tabular output was filtered using cath-resolve-hits [41] to obtain a final set of non-overlapping domain assignments. The resulting list of Pfam domains was sorted by the gene and the domain start location.

Embeddings of Pfam domains with ESM-1b
We represent each Pfam as a vector. To do this, we take the first sequence in the alignment for a Pfam, then use ESM-1b [14], a protein masked language model, to embed all amino acids of this sequence. We averaged the embeddings over the full sequence, yielding a representation vector of size 1280. By obtaining pretrained embeddings of Pfam domains with ESM-1b, our model takes into account sequence details. To explore whether pretrained Pfam domain embeddings show improvement on the quality of Pfam domain representations, we use three different initial Pfam embeddings for BiGCARP: ESM-1b embeddings finetuned, ESM-1b embeddings frozen, and randomly initialized embeddings updated throughout training. ESM-1b-finetuned and ESM-1b-frozen have the same initialization at the start of self-supervised training. All other model weights were randomly initialized.

BiGCARP architecture and training
We train BiGCARP using the masked language model objective described in [33]. We prepend a token representing the antiSMASH BGC class to each BGC sequence. Each sequence is then corrupted by changing some tokens to a special mask token or another Pfam domain token, and the model is tasked with reconstructing the original sequence. Specifically, 15% of tokens from each sequence are randomly selected for supervision during each training step. For those 15% of tokens, 80% are replaced by the mask token, 10% are replaced by a randomly-chosen Pfam domain token, and 10% remain unchanged. The model is trained to minimize the batch average cross entropy loss between its predictions for the selected tokens and the true tokens at those locations.
BiGCARP is a dilated 1D-convolutional neural network masked language model based on ByteNet [34] and CARP [32]. The input is a sequence of Pfam domains represented by 1280-dimensional vectors. Model hyperparameters include the following: kernel width of 3, maximum dilation of 128, 32 layers, and a hidden dimension of 256 for a total of 34 million parameters. Training parameters include the following: batch size of 64, Adam optimizer with a learning rate of 10 −4 , and mixed precision training using PyTorch [42] and NVIDIA Apex. Each version of BiGCARP was trained on one 32GB NVIDIA V100 GPU for 300 epochs. The epoch with the lowest validation loss was selected for downstream experiments. Model weights and datasets are available on Zenodo; training code is available on our BiGCARP repository and code to run pretrained BiGCARP models is available on our protein sequence models repository on Github, with a command line manual for how to use the models. We do not report replicates for results as that would require training each model from scratch multiple times.

Evaluation on 9-genomes and 6-genomes
We use the intuition that the model should make more confident predictions when given BGC sequences than non-BGC sequences to predict BGC start locations and whether each domain is part of a BGC. For each bacterial genome, we prepend a mask token to each possible subsequence of 64 domains and pass the resulting sequences to BiGCARP. With the exception of domains at the beginning and end of the genome, each domain is thus scored 64 times. For each window, we calculate the entropy of the predictions for the prepended mask token (start entropy), the entropy for each of the 64 domains in the window (domain entropy), and the negative log-likelihood of each domain in the window (negative log-likelihood). We predict whether a domain is the start of a BGC using the start entropy of the window for which it is the first domain; positions with a lower start entropy are more likely to be BGC start locations.
We predict whether each domain is part of a BGC using the average of the start entropies for every window in which it appears and its domain entropy and negative log-likelihood within each window in which it appears (a total of 64 × 3 values). Domains with lower scores are more likely to be within a BGC.

Supervised training on DeepBGC training set
We follow the supervised training procedure described in DeepBGC. Using the positive BGC domain sequences from MiBIG (version 1.4) and 10128 negative BGC domain sequences from DeepBGC, at each epoch, we shuffle the sequences into a "genome" and then predict whether each domain is part of a BGC. We fine-tune the self-supervised versions of BiGCARP as well as a randomly-initialized version using the Adam optimizer and a learning rate of 10 −4 with early stopping using supervised results on 9-genomes.