Transformer model generated bacteriophage genomes are compositionally distinct from natural sequences

Abstract Novel applications of language models in genomics promise to have a large impact on the field. The megaDNA model is the first publicly available generative model for creating synthetic viral genomes. To evaluate megaDNA’s ability to recapitulate the nonrandom genome composition of viruses and assess whether synthetic genomes can be algorithmically detected, compositional metrics for 4969 natural bacteriophage genomes and 1002 de novo synthetic bacteriophage genomes were compared. Transformer-generated sequences had varied but realistic genome lengths, and 58% were classified as viral by geNomad. However, the sequences demonstrated consistent differences in various compositional metrics when compared to natural bacteriophage genomes by rank-sum tests and principal component analyses. A simple neural network trained to detect transformer-generated sequences on global compositional metrics alone displayed a median sensitivity of 93.0% and specificity of 97.9% (n = 12 independent models). Overall, these results demonstrate that megaDNA does not yet generate bacteriophage genomes with realistic compositional biases and that genome composition is a reliable method for detecting sequences generated by this model. While the results are specific to the megaDNA model, the evaluated framework described here could be applied to any generative model for genomic sequences.

The number of shu✏ing solutions grows exponentially depending on input sequence length (Figure S6A), and can therefore e↵ectively be treated as infinite for the sequence lengths relevant to this study.Based on the Central Limit Theorem, increased sample sizes will reduce the variability of an MFED estimate compared to other estimates drawn from the same population.However, larger sample sizes also incur higher computational costs and thus a balanced sample size must be selected.To measure the relationship between sample size and MFED estimate variability and identify an optimal sample size, random sequences of length l ranging from 100bp to 400bp were generated and MFED values were processed with varying sample sizes.Values of l included 100bp, 150bp, 200bp, 275bp, and 400bp and 15 replicates were generated for each length.
For each sequence, 15 MFED values were derived based on a shu✏ing sample size of i.Values of i ranged from 10 to 100 with step sizes of 10.
From this population of 15 distinct MFED estimates, the standard deviation s was measured.This process was repeated for 15 total s estimates.
The median s value for each value of i was carried forward for further analysis, creating a two column dataframe of i versus median s.The s values were normalized based on the maximum median s value and a linear regression of s vs i was performed.While likely not optimal, a linear equation was used to ensure a monotonically decreasing relationship.The resultant equation describes the number of samples i necessary to achieve a reduction of Y% compared to the maximum s value for that sequence.For every integer value of Y from 1 to 99, the median estimate of i for all input sequences of length l was collected.

GenerRNA Evaluation
This evaluation incorporated the 2000 natural and 2000 generated sequences listed within MFE distribution Fig4a.csv in the GenerRNA GitHub repository (https://github.com/pfnet-research/GenerRNA)as of commit ab7f470 on January 25, 2024.These were subsetted to only those sequences without ambiguous nucleotide codes, yielding 1,939 natural and 1,942 generated sequences for the analysis.GC content, dinucleotide odds ratios, MFE, and MFED values were calculated for whole sequences (rather than 120bp subsequences for MFE and MFED as done for megaDNA).This approach was selected to be similar to that done in the GenerRNA preprint.MFED was calculated with only 20 iterations to save computational time given size of input sequences.PCA and cluster analysis were performed as in methods.

Gene Functional Evaluation
A subset of .faafiles produced by PHANOTATE for 100 transformer-generated sequences were queried against the nr database on NCBI using blastp v2.16.0 by leveraging the Bio.Blast.NCBIWWW module in Biopython.

MFED sample size optimization
As expected, the relationship between sample size and MFED estimate variability was logarthmic, with progressively diminishing impacts on variability for increasing sample sizes (Figure S6B).Results were consistent across all sequence lengths tested.From these results, a shu✏ing sample size of 105 was chosen to be used in this study as this was shown to reduce the value of s by approximately 89% for all sequence lengths.Beyond this level, the diminishing returns of increased sample size were deemed to be an ine cient use of computational resources and time.There is no 'correct' sampling size for dinucleotide shu✏ing and future uses of this method need to balance accuracy with available resources.

GenerRNA Evaluation
For the 19 compositional metrics analyzed, the distributions of natural versus generated sequences were significantly di↵erent for only two after a Bonferroni correction (GpC ratio and MFED, Figure S7).Consistent with the findings of Zhao et al. ( 2024), there were no di↵erences in distribution of raw MFE values.However, the algorithms used by Zhao et al. (2024) to conclude there were no di↵erences in structure are inadequate compared to dinucleotide shu✏ing (see Clote et al. (2005)).In contrast to the authors' claims of parity between natural and generated sequences, MFED values of generated sequences were significantly lower than those of natural sequences (0.070 versus 0.078, respectively, p = 0.000062, two-tailed Mann-Whitney U Test).Lastly, generated sequences did not cluster with themselves greater than expected by chance (p = 0.14, binomial distribution; Figure S8).Compared to megaDNA, the composition of GenerRNA model outputs are substantially more compositionally similar to their training data.

Gene Functional Evaluation
The 100 tested .faafiles comprised 6,664 transformer-generated genes predicted by PHANOTATE.Of these, only 11.7% returned at least one hit (n = 780).46.2% of successful queries had > one hit (n = 360) for a total of 2,248 hits.The median Expect (E)-value for all hits was a disappointingly high 4.68 (3.81 if queries were limited to only their lowest scoring hit).153 queries had at least one hit with an E-value of < one, of which only 15 were < 0.05.Exactly three queries had E-values < 0.01.These three hits from three di↵erent queries were to a serine/threonine-protein kinase from Pseudomonadota bacterium (Accession = HRI53766; E-value of 0.0027), a rhamnulokinase family protein from ribacterium parvum (Accession = WP 009535588; E-value of 0.0072), and uncharacterized protein LOC131167269 from Malania oleifera (Accession = XP 057982017; E-value of 0.0050).Notably, none of the queries returned hits derived from bacteriophage genes."Nearest Neighbor Matches" defined as the number of family members for which the nearest data point in the 19-dimensional space (by weighted Euclidean distance) was a member of the same family.P value is derived from a binomial distribution where the expectation of matches is equal to the frequency of the family within the broader sample population ("Frequency Expected").Note, this method is vulnerable to within-family heterogeneity, as a family with many small local clusters but large global variance would appear highly clustered with this method.Results of predictive feature evaluation for (A) maximal and (B) minimal feature orders.X-axis labels specify the order in which features were added, with each point on the x-axis specifying the newest feature.Each model was inclusive of all features preceding that point on the x-asix.Only MFED (p = 0.00062) and GpC ratio (p = 0.0017) had significantly di↵erent distributions by two-tailed Mann-Whitney U Test after a Bonferroni correction.For ease of viewing, these panels are colored white.

Fig. S1 .
Fig. S1.Comparisons of Compositional Metrics between Natural and Transformer-Generated Sequences.Inclusive only of metrics not displayed in Figure 1.All distributions significantly di↵erent by two-tailed Mann-Whitney U Test (p < 0.0026) except for TpG dinucleotide ratio (colored grey).

Fig. S2 .
Fig. S2.Exploration of MFED values.(A, B) MFED plots for representative sequences.Sequences chosen due to having MFED values that were the median value for their respective provenance.Note consistent y-axis for both plots.(C, D) Impact on median Z score when changing window for MFE/MFED calculation from 120bp to 85 bp.Negative Z scores are indicative of a higher MFED value than expected by chance.

Fig. S3 .
Fig. S3.Correlation between Scaled Compositional Metrics.Correlations (Pearson) are inclusive of all data points regardless of completeness of taxonomy metadata.Correlation values calculated using function cor() from base R.

Fig. S4 .
Fig. S4.Principal Component Analysis (PCA) of Compositional Metrics without MFE or MFED.(A) Two-dimensional projections of PCs 1 and 2 limited to sequences from families with 25 members, colored by taxonomy.Percentages in x-axis and y-axis titles indicate percentage of variance explaned by given PC.Kindly note that the two-dimensional projection can be misleading as to the true location of a data point in the 19-dimensional space created by the PCA.Transformer-generated sequences clustered at a rate of 92.7% (929/1002).(B) Unitless weighted euclidean distance measures from centroid of family to centroid of transformer-generated sequences.

Fig
Fig. S5.Neural Network Predictive Feature Evaluation.Results of predictive feature evaluation for (A) maximal and (B) minimal feature orders.X-axis labels specify the order in which features were added, with each point on the x-axis specifying the newest feature.Each model was inclusive of all features preceding that point on the x-asix.

Fig
Fig. S6.Dinucleotide Shu✏ing Sample Size Optimization.(A) Results of dinucleotide shu✏ing observed for each random input sequence.Solutions observed are number of unique dinucleotide shu✏ed sequence outputs after performing n = 1,000,000 independent shu✏es.(B) Relationship between shu✏ing sample size and variability of MFED estimate.Values obtained as described in Supplementary Methods.Vertex displays sample size implemented for this study.

Fig. S7 .
Fig. S7.Comparisons of Compositional Metrics between Natural and GenerRNA Sequences.Only MFED (p = 0.00062) and GpC ratio (p = 0.0017) had significantly di↵erent distributions by two-tailed Mann-Whitney U Test after a Bonferroni correction.For ease of viewing, these panels are colored white.

Fig
Fig. S8.Principal Component Analysis (PCA) of Compositional Metrics for GenerRNA sequences.Two-dimensional projection of PCA results for all natural and generated sequences in the GenerRNA dataset.Views of "generated only" and "natural only" provided due to substantial overlap in the main panel.Neither generated nor natural sequences clustered within this analysis by binomial distribution tests.

Table S1 .
Distributions of Composition Metrics.Table split into two for ease of viewing.(A) Comparison of median values for compositional metrics under analysis between natural and transformer-generated sequences.(B) Values indicate p value from two-tailed Mann-Whitney U Test. Green highlighted cells indicate comparisons that are statistically significant after Bonferroni correction (p < 0.0026 for Natural vs Transformer or p < 6.6e-4 for all others).

Table S2 .
Family-specific Clustering within PCA Results.