Validation of predicted anonymous proteins simply using Fisher’s exact test

Abstract Motivation Genomes sequencing has become the primary (and often the sole) experimental method to characterize newly discovered organisms, in particular from the microbial world (bacteria, archaea, viruses). This generates an ever increasing number of predicted proteins the existence of which is unwarranted, in particular among those without homolog in model organisms. As a last resort, the computation of the selection pressure from pairwise alignments of the corresponding ‘Open Reading Frames’ (ORFs) can be used to validate their existences. However, this approach is error-prone, as not usually associated with a significance test. Results We introduce the use of the straightforward Fisher’s exact test as a postprocessing of the results provided by the popular CODEML sequence comparison software. The respective rates of nucleotide changes at the nonsynonymous versus synonymous position (as determined by CODEML) are turned into entries into a 2 × 2 contingency table, the probability of which is computed under the Null hypothesis that they should not behave differently if the ORFs do not encode actual proteins. Using the genome sequences of two recently isolated giant viruses, we show that strong negative selection pressures do not always provide a solid argument in favor of the existence of proteins.


Introduction
Since the first two bacterial genomes were sequenced 25 years ago (Fleischmann et al., 1995;Fraser et al., 1995), partial and whole-genome sequencing has become the method of choice in identifying and characterizing new microorganisms (bacteria, archaea, unicellular eukaryotes, viruses), revealing the stupendous extent of their diversity. By their simplicity of use and low cost, the most recent third-generation sequencing platforms (Goodwin et al., 2016) have made microbial genomics accessible to nonspecialists (Gwinn et al., 2019;MacLean et al., 2009), while a few large centers are fully taking advantage of their huge throughput to run biodiversity exploration projects of ever increasing dimensions (Chen et al. 2021;Lewin et al. 2018;Sunagawa et al., 2020). The most recurring (and unexpected) lesson emerging from the analyses of these enormous datasets is that overall morphological and phylogenetic similarities, as well as similar metabolisms and lifestyles, could hide large differences in gene contents and encoded proteomes. Within microorganisms belonging to a given clade, such as eukaryotic classes, bacterial genera or virus families, genomes are found to encode a subset of « core » proteins (i.e. with homologs in all members) together with proteins unevenly distributed, some of which only present in a single species or strain. This dichotomy is best documented for bacteria and viruses for which core genes might only represent a small proportion of the pangenome, i.e. of all the genes occurring at least once among all clade members (Claverie and Abergel, 2018;Land et al., 2015). While there is little doubt that genes encoding proteins with homologs in multiple divergent members of a clade are real, the level of certainty is much lower when they only occur once, or within very close clade members where the corresponding open reading frames (ORFs) may occur by chance. Given the A þ T richness of STOP codons (TAA, TAG, TGA), random ORFs are also statistically expected to occur at higher frequency in high G þ C content genomes, increasing the risk of protein overprediction (Legendre et al., 2019). The uncertainty further increases when the predicted proteins are short (typically less than 100 residues), or do not exhibit neither a functional motif nor a significant sequence similarity in the reference databases (Sayers et al., 2021). Such cases referred to as 'ORFans' represent a large proportion of predicted microbial proteomes (Entwistle et al., 2019) in particular for large viruses (Abergel and Claverie, 2020;Boratto et al., 2020;Gallot-Lavallée et al., 2017;Legendre et al., 2018;Philippe et al., 2013). The validation of ORFans is important to document the intriguing evolutionary process of de novo gene creations from noncoding regions in prokaryotes, eukaryotes and their viruses (Legendre et al., 2018;McLysaght and Guerzoni, 2015; V C The Author(s) 2021. Published by Oxford University Press.
If the experimental validation of predicted proteins through mass-spectrometry has become easier, the technique remains inaccessible to many of the laboratories generating genomic data. It also requires the corresponding microorganisms to be isolated and cultivated, thus disregarding the increasing number of metagenomic assemblies (Benler et al. 2021). Furthermore, certain proteins might only be expressed (and experimentally detectable) at specific time in the life cycle of organisms, in certain environmental conditions, or in specific organs. Thus, our capacity of experimentally demonstrating the actual existence of predicted proteins has fallen much behind the overwhelming production of genomic data. To overcome this difficulty, it has become customary to compute the selection pressure, i.e. the ratio of the synonymous versus nonsynonymous mutation rates as a way to validate bioinformatic protein predictions (e.g., Christo-Foroux et al., 2020;Doutre et al., 2014: Gonzá lez et al., 2016Prabh and Rö delsperger, 2016).
The concept/calculation of the selection pressure is based on the fact that proteins are made for a purpose and that their functions, directly derived from their amino-acid sequences, tend to be conserved throughout evolution. Accordingly, we expect that the nonsynonymous positions of their coding regions will vary much less rapidly than the synonymous ones, the changes of which have lesser consequence on the organism's fitness.
The concept of selection pressure was most widely disseminated via the CODEML program of the PAML package for phylogenetic analysis (Jeffares, 2015;Yang, 2007Yang, , 2014. The computation requires the disposal of at least two homologous ORF sequences, and involves five straightforward steps: (i) from the comparison with the associated amino-acid sequence, each position in the ORF nucleotide sequence is classified as synonymous or nonsynonymous in reference to the degeneracy of the genetic code. Their respective total numbers are denoted N S and N NS ; (ii) the two homologous amino-acid sequences are optimally aligned, then codon-wise converted into a nucleotide sequence alignment; (iii) the observed nucleotide changes associated to the positions previously mapped as synonymous or nonsynonymous are separately counted and are denoted n S and n NS ; (iv) one then forms the ratios d N ¼ n NS /N NS and d S ¼ n S /N S , separately quantifying the substitution rates at the two different types of positions; (v) finally, one computes the 'selection pressure' as the ratio The values of x are intuitively interpreted as follows: x < 1 will correspond to proteins the (beneficial) functions of which resist amino-acid changes, also said to evolve under negative (purifying) selection. This is by far the most frequent situation. In contrast, x > 1 correspond to the less frequent cases where changes in protein are positively selected (i.e. adaptive evolution) either to modify or abolish its (detrimental) function. Although conceptually simple, the practical implementation of this analysis comes up against two contradictory constraints. The first is that it must be based on an alignment of impeccable quality, and therefore between two highly similar protein sequences. The second is that the number of substitutions must be sufficiently high, while keeping the probability of multiple substitutions at the same site negligible (which would distort the estimate of d s and d ns ). To our knowledge, the validity range of the method was never rigorously defined in terms of pair-wise sequence divergence (i.e. acceptable value ranges for N S , N NS , and the d S or d N ratios), although CODEML can compute a likelihood value for a large suite of adaptive evolution models (the grasp of which is beyond the reach of most of occasional users). Fortunately, the use of CODEML remains easily tractable (Jeffares, 2015) if we only wish to compute x from the pairwise alignment of two homologous ORFs in order to evaluate the quality of ab initio protein prediction, as presented in the next section.

Methods
For actual proteins, the nonsynonymous and synonymous sites within the coding regions are expected to diverge at different speeds, thus leading to x 6 ¼ 1 in most cases. In contrast, in the case of false protein (ORF) predictions, the bioinformatic distinction made between nonsynonymous and synonymous sites becomes irrelevant, and both types of sites should now display a similar mutational behavior. We then expect x to remain close to one, within the range of random fluctuations.
As the nonsynonymous and synonymous positions are two mutually exclusive categories, we can evaluate how much both positions behave differently using Fisher's exact test in the analysis of the 2 Â 2 contingency table computed from the pairwise alignment of two homologous protein predictions, as follows: Where n NS and n S are computed as the products d N .N NS and d S .N S , respectively. These values are directly read from the standard CODEML output, then rounded to the nearest integers to be compatible with Fisher's test. The probability (p-value) that both position types (synonymous and nonsynonymous) behave differently (hence that the ORFS prediction are dubious) can be calculated by any available implementation of Fisher's test (online or in R, for instance). The pairwise sequence alignments were analyzed using the PAML 4.9j package version for UNIX/Linux with the following relevant options: We apply the above procedure to the evaluation of the whole predicted proteomes of two virus sequenced in our laboratory, constituting the only two known members of the proposed Molliviridae giant virus family. The prototype of the family, Mollivirus sibericum was isolated from ancient Siberian permafrost (Legendre et al., 2015) while the second member, Mollivirus kamchatka, was isolated from surface soil in Kamtchatka (Christo-Foroux et al., 2020). Both are 'giant' DNA viruses infecting the protozoan Acanthamoeba.
A stringent gene annotation of M.sibericum was initially performed using transcriptomic data (stranded RNA-seq reads) in addition to the standard protein-coding prediction methods (Legendre et al., 2015). Mollivirus kamchatka proteome prediction (Christo-Foroux et al., 2020) was performed without RNA-seq data but taking into account protein similarity with M.sibericum. Gene predictions were further curated using the web-based genomic annotation editing platform Web Apollo (Dunn et al., 2019). The selection pressure analysis was performed using CODEML as previously described (Christo-Foroux et al., 2020). Finally, the codon adaptation index (CAI) of both mollivirus predicted proteomes was performed using the CAI tool from the Emboss package (Rice et al., 2000).

Results
A total of 495 and 480 genes were predicted for M.sibericum and M.kamchatka, with the encoded proteins ranging from 51 to 2171 residues and from 57 to 2176 residues, respectively (Christo-Foroux et al., 2020). While the two isolates are very close to each other, sharing 463 of their predicted proteins as best reciprocal matches (with 92% identical residues on average, using BlastP), they are also very different from all other known organisms with 60% of their predicted proteins lacking a detectable homolog among cellular organisms or previously sequenced viruses (outside of the proposed Molliviridae family). These ORFan-rich proteomes constitute an ideal test set for our proposed selection pressure-based validation procedure to distinguish proteins that are actually made from spurious ORFs solely conserved by chance between evolutionary close viruses. Figure 1 displays the selection pressure values computed for all pairs of ORFans (Fig. 1A) and non-ORFans (Fig. 1B). For comparison, the exact same pairs of genes were also displayed in association to more traditional parameters such as their length, CAI and (G þ C) %. In all graphs, ORF pairs associated to nonsignificant Fisher's test p-values (thus less likely to correspond to actual proteins) are indicated by red dots. Table 1 provides some of the numerical values distinguishing the ORFan versus non-ORFans gene populations as well as those associated to x values nonsignificantly or significantly different from 1 (i.e. 'dubious' versus 'confirmed' protein candidates), as displayed in Figure 1.
The most discriminant pattern in Figure 1 is the larger proportion of red dots in the left columns. ORFans are more frequently associated to x values nonsignificantly departing from one (68/190, 36%) than non-ORFans (15/125, 12%). This suggests that more than a third of  predicted ORFans might not correspond to actual proteins. Yet, this result also shows that our testing procedure provides confirmations for all the others (64%; Table 1). The much larger proportion of blue dots for the non-ORFans confirms that the detection of homologs (even in very close species) is a reliable way to assess the reality of predicted proteins. However, our results indicate that 12% of them might be undergoing pseudogenization, despite appearing to remain under negative selection (x values < 0.6, Fig. 1B). The utility of our x value testing procedure is best illustrated by the combined consultation of Figure 1 and Table 1. For instance, it is clear that dubious predictions (red ORFs) are in average associated to larger x value than the blue ones (Table 1, p-values correspond to a Kolmogorov-Smirnov test). However, Figure 1A and B shows that no unique minimal x value threshold can be used to cleanly separate the two populations, confirming the utility of Fisher's test for that purpose.
Similarly, the predicted protein length distribution is significantly different between the red and blue dots (Table 1). However, if smaller ORF predictions are clearly less reliable, again no clear length threshold could separate both distributions ( Fig. 1C and D).
Finally, the computations of the CAI (Fig. 1E and F) or G þ C content ( Fig. 1G and H) do not bring in usable information to discriminate between reliable or unreliable protein predictions, given the very similar value distributions of these parameters for the red and blue dot populations.
We started this work by noting that x is defined as the ratio of two small quantities (d N /d S ) themselves computed from a limited number of substitution events (n s , n NS ) imposed by the necessity of flawless pairwise sequence alignments. Values of x are thus highly sensitive to random fluctuations making them unreliable to assess the validity of protein prediction. We showed here that applying the Fisher's exact test to the standard CODEML output provides a simple way to improve the reliability and predictive power of pressure selection computations. This procedure might thus constitute a useful addition to the standard genome annotation pipeline and previously proposed software tools to help identify spurious ORFs (Eberhardt et al., 2012;Hö ps et al., 2018). As a side benefit, the use of Fisher's test automatically filters out pairwise alignments that do not exhibit enough substitutions because they are too similar, or their alignments too short. The only parameter remaining to be fixed is the % of identical residues between orthologous proteins that should be greater than 70% (usually by imposing d S <1) to ensure high-quality pairwise alignments and minimize the probability of multiple substitutions at one given site (Jeffares, 2015).