Computational analysis of genes with lethal knockout phenotype and prediction of essential genes in archaea

ABSTRACT The identification of microbial genes essential for survival as those with lethal knockout phenotype (LKP) is a common strategy for functional interrogation of genomes. However, interpretation of the LKP is complicated because a substantial fraction of the genes with this phenotype remains poorly functionally characterized. Furthermore, many genes can exhibit LKP not because their products perform essential cellular functions but because their knockout activates the toxicity of other genes (conditionally essential genes). We analyzed the sets of LKP genes for two archaea, Methanococcus maripaludis and Sulfolobus islandicus, using a variety of computational approaches aiming to differentiate between essential and conditionally essential genes and to predict at least a general function for as many of the proteins encoded by these genes as possible. This analysis allowed us to predict the functions of several LKP genes including previously uncharacterized subunit of the GINS protein complex with an essential function in genome replication and of the KEOPS complex that is responsible for an essential tRNA modification as well as GRP protease implicated in protein quality control. Additionally, several novel antitoxins (conditionally essential genes) were predicted, and this prediction was experimentally validated by showing that the deletion of these genes together with the adjacent genes apparently encoding the cognate toxins caused no growth defect. We applied principal component analysis based on sequence and comparative genomic features showing that this approach can separate essential genes from conditionally essential ones and used it to predict essential genes in other archaeal genomes. IMPORTANCE Only a relatively small fraction of the genes in any bacterium or archaeon is essential for survival as demonstrated by the lethal effect of their disruption. The identification of essential genes and their functions is crucial for understanding fundamental cell biology. However, many of the genes with a lethal knockout phenotype remain poorly functionally characterized, and furthermore, many genes can exhibit this phenotype not because their products perform essential cellular functions but because their knockout activates the toxicity of other genes. We applied state-of-the-art computational methods to predict the functions of a number of uncharacterized genes with the lethal knockout phenotype in two archaeal species and developed a computational approach to predict genes involved in essential functions. These findings advance the current understanding of key functionalities of archaeal cells.

cell even in the most favorable conditions.It can be expected that, in most free-liv ing organisms, the sets of essential genes are larger.And indeed, the number of experimentally identified essential genes, that is, those genes for which knockout is lethal, in free-living bacteria and archaea varies from 270 to 640, and this num ber depends on conditions of experiment and gene content of tested strains [(4-9) and references therein].Furthermore, computational analysis of some essential genes showed that they belong to integrated mobile genetic elements (MGE), and some are antitoxins of known toxin-antitoxin (TA) systems suggesting that the lethal effect of their disruption is due to induction of cellular toxicity rather than due to the inactivation of an essential function (10)(11)(12).Conversely, the lack of a lethal knockout effect for some genes can be caused by the presence of a paralog or a functional analog such that none of the two genes is essential individually, but one of the two has to be present for survival (13).Taken together, these findings challenge the binary classification of genes as essential or non-essential (14) and raise concerns about the gene knockout phenotype interpretation.
In the case of archaea, only two studies so far reported genome-wide transposon insertion mutagenesis to identify essential genes (7,12).In one of these, 89,000 unique insertions of Tn5 transposon into the genome of the methanogenic archaeon Methano coccus maripaludis S2 (member of the Euryarchaeota superphylum) were tested in both rich and minimal media (7).An essentiality index was calculated based on the number of viable gene insertions in the given gene and comparison with known essential genes.About 30% of genes of the M. maripaludis genes (534 protein-coding genes, a union of all genes with essentiality index ≥3) were considered to be essential on rich medium, and about 47% (816 protein-coding genes) were considered essential on minimal medium.The second study was performed on Sulfolobus islandicus M. 16.4, a model organism for Thermoproteota (formerly Crenarchaeota) phylum (12).Three independent libraries were generated, and more than 100,000 colonies grown on rich medium were analyzed, with 441 protein-coding genes (16%) found to be essential using two different essentiality indices.This study explored the distribution of essential genes in 168 strains across the tree of life.It identified 42% shared arCOGs with M. maripaludis and 45% unique to S. islandicus, some of which were shown to be lethal deletion mutants (12).
Numerous bacterial genes with unknown function are known to severely affect fitness when inactivated by transposon insertion (15).The two studies on essential gene identification of two archaea are not an exception: dozens of identified essential genes remain uncharacterized.We were interested in deciphering the functions of these genes and, in particular, differentiating bona fide essential genes from those that display the lethal knockout phenotype (LKP) due to induced toxicity.As the first step, we employed a variety of computational methods to collect the maximum information on these genes and, whenever possible, to predict their functions.We then experimentally validated several predicted new antitoxins by demonstrating that deletion of the respective genes together with the adjacent genes encoding the cognate toxins yielded a viable phenotype.We further explored the relationship between LKP and various genome features.

Comparative genomics of lethal knockout phenotype genes in two archaea
First, we compared the sets of essential genes (lethal knockout phenotype, for accu racy) in M. maripaludis and S. islandicus and also compared these genes sets to the 218 archaeal core genes that are represented in all archaeal genomes except for some members of the DPANN (Diapherotrites, Parvarchaeota, Aenigmarchaeota, Nanohaloarchaeota, and Nanoarchaeota) superphylum (16).To this end, we used the arCOGs (archaeal clusters of orthologous genes) which contain orthologs from 524 archaeal genomes representing all major archaeal lineages including M. maripaludis and S. islandicus (17).These comparisons identified 177 arCOGs that are shared between the two sets of LKP genes comprising 34% and 42% of the LKP genes in the rich medium for M. maripaludis and S. islandicus, respectively.This relatively small overlap between the LKP gene sets implies substantial differences in key cellular processes in these distantly related archaea (Fig. 1A; Table S1).
As expected, the majority of the shared LKP genes (118) belong to the archaeal core gene set, but the fraction of the core genes (54%) overlapping with the set of shared LKP genes was notably low.Moreover, 37 core genes were not represented among the LKP genes of either organism suggesting that experimental conditions poorly reflect the actual conditions in the native environments.Among these genes, 20 encode compo nents of the translation system including 5 ribosomal proteins, L15E, L39E, L40E, S24E, and S27E.All these proteins are conserved between archaea and eukaryotes, but only L39E was also found to be non-essential in yeast (12,19,20).Generally, non-essential ribosomal proteins in yeast affect growth rate and stress resistance (19,20).Also, three genes, dph2, dph5, and dph6, involved in the biosynthesis of diphthamide are not LKP in both archaea.Diphthamide is a post-translationally modified histidine residue in translation elongation factor 2 (EF2), which enhances the fidelity of EF2 function in translocation during translation (21,22).The absence of DNA polymerase PolB3 from both sets of LKP genes is notable as well.PolB3 also can be deleted in Thermococcus kodakarensis, another representative of Euryarchaeota, without any observable loss in viability or growth defects suggesting that PolD family polymerase, which is not a core gene, is sufficient for DNA replication (23,24).In Sulfolobus, PolB1 (a PolB3 paralog) is the essential polymerase responsible for DNA replication, whereas PolB3 is believed to be involved in DNA repair (12,25,26).Thus, the causes of PolB3 conservation remains unclear.These examples suggest that many genes found to be non-essential in labora tory conditions substantially affect population fitness in the native environments and/or in the long term and, thus, are effectively indispensable and, as a consequence, evolutio narily conserved (27).
Next, we used evolutionary reconstruction to infer the origins of LKP genes.To this end, we constructed clade-specific clusters of orthologous genes (csCOGs) for Methano coccales and Sulfolobales clades and, based on the respective phyletic patterns mapped onto the phylogenetic trees, reconstructed evolutionary events (gains and losses) and the origin of each gene in each genome (see Materials and Methods).It can be expected that bona fide essential genes are ancestral in a respective clade, whereas conditionally essential genes, such as those encoding antitoxins, were acquired later in evolution.Indeed, the overwhelming majority of the LKP genes, 97% in M. maripaludis and 95% in S. islandicus, are ancestral in the respective clades (Table 1).Apparently, some conditionally essential genes can be ancestral, too.For example, in S. islandicus, two of the three annotated antitoxins are ancestral.The cas5 gene, a component of the type I-A CRISPR-Cas system, is also ancestral, but the deletion of the entire I-A locus had no detectable effect on cell growth (12).Several ancestral LKP genes, including DNA-binding transcrip tional regulator of HxlR family (MMP0752) and transposon ISA1214-encoded protein of the DUF2080 family (MMP0751) in M. maripaludis and two transcriptional regulators of the MarR family (M164_1630 and M164_1605) in S. islandicus, are located in integrated MGE (Table S1).Similarly, the plasmid-encoded M164_1606 protein, a paralog of M164_1605, is ancestral but not LKP.It seems highly unlikely that these genes from integrated MGE are important for any housekeeping functions.Instead, transcriptional regulators might silence toxin genes encoded on the respective plasmids.The identity of such toxins remains to be determined.There were 1.5 times more LKP M. maripaludis genes on the minimal medium compared to the rich medium (Table 1; Table S1).The number of LKP genes on the minimal media increased in all functional classes, but especially among genes involved in post-translational modification, protein turnover, chaperones, and cell wall/membrane/ envelope biogenesis, as well as functionally uncharacterized genes.About 87% of the LKP genes on minimal medium are ancestral, which is less than in the rich medium, suggesting that stress conditions increase the number of LKP genes from the pool of relatively recently acquired genes.In particular, there are 21 LKP genes encoded on integrated plasmids on minimal medium compared to only 3 on rich medium, and 14 of these are not ancestral (Table S1).

Analysis of uncharacterized LKP genes
In an attempt to predict the functions of uncharacterized LKP genes, we used a variety of computational approaches including sensitive methods for sequence similarity search (PSI-BLAST and HHpred), comparative genomics (using arCOG database framework  and csCOGs for Methanobacteriales and Sulfolobales), gene neighborhood analysis, AlphaFold2 (AF2) modeling of protein structures, and structure comparison using DALI (see Materials and Methods for details).In addition to functional prediction as such, we were interested in a detailed analysis of the evolutionary conservation of the LKP genes, aiming at the identification of extremely diverged orthologs of conserved genes that are missed in current annotations including arCOGs.Altogether, we analyzed 31 proteins encoded by LKP genes of M. maripaludis (rich medium only) and 49 proteins of S. islandicus (Fig. 1B; Table S2).For several of these genes, functionally characterized homologs were readily identifiable.For many of the remaining proteins, we were unable to obtain any functionally relevant information.However, genes of high importance are expected to be ancestral and, at least in Sulfolobus, tend to cluster in large chromosomal segments (28)(29)(30).Thus, examination of extended neighborhoods of uncharacterized LKP genes might help to identify candi dates for key house-keeping functions for further experimental studies.For example, genes coding for ACR40673 (double-stranded beta-helix fold), ACR40803 (Rossmann fold), ACR42328, ACR42330, and several others are surrounded by ancestral genes including many LKP genes (Fig. S1).This feature might be predictive for M. maripaludis as well.For example, genes encoding WP_011170892, WP_011170894, WP_011171056 (sugar binding domain), and WP_011171261 (ferredoxin fold) are more likely to be involved in key cellular functions (Fig. S2).
For 7 and 14 uncharacterized LKP proteins of M. maripaludis and S. islandicus, respectively, at least a general functional prediction was possible (Fig. 1B; Table 2; Table S2).For several such proteins, the combination of evidence allowed us to propose specific hypotheses on their functions as described below.

Essential LKP genes
HHpred search and subsequent AF2 modeling and structural comparison for Metha nobacteriales-specific arCOG05026 revealed similarity with GINS proteins (stands for Japanese go-ichi-ni-san, meaning 5-1-2-3), essential components of replication initiation complexes in archaea and eukaryotes (Fig. 2A and B; Table S2).In the TACK [Thau marchaeota (now Nitrososphaeria), Aigarchaeota, Crenarchaeota (now both in Thermo proteota), and Korarchaeota (now Korarchaeia)] superphylum, there are two clusters of GINS that, based on domain architecture and similarity to the corresponding eukaryotic GINS, are classified into GINS23 and GINS51 families (31-34) (Fig. 2A).The AF2 model of arCOG05026 representative revealed an N-terminal, mostly beta-stranded, divergent B-domain and the C-terminal alpha helical A-domain, the typical domain arrangement for the GINS23 family (33,34) (Fig. 2B).Previously, it has been thought that GINS23 family is absent in most euryarchaeal lineages except Thermococcales, especially considering that, in Thermoplasmata, GINS51 is a homotetramer which forms a complex with Cdc6 ATPase and MCM helicase (31,33,35,36).The formation of this complex seems to be sufficient for replication initiation, although this has not been directly demonstrated.Another piece of evidence to support the GINS23 family assignment for arCOG05026 is the fact that these proteins in most Methanococcales are encoded next to the MCM helicase, which is a conserved neighborhood for other known member of the GINS23 family (31,34) (Fig. 2C).Thus, we identified GINS23 in at least one more euryarchaeal lineage, suggesting that, in many archaea, GINS23 might have escaped identification due to very low sequence conservation.This is apparently the case for the GINS51 family as well.We searched all archaeal genomes in which either GINS51 or GINS23 was not previously identified and detected many highly diverged members of these families that were not included in the respective arCOGs; indeed, most of the crenarchaeal GINS51 proteins are still annotated as "hypothetical" in public databases (Table S3).Based on these findings, we amended the corresponding arCOGs and modified the respective phyletic patterns (Fig. 2A; Table S3).However, there are still gaps in several complete archaeal genomes.For instance, we could not identify GINS51 by sequence similarity in Fervidicoccus fontis, a Desulfurococcales archaeon which is expected to encode representatives of both GINS families.We checked the vicinity of priS and dnaN genes encoding small primase subunit where GINS51 proteins are typically encoded and identified an uncharacterized gene encoding AFH43081 protein (Fig. S3A).The AF2 model for this protein revealed a typical domain organization of GINS51, that is, the N-terminal alpha-helical domain and the C-terminal beta-stranded domain, but structural comparison using DALI failed to detect similarity with any GINS proteins (Fig. S3B).Nevertheless, considering all the above evidence, it appears most likely that this protein is an extremely divergent member of GINS51 family.The only two remaining complete archaeal genomes lacking any GINS proteins are Methanopyrus kandleri AV19 and Methanopyrus sp.KOL6.Although we failed to identify any strong candidates for GINS homologs in these two genomes, we detected a predicted alpha-helical protein specific to these organisms and encoded next to the dnaN gene that might be function ally analogous to GINS (Fig. S3C and 3D; Table S3).
We next explored another case of extreme sequence divergence of one of the key components of informational systems.Initially, for the uncharacterized LKP protein ACR42468 of S. islandicus, a representative of arCOG07188, we identified sequence similarity to the Cgi121 subunit of KEOPS (kinase, endopeptidase, and other proteins of small size) (Fig. S4).The universally conserved KEOPS complex is responsible for the universal N(6)-threonylcarbamoyladenosine (t(6)A) tRNA modification, which stabilizes the tRNA anticodon loop structure (37,38).Archaea and eukaryotes share the archi tecture of the KEOPS complex that is distinct from the bacterial one and includes the following subunits: Kae1 (tRNA N6-adenosine threonylcarbamoyltransferase), Bud32 (atypical protein kinase/ATPase), Pcc1, and Cgi121 (37)(38)(39)(40).Eukaryotes have an additional subunit Gon7, which has been recently proposed to be a divergent Pcc1 homolog (41,42).Previously, all homologs of Cgi121 were assigned to arCOG02197 (Fig. 2D), and no homologs of this protein were ever reported in Sulfolobales and most other Crenarch aeota.Here, we identified Cgi121 in all crenarchaeal genomes and many more in other archaea, some of which belonged to other lineage-specific arCOGs (Fig. 2D; Fig. S4).We noticed that ACR42468 and several other Cgi121 homologs in Sulfolobales are shorter than the typical Cgi121 because the N-terminal portion appeared to be missing.Upon further examination of this genome region, we identified a potential ORF extension, present in most Sulfolobus genomes, that contained 48 extra codons and, possibly, an unconventional start codon.We produced an AF2 model for a protein with the additional N-terminal region and found that it was highly similar to Cgi121 from Pyrococcus furiosus (PDB: 1ZD0; Z-score = 6.7;RMSD = 3.6; identity = 15%) (Fig. 2E).With these findings, it seems that (almost) all completely sequenced archaeal genomes encode all four KEOPS subunits except for most members of the DPANN superphylum (Fig. 2D).(G) Phyletic pattern of the GPR family protease compared with those of LonA and LonB protease, peptidyl-tRNA hydrolase Pth2, and the Zn finger protein (arCOG03770) which is likely functionally linked to the GRP protease.Designations are the same as in panel A. (H) GPR protease gene neighborhoods in selected diverse archaeal genomes (detailed complete information is available in Table S3).For each gene neighborhood, the species name, genome partition, and coordinates of the locus are indicated.Genes are depicted by block arrows, with the length roughly proportional to the gene size.Dashed line indicates that the distance between the respective genes was shortened to save space.GPR protease genes are colored yellow, and Zn finger genes are colored green.The genes in the neighborhoods are designated by their gene or family names as follows: Map, methionine aminopeptidase; Rli1, translation initiation factor RLI1; MetG, methionyl-tRNA synthetase; AsnS, aspartyl/asparaginyl-tRNA synthetase; Dcd, deoxycytidine deaminase; Fap7, broad-specificity NMP kinase;  At least one more functionally uncharacterized LKP protein is predicted to be involved in key translation-associated processes in crenarchaea, namely, arCOG04181 that is represented in most genomes of the TACK superphylum (Table S3).Sequence analysis of this family revealed similarity with atypical aspartic acid protease of the GPR (germination protease) family, which is widespread in bacteria, and in Bacillus, is involved in spore germination (43, 44) (Fig. 2F; Table S2_evidence).Both catalytic aspartates are conserved in the archaeal proteins, suggesting that it is an active protease (Fig. 2G; Fig. S5A).These proteins contain an N-terminal transmembrane segment and thus are predicted to be membrane associated (Fig. 3B).The DALI comparison of the AF2 model for ACR42636.1 protein with the PDB database identified peptidyl-tRNA hydrolase Pth2 as the best match.Both GRP and Pth2 belong to the same phosphorylase/hydrolase-like fold, according to the SCOP database (45).Peptidyl-tRNA hydrolases are responsible for the removal of bound peptides from peptidyl-tRNA, to rescue stalled ribosomes which is essential for the cell viability in both bacteria and archaea (46,47).However, Pth2 (arCOG04228) is a core archaeal protein that is not membrane associated (Fig. 2G).Other proteases that are involved in protein quality control belong to either LonA or LonB subfamilies of serine (S16 family) ATP-dependent proteases (48).In archaea, both families are present, although LonB is more common (Fig. 2G) (49).Notably, archaeal LonB proteases are membrane associated (48) and essential for viability, at least, in haloarch aea (50,51).The phyletic pattern of arCOG04181 almost perfectly complements the LonB pattern, suggesting that these proteins are functionally interchangeable (Fig. 2G).Analysis of the arCOG04181 neighborhoods revealed a conserved Zn-finger protein (arCOG03770) that is often predicted to be membrane associated and is likely to be functionally linked to arCOG04181 (Fig. 2G; Fig. S5B).These two genes are likely cotranscribed with other genes encoding proteins involved in translation, such as methio nine aminopeptidase (Map), translation initiation factor RLI1, and aminoacyl-tRNA synthetases, which support their involvement in translation (Fig. 2H).Thus, we hypothe size that arCOG04181 protein performs a variety of functions including protein quality control by cleaving proteins tagged for degradation similarly to Lon proteases in bacteria (52), although other translation-associated roles of these proteins cannot be ruled out.

Conditionally essential LKP genes: antitoxins
As pointed out above, disruption of antitoxin genes by transposon insertion can lead to cell lethality due to activation of toxins, and indeed, such cases have been described for S. islandicus previously (12).In this work, we identified several additional novel antitoxins among the products of LKP genes and predicted new TA systems in S. islandicus.In contrast, no potential antitoxins were identified among LKP genes in M. maripaludis.
Four proteins that belong to arCOGs 10132, 7934, 7229, and 8451 form conserved two-gene arrays with known toxins, making prediction of antitoxin function for these  proteins straightforward.For instance, Sulfolobales-specific arCOG10132 proteins are always encoded next to a Doc (death on curing) family toxin, which is a kinase that inactivates elongation factor Tu (54) (Fig. 3A; Table S3; Fig. S1).All these gene pairs are located within integrated MGE or in the vicinity of other defense systems (Fig. S6A; Table S3).No sequence or structural similarity with any other proteins was detected for arCOG10132 proteins.Notably, however, the AF2 model resembled phd (prevent host death) family antitoxins typically associated with Doc toxins (53) despite being topologically different, suggesting convergent evolution (Fig. 3A).The next case is the association of arCOG07934 with an RelE family toxin (Fig. 3B; Fig. S6B).RelE toxins are interferases that cleave mRNAs directly in the peptidyl transferase center of the ribosome (55).The arCOG07934-RelE module was found only in Sulfolobales and can be located within or outside of integrated MGE (Table S3).Using HHpred, we detected limited sequence similarity between arCOG07934 and the AbrB/MazE family of transcriptional regulators and antitoxins (56).However, no structural similarity was detected for the arCOG07934 AF2 model, presumably, because the GD box motif and the following beta strand typical for AbrB/MazE proteins (56) are missing in arCOG07934 proteins, and only the C-terminal region is conserved (Fig. 3B).Two other predicted antitoxins from arCOG07229 and arCOG08451 are linked to predicted nucleotidyltransferases related to AbiEii (arCOG05472), a component of an abortive infection (antivirus defense) system and type IV or type II toxin-antitoxin systems (57, 58) (Fig. 3C; Fig. S6C; Table S3).It has been recently shown that a nucle otidyltransferase of this family inhibits growth of Mycobacterium tuberculosis through modification of the tRNA acceptor stems (58,59).Both predicted antitoxin families are specific to Sulfolobales, and most of the gene pairs comprising the corresponding TA modules are encoded in the vicinity of other defense systems (Fig. S6C; Table S3).No similar proteins were identified for arCOG07229 or arCOG08451 by sequence and structure analysis, and the predicted structures of the proteins in the two families were clearly different (Fig. 3C).Furthermore, nucleotidyltransferases of arCOG05472 are widespread in archaea (present in 142 genomes) and are associated with several other putative antitoxins of different, unrelated arCOGs (Table S3).Thus, mechanisms of inhibition of nucleotidyltransferase toxin by antitoxins are apparently highly diverse in archaea and remain to be explored experimentally.
We also predicted antitoxin functions for two other protein families that are potential components of novel, unusual TA systems.One of these consists of derived AbrB/MazE family proteins of arCOG09897 that show a limited but significant sequence similarity with other AbrB/MazE proteins including strong conservation of the GD motif followed by the characteristic beta strand (Fig. 3D; Table S2).No structural similarity for the AF2 model was detected, conceivably, because AbrB/MazE family antitoxins only assume the native structure when complexed with the cognate toxin (60).This family is present only in four archaeal genomes, and in all cases, arCOG09897 proteins are encoded divergently to predicted wHTH domain containing DNA-binding proteins of arCOG09898 in Sulfolobales and arCOG01055 in Pyrodictium delaneyi (Fig. 3D).In three genomes, this pair of genes is encoded within integrated MGE (Table S3).Proteins of arCOG09898 are encoded in four additional genomes, and in two of these, the respective genes are located next to a gene encoding another transcriptional regulator of the MarR family (Table S3).Considering the conservation of the two-gene arrangement and the fact that both AbrB/MazE and MarR family proteins could be antitoxins, we hypothesize that arCOG09898 proteins are toxins (61,62).Currently, only one DNA-binding toxin is known, the SymE protein in the SymR-SymE type I TA system, where SymE, an AbrB/MazE family DNA-binding protein, is the toxin causing severe nucleoid condensation, and SymR is the RNA antitoxin (63,64).In contrast to this system, we propose that arCOG09897 proteins are antitoxins in either type II TA modules where a toxin is inhibited by direct binding to an antitoxin, or type IV TA modules, where a protein antitoxin inhibits the toxicity indirectly, for example, by repressing the toxin transcription.
The most enigmatic predicted TA systems are associated with another group of AbrB/MazE family proteins, arCOG07185 (Fig. 3E).S. islandicus has three arCOG07185 genes, and disruption of each of these results in LKP (12).Altogether, we found that arCOG07185 is represented in 14 genomes of both crenarchaea and euryarchaea, typically, within integrated MGE or defense islands (Fig. 3E; Table S3).In S. islandicus, arCOG07185 proteins are encoded divergently with arCOG09859, an uncharacterized moderately conserved protein with a strictly conserved aspartate suggesting enzymatic activity (Fig. S5).However, in S. tokodaii, an arCOG09859 protein is encoded farther upstream.In Halobacteria, arCOG07185 proteins are also divergently encoded next to proteins of a large family (arCOGs 8928 and 8980) provisionally annotated Halobacterial output domain 1 because some of these proteins are fused to or encoded in the vicinity of signal transduction domains (Fig. 3E) (65).In the genomes of several methanogens, however, no conserved neighborhoods were identified for arCOG07185 (Fig. 3E).The structure of arCOG07185 member encoded in the genetic element pSSVx from Sulfolobus islandicus REY15/4 has been solved revealing the swapped-hairpin fold typical of AbrB/ MazE proteins, and this protein has been shown to bind DNA within its own promoter region (66).Thus, we propose two hypotheses, both of which are compatible with these observations.In the first scenario, the toxin is a small RNA, possibly overlapping with the arCOG07185 coding region.However, we were unable to identify a conserved DNA region that could encode a putative RNA toxin.Under the second scenario, arCOG07185 is a component of type II or type IV TA systems, but the corresponding toxins are different, often non-homologous proteins, which can be encoded either in cis or in trans.The arCOG07185 antitoxins are likely to be highly specific so that they cannot inhibit a non-cognate toxin in trans, even when its sequence is highly similar to that of the cognate toxin, as seems to be the case for the three arCOG09859 proteins in S. islandicus (Fig. S5).
If the above predictions are valid, it should be expected that deletion of the antitoxin gene together with the adjacent cognate toxin gene will be viable.To test this prediction, we constructed eight deletions of predicted TA loci in S. islandicus: Doc-arCOG10132, RelE-arCOG07934, nucleotidyltransferase-arCOG07229, nucleotidyltransfer ase-arCOG08451, arCOG09898-arCOG09897, and three arCOG07185-containing loci, which include arCOG09859 as a potential toxin (Fig. 3; Table 3).Deletions were verified by PCR using primers flanking the deleted loci (Fig. 4; Tables 4 and 5).In all eight cases, no growth defect was observed.These results validated our prediction that the eight LKP genes in the tested gene pairs are antitoxins that are not involved in any essential cellular functions, with the LKP resulting from toxins activation upon the deletion of the antitoxin gene.
For several more uncharacterized LKP proteins, only a general function could be predicted (Table 2).At least two protein families likely play a role in RNA metabolism.One (arCOG00908, also known as DUF2067) consists of an N-terminal domain distantly similar to Pcc1, a subunit of the KEOPS complex, a central domain similar to tRNA-binding domain of Trm1 and an unknown C-terminal domain (Table S2).The gene coding for this protein is present in 108 genomes of both euryarchaea and crenarchaea and is located in a conserved neighborhood which includes other RNA metabolism genes, namely, DNA-directed RNA polymerase subunit RPB11L and an exosome complex subunit, RNA-binding protein Csl4 (Table S3).Thus, it appears most likely that arCOG00908 proteins bind RNA and, considering the LKP of this gene, are probably involved in key cellular functions.A second family (arCOG05929) adopts an OB-fold (Table S2), which is found in many proteins binding nucleic acids (69).This family is specific to Sulfolobales.These proteins are encoded divergently with RNA 3'-terminal phosphate cyclase RCL1 (arCOG04125), which is involved in ribosomal RNA maturation and RNA repair (70,71).Thus, arCOG05929 could be an RNA-binding protein functioning in the same pathway.
We also predicted two subunits of the energy-converting hydrogenase, a membrane subunit that is homologous to NADH-ubiquinone oxidoreductase chain 3 (arCOG05034) and an intracellular subunit (arCOG08277) with an unknown role.Both genes are always present together with other energy-converting hydrogenase subunit genes in other archaeal genomes (Tables S2 and S3).Other LKP genes for which general predictions were made are Sulfolobales-specific methylase of arCOG05922, methylase of arCOG04385, rubrerythrin family protein of arCOG04160, and FAD-binding protein of arCOG04376 (Table 2; Table S2).

In silico prediction of essential genes and gene status
We explored the extent to which the LKP in rich and minimal media for the M. maripalu dis S2 (7) and in rich medium for S. islandicus M. 16.4 (12) could be predicted by compara tive genomic analysis.To this end, we collected data on 11 evolutionary and phenotypic characteristics of orthologous gene families in Methanococcales and Sulfolobales (Table 6, also see Materials and Methods section) and used linear discriminant analysis (LDA) to find the optimal linear combination of these variables to separate the LKP genes from the rest.
Stepwise model reduction (see Materials and Methods for details) suggested that two to three variables reflecting the family-level and kingdom-level gene conservation were FIG 4 PCR verification of predicted TA loci knockouts.The TA loci were deleted by replacing with StoargD marker cassette via homologous recombination as described previously (67).Two alternative StoargD marker cassettes were used, one 740 bp in length and the other 631 bp in length.The TA deletion strains were verified by PCR using primers annealing to sequences flanking the target regions.∆, genomic DNA from the representative toxin-antitoxin knockout used as the DNA template; wt, genomic DNA from host strain RJW004 used as the DNA template; neg, molecular biology grade water used as a control; L, GeneRuler Express DNA Ladder (Thermo Fisher Scientific, USA).The expected size of the amplicon was shown in Table 3.
The sequences in lowercase exhibit homology to regions located outside the targeted deletion sites.
sufficient for the best achievable prediction where the quality of prediction, measured as the area under the ROC (receiver operator characteristic) curve, was 0.68 for M. maripaludis in minimal medium and 0.81 for both S. islandicus and M. maripaludis in rich medium (Table 6).Principal component analysis (PCA) of the same 11-dimensional space of compara tive genomic variables showed that the first principal component (PC1) values strongly correlated with the posterior probabilities obtained by the optimal linear predictor for the LKP (Table 6), suggesting that PCA results can be used to predict the LKP genes.Indeed, using PC1 value to rank gene families results in predictors with areas under the ROC very close to those obtained with LDA (Table 6).
We showed previously that PC1 of comparative genomics and phenotypic character istics of eukaryotic gene families can be interpreted as "gene status, " an integral measure of a gene's importance on the cellular and organismal levels (72).High-status genes tend to be more conserved with respect to both loss or disruption and sequence divergence and are on average higher expressed and more central in interaction networks than low-status genes.Our current results show that PC1 of comparative genomics varia bles in archaea can, by itself, be used to predict the LKP genes on par with specifically optimized linear predictors.This result implies that the notion of gene status is applicable to prokaryotes and that gene status can be assessed in the same way, namely, as the axis of highest variance in the space of comparative genomics variables.
Thus, we calculated PC1 values for five archaeal clades including well-characterized model organisms (Sulfolobales, Methanococcales, Thermococcales, Haloferacales, and Methanosarcina) and found that, typically, the number of genomes in csCOGs, ances trality in csCOGs, number of genomes in arCOGs, and ancestrality in the last archaeal common ancestor make the greatest positive contribution, whereas the number of gene gains makes the greatest negative contribution (Fig. S7).We plotted the distributions of the PC1 values for all csCOGs and compared it with the distributions for the subsets of csCOGs including LKP and non-LKP genes for M. maripaludis and S. islandicus as well as highly and lowly expressed genes in Pyrococcus furiosus (Fig. 5A, B, and C).In each of the plots, we consistently observed four peaks.Examination of the csCOGs in these peaks showed that they can be generally described as follows.The peak with the lowest PC1 values consists of rare, mostly recently acquired genes, and the next peak comprises the variable genomic "shell" (73,74), which includes genes present in several but not all the genomes from the respective clade.The two peaks with the largest PC1 values consist mostly of ancestral csCOGs, and in particular, the rightmost peak corresponds to the csCOGs mapped to the most widely represented, conserved arCOGs (including core arCOGs).These ranges (Fig. 5A) were remarkably consistent across the clades.As expected, for both S. islandicus and M. maripaludis, most of the LKP genes belonged to the two peaks with the highest PC1 values, which is also the case for the most highly expressed genes in P. furiosus (Fig. 5A, B, and C).The csCOGs corresponding to the two peaks with high PC1 values are, by definition, high-status genes.Thus, as could be expected, the majority of the LKP genes (75% and 70% for rich medium in S. islandicus and M. maripaludis, respectively) belong to the csCOGs in the high-status gene category.This analysis provides for differentiating between bona fide essential genes involved in key cellular processes from conditionally essential genes such as those encoding antitoxins.All arCOGs that include antitoxins (see above) fell within the low-status range (PC1 values below 2).The same analysis was performed for the Haloferacales and Methanosarcina clades, yielding predictions of the high-status gene sets (Fig. 5D).The number of high-status genes ranged from 33% to 52% in the selected model genomes from the respective clades (Fig. 6A).Despite the fact that only 3 of the 11 variables were derived from arCOGs (the rest being based on csCOGs), the sets of high-status genes obtained for all 5 clades are remarkably similar (Fig. 6B).Within each of these sets, 43% to 47% of the arCOGs were also recognized as high status in all five clades, and additional 19% to 25% were high status in four of the five clades, a considerably better agreement than that between the genome-wide gene knockout experiments (Fig. 1A).

Concluding remarks
Genome-wide gene knockout as a method for identification of essential genes is widely used and provides valuable results.However, interpretation of the LKP is far from being straightforward, especially for poorly characterized genes, considering that a sizable fraction of genes can exhibit LKP not because their products perform essential cellular functions but because their knockout unleashes toxins.In this work, we applied a variety of computational approaches in an attempt to differentiate between essential and conditionally essential genes in archaea and to predict at least a general function for as many of the proteins encoded by the LKP genes as possible.This analysis allowed us to propose specific hypotheses on the function and importance of several uncharac terized LKP genes.In particular, several novel antitoxins (conditionally essential genes) were predicted, and this prediction was experimentally validated by demonstrating that deletion of these genes together with the adjacent genes apparently encoding the cognate toxins caused no growth defect.We applied PCA to identify high-status genes based on sequence and comparative genomic features and showed that this approach separates bona fide essential genes from conditionally essential ones and can be used to predict essential genes in organisms for which knockout experiments have not been performed.Because of the fuzzy definition of essentiality, neither genome-wide gene knockout nor the gene status analysis can reveal the complete picture of the roles of different genes in an organism, but the two approaches are complementary, and their combination can provide insights into the importance of different genes and might help focusing experimental efforts.Evidently, the requirements on gene content and expression in the native environment of any organism substantially differ from those in the laboratory, and the set of essential genes can be expected to differ as well, the laboratory LKP gene sets (at least, for rich media) likely being underestimates.Actually, high-status genes might be a better approximation of the native essential gene sets than the LKP data obtained in the laboratory, given that conditionally essential genes are excluded by this approach.In particular, the gene status information presented here for five archaeal lineages could be a useful guide for the study of the biology of these organisms.

Essential, non-essential, and highly expressed gene sets
The list of essential genes for M. maripaludis is a combination of all genes with essen tiality index ≤3 in two libraries for rich medium at the T2 point and the same for minimal medium, whereas the non-essential gene list is a combination of all genes with essentiality index ≥11 in two libraries at the T2 point for rich and minimal medium, respectively (7).For S. islandicus, the list of essential, non-essential, and unassigned genes was taken directly from the published data (12).For Pyrococcus furiosus DSM 3638, one-third of genes with the highest expression values was classified as highly expressed, and one-third of the genes with the lowest expression values was classified as low expressed (75).

Sequence and structure analysis
The arCOGs database was used as the comparative genomics framework (16,17).arCOGs contain annotated clusters of orthologous genes and respective sequence information for 524 archaeal genomes covering all major archaeal lineages.To identify remote homologs, PSI-BLAST ( 76) with E-value threshold = 0.001, and compositionally based statistics turned off was run for five iterations or until convergence against either arCOG database or the NCBI NR database.Low scoring proteins were also used as queries for HHpred search (77) with default parameters to verify similarity found by PSI-BLAST.Muscle5 (78) with default parameters was used for protein multiple alignment construc tion.TMHMM program (79) was used to predict transmembrane segments.Jpred 4 web server was used to predict secondary structure (80).For structure predictions, the colabfold web server running AlphaFold2 was used (81).Structure comparison with available PDB structures was performed using the DALI web server (82).USCF ChimeraX was used for all structural analysis and visualization (83).Association of genes with integrated elements, viruses, and defense islands was inferred from examina tion of extended antitoxin neighborhoods (30 genes upstream and downstream) for the presence of signature genes of MGE: Rep, ParA/Soj, and primase-polymerase for plasmids, VirB4 for conjugative plasmids, signature viral genes for the respective viruses, integrases or recombinases for integrated elements, and annotated defense genes (when other signature genes were absent) for defense islands (Table S3).

Genome sets and construction of csCOGs
csCOGs were constructed for five archaeal lineages: Sulfolobales, Haloferacales, Methanosarcina, Thermococcales, and Methanococcales.Detailed information on the methods for csCOGs construction and evolutionary reconstructions using phyletic patterns of csCOGs and 16S rRNA tree of the genomes in the respective clade was recently published elsewhere (84).The csCOGs for Methanococcales (see Data Availabil ity) were constructed in this work, and the csCOGs for all other clades were taken from previous work (84).

Analysis of comparative genomic variables
Eleven comparative genomic variables were calculated for each csCOG, using either the constituent sequences, or csCOG evolutionary reconstruction or associated arCOG data: 1. number of genomes, csCOGs 2. paralogy, csCOGs (calculated as the number of distinct genes, coding a csCOG protein, divided by number of genomes that encompass these genes) 3. variability, csCOGs [a measure of sequence variability in the csCOG alignment, scaled to 1 corresponding to the average variability across all csCOGs in the clade (84)] 4. number of gains, csCOGs (calculated from the csCOG evolutionary history reconstructions) 5. ancestrality, csCOGs (a ternary value, calculated from the csCOG evolutionary history reconstructions, with 0 denoting acquisition on the terminal branch of the clade tree, 1 denoting acquisition on an internal branch of the tree, and 2 denoting a gene ancestral to the clade) 6. signal peptide fraction, csCOGs [fraction of proteins in the csCOG in which a signal peptide is predicted using SignalP (85)] 7. number of TM segments, csCOGs [average number of transmembrane segments predicted in csCOG proteins using TmHMM (79)] 8. core, arCOGs [a binary value with 1 denoting one of the 218 nearly universal, core arCOGs (18) and 0 otherwise] 9. low complexity fraction, csCOGs [the total fraction of residues in csCOG proteins, masked by SEG (86)] 10. number of genomes, arCOGs (with the value of 0 assigned to csCOGs not mapped to any arCOG) 11. ancestrality, LACA [the posterior probability of the presence of the arCOG in the last archaeal common ancestor (87)] All variables were normalized to the average of 0 and variance of 1 prior to subse quent analysis.
Genes for which experimentally determined binary LKP data were available were mapped to the corresponding csCOG; if at least one gene in a csCOG had the LKP, the csCOG was assigned an LKP value of 1, and 0 was assigned otherwise.Function lda() of the MASS package in R was used to predict the LKP using the comparative genomic variables.The csCOGs were sorted in the increasing order of the posterior probability of the non-lethal phenotype; the fraction of all LKP and non-LKP csCOGs (true-positive and false-positive rates, respectfully), calculated for each position of the list, defined the ROC curve.The area under the ROC curve (AUROC) was used as the prediction quality measure.For each prediction, 1,000 bootstrap samples of csCOGs were generated, and the same LDA was performed, producing 1,000 AUROC values; their variance was calculated and recorded.
The analysis started with the LKP prediction using k = 11 original comparative genomic variables.Then, at each stage, k reduced models, predicting LKP using k − 1 variables, were analyzed in the same manner.The reduced model with the highest Z-score (difference between AUROCs divided by the square root of the sum of variances) was accepted if the reduction did not lead to a significant drop of prediction quality (Z ≥−2).The stepwise model reduction continued until each of the remaining variables was found to contribute significantly.
PCA of the comparative genomic variables was performed using the prcomp() function in R. The csCOGs were sorted in the increasing or decreasing order of the first principal component; the ROC curve and the AUROC values were obtained in the same manner as with the LDA posterior probabilities.

Construction of targeted disruption mutants in S. islandicus
To construct mutant strains in S. islandicus, a gene deletion cassette containing agmatine selection marker flanked by 35-40-bp sequence homologous to the regions outside the targeted deletion sites was generated and electroporated into S. islandicus RJW004 competent cells, selecting for transformants on plates without the addition of agma tine, as described previously (67).Two variants of the agmatine selection marker were used, depending on the genomic context of TA loci.The first variant, spanning 740 bp, encompassed the open reading frame of the arginine decarboxylase encoding gene along with its putative promoter and terminator regions.In the second variant, consisting of 631 bp, the putative terminator region was removed to allow for transcrip tional read through.Mutant strains were confirmed by PCR analysis using the primers designed to bind outside the targeted regions, as detailed in Tables 3 to 5.

ADDITIONAL FILES
The following material is available online.S3).Color code: pink, ancestral genes; yellow, genes acquired at intermediate branches, pale blue, genes acquired in the leaves (individual genomes); red outline, LKP gene in rich medium; black outline, non-LKP genes.Figure S4 (mBio03092-S0004.pdf).Identification of KEOPS complex subunit Cgi12 in all crenarchaeal genomes in arCOG database.The phylogenetic tree of archaeal genomes was constructed based on a concatenated alignment of 56 ribosomal protein sequences using FastTree (88); only the subtree for crenarchaea is shown.The same program was used to compute bootstrap values shown for each branch.Phyletic pattern (Cgi121 presence) is shown by circles next to the respective leaf in the tree.Prior to this work, only arCOG02197 (orange circles) was known to include Cgi121 proteins.Other proteins identified in this work as Cgi121 were previously included in several uncharacterized arCOGs (gray circles) or were not assigned to arCOGs (magenta circles).As a result of this work, these proteins (except for arCOG02197) are combined in arCOG07188.Figure S5 (mBio03092-S0005.pdf).Multiple alignments of protein families described in the text.Figure S6 (mBio03092-S0006.pdf).Gene neighborhoods for predicted arCOG10132 (A), arCOG07934 (B), and arCOG08451 and arCOGo7229 (C) antitoxins in all genomes where these protein families were identified.Genes are shown by arrows roughly proportional to their size.The genome, the nucleotide accession number, and the coordinates of the locus are indicated on the right.Assignment to defense islands, integrated elements, or elsewhere was based on examination of extended loci (Table S3).Genes are designated by a gene (or protein family) name and arCOG number (see detailed description in the Table S3).Toxins are colored red, and antitoxins are colored blue with a red outline.A dashed outline and the "pseudo" label above the arrows indicate that this particular gene contains a frameshift.Figure S7 (mBio03092-S0007.pdf).Contribution of comparative genomic variables to PC1.Table S1 (mBio03092-S0008.xlsx).Essentiality, ancestry, and gene status data for complete genomes of M. maripaludis and S. islandicus.Table S2 (mBio03092-S0009.xlsx).Evidence of sequence and structural similarity for uncharacterized LKP proteins analyzed in this work.Table S3 (mBio03092-S00010.xlsx).Gene neighborhoods of uncharacterized LKP described in the text.

FIG 1
FIG 1 Comparison and analysis of lethal knockout phenotype genes in Methanococcus maripaludis and Sulfolobus islandicus.(A) Venn diagram comparing sets of LKP protein families (rich medium) with the set of 218 most conserved protein families (archaeal core) defined previously (18).All LKP genes were assigned to arCOGs, and in each section of the diagram, the number of arCOGs, but not individual proteins, is shown.(B) Breakdown of uncharacterized LKP genes in M. maripaludis and S. islandicus.

FIG 2
FIG 2 New components of GINS and KEOPS complexes and GPR family protease, a putative functional analog of LonB protease.(A)Corrected phyletic patterns of GINS51 and GINS23 subunits for all 524 genomes in arCOG database.For each genome, gene presence is shown as a vertical bar color coded according to the major archaeal phyla indicated above.arCOG numbers are indicated on the left.(B) Structural overlay of AF2 model for WP_011170969.1 with the best scoring structure (PDB:3awn; GINS51 family protein).Due to a circular permutation and divergence, methanococcal B-domain in GINS23 subunit (arCOG05026) did not align with GINS51, whereas A-domain aligned well.(C) Gene neighborhood for WP_011170969.1.This gene is encoded upstream of the replicative (Continued on next page)

FIG 2 (
FIG 2 (Continued) helicase MCM2, which is typical for most other GINS23 genes.(D) Corrected phyletic pattern for Cgi121 and other subunits of the KEOPS complex.The old pattern for arCOG02197 is shown for comparison.Designations are the same as in A. (E) Corrected sequence for ACR42468 locus of S. islandicus (CP001402.1:1709185..1709610) and structural overlay of the corrected ACR42468 AF2 model with Cgi121 subunit from Pyrococcus furiosus (PDB: 1ZD0) are shown.The structural element corresponding to the appended amino acid sequence is colored red, and the old part of ACR42468 is colored yellow.This comparison shows that the previously missed N-terminal region of the corrected protein fits the Cgi121 structure.(F) AF2 model of GRP protease.The transmembrane segment is colored cyan, alpha helices are colored yellow, beta strands are colored green, and the two predicted catalytic aspartates are rendered in red.

FIG 3
FIG 3 Predicted antitoxins and novel toxin-antitoxin systems in Sulfolobus islandicus.(A) Gene neighborhood of the predicted arCOG10132 antitoxin in S. islandicus and comparison of the AF2 model of the arCOG10132 protein with the PHD antitoxin in complex with Doc toxin (PDB: 3K33) structure (53).Genes are denoted by an arCOG number and a gene or a family name, if available.More details are provided in Table S3.Blue brackets approximately indicate boundaries of the locus that was deleted in this work.The toxin description is given above the respective arrow.Species name, genome partition, and coordinates of the locus are indicated on the right from loci schematics.(B) Gene neighborhood of the predicted AbrB/MazE family antitoxin (arCOG07934) in S. islandicus and multiple alignment of arCOG07934 proteins.Atypical AbrB/MazE protein is aligned for comparison.The GD motif conserved in most of AbrB/MazE family proteins is shown in green.Secondary structure prediction is shown below the alignment in magenta as follows: E, beta sheet; H, alpha helix.Other designations are as (Continued on next page)

FIG 3 (
FIG 3 (Continued) in A. (C) Gene neighborhood and AF2 models of the predicted antitoxin of arCOG07229 and arCOG08451.The designations are the same as in A. (D) Gene neighborhoods and multiple alignment of all predicted AbrB/MazE family antitoxins of arCOG09897.Other designations are as in B except that neighboring HTH domain-containing genes are colored green and aquamarine.(E).Gene neighborhoods of the predicted AbrB/MazE family antitoxin of arCOG07185, alignment of three arCOG07185 proteins from S. islandicus and AF2 models of potential toxins.Complete alignment of this family is available in Fig. S5.Potential toxins are colored magenta and dark blue.Arrows with dashed outline were shortened to save space.Other designations are as in B.

FIG 5
FIG 5 Distribution of PC1 values in five archaeal lineages.(A)Distributions of PC1 values for six subsets of Methanococcales csCOGs color coded as shown to the right of the plot.Arrows above the plot show approximate ranges of the evolutionary age of the csCOGs.The approximate range of the high-status csCOGs is shown by the red arrow below the plot.p.d.f., probability density function.(B) Distributions of PC1 for four subsets of Sulfolobales csCOGs (legend inside the plot).(C) Distributions of PC1 for four subsets of Thermococcales csCOGs (legend inside the plot).(D) Distributions of PC1 for all csCOGs in Haloferacales and Methanosarcina.

FIG 6
FIG 6 Predicted high-status gene families in five archaeal lineages.(A) Fraction of high-status genes in selected model genomes (PC1 ≥2 for M. maripaludis, S. islandicus, and P. furiosus; PC1 ≥3 for Haloferax volcanii and Methanosarcina mazei).(B) Mapping of predicted high-status genes to arCOGs, compared between the lineages.

Figure S1 (
Figure S1 (mBio03092-S0001.pdf).Extended genome neighborhoods for uncharacter ized LKP genes in S. islandicus.Genes are shown by arrows roughly proportional to their size.Coordinates in the genome are indicated on the right.Proteins are annotated by gene names (for details see TableS3).Color code: pink, ancestral genes; yellow, genes acquired at intermediate branches, pale blue, genes acquired in the leaves (individual genomes); red outline, LKP gene in rich medium; black outline, non-LKP genes.FigureS2(mBio03092-S0002.pdf).Extended genome neighborhoods for uncharacter ized LKP genes in M. maripaludis.Designations are the same as in the FigureS1.FigureS3(mBio03092-S0003.pdf).GINS candidates in Fervidicoccus fontis and Methanopyrus kandleri.(A) Uncharacterized protein AFH43081.1 from F. fontis is encoded downstream of PriS, primase small subunit.This neighborhood is characteristic for many other GINS51 genes.GINS51 could not be identified in F. fontis using sequence comparison methods.(B) AF2 model for AFH43081.1 revealed two domains resembling A and B domain characteristic of GINS51 family, but structural comparison still fails to identify similarity with other GINS structures.Nevertheless, we tentatively assign this

Figure S2 (
Figure S1 (mBio03092-S0001.pdf).Extended genome neighborhoods for uncharacter ized LKP genes in S. islandicus.Genes are shown by arrows roughly proportional to their size.Coordinates in the genome are indicated on the right.Proteins are annotated by gene names (for details see TableS3).Color code: pink, ancestral genes; yellow, genes acquired at intermediate branches, pale blue, genes acquired in the leaves (individual genomes); red outline, LKP gene in rich medium; black outline, non-LKP genes.FigureS2(mBio03092-S0002.pdf).Extended genome neighborhoods for uncharacter ized LKP genes in M. maripaludis.Designations are the same as in the FigureS1.FigureS3(mBio03092-S0003.pdf).GINS candidates in Fervidicoccus fontis and Methanopyrus kandleri.(A) Uncharacterized protein AFH43081.1 from F. fontis is encoded downstream of PriS, primase small subunit.This neighborhood is characteristic for many other GINS51 genes.GINS51 could not be identified in F. fontis using sequence comparison methods.(B) AF2 model for AFH43081.1 revealed two domains resembling A and B domain characteristic of GINS51 family, but structural comparison still fails to identify similarity with other GINS structures.Nevertheless, we tentatively assign this

Figure S3 (
Figure S1 (mBio03092-S0001.pdf).Extended genome neighborhoods for uncharacter ized LKP genes in S. islandicus.Genes are shown by arrows roughly proportional to their size.Coordinates in the genome are indicated on the right.Proteins are annotated by gene names (for details see TableS3).Color code: pink, ancestral genes; yellow, genes acquired at intermediate branches, pale blue, genes acquired in the leaves (individual genomes); red outline, LKP gene in rich medium; black outline, non-LKP genes.FigureS2(mBio03092-S0002.pdf).Extended genome neighborhoods for uncharacter ized LKP genes in M. maripaludis.Designations are the same as in the FigureS1.FigureS3(mBio03092-S0003.pdf).GINS candidates in Fervidicoccus fontis and Methanopyrus kandleri.(A) Uncharacterized protein AFH43081.1 from F. fontis is encoded downstream of PriS, primase small subunit.This neighborhood is characteristic for many other GINS51 genes.GINS51 could not be identified in F. fontis using sequence comparison methods.(B) AF2 model for AFH43081.1 revealed two domains resembling A and B domain characteristic of GINS51 family, but structural comparison still fails to identify similarity with other GINS structures.Nevertheless, we tentatively assign this

TABLE 2
Functional predictions for uncharacterized LKP proteins

TABLE 2
Functional predictions for uncharacterized LKP proteins (Continued) A, ancestral; I, intermediate branch; T, acquired at a terminal branch (the given genome).
a b -several family members are correctly annotated in public databases.

TABLE 3
The expected size of amplicon from TA deletion and parental strains a Locus tag (M164_) four last digits are shown.Research Article mBioFebruary 2024 Volume 15 Issue 2 10.1128/mbio.03092-2313

TABLE 4
Primers used in this study a

TABLE 5
Strains and plasmids used in this study Sulfolobus-E. coli shuttle vector containing SsopyrEF and StoargD selection markers(Zhang, 2018 #10)