Comparative Genomics Provides Insights into the Genetic Diversity and Evolution of the DPANN Superphylum

ABSTRACT DPANN is known as highly diverse, globally widespread, and mostly ectosymbiotic archaeal superphylum. However, this group of archaea was overlooked for a long time, and there were limited in-depth studies reported. In this investigation, 41 metagenome-assembled genomes (MAGs) belonging to the DPANN superphylum were recovered (18 MAGs had average nucleotide identity [ANI] values of <95% and a percentage of conserved proteins [POCP] of >50%, while 14 MAGs showed a POCP of <50%), which were analyzed comparatively with 515 other published DPANN genomes. Mismatches to known 16S rRNA gene primers were identified among 16S rRNA genes of DPANN archaea. Numbers of gene families lost (mostly related to energy and amino acid metabolism) were over three times greater than those gained in the evolution of DPANN archaea. Lateral gene transfer (LGT; ∼45.5% was cross-domain) had facilitated niche adaption of the DPANN archaea, ensuring a delicate equilibrium of streamlined genomes with efficient niche-adaptive strategies. For instance, LGT-derived cytochrome bd ubiquinol oxidase and arginine deiminase in the genomes of “Candidatus Micrarchaeota” could help them better adapt to aerobic acidic mine drainage habitats. In addition, most DPANN archaea acquired enzymes for biosynthesis of extracellular polymeric substances (EPS) and transketolase/transaldolase for the pentose phosphate pathway from Bacteria. IMPORTANCE The domain Archaea is a key research model for gaining insights into the origin and evolution of life, as well as the relevant biogeochemical processes. The discovery of nanosized DPANN archaea has overthrown many aspects of microbiology. However, the DPANN superphylum still contains a vast genetic novelty and diversity that need to be explored. Comprehensively comparative genomic analysis on the DPANN superphylum was performed in this study, with an attempt to illuminate its metabolic potential, ecological distribution and evolutionary history. Many interphylum differences within the DPANN superphylum were found. For example, Altiarchaeota had the biggest genome among DPANN phyla, possessing many pathways missing in other phyla, such as formaldehyde assimilation and the Wood-Ljungdahl pathway. In addition, LGT acted as an important force to provide DPANN archaeal genetic flexibility that permitted the occupation of diverse niches. This study has advanced our understanding of the diversity and genome evolution of archaea.

ABSTRACT DPANN is known as highly diverse, globally widespread, and mostly ectosymbiotic archaeal superphylum. However, this group of archaea was overlooked for a long time, and there were limited in-depth studies reported. In this investigation, 41 metagenome-assembled genomes (MAGs) belonging to the DPANN superphylum were recovered (18 MAGs had average nucleotide identity [ANI] values of ,95% and a percentage of conserved proteins [POCP] of .50%, while 14 MAGs showed a POCP of ,50%), which were analyzed comparatively with 515 other published DPANN genomes. Mismatches to known 16S rRNA gene primers were identified among 16S rRNA genes of DPANN archaea. Numbers of gene families lost (mostly related to energy and amino acid metabolism) were over three times greater than those gained in the evolution of DPANN archaea. Lateral gene transfer (LGT; ;45.5% was cross-domain) had facilitated niche adaption of the DPANN archaea, ensuring a delicate equilibrium of streamlined genomes with efficient niche-adaptive strategies. For instance, LGT-derived cytochrome bd ubiquinol oxidase and arginine deiminase in the genomes of "Candidatus Micrarchaeota" could help them better adapt to aerobic acidic mine drainage habitats. In addition, most DPANN archaea acquired enzymes for biosynthesis of extracellular polymeric substances (EPS) and transketolase/transaldolase for the pentose phosphate pathway from Bacteria. IMPORTANCE The domain Archaea is a key research model for gaining insights into the origin and evolution of life, as well as the relevant biogeochemical processes. The discovery of nanosized DPANN archaea has overthrown many aspects of microbiology. However, the DPANN superphylum still contains a vast genetic novelty and diversity that need to be explored. Comprehensively comparative genomic analysis on the DPANN superphylum was performed in this study, with an attempt to illuminate its metabolic potential, ecological distribution and evolutionary history. Many interphylum differences within the DPANN superphylum were found. For example, Altiarchaeota had the biggest genome among DPANN phyla, possessing many pathways missing in other phyla, such as formaldehyde assimilation and the Wood-Ljungdahl pathway. In addition, LGT acted as an important force to provide DPANN archaeal genetic flexibility that permitted the occupation of diverse niches. This study has advanced our understanding of the diversity and genome evolution of archaea.
sustain autotrophic growth on carbon dioxide and, potentially, carbon monoxide, formate, or acetate using a modified Wood-Ljungdahl pathway (27). In addition, Diapherotrites exhibited genomic evidence for anabolic biosynthesis of multiple carbohydrates, amino acids, lipids, and nucleotides as well as several cofactors (13). These findings highlighted high metabolic diversity within the DPANN group.
Environmental genomics has significantly facilitated the identification and characterization of numerous novel archaeal lineages, such as those belonging to the DPANN group (13,24,34). However, unquestionably, many aspects of the DPANN group, including the overall genetic diversity and evolutionary history, have not yet been clearly revealed. In addition, there is still a vast novelty and diversity of species within the DPANN superphylum that await exploration. This "dark matter" might possess unexpected metabolic diversity and fascinating cellular physiology, but it was so new to us that such organisms were usually not recognized by regular probing technology based on rRNA gene sequences. To date, no report has explored the pan-genome and comprehensive phylogenomics of the DPANN superphylum. In this study, we extended previous studies on the gene repertoires and evolutionary history of the enigmatic DPANN superphylum by the investigation of about 600 DPANN genomes. The DPANN genomes studied consisted of 41 DPANN MAGs recovered from metagenomic data sets (four from our AMD metagenome data sets [unpublished] and the other 37 from public metagenomes in the GenBank [35] and JGI-IMG [36] databases) using a binning strategy and 515 publicly available DPANN genomes scavenged from the GenBank, ggKbase, and JGI-IMG databases.
Among these 41 MAGs, 38 had ,5% contamination and 10 had .90% calculated completion and ,5% contamination (see Materials and Methods). However, the proposed extensive genome reduction that DPANN archaea underwent might have biased the genome completion assessment (33). These 41 MAGs were analyzed together with 515 publicly available DPANN genomes scavenged from the GenBank, ggKbase, and JGI-IMG databases (Table S2) . A significantly positive Spearman correlation (P , 0.001) between genome size and G1C content, average protein length, or number of coding sequences (CDS) was observed. Meanwhile, a significantly negative Spearman correlation (P , 0.001) between genome size and coding density was found (Fig. 1).
The genomes recovered from a hot spring and a hypersaline lake were significantly smaller than those from other habitats (unpaired t test, P , 0.05) (see Fig. S2 at https://doi.org/10.6084/m9.figshare.14215802.v2). In addition, genomes of the phyla Nanoarchaeota and Huberarchaea were smaller than those of other DPANN phyla (unpaired t test, P , 0.05) (see Fig. S3 at https://doi.org/10.6084/m9.figshare .14215802.v2). Comparative analysis of the pan-genome of each DPANN phylum (the collection of gene families found among genomes of an individual phylum) showed that Huberarchaea contained the fewest gene families (729) in its pan-genome. Venn analysis further illustrated interphylum differences in respective pangenome content (see Fig. S4  . From the PCA plot, it was observed that the intergroup difference was significantly greater than the intragroup difference (analysis of similarity [ANOSIM], R . 0, P , 0.05), and genomes of DPANN archaea appeared to cluster based on their taxonomic assignment rather than their habitat. However, the clustering of Pacearchaeota, Aenigmarchaeota, and Nanoarchaeota was tangled, which was also observed in whole-genome phylogeny (Fig. 2, left). This might result from conserved genes or shared genomic features in these phyla.
Core and pan-genome analyses of the 556 genomes revealed that the pan-genome of the DPANN superphylum reached a size of about 211,966 gene families, fitted into a power-law regression function [Ps(n) = 423.919n 0.983 ] with a parameter of g (0.983) close to 1, suggesting that the pan-genome was still highly "open" (i.e., tended to be linear). The core genome of the DPANN superphylum was fitted into an exponential regression [Fc(n) = 26457.4e 23.342n ], which followed a sharply steep slope and was reduced to 0 gene families within about five genomes. These results indicated vast diversity within the DPANN superphylum and showed that the currently characterized features of this superphylum are still far from saturation.
Omission of DPANN archaea during 16S rRNA-based investigations might be another issue in addressing the diversity of DPANN archaea. Mismatches to 25 known archaeonspecific or universal 16S rRNA gene primers were identified among 16S rRNA genes from available high-quality DPANN genomes (see Fig. S6  Evolutionary analyses of DPANN phyla. To gain insight into the evolutionary histories of the DPANN superphylum, gene family gain and loss events were predicted by mapping the identified genes families onto the whole-genome tree ( Fig. 3; also, see Materials and Methods). As expected, gene families undergoing loss events dramatically outnumbered those undergoing gain events by a factor of more than 3 (4,457 versus 1,425). The DPANN group was predicted to first diversify around 2,644.56 million years ago (Mya), prior to the occurrence of the Great Oxidation Event (GOE) (;2,400 Mya) (42) (Fig. 3a). A large number of gene family loss events occurred at branches leading to Huberarchaea, Pacearchaeota (accounting for 56% and 47% of gene families, respectively), and the most recent common ancestor (MRCA) of Nanoarchaeota (accounting for 56% of gene families). The top three clusters of orthologous groups (COG) categories (excluding the poorly characterized proteins) that lost the most gene families annotated were COG category C (energy production and conversion), COG category E (amino acid transport and metabolism), and COG category F (nucleotide transport and metabolism) (Fig. 3b).
Coding potential of DPANN phyla. Metabolic reconstruction of DPANN phyla was also performed ( Fig. 2 and 5). Results showed that the metabolic traits of DPANN genomes were a patchwork with few coherent features in most phylum-level radiations, which might represent indirect evidence of the complex evolution of the mysterious DPANN lineages (33). The DPANN archaea were characterized by the absence of biosynthetic pathways for amino acids, lipids, cofactors, vitamins, and nucleotides, but to different extents. In addition, although DPANN microorganisms seemed to have lost genes in many important functional categories, they selectively retained genes related to the central informational processes of DNA replication, transcription, and translation.
Succinyl-CoA synthetase of the TCA cycle was missing in most Micrarchaeota. It was postulated that succinate in the Micrarchaeota was generated from methylisocitrate via the methylisocitrate lyase (34). Notably, ;61.8% of Altiarchaeota also harbored some TCA enzymes (e.g., IDH, FumA, and Mdh), which might be involved in the metabolism of biochemical intermediates. Bacteria-derived FumA (fumarate hydratase class I) and FumC (fumarate hydratase class II) were also found (see Fig. S17 and S18 at https://doi.org/10.6084/m9.figshare.14215802.v2). Micrarchaeota and Parvarchaeota seemed to have lost FumA while retaining FumC. FumA was reported to be reactive oxygen species (ROS) sensitive, and FumC, in contrast, is ROS resistant (54). This characteristic difference might account for the distinct distribution of these two forms of fumarate hydratase (Fig. 2), since Micrarchaeota and Parvarchaeota mostly dwell in aerobic mine area-related habitats, while Altiarchaeota mostly inhabit anaerobic groundwater environments. In addition, 53.1% of Micrarchaeota and 60.5% of Parvarchaeota contained the oxaloacetate-decarboxylating enzyme malate dehydrogenase (MaeA), with putative bacterial origins (see Fig. S19  phyla devoid of the majority of TCA cycle enzymes (e.g., Pacearchaeota, Woesearchaeota, and Nanoarchaeota) contained 2-oxoglutarate/2-oxoacid ferredoxin oxidoreductase (KOR) ( Fig. 2 and 5). This suggests that the KOR family has a wider application than TCA function, like amino acid degradation, as reported for sulfur-dependent hyperthermophilic archaea (56).
About 52.7% of Altiarchaeota possessed a complete archaeal version of the Wood-Ljungdahl (WL) pathway. The assimilation of CO 2 into acetate in Altiarchaeota through the WL pathway has been demonstrated by carbon isotopic and transcriptomic analysis (27), indicating an anaerobic autotrophic lifestyle. The heterodisulfide reductase complex (HdrABC) and the non-F420-reducing hydrogenase iron-sulfur subunit (MvhD) were also annotated in 39.1% of Altiarchaeota; these might cooperatively form a cytoplasmic MvhD-HdrABC complex to reduce the disulfide of coenzyme M and coenzyme B (CoMS-SCoB) in the final step in methanogenic pathways (57,58). These enzymes were also likely acquired via LGT (see Fig. S20 and S21 at https://doi.org/10.6084/m9 .figshare.14215802.v2).
A putative acid stress resistance strategy was also found. Archaeal membrane (composed of isoprenoid-based lipids) was primarily synthesized via the typical mevalonate pathway (62). This mevalonate pathway was detected in 50.5% of Altiarchaeota, 39.7% of Diapherotrites, 52.8% of Micrarchaeota, and 20.0% of Aenigmarchaeota. However, there was a difference between the biosynthetic routes that these phyla took. Namely, 16.3% of Micrarchaeota contained a unique LGT-acquired mevalonate 3,5-bisphosphate decarboxylase (MBD) (see Fig. S24 at https://doi.org/10.6084/m9.figshare.14215802.v2) to generate the precursor of isoprenoids. MBD is thought to be more efficient at low pH (62), putatively conferring on Micrarchaeota the ability to adapt to acid mine drainage habitats.
Inorganic pyrophosphatase, polyphosphate kinase, and a five-subunit V/A-type H 1 / Na 1 -transporting ATPase (more streamlined than the nine-subunit ATPase prototype from euryarchaea [66]) were annotated in 82.7% of Micrarchaeota, 80.0% of Altiarchaeota, 71.1% of Parvarchaeota, and 37.9% of Diapherotrites. In addition, an LGT-derived F-type H 1 -transporting ATPase was annotated in an Altiarchaeota genome (see Table S5 at https://doi.org/10.6084/m9.figshare.14806140.v1). However, we still failed to detect components of the electron transfer chain in ;60.6% DPANN genomes. They seemed to rely on substrate-level phosphorylation as a major mode of energy conservation, in a symbiosis lifestyle. Correspondingly, various enzymes for the metabolism of fermentation products (e.g., butyrate, lactate, formate, ethanol, and acetate) were annotated. The fer gene, encoding ferredoxin (an important electron transfer protein), was detected in most DPANN phyla inhabiting anoxic biotopes (e.g., 74.5% of Altiarchaeota and 72.4% of Diapherotrites) but seemed to be consistently missing in those from oxic biotopes (e.g., Micrarchaeota and Parvarchaeota that inhabited AMD-related habitats). The distinct distribution of DPANN ferredoxin genes in different environments might further impact the ferredoxin-dependent pathways, eventually driving the selective distribution of DPANN members, as observed previously in Woesearchaeota (67).
(iv) Genetic information processing. Functions associated with the core biological processes of replication, transcription, and translation as well as protein folding and stability were also highly conserved in DPANN archaea (see Fig. S35 at https://doi.org/ 10.6084/m9.figshare.14214500.v2). In addition, 93.2% of tested DPANN genomes contained FtsZ-based cell division systems, attesting that they represented cellular life forms instead of residual DNA (73). Regarding chaperone and repairing molecules, heat shock protein Hsp20 and thioredoxin (TrxA) were found in ;70.2% of DPANN genomes. Chaperonin GroEL and prefoldin PfdB were found in ;88.4% of DPANN genomes; chaperonin GrpE, and the heat shock protein 70 (Hsp70), and chaperones DnaJ and DnaK were present in 85.5% of Altiarchaeota, 65.5% of Diapherotrites, 61.2% of Micrarchaeota, 65.8% of Parvarchaeota, and 74.2% of Woesearchaeota. Previous studies revealed relatively high gene expression and protein abundance of these proteins in symbionts with reduced genomes (74)(75)(76), corroborating the viewpoint that these proteins played critical roles in the biology of symbionts (including the DPANN microorganisms). The chaperones widely present in reduced microbial genomes might ameliorate the destructive influences of deleterious substitutions accumulating in symbionts (77), as well as ensuring correct protein folding and stability (78).

DISCUSSION
Nineteen years have passed since the first discovery of Nanoarchaeota (9), and the omics era has witnessed the rapidly increased identification of DPANN members that were previously unknown (25). In this study, we tried to reveal more genomic details of the enigmatic DPANN superphylum by using a metagenome binning strategy and comparative genomic analysis of the 556 DPANN genomes. Such information would be valuable for illuminating the genomic features, functional repertoire, and evolutionary history of the DPANN superphylum, providing the theoretical foundation for further study on isolation, cultivation, and preservation.
The editorial in the journal Science "So Much More To Know" raised over 100 most frontier scientific questions awaiting answers (83), and we considered that three of them were closely related to this study and worthy of discussion here.
The first question was, "Why are some genomes really big and others quite small (compact)?" (83). Our results showed that the most predominant feature of most DPANN genomes was their ultrasmall genomes, which were predicted to have lost genes for a variety of metabolic pathways. Moreover, we found that there was a significant positive correlation between the genome size and number of CDS, G1C content, and average protein length, as well as a significant negative correlation between the genome size and coding density (Fig. 1). The tiniest genomes in the DPANN superphylum were limited to symbionts that relied on archaeal hosts. Only Nanoarchaeota (84) and Micrarchaeota (71,85) within the DPANN superphylum were experimentally confirmed as ectosymbiotic archaea. Woesearchaeota was inferred to interact with Methanomicrobia and Methanobacteria based on a co-occurrence network (67), while Huberarchaea potentially lived in symbiosis with Altiarchaeota based on covarying cell abundance profiles and microscopic imaging (27,86). In contrast, Diapherotrites and Altiarchaeota were suggested to be free-living organisms (30,87).
The questions of whether the remaining members of the DPANN superphylum live a parasitic or free-living lifestyle and, if they are symbionts, what their hosts are still waiting to be addressed. In addition, it remains to be confirmed if members of DPANN phyla that contain ectosymbiotic representatives are all symbionts without exception. Furthermore, members of the DPANN superphylum diversified from each other before the rise of the earth's oxygen level, and these archaea were predicted to have evolved independently for millions of years (Fig. 3), similar to the endosymbiotic bacteria (88). However, the reasons and routes for DPANN phyla (apart from Altiarchaeota) undergoing such massive genome reduction are not yet fully resolved. In fact, almost all microorganisms harbor an innate bias toward gene deletion in their evolution (89). Because of deletional bias, microorganisms with tiny genomes have minimized the amount of noncoding DNA in their genomes, resulting in a very high coding density (89).
Similar to bacterial endosymbionts, host-restricted archaea in the DPANN superphylum, such as Nanoarchaeota and Micrarchaeota, were suggested to acquire (or share) metabolites such as cofactors, amino acids, and even ATP from their hosts and/or surroundings (32,71,81,90). This situation would render genes of a variety of functional categories in their genomes superfluous, resulting in relaxed selection for these redundant genes (91,92). Harmful mutations would then accumulate in these genes, followed by pseudogenization and deletion via genetic drift, and the absence of DNA repair and recombination mechanisms would exacerbate these processes (93)(94)(95). Several genes might also be actively removed through selection (96). The net result would be the deletion of nonessential genes and probably obligate reliance on their hosts. Functional complementation between these symbiotic DPANN archaea and their confirmed hosts was observed at the genome level (see Table S7a and S7b at https://doi.org/10.6084/m9.figshare.14806215.v1). However, it remains to be determined if these genomic evolution theories of endosymbionts could be applied to archaeal ectosymbionts.
The wide absence of DNA repair machinery might induce more A or T mutations in the genome, as DNA damage (e.g., cytosine deamination and guanine oxidation) usually results in G/C-to-A/T changes (94,97). This was the probable cause of the increased A and T content in the reduced DPANN genomes (Fig. 1d).
Environmental stresses might also play a part in shaping the reduced genomes of the DPANN archaea. For example, the DPANN archaea inhabiting hot spring environments (with thermophilic lifestyle) had significantly smaller genomes than those from other habitats in this study (see Fig. S2 at https://doi.org/10.6084/m9.figshare.14215802.v2) (unpaired t test, P , 0.05). It was reported that there was a negative correlation between growth temperature and genome size in thermophilic microbes (98). In other words, tiny genomes were more adaptive at high temperatures, by losing genes with less benefit in order to achieve energetic stress minimization (99)(100)(101). In addition to reduced CDS numbers, proteins from DPANN archaea with smaller genome size usually had a shorter average protein length (Fig. 1b) (r = 0.449; P , 0.001). This is probably due to the adaptive evolution of these proteins through discarding regions that encoded destabilizing substructures (102,103), which also help cut metabolic cost (104,105). In addition, results showed that as the genome of DPANN archaea got smaller, the genome became more compact (with increasing coding density) (Fig. 1a) (r = 20.176; P , 0.001), consistent with previous research on archaeal genome evolution (106). This is quite different from Bacteria. In Bacteria, the proportion of the genome consisting of noncoding regions is comparatively constant across a broad range of genome sizes (106,107). The above-mentioned features were probably associated with adaptive genome streamlining, with members of the DPANN archaea that inhabited hot springs exhibiting the most reduced forms (108,109). In addition, the DPANN group was predicted to diversify before the Great Oxidation Event (GOE) (;2,400 Mya) (42), and both oxidized nitrogen and sulfur-based compounds might be at low concentrations prior to the rise of an oxidizing atmosphere. Except for a few phyla (e.g., Micrarchaeota and Parvarchaeota) that were later exposed to (micro)aerobic environments, most other DPANN phyla were probably consistently confined in anoxic habitats. In this case, most DPANN phyla failed to evolve the TCA cycle and electron transport chain (ETC) components necessary for an aerobic lifestyle, and they were almost unable to perform dissimilatory nitrate reduction and dissimilatory sulfate reduction.
The second open question proposed by the Science editorial was, "Why does lateral transfer occur in so many species and how?" (83). Lateral gene transfer (LGT) is an indispensable evolutionary force in prokaryotes that has a massive impact on their genomic diversity and adaptive evolution. Hot springs, hypersaline environments, and acid mine drainage that DPANN archaea inhabit are all potential hot spots for LGT (110)(111)(112)(113). Whether LGT events happen gradually and continuously or rapidly in a short period of time is still under debate (111,(114)(115)(116). Our results showed that putative LGT events might have contributed substantially to the genome contents of the DPANN superphylum (see Table S5 at https://doi.org/10.6084/m9.figshare.14806140.v1), in line with previous reports (87,117). The DPANN cells were ectosymbiotic or episymbiotic (symbionts attached to the surface of other cells), distinct from the endosymbionts (symbionts living within other cells). The DPANN cells with an open environment-exposed cellular membrane did not experience isolation (such as those in insect endosymbionts) that separated them from contact with foreign genetic material (81,118). In fact, there was evidence that the DPANN cells could take up foreign DNA: the genome of Nanobsidianus (a Nanoarchaeota from a Yellowstone National Park hot spring) was found to harbor genes originating from a virus detected in the same hot spring (84,119).
In addition, LGT events might have facilitated niche adaption of the DPANN archaea, allowing for the delicate equilibrium of a streamlined genome with efficient niche-adaptive strategies. For example, the acquisition of cytochrome bd ubiquinol oxidase and arginine deiminase might confer on Micrarchaeota adaptive advantages under acidic (micro)aerobic conditions. Putative LGT events between symbiotic DPANN archaea and their hosts were also found, in line with previous reports (71,117). The direct cell-cell contact between the DPANN archaea and their hosts might have provided a good opportunity for gene exchange, as seen between Nanoarchaeota and their crenarchaeotal hosts (120). Lateral gene transfer from symbionts into host genomes might also have contributed to gene loss in the DPANN archaea (121), similar to that of bacterial endosymbionts (122)(123)(124), and there were indeed genes in genomes of DPANN hosts that were putatively transferred from DPANN donors (see Table S7c at https://doi.org/10.6084/m9. figshare  .14806215.v1).
The third open scientific question was, "Who was LUCA (the last universal common ancestor)?" (83). Putative ancestral traits of LUCA include living an anaerobic lifestyle with a Wood-Ljungdahl pathway (125,126), inhabiting thermal environments rich in transition metals and FeS (e.g., hot springs, deep-sea hydrothermal vents, and probably acid mine drainage) (125), and the presence of FBPA/ase (48). Consistent with this, our results showed that members of the DPANN superphylum contain all these features. SSN analysis showed that the FBPA/ases of DPANN (clustering with the Euryarchaeota) were in a hub-like position from which homologs of bacteria and Crenarchaeota derived (see Fig. S14 at https://doi.org/10.6084/m9.figshare.14215802.v2). This indicated that DPANN archaea occupied a more ancient position in evolution.
Other ancient characters in DPANN archaea were also revealed by previous studies, such as the presence of split genes (127)(128)(129) and the possibility that the DPANN superphylum might be at or close to the phylogenetic root of life (130) or archaea (126). The archaeal ancestor was inferred to possess a relatively simple and small genome, which increased in complexity subsequently and gradually through lateral gene transfer (LGT) and gene duplication (41,126). Correspondingly, these features (i.e., small genomes and occurrences of LGT) were found in DPANN archaea, as discussed above. Taken together, these observations led us to believe that detailed characterization of the DPANN superphylum would provide more clues to help unravel the mystery of LUCA.
Cultivation-independent genomic approaches have brought dramatic improvements to our understanding of the genome characteristics of the DPANN archaea. However, most species and many features of this superphylum remain unexplored, considering the widely "open" pan-genome of the DPANN archaea and mismatches in 16S rRNA genes against widely used primers. The continuing exploration of the dark matter within this supergroup of archaea will be the focus of further studies. In addition, cultivation studies are also important for characterizing the physiology and morphology and examining the coding potential of the DPANN archaea. To date, only a few strains of DPANN phyla have been able to be cultivated. It is foreseeable that genomics-guided isolation of the uncultured DPANN archaea will be performed extensively in the future, leveraging available genomic information to infer suitable cultivation conditions for the isolation, as seen with the Nanoarchaeota (81) and "Candidatus Lokiarchaeota" (131).
Concluding remarks. In this study, we performed a comparative genomic analysis of about 600 DPANN genomes, including 41 DPANN MAGs recovered from metagenomic data sets (18 MAGs had ANI values of ,95% and a POCP of .50%, while 14 MAGs showed a POCP of ,50%). We found that there were significant differences in gene repertoire among DPANN phyla, and there was a significant positive correlation between the genome size and number of CDS, G1C content, and average protein length, as well as a significant negative correlation between genome size and coding density. Predicted lost gene families outnumbered those gained by a factor of more than three during the evolution of the DPANN superphylum, whereas the top three COG categories that lost the most gene families annotated were COG category C, COG category E, and COG category F.
LGT (;45.5% was cross-domain) has promoted adaptive evolution of the DPANN archaea, permitting a delicate equilibrium of streamlined genomes with excellent niche-adaptive strategies. We also found blurred taxonomic boundaries in DPANN phyla and mismatches to known 16S rRNA gene primers among 16S rRNA genes of DPANN archaea, suggesting there were yet largely undetected and uncultivated branches. The insights gained in this study would be helpful for uncovering the genomic diversity of the DPANN superphylum and the evolutionary adaptation of these miniature archaea to such a broad range of environmental conditions, providing hints for further study on their detection, isolation, and cultivation.

MATERIALS AND METHODS
Sample collection, sequencing, and assembly. Six metagenomic samples with the identifiers C1W, C3W, C4W, C5W, C6W, and C9W were collected from six individual stations in the acid mine drainage (AMD) of DaBaoShan, Guangdong Province, China (with a latitude and longitude range of 24.554 to 24.557 N and 113.721 to 113.723 E, an altitude range of 598.84 to 641.90 m, a temperature range of 32.8 to 38.2°C, a pH range of 2.38 to 2.59, and a dissolved oxygen range of 4.92 to 6.14 mg/liter). Total environmental genomic DNA was extracted from these AMD samples using the PowerPlant DNA isolation kit (Mo Bio Laboratories, CA, USA) following the manufacturer's instructions. First, DNA samples were sheared into smaller fragments by nebulization. Then, the overhangs resulting from fragmentation were converted to blunt ends by using T4 DNA polymerase, Klenow fragment, and T4 polynucleotide kinase. After addition of an A (adenine) base to the 39 end of the blunt phosphorylated DNA fragments, adapters were ligated to the ends of the DNA fragments. Then, short fragments were removed with Ampure beads.
An Agilent 2100 Bioanalyzer and ABI StepOnePlus real-time PCR system were used to qualify and quantify the sample libraries. The qualified libraries were then sequenced on an Illumina HiSeq platform at Shenzhen BGI Gene Co., Ltd. (Shenzhen, China). In order to obtain more accurate and reliable results, unqualified reads were removed to obtain clean data. The unqualified reads were defined as follows: (i) reads containing 10% or more ambiguous bases (N base); (ii) reads containing adapter sequences (default: 15 bases overlapped by reads and adapter); (iii) reads containing 50% or more low-quality (Q , 20) bases. Preprocessed reads were assembled with IDBA_UD v.1.1.1 (132) to obtain longer contigs, and reads were assembled with a series of different-size k-mers in parallel. Reads were mapped back to each assembly result with SOAPdenovo2 (133). The optimal k-mer size and assembly results were chosen depending on both contig N 50 and mapping rate. During the assembly process, only contigs of no less than 300 bp were kept for further analysis.
Metagenome-assembled genome reconstruction. Binning strategies provided by both Maxbin v.2.0 (134) and MetaBAT v.0.32.4 (135) were applied for metagenome-assembled genome (MAG) recruitment from our six AMD metagenome data sets (unpublished data) as well as 36 publicly available metagenomes obtained from the GenBank and JGI-IMG databases through database mining (see Table S1 for metagenome accession numbers and other information). After that, the 41 bins acquired were refined with Prinseq (136). The phylogenetic placement and quality of MAGs were assessed by MiGA (137) and CheckM (138). The DPANN archaea have undergone such an extensive genome reduction that even the closed complete genomes (e.g., "Candidatus Mancarchaeum acidiphilum" Mia14) would have a genome completeness of 82.4% assessed by the above-mentioned method. We thereafter also assessed the relative completeness of each MAG based on the presence of 974 single-copy marker genes of "Candidatus Mancarchaeum acidiphilum" Mia14 (80) identified with CheckM (138).
Pan-genome and comparative genomic analyses. Available DPANN genomes in the public databases (GenBank, ggKbase, and JGI-IMG) were collected (n = 515, excluding genomes with contamination over 5%) for comparative genomic analysis with the 41 novel MAGs. Coding sequences in each genome were predicted using Prodigal v. 2.6 (139). OrthoFinder v1.1.4 (140) was then used to cluster the protein sequences in each genome into orthogroups (with default parameter). A representative sequence from each orthogroup was then used for functional annotation by eggNOG-mapper v. 2.0 (141) (default parameters: E value , 10 23 , bit score . 60). Spearman rank correlation tests and principal-component analysis (PCA; applying Bray-Curtis distance) were performed in OriginPro 2020b (OriginLab, Northampton, MA, USA). ANOSIM (analysis of similarity) in the vegan R package (142) was used to determine whether there was a significant (P , 0.05) difference between the groups and within groups with Bray-Curtis distance. An unpaired (between-group) t test was performed with GraphPad Prism v 9.0 (GraphPad, San Diego, CA, USA).
Model extrapolation of the pan-genome and core genome was conducted with the BPGA pipeline v.1.3 (143) applying USEARCH v.11.0 (http://www.drive5.com/usearch/) for clustering gene families with a 30% sequence identity cutoff and 300 random permutations of genomes to prevent bias in the sequential addition of new genomes. The size of the pan-genome was fitted into a power-law regression function, Ps = k ng, with a built-in program of the BPGA pipeline (143) (Ps is the total number of gene families; n is the number of analyzed genomes; g is a free parameter). The pan-genome was defined as being "open" (which meant that each added genome would contribute some new genes and the pangenome would increase) in cases where the calculated exponent g had an outcome falling in the range between 0 and 1, which was often observed in prokaryotic pan-genomes (144), and the openness of the pan-genome increased as the exponent g was closer to 1 (tends to be linear). However, if the exponent g had an outcome smaller than 0, then the pan-genome was defined as being "closed" (which meant that the size of pan-genome is relatively constant as new genomes were added), as observed in Staphylococcus (145).
The size of the core genome was fitted into an exponential decay function, Fc = k c 2n/t c , with a builtin program in the BPGA pipeline (143) (Fc is the number of core gene families and k c and t c are free parameters). OrthoVenn v.2.0 (146) was applied for clustering analysis and creating Venn diagrams based on orthologous clusters. Since no core gene was found in the tested genomes and many draft genomes in this study lack complete small-subunit (SSU) rRNA genes, we applied CVTree3 (39,40) for alignment-free phylogeny reconstruction based on whole-genome sequences (k-mer = 4). The tree and the presence and absence pattern of genes were visualized using iTOL (147) with the genomes of Euryarchaeota as the outgroup.
Average nucleotide identity (ANI) (37) of each MAG relative to public DPANN genomes (n = 515) (Table S2) was calculated with the ANI calculator (37). The percentage of conserved proteins (POCP) (38) between our MAGs and their phylogenetically closest genomes in public database was calculated with DIAMOND (148). The phylogenetically closest genome was defined as the public available genome that shared the most recent common ancestor (MRCA) with corresponding MAG in CVTree3 phylogeny. If the MAG formed a monophyletic clade in the phylogenetic tree, then the available public genome sharing the MRCA with the corresponding MAG and in the most basal position was chosen as the reference genomes. Sequence similarity networks (SSN) of gene families of interest were calculated with EFI-EST Tools (149) (with an E value cutoff of 10 25 and an identity cutoff of 35%) following the official online tutorial (https://efi.igb.illinois.edu/efi-est/tutorial.php). Sequences sharing more than 90% identity in SSNs were consolidated into the same "metanode." The SSNs were finally visualized with the "Organic layout" tool in Cytoscape v. 3.7.1 (150).
Primer alignments and environmental distribution analysis. Alignments of full-length 16S rRNA genes from high-quality DPANN genomes with 25 archaeon-specific or universal 16S rRNA gene primers (87) were performed with ClustalW (https://www.genome.jp/tools-bin/clustalw) to examine putative mismatches. The environmental distribution and abundance of the DPANN members were assessed with full-length 16S rRNA genes from high-quality DPANN genomes as queries against the Sequence Read Archive (SRA) database (including 93,045 of 16S rRNA gene amplicon data sets from 96 different environments), applying the pipeline described by Lagkouvardos et al. (163).
Data availability. The 41 metagenome-assembled genomes (MAGs) generated in this study are available in Genome Warehouse (GWH) in National Genomics Data Center (164) under project accession number PRJCA002651. All genome sequences used in this study can be readily accessed in corresponding databases using the accession numbers provided in Table S1 and Table S2.

SUPPLEMENTAL MATERIAL
Supplemental material is available online only.