Ultra-resolution Metagenomics: When Enough Is Not Enough

Technological advances in community sequencing have steadily increased the taxonomic resolution at which microbes can be delineated. In high-resolution metagenomics, bacterial strains can now be resolved, enhancing medical microbiology and the description of microbial evolution in vivo. ABSTRACT Technological advances in community sequencing have steadily increased the taxonomic resolution at which microbes can be delineated. In high-resolution metagenomics, bacterial strains can now be resolved, enhancing medical microbiology and the description of microbial evolution in vivo. In the Hildebrand lab, we are researching novel approaches to further increase the phylogenetic resolution of metagenomics. I propose that ultra-resolution metagenomics will be the next qualitative level of community sequencing, classified by the accurate resolution of ultra-rare genetic events, such as subclonal mutations present in all populations of evolving cells. This will be used to quantify evolutionary processes at ecologically relevant scales, monitor the progress of infections within a patient, and accurately track pathogens in food and infection chains. However, to develop this next metagenomic generation, we first need to understand the currently imposed limits of sequencing technologies, metagenomic strain delineation, and genome reconstructions.

M icrobial sequencing has drastically increased our understanding of bacterial communities. Amplicon sequencing (metabarcoding) of the 16S rRNA gene can costeffectively identify and quantify prokaryotes in communities, but it is limited by sequencing errors that will inflate species diversity if not corrected (1). Further, my group previously described phylogenetic restrictions (2,3) and taxonomic misclassifications in low-biomass scenarios (4). To optimally account for such errors, we maintain and steadily improve the LotuS2 pipeline (5). However, despite active development in read error correction (6), metabarcoding approaches will remain limited to specieslevel resolution (7).
There is a strong interest to further resolve microbiomes; currently, bacterial strains can be differentiated using high-resolution metagenomics (HRM) (8,9). But each population of (clonally derived) cells contains de novo mutations segregating in only a fraction of the population-subclonal mutations. These are vital to evolution, as they are either lost or fixed within a population, depending on partly luck and partly relevance to survival. Detecting and linking these to their genome ("phasing") using reference-free metagenomics will require new approaches, a qualitative leap I refer to as ultra-resolution metagenomics (URM) (Fig. 1). I discuss in this commentary potential use cases of URM and technical challenges to overcome.

APPLICATIONS OF ULTRA-RESOLUTION METAGENOMICS
Contrary to intuitive assumptions, microbial evolution can happen within days: some gut bacteria adapt to antibiotic treatments within days (10), and plant commensals adapt on an "agriculturally relevant evolutionary timescale" (11). Pseudomonas aeruginosa typically loses dozens of genes during cystic fibrosis infections lasting years (12), Bacteroides fragilis adapts to its specific human host (13), and detecting parallel evolution in pathogens can be used to identify pathogenicity genes (14), as demonstrated using variants of metagenomic and isolate sequencing. These evolutionary processes can be quantified and predicted using population genetics, dependent on accurate genetic information. In cross-sectional or longitudinal intervention studies, we can develop a new holistic eco-evolutionary understanding of microbes under different external circumstances (reviewed in reference 15).
In medical applications, detecting subclonal mutations is vital to predict and avoid treatment resistances, as used in cancer genetics (16). Awareness of these mutations is also relevant in infections; e.g., HIV treatments are combinatorial to avoid subclonal resistance mutations to become prevalent, and a higher frequency of subclonal mutations was linked to prolonged severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infections (17). Subclonal mutations further helped disentangle SARS-CoV-2 transmission chains among multiple hosts (17). Using ancient metagenomes, the divergence times of modern gut bacteria were predicted (18). In URM, much more closely related genomes could be used to time transfer events. Thus, a combination of mutation rates and de novo and subclonal single nucleotide variants (SNVs) obtained from URM could be used to trace "transfer chains" in hospitals, care homes, or production plants, reconstructing the routes and timing of bacterial colonization. However, currently, this resolution is not available in metagenomics.

PERFORMANCE OF HIGH-RESOLUTION METAGENOMICS
Short read (e.g., Illumina) sequencing enables HRM through a combination of large read numbers and a relatively good quality. Illumina reads typically have 10 23 to 10 24 errors/nucleotide, meaning that 1 in 1,000 to 1 in 10,000 nucleotides (nt) are wrong. In large sequencing runs with typically billions of bases, this accumulates to millions of reported errors. Dedicated SNV calling tools can correct some of these but are limited by available computational resources, nonrandom errors in Illumina reads, and mapping biases (19). Thus, sequencing error rate does not directly translate to metagenomic resolution.
The earliest HRM approaches were reference based; i.e., sequencing reads were mapped to reference genomes to identify SNVs per species and metagenome (20). However, reference-based metagenomics has several disadvantages, primarily that the majority of unknown microorganisms is not represented in reference databases (21,22). Further, mapping to a distantly related reference genome will bias and increase the SNV calling error (23). Last, the bacterial genome is very dynamic: our experiments showed that on average, only 72% of genes are shared among intraspecific strains from 155 bacterial species (24). This might also apply to gut bacteria: when we de novo reconstructed the genome of Parabacteroides distasonis, this genome was better For the same species, two extremely similar strains were detected, occurring in cohousing family members. These strains were only 185 nt distant (99.996% average nucleotide identity [ANI]) but could be reliably assigned to dozens of time series samples from their respective human hosts. Thus, using de novo-reconstructed genomes, we could effectively resolve strains at ,4 Â 10 24 errors/nucleotide (25). This workflow is implemented in our MATAFILER pipeline and was recently used to track 440 species in 5,278 metagenomes (9). To estimate the metagenomic resolution on real data, we compared the nucleotide sequence of .26,000 strains persisting between longitudinal samples. The average ANI was 99.98% (estimated error rate, 2 Â 10 23 ), although 55% of these strains were reconstructed at 100% ANI (9). This resolution compares favorably to other strain-level pipelines when evaluated on real metagenomes (see Fig. 2d in reference 26), exemplifying resolution of current HRM.
HRM typically use a "consensus" SNV calling approach, considering only the major (.0.5 frequency) alleles in a population. Typically, samples are excluded if $2 conspecific strains of a species are present. In contrast, HRM approaches such as STRONG phase SNVs to reconstruct strain-specific genomes, even from mixtures of conspecific strains co-occurring in the same samples, delineating strains at 99.95% ANI (5 Â 10 23 errors/nucleotide) (27). In a different strategy, inStrain (26) reports minor and major alleles indiscriminately, reporting high strain similarities in simulations (99.9998% ANI, 2 Â 10 25 errors/nucleotide) but also reporting sequencing errors.
To distinguish true biological variation from sequencing errors is one of the biggest challenges to enabling URM.

ENABLING ULTRA-RESOLUTION METAGENOMICS
The bacterial mutation rate is estimated at between 10 28 and 10 25 mutations/nucleotide/year (28), and population genetics theory predicts that neutral mutations get fixed at a similar rate. To identify bacterial de novo mutations fixed after 1 year would thus require a metagenomic resolution of ,,1e25 errors/nucleotide. Detecting subclonal mutations at a minor allele frequency (MAF) of #0.5 would require a similar error rate but a much higher read coverage, approximated recently at .5Â coverage for Illumina reads (26). Therefore, URM requires a combination of (i) increased sequencing coverage, (ii) read accuracy, (iii) phasing of SNVs, and (iv) using orthogonal genetic markers, as I briefly discuss below.
Increased read coverage is enabled through inexpensive, high-throughput sequencing. However, to cover more species from an ecosystem, a linear increase in sequencing depth is unlikely to cover its rare members, as species abundance curves in ecosystems follow a power law. We are therefore exploring alternatives to systematically increase the coverage (and thereby resolution) of underrepresented microbes. One of our bioinformatic solutions is to fine-tune strain delineation at $2Â coverage (9). Experimentally, we demonstrated the potential of microfluidics to enrich particular microbes (29), while single-cell sequencing can increase read coverage of a single cell, allowing phasing of subclonal mutations.
Simply increasing sequencing coverage does not scale linearly to SNV calling accuracy, since PCR and Illumina sequencing errors are not random (30). Single-molecule sequencing like PacBio has the advantage of having random read errors, allowing for better error correction. This is implemented in PacBio HiFi sequencing, reaching under optimal conditions an error rate below 10 25 . Illumina read accuracy can be further increased to potential error rates of #10 27 using molecular barcoding techniques (31). Bioinformatically, read error profiles can be generated to further increase accuracy, as used in DADA2 (6). In metagenomes this is not applicable, but instead, spiking artificial DNA into a sample can be used to calculate sample-specific error profiles.
Phasing SNVs is possible with short read sequencing but inefficient when very similar genomes are co-occurring. In the URM context, other techniques need to be employed, such as single-cell sequencing or barcoding DNA molecules (32). However, using long read sequencing (Oxford Nanopore or PacBio) at high read accuracy might be the most straightforward approach.
Instead of SNVs, structural variants (SVs) of microbial genomes can be used to track bacterial evolution. Even short read data can reliably track small SVs such as indels and microsatellites, and the confidence in observing a specific, e.g., 5-bp, deletion across samples is significantly higher than repeatedly observing a single SNV. SVs can inform on the evolution of bacteria; for example, we used microsatellites to reconstruct the demography and spread of Mycobacterium tuberculosis (33). With long read technologies, the detection of larger inversions, insertions, and deletions will become more commonplace, allowing increased resolution of evolutionary processes.

CONCLUSION
Ultra-resolution metagenomics (URM), here defined by the ability to detect ultrarare mutations within a microbial population, is the next logical step in metagenomics. For evolutionary and epidemiological science, this resolution will revolutionize our understanding of microbial ecosystems. Given the importance of this technology, the work of my group is focused on providing both experimental and bioinformatic solutions to get the most out of current HRM and future URM data. Given the rapidly increasing accuracy and throughput of long read sequencing, I predict that standard metagenomic sequencing data will be URM compatible within 5 years.