Shotgun metagenomics of soil invertebrate communities reflects taxonomy, biomass, and reference genome properties

Abstract Metagenomics – shotgun sequencing of all DNA fragments from a community DNA extract – is routinely used to describe the composition, structure, and function of microorganism communities. Advances in DNA sequencing and the availability of genome databases increasingly allow the use of shotgun metagenomics on eukaryotic communities. Metagenomics offers major advances in the recovery of biomass relationships in a sample, in comparison to taxonomic marker gene‐based approaches (metabarcoding). However, little is known about the factors which influence metagenomics data from eukaryotic communities, such as differences among organism groups, the properties of reference genomes, and genome assemblies. We evaluated how shotgun metagenomics records composition and biomass in artificial soil invertebrate communities at different sequencing efforts. We generated mock communities of controlled biomass ratios from 28 species from all major soil mesofauna groups: mites, springtails, nematodes, tardigrades, and potworms. We shotgun sequenced these communities and taxonomically assigned them with a database of over 270 soil invertebrate genomes. We recovered over 95% of the species, and observed relatively high false‐positive detection rates. We found strong differences in reads assigned to different taxa, with some groups (e.g., springtails) consistently attracting more hits than others (e.g., enchytraeids). Original biomass could be predicted from read counts after considering these taxon‐specific differences. Species with larger genomes, and with more complete assemblies, consistently attracted more reads than species with smaller genomes. The GC content of the genome assemblies had no effect on the biomass–read relationships. Results were similar among different sequencing efforts. The results show considerable differences in taxon recovery and taxon specificity of biomass recovery from metagenomic sequence data. The properties of reference genomes and genome assemblies also influence biomass recovery, and they should be considered in metagenomic studies of eukaryotes. We show that low‐ and high‐sequencing efforts yield similar results, suggesting high cost‐efficiency of metagenomics for eukaryotic communities. We provide a brief roadmap for investigating factors which influence metagenomics‐based eukaryotic community reconstructions. Understanding these factors is timely as accessibility of DNA sequencing and momentum for reference genomes projects show a future where the taxonomic assignment of DNA from any community sample becomes a reality.

the factors which influence metagenomics data from eukaryotic communities, such as differences among organism groups, the properties of reference genomes, and genome assemblies.
• We evaluated how shotgun metagenomics records composition and biomass in artificial soil invertebrate communities at different sequencing efforts. We generated mock communities of controlled biomass ratios from 28 species from all major soil mesofauna groups: mites, springtails, nematodes, tardigrades, and potworms. We shotgun sequenced these communities and taxonomically assigned them with a database of over 270 soil invertebrate genomes.
• We recovered over 95% of the species, and observed relatively high falsepositive detection rates. We found strong differences in reads assigned to different taxa, with some groups (e.g., springtails) consistently attracting more hits than others (e.g., enchytraeids). Original biomass could be predicted from read counts after considering these taxon-specific differences. Species with larger genomes, and with more complete assemblies, consistently attracted more reads than species with smaller genomes. The GC content of the genome assemblies had no effect on the biomass-read relationships. Results were similar among different sequencing efforts.
• The results show considerable differences in taxon recovery and taxon specificity of biomass recovery from metagenomic sequence data. The properties of reference genomes and genome assemblies also influence biomass recovery,

| INTRODUC TI ON
Biodiversity research, and particularly the investigation of hardto-observe ecological communities, increasingly relies on DNAand RNA-based tools such as metabarcoding (Taberlet et al., 2018), metagenomics (Arribas et al., 2020), or metatranscriptomics (Cristescu, 2019). There are several preconditions to the use of these tools for generating datasets on ecological community composition: nucleotide sequence databases must exist (Hebert et al., 2003;Margaryan et al., 2021) with curated taxonomic links (Schenk et al., 2017) for taxonomic identification of DNA or RNA sequences. Laboratory experimental designs must also be robust, with excellent guidance already existing .
However, if preconditions are met, molecular tools can provide data on the composition and structure of ecological communities, even if they are made up of very small, diverse, and difficult to identify species. Molecular tools to monitor communities can be time and cost efficient when compared to conventional, observation-based studies, where species are morphologically identified and counted to document abundances (Serrana et al., 2019). This is especially the case when observed communities are species rich, and when many community samples need to be processed simultaneously (Bálint et al., 2018). This needs expertise on certain taxonomic groups, which makes it difficult for one researcher to acquire composition data. Molecular tools overcome this issue as whole communities, containing various taxonomic groups, can be identified at once, in many samples run in parallel (e.g., . There are two main approaches to the molecular biomonitoring of communities: metabarcoding and metagenomics. Metabarcoding uses high-throughput sequences of taxonomic marker genes ("barcodes") which are PCR amplified from a community DNA extract.
However, metabarcoding has an important long-known drawback: it relies on the amplification of a marker gene (Taberlet et al., 2012). This can result in biases in species recovery from the resulting sequence data: several species might be completely missed as false negatives if metabarcoding PCR primers poorly match binding sites in their genomes in a phenomenon known as PCR bias . Sometimes PCR bias is not sufficiently strong to completely miss species, but primer mismatch still causes a less efficient amplification compared to other species, resulting in distortions of the original biomass-sequencing read relationships for certain taxa (Piñol et al., 2019). However, the amplification step solves two important issues: one can effectively target the taxonomic groups of interest (e.g., insects) and avoid others (e.g., microorganisms), and small or rare organisms with low amounts of DNA can still be recorded. Metagenomics randomly sequences all DNA fragments from a community DNA extract, generally without enrichment of certain parts of the genome. It is more quantitative than metabarcoding, since it skips the potentially biased PCR amplification step of taxonomic marker genes (Bista et al., 2018), and consequently, may provide more detailed insights into the biomass ratios of different species (Peel et al., 2019). Biomass ratios are important for ecological studies as the importance of species in a community is often directly related to its abundance or biomass (Naeem et al., 2009). Biomonitoring schemes frequently rely on indices of environmental quality which are computed from species identities, and abundance or biomass ratios (Bennion & Battarbee, 2000/60/EC of the European Parliament and of the Council of 23 October 2000). However, metabarcoding can provide limited information on this given biases caused by the PCR step (Aird et al., 2011), and currently this limits its use in applied biomonitoring (Hering et al., 2018). In metagenomics, a random selection of DNA fragments is sequenced from the DNA extracts, resulting in a less biased representation of the community in the sequence data. The omission of the PCR step makes metagenomics lab work and they should be considered in metagenomic studies of eukaryotes. We show that low-and high-sequencing efforts yield similar results, suggesting high costefficiency of metagenomics for eukaryotic communities. We provide a brief roadmap for investigating factors which influence metagenomics-based eukaryotic community reconstructions. Understanding these factors is timely as accessibility of DNA sequencing and momentum for reference genomes projects show a future where the taxonomic assignment of DNA from any community sample becomes a reality.

K E Y W O R D S
biomonitoring, eukaryotes, genome completeness, genome size, invertebrates, shotgun metagenomics, species composition, taxonomic bias

T A X O N O M Y C L A S S I F I C A T I O N
Genomics conceptually and technically simpler. From metagenomic studies of microbial mock communities we know that several factors, such as taxonomic identity (Schiebelhut et al., 2017) or the genome properties of involved species (Beszteri et al., 2010), have an impact on biomass representation through metagenomic reads. However, these effects are so far not investigated in metagenomics studies of eukaryotes, at least to our best knowledge. The taxonomic assignment of metagenomic sequences needs genome databases, and consequently, metagenomics is more frequently applied on microbial communities, where more complete genomic resources are available (Parks et al., 2020). There are several approaches to circumvent this limitation, from mitogenomes (Arribas et al., 2020) to shallow genome sequencing (Bohmann et al., 2020). As genome sequencing technologies mature, the generation of reference genomes for all eukaryotes receives increasing attention (Lewin et al., 2018). However, the technical issues affecting metagenomics, such as species identification success, read-biomass relationships, the effects of different DNA extraction techniques, and the effects of reference genome properties used for taxonomic identification of metagenomic reads are much less investigated than issues affecting metabarcoding, at least for eukaryotes. Approaches to metagenomic read classification need also be evaluated for eukaryotes, since there are several algorithms available, and these algorithms can be adjusted to allow for more or less base dissimilarities among query and database sequences (Altschul et al., 1990;Wood et al., 2019).
Soil invertebrate communities are diverse, with high numbers and often high biomass of taxa (FAO, 2020). Most soil invertebrate species are very small, with body lengths below 1 mm (Orgiazzi et al., 2016). Despite their small size, invertebrates are important for soil health (Kibblewhite et al., 2008), and high biomass species strongly contribute to soil functioning (van den Hoogen et al., 2019). However, ecological work and biomonitoring of these communities are difficult: taxonomic identification needs very specialized expertise (Lehmitz & Decker, 2017). This makes morphology-based identification efforts unfeasible for the large sample numbers needed by most community ecology and biomonitoring efforts. The increasing availability of reference genomes (Lewin et al., 2022) makes metagenomics a promising approach to describe and monitor community composition and species biomass ratios of soil invertebrates. Here, we evaluate the performance of metagenomics in species identification in artificially composed (mock) communities of soil invertebrates. We also evaluate how well metagenomics reflects biomass ratios of species.
We use a large collection of soil invertebrate genomes to taxonomically assign metagenomic reads. We investigate the effects of metagenomic classification thresholds on correct and false identification. We evaluate the relationship between biomass and reads, and how this relationship is influenced by taxonomy and by the properties of the genome assemblies used for taxonomic assignments. Finally, we evaluate how different sequencing efforts influence metagenomics results, as this strongly influences the economics of eukaryotic metagenome sequencing.

| Mock community construction
We constructed mock communities from 28 soil invertebrate species from six major taxonomic groups at the Senckenberg Museum of Natural History Görlitz. Specimens were either freshly collected and stored in 96% undenatured ethanol (Collembola, Gamasida, and Oribatida), or they came from breeding cultures (Enchytraeidae, Nematoda, and Tardigrada). Mock communities were composed from individual animals and animal fragments (and not from DNA extracts). These individuals/fragments were placed into single test tubes per mock community to create bulk community samples. Four different mock types were designed ( Figure 1a, Table 1). We varied the total body volume (the sum of body volumes of all individuals of a species) across the four mock communities, meaning the total volume as well as species volumes differ in all four set-ups. The mocks contained species with very small body volumes (Nematoda, Tardigrada) and larger species (Collembola, Gamasida, and Oribatida).
Enchytraeidae represent the largest taxon in this study ( Table 1). We used body volume as a proxy of biomass, and refer to it as biomass throughout the text. In the first mock, all species were represented with equal biomass. In the second mock, each of the small species had two to five times more biomass compared to any large species.
In the third mock, a part of small species (7 of 11) had larger biomass (two to four times than any large species). In the fourth mock, most small species had more biomass than large species, but some large species also had high biomass. All four mock types were replicated three times: we attempted to reproduce the same biomass ratios among the species. This altogether gave us 12 mock communities.
We used different formulas for body volume approximation.
For Collembola, we estimated body volumes as ellipsoid volumes (V (µm³) = 1.33 × π × a × b × c × 10 −6 , where a, b, and c are axis lengths in µm). For Oribatida, Gamasida, and Enchytraeidae, we estimated body volumes as cylinder volumes (V (µm³) = π × L × r 2 × 10 −6 , where L is height and r is radius); for Tardigrada, V (µm³) = L × d 2 × 0.785 × 10 −6 (Hallas & Yeates, 1972); and for Nematoda, V (µm³) = L × d 2 × 0.577 × 10 −6 was used (Andrássy, 1956). We measured the sizes of all tardigrade, enchytraeid, springtail, and mite specimens and 20 nematode specimens used in the mock community construction to obtain average body size measures. We then combined sufficient numbers of specimens, considering variation in individual sizes to achieve the biomass ratio desired in the experimental design of the mock communities ( Figure 1a). to the other invertebrates, we used only body fragments, cutting off parts after measuring the lengths of the specimens. Tardigrades, collembolans, and mites were individually counted into the mock communities. In order to achieve the needed biomass of the respective mock type, differently sized individuals (adults and juveniles) were used. All mock community samples were stored in 2-ml Eppendorf tubes in 96% undenatured ethanol at −20°C until sequencing.

| Laboratory work and sequencing
We used the 12 bulk mock communities containing individuals/fragments of individuals of the 28 species for DNA extractions (four biomass ratios, each replicated three times, Table 1, Figure 1a).
Before performing the DNA extraction, ethanol was evaporated in a SpeedVac Concentrator Plus (Eppendorf) to avoid losing individuals/fragments. This is especially important for potentially floating Nematoda and Tardigrada specimens. DNA was extracted with DNeasy Blood and Tissue kit (Qiagen). DNA was extracted from bulk samples. Species and specimens were mixed into mock communities prior to lysis and extraction. Replicates were extracted separately.
We included a negative control into the extractions to investigate possible cross-sample contamination. This negative control was an extraction blank without tissue. We followed the Qiagen protocol except a few modifications. We crushed specimens with pistilles in 1.5-ml Eppendorf tubes. Before homogenizing (crushing) the tissue, we immediately added 30 µl ATL lysis buffer to inhibit the DNAse activity. Subsequently, 150 µl ATL lysis buffer and 20 µl protein kinase K were added. After vortexing and incubating (~3 h, 56°C), 20 µl RNase was added. The samples were then incubated overnight (37°C). We eluted with 50 µl AE buffer. Each resulting extract represents one replicate of the mock communities. DNA concentration was measured on NanoDrop (Thermo Fisher Scientific) and Qubit ™ with the dsDNA BR Assay kit (Thermo Fisher Scientific). We used both tools to double check concentration measurements. Fragment length was checked on TapeStation 2200 (Agilent Technologies). Libraries were prepared with the NEB Next ® Ultra ™ DNA Library Prep kit (New England Biolabs, Ipswich MA, USA) and sequenced on an Illumina NovoSeq 6000 PE150 platform at Novogene. Sequencing depth was 20 gigabase per mock community, and 1 gigabase for the negative control (2 × 150 bp, paired-end).
F I G U R E 1 (a) Ratios of species biomass and sequencing reads assigned to these species in the four mock communities types. (b) GLMpredicted effects of biomass, genome completeness, genome size, and repeat content on taxonomically assigned metagenomic reads. (c) Relative importance of GLM predictor variables

| Bioinformatics and data processing
Sequences were trimmed and quality checked with Autotrim v0.6.1 (Waldvogel et al., 2018). Autotrim relies on Trimmomatic TA B L E 1 Composition of mock communities. For species where different developmental stages were available, individuals of different sizes were used to achieve the necessary biomass [adults + juveniles, e.g., Paramacrobiotus richtersi in mock 1: 4 + 1]. Mock 1: all species have equal biomass; mock 2: small species have higher biomass; mock 3: some, but not all small species have higher biomass than large species; mock 4: some small and some large species both have higher biomass than other small and large species Taxon Mean body length (µm) Body volume (10 −6 µm 3 )

| Data analysis
Data analysis was conducted with R v3.6.1 in RStudio (v1.2.1335), with data formatted with tidyverse (Wickham et al., 2019). Graphs and plots were generated by using the package ggplot2 (Wickham, 2016 (Simão et al., 2015)), GC content, genome sizes, and repeat content. We estimated genome sizes with ModEst, a new method, which performs very well in comparison with flow cytometry measurements (Pfenninger et al., 2021). We estimated the repeat content of genomes with speciesspecific repeat libraries which were constructed using an automated RepeatModeler 2.0.1 pipeline with LTR Structural discovery pipeline activated (Flynn et al., 2020). For each genome, the resulting repeat libraries were merged with the RepBase 26.05 Arthropoda-specific section (Bao et al., 2015) and subsequently used for the annotation of repetitive elements with RepeatMasker 4.1.2-P1 (Smit et al., 2015). First, we performed a combinatorial model selection with MuMIn (Burnham & Anderson, 2003). The best performing model based on quasi-AIC scores can be written up as hits ~ biomass + taxon_group + missing_buscos + genome_size + repeat_content.
The final model was fitted with quasi-Poisson distribution to account for overdispersion. All predictors were scaled. Genome sizes were log-normalized before scaling. We evaluated the relative importance of the predictors by calculating model-specific variable importance scores in the R package vip (Greenwell & Boehmke, 2020).
We evaluated the correspondence between community composition captured by metagenomic reads and original biomass composition with redundancy analyses in vegan (Oksanen et al., 2019).
We tested metagenomic hit model statistical significance with an ANOVA-like permutation test for redundancy analysis (Legendre & Legendre, 2012).
We re-run read taxonomic identification after subsampling raw sequences to 100,000, 500,000, 1,000,000, 5,000,000, and 10,000,000 reads with seqtk (Li, 2012(Li, /2022. We compared results of data analysis done with the complete number of reads with results obtained after analyzing 100,000 reads. This allows to evaluate how strongly sequencing effort influences metagenomic results.

| RE SULTS
The sequencing resulted in ~69 million paired-end reads on average per mock community replicate, with a standard deviation of ~1.5 million reads (Figure 1a). Raw sequencing results are available on the European Nucleotide Archive (accession number: PRJEB45431).
About 10 million reads were recorded in the negative control. Only 4% of the reads from the negative control could be classified with the NCBI nt database: 1% is classified as Eukaryota, 1% as Bacteria, and the remaining 2% as viruses, Archaea, and others (Appendix S1, https://figsh are.com/artic les/figur e/Schmi dt_et_al_Suppl_File_ html/19711 684/1). Of the reads passing quality filtering in the mock communities, ~95 million were assigned to taxa at a 0.95 classification threshold ( Table 1) (Figure 1b,c). Genome size and repeat content were collinear (Pearson R 2 = 0.56, p < .001). Taxon groups were the most important predictors in the model (Figure 1c). Replicates of the four mock community types were statistically significantly grouped together in the redundancy analysis (df = 3, F = 3.863, p < .001, Figure 3).
Data analysis performed with 100,000 reads yielded very similar results ( Figures S1 and S2).

| DISCUSS ION
We performed a shotgun metagenomic experiment on soil invertebrate mock communities of known composition. We assigned metagenomic reads to a genome database of soil invertebrates. We investigated how metagenomic reads record the presence of taxa in the mocks, whether read numbers reflect biomass. We found that almost all species from the mocks could be identified with metagenomics. We also found that metagenomic reads reflect biomass ratios among the species, but taxonomy and reference genome and assembly properties must be considered during metagenomics read assignments.
Almost all species ( (Paula et al., 2016). However, most species used in these mocks are not predators. The predatory tardigrade P. richtersi was exclusively feeding on a nematode species which was also present in all the mock communities (A. nanus). The most likely explanation is related to aspects of the metagenomic read assignment. The first candidate is the assignment algorithm itself, although comparisons show that Kraken is conservative (Harbert, 2018). Assignment of reads to closely related taxa is an unlikely cause since 8 of the 12 false-positive species (at 0.95 classification threshold) had no genus-level relatives in the mocks. Assignment to genome regions highly conserved and thus similarity among species might also result in false positives. Unmasked repeats might also erroneously attract reads during the assignment. Eukaryotes are rich in low complexity regions, and cross-assignment of these regions might be a considerable source of false positives in all eukaryotic metagenomes (Clarke et al., 2018). The effects of repeat regions in eukaryotic metagenomic assignments should be evaluated, although repeat identification is not trivial, especially for understudied taxa (Clarke et al., 2019).
The relationship between sequencing reads and the initial biomass of organisms is a central topic in the DNA-based analysis of community composition. In theory, more shotgun metagenomic reads should be assigned to species which are represented with higher biomass in a sample. However, this relationship might still be influenced by other factors. Here, we investigated taxonomic effects, the impact of genome completeness, genome size, and GC content. We found that read counts were most strongly influenced by taxonomy, followed by genome size, genome completeness, biomass, and repeat content (Figure 1c). We found no statistically significant effects of GC content on read assignment, although this was expected based on previous results with bacterial metagenomes (Browne et al., 2020).
There were consistently more reads assigned to some taxonomic groups than to others (Figure 1b, Table 2). The impact of taxonomy on sequencing reads recovery seems to be systemic, with some species having many reads in all mocks, some species having only few reads (Figure 1a), and one species was even missed due to the stringent filtering ( Figure 2a). Species represented with low biomass in mocks were already found to result in false negatives in metagenomics (Bista et al., 2018), and A. striculus was indeed represented with a relatively low biomass in the mocks. However, low biomass alone does not explain the strong taxon effect on read assignment. We suspect that the most important cause for the strong taxon effects is likely caused by differences in DNA yields among different taxa (Sato et al., 2019;Schiebelhut et al., 2017;Tourlousse et al., 2021). Some taxa, e.g., oribatid mites, are very hardy, and their cuticles might present obstacles to tissue homogenization during DNA extractions. Indeed, the single falsenegative species was an oribatid mite. Cells of different taxa might react differently to extraction (Costea et al., 2017;Morgan et al., 2010), with some species consistently yielding lower-quality DNA in lower quantities (or no DNA at all) than others (Schiebelhut et al., 2017). However, differential DNA extraction efficiency does not explain why soft-bodied enchytraeids yield considerably less DNA than all other taxa ( Figure 1b). Differences in DNA content relative to body size (or biomass) might be responsible for this: some taxa may contain higher amounts of DNA per unit biomass than others. The association of DNA content with body size can be positive or negative depending on the organism group (Gregory, 2001).
Strong taxonomic effects on biomass-read relationships are interesting not only for metagenomic but also for metabarcoding studies. It is generally assumed that primer mismatch is the most important source of taxonomically biased biomass-read relationships in metabarcoding (Collins et al., 2019;Lamb et al., 2019;Piñol et al., 2019). Our results suggest that taxon-specific differences in DNA extraction efficiency and/or DNA content might also play a role in taxonomic bias. However, recognizing this bias is difficult in metabarcoding: both primer bias, and factors influencing extraction DNA yields are likely phylogenetically conserved. Parallel metabarcoding and metagenomic studies on the same mock communities are necessary to evaluate the relative importance of primer bias versus DNA yield in biomass-read relationships (e.g., Bista et al., 2018).
Despite considerable taxonomic effects, biomass was a statistically significant predictor of reads ( Figure 1b, Table 2). This is in line with other metagenomic mock community studies on multicellular eukaryotes, such as benthic invertebrates (Bista et al., 2018) and pollen samples (Peel et al., 2019). The biomass effect on reads, although considerably smaller than taxon effects (Figure 1c), was still sufficient to reflect compositional differences among the four mock types (Figure 3). This confirms the suitability of shotgun metagenomics for a semi-quantitative comparison of soil invertebrate communities.
We found that reference genome properties influence taxonomic assignments and read-biomass relationships, and that these need to be considered in metagenomic studies on eukaryotes. We showed that reference genomes size influences metagenomic assignments, with larger genomes attracting more reads than smaller genomes ( Figure 1b). This is known from microbial studies where it was shown that average genome size of a microbial community influences metagenomic results (Beszteri et al., 2010). Repeat content is considered to positively influence genome size of eukaryotes, at least in the range of genome sizes of species analyzed here (Novák et al., 2020). We found a weak negative, but statistically significant effect of repeat content on metagenomic assignments (Figure 1b,c, Table 2). Repeat content and genome sizes were collinear. This collinearity also suggests that repetitive element abundance and repeat family composition may act as "hidden variables" in metagenomic read assignment. The effects of repetitive elements should be evaluated with highly contiguous and complete genomes which allow for an unbiased identification of the repetitive / non-repetitive genome fractions. We found that genome completeness recorded as BUSCO scores may also influence metagenomic assignments, with more complete genomes attracting more reads. This suggests that reference genome assembly properties should also be considered in metagenomic assignments, even though previous findings show that even low-coverage reference genomes can perform well (Sarmashghi et al., 2019). GC content of genomes might also influence metagenomic assignments (Browne et al., 2020), although in our case this effect was limited. fraction of metagenomic reads might not be identifiable. We expect that this will rapidly change in the next years as biodiversity genomics initiatives cover more of eukaryotic diversity with reference quality genomes (Formenti et al., 2022;Lewin et al., 2022), and as phylogeny-based assignment approaches are rapidly developing (Asnicar et al., 2020). Real samples will also contain higher numbers of bacteria and fungi. This is specifically true for eDNA samples where most reads likely originate from bacteria and fungi (Fierer et al., 2012). This means that sequencing depths will need to be higher when dealing with real communities, especially for eDNA samples. Other factors also need to be evaluated, such as release of eDNA be different taxa, legacy effects due to longterm preservation of DNA in soils (Pedersen et al., 2015), or enzymatic inhibition and strategies for sample comparison (Hedman & Rådström, 2013). We are convinced that mock community experiments will remain important tools to understand these sources of variation.

| Roadmap for future metagenomics on metazoans
Our results outline a roadmap for future shotgun metagenomic work on metazoan mock communities. In the wet lab, DNA extraction needs to be optimized and likely adapted to taxa of inter- est. An important component of this is to ensure the best possible homogenization of bulk samples, probably under cryogenic conditions. This is particularly important in order to detect species which yield low amounts of DNA, since this may frequently happen in more species-rich natural communities. Differences in DNA content per unit biomass among and within major taxon groups should be evaluated and corrected for. In bioinformatics, assignment algorithms should be evaluated, adapted, and developed with eukaryotes in mind. The performance of distinct genomic regions (i.e., conventional marker genes, mitogenomes, coding regions, ultraconserved regions, and repeat elements) should be evaluated, especially with respect to false-positive detections.
Genome databases will likely remain incomplete for some time. An important direction is to evaluate how incomplete databases (i.e., databases not containing the target species, but congenerics or even less related species) perform in taxonomic assignments. It is also important to consider the completeness of the references genome assemblies, as more complete genomes will allow to assign more metagenomic reads. Genome completeness and genome size should be explicitly accounted for in analyses, for example, as predictors in generalized linear models.

| CON CLUS ION
Metagenomics is a promising alternative to metabarcoding also for eukaryotic communities, even at very low-sequencing efforts.
Although theory suggests that metagenomic reads should wellrepresent biomass relationships in communities, differences among organisms related to DNA extraction efficiency and genome properties have strong influences on the biomass-read relationships.
These effects need to be further investigated and quantified in parallel metabarcoding-metagenomic experiments. The effects of taxonomy, genome, and assembly properties should be considered in analyses. Generalized linear models provide an excellent opportunity for this. With affordable sequencing and increasingly accessible eukaryotic reference genomes, metagenomics is becoming a viable alternative to metabarcoding for describing community composition and structure.

ACK N OWLED G EM ENTS
The present study is a result of the LOEWE Centre for Translational Biodiversity Genomics (LOEWE-TBG) and it was funded through the program "LOEWE -Landes-Offensive zur Entwicklung