Optimizing insect metabarcoding using replicated mock communities

Metabarcoding (high‐throughput sequencing of marker gene amplicons) has emerged as a promising and cost‐effective method for characterizing insect community samples. Yet, the methodology varies greatly among studies and its performance has not been systematically evaluated to date. In particular, it is unclear how accurately metabarcoding can resolve species communities in terms of presence‐absence, abundance and biomass. Here we use mock community experiments and a simple probabilistic model to evaluate the effect of different DNA extraction protocols on metabarcoding performance. Specifically, we ask four questions: (Q1) How consistent are the recovered community profiles across replicate mock communities?; (Q2) How does the choice of lysis buffer affect the recovery of the original community?; (Q3) How are community estimates affected by differing lysis times and homogenization? and (Q4) Is it possible to obtain adequate species abundance estimates through the use of biological spike‐ins? We show that estimates are quite variable across community replicates. In general, a mild lysis protocol is better at reconstructing species lists and approximate counts, while homogenization is better at retrieving biomass composition. Small insects are more likely to be detected in lysates, while some tough species require homogenization to be detected. Results are less consistent across biological replicates for lysates than for homogenates. Some species are associated with strong PCR amplification bias, which complicates the reconstruction of species counts. Yet, with adequate spike‐in data, species abundance can be determined with roughly 40% standard error for homogenates, and with roughly 50% standard error for lysates, under ideal conditions. In the latter case, however, this often requires species‐specific reference data, while spike‐in data generalize better across species for homogenates. We conclude that a non‐destructive, mild lysis approach shows the highest promise for the presence/absence description of the community, while also allowing future morphological or molecular work on the material. However, homogenization protocols perform better for characterizing community composition, in particular in terms of biomass.


| INTRODUC TI ON
Insects are likely the most species-rich group of animals on Earth (Stork, 2018) and fulfil a myriad of different ecosystem functions (Jordan et al., 2021). Yet our knowledge about insect diversity in natural ecosystems is still surprisingly scarce -and the key reason for this deficit is their actual diversity. Recent reports on insect abundance and diversity declines (Hallmann et al., 2017;Seibold et al., 2019;van Klink et al., 2020) make monitoring more important than ever. Nonetheless, surveys based on traditional taxonomic work demand enormous time and labour. In one of the most ambitious projects aiming to characterize the insect fauna at the national level-the Swedish Malaise Trap Project-sorting 1919 samples into 300 taxonomic fractions took 15 years, despite the high manpower invested (estimated 98 K person-hours devoted to sorting; Karlsson et al., 2020). While taxonomic work remains crucial, there is a pressing need for the development of complementary high-throughput and cost-effective methods that enable adequate description of insect community diversity, composition and spatiotemporal dynamics.
In this context, the ever-decreasing costs of high-throughput sequencing (HTS) have made DNA-based methods particularly attractive for large-scale insect monitoring. A number of such methods have been proposed for and tested on bulk invertebrate samples (Kennedy et al., 2020). The method, perhaps, most widely used today is community-level metabarcoding. In this approach, DNA is first extracted from the whole multi-species sample, such as a Malaise trap catch. Afterwards, a short region of a marker gene, typically mitochondrial cytochrome oxidase I (COI), is amplified by PCR with broad-spectrum primers and subsequently sequenced. The resulting collection of marker gene sequences-barcodes-is compared against reference databases, providing information about the identities of species present in the original bulk sample. This rapid and affordable workflow has been used to describe arthropod communities (Cristescu, 2014;Elbrecht et al., 2017;Taberlet et al., 2012) and is increasingly proving useful in ecological studies (Beng et al., 2016;Liu et al., 2020;Porter et al., 2019). Nevertheless, despite its obvious advantages, metabarcoding is still a relatively young method and many challenges remain. Crucially, it is unclear how accurately metabarcoding can describe species diversity (presence/absence) and community composition (species' relative abundance)-and how its accuracy can be improved.
Research to date has pointed to multiple obstacles to obtaining accurate estimates of both species-specific incidence-adding up to species diversity-and abundance from metabarcoding. Differences in species' biomass and physical structure cause unequal contribution of DNA to the sequenced pool. Added to this, primer biases towards certain taxonomic groups can obscure estimates of species diversity and abundance, as demonstrated through metabarcoding of samples with known composition Elbrecht et al., 2019;Elbrecht & Leese, 2015;Martoni et al., 2022).
One of the solutions proposed to address these challenges is the introduction of known amounts of either tissue or DNA (so-called spike-ins; Ji et al., 2020;Luo et al., 2023;Thomas et al., 2016) to facilitate normalizing sequencing results between and within samples.
Many crucial considerations in metabarcoding, however, relate to the early stages of sample processing-before DNA extraction.
At present, the most common way of treating bulk samples is to homogenize specimens into an 'insect soup' (Liu et al., 2020;Morinière et al., 2016;Yu et al., 2012). This destructive method results in high DNA yields, but also in the loss of all morphological information-a discouraging outcome for taxonomists. Homogenization can also decrease the odds of detection of smaller and/or rare insects since large or abundant species will contribute much more DNA to the final pool and thereby swamp the DNA of smaller and/or rare species during sequencing. An alternative approach is then to use a non-destructive mild lysis, in which insects soak in the lysis buffer for a short period. The procedure leaves specimens almost intact-as demonstrated in a diverse arthropod sample by Kirse et al. (2023)-allowing future taxonomic work. The mild lysis approach has been developed and applied in freshwater and terrestrial arthropod studies (Batovska et al., 2021;Carew et al., 2018;Martins with roughly 40% standard error for homogenates, and with roughly 50% standard error for lysates, under ideal conditions. In the latter case, however, this often requires species-specific reference data, while spike-in data generalize better across species for homogenates. 4. We conclude that a non-destructive, mild lysis approach shows the highest promise for the presence/absence description of the community, while also allowing future morphological or molecular work on the material. However, homogenization protocols perform better for characterizing community composition, in particular in terms of biomass.

K E Y W O R D S
abundance estimation, homogenization, Malaise trap, metabarcoding, mock communities, nondestructive mild lysis, spike-ins et al., 2019;Nielsen et al., 2019), showing promising results in terms of evenness in species representation. However, critical methodological parameters such as the lysis buffer composition or lysis duration and conditions, have not been systematically investigateddespite their likely effects on the relative DNA yield from different species and, thus, on any estimates of community composition.
Arguably, the best way of evaluating methodological aspects of metabarcoding is their application to artificial communities of known composition, also called mock communities (Cristescu, 2014).
By controlling specimen numbers and approximate biomass, it is possible to obtain the reference that any metabarcoding results can be compared. Since metabarcoding is sensitive to multiple sources of variation, making sufficient replication is the key to obtaining robust results.
However, many studies to date have used wild-caught insects to construct mock communities (Batovska et al., 2021;Zizka et al., 2019), which precludes proper replication since each sample is unique. Others have based their approach on DNA rather than insect samples and constructed mock communities by blending species-specific DNA extracts or extracts from low-diversity samples into more complex blends Nielsen et al., 2019). While such approaches might be suitable for testing PCR and sequencing steps of the workflow, they depart significantly from realistic community sample processing procedures. We argue that well-replicated, relevant studies may best be obtained by constructing mock communities from size-standardized, homogeneous populations of different species. Such samples can be prepared in multiple copies, allowing us to evaluate the robustness of our inferences made from metabarcoding.
In this study, we use a set of individual-based mock communities as a powerful tool to systematically compare the effects of methodological choices on inferences from metabarcoding data. To scrutinize our ability to accurately describe insect communities in terms of species presence/absence, abundance and biomass, we use metabarcoding of replicate community samples to generate taxonomically assigned sequence yields, and a simple probabilistic model to analyse the results. We revisit the impact of different method-

| MATERIAL S AND ME THODS
Our study is built on three separate experiments, all using replicate mock communities constructed from different combinations of up to 25 insect species (Figure 1). In Experiment I, we compare different formulae for lysis buffer and assess if they affect the inferences of species presence. For Experiment II, we select the most promising buffer and then test a wider range of lysis times applied to larger and more diverse communities. In both Experiments I and II, we also assess how mild lysis and homogenization methods compare and how they affect our ability to accurately reconstruct the true composition of mock communities. Finally, in Experiment III, we use biological spike-ins to assess our ability to recover species abundance from bulk samples subjected to mild lysis, as well as homogenization.

| Constructing mock communities (Q1)
For constructing replicate insect assemblages of known composition, we used combinations of 31 standardized, reference insect species obtained from biocontrol and pet stores, hobbyists, laboratory cultures and social insect colonies (Table S1). All individuals within a species represented the same developmental stage and were sizestandardized. Insects were preserved in 90% ethanol and stored at −20°C until used. Species-specific dry weights were calculated as averages from 10 specimens. Communities were prepared by combining pre-set numbers of individuals from each species in a Falcon tube with ethanol. Every mock community was prepared in at least three identical copies. Each experiment used different community types, as explained below.

| Experiment I: Assessing the effects of buffer and lysis time/homogenization (Q2, Q3)
To establish the effect of different lysis approaches, we assembled three mock communities-S, M and L-consisting of 8-13 species represented by 11-25 individuals in total ( Figure 1, Table S2). Each community was prepared in six replicates: three were lysed with Buffer 1 (Vesterinen et al., 2016;modified from Aljanabi & Martinez, 1997) and the other three with Buffer 2 (arthropod-specific buffer from the Canadian Centre for DNA Barcoding; Ivanova et al., 2006). Communities were incubated in 5 mL of the buffer at 56°C, in a shaking water bath with one negative control per buffer. Fifty μL aliquots were taken after 2, 4 and 6 h of incubation, and then the samples were homogenized using bead beaters. DNA was purified from 20 μL aliquots of the lysate or homogenate. (For buffer recipes and details, see Text S1).

| Experiment II: Testing a wider range of conditions on more complex mock communities (Q3)
To further explore how mild lysis time influences community estimates and how homogenization and mild lysis compare, we expanded the range of lysis times and increased the size, number and complexity of mock communities. We used five distinct communities, comprising 20-25 species represented by 68-132 individuals. Each mock community was replicated three times ( Figure 1b, Table S3). The 15 mock community samples and one negative control were incubated with 20 mL of Buffer 1 at 56°C in a shaking water bath. Two-hundred-and-fifty μL aliquots of lysate were taken at six time points during the incubation (1, 2, 4, 6, 8 and 20 h). Afterwards, each sample was homogenized using a bead beater. DNA was purified using silica-coated magnetic beads from a 225 μL aliquot of each lysate and homogenate. (see details in Text S2).

| Experiment III: Estimating species abundance with the use of biological spike-ins (Q4)
To explore the utility of biological spike-ins in improving abundance estimations, we assembled 17 mock communities, using up to 12 species. Five species: Shelfordella lateralis (Blattodea: Blattidae), Drosophila hydei, Drosophila yakuba (both Diptera: Drosophilidae), Gryllus bimaculatus and Gryllodes sigillatus (both Orthoptera: Gryllidae) were used for calibration and we will henceforth refer to them as 'spike-ins'. Community type '0' comprised only spikeins and was prepared in nine replicates. The other 16 communities had a variable number of additional test species (1-18 individuals representing up to six species) and were prepared in three copies ( Figure 1c, Table S4). Communities and one negative control were subjected to mild lysis in Buffer 1 at 56°C for 4 h and then homogenized using bead beaters. DNA was purified as in Experiment I.

| Library preparation and bioinformatics
Amplicon libraries were prepared using a two-step PCR approach, as described by Marquina et al. (2021). We amplified a 458-bp region of the COI gene using BF3-BR2 primers  and sequenced the library pools on Illumina MiSeq using Reagent Kit v3 (2 × 300 bp reads). (For details see Text S3).

F I G U R E 1
Visual summary of three Experiments that form the basis of the current study. (a) In Experiment I, we tested two lysis buffers on three mock communities of different sizes (S, M and L: up to 25 individuals of 13 species). Communities were prepared in six replicates: three incubated with Buffer 1 and the other three with Buffer 2. We collected lysate aliquots at three time points before homogenizing them. (b) In Experiment II, we made five larger and more complex communities (up to 132 individuals of 25 species), preparing each in three replicates. All were incubated with Buffer 1, and six lysate aliquots were taken, followed by homogenization. (c) In Experiment III, we prepared nine identical communities made of five spike-in species and 16 different mock communities made of the spike-ins and a varying number of seven other species (6-23 individuals per community). Communities, prepared in triplicate, were lysed for 4 h and then homogenized.

| Data visualization and statistical analysis
To statistically analyse differences in inferred community composition depending on buffer and/or treatment (lysis time, homogenization) in Experiments I and II, we performed Permutational Multivariate Analysis of Variance (PERMANOVA) using Bray-Curtis dissimilarity distance for species relative read abundance and Jaccard distance for species presence/absence data with the use of adonis2() function of the vegAn package (Oksanen et al., 2020) and pairwise. adonis() of the pAirwiseADonis package (Martinez Arbizu, 2020). To visualize results, we performed nonmetric multidimensional scaling (nMDS) with the metaMDS() function of the vegAn package, based on Bray-Curtis distances. Relative read counts from Experiment I were plotted as a heatmap with the use of pheAtmAp package (Raivo, 2019) and those from Experiment II as stacked bar plots with the ggplot2 R package (Wickham, 2016).
For Experiment III, we developed a hierarchical Bayesian model that allowed us to estimate the variation across specimens and species in DNA yield and the variation across samples in PCR amplification factors from spike-in data. The model parameters learned were then used to estimate the abundance and biomass of the remaining species. As a basis for the model, let d tmij be the DNA yield, the amount of template DNA extracted from an insect specimen i of species t in sample j of community m and effectively available for PCR amplification. In the simplest case, we model d tmij using a universal gamma distribution where k is the shape parameter and θ the scale parameter.
We now introduce a PCR factor, c j , specific to each sample j. This factor represents how many times the DNA template is multiplied in sample j, taking into account amplification, subsampling, sequencing depth and bioinformatic processing steps after DNA extraction. The PCR factor only affects the scale parameter of the gamma distribution, while the number of specimens or biomass affects only the shape parameter. Specifically, given that there are n tm specimens, the read count r tmj (interpreted as a concentration) is distributed as Given suitable prior probability distributions on k, θ and c, these parameters can be inferred from observed read counts. Specifically, we used a gamma prior on k and lognormal priors on θ and c, with conjugate normal-inverse-gamma hyperpriors. For abundance estimates, we used a uniform prior to n. We explored variation across species in DNA yield and PCR amplification by introducing species-specific k and θ parameters, respectively. Models were implemented as probabilistic programs in Birch version 2.0.34 (Murray & Schön, 2018), and the inference was performed using sequential Monte Carlo.

| RE SULTS
After primer trimming and quality filtering, the sequencing yielded an average of 7573 (SD +/− 3869) COI reads per sample (Table S5).
We recovered the barcode sequences of all species making up mock communities but not in every sample where they were present.
Instead, the presence/absence and relative abundance of species varied substantially among mock community types, buffers and treatments (lysis time or homogenization). Furthermore, we observed differences among replicates of the same community. We explore these patterns by answering the methodological questions defined in the Introduction.

| [Q1] How consistent are the composition profiles across replicate mock communities?
In all Experiments, we observed considerable variation in the relative abundance of species and also in their inferred presence/absence. Certain species were consistently undetected in some of the treatments (patterns explored below, Q3). For others, the rate of false negatives seemed to vary haphazardly from replicate to replicate, especially after mild lysis (Figures 2 and 4). After homogenization estimates from replicates were becoming more similar to one another. The Bray-Curtis distances between replicates were considerably higher among samples treated by mild lysis (0.22 on average in Experiment II) than between homogenized ones (0.12). Overall, inferred community compositions were distinct from one another (Table S9; PERMANOVA, p.adjust < 0.001), but the estimates varied from replicate to replicate, despite them coming from the identical mock community.

| [Q2] How does the choice of lysis buffer affect community reconstruction?
In Experiment I, the choice of buffer significantly affected the reconstruction of species presence/absence patterns in Community S but not others (Table S6; PERMANOVA, p < 0.05 for S but p > 0.05 for M and L). However, when considering relative read counts d tmij ∼ Gamma(k, ), r tmj ∼ Gamma n tm k, c j .
we found a strong effect of the buffer used in all Communities (Table S6; PERMANOVA, p < 0.01). Overall, digestion with Buffer 1 resulted in substantially more reads from smaller species such as Aphidius colemani, Dacnusa sibirica, Encarsia formosa and Drosophila yakuba. With Buffer 2, these small species remained undetected in multiple replicates, whereas we scored higher relative read counts F I G U R E 2 Heatmap showing relative abundance of species in Experiment I. Mock communities (three types: S, M, L) were subjected to mild lysis in two buffers (B1 and B2), with aliquots taken after 2, 4 and 6 h, then homogenized (H). We used three biological replicates, identical copies, per community, per buffer (Rep. 1-3). DNA purified from the aliquots was then prepped for metabarcoding. Heatmap colours present relative read counts per experimental species recovered in metabarcoding. White tiles indicate 0 reads. Species absent from a given mock community by design have their name crossed out in the legend; note that there were no false positives.

F I G U R E 3
Visualization of non-metric multi-dimensional scaling (nMDS) based on metabarcoding results from Experiment I. Mock communities S, M and L are the same as in Figure 2. Different colours represent the two buffers. Homogenized samples are squares while mild lysis is circles with lysis times represented as opacity levels. The original community set-up in terms of the relative number of species and the relative biomass of the species are represented by an asterisk and crossed diamond, respectively. Stress factor for each ordination is given in the top right corner. Given the improved detection of small and rare species conveyed by Buffer 1, we chose this buffer for our subsequent experiments.

| [Q3] How are community estimates affected by different lysis times and homogenization?
We found no significant differences between samples lysed for different amounts of time (p.adjust>0.05; Tables S7 and S11).
However, close inspection of relative read counts revealed that the representation of several species was changing in a predictable manner, either consistently increasing or decreasing with the duration of lysis. For instance, in Experiment II, the relative abundance of A. colemani decreased over time, while that of D. yakuba was increasing ( Figure 4).
Treatments-mild lysis/homogenization-significantly altered the community composition profiles inferred. Species presence/absence differed in all communities from Experiment II (PERMANOVA, p < 0.01; Table S10) and in Communities S and L from Experiment I (PERMANOVA, p < 0.05; Table S6) but not M.
When considering relative read counts, treatment had a significant effect in all communities (PERMANOVA p < 0.01; Tables S7 and S10).
In more detail, we observed increasing or decreasing relative abundance for many species depending on the treatment (Figure 4).
The smallest insects tended to be more reliably detected after mild lysis, yielding few or no reads after homogenization. For instance, the smallest species in our set (the parasitic wasp E. formosa) was consistently represented by multiple reads after short lysis (1-4 h) in Communities I and II (Experiment II). However, its proportion of reads decreased rapidly with increasing lysis time and after homogenization, E. formosa was no longer detected in any replicate (Figure 4a,b). The same was true for other small insects such as A.
colemani, Aphidius ervi and Aphidoletes aphidimyza (Figure 4). For bigger insects (e.g. the lepidopteran Pieris napi) and harder-bodied insects (e.g. the mealworm beetle Tenebrio molitor), we observed the opposite tendency: species were detectable already after the shortest of lysis times (1 h), but absolute read counts increased

| [Q4] Is it possible to obtain adequate species abundance estimates through the use of biological spike-ins?
Of the five spike-in species used in Experiment III, D. hydei amplified poorly and was therefore excluded from further comparisons. Thus, the following results apply to the remaining spike-in species-S. lateralis, G. bimaculatus, G. sigillatus and D. yakuba. There was considerable variation among replicates in relative proportions of reads for different spike-ins, especially after the mild lysis ( Figure S2). Using the probabilistic model for lysate data, we found strong evidence in favour of models accommodating species-specific differences in DNA yield over models that did not, while there was no evidence for such a difference in homogenates ( Figure 6). Posterior estimates of model parameters illustrate the same results in more detail: there were distinct differences between species in average DNA yield per specimen in lysates but not in homogenates (Figure 7).
Estimates of coefficients of variation (CVs) can be used to pinpoint the level of consistency in read counts (the higher the CV, the lower consistency). Under the simple model that did not accommodate species-specific differences, the estimated CV for specimen numbers was around 0.80 for lysates and around 0.45 for homogenates (Figure 7c,d). In other words, the standard deviation (i.e. standard error, SE) was around 80% vs 45% of the mean, respectively.
If we did not use the spike-in data to learn the PCR factor of each F I G U R E 4 Relative species abundance in mock communities from Experiment II. Panels a-e present five mock communities (I-V) and each consists of three sections: (1) two bar plots showing the set-up composition of the given mock community in terms of numbers of individuals of each species (first barplot) and their relative biomass contributions (second barplot); (2) Barplots showing relative read counts obtained through metabarcoding for six lysis times (1 h-20 h) and homogenization for three biological replicates (Rep.1-3); (3) Panel above the bar plots is showing false negatives, where coloured fields indicate species that were not detected in metabarcoding even though they were present in the mock community. run, the CV increased to around 0.9 for lysates and 0.6 for homogenates ( Figure S3c,d; constant PCR factor). When both PCR factor and species-specific differences in DNA yield were accommodated, the CV decreased to 0.5 for three of the four spike-ins in the lysate data, while it remained high for D. yakuba, which was also the smallest of the four spike-ins (Figure 7). For the homogenate data, the CV estimates did not change when species-specific differences were accommodated.
When the spike-in data, fitted to a common model for all spike-in species, were used to estimate the abundance of the remaining species, we noticed clear species-specific effects: some species were consistently overestimated while others were consistently underes-

| Reproducibility of metabarcoding results
Previous studies have reported limited differences in the perception of community composition gained from technical replicates, for example, when using different DNA purification methods (Marquina et al., 2019;Nielsen et al., 2019) or when evaluating taxonomic bias F I G U R E 5 nMDS plot summing up COI reads for all samples in Experiment II. Five arthropod mock communities (I-V) in biological triplicates were subjected to mild lysis for 20 h, sampled six times throughout the experiment and subsequently homogenized. Lysis time of each sample is represented in the graph as opacity level. Different shapes inform about whether a sample was homogenized (square), lysed (circle) or if it represents the original community set-up in terms of the relative number of individuals contributing to the community (asterisk) or the relative biomass of the species (crossed diamond). due to PCR primers (Martoni et al., 2022). However, to our knowledge, the current study is the first to examine variation between biological replicates of multi-species communities. Our results reveal a surprising amount of variation among replicates of identical mock communities. It concerns both the relative read abundance and the presence/absence of species. It was particularly strong after mild lysis treatment but also persisted, to a lesser extent, after homogenization. Despite the variation, replicates of the same community were still distinguishable from those of other communities. But we should highlight that our mock communities were explicitly designed to be different.
It seems likely that this variation is largely due to natural variation in DNA yield among individuals of the same species. Recall that in our study, efforts were made to limit variation among individuals; for instance, we used only one life stage per species and excluded individuals that appeared poorly sclerotized, injured, particularly large or small, or otherwise unusual. Thus, one might expect even more variation among real wild-caught insects, where specimens differ in life stage, age, sex, health and physiological state. Adding to this, how samples are handled, transported and stored, will also contribute to variation in DNA yield of specimens, with the effects likely to vary among species (Marquina et al., 2021). Thus, this variation is inherent in metabarcoding procedures and will clearly limit the precision of any estimates of community composition.

| Trade-offs between protocols
The different outcomes of our metabarcoding experiments contribute to the understanding of the trade-offs involved. Generally speaking, a milder protocol (short lysis and mild buffer) results in more accurate reflection of the presence/absence of species and community profiles more closely resembling those based on the numbers of specimens per species. The more aggressive the protocol (i.e. buffer with stronger lytic properties, longer lysis, homogenization), the more reflective the results are of the biomass composition of the community, favouring bigger specimens.

F I G U R E 6
Estimated marginal likelihoods (log Z) for four models accommodating different sources of variation in the relative read numbers of the four spike-in species. Results for lysis data are in the top panels (a, b) and for homogenate data in the bottom (c, d). The left panels (a, c) show results for specimen composition, and the right panels (b, d) show results for biomass composition. A difference of five loglikelihood units is considered very strong evidence against the model with lower likelihood (Kass & Raftery, 1995). In lysates, there is strong evidence for species-specific differences in DNA yield (4 k over 1 k models), whereas this is not the case in homogenates. In neither case, there is strong evidence for species-specific PCR amplification biases in these four species (4θ over 1θ models).
Notably, the volume of lysate and homogenate aliquots used for DNA extraction and library preparation were the same across all treatments. Thus, the shifting representation of species in metabarcoding is reflecting the varying proportion of their DNA present in the starting aliquot. The likely explanation for this is that small insects have a high surface-to-volume ratio and tend to release relatively more DNA initially. With more aggressive treatments the DNA pool becomes more reflective of the insects' volumes. As a result, large species dominate in terms of available DNA and we fail to detect small species. This is demonstrated by the high rate of false negatives for some of the smallest species (A. aphidimyza, D. sibirica, E. formosa and Tuberculatus annulatus) in the homogenates of Experiment II. One solution to this issue, however costly, is adding PCR replicates, as they increase the probability of detecting small species (Alberdi et al., 2018).
On top of this, there are species-specific effects, the most obvious of which is the degree of sclerotization. For instance, Hermetia illucens larvae yielded little DNA in mild lysis treatments despite their large size. This may be due to their unusually hard integument (Ståhls et al., 2020). Similarly, the two ants (F. fusca and F. rufa) required homogenization to be detected at all. Still, their low representation even in homogenates suggests an additional possible PCR bias against them. Certainly, there are numerous other traits that influence how species respond to extraction protocols.
Overall, our findings are in line with the results of Marquina et al. (2023), who found that small and soft species were more difficult to detect after a chemically aggressive treatment (buffer with high SDS, DTT and proteinase K concentrations) than after lysis with the milder buffer, corresponding to Buffer 1 in the current study. They also agree with the results of Marquina et al. (2019), who showed that hard-bodied and large insects are more easily detected in homogenates, while analysis of preservative ethanol-an extremely mild lysis of sorts-is better for the detection of small and soft-bodied insects. However, it remains to be evaluated whether and under what conditions this pattern holds true for real terrestrial arthropod catches (but see Nielsen et al., 2019).

F I G U R E 7
Estimated posterior distributions of the mean DNA yield (a, b) and the coefficient of variance (CV) of DNA yield per specimen (c, d). Using two model variants: 'simple model' with no species-specific parameters (blue), and the model that allows the DNA yield of the four spike-in species to be different (remaining colours). The mean DNA yield varies among spike-ins in lysates (a); in particular, the yield of D. yakuba is lower than that of the other spike-in species. The DNA yield is more consistent across species in the homogenates (b). The CV was reduced to approximately 0.4 for three of the four species, but not for D. yakuba, in lysates (c). In homogenates (d), the CV was estimated to be around 0.45 in the simple model, and accommodating species-specific DNA yields did not improve estimation except for Gryllus bimaculatus.

| Abundance estimates from metabarcoding data
How well species abundance may be estimated through metabarcoding is currently a hotly debated topic. Early studies suggested that metabarcoding was unsuitable for quantification, but more recent efforts using correction factors or spike-ins of known concentration to calibrate metabarcoding-based estimates have reported encouraging results (Ershova et al., 2021;Luo et al., 2023). Nevertheless, our study offers some of the first quantitative analyses of the accuracy that might be achieved using such approaches.
The use of biological spike-ins is based on the expectation that parameters learned from spike-in data can be applied to the species occurring naturally in the sample, but several sources of variation may influence the results. First, the mean DNA yield per specimen is likely to differ dramatically among species. Second, specimens from a single species can vary substantially in DNA yield. Finally, PCR amplification may also vary considerably among species (Piñol et al., 2015(Piñol et al., , 2019. Indeed, in Experiment III, mean DNA yields varied considerably across our biological spike-in species in lysates, making it difficult to calibrate data for species other than those deliberately spiked-in. However, when applying a modelling approach and providing species-specific calibration data, it seems possible to bring down the error of abundance estimates for specimens to a CV of around 50% of the mean under ideal conditions (recall that we sizestandardized the spike-in specimens). The mean DNA yield in the homogenate data is more uniform across species, and the CV estimates are lower. Under ideal conditions, one might then be able to reach a CV of 45% when using homogenate data for spike-in species to calibrate biomass estimates for other species.
These results, however, do not apply to species with strong PCR amplification biases such as D. hydei and the stick insect (Phasmatodea)-with very low and high PCR factors, respectively.
Primer mismatches are usually thought to be the cause of such biases (Piñol et al., 2015(Piñol et al., , 2019, however, they are an unlikely culprit in this study. We used highly degenerate primers with broad F I G U R E 8 Predictions of species abundance in Experiment III for the lysis treatment (a, b) and homogenate treatment (c, d). Histograms (a, c) present specimen numbers (n) predictions for all species and samples combined. The x-axes show the difference between the true value of n and the maximum a posteriori (MAP) estimate. When the difference is zero, the model's MAP estimate is also the true value of n. In (b) and (d), examples of species-specific predictions are shown: three samples each for three species, one species that is well predicted, one species that is underestimated and one species that is overestimated. The black circles mark the mean of the estimates, the bars are standard deviation and the red circles mark the true value.
coverage-BF3-BR2 -providing a perfect match (of some of the primer variants) to all experimental species, including D. hydei. Additionally, the Qiagen Multiplex polymerase used has relatively little bias when simultaneously amplifying multiple templates in metabarcoding experiments, but factors such as GC contents within the priming site or template may still affect relative amplification efficiency (Nichols et al., 2018). It will be important to explore ways of reducing the bias, perhaps by adjusting the annealing temperature or other PCR reaction parameters. For instance, Yang et al. (2021) demonstrated an improvement in species diversity recovery from mock arthropod DNA mixtures when annealing temperature and number of PCR cycles were decreased. However, it remains to be tested on realistic insect bulk samples.
Overall, our data suggest that there is a natural limit to the precision of metabarcoding-based abundance estimates. Specifically, it appears difficult to bring the standard error of homogenatebased estimates much below 40%-45% of the mean when using biological spike-in references. These limits derive from natural variation between specimens, and stochastic processes in sample preparation. Nonetheless, the statistical method presented here can still be improved to increase the range of situations in which this natural limit can be reached. For instance, the error associated with variable DNA yield from biological spike-ins could possibly be reduced through synthetic spike-ins, such as linearized plasmids with artificial COI targets that are added in pre-defined quantities to homogenates (Palmer et al., 2018;Tourlousse et al., 2017).

| Towards the optimal protocol
Given the different advantages and disadvantages of lysis versus homogenization, it will be hard to strike the ideal compromise.
Detecting the smallest and rarest species in the sample will always be challenging-particularly for homogenate data, where they constitute a minuscule proportion of the total DNA. One approach, albeit destructive and costly, might be to metabarcode both the lysate and homogenate of the same sample. Such a combination would take advantage of the combined qualitative and quantitative strong suits of the two approaches.
With respect to the specifics of the protocol, our results suggest that intermediate lysis times, between 2 and 8 h, and using our Buffer 1 provide optimal, robust results. 1 h lysis appears to be suboptimal, with increased variation in species representation and lower overall DNA yield. Samples digested for 20 h start to resemble homogenates, with decreased representation of small species.
All in all, our current analysis provides key pointers for the design of an optimal protocol for processing bulk insect samples, with the specifics to be further attuned to local logistics and laboratory setup. We remain optimistic about the prospects of improved statistical modelling to unlock the resulting data's full potential. Ronquist led the writing of the manuscript. All authors contributed critically to the drafts and gave final approval for publication.

CO N FLI C T O F I NTE R E S T S TATE M E NT
All authors declare that they have no conflicts of interest.

DATA AVA I L A B I L I T Y S TAT E M E N T
Raw sequence data for this project have been submitted to NCBI's SRA archive under accession number PRJNA885053.