Comparative analysis reveals within-population genome size variation in a rotifer is driven by large genomic elements with highly abundant satellite DNA repeat elements

Eukaryotic genomes are known to display an enormous variation in size, but the evolutionary causes of this phenomenon are still poorly understood. To obtain mechanistic insights into such variation, previous studies have often employed comparative genomics approaches involving closely related species or geographically isolated populations within a species. Genome comparisons among individuals of the same population remained so far understudied—despite their great potential in providing a microevolutionary perspective to genome size evolution. The rotifer Brachionus asplanchnoidis represents one of the most extreme cases of within-population genome size variation among eukaryotes, displaying almost twofold variation within a geographic population. Here, we used a whole-genome sequencing approach to identify the underlying DNA sequence differences by assembling a high-quality reference genome draft for one individual of the population and aligning short reads of 15 individuals from the same geographic population including the reference individual. We identified several large, contiguous copy number variable regions (CNVs), up to megabases in size, which exhibited striking coverage differences among individuals, and whose coverage overall scaled with genome size. CNVs were of remarkably low complexity, being mainly composed of tandemly repeated satellite DNA with only a few interspersed genes or other sequences, and were characterized by a significantly elevated GC-content. CNV patterns in offspring of two parents with divergent genome size and CNV patterns in several individuals from an inbred line differing in genome size demonstrated inheritance and accumulation of CNVs across generations. By identifying the exact genomic elements that cause within-population genome size variation, our study paves the way for studying genome size evolution in contemporary populations rather than inferring patterns and processes a posteriori from species comparisons.


Background
The genomes of eukaryotic organisms display remarkable diversity in size, overall spanning approximately five orders of magnitude [1]. In addition, genome size may vary substantially among closely related species [2,3], within a species (e.g., [4][5][6]), and sometimes even within a population [7]. Most of the variation in genome size stems from differences in the proportion of various kinds of non-coding DNA and/or transposable elements, which can reach excessive levels in species with giant genomes [8,9]. Studying genome size variation at the DNA sequence level allows identification of exactly those genomic elements that make up for the genome size difference, and it can suggest the relative strength of mutation, selection, and drift-the underlying evolutionary forces ultimately causing divergence in genome size.
Much of our understanding of eukaryotic genome size variation comes from comparisons between closely related species. Recent studies suggested that the proliferation of repetitive elements (REs), in particular transposable elements, plays an important role in genome expansion, while their silencing or deletion has been implicated in the streamlining of genomes. In a few studies, it was possible to pinpoint individual REs as the driver of genome expansion [10,11], whereas in other studies, differently sized genomes were found to differ in several classes of REs [12,13]. In the latter case, it is difficult to decide whether multiple RE classes have expanded more or less simultaneously in evolutionary time, or whether the expansions of some REs have occurred after an initial genome size divergence driven by a single element. Without accurate dating of the expansions of individual elements, interspecific comparisons suffer from such a "blind spot" on the early stages of genome divergence. Ultimately, all genome size differences must have gone through a stage of intrapopulation variation followed by fixation, or loss, of these size variants. Thus, identifying genome size variants within populations and studying them on microevolutionary time scales may allow additional insights into the evolutionary dynamics of early genome divergences.
Intraspecific genome size variation (IGV) has been described in several species of eukaryotes (some examples are summarized in [7,14]), although studies on the level of a single, natural population (i.e., unaltered by, e.g., inbreeding) are rare [15]. IGV may be associated with variation in the number of chromosomes (e.g., Bchromosomes [14]), but there are also examples where IGV is not reflected in the karyotype [2,6]. One of the most recent additions to IGV model organisms was the monogonont rotifer Brachionus asplanchnoidis, which displays a nearly 2-fold variation in genome size even among individuals within a geographic population [7]. This level of variation is at the upper end of what has been found in other animals or plants (see Supplementary Table S4 in [7]). Monogonont rotifers are shortlived (1-2 weeks), small aquatic metazoans, only a few hundred micrometers in size, common in fresh and brackish water habitats throughout the world. They have a life cycle involving cyclical parthenogenesis [16] reproducing by ameiotic parthenogenesis for prolonged periods and inducing sexual reproduction occasionally. A "rotifer clone" consists of the asexual descendants of a single female that has hatched from a single resting egg, which itself is the product of sexual reproduction. In many Monogonont species, sex is triggered by crowding due to the accumulation of sex-inducing peptides released by the animals [17]. In lab cultures, it is possible to suppress sexual reproduction by frequent dilution intervals or large culture volumes or to induce sex in small culture volumes or through the use of media drawn from dense cultures. Thus, it is possible to either deliberately cross two rotifer clones sexually or to keep them clonally for hundreds of generations.
In the present study, we focus on a population of B. asplanchnoidis from Obere Halbjochlacke (OHJ), a shallow alkaline lake in Eastern Austria [7,18]. Individuals of this population can be crossed with each other-even if they substantially differ in genome size-and they will produce offspring with intermediate genome sizes close to the parental mean. Genome size can be artificially selected up or down with a heritability of 1 by breeding only individuals with large or small genome sizes. Genome size variation in this system is mediated by relatively large genomic elements (several megabases in size), which segregate independently from each other during meiosis. The smallest observed genome size in B. asplanchnoidis was 404Mb (2C, nuclear DNA content). Individuals at or close to this basal genome size are completely lacking independently segregating elements, while in individuals with larger genome sizes, genome size scales with the amount of independently segregating elements [7].
Here, we used a whole-genome sequencing approach to identify the DNA sequence differences responsible for intrapopulation genome size variation in this population of B. asplanchnoidis. Our specific goal was to identify and characterize genomic regions that are present in one or multiple copies in some individuals of the OHJ population but are missing in others. To this end, we assembled a highly contiguous draft genome of a reference clone using long-read (PacBio) technology and then mapped short reads of 15 different clones with genome sizes from 404 to 644 Mbp to this assembly. To identify copy number variations (CNVs), we scanned for regions of increased per-bp read coverage. To independently confirm CNVs, we used PCR to detect the presence/absence of selected CNVs across different clones of the OHJ population, and droplet digital PCR (ddPCR) to determine the exact copy numbers of one specific locus. Finally, we annotated genes and repetitive elements in the reference genome and compared CNV regions to non-variable regions of the genome.

Results
De novo assembly and annotation of the reference genome The rotifer clone (OHJ7i3n10) chosen for our reference genome derives from the natural isolate OHJ7 after three rounds of selfing (i.e., fertilizing sexual females by males of the same clone). As measured by flow cytometry [7], OHJ7i3n10 has a 2C-genome size of 568 Mbp, and thus contains approximately 40% excess DNA, compared to the smallest genome size of the OHJ population (~410 Mbp). The total length of our reference assembly was 230.1 Mb, with 455 contigs and an N50 value of 3.065 Mb (Fig. 1). The average GC-content was 30.5%. However, the GC-distribution was not unimodal but showed two major peaks at~25% and~35% GC and a minor one at~50% GC (Additional file 1: Figure S1).
Taxonomic partitioning of the polished genome assembly confirmed its purity. Most hits could correctly be assigned to rotifers and remaining hits assigned to mollusks and arthropods can mostly be explained by imbalanced availability of rotifer entries in the nt database (Additional file 1: Figure S2, Additional file 2). We observed that 90.6% of the metazoan BUSCO gene set collection was complete with low levels of duplicated (2.7%), fragmented (2.0%), and missing (7.4%) genes (Additional file 1: Table S2). A visualization of assembly contiguity and completeness was generated via assembly-stats [19] and is presented in Fig. 1. Proteincoding genes make up approx. 26% of the genome assembly length. In total, we annotated 16,667 genes with a median gene length of 1999 bp and approx. five exons per gene (Additional file 1: Table S3).

Comparison of short reads in 15 OHJ clones
To examine within-population genome size variation, we sequenced 29 short-read libraries from 15 different rotifer clones from the OHJ population (1-4 libraries per clone using different methods, described below). Nine clones were asexual descendants of individuals collected from the field, four (including the source of the reference genome) were each asexual descendants of the same three rounds of selfing of one of these clones, one was an asexual descendant from three rounds of selfing of a different clone, and one was an asexual descendant of a cross between clones derived from crossing two different selfed lineages from two natural isolates (Table 1,  Table S4). Raw reads were passed through multiple preprocessing steps that included quality trimming, removal of PCR duplicates and mitochondrial DNA, and removal of contaminant DNA. Overall, preprocessing reduced the total sequence amount from 265.6 to 194.3 Gbp, resulting in per-base sequencing coverage of 9.4-to 79-  Same clone as the reference genome 3 Hatched from an individual resting egg after three generations of selfing . The red, cyan, and blue lines designate a mixture model fitted to these data, consisting of three normally distributed subpopulations (26 ± 6, 36 ± 2.5, and 48 ± 3 %GC; means and SDs). b Panels show the results of the same mixture model applied to each library individually, thus estimating the proportion of the total reads per library in each GC-fraction. 2C-genome size estimates are based on flow cytometry and were taken from [7]. Colors in b correspond to the six library preps (A-F) listed in Table S4: A = orange, B = gold, C = green, D = turquoise, E = blue, and F = pink fold for the different libraries, with the majority of libraries being above 20-fold (Additional file 1: Supplementary results, Figure S3, Additional file 3). Pooling short reads from all libraries revealed the same three-peak pattern of GC-content apparent in the reference assembly, with GC maxima at 26%, 36%, and 48% (Fig. 2a). Since these three GC peaks are indicative of three discrete fractions among the genomic reads, we applied a mixture model to the short-read data, which allowed estimating the relative proportion of each fraction per library. Overall, there was substantial variation in the relative proportions of the three fractions, both among rotifer clones and between libraries of the same clone. Overall, the 26% GC-fraction was negatively correlated with genome size based on flow cytometry (FCM), and the 36% and 48% GC-fractions were positively correlated (Fig. 2b, Additional file 1: Table S5).
We also used two kmer-based tools, GenomeScope 2.0 [20] and findGSE [21], to obtain reference-free estimates of genome size for each clone/library. Those estimates were generally lower than their FCM-based counterparts, approximately 0.8-fold in findGSE and 0.6-fold in GenomeScope (Additional file 4). GenomeScope appeared to underperform at sequencing coverages below ca. 25-fold, where it estimated extremely low genome sizes (compared to FCM) and unrealistically high heterozygosities (6-10%). By contrast, findGSE performed consistently along the gradient of sequencing coverages. Overall, there was a positive correlation between genome size (FCM estimate) and the ratio of repeats, a fitted parameter of findGSE (Additional file 1: Figure S4; Pearson's r 27 = 0.53, p = 0.004).
Our assembly-based analyses rely on the alignment of cleaned reads of each of the 29 libraries to the reference genome. Total alignment rates (TARs) of reads to the reference genome draft were generally above 94%. TARs were highest in those rotifer clones that are most closely related to the reference genome (Additional file 1: Figure  S5). Concordant alignment rates, i.e., properly aligned reads with the correct insert size, were > 90% in all libraries (Additional file 1: Table S6). However, only about 1/3 of the reads aligned uniquely to one site in the genome. Only a small proportion of reads, usually well below 5%, showed discordant alignment, either due to incorrect insert size, or when only one of the mates aligning to the reference genome. Discordant alignment rates were generally low (1.3-5.1%) and were not correlated to genome size (r 27 = −0.124, p = 0.5).
To identify CNV regions, we calculated the average per-base coverage for 5-kbp and 50-kbp windows. To normalize coverage across libraries, we divided the perbase coverage of each window by 1/2 of the (mean) exon coverage of the respective library. This yields a value of 2 for all diploid regions of the genome, corresponding to two copies for those genomic regions (provided that our assembly is unphased in these regions). This analysis of coverage variation revealed that large tracts in the genome of B. asplanchnoidis display consistent patterns of coverage variation, which we quantified as the standard deviation of coverage across libraries and clones (Fig. 3). For example, the first contig (000F) exhibits large coverage variation with a standard deviation of 1.5-1.7 while the next three contigs (001F to 003F) have much less coverage variation (< 0.5 s.d.) and a mean coverage value of close to 2. There are several other contigs showing consistently elevated coverage variation, like 004F, 031F, 032F,036F, 042F, and 044F. In addition, some of these variable contigs appear to be more variable than others (e.g., 031F is more variable than 032F). These overall patterns were very similar when a lower 5-kbp window resolution was applied (Additional file 1: Figure S6).
Combining the values of coverage variation of all windows (n = 4380 for 50kbp, n = 45800 for 5kbp) reveals a multimodal distribution with a prominent peak located at low coverage variations of~0.15 (lowSD in Fig. 4a). This peak corresponds to the genomic sections that are colored in orange/red with a mean coverage of 2 in Fig.  3. At intermediate coverage variations (interSD; 0.7 < s.d. < 2.0), there appear to be at least two peaks, which correspond to the green and blue regions in Fig. 3, respectively. There are also a few windows showing high coverage variation (highSD; s.d. > 2.0), which form the right tail in Fig. 4a.
To test for an effect of these coverage variations on genome size, we calculated the mean coverage for each clone/library for these three categories of coverage variation (lowSD, interSD, highSD) and calculated their correlations with genome size. Notably, there were substantial differences in coverage at the interSD and high SD regions among the different libraries, even in some that had been prepared from the same rotifer clone (Fig. 4c). Thus, to control for the effect of library preparation, we calculated partial correlations between the variables "mean coverage" and "genome size" (Fig.  4c). Those correlations at "intermediate" and "high" variability were highly significant. This result holds even if the libraries with elevated coverage and GC-content at interSD and highSD regions (green symbols in Figs. 2b and 4c) are excluded (Additional file 1: Figure S7).
By merging adjacent 5-kbp windows that show increased coverage variation and consistent coverage pattern (significant correlation of coverages), we identified 509 CNV regions in the genome of B. asplanchnoidis (Additional file 5). The total genome space classified hitherto as "copy number variable" was 72.43Mbp (i.e., 31% of the genome assembly). Most CNV regions tended to be rather long, as can be seen by an "N50" of 0.455 Mbp for the CNV fraction of the genome. Two of the largest contigs, 000F and 004F, consisted almost entirely of 3-4 large CNV regions, which were separated only by short "breakpoints" of lower coverage variability (Additional file 5). Our 5-kbp window-scanning algorithm also identified CNVs that resided in contigs with otherwise "normal" coverage. In many cases, these CNV regions were near the beginning or end of a contig (e.g., 003F, 006F, 010F). Coverage values of clone IK1, a cross between OHJ7i3n2 and OHJ22i3n14, were usually intermediate between those of the two parental clones (Figures S8, S11, S12). In addition, four clones that were derived by selfing from clone OHJ7 displayed coverages that were largely consistent with their differences in genome size ( Figure S9). For instance, the clone with the largest genome of the selfed line, OHJ7i3n5, had an additional coverage peak at about 2.5 times the base coverage (set by IK1), which indicates that some CNV regions have significantly higher coverage than any other clone of this selfed line.
To additionally classify CNV regions according to their length and contiguity, we considered contigs as "B-contigs" if they contained a large fraction of CNV windows (in analogy to B-chromosomes). Setting this threshold at 90% of contig length, 38 contigs are classified as "B-contigs," comprising 77% of all CNV windows, i.e., 55.8 Mbp of the assembly (Additional file 1: Figures S10, S11 , Table S7). Thus, approximately three-quarters of the observed CNVs affect more or less an entire contig, while the remaining quarter of CNV regions were found on contigs with otherwise low coverage variability.
To independently confirm CNVs, we chose four genomic loci for PCR amplification, two in CNV regions and two in non-CNV regions (Additional file 1: Tables S8, S9). All four primer pairs yielded amplicons with the correct size, with no signs of non-specific amplification (Fig. 5a). The two primer pairs targeted to non-CNV regions (TA_001F and TA_003F) yielded an amplicon in all rotifer clones. In contrast, the two primer pairs targeted to CNV loci (TA_000F and TA_032F) amplified only in some clones. In particular, clones with the smallest genome sizes (OHJ82, OHJ22, and its descendent OHJ22i3n14) seem to lack both CNV loci, and in others Contigs are ordered by size rather than reflecting biological contiguity. Variation across clones/libraries is indicated by color (based on standard deviations). Standard deviations larger than 2.5 (about 4% of the data) were capped to a value of 2.5, to limit outliers affecting the color scale (OHJ 98, 104, 105) the TA_000F-locus was present, but the TA_032F-locus was absent. Overall, these patterns were highly consistent with coverage of the amplified regions in sequencing libraries (Fig. 5c). Copy numbers for TA_032, as estimated by ddPCR, ranged from zero to six across the studied rotifer clones, including 3 copies in IK1, a cross between OHJ7i3n2 (six copies) and OHJ22i3n14 (zero copies) (Fig. 5b, Table S8).
After having identified the CNV regions that contribute to intrapopulation genome size variation in the OHJ population, we annotated repetitive elements of these regions and compared them to the rest of the genome. A custom repeat library was created using RepeatMode-ler2, and the top-contributing TEs were curated. In total, 123 Mbp of the assembly (53.6%) was masked by this library. The highest contributing element (rotiSat2) accounts for just over 50 Mbp of this (Fig. 6a). The 36 most abundant repeats represent 67% of masked repeats and 82.6 Mbp of the assembly. Of the 36 highly contributing repeats, 7 were enriched in the interSD region, 12 in the highSD region, 5 in allCNVs, 6 in the B90 region, and 5 in the B95 region (Additional file 6). Overall, repetitive elements, and especially satDNAs, were over-represented in CNV regions (Fig. 6c, Additional file 7).
Of the satDNAs we identified in B. asplanchnoidis, three (rotiSat2, 8,9) are not present in any other sequenced Brachionus genome; three others (rotiSat1, 5, 10, 11) are shared with the Brachionus plicatilis genome. All but two other repeats in the topRE library were found in at least two other Brachionus genomes in varying levels. In addition, we identified a DNA/MITE element and an uncharacterized element in the Brachionus plicatilis genome that are not found in other Brachionus genomes.
CNV regions differed strongly from non-CNV regions by having a much lower gene density (Fig. 6c, Additional file 7). Phylogenetic orthology inference based on Mean coverage in regions of elevated coverage variation (interSD, highSD) correlates positively with genome size (flow cytometry data on genome size was taken from [7]). Dots in c represent the mean coverage per library. Colors indicate co-prepared libraries (of which some were prepared from the same rotifer clone, but at different dates and with different library preparation methods; see Table S4). Colors in c correspond to the six library preps (A-F) listed in Table S4: A = orange, B = gold, C = green, D = turquoise, E = blue, and F = pink. d This panel shows the same data as in c, but rotifer clones are categorized into small 2C genome sizes (<430Mbp, which represent the basal genome size of B. asplanchnoidis according to [7]) vs. large genome sizes (>430Mbp, which contain additional genomic elements that independently segregate during meiosis, according to [7]) proteomes (OrthoFinder analysis) of four species in the B. plicatilis species complex, with the bdelloid rotifer Adineta vaga as an outgroup, resulted in the assignment of 93,063 genes (90.9% of all annotated proteins) to 17,965 orthogroups. Fifty percent of genes were in orthogroups with 6 or more genes and were contained in the largest 5228 orthogroups. There were 3953 orthogroups represented in all species and 299 of these consisted entirely of single-copy genes. Many duplication events appear to be species-specific, with an especially high number of genes (5634, or 32% of all proteincoding genes) derived from gene duplication in B. asplanchnoidis. While gene density is significantly reduced within CNV regions (417 genes located within CNV regions = 2.39% of all annotated protein-coding genes, p < 0.001, Fig. 6c), a significant number of these genes derive from gene duplication events (385 genes = 92.33% of all genes located within CNVs, p < 0.001, Fig.  6c). The overall pattern of gene distribution thus shows that CNV regions almost exclusively contain gene copies.
GO enrichment analysis of genes throughout the B. asplanchnoidis genome that derived from duplication events identified 29 significantly enriched GO terms (Additional file 1: Table S10). When restricting the gene set to only those genes derived from a duplication event that were found within CNV regions, we identified eleven significantly enriched GO terms (Additional file 1: Table S11).
Throughout this study, we observed several conspicuous patterns related to GC-content. Regions of elevated coverage variability (i.e., interSD and highSD regions), CNV regions, and B-contigs were characterized by an elevated GC-content showing the main peak at~37% GC and two additional peaks at around 50% GC (Additional file 1: Figures S13, S14). By contrast, regions of low coverage variability had their main peak at~25% GC (Additional file 1: Figure S14). Those three peaks were also present in the GC-distributions of unaligned sequencing reads from rotifer clones varying in genome size (Fig. 2b). The 37% GC peak, which was the most prominent peak in genomic regions of elevated SD, CNVs, and B-contigs (Additional file 1: Figure S14), could be attributed mainly to the satellites rotiSat1 and rotiSat2, while the higher peak at~55% GC could be attributed to rotiSat5 (Additional file 1: Figure S15-S19). Overall, these three satellites are the most abundant repeat elements in the B. asplanchnoidis genome, and their consensus sequences show the same characteristically elevated GC-contents compared to most other repeat elements (Additional file 6). We also observed one minor but distinct peak at~48% GC in highSD regions, CNV regions, and B-contigs, which consisted of sequences that were not classified as repeats by repeatMo-deler2 (Additional file 1: Figure S15-S19).

Discussion
In this study, we provide a high-quality reference genome draft of the rotifer Brachionus asplanchnoidis with a total length of 230.1 Mbp. The 2C DNA content of the same rotifer clone is 568 Mbp, according to flowcytometry-based estimates of an earlier study [7]. This earlier study provided evidence that the genome of this particular clone consists of a core haploid genome of 207Mbp and four copies of a segregating 34-Mb element. Assuming that our 230.1-Mbp assembly is completely unphased, our reference genome thus amounts to 95% of the flow-cytometry-based estimate of the haploid genome (241Mbp).
Aligning short reads of 15 rotifer clones from the same geographic population to the reference genome revealed multiple long tracts along the reference genome with increased coverage variation across clones. Additionally, we found that the average coverage at CNV regions strongly correlates with genome size. CVN regions also carried a distinct signature in terms of an increased GCcontent (36% and 48%, respectively, versus 26% for the rest of the genome), which was both apparent in the short-read data and the genome assembly. Strikingly, many CNVs had near-zero coverage in three of the studied clones (OHJ82, OHJ22, OHJ22i3n14), which was independently confirmed by our PCR-based assays of two selected CNV regions. These results are highly consistent with previous evidence from flow cytometry experiments, showing that these three clones are characterized by a "basal" genome size (i.e., they are close to the smallest observed genome size in this species) and that they entirely lack the independently segregating genomic elements (ISEs) that are present in many other members of the OHJ population. Our results indicate the presence of multiple different ISEs in the OHJ population. For example, both our PCR and alignment data suggested that the two contigs 000F and 032F belong to two different ISEs because only the former was detectable in the clones OHJ98, OHJ104, and OHJ105 (Fig. 5). This observation is consistent with an earlier study, which suggested such diversity based on the size of ISEs measured by flow cytometry. CNV regions account for megabase-long tracts in the genome of B. asplanchnoidis. Several contigs displayed highly similar coverage patterns across almost their entire length if one ignores the few and very short breakpoints. Such contigs might be fragments of even larger elements, perhaps B-chromosomes. Contig 032F might be a good candidate for this, ranging from zero to six copies across the OHJ population, as indicated by ddPCR. We also detected large CNVs, hundreds of kilobases in length, that were located on contigs with otherwise normal (diploid) coverage and low coverage variation (e.g., contig 006F in Fig. 6d). Such a genomic pattern is consistent with a stable diploid chromosome that contains a large homozygous insertion in some clones, hemizygous insertions in others, and a homozygous deletion in the remaining clones. Additionally, there might be length polymorphisms in such genomic regions, which are dominated by tandemly repeated satellite DNA. In the future, assembling multiple genomes from rotifer clones with different genome sizes, ideally using long-read technologies to approach chromosomelevel assemblies [22], might allow a more precise delineation of individual CNVs into these two categories. Overall, our data is consistent with a mixture of Bchromosomes and large insertions into normal chromosomes, and possibly a dynamic exchange between both genomic fractions (since they are made up mostly by the same set of tandemly repeated satellite DNA; see below). Interestingly, GO enrichment analysis of genes that derived from a duplication event identified one term, GO: 0015074 (DNA integration), as the most significant term (p < 1e−30), which might indicate elevated transposon activity during the early evolution of the B. asplanchnoidis genome.
There are a few technical caveats and limitations to be considered in our analysis. First, although we found strong correlations between coverage variation at CNV regions and genome size variation, it was not possible to quantitatively "predict" the genome size of individual clones based on coverage along the reference assembly. The library preparation method seemed to introduce additional variation, specifically at the CNV regions, that prevented us from determining exact copy numbers of these genomic regions. This is a well-known limitation of short-read libraries from genomes that contain large amounts of satellite DNA [23], in particular when combined with heterogeneous GC-contents [24]. We could alleviate this limitation and determine exact copy numbers by using ddPCR, finding that the targeted genomic region (contig 032F) is present in zero to six copies across the OHJ population. This approach is a promising strategy for future studies to accurately assess copy number variation across many loci and in a large number of genomes from the same population. In addition, the differences in copy numbers at CNV regions could be estimated by comparing clones within a set of library preparations. For instance, if coverage at CNV regions is scaled by a reference clone (like clone "IK1" in Figures  S8, S9), coverage at CNV regions of focal clones tends to fall into discrete clusters, indicating n-fold differences in coverage relative to the reference. A second limitation is that our reference genome might still not contain all the ISEs that contribute to genome size variation in the natural OHJ population, simply because it represents just a sample from that population. To obtain a more complete picture, more genomes of the OHJ population would be needed to be sequenced, preferably using longread sequencing technologies that allow better identification of structural variants [25]. Third, one might argue that our CNV detection pipeline could miss many of the smaller insertions and deletions, in particular those smaller than 5 kb. However, it is quite unlikely that such small-scale structural variation has much influence on genome size variation in B. asplanchnoidis, since the percentage of discordantly aligned reads was overall rather low (1.3-5.1%) and it did not scale with genome size. With our reference genome being at the upper end of the genome size distribution of the OHJ population, we would expect to find a higher percentage of discordant reads in smaller genomes (mainly due to deletions), which was not the case.
CNV regions differed strongly from the remaining genome regarding repeat composition and gene density. Strikingly, most CNV regions were composed of only three satellite repeat elements, some unique to B. asplanchnoidis and others shared only with its closest congener, B. plicatilis, indicating recent evolutionary origin. The two most prominent satellite repeats, rotiSat1 and rotiSat2, consisted of 154-and 143-bp monomers that were arranged as tandem repeats of a length of up to a few megabases. The low sequence diversity of CNV regions explains the characteristic trimodal GC-content signature mentioned earlier, especially since the most abundant satellite elements display the same elevated GC-content (Additional file 6). Non-CNV parts of the genome contained a much higher diversity of other repeat elements, which included DNA transposons, LINEs, LTR elements, and Helitrons. Gene density was significantly lower in CNV regions compared to the rest of the genome, and genic regions in CNVs were typically confined to short stretches scattered across the contig. Interestingly, Orthofinder suggested that genes in CNV regions were three times more likely to derive from a duplication event than genes found in other places of the genome. This indicates that duplications had a role in the origin of these CNV regions, which incidentally has some resemblance to the proposed early evolution of B-chromosomes [26,27].

Conclusions
In this study, we have identified, for the first time, genomic elements that can cause substantial withinpopulation genome size variation, ultimately allowing investigation of genome size evolution at microevolutionary time scales. In the OHJ population of B. asplanchnoidis, these genomic elements consist of up to megabases-long arrays of satellite DNA, with only a few interspersed genes or other sequences. Though satellite arrays can form essential chromosome structures such as centromeres and telomeres [28], the large intrapopulation variation of these elements in B. asplanchnoidis, and their virtual absence in some individuals, suggests that, overall, these DNA additions do not provide an immediate fitness benefit to their carriers. Nevertheless, variable amounts of "bulk DNA" might influence the phenotype through subtler mechanisms independently of the DNA information content, for example through potential causal relationships between genome size and nucleus size, or cell size [15,29,30]. Increased levels of structural variation in a genome, as implied by our findings, may also constrain adaptive evolution or genome stability over microevolutionary time scales. In this regard, the genome of B. asplanchnoidis should be a valuable addition to existing models of genome evolution, enabling whole-genome analysis combined with experimental evolution approaches [31,32] to disentangle the contributions of selection and drift to phenotypic change (and potentially concomitant changes in the genome size distribution) of evolving populations.

Origin of rotifer clones and DNA extraction
Resting eggs of rotifers were collected in the field from Obere Halbjochlacke (OHJ), a small alkaline playa lake in Eastern Austria (N 47°47′ 11″, E 16°50′ 31″). The hatchlings of these resting eggs were used to found clonal lines. Since resting eggs are produced sexually in monogonont rotifers, each clone has a unique genotype. Our clones from the OHJ population have been characterized previously concerning their genome size and other biological traits [7].
Rotifers were cultured in F/2 medium [33] at 16 ppt salinity and with Tetraselmis suecica algae as the food source (500-1000 cells μl −1 ). Continuous illumination was provided with daylight LED lamps (SunStrip, Econlux) at 30-40 μmol quanta m −2 s −1 for rotifers, and 200 μmol quanta m −2 s −1 for algae. Stock cultures were kept either at 18°C, re-inoculated once per week by transferring 20 asexual females to 20-mL fresh culture medium, or they were kept for long-term storage at 9°C, replacing approximately 80% of the medium with fresh food suspension every 4 weeks.
To produce biomass for DNA extraction, rotifers were cultured at 23°C in aerated borosilicate glass containers of variable size (250mL to 20L). Before DNA extraction, rotifers were starved overnight in sterile-filtered F/2 medium, with 2-3 additional washes of the sterile medium on the next day. In most preparations, we also added the antibiotics streptomycin (Sigma-Aldrich: S6501) and ampicillin (Sigma-Aldrich: A9518) to the washing medium, both with an end concentration of 50mg/mL. DNA was extracted using the Qiagen kits Dneasy (for short-read sequencing; from approximately 5000-7000 rotifers) and GenomicTips 100 (for longread Pacbio sequencing; from >20,000 rotifers), and RNA was extracted from freshly prepared biomass with Rneasy.

Sequencing of the reference clone
We selected one rotifer clone (called: OHJ7i3n10) as DNA donor for the reference genome This clone ultimately derives from the natural population since it is a descendant of clone OHJ7, which was hatched from sediments of Obere Halbjochlacke. However, its immediate ancestors were passed through three generations of selfing (i.e., mating one male and female of the same clone). More details on the genealogy of this lineage and its biological characteristics can be found in [7]. Using long-read sequencing technology (PacBio SMRT® on the Sequel-platform), we obtained 16.3 Gbp from two SMRTcells, which corresponds to a 57-fold coverage assuming a haploid genome size of 284 Mbp. Additionally, we obtained 35.5 Gbp of Iso-Seq transcriptome data and 12 Gbp of short-read Illumina data of OHJ7i3n10. All sequencing and library preps related to the reference genome were performed by the Next Generation Sequencing Facility at Vienna BioCenter Core Facilities (VBCF), a member of the Vienna BioCenter (VBC), Austria.

Sequencing of individuals of the OHJ population
To characterize genomic variation across the OHJ population, we generated short-read sequencing data (Illumina platforms HiSeq and NextSeq) from 15 different rotifer clones, both from the natural OHJ population and from various clones of a selfed lineage (Additional file 1: Table S4). Clones were selected such that they covered the full range of genome size of the OHJ population. Short-read libraries were constructed using either the KAPA HyperPrep kit (Roche) or the NGS DNA Library Prep Kit (Westburg). The KAPA library preparations were done at the Marine Biological Laboratory [for more details, see [12]] while the Westburg preparations were done at VBCF. The two methods mainly differ in the fragmentation method (ultrasonic fragmentation in KAPA vs. enzymatic fragmentation in Westburg). Both methods are claimed to deliver sequence-independent fragmentation and to yield consistent coverage across a wide range of GC-contents. While the peak fragment size was~400bp in both methods, we observed that the libraries constructed with the Westburg kit sometimes had a pronounced right tail with some fragments up tõ 2000bp. We accounted for these larger fragments by adjusting the relevant parameters during short-read alignment (see below). In many of our clones, we used both library construction methods, yielding a total of 29 libraries (Additional file 1: Table S4).

Reference genome assembly and annotation
Pacbio sequences of the reference genome were assembled using the HGAP4 pipeline [34] at VBCF, and contamination was initially checked using CLARK [35] against all available bacterial genomes from NCBI (Additional file 2). We polished this initial VBCF assembly with short-read Illumina data of the identical B. asplanchnoidis clone using Pilon [36] in three rounds. To investigate the assembly quality, we backmapped the Illumina data to the genome assembly using bwa mem [37] and calculated summary statistics using QUAST [38] and Qualimap [39]. To provide an additional check for contamination, we used Blobtools [40] based on the backmapping alignments and a blastn search against the nt database (NCBI).To assess the assembly's completeness, we performed a BUSCO v4.0.6 [41] using the metazoan gene set (n = 978) in the genome mode applying the -long option.
Genes were structurally annotated on the repeat masked genome assembly using the MAKER2 annotation pipeline (v2.31.10, [42]) using evidence from Pacbio Iso-Seq transcripts of the same B. asplanchnoidis clone, protein homology evidence from the UniProt database (download January 2020, The UniProt Consortium 2017) in combination with proteoms of different clones and closely related Brachionus species (Additional file 1: Table S1), and ab initio gene predictions from SNAP [43], GeneMark-ES (v4.48_3.60_lic, [44]), and Augustus (v3.3.3, [45]). The SNAP model was initially trained on the genome assembly with the additional support of BUSCO complete hits (see above). GeneMark was computed in the ES suite on the soft masked genome assembly. The ab initio training of the Augustus model was computed on the genome assembly supported by the unassembled Iso-Seq transcripts. To compensate for underrepresented rotifer protein representation in the UniProt database, we combined this dataset with proteomes of another B. asplanchnoidis clone and four closely related Brachionus species (Table S1). The completeness of these proteomes was checked with BUSCO in the protein mode and missing orthologs were compared to determine if completeness could be increased through the combination of proteomes. The combined proteome finally contained 97% of the BUSCO genes of the metazoan dataset.
MAKER was run over three rounds. For the first round of MAKER, we only used the Iso-Seq data as EST evidence and the combined UniProt and proteome sequences as protein homology evidence applying default parameters and est2genome=1, protein2genome=1 to infer gene predictions directly from the transcripts and protein sequences. We used the gene prediction of round 1 to retrain the Augustus and SNAP model since the GeneMark model was exclusively computed on the genome assembly and did not require retraining. Maker was run in the second round using the retrained models and again est2genome=1, protein2genome=1 to increase prediction fidelity. After a second round of retraining the Augustus and SNAP model, MAKER was run through round 3 with switched off est2genome and pro-tein2genome inference. After structural annotation via MAKER, we functionally annotated the genes using functional classification of genes from InterProScan [46] in combination with putative gene names derived from Swissprot [47].
To examine orthologous genes among rotifer species and identify gene duplication events, we used Orthofinder [48]. This analysis was based on proteome information of the Brachionus plicatilis species complex: B. rotundiformis, B. sp. "Tiscar," and B. plicatilis [respectively "Italy2," "TiscarSM28," and "Tokyo1" in [12]], and B. asplanchniodis (annotation of this study). Proteome information of Adineta vaga [49] was included as an outgroup. We extracted the longest transcript variant per gene to avoid duplicates in the input proteomes and followed the manual instructions of Orthofinder. To analyze the genomic distribution of genes that were identified to derive from gene duplication events, we extracted all genes of the B. asplanchniodis node (Orthofinder: SpeciesTree_Gene_Duplications) and removed duplicates to create a non-redundant list of genes. Comparing this list of genes to genes that are located inside or outside of genomic regions with high levels of CNVs allowed the estimation of non-random gene distribution patterns via Monte Carlo permutation tests (1000 permutations).
To characterize the putative functional properties of genes derived from a duplication event, we performed gene ontology (GO) enrichment analyses on the set of all multiple copy genes (n = 5634) and the more exclusive set of duplicated genes found within CNV regions (n = 385). The reference list consisted of 10,083 proteincoding genes (60% of all annotated genes) with GO annotation via INTERPROSCAN (see above). Enrichment analysis was done with the topGO R package (v2.24.0, [50]) in the category "biological process" using the weight01 algorithm and Fisher statistics (significance level p < 0.05).

Annotation of repetitive elements
The repeat library was produced using RepeatModeler2 (default settings, [51]) and the polished PacBio assembly. This produced 484 consensus sequences, 289 of which are "Unknown." These consensus sequences were then used to mask the genome assembly using RepeatMasker with default settings. From this, we identified the top contributing repeat elements and used their consensus sequences to blast queries against the genome assembly, and the top hits for each consensus were used to produce alignments for manual curation [52] and classification [53]. Satellite monomers were identified using Tandem Repeat Finder [54]. Consensus sequences of the satDNAs are dimers of these identified monomers.
Contributions per repeat were estimated by summing up the total length covered by each copy in the Repeat-Masker output file. Top contributions were calculated over the whole assembly, and over regions of the genome with distinct coverage variability (lowSD, interSD, and highSD in Fig. 4). For each region, the top 20 repeats were included, resulting in a total of 38 repeats to curate. Curation was done by manually inspecting each alignment, identifying the ends of the aligning regions, and producing a new consensus sequence. Additionally, classification was performed by searching for TSDs, TIRs, LTRs, satellite structure, and conserved domains. RepBase searches were rarely constructive since rotifer, and especially monogonont, TEs are not wellrepresented in any databases. The final curated library of topREs contains 37 consensus sequences from 36 elements (one sequence was removed because alignments were to only scattered AT-rich regions, one was an rRNA gene, and one LTR sequence was split into the LTR and internal portions). Redundant sequences between the topRE library and the RepeatModeler library were identified using RepeatMasker. Sequences that were at least 95% identical and covered at least 80% of the uncurated repeat were removed from the uncurated repeats before merging the libraries. Due to the short length and difficulties in automatically creating consensus sequences for satDNA elements, all uncurated elements that were identified as similar by RepeatMasker, regardless of similarity or length of hit, were aligned to the curated consensus sequence and visually inspected for alignment. Uncurated elements that aligned across most of the curated satDNA consensus sequences were removed. The combined library (Additional file 8) of curated and uncurated elements contained 472 elements (262 unknowns) and was used to mask the B. asplanchnoidis genome and to repeat-mask related, high-quality Brachionus genome assemblies to identify shared elements [55][56][57]. For each of the top contributing elements and each genomic region, an enrichment index was calculated as the proportion of each repeat contribution in that region vs. the whole genome (i.e., repeat contribution in region/total contribution of repeat) divided by the proportion of the genome represented by each region (i.e., region size/assembly size). If this index was over 1, it meant that the repeat in question was enriched in the region in question.

Preprocessing of short-read data
Trimming and adaptor removal was done using bbduk (v38.34, https://jgi.doe.gov/data-and-tools/bbtools/) with the settings: k = 23 ktrim = n mink = 11 hdist = 1 tpe tbo qtrim = rl trimq = 20 maq = 10 minlen = 40. Trimmed reads were initially aligned to the reference genome using bowtie2 [58]. Preliminary tests with various parameter settings (local vs. end-to-end alignment, different settings for the sensitivity parameter, map as single reads vs. map as paired reads) did not indicate strong differences. Thus, we used the default values for most parameters, except for fragment size (parameter: X), which was set to the maximum observed fragment size of each library instead of the default value of 500 bp (see Additional file 1: Table S4).
We included three steps to remove contaminants, since DNA extracted from the whole bodies of microscopic organisms might contain DNA from other organisms, such as bacteria. First, we extracted all unmapped reads from an initial alignment to the reference genome and screened only those for potential contaminant reads. Thus, we considered the mapped fraction as "rotifer DNA," since they mapped to the (contaminant-free) assembly. Second, the unmapped reads of all libraries were combined and assembled using metagenomic approaches. For this assembly, we used metaVelvet (v1.2.02, [59]) with a kmer length of 101bp, an insert size of 500 (± 200 standard deviation), and a minimum reported contig length of 300bp. The resulting assembly was then analyzed with metaQuast [60] using the option "automatic pulling of reference sequences," which restricts the search to bacteria and archaea. Subsequently, the complete genomes of putative contaminants were downloaded, and the unmapped reads were mapped against those genomes. In the third step, the remaining fraction (i.e., reads not mapping to the microbial metagenome) were subjected to a kmer-based identification using the kraken2 pipeline [61], with the databases bacteria, archaea, viral, UniVec-Core, and protozoa. We also performed checks on the false-discovery rate of kraken2, by running the same pipeline on reads that initially did map to the reference genome. Finally, all unmapped reads that could not be taxonomically assigned to contaminants, with either of the two approaches above, were considered to be of "rotifer origin" and were merged with the mapped fraction. Those "rotifer reads" were then further cleaned by removing mitochondrial DNA, which was done by mapping them to the published mitochondrial genome of B. plicatilis [62], and by removal of duplicates using FastUniq [63]. These "final reads" were again mapped to the reference genome, or they were analyzed using reference-free approaches that do not require alignment.

Analysis of unaligned reads
"Final reads" were subjected to kmer-based analyses using a kmer size of 21bp with jellyfish (v2.1.4, [64]). Then, GenomeScope 2.0 [20] and findGSE [21] were used to obtain kmer-based estimates of coverage, heterozygosity, and genome size. Per-base coverages C were computed from the kmer-coverages C k with the formula: where R is the average read size, obtained from dividing the total number of base pairs in each library by the total number of reads. The coverage estimates from these two programs were contrasted with the naïve coverage estimate, based on sequencing effort (total number of bp in a library) and 1C genome size estimated from flow cytometry, assuming a diploid genome.
To analyze GC-distributions among short-read libraries, the GC-contents of individual reads were extracted using the function fx2tab of SeqKit [65]. The resulting csv files were further analyzed with the R-package mixtools [66] using the function normalmixEM.

Analysis of copy number variation
Analysis of copy number variation was done separately for all 29 short-read libraries from 15 different clones of the rotifer B. asplanchnoidis. Average per-base depth-ofcoverage (DOC) values along 50-kbp and 5-kbp windows, respectively, were extracted from the BAM alignment files ("final reads" to reference genome) using the samtools function "bedcov" [67]. In total, there are 4835 windows at 50-kbp resolution and 46,255 windows at 5kbp resolution in the current genome assembly. To allow comparisons among clones and libraries, DOC was normalized by dividing the per-bp coverage of each window by 1/2 of the (mean) exon coverage of the respective library. In unphased sections of the genome assembly, we expected DOC values of around 2, provided that both alleles of a (diploid) genome map to the correct location. Coverage variation was quantified as the standard deviation of DOC per window (50kbp or 5kbp) across all 29 libraries.
To identify individual CNVs and to locate possible breakpoints within contigs, we used a custom-written Ralgorithm involving the following criteria for merging adjacent windows (5kbp) based on the similarity of coverage patterns. First, the coverage variation (measured as the standard deviation of per-base normalized coverage across all libraries) had to be above a defined threshold (i.e., 0.7, which was an a posteriori determined threshold). Second, the read depths of each library of both windows had to be significantly correlated with each other at the p < 0.05 level. This was done by calculating the partial correlation coefficients. If both conditions applied, those two windows were considered to belong to the same CNV. The very first and the last window of each contig, which often showed deviant coverage patterns, were merged with their neighboring window. Third, adjacent CNVs identified according to the above criteria were merged if they were separated by only one window (CNV stopping breakpoint) AND if the coverages in the two windows surrounding the breakpoint were significantly correlated with each other. Finally, we only considered CNVs with lengths of at least three adjacent windows. Thus, we obtained a table of all CNVs along the genome of B. asplanchnoidis, together with their size, and their location on individual contigs (i.e., in the middle of the contig, at the edges, or spanning the entire contig). Data analysis related to coverage variation and CNV detection was done using customwritten algorithms in the R environment [68] with the base package (v3.6.3) and the add-on packages stringr v1.4.0 [69] and reshape2 v1.4.4 [70]. For graphical visualization, we used ggplot2 v3.3.0 [71] and the add-on packages cowplot v1.0.0 [72] and treemapify v2.5.5 [73].

PCR confirmation of CNV regions
To independently confirm the presence or absence of CNV regions in different rotifer clones, and to estimate the copy numbers of these genomic regions, we used PCR-based methods. To identify unique PCR-primer binding sites in regions of high or low coverage, we used Thermoalign [74] searches in multiple regions of 5000bp length, spread across the genome and on different contigs. The exact search parameters for Thermoalign are given in Additional file 9. Candidate primers were tested and optimized using PCR on template DNA from different OHJ clones, including the reference clone OHJ7i3n10. PCR reactions consisted of 25μl HotStart Taq master mix (Qiagen), 0.1μM Primer, 3mM MgCl 2 , and 20 ng template DNA. PCR cycling conditions were 95°C for 15 min, 30 cycles of 94°C for 20 s, 56°C for 20 s, and 68°C for 10s, followed by 68°C for 5 min and hold at 4°C. Agarose gels were used to test for the presence or absence of the associated loci across different members of the OHJ population. In total, we screened four loci located on different contigs of the reference genome assembly. Two of them were located in copy numberinvariable, diploid regions of the genome (according to the coverage estimates from the short-read alignments), and two were located in CNV regions (Additional file 1: Table S7). In addition to the qualitative PCR test, we used ddPCR to estimate the copy number of the CNV loci for each rotifer clone. For each 22-μl reaction, we used EvaGreen Supermix (Biorad), 0.15μM Primer, 4 units EcoRI, and template DNA equivalent of 1500 genome copies. PCR cycling conditions were 95°C for 5 min, 40 cycles of 95°C for 30 s, and 61°C for 1 min with a ramp rate of 2°C/s; 4°C for 5 min, followed by 90°C for 5 min and hold at 4°C. For droplet generation and fluorescence readout, we used a QX200 Droplet Generator and Droplet Reader (Biorad), respectively. Copy numbers (CN) of CNV loci were estimated for each rotifer clone using the ratios of amplicon molecule concentrations, which were obtained with the QuantaSoft Software (Biorad): where T is the amplicon concentration of the target locus (the one showing copy number variation across the OHJ population), R 1 and R 2 are the amplicon concentrations of the two reference loci, and N R is the number of copies of the reference loci in the genome (in this case, N R =2, since the reference loci were both diploid). The 95% confidence intervals obtained from QuantaSoft were used as an indication of the measurement error.