Copy number variation and population-specific immune genes in the model vertebrate zebrafish

Copy number variation in large gene families is well characterized for plant resistance genes, but similar studies are rare in animals. The zebrafish (Danio rerio) has hundreds of NLR immune genes, making this species ideal for studying this phenomenon. By sequencing 93 zebrafish from multiple wild and laboratory populations, we identified a total of 1513 NLRs, many more than the previously known 400. Approximately half of those are present in all wild populations, but only 4% were found in 80% or more of the individual fish. Wild fish have up to two times as many NLRs per individual and up to four times as many NLRs per population than laboratory strains. In contrast to the massive variability of gene copies, nucleotide diversity in zebrafish NLR genes is very low: around half of the copies are monomorphic and the remaining ones have very few polymorphisms, likely a signature of purifying selection.


Introduction
The innate immune system of an organism provides the first defense line against pathogens.Immune genes tend to evolve quickly and are often associated with a high degree of genetic variability.Many genes and proteins of the immune system are lineage-specific (limited to specific groups of animals, plants, or other taxa), while others have defense roles in a wide range of species.In particular, proteins containing a large nucleotide-binding domain followed by smaller repeats have an immune function in animals, plants, fungi, and bacteria alike (Ting et al., 2008;Jones and Dangl, 2006;Uehling et al., 2017;Gao et al., 2022).In animals, these repeats are usually leucine-rich repeats (LRRs) and the proteins themselves are classified as NLRs (nucleotide binding domain leucine-rich repeat containing, also known as NOD-like receptors).They have a multitude of functions: some act as pathogen sensors or transcription factors (Almeida-da-Silva et al., 2023), others are components or modulators of inflammasomes, large protein complexes that are assembled within cells as part of the response to biological or chemical danger (Almeida-da-Silva et al., 2023).
Plants have their own NLRs that are structurally similar to the ones from animals and also carry out central functions in the immune response (Urbach and Ausubel, 2017;Yue et al., 2012).Their diversity has been extensively characterized in several species, including the thale cress (Arabidopsis thaliana), and vastly different repertoires have been found from different strains or individuals (Van de Weyer et al., 2019b).NLR repertoires can also be referred to as NLRomes, and a species-wide repertoire is called the 'pan-NLRome'.
The majority of fish NLRs represent a fish-specific subtype that was originally labeled NLR-C (Laing et al., 2008), although they can be further divided into at least six groups based on structural similarities and sequence of conserved exons (Howe et al., 2016;Adrian-Kalchhauser et al., 2020).A schematic structure of proteins encoded by zebrafish NLR-C genes is presented in Figure 1A.All of them possess a FISNA domain (fish-specific NACHT-associated domain), which precedes the nucleotidebinding domain NACHT and is encoded by the same large exon near the N-terminus of the protein (Howe et al., 2016).FISNA-NACHT is in some cases preceded by the effector domain PYD, but this is encoded by a separate exon (Howe et al., 2016).Additionally, many NLR-C proteins have a B30.2 eLife digest Humans and other animals have immune systems that protect them from bacteria, viruses and other potentially harmful microbes.Members of a family of genes known as the NLR family play various roles in helping to recognize and destroy these microbes.Different species have varying numbers of NLR genes, for example, humans have 22 NLRs, but fish can have hundreds.400 have been found in the small tropical zebrafish, also known as zebra danios.
Zebrafish are commonly used as model animals in research studies because they reproduce quickly and are easy to keep in fish tanks.Much of what we know about fish biology comes from studying strains of those laboratory zebrafish, including the 400 NLRs found in a specific laboratory strain.Many NLRs in zebrafish are extremely similar, suggesting that they have only evolved fairly recently through gene duplication.It remains unclear why laboratory zebrafish have so many almost identical NLRs, or if wild zebrafish also have lots of these genes.
To find out more, Schäfer et al. sequenced the DNA of NLRs from almost 100 zebrafish from multiple wild and laboratory populations.The approach identified over 1,500 different NLR genes, most of which, were previously unknown.Computational modelling suggested that each wild population of zebrafish may harbour up to around 2,000 NLR genes, but laboratory strains had much fewer NLRs.The numbers of NLR genes in individual zebrafish varied greatly -only 4% of the genes were present in 80% or more of the fish.Many genes were only found in specific populations or single individuals.
Together, these findings suggest that the NLR family has expanded in zebrafish as part of an ongoing evolutionary process that benefits the immune system of the fish.Similar trends have also been observed in the NLR genes of plants, indicating there may be an evolutionary strategy across all living things to continuously diversify large families of genes.Additionally, this work highlights the lack of diversity in the genes of laboratory animals compared with those of their wild relatives, which may impact how results from laboratory studies are used to inform conservation efforts or are interpreted in the context of human health.domain (also known as PRY/SPRY) at the C-terminal end, separated from FISNA-NACHT by multiple introns and exons containing the LRRs (Figure 1A; Howe et al., 2016).The B30.2 domain functions through protein-protein interaction (Woo et al., 2006) and is also found in a variety of other genes such as the large family of TRIM ubiquitin ligases (van der Aa et al., 2009;Howe et al., 2016;Suurväli et al., 2022) that are often also involved in immunity.
It is not known why fishes possess so many NLRs, how they evolve, and how much within-species genetic variability they have.The previously observed repeated expansions and contractions of this family suggest it to have a high rate of gene birth and death (Suurväli et al., 2022).Studies have shown that viral and bacterial infections can induce the expression of specific fish NLRs (reviewed in Chuphal et al., 2022).Some of these have PYD or CARD domains and can even form inflammasomes similar to mammalian NLRs (Kuri et al., 2017;Li et al., 2018b).A species-wide inventory of major NLR exons in a model species such as zebrafish would provide valuable insights into the evolution and diversity of this large immune gene family.

Results
By adapting and modifying a protocol that combines bait-based exon capture with PacBio SMRT technology (Witek et al., 2016), we successfully generated circular consensus sequence (CCS) data for targeted parts of the immune repertoire from 93 zebrafish (of initial 96), representing four wild populations (Figure 1B) and two laboratory strains.With this approach, we aimed to sequence all exons in zebrafish that encode the nucleotide-binding FISNA-NACHT domains and all exons that encode B30.2 domains.Samples of one wild population (DP) suffered from poor sequence coverage and had to be excluded from downstream analyses in order to avoid bias in interpretation.Results involving this population are only shown in figure supplements and not in the main figures.
Our protocol used PCR with primers targeting ligated adapters to amplify the below-nanogram amounts of genomic DNA obtained from exon capture.This limited our fragment sizes to the lengths of what the polymerase was able to amplify.Zebrafish NLRs can have their exons spread out across tens of kilobases, so that we cannot know which exons belong to the same gene.However, we were able to use captured sequence surrounding the targeted exons to distinguish among near-identical coding sequences and separate NLR-associated B30.2 domains from B30.2 elsewhere in the genome.

The zebrafish pan-NLRome
We used an orthology clustering approach on NLR sequences assembled from all populations to create a reference set of NLRs (a pan-NLRome).This resulted in the identification of 1513 unique FISNA-NACHT containing sequences and 567 for NLR-associated B30.2 (NLR-B30.2).Nearly 10% of the sequences (145 FISNA-NACHT and 64 NLR-B30.2) contained pre-mature stop codons that were at least 10 amino acids from the end and led to early truncation of the protein.In total, 101 of the 1513 FISNA-NACHT were preceded by an exon containing the N-terminal effector domain PYD.Nearly all of those (97 out of 101) were found in group 1 NLR-C genes identified by the presence of the characteristic sequence motif GIAGVGKT (Howe et al., 2016).Since the combination of FISNA and NACHT is only present in NLR-C, its count of 1513 can be considered equal to the total number of NLR-C genes in the data.We found each individual zebrafish to have 100-550 NLR genes from the pan-NLRome in at least one copy (Figures 2 and 3), and only 50-75% of these have a high-quality match in the GRCz11 reference genome (Figure 2-figure supplement 2).In general, laboratory zebrafish had less NLRs than wild samples (Figure 2).The number and length of CCS reads and assembled contigs (both prior to orthology clustering) are presented in     Whereas FISNA-NACHT is only found in NLRs, B30.2 domains are also found in other gene families.In addition to the 567 NLR-B30.2 domains, we also found 732 B30.2 domains not associated with NLRs.We were able to distinguish between them by utilizing the sequence of a short highly conserved 47 bp exon that appears to precede B30.2 in NLRs, but not in other genes (Figure 2  Copy number variation of NLR genes.(A) Sequence data from each individual zebrafish (vertical axis) was aligned to FISNA-NACHT exon sequences of the pan-NLRome (horizontal axis).Grayscale intensity shows, for each NLR, the proportion of NLR-aligning data in each given fish that matches this specific gene.Darker gray indicates a higher likelihood of this NLR being represented in multiple copies in the particular individual.Light gray indicates a single copy, white indicates absence.For clarity, only the 1235 FISNA-NACHT exons for which at least one fish had a minimum of 10 reads mapped to it are shown.(B) Numbers of pan-NLRome sequences (based on FISNACHT diagnosis) found in all three, two, or only one wild population.(C) Relative numbers of fish in which pan-NLRome sequences were found in wild populations.'Core' pan-NLRome: genes which are found in at least 80% of the sample (from a total of 57 wild fish); 'shell': genes in at least 20%; 'cloud': rare genes found in less than 20% of the sample.(D) Observed and estimated sizes of population-specific pan-NLRomes.Data points (filled circles and squares) show the average number of totally discovered NLR genes (as identified via their FISNA-NACHT domain) when investigating x fish.The dashed line is obtained by non-linear fit of the data to the function given in Equation 2. For all populations, the hypothetical pan-NLRome size -when extrapolating x → ∞ -is finite (see Table 1).
The online version of this article includes the following source data and figure supplement(s) for figure 3: Source data 1.Source tables for Figure 3 and its supplements.

Copy number variation in the pan-NLRome
Aligning CCS reads to the pan-NLRome revealed a considerable amount of variability in the proportion of reads mapping to them, both between and within populations (Figure 3A).This can be interpreted as the gene being present in different copy numbers.Furthermore, each NLR had its own distinct pattern of copy number variation, although generally the highest copy numbers were observed for the wild populations KG, SN, and CHT (Figure 3A).We also observed some sequencing batch-related differences, but the copy numbers differed even between individuals sequenced in the same batch.

-figure supplement 1).
There were also NLR sequences shared between just two wild populations, and some were restricted to a single population (Figure 3B).Moreover, we observed a lot of variability in the distributions of gene copies among fish within populations (Figure 3C).Only around 4% of the genes in the pan-NLRome were found in 80%, or more, of the wild fish.They constitute the core NLRome (Van de Weyer et al., 2019a).Most genes (51%) were found in the so-called shell of the pan-NLRome (20-80% of fish).Almost as many (45%) are found in a few fish (less than 20% of the sample) only.Although 60% of NLR genes occur in all wild populations, only 4% are omnipresent, that is, are in the core pan-NLRome.Thus, there is considerable variation in the NLR repertoires of individuals from the same population.
The total number of NLRs identified in a number x of individual fish can be fitted to a harmonic function (Medini et al., 2020).Using this function (see 'Materials and methods'), we estimated the sizes of the NLRomes of the populations (Figure 3D) and found a total of 520 and 570 NLRs in the laboratory strains TU and CGN, respectively (Table 1).For the wild populations, we estimated four times as many: 2283 in KG, ,896 in SN, and 2452 in CHT.

Differences from the reference genome
NLRs sequenced in this study were often different from those present in the reference genome GRCz11.Even NLRs sequenced from the strain that the reference genome itself is based on (TU) did not always align well to it.When the exon itself did align, the intronic sequences surrounding it could often be very different from the reference.In numbers, only around 75% of NLRs occurring in TU fish aligned to the reference genome GRCz11 with high mapping qualities (Figure 2-figure supplement 2A).This number dropped even lower elsewhere -from 60-65% of NLRs in CGN which aligned well to the reference, down to only around 50% for the wild populations.The majority of NLRs that did not map well had a very poor mapping quality of 1 (Figure 2-figure supplement 2B).Moreover, there were 9 FISNA-NACHT and 10 NLR-associated B30.2 in the pan-NLRome which did not map anywhere in the reference genome.

Purifying selection on single-nucleotide variants
We used the pan-NLRome as a reference for identifying single-nucleotide polymorphisms in the data.NLR sequence diversity was rare, with a large fraction of exons not having any variants in any of the populations.If variants were present, nucleotide diversity ( θπ ) was up to 0.016 and Watterson's estimator ( θw ) up to 0.021 (Figure 4A and B).In laboratory strains, genetic variability of FISNA-NACHT exceeded that of B30.2, but no such pattern was observed for wild populations.B30.2 exons of laboratory strains were also less variable than B30.2 from wild zebrafish (Figure 4B).The proportion of exons without any polymorphisms was much higher among FISNA-NACHT than among B30.2 (Figure 4C).The majority of variable NLR exons had θπ/θw ratios of less than 1 (Figure 4D), indicating an excess of rare alleles.

Discussion
We sequenced and assembled the FISNA-NACHT and B30.2 exons of hundreds of NLRs from 93 zebrafish.We were able to capture the diversity of this gene family in three wild populations and two laboratory strains, and produced lower coverage NLR data for an additional wild population (DP).
Analyzing the 73 zebrafish from populations other than DP, we found evidence that each genome from a wild individual contains only a fraction of more than 1500 identified NLR copies.The number of NLRs found per individual, each with one or more copies, ranged from around 100-550.Some of  the lower counts were likely underestimated due to low sequencing depths in specific samples.Since all samples from population DP suffered from low read depth, their analysis is only shown in figure supplements.As targeted sequencing based on bait-capture requires sufficient homology between bait and target, diverged NLR exons may have been missed in our approach.This affects B30.2 exons even more than FISNA-NACHT exons because they are much shorter.However, the observed slow increase in newfound NLR gene copies per sequenced individual after the first few individuals indicates that not many NLRs were missed.The sizes of NLR repertoires differ between zebrafish individuals in the three wild populations.
Nonlinear fitting of NLR counts to Equation 2 suggested that the investigated populations all possess closed pan-NLRomes with roughly 500-600 NLRs in the laboratory strains and around 2000 NLRs in the wild populations.The total numbers of NLRs with a B30.2 exon are about 170 in the laboratory strains and between 1100 and 1500 for the wild populations (Table 1).To explore the entire NLRome of wild populations, large samples are needed: based on the curve-fitting results, we estimate that capturing 90% of the NLRome may require up to several hundred thousand fish (Table 1).Orthogroup clustering with the data from DP resulted in 47 FISNA-NACHT exons which did not occur in any other population.Our results suggest that the pan-NLRome of the entire species must be vastly larger than what we have been able to detect with our limited sample sizes from a limited number of populations.Geographically distant populations -for example, in Nepal or the Western Ghats (Whiteley et al., 2011) -likely harbor many more NLRs which are not present in the populations we sequenced.
Although a few zebrafish assemblies are available in addition to the reference genome, for instance, the fDanRer4.1 assembly from the Tree of Life Initiative (GCA_944039275.1),none of those provide a suitable framework for mapping and analyzing NLRs on their own.One of the hindrances is the fact that the majority of NLR genes are located on the notoriously difficult to assemble long arm of chromosome 4, which harbors plenty of structural variation (McConnell et al., 2023).Furthermore, large multi-copy gene families are difficult to analyze.Read mapping and counting of copies in a particular genome is not trivial.Any downstream analysis which relies on clearly distinguishing paralogous and orthologous comparisons becomes fuzzy, if not impossible.Still, improving sequencing technology and the rising interest in pan-genomic studies Bayer et al., 2020;Sherman and Salzberg, 2020;Liao et al., 2023 have already started to transform the data structures in which genomes are stored, away from a single-reference genome-based view, toward graph-based genome networks.Whether the promise of a thereby improved inventory of structural variation of a species holds up remains still to be seen.Anyway, as shown for the zebrafish NLRs, the availability of a single high-quality reference genome is certainly not sufficient neither to identify nor to understand the diversity of large gene families.

Properties of the zebrafish NLRome
We have previously demonstrated a substantial reduction in single-nucleotide variation in zebrafish laboratory strains compared to wild populations (Suurväli et al., 2020).Here, we showed that the copy numbers of the NLRome and their variation are also heavily reduced.The most obvious explanation for this observation is the recent population bottleneck which marks the establishment of laboratory strains.The reduction in copy number variation in the major histocompatibility complex (MHC) locus in a population of greater prairie-chicken was attributed to a recent bottleneck as well (Eimes et al., 2011).Additionally, the reduced amount of pathogenic challenges in a laboratory environment could lead to a steady loss of expendable genes.For these reasons, one has to exercise caution when extending conclusions from immune-related studies on laboratory zebrafish to wild zebrafish.The same caution should also be exercised when extending results from laboratory organisms to other species, including human.
Studies have shown that even mammals have hundreds of genes with diverse molecular functions that are affected by copy number variation, even though it rarely involves full genes (Kooverjee et al., 2023;Zarrei et al., 2015).One example of the latter is the MHC locus, which harbors varying numbers of gene copies between closely related species of ruminants (He et al., 2024) and has haplotype-specific copies in mice (Lilue et al., 2018).However, the vast number of NLRs in zebrafish combined with presence/absence variation (McConnell et al., 2023) and high rates of duplication exceeds what has been found in other animals so far.A comparable situation can be found in the NLR genes of the thale cress (A.thaliana).Our predicted number of NLRs in a zebrafish population is on the same scale as the 2127 NLRs found in the thale cress NLRome (Van de Weyer et al., 2019b).Moreover, copy numbers also vary greatly between A. thaliana accessions (Lee and Chae, 2020).A total of 464 conserved, high-confidence orthogroups were identified in A. thaliana, 106-143 of which were defined as the core NLRome because they were found in a subset comprising at least 80% of the accessions (Van de Weyer et al., 2019b).In wild zebrafish, we found a set of 880 NLR genes which were detected in at least one individual from three wild populations, but only 58 NLRs were found in the vast majority (more than 80%) of wild individuals.Although structural similarities of NLRs in plants and animals are thought to be the result of convergent evolution (Yue et al., 2012), it appears that the similarities extend to their evolutionary trajectories.However, the overall number of gene copies as well as the variation in copy numbers within populations and in individual gene repertoires are more extreme in zebrafish than in A. thaliana.
We postulate that as immune genes, many NLR genes are likely shared between populations because they provide a fitness advantage in the defense against common pathogens.The additional NLRs shared among only some of the wild populations and the population-specific NLRs may represent local adaptations to ecological niches.Additionally, there could be functional redundancy within the NLRome, so that different individuals have different NLRs with the same functional role.In general, the fact that hundreds of NLR gene copies are maintained in zebrafish, together with a signature of purifying selection, suggests that the evolution of these genes is far from neutral.Although the expression of fish NLRs is often induced by pathogen exposure (reviewed in Chuphal et al., 2022), the exact function of most zebrafish NLR-C genes remains unclear.It is possible that some of them participate in the formation and activity of inflammasomes (Li, 2018a;Valera-Pérez et al., 2019;Lozano-Gil et al., 2022, Kuri et al., 2017), but we only found the N-terminal effector domains (CARD or PYD) that are typically involved in this function (Petrilli et al., 2005) in a small subset of NLR-C genes.
Although we mainly used the counts of FISNA-NACHT orthogroups to estimate total numbers of NLRs, we also analyzed the B30.2 exons of NLR-C genes.In general, NLR-associated B30.2 exons exhibit patterns of copy number variation that are similar to those seen for FISNA-NACHT.For example, about half of the B30.2 sequences are found in all wild populations, similar to the set of 880 FISNA-NACHT exon sequences conserved among populations.

What drives the copy number differences?
There are at least two mechanisms which could contribute to the extensive copy number variation seen among zebrafish populations: first, it could be attributed to a high degree of haplotypic variation.Large DNA fragments contain different sets of genes and gene copies, similar to the zebrafish MHC loci (McConnell et al., 2014).Extensive haplotypic variation occurs on the long arm of chromosome 4, the location containing over two-thirds of all NLRs in zebrafish (McConnell et al., 2023).Such segregating haplotype blocks would explain the existence of the core NLRome, but not the frequent presence of genes that occur only in a single individual.
Alternatively or additionally, the evolution of NLR-C genes could be driven by duplication events (Cannon et al., 2004) and gene conversion (Laing et al., 2008).Gene duplications can be caused by unequal recombination, transposon activity, or whole genome/chromosome duplications (Magadum et al., 2013;Kapitonov and Jurka, 2007).The arrangement of NLR-B30.2 genes in clusters on the long arm of chromosome 4 suggests that tandem duplication via unequal crossing-over (Otto et al., 2022) played the most important role in the expansion.Since there are many transposable elements on the long arm of chromosome 4 (Howe et al., 2013), it would be reasonable to assume that at least some of them have assisted in the local expansion and transfer of NLR exons and genes to chromosomes other than chromosome 4. Since our targeted sequencing approach does not elucidate the genomic arrangement of the NLR gene copies and many of them do not have recognizable orthologs in the reference genome, we cannot draw further conclusions about the role of tandem arrays in their evolutionary trajectory.
It is tempting to speculate that chromosome 4 could be a source of NLRs which continuously generates new copies.However, gene gains must be balanced by gene loss to maintain a stable genome size.NLR-C genes may be lost via accumulation of random mutations due to a lack of selective pressure and loss-of-function mutations, but they may also be lost through unequal recombination.This mechanism would allow only NLR genes contributing to the functionality of the immune system to be kept, while others would disappear.
In the similarly evolving plant NLRs, tandem duplication is thought to be the primary driver of NLR gene expansion (Cannon et al., 2004), but they are also often associated with transposable elements.If the diversity of unrelated NLR genes in such distantly related species is driven by common molecular mechanisms, then the same mechanisms might also act on NLRs of other phylogenetic clades and even on unrelated large gene families, such as odorant receptors (Mombaerts, 1999).

Conclusion
This study showcases an example of the evolutionary dynamics affecting very large gene families.The sheer amount of copy number variation that appears to be present in a single gene family of zebrafish is staggering, with different individuals each having numerous genes that are not present in all others.This can only be caused by diversity-generating mechanisms that are active even now.In this study, we have laid the groundwork for future studies investigating the molecular basis and evolutionary mechanisms contributing to the diversity of large, vertebrate gene families.

Samples
Wild zebrafish from four sites in India and Bangladesh (Figure 1B) had been collected in the frame of other projects (e.g., Whiteley et al., 2011;Shelton et al., 2020).Laboratory zebrafish from the Tübingen (TU) and Cologne (CGN) strains were provided by Dr. Cornelia Stein from the Hammerschmidt laboratory (Institute for Zoology, University of Cologne).All samples were stored in 95% ethanol until use.Tail fins from 20 fish per wild population and 8 fish per laboratory strain were used as starting material for the subsequent steps.
DNA extraction, exon capture, and sequencing Genomic DNA was extracted with kits from QIAGEN (MagAttract HMW kit) and MACHEREY-NAGEL (Nucleospin Tissue Kit), followed by shearing with red miniTUBEs on the Covaris M220 ultrasonicator.Nicks in the DNA were repaired with PreCR Repair Mix (New England Biolabs).Samples were barcoded with the NEBNext Ultra II DNA Library Prep Kit, then pooled together in batches of four or eight (details provided in Appendix 1).RNA baits for the exon capture (Daicel Arbor Biosciences) were custom-designed to target immune genes of interest (mainly NLRs, but also some others) based on version GRCz10 of the reference genome.Bait sequences and target locations are available in Figure 2-source data 2. Exon capture and PacBio library preparation were both done according to a protocol adapted from Witek et al., 2016.Libraries were sequenced at the Max Planck-Genome-Centre Cologne, with PacBio Sequel and Sequel II.Additional details are provided in Appendix 1.

Read processing, mapping, and clustering
Raw sequences were de-multiplexed with lima.Consensus sequences of DNA fragments with at least three passes (CCS reads) were inferred with ccs, followed by PCR duplicate removal with pbmarkdup.

Modeling
To estimate the full size of each population's NLR repertoire, we calculated the increment in the total number of identified NLR exon sequences when adding sequence data from one additional individual of a population to a set of already surveyed individuals.As noted earlier (Medini et al., 2020), these increments are well approximated by a power-law decay.
Briefly, given a sample of n individuals, there are ways to choose x − 1 individuals from the entire sample and add another -not yet chosen -one.For each x , we calculated the increment in the number of identified exon sequences and averaged over all possible choices of individuals.Summation of the average increments yields the total number of exons identified with x individuals, as plotted in Figure 3D.Then, we fitted the nonlinear function where H(x, β) is the generalized harmonic number with parameter β , that is, It represents the sum of increments, decaying according to a power-law, with parameters α (intercept) and β (decay rate).Importantly, if β > 1 , the series in Equation 3 converges and its limit may be interpreted as the size of a closed NLRome.The NLRome is open, if β ≤ 1 .Values of the fitted parameters and saturation limits are presented in Table 1.

Genetic diversity
Single-nucleotide genotypes in each fish were identified from the. bam output of pbmm2 by using deepvariant (r1.0) (Poplin et al., 2018) with the PacBio model.Joint genotyping of the individual samples was done with glnexus (v1.2.7-0-g0e74fc4) (Yun et al., 2021) with its deepvariant-specific setting.Per-site θπ of the NLR exons was calculated with vcftools (v0.1.16)(Danecek et al., 2011).Watterson's estimator of the scaled mutation rate is where S is the number of segregating sides seen in a sample of n aligned sequences, each of size l (here, 1761 bp for the FISNA-NACHT exons and 540 bp for the B30.2 exons).
Under neutrality (all alleles confer the same fitness to an individual) and constant population size over time, one expects equality θπ = θw .

Data availability
NLR reads are available in the NCBI Sequence Read Archive (BioProject PRJNA966920).Scripts are available on GitHub (https://github.com/YSchaefer/pacbio_zebrafish,copy archived at Schaefer, 2024).Sequences of the hybridization baits are provided as a source dataset.
The following dataset was generated: The following previously published dataset was used: Library construction and sequencing The final libraries were prepared with the SMRTbell Template Prep Kit 1.0-SPv3 (Pacific Biosciences).At the ligation step, the recommended amount of PacBio adapters was increased from 1 to 5 µl per 40 µl total reaction volume and the reaction was incubated overnight at room temperature.For the SN and CHT libraries in pools of eight (see Appendix 1-table 1), barcoded PacBio adapters were used instead of regular ones.The product codes for barcodes were BC1001 and BC1002 for CHT, BC 1003 and BC1004 for SN.The first libraries (TU, CGN, 4 DP and 4 KG samples) were size selected to 2-8 kb with the BluePippin pulsed field electrophoresis system (Sage Science).The following libraries were size selected to 1.5-8 kb.
All sequencing was done at the Max Planck-Genome-Centre Cologne.All TU, CGN, DP, and KG zebrafish, as well as four CHT and four SN samples were sequenced with 1M v2 SMRT Cells of the Sequel instrument (Pacific Biosciences).The rest of the samples (all with barcoded adapters) were multiplexed together and sequenced with an 8M SMRT Cell of the much higher throughput Sequel 2 instrument (Pacific Biosciences).One of the already sequenced SN samples (SN24) was also resequenced in this run as it yielded no data in the first one.Furthermore, Pacific Biosciences upgraded their kits with a superior polymerase after we had sequenced TU, CGN and the first four samples of each wild population; all samples other than those were sequenced with their LR (long run) polymerase.
An overview of the sequencing is presented in Appendix 1-table 1.

Read processing and assembly
Raw data were de-multiplexed and stripped of primer/adapter sequences with lima from Pacific Biosciences.For the samples sequenced with the PacBio Sequel I, the parameters -enforcefirst-barcode -split-bam-named -W 100 were used with lima v1.0.0 for the runs without the LR polymerase.For Sequel runs with the LR polymerase, lima v1.8.0 and v1.9.0 were used with the same parameters.To remove PacBio barcodes from the data produced on Sequel II, lima v1.11.0 was used with parameters -split-bam-named -peek-guess and for the subsequent removal of NEBNext barcodes, the parameters were changed to -enforce-first-barcode -split-bam-named -peek-guess.
Consensus sequences of all DNA fragments with a minimum of three passes (henceforth referred to as CCS reads) were calculated using ccs (v4.2.0,Pacific Biosciences) with default parameters.PCR duplicates were identified and flagged with pbmarkdup (v1.0.0,Pacific Biosciences) with default parameters, then excluded from downstream analyses.Any chimeric reads containing a primer sequence in the middle were identified with blastn (v2.11.0+) (Altschul et al., 1990) and removed.The filtered CCS reads were assembled into contigs for each fish separately using hifiasm (v0.15.4-r347) (Cheng et al., 2021) with default parameters.

NLR identification
To obtain a list of NLR gene positions in the reference genome, we first extracted known NLR locations from Ensembl.In addition, the reference genome was translated in all frames using transeq (from EMBOSS:6.6.0.0) (Rice et al., 2000) and searched for further NLRs using hmmsearch from hmmer (v3.2.1) (Wheeler and Eddy, 2013), without bias correction and with the hidden Markov model (HMM) profiles for zf_FISNA-NACHT and zf_B30.2from Adrian- Kalchhauser et al., 2020.Each position in which the zf_FISNA-NACHT model found a hit with a maximum i-Evalue of 1e − 200 and a minimum alignment length of 500 aa was considered a FISNA-NACHT exon.The filtering thresholds for B30.2 exons were an i-Evalue of 1e − 5 and a minimum alignment length of 150 aa.This approach was used both during bait design and as a preparatory step for the first round of read filtering.
To distinguish CCS reads of NLR genes from other CCS reads, the CCS reads of each fish were mapped against the reference genome GRCz11 using pbmm2 (v1.3.0) with preset ccs.CCS reads which mapped within a known NLR gene or one found with our HMM-based approach with any mapping quality were considered potential NLR reads and used as input for subsequent steps.
De novo assembled contigs containing NLR exon sequences were identified by translating all contigs of each fish in all frames with transeq (from EMBOSS:6.6.0.0) and subsequently searching for FISNA-NACHT and B30.2 domains using hmmsearch from hmmer (v3.2.1) without bias correction and the HMMs zf_FISNA-NACHT and zf_B30.2again.The HMM-based approach was chosen for the contigs in particular because we assumed that there would be NLR sequences in the data which

Figure 1 .
Figure 1.Structure of zebrafish NLRs and a map showing the origin of wild zebrafish samples.(A) Generalized, schematic representation of the domain architecture of an NLR-C protein.Each box represents a translated exon.The N-terminal repeats, the death-fold domain, as well as the B30.2 domain only occur in subsets of NLR-C genes.The number of N-terminal repeats and leucine-rich repeats can vary.Domains that can be either present or absent in different NLRs are surrounded by square brackets.(B) Sampling sites for wild zebrafish.All sites are located near the Bay of Bengal.Final sequenced sample sizes are indicated in parentheses.The map is based on geographic data collected and published by AQUASTAT from the Food and Agriculture Organization of the United Nations (FAO, 2021).The population DP is marked with an asterisk because its analysis and results are presented only in figure supplements.

Figure 2 .
Figure 2. Total counts of NLRs found per individual, shown for each population.Black diamonds on the box plots denote means, horizontal lines denote medians.Left side: two laboratory strains; right side: three wild populations.The online version of this article includes the following source data and figure supplement(s) for figure 2: Source data 1.Source tables for Figure 2 and its supplements.Source data 2. Sequences and target locations of RNA baits.

Figure supplement 1 .
Figure supplement 1.Sequencing and assembly statistics of circular consensus sequence (CCS) reads from NLR exons.

Figure 3 .
Figure3.Copy number variation of NLR genes.(A) Sequence data from each individual zebrafish (vertical axis) was aligned to FISNA-NACHT exon sequences of the pan-NLRome (horizontal axis).Grayscale intensity shows, for each NLR, the proportion of NLR-aligning data in each given fish that matches this specific gene.Darker gray indicates a higher likelihood of this NLR being represented in multiple copies in the particular individual.Light gray indicates a single copy, white indicates absence.For clarity, only the 1235 FISNA-NACHT exons for which at least one fish had a minimum of 10 reads mapped to it are shown.(B) Numbers of pan-NLRome sequences (based on FISNACHT diagnosis) found in all three, two, or only one wild population.(C) Relative numbers of fish in which pan-NLRome sequences were found in wild populations.'Core' pan-NLRome: genes which are found in at least 80% of the sample (from a total of 57 wild fish); 'shell': genes in at least 20%; 'cloud': rare genes found in less than 20% of the sample.(D) Observed and estimated sizes of population-specific pan-NLRomes.Data points (filled circles and squares) show the average number of totally

Figure supplement 2 .
Figure supplement 2. Copy number variation of NLR genes, including the DP population.

Figure 4 .
Figure 4. Single-nucleotide variation in NLR exons.Pairwise nucleotide diversity ( θπ ) and Watterson's estimator of the scaled mutation rate ( θw ) for FISNA-NACHT (A) and NLR-associated B30.2 (B) exons.(C) Proportion of exons without any single nucleotide polymorphisms.(D) Ratio of θπ/θw .Only exons with at least one single-nucleotide polymorphism are shown.The dotted, horizontal line marks a ratio of 1, the expected value under neutrality and constant population size.The black diamonds on box plots denote means, horizontal lines denote medians.The online version of this article includes the following source data and figure supplement(s) for figure 4: Source data 1.Source tables for Figure 4 and its supplement.

Figure supplement 1 .
Figure supplement 1. Single-nucleotide polymorphisms of different NLR exons shown by population, including DP.

Table 1 .
Values of fitted parameters and saturation limits for FISNA-NACHT and NLR-B30.2 exons, by population.