Towards the detection of copy number variation from single sperm sequencing in cattle

Copy number variation (CNV) has been routinely studied using bulk-cell sequencing. However, CNV is not well studied on the single-cell level except for humans and a few model organisms. We sequenced 143 single sperms of two Holstein bulls, from which we predicted CNV events using 14 single sperms with deep sequencing. We then compared the CNV results derived from single sperms with the bulk-cell sequencing of one bull’s family trio of diploid genomes. As a known CNV hotspot, segmental duplications were also predicted using the bovine ARS-UCD1.2 genome. Although the trio CNVs validated only some single sperm CNVs, they still showed a distal chromosomal distribution pattern and significant associations with segmental duplications and satellite repeats. Our preliminary results pointed out future research directions and highlighted the importance of uniform whole genome amplification, deep sequence coverage, and dedicated software pipelines for CNV detection using single cell sequencing data.

analysis tools are available for detecting CNVs in human scDNA-seq data, as reviewed recently [40].
However, no report has been published on the CNV identification on the single-cell level in livestock, including cattle. Here we sequenced and analyzed 143 single sperm genomes from two Holstein bulls, identifying thousands of candidate CNV events. We attempted to validate the single-sperm sequencing-based CNV results using the data derived from the diploid genome sequencing of one bull's family trio. Since one mechanism of CNV formation is non-allelic homologous recombination (NAHR), a recent paper reported that NAHR leads to over two-thirds of the structural variation detected within the human genome [41]. We also investigated CNVs and their associated segmental duplications [2]. To the best of our knowledge, this is the first reported trial of single sperm genome sequencing in livestock, highlighting future CNV detection directions using scDNAseq data and opening the door for studying individual sperm genome and male infertility.

Sequencing of haploid sperms and diploid trio Sequencing of sperms
We chose two bulls with different fertility capabilities (See Methods). Using the MALBAC method [42], we amplified and sequenced a total of 156 single sperm cells manually picked from two Holstein bulls' semen. After quality control filtering, 143 sperm data (71 for Sample1 and 72 for Sample2) were kept for downstream analyses. The sequenced sperms had an average of 1.79 × genome coverage, and 16 of them were at ~ 4 × genome coverage, achieving an overall coverage of ~ 11.40% to ~ 41.35% of the genome, respectively (Table S1). On average, 98.18% of sequencing reads from single sperms were mapped on the bovine ARS-UCD1.2 genome.

Sequencing of the trio
For Samples1's family trio diploid genomes, we sequenced bulk DNA samples extracted from ear punches of Sam-ple1, its sire Sample1-sire, and dam Sample1-dam to approximately 40, 10, and 20 × genome coverage, respectively, with over 99% genome mapping rate and covering 96% genome sequence (Table S2).

Segmental duplication analysis
Delineation of the recent duplication events at the genomic-sequence level, particularly sequences located at their junctions [43], may provide insight into their mechanism of origin. Because SDquest can detect recent and ancient segmental duplication (SegDup) [44], we applied it to the latest bovine ARS-UCD1.2 genome assembly. A total of 27,560 pairwise SegDup sequence fragments were reported by SDquest, with 49,126 unique nonredundant fragment regions (Table S3). Among them, 12,400 (44.99%) and 3,374 (12.24%) pairwise Seg-Dups have sequence identity larger than 80% and 90%, respectively. Also, 17,477 (63.41%) pairwise SegDup sequence fragments are reversed in their orientations on the chromosomes, while 16,621 (60.31%) are interchromosomally distributed (Fig. 1). After merging neighboring pairwise SegDup sequence fragments, we detected a total of 9,445 SegDup regions, covering 2.89% of the bovine genome (71,877,120 bp) ( Table S4). As shown in Table S5, chr3 has the highest count of SegDup regions (600), chr5 has the largest length of SegDup regions (5,004,378 bp), and chr29 has the largest percentage of SegDup coverage (7.42%).
Following our previous study [45], we analyzed repetitive sequence contents in and near SegDup regions (Table S6, Methods). We evaluated the repeat content of duplicated sequence, 20 kb flanking sequence, and the whole genome. As reported before [35], SINE Alu repeats were associated with human segmental duplications, but we did not find SINE enrichment was enriched for bovine segmental duplications. However, we detected two clear patterns regarding repeat content. While LINE content remains similar, DNA and SINE repeat content of most duplications are reduced. We observed a reverse trend for LTR and satellite repeat sequences, even though the fold change for LTR is only 1.25 (Table S6, Random simulation test, P-value < 0.001). Bovine segmental duplications show a 2.84-fold enrichment for satellite repeat content and a 2.03-fold elongation for satellite repeat average length over the genome average (Table S6), agreeing with our earlier observation [45].
We also performed gene annotation for those Seg-Dup regions and found 3,724 SegDups overlapping with 2,969 genes, which were significantly enriched (adjusted P-value < 0.05) in the GO term of GTPase activity and 12 KEGG pathways, such as metabolism of xenobiotics by cytochrome P450 and antigen processing and presentation (Table S7), again agreeing with our previous cattle results and the results from other species [8][9][10]45]. When compared with the cattle QTL database [46], we found a total of 837 QTLs intersected with 425 SegDups. We also found that eight QTLs were significantly enriched (adjusted P-value < 0.05 after the Benjamini-Hochberg correction for multiple testing) for animal reproduction and health traits, such as conception rate, inseminations per conception, stillbirth, bovine respiratory disease susceptibility, and others (Table S7).

Copy number variations in sperms and trio genomes
Using single sperms with deep sequencing from Sam-ple1 (n = 8) and Sample2 (n = 6), as well as Sample1 trio somatic samples, we detected a total of 5,646 CNVs (ranging from 50 bp to 5 Mb), including 1,307 break end (BND), 2,779 deletion (DEL), 877 duplication (DUP), and 683 inversion (INV) events (Table 1, Table S8, and Table  S9). Totally 0.27% of autosomes were covered by 6.73 Mb length of CNV (Table S10). We then focused on CNVs (i.e., DEL and DUP), which are shown in Fig. 2 and Fig.  S1. Similar to the recombination maps derived from the same sequence data (Yang et al., 2021 submitted), CNV distributions are significantly enriched in the two ends of chromosomes (Fig. 3). This result was also in line with 2) package circlize. From outside to inside: segmental duplication position, pairwise intrachromosomals, and interchromosomal SegDup events. Solid lines denote the pairwise SegDup with the same orientation, while dotted lines for reversed orientation. Only SegDups with lengths larger than 10 kb, sequence identity of more than 90% are shown the human results from two recent large-scale CNV discovery studies [47,48].
In addition, we analyzed the repeat content in and near 2,485 nonredundant CNV regions, similar to what we did for SegDup as described above (Table S12). We tested two flanking window sizes: 5 and 20 kb. For Sample1 sperm, Sample2 sperm, and Sample1 diploid CNVs, we consistently observed that SegDup (3.67-13.79 folds) and satellites (up to 5.64 folds) were enriched in CNVs (all P-values < 0.001). Within the 5 kb flanking regions, the enrichment folds of SegDups

Discussion
Single-cell sequencing and analysis are still facing multiple grand challenges [49]. To the best of our knowledge, this is the first trial of single sperm sequencing in the livestock, and we will discuss what was achieved and what needs further improvement. It is generally accepted that the de novo CNV event is infrequent. By mapping each sperm's sequencing data to the reference genome, our results focused on the germline/inherited CNVs, which are the common CNV events shared by single sperms. Using the CNV results derived from the trio bulk-cell sequencing as the ground truth, we estimated shared and unique percentages among the three groups (Sample1-sperms, Sam-ple2-sperms, and Sample1-trio). To our disappointment, only 10.39% of CNVs detected in Sample1-sperms were shared with its family trio, while 36.49% were shared between Sample1-sperms and Sample2-sperms. Thus, it is possible that CNVs only called in single cells were less reliable. We suspect the following systematic factors may contribute to these discrepancies: 1) uneven whole genome amplification, 2) low sequence coverage, and 3) suboptimal pipelines and their parameters.
As expected, scDNAseq is limited by its DNA amount: a single sperm contains 3 pg of DNA, not enough for whole-genome sequencing. Therefore, scDNAseq template amplification and library preparation are needed. As shown previously [50], these steps could severely impact the performance of CNV detection when whole genome amplification is uneven and/or sequence coverage is low. Additionally, the bioinformatics pipelines also influenced the performance of CNV detection. Ideally, read depth should be a better strategy given the low sequencing coverage, as compared to the pair-end and split-read approaches. As reviewed before [40], to correct for the first two factors, existing scDNAseq CNV read depth detection pipelines need to divide the genome into bins or windows first. They will then perform GC correction and mappability correction to obtain normalized reads depths (Figs. S3 and S4). Finally, they will need to remove outlier bins and outlier cells. The outlier bins often have an unusually high read count and occur near the centromere and telomere of each chromosome. The outlier cells often are low in signal-to-noise ratio or low in sequence coverage.
However, most of the existing pipelines are designed for the human genome [40], and it will take a great effort to fully customize and optimize them for livestock like cattle. In this study, CNVs were called using LUMPY [51], which was not designed for scDNA-seq data. It is also better to simultaneously apply a method to multiple samples to call germline/inherited CNVs to achieve better sensitivity and accuracy as the recently published method CHISEL did for human data [52]. Then CNV genotyping could then be performed on individual sperm cells. Our pipeline processed each sample separately using an integrated algorithm combining pair-end, split-read, and read depth. It did not specifically remove the outlier bins or the outlier cells, as no such data exists for cattle. However, our rationale for using LUMPY was that although we had a low average coverage and a low read depth for individual sperm genomes, we sampled the same genomes multiple times, through different sperms, with a total accumulating read depth of 56.99 × and 43.68 × . Therefore, merging reads across diffident sperms, i.e., pseudo bulk sequencing, should yield relatively confident results. In the future, we plan to adopt existing human pipelines to alleviate the impacts of these systematic factors on CNV calling in cattle.
During meiosis, chromosome missegregation can cause aneuploidy. Using Sperm-seq, Bell et al. sequenced 31,228 human sperm genomes from 20 men, identifying crossovers and other genomic anomalies [37]. They discovered that human sperm donors had aneuploidy rates ranging from 0.01 to 0.05 aneuploidies per gamete [37]. Due to the limited sample size and probably the signalto-noise ratio, no aneuploidy was detected in this study.
Finally, Ebert et al. recently reported that over twothirds of CNV detected within the human genome were associated with NAHR, mediated by repetitive sequences, such as segmental duplications and common repeat elements [41]. It was encouraging that our cattle segmental duplication and CNV flanking sequence analysis results also showed they are significantly enriched for each other and satellite repeats, despite the suboptimal data quality due to the abovementioned factors. In summary, we sequenced single sperms in cattle, performed an initial CNV detection, and found a distal chromosomal distribution pattern, which agreed with previous results derived from cattle bulk-cell sequencing or human studies. In the meantime, our results also highlighted the importance of the uniform whole genome amplification, deep sequence coverage, and dedicated software pipelines for CNV detection using scDNA-seq data.

Sample collection and whole genome amplification and sequencing
We chose two Holstein bulls with different fertility capabilities: Sample1 has a DPR (daughter pregnancy rate) PTA value of 0.0, reliability of 0.99, estimated from 6,528 daughters. In contrast, Sample2 has a DPR PTA value of -3.2, reliability of 0.99, estimated from 15,314 daughters. Somatic tissue (ear punch) samples of Holstein Sample1, together with its parent somatic tissues, were donated by Select Sires, Inc (Plain City, OH, USA). Semen samples were freshly collected by Select Sires, Inc. in its routine artificial insemination semen straw production. After receiving them under liquid nitrogen in USDA-ARS Animal Genomics and Improvement Laboratory (AGIL), we manually isolated a total of 156 sperm cells from two Holstein bulls (Sample1 with 73 sperm cells and Sample2 with 83 sperm cells). Briefly, isolated sperms were thawed in 37 ℃ water for 30-45 s and treated with 0.25% Trypsin-EDTA, followed by dilution with PBS + 1% BSA and washing twice. The sperms were further diluted to a proper resolution using PBS + 1% BSA on a petri-dish, and active single sperms were picked up manually by pipetting into a reaction tube under a micromanipulator as described previously [42]. Whole-genome amplification was performed on single cells according to the manufacturer's protocol, using the Single Cell Whole Genome Amplification Kit (Yikon Genomics, Shanghai, China) developed from the Multiple Annealing and Looping Based Amplification Cycles (MALBAC) method [35]. In brief, a single sperm was initially analyzed and pre-amplified by primers supplied in the kit with 8 cycles with multiple annealing steps. PCR generated fragments with variable length at random starting positions for Illumina short-read sequencing. To evaluate the agreement rate of individual recombination from sperms and parents, we also sequenced the somatic diploid genomes of the trio, including Sample1 (Sample1-diploid) and its parents (Sample1-sire and Sample1-dam). Using their somatic ear punch tissues, we isolated their diploid genomes using a QIAGEN DNA extraction kit. DNA samples extracted from the donor and his parents' ear skin samples were then used to prepare sequencing libraries using standard Illumina protocol and sequenced on an Illumina HiSeq 2000/ NextSeq 500 sequencing platform.

Identification of segmental duplications and enrichment test
We utilized software SDquest v0.1 [44] for detecting segmental duplications (SegDup, also known as low copy repeats) and constructing the breakpoint graph of these mosaic SegDups, based on the repeat masked ARS-UCD1.2 reference downloaded from ENSEMBL (ftp:// ftp. ensem bl. org/ pub/ relea se-102/ fasta/ bos_ taurus/ dna/). We compared the repeat content of Seg-Dups, CNVs (DEL or DUP), or 5 kb, 20 kb flanking regions (5kbF, 20kbF). For CNVs, we combined the Seg-Dups and repeats from UCSC Table Browser (https:// genome. ucsc. edu/ cgi-bin/ hgTab les). Length, Count, Average Length, Length%, and Count/Mb of repeat content for SegDups, CNVs, 5kbF, or 20kbF were based on these repeat overlapped with regions, Length% denotes the proportion of repeat length overlapped with SegDups/CNVs/5kbFRoCNVs in total SegDups/ CNVs/5kbF/20kbF length, Count/Mb denotes the count of repeats overlapped with SegDups/CNVs/5kbF/20kbF divided by total SegDups/CNVs/5kbF/20kbF Mb. For enrichment, ratios were defined as Average Length, Length%, and Count/Mb of repeats in SegDups/ CNVs/5kbF/20kbF divided by repeats in the genome. We determined the significance of the enrichment by 1,000 times simulating the SegDups/CNVs/5kbF/20kbF in random genome position with the same average and standard deviation length, which generated by function createRandomRegions from R v4.0.2 package regioneR. P-value refers to the frequency of simulated value larger than observed value divided by simulation times. The threshold was set as 0.05.

Structural variation detecting
We employed LUMPY v0.2.13 [51], which integrated read-depth, read-pair, and split-read strategies, to detect structural variations in high coverage sperms. As recommended, LUMPY was internally implemented in a pipeline smoove (https:// github. com/ brentp/ smoove) with shorter run-time and lower false-positive rate. Smoove was used to collect the best practices of LUMPY, such as generating empirical insert size statistics on each library in the BAM file, estimating the mean and standard deviation (SD) of the input parameters for LUMPY. From LUMPY, the four types of structural variations, including deletion (DEL), duplication (DUP), inversion (INV), and break end (BND), were reported for each sample. Due to the limitation for INV and BND detection, we focused on CNV (DEL plus DUP) in most of the analysis, after filtering away DEL and DUP with a length more than 5 Mb or short than 50 bp. For haploid sperms and diploid trio, we applied the following thresholds to filter out low-quality CNVs: the threshold of supporting read count for either paired-end event or split-read event must be more than 3/4 of the genome coverage, while the read count for the other type of split-read event must be more than 1 or paired-end event must be more than 3.

Gene annotation and enrichment analysis
We mapped regions of interest to the bovine reference gene annotation of the ARS-UCD1.2 genome from ENSEMBL using BEDtools v2.26.0 [53]. The gene features included transcripts, exons, CDS, 3'-UTR, 5'-UTR, start codon, and stop codon. The Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway and Gene Ontology (GO) enrichment were performed using the R (version 4.0.2) packages org.Bt.eg.db and cluster-Profiler. We performed the quantitative trait loci (QTL) enrichment analysis using the Fisher exact test at animalgenome.org [46]. All enrichment P-values were also adjusted for multiple comparisons by Benjamini and Hochberg's (BH) algorithm.