A method for low-coverage single-gamete sequence analysis demonstrates adherence to Mendel’s first law across a large sample of human sperm

Recently published single-cell sequencing data from individual human sperm (n=41,189; 969–3377 cells from each of 25 donors) offer an opportunity to investigate questions of inheritance with improved statistical power, but require new methods tailored to these extremely low-coverage data (∼0.01× per cell). To this end, we developed a method, named rhapsodi, that leverages sparse gamete genotype data to phase the diploid genomes of the donor individuals, impute missing gamete genotypes, and discover meiotic recombination breakpoints, benchmarking its performance across a wide range of study designs. We then applied rhapsodi to the sperm sequencing data to investigate adherence to Mendel’s Law of Segregation, which states that the offspring of a diploid, heterozygous parent will inherit either allele with equal probability. While the vast majority of loci adhere to this rule, research in model and non-model organisms has uncovered numerous exceptions whereby ‘selfish’ alleles are disproportionately transmitted to the next generation. Evidence of such ‘transmission distortion’ (TD) in humans remains equivocal in part because scans of human pedigrees have been under-powered to detect small effects. After applying rhapsodi to the sperm data and scanning for evidence of TD, our results exhibited close concordance with binomial expectations under balanced transmission. Together, our work demonstrates that rhapsodi can facilitate novel uses of inferred genotype data and meiotic recombination events, while offering a powerful quantitative framework for testing for TD in other cohorts and study systems.


Sample-size estimation
• You should state whether an appropriate sample size was computed when the study was being designed • You should state the statistical method of sample size computation and any required assumptions • If no explicit power analysis was used, you should describe how you decided what sample (replicate) size (number) to use Please outline where this information can be found within the submission (e.g., sections or figure legends), or explain why this information doesn't apply to your submission: To evaluate the statistical power of our transmission distortion (TD) scanning approach, we conducted simulations of various levels of TD (0-20% deviations from Mendelian expectations) across a range of sample sizes of human sperm (Fig. 4, Fig.4supplemental figure 1), including that of our available data (from Bell et al. 2020 andLeung et al. 2021). From this analysis, we were able to estimate the minimum rate of TD we would be able to observe in our dataset of human sperm (969-3,377 cells from each of 25 donors).
To benchmark rhapsodi's performance with simulated data ( are each used to provide an example of a possible signature of TD due to linkage disequilibrium, a single simulation is displayed and no formal power analysis was performed. This is described in the figure caption and the sections "Application to data from human sperm" (Results) and "Assessing performance with simulation" (Methods).
To build the null distribution for a potential global signal of TD (Fig. 6), 500 independent simulations were run. This number of replicates was determined based on the practical limitations of computation runtime and the need to produce a 2 decimal precision p-value.

Replicates
• You should report how often each experiment was performed • You should include a definition of biological versus technical replication • The data obtained should be provided and sufficient information should be provided to indicate the number of independent biological and/or technical replicates • If you encountered any outliers, you should describe how these were handled • Criteria for exclusion/inclusion of data should be clearly stated • High-throughput sequence data should be uploaded before submission, with a private link for reviewers provided (these are available from both GEO and ArrayExpress) Please outline where this information can be found within the submission (e.g., sections or figure legends), or explain why this information doesn't apply to your submission: The investigation for transmission distortion (TD) was conducted once for each donor (results shown in Fig. 5). Because each sample originates from a unique donor, they are not biological replicates. In quantifying a global signal of TD (Fig. 6), we conducted 500 independent simulations to build a null distribution of allele sharing (stated in Fig. 5 legend). We used the sperm genome data from Bell et al. 2020 andLeung et al. 2021, applied filtering steps (as discussed in Methods section "Genotype filtering to mitigate spurious TD signatures"), and then used all sperm cells remaining (N = 41,189; 969-3,377 cells from each of 25 donors, as stated Abstract and Introduction) in our analysis.
No outliers were encountered. Data are accessible as described in "Availability of data and materials." To benchmark rhapsodi's performance with simulated data, 3 independent replicates were performed for each study design. This is described in the "Assessing performance with simulation" section within the Methods. For those experiments displayed in Fig. 2 For the other experiments which were used to benchmark rhapsodi's performance with simulated data (displayed in Fig. 3, Fig. 3figure supplements 1-2, and Fig. 2-figure supplement 3), each of these replicate trials is considered and displayed as independent data points within the overall dataset, as described in the legends of these figures. Simulations to determine window size (displayed in Fig 2.-figure supplement 4) used 1-3 independent replicate simulations for each combination of parameters. This is stated in the Methods section "Automatic phasing window size calculation". To benchmark rhapsodi's computational run time, each combination of parameters was simulated in 3 independent replicates ( Fig. 2supplemental figure 10). This is stated in the Methods section "Benchmarking run time".
To evaluate the statistical power of our TD scanning approach in Fig. 4 and Fig 4.supplemental figure 1, 1000 independent simulations were performed for each study design. This is stated in the Methods section "Power analysis for detecting TD" and the figure caption. To demonstrate the statistical power of a TD scanning approach in human pedigrees (Fig 4.-supplemental figure 4), 1000 independent simulations were performed for each study design. This is stated in the figure caption.
Simulation replicates (i.e., repeated runs of random simulations) versus biological replicates (i.e., study subjects) are clearly described as such in the text and Methods.

Statistical reporting
• Statistical analysis methods should be described and justified • Raw data should be presented in figures whenever informative to do so (typically when N per group is less than 10) • For each experiment, you should identify the statistical tests used, exact values of N, definitions of center, methods of multiple test correction, and dispersion and precision measures (e.g., mean, median, SD, SEM, confidence intervals; and, for the major substantive results, a measure of effect size (e.g., Pearson's r, Cohen's d) • Report exact p-values wherever possible alongside the summary statistics and 95% confidence intervals. These should be reported for all key questions and not only when the p-value is less than 0.05.
Please outline where this information can be found within the submission (e.g., sections or figure legends), or explain why this information doesn't apply to your submission: (For large datasets, or papers with a very large number of statistical tests, you may upload a single table file with tests, Ns, etc., with reference to sections in the manuscript.)

Group allocation
• Indicate how samples were allocated into experimental groups (in the case of clinical studies, please specify allocation to treatment method); if randomization was used, please also state if restricted randomization was applied • Indicate if masking was used during group allocation, data collection and/or data analysis We used a two-tailed binomial test to investigate transmission distortion (TD) in each donor's set of sperm (as stated in Methods section "Genome-wide scan for TD"). Raw data is not presented. The total number of hypothesis tests as well as the methods used for multiple testing correction are described in the Methods section "Significance threshold for TD scan." Exact p-values for the TD analysis are shown in Fig. 5 and Fig. 6. The power analysis also applied two-sided binomial tests, as depicted with and without multiple testing correction in Fig. 4 and Fig.4-supplemental figure 1 and described in Methods section "Power analysis for detecting TD." We used a one-tail hypothesis test in quantifying global signatures of TD. This is described in the Methods section "Calculation of global signal of TD." The exact p-value and number of simulations are reported in the caption of Fig. 6B as well as the main text.
For simulation experiments benchmarking rhapsodi's performance, the definitions of center are described in the figure captions (Fig. 2, Fig. 2-figure supplement 2, and Fig. 2-figure supplements 5-8) as well as throughout the Results section "Evaluating performance on simulated data." The definitions of dispersion/precision are also stated in the Results section "Evaluating performance on simulated data," specifically: "(Values are reported as the mean, plus or minus one standard deviation)." In describing the application of rhapsodi to the human sperm genomes, the definitions of center and dispersion/precision are stated in the Results section "Application to data from human sperm," specifically; "(Values are reported as the mean, plus or minus one standard deviation.)"