Clusters of polymorphic transmembrane genes control resistance to schistosomes in snail vectors

Schistosomiasis is a debilitating parasitic disease infecting hundreds of millions of people. Schistosomes use aquatic snails as intermediate hosts. A promising avenue for disease control involves leveraging innate host mechanisms to reduce snail vectorial capacity. In a genome-wide association study of Biomphalaria glabrata snails, we identify genomic region PTC2 which exhibits the largest known correlation with susceptibility to parasite infection (>15 fold effect). Using new genome assemblies with substantially higher contiguity than the Biomphalaria reference genome, we show that PTC2 haplotypes are exceptionally divergent in structure and sequence. This variation includes multi-kilobase indels containing entire genes, and orthologs for which most amino acid residues are polymorphic. RNA-Seq annotation reveals that most of these genes encode single-pass transmembrane proteins, as seen in another resistance region in the same species. Such groups of hyperdiverse snail proteins may mediate host-parasite interaction at the cell surface, offering promising targets for blocking the transmission of schistosomiasis.


Sample-size estimation
• You should state whether an appropriate sample size was computed when the study was being designed • You should state the statistical method of sample size computation and any required assumptions • If no explicit power analysis was used, you should describe how you decided what sample (replicate) size (number) to use Please outline where this information can be found within the submission (e.g., sections or figure legends), or explain why this information doesn't apply to your submission:

Replicates
• You should report how often each experiment was performed • You should include a definition of biological versus technical replication • The data obtained should be provided and sufficient information should be provided to indicate the number of independent biological and/or technical replicates • If you encountered any outliers, you should describe how these were handled • Criteria for exclusion/inclusion of data should be clearly stated • High-throughput sequence data should be uploaded before submission, with a private link for reviewers provided (these are available from both GEO and ArrayExpress) For the samples used in pooled sequencing (1200 snails), the sample size was chosen based on simulated data, as explained in Materials and Methods, "Genome-wide scan of 13-16-R1" section. To generate these 600 infected and 600 uninfected snails, we challenged 1700 snails, and we subsequently genotyped the majority (1570) of these at our candidate locus (not all samples could be genotyped for logistical reasons, e.g. insufficient high-quality DNA remaining). These 1570 are expected to provide even greater statistical power than the 1200 samples in pooled sequencing, both because of the large number and because full diploid genotypes provide more information. If these results had been ambiguous, we would have challenged and genotyped more snails as needed. The validation set of 392 snails was a sample of convenience in that the samples had previously been phenotyped, but as we demonstrate this sample was enough to confirm the effect with statistical significance (see Results and Discussion). 2 Please outline where this information can be found within the submission (e.g., sections or figure legends), or explain why this information doesn't apply to your submission: The pooled whole-genome sequencing and subsequent genotyping of the same samples constitutes a single biological replicate. The second confirmatory replicate was the independent set of previously-phenotyped samples. There were individual outlier variants showing high Fst between the pools which could represent false positives, so to minimize this kind of noise we analyzed the results by genomic window rather than by individual variant. This is all explained in Materials and Methods, "Genome-wide scan of 13-16-R1" section.
All high-throughput data are available in NCBI, with BioProject numbers indicated in Materials and Methods.

Statistical reporting • Statistical analysis methods should be described and justified
• Raw data should be presented in figures whenever informative to do so (typically when N per group is less than 10) • For each experiment, you should identify the statistical tests used, exact values of N, definitions of center, methods of multiple test correction, and dispersion and precision measures (e.g., mean, median, SD, SEM, confidence intervals; and, for the major substantive results, a measure of effect size (e.g., Pearson's r, Cohen's d) • Report exact p-values wherever possible alongside the summary statistics and 95% confidence intervals. These should be reported for all key questions and not only when the p-value is less than 0.05.
Please outline where this information can be found within the submission (e.g., sections or figure legends), or explain why this information doesn't apply to your submission: (For large datasets, or papers with a very large number of statistical tests, you may upload a single table file with tests, Ns, etc., with reference to sections in the manuscript.)

Group allocation
• Indicate how samples were allocated into experimental groups (in the case of clinical studies, please specify allocation to treatment method); if randomization was used, please also state if restricted randomization was applied • Indicate if masking was used during group allocation, data collection and/or data analysis Please outline where this information can be found within the submission (e.g., sections or figure legends), or explain why this information doesn't apply to your submission: Additional data files ("source data") • We encourage you to upload relevant additional data files, such as numerical data that are represented as a graph in a figure, or as a summary table • Where provided, these should be in the most useful format, and they can be uploaded as "Source data" files linked to a main figure or table • Include model definition files including the full list of parameters used • Include code used for data analysis (e.g., R, MatLab) • Avoid stating that data files are "available upon request" Group allocation is explained in Materials and Methods. Snails were selected haphazardly, and all were exposed equally to parasites. Snails were designated as infected or not depending on whether they subsequently shed parasites. Equal numbers of infected and uninfected snails were randomly chosen for pooled sequencing. Masking was not used, but it does not apply here because snails were not deliberately placed in a particular group, but rather they naturally fell into one group or the other based on infection status. Pooled sequencing, the core method for our principle discovery, is a bulk analysis that does not depend on characterizing individuals.