Adaptive and maladaptive introgression in grapevine domestication

Significance Our study focused on the history of introgression between domesticated grapes and their European wild relative. We find evidence for a single domestication of grapevine with introgression from the wild relative to wine, but not to table, grapes. These regions are enriched for genes related to aromatic compound synthesis, suggesting that European wild grapes have been an important resource for improving the flavor of wine grapes. However, the beneficial effects of introgression carry a potential cost because introgressed regions have elevated numbers of putatively deleterious variants, which are usually maintained in a heterozygous state for this clonally propagated crop. The identification of introgressed beneficial and deleterious variants is important for genomic breeding of grapevines.

Phylogenetic trees were constructed using SNP data by iqtree with 1000 bootstrap replicates (11). Two kinds of models were used: VT+F+R5 and GTR+I+G. The split trees of chloroplast and mitochondria were generated by SplitsTree using SNP data filtered by vcftools with parameters: -minGQ20 -max-missing 0.8 (12). The population structure based on nuclear SNPs was determined using the program Admixture with K ranging from 2 to 10 (13). The SNP data used for these analyses were filtered based on LD by plink. The principal component analysis (PCA) was performed using plink with parameter: --pca 20 (14).

Population genetic analyses
Vcftools was used to count heterozygous sites with the parameter: --het. The heterozygosity of each sample was calculated by dividing the number of heterozygous sites in each sample by the number of all SNPs. The forward simulation software SLiM3 was used to evaluate the heterozygosity of crops under different propagation types: outcrossing or cloning (15). Outcrossing was evaluated under the Standard Neutral Model. 10,000 diploids were used as the ancestral population sizes for both models and 20,000 generations were used as the burn-in. After burn-in, we calculated the heterozygosity every 100 generations. An analysis of introgressed alleles was also conducted with SLiM3. In these introgression simulations model, two independent populations were initialized with the same population size (Ne = 1000) and length of chromosomes (10 Mb). The beneficial mutations had fixed selection coefficients of s = 0.01 and a dominance coefficient of 0.5; the deleterious mutations had a selection coefficient following a gamma distribution with an average value of -0.03 and dominance coefficient of 0 (i.e., fully recessive). We simulated with ten-fold more expected deleterious, as opposed to beneficial, mutations. To simulate introgression, the two populations first experienced a burn-in of 10*Ne generations with outcrossing, and then a directional migration event of 17 individuals (estimated from our demography analyses) occurred from the donor to receptor population. The clonal receptor population remained outcrossing for 10 generations before becoming clonal. The codes for the simulation were on GitHub (https://github.com/zhouyflab/Grapevine_Adaptive_Maladaptive_Introgression (16)).
Sequence similarity (Dxy), and fixation indices (FST) were calculated using the Python script: popgenWindows.py (https://github.com/simonhmartin/genomics_general) with 50 kb nonoverlapping windows. Nucleotide diversity (π) was calculated using vcftools with parameter: --window-pi 50000. These statistics were evaluated within or between all five groups (EU, n = 72; ME1, n = 29; ME2, n = 7; wine, n = 77; and table, n = 148). We also used SMC++ to infer the demography of each of the five groups using unphased SNPs data according to the manuscript with default parameters (17). The mutation rate was set to 7 x 10 -9 and a generation time of 2.5 years was used to convert the generations into years (18). Identity By Descent (IBD) was calculated using plink with parameter: --genome. The average values within or between groups were calculated for presentation.
The continuous-time sequential Markovian coalescent approximation implemented fastsimcoal (version: fsc27) was used to further estimate the population history under complex evolutionary scenarios (19,20).
Before performing the analysis, the derived site frequency spectrum (unfolded SFS) was generated using a Python script: superSFS (https://github.com/zhouyflab/Grapevine_Adaptive_Maladaptive_Introgression (16)). This script can infer ancestral alleles according to outgroup species and count the frequency of the derived allele in each group. We then used fastsimcoal to estimate the unfolded SFS under different evolutionary models. Monomorphic sites and SFS entries with less than 20 SNPs were ignored. Each model was based on 100,000 coalescent simulations with 50 optimizations to estimate parameters for each scenario.
The histories of a single population, population differentiation without gene flow, and population differentiation with gene flow have been estimated. The command line for running fastsimcoal is: fastsimcoal -t model.tpl -e model.est -d -0 -C 10 -n 100000 -L 50 -s 0 -M -c20 -B40.
The genomic databases with SIFT predictions were generated by a tool, make-SIFT-db-all.pl, using genomic sequence data in Fasta format and the gene annotation file in gtf format (21). The protein database required by the tool was downloaded from UniProt (https://www.uniprot.org/) (22). Then the SNP data in vcf format and the genomic databases were input into SIFT 4G for deleterious mutation assessment. For each individual, the homozygous sites predicted as deleterious were counted as recessive deleterious mutation, and the additive deleterious mutation was counted by the number of heterozygous deleterious mutation sites plus two times the number of homozygous deleterious mutation sites (6). The LD decay was evaluated and plotted using PopLDdecay (https://github.com/BGI-shenzhen/PopLDdecay) (23). We evaluated LD every 1Mb sliding window with 1Mb step sites.

Selective sweep detection and analysis
Population branch statistic (PBS) analysis was performed by the program PBScan We also used the machine learning-based selective sweep analysis program diploSHIC/SHIC (25,26). This program predicted the selection types using the classifier that was trained on simulated data. Before the prediction, we simulated training and testing data in different categoriesi.e., neutral, soft-linked, hardlinked, soft and hard -using the program discoal (https://github.com/kr-colab/discoal). The demographic parameters for simulation were based on the results of SMC++. The scripts for the simulation are available on GitHub (https://github.com/zhouyflab/Grapevine_Adaptive_Maladaptive_Introgression (16)). Using this simulated data, we then used diploSHIC to make predictions that were performed as follows: First, the feature vectors of simulated data were generated using fvecSim mode in diploSHIC. Second, training and testing sets were generated using makeTrainingSets mode in diploSHIC. Third, both training data and testing data were input into diploSHIC using the training mode to train the classifier. Fourth, the feature vectors of each grape population were calculated using fvecVcf mode by inputting the SNP information stored within the VCF format. Finally, the classifiers trained in step three were applied to predict the sweep selection types of different regions in each group by inputting the feature vector data generated in step four using the prediction mode of diploSHIC. The default parameters were used for running diploSHIC. When calculating the feature vectors, 5kb sliding windows with 5 kb step size were used.

Introgression analyses
We used the program Dsuite (https://github.com/millanek/Dsuite) to calculate Patterson's D (ABBA-BABA) and f4-ratio statistics across populations to evaluate introgression probabilities across the genome (27). When performing the analysis, the SNP data and a tree in the newick format showing the relationship between groups were used as input files. The tree was written based on the phylogenetic tree generated by iqtree, as mentioned previously. We also used the machine learning-based introgression analysis program Filet to predict introgressed regions (28). Prior to prediction, we first generated training data via simulation. Msmove was used to simulate the history between EU and wine groups under scenarios of no gene flow, gene flow from EU to wine, and gene flow from wine to EU, using the parameters estimated by fastsimcoal previously (29,30). The scripts for running Msmove were uploaded to GitHub (https://github.com/zhouyflab/Grapevine_Adaptive_Maladaptive_Introgression (16)). Then, the classifier was trained and used for prediction as follows: First, the summary statistics of simulated data were calculated by Msmove using the normalizeTwoPopnStats.py script in Filet. Next, the summary statistics were aggregated into one labeled data matrix to be used for training using the buildThreeClassTrainingSet.py script in Filet. Third, the classifier was trained using the trainFiletClassifier.py script. Fourth, phased sequence data for EU and wine groups and inferred ancestral sequences were generated in Fasta format. Masking was also performed using the TE annotation information. Then the Fasta files were used for calculating the summary statistics in non-overlapping 10kb sliding windows by pgStatsBedSubpop_forML script and normalized by normalizePgStats.py script. Finally, the data from the previous step were input into the trained classifier for classifications, using the classifyChromosome.py script. The default parameters were used for running the Filet pipeline. As for generating the inferred ancestral sequences, we modified the reference genome at the sites that the alleles in 6 outgroup individuals more than 2. As for the classifications, there are three kinds of possibilities for each 10kb region: no introgression, introgression from EU to wine, and introgression from wine to EU. We considered a window to be introgressed only when the inferred probability of introgression was > 0.95.
To calculate introgressed allele frequencies and heterozygosity in introgressed regions, we analyzed sites that have a specific genotype in EU grapes relative to ME grapes. Alleles not existed in the ME2 groups and more than 50% in EU groups were defined as specific genotypes. The frequency of these specific genotypes in wine grapes and the ratio of individuals containing the specific genotypes were calculated for presentation.
The heterozygosity of introgressed alleles was calculated by dividing the number of heterozygous introgressed sites by the total number of introgressed sites in each individual.

Recombination analysis
We estimate the recombination rates in introgression and selection regions according to the methods in the paper of Abraham et. al (2021) (31). Firstly, the mapping data which defined the recombination rate between two markers were obtained from a previous study (32). Secondly, we transformed the location of 1662 markers in that paper onto our reference genome by sequence blast. Finally, the recombination rates of specific regions were counted in cM/kb.

GO-term enrichment
All protein sequences in the reference genome were blasted to UniProtKB/Swiss-Prot (major release 51.0) by diamond (v0.9.14.115) (33). Then based on the Gene Ontology (GO, http://geneontology.org/) databases, the results were converted into GO-Terms as the background (34,35

KINSHIP analysis
Using genotype data, the kinship matrix by the VanRaden method for relationships among individuals in the specific region was generated in GAPIT (https://github.com/jiabowang/GAPIT3). Then these data were imported into the Cytoscape for visualization, the nodes represent samples and the edge represents the kinship value.
6 EU ME2 ME1 Figure S1. The distribution possibility of wild grapes in Europe and the Near East and wild samples used. The redder the color, the higher the possibility. The yellow circles indicate the EU wild grapes used in this study. Green triangles indicate ME1 wild grapes used in this study. The yellow triangles indicate the ME2 wild grapes used in this study. Most of the EU wild grapes are only labeled the approximate location due to lack of coordination. Figure S2. Phylogenetic tree generated using GTR+I+G model. The lines in different colors indicate different groups: the ME1 group is yellow, the ME2 group is purple; the EU group is reddish brown; the wine group is blue; the table group is green, and the outgroup is black. The red circles on the outside mark the domesticated grapes that their chloroplast originated from the EU grapes, and the red triangles mark the domesticated grapes that their mitochondria originated from the EU grapes.           24 Table Wine  ME2  ME1  EU   Table Wine  ME2  ME1  EU  ME1  Table Wine  ME2 EU Table Wine  ME2  EU  ME1  Table Wine  ME2  ME1  EU   Table Wine  ME2  ME1 EU Table Wine  ME2  ME1 EU Table Wine  ME2  EU  ME1  Table Wine  ME2  ME1 EU Table Wine  ME2  EU  ME1  Table Wine  ME2 EU ME1 Table Wine  ME2 EU ME1 Table Wine  ME2  ME1 EU Table Wine  ME2  ME1  EU   Table Wine  ME2  EU  ME1  Table Wine  ME2  EU  ME1  Table Wine  ME2  ME1  EU   T1  T2  T3  T4  T5   T6  T7  T8  T9  T10   T11  T12  T13  T14  T15 T16 T17  T1   T2  T3  T4  T5   T6  T7  T8  T9  T10   T11  T12  T13  T14  T15 T16 T17 Figure S20. The models for estimation of the process of gene flow among table, wine and EU populations when ME2 is assumed as the ancestral population. The red lines indicate introgression beween wine and EU populations, and the blue lines indicate the introgression between table and EU populations. Table Wine  ME2  ME1 EU Table Wine  ME2  ME1 EU Table Wine  ME2  EU  ME1  Table Wine  ME2  EU  ME1  Table Wine  ME2 EU ME1 Table Wine  ME2  ME1 EU Table Wine  ME2  EU  ME1  Table Wine  ME2  EU  ME1  Table Wine  ME2  EU  ME1  Table Wine  ME2  ME1  EU   Table Wine  ME2  ME1 EU Table Wine  ME2  ME1  EU   Table Wine  ME2  ME1  EU  Table Wine  ME2  ME1  EU   Table Wine  ME2  ME1 EU Table Wine  ME2  EU  ME1  Table Wine  ME2 EU ME1 fd fd Figure S21. fd test between wine and EU group, table and EU group by 50 kb windows. The dots above the red horizontal line are the highest 1% regions, and the blue horizontal line indicate the threshold of 5% highest regions.  4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18                Movie S1 (separate file). The SFS change of introgressed alleles after hybridization in outcrossing population. The bar located directly above the movie indicates the generations after hybridization.

Nucleotide diversity (π)
Movie S2 (separate file). The SFS change of introgressed alleles after hybridization in clonal population. The bar located directly above the movie indicates the generations after hybridization.
Movie S3 (separate file). The DFE change of introgressed alleles after hybridization. The bar located directly above the movie indicates the generations after hybridization.
Dataset S1 (separate file). Prediction of selective sweep regions on the genome of wine population by Shic.
Dataset S2 (separate file). Prediction of selective sweep regions on the genome of table population by Shic.
Dataset S3 (separate file). Prediction of introgressed regions between EU and wine population by Filet. Class 1 means introgression from EU to wine population, and class 2 indicates the opposite.
Dataset S4 (separate file). The genes located in introgressed regions and the selective types of these regions in wine population. Class 1 means introgression from EU to wine population, and class 2 indicates the opposite.