Protocol for detecting introgressed archaic variants with SPrime

Summary The SPrime program detects the variants in current-day populations that were introgressed from an archaic source in the past. It is optimized for detecting introgression from Neanderthals and Denisovans in modern humans. We provide a protocol for detecting Neanderthal and Denisovan introgression in 1000 Genomes Project data, specifically focusing on the CHB (Han Chinese in Beijing) population. For complete details on the use and execution of this protocol, please refer to Browning et al. (2018).


SUMMARY
The SPrime program detects the variants in current-day populations that were introgressed from an archaic source in the past. It is optimized for detecting introgression from Neanderthals and Denisovans in modern humans. We provide a protocol for detecting Neanderthal and Denisovan introgression in 1000 Genomes Project data, specifically focusing on the CHB (Han Chinese in Beijing) population. For complete details on the use and execution of this protocol, please refer to Browning et al. (2018).

BEFORE YOU BEGIN Download the script files
Timing: 1 min 1. The SPrime pipeline script files are available from github (see key resources table). In this protocol, we use different folders for the output from each step and for the source data. We use a folder named ''download'' to store all downloaded data, a folder named ''tools'' to store computation tools and scripts specific to this protocol, a folder named ''tmp'' to store temporary files, and folders named ''step [2][3][4][5]'' to store output from each step. The code for each step should be executed within the corresponding folder unless you modify the file paths in the code. Basic knowledge about bcftools, R scripting, and bash scripting is required to understand this protocol.

Download the sequence data
Timing: 5 h 2. Sequence data for current-day and archaic populations should be in gzip-compressed VCF format (Danecek et al., 2011). The example data in this protocol can be downloaded via the links listed in the Key Resources Table. The script for downloading the genotype data from the 1000 Genomes Project, genotype data and genome masks for the two archaic populations, and a recombination map for the example analysis is available in the folder ''download'' in our published pipeline online (key resources table). In this protocol, we analyze the phase 3 data from the 1000 Genomes Project, as these are the data that were analyzed in the original SPrime paper (Browning et al., 2018). Analysis of the high-coverage 1000 Genomes data from the New York Genome Center's high coverage resequencing (Byrska-Bishop et al., 2021) would require the use of a genetic map in GRCh38 coordinates, and liftover of the SPrime results to match the GRCh37 coordinates of the available Neanderthal and Denisovan genomes.

MATERIALS AND EQUIPMENT
Genotype data (VCF files for current-day and archaic populations, see download the sequence data in before you begin) SPrime, bcftools, R and other scripts used in this protocol (see Software section of key resources  table) A Linux computer with bash installed and at least 16 Gb of memory. The protocol should also work on Mac OS. In this protocol, all tests were run on a Linux 12-core 2.6 GHz computer with Intel Xeon ES-2630 processors and 128 GB of memory.

STEP-BY-STEP METHOD DETAILS
Step-1: Install SPrime Timing: 1 min 1. The SPrime software has been included in the SPrime pipeline download. The latest version of the software can be downloaded from the github page: https://github.com/browning-lab/sprime. Place the software file ''sprime.jar'' in the ''tools'' folder. Check that it is working and print out information on the parameters by running ''java -jar sprime.jar'' ( Figure 1).
Step-2: Prepare input data for the SPrime analysis 3. We concatenate all autosomes into one file, because SPrime needs whole genome data to estimate key parameters. Although we will analyze the chromosomes one by one in order to parallelize computation, SPrime will obtain information about relative mutation rates from the whole autosome. bcftools concat -file-list vcf.file.list -naive -output-type z -output all.auto.vcf.gz Step-3: Run SPrime to detect introgressed variants Timing: 1 h 4. We use the HapMap combined LD map as the input recombination map in this example, following the analysis in (Browning et al., 2018). The recombination map must be in plink format  It takes approximately 1 h to run SPrime on all autosomes. The maximum memory in use for one chromosome is 4.6GB.
Step-4: Calculate match rates to a known archaic genome Timing: 1.5 h SPrime is able to detect archaic introgression without knowing the archaic genome by utilizing a purported non-admixed population as an outgroup. For Neanderthal or Denisovan introgression, a West African population is typically used as the outgroup, for example the YRI from the 1000 Genomes Project. If a relevant archaic genome has been sequenced, one can map the detected variants to the archaic genome to confirm the source of introgression. We use the genome of the Altai Denisovan and the genome of the Vindija Neanderthal to represent two different sources of archaic introgression. The archaic genomes are in VCF format and the mask files are in BED (Browser Extensible Data) format.
6. For each variant detected by SPrime, map it to the archaic genome, resulting in ''match'', ''mismatch'', or ''notcomp'' to the archaic genome. The three states mean the detected variant is present in the archaic genome, is not present in the archaic genome, or is not comparable because genotype quality in the archaic genome is low for that locus. To complete this step, we have an C script named ''map_arch'', which adds the match status for each variant as an additional column to SPrime's output. The mismatch analysis takes 7 mins for chromosome 2 and 81 mins for all autosomes. The maximum memory in use for one chromosome is 10GB.
Step-5: Find multiple sources of archaic introgression Timing: 1 min 7. This is an optional step for those who are interested in population history. Once we know the match status of each detected variant to the archaic genome, we are able to calculate the match rate for each reported introgression segment. The match rate for a segment is the number of matching positions divided by the sum of matching and mis-matching positions (the match rate is undefined if all the SPrime variants in the segment are not comparable to the archaic genome). If a segment has high match rate to a particular archaic genome, this segment probably shares close ancestry with that archaic genome. By calculating the match rate to different archaic genomes, we may find evidence of different sources of introgression as in Figure 2 (Browning et al., 2018

EXPECTED OUTCOMES
SPrimes's output score file from ( Comparison of the putative archaic segments to archaic genomes yields one or more additional columns ( Figure 4). Although some positions are not represented in the archaic sequences and hence are shown as "notcomp", other positions will "match" or "mismatch" the archaic genome. Within a truly introgressed segment, "match" results will predominate, although some positions will be mismatches due to polymorphism within the archaic population. Visualization of these results is possible through contour plots ( Figure 2) for two archaic genomes, or histograms for a single archaic genome.

LIMITATIONS
The SPrime method is based on an out-of-Africa Neanderthal-admixture model, in which a modern human founder population of small size interbred around 60,000 years ago with an archaic population that had split from the main human lineage around 400,000 years ago (Browning et al., 2018). In alternative scenarios, re-tuning of the parameters of the method may be advantageous.
Particular care should be taken when interpreting results from populations in which one or more subpopulation split off from the rest of the population a large number of generations ago. If the time since the split has been sufficient for the build-up of a high density of sub-population-specific The peak in the upper left represents ancestry from a group that is closely related to the Altai Denisovan, the peak in the middle left represents ancestry from a group that is distantly related to the Altai Denisovan, the peak on the lower right represents ancestry from Neanderthals, and the peak in the lower left represents other segments which may be false positive introgression calls or introgression from another source. The SPrime method assumes the availability of sampled individuals from an outgroup population that has experienced negligible admixture from the archaic source. For Neanderthal and Denisovan admixture, West African populations are suitable. However, archaic admixture events that occurred within Africa prior to the out-of-Africa migration may have resulted in current-day admixture in most if not all human populations, so that it may not be possible to find a suitable outgroup population.
The SPrime method doesn't find all the introgressed material because some introgressed segments are too short to be detected. In the out-of-Africa Neanderthal admixture scenario, around one half of introgressed material is detected with the method (Browning et al., 2018). Thus estimates of admixture proportion based on the SPrime results will be underestimates.

Potential solution
Use ''bcftools index'' to rebuild the index file for the VCF input.

Potential solution
Check the chromosome identifier, which should be consistent between the genotype files and the recombination map file.

Problem 3
Genotype data is not in GRCh37 (e.g., is in GRCh38), but currently the archaic genomes are only available in GRCh37 (step 4).

Potential solution
Confirm that the ''zlib'' is installed in the system and then type ''make'' in the folder containing the code to recompile the program. Problem 5 The match rate between SPrime's output and the archaic genome is low (steps 4 and 5).

Potential solutions
The match rate is based on variants that are polymorphic in the target population (such as the CHB) and that are inferred by SPrime to be introgressed from the archaic population. Due to polymorphism in the archaic population, such variants won't always match the archaic genome. The level of divergence between the introgressing population and the population from which the archaic sequence is derived also can also have a significant impact. For example, the match rate between some Denisovan-derived introgressed sequence in the CHB data and the sequenced Altai Denisovan is only 50% due to high divergence, whereas the match rates between Neanderthal-derived introgressed sequence in the CHB and the Vindija Neanderthal is 90% (Figure 2). Genotype errors in the archaic genome, and false positive results in the SPrime analysis also affect results.
False positive rates may be inflated in some populations, such as admixed populations, or when sample sizes are small (<15) (Browning et al., 2018). In such cases, it may be necessary to run simulations to choose an appropriate score threshold.

RESOURCE AVAILABILITY
Lead contact Further information and requests for resources and code should be directed to and will be fulfilled by the lead contact, Sharon Browning (sguy@uw.edu).

Materials availability
This study did not generate new unique reagents.

Data and code availability
Code generated during this study is available at https://github.com/YingZhou001/sprimepipeline