A high-quality human reference panel reveals the complexity and distribution of genomic structural variants

Structural variation (SV) represents a major source of differences between individual human genomes and has been linked to disease phenotypes. However, the majority of studies provide neither a global view of the full spectrum of these variants nor integrate them into reference panels of genetic variation. Here, we analyse whole genome sequencing data of 769 individuals from 250 Dutch families, and provide a haplotype-resolved map of 1.9 million genome variants across 9 different variant classes, including novel forms of complex indels, and retrotransposition-mediated insertions of mobile elements and processed RNAs. A large proportion are previously under reported variants sized between 21 and 100 bp. We detect 4 megabases of novel sequence, encoding 11 new transcripts. Finally, we show 191 known, trait-associated SNPs to be in strong linkage disequilibrium with SVs and demonstrate that our panel facilitates accurate imputation of SVs in unrelated individuals.

Deletions have been stratified into four bins based on AF quartiles (panels a-d). Each plot shows the length distribution in the given AF range. Furthermore, thin bars indicate the approaches that have contributed to the discovery of the deletions in the respective size and AF range. The same deletion can have been discovered by multiple approaches, so each five thin bars can add up to more than the enclosing grey bar.   i.e. of a base-line random caller, are shown (black). Figure 11 Supplementary Figure 11. Distribution of imputed genotypes per 'ground truth' genotype and per SV type when only genotypes for which the gold standard genotype contains the rare allele are considered. Note that either the reference or the alternative allele can be the rare(r) allele. RR, RA, and AA are short for "homozygous in the reference allele", "heterozygous", and "homozygous in the alternative allele". Imputed genotypes have been filtered at a GL threshold of c=0.95.  Table 3 Overview of classes of novel genomic segments identified in the GoNL dataset.

Indel and structural variant calling
Variant identification was performed using a combination of 12 different algorithms, which represent 5 major approaches for indel and structural variation (SV) discovery: (i) gapped reads, (ii) split-read, (iii) read pair, (iv) read depth and (v) de novo assembly (Figure 1)  was filtered to only include reads of which one end could not be mapped to the reference genome, significantly reducing the file size for each region and resulting in less transfer overhead. In the second step of the analysis, one region was selected and analyzed across all samples. For each analysis 8 compute cores were used, consuming 5000 core hours in total. The confident variants are those observed in one child with at least 2 supporting reads and which also appeared in at least one of the parents.

GATK Unified Genotyper
We used the GATK UnifiedGenotyper (UG; https://software.broadinstitute.org/gatk/) v2.1.8 for indel discovery and calculation of genotype likelihoods, running it on 5 Mb regions with 100 bp overlap on each side. The overlap ensured that an indel at the fringe of the 5 Mb region would be correctly called. In order to separate biallelic sites from multi-allelic ones, we used the GATK UG multi-allelic model for indel calling.
Only bi-allelic ones were kept for further processing and downstream analyses. We only reported indels with a Phred quality score > Q10.
We adopted a filtering strategy by using GATK VQSR on our initial calls. We used the iii.
Haplotype score iv.
Read position rank sum test v.

Inbreeding coefficient
We genotyped all autosomal calls passing VQSR with GATK PhaseByTransmission using a mutation prior of 10 -4 to obtain sensitive trio-aware genotypes.

1--2--3--SV
We used version 0.9 of the 1-2-3-SV software package (http://tools.genomes.nl/123sv.html) 5 for population-based calling of deletions, inversions, intra-and inter-chromosomal structural variants. The analysis was performed across the complete set of 2,238 libraries. We considered only uniquely mapped reads (BWA X0 flag = 1) that do not have a secondary hit (X1 flag = 0). A read pair was considered to be discordant if the two reads in a pair map to different chromosomes, display incorrect mapping orientation or exhibited too short (below 0.1 percentile) or too long (above 99.9 percentile of fragment size distribution characteristic for a given library) distance of the mapping positions. We only retained clusters supported by 4 or more discordant and independent (not PCR duplicates) read pairs.

Breakdancer
We used BreakDancer (http://breakdancer.sourceforge.net) version 1.1 with parameters -r 5c 7 6 . The rest of the parameters were set to default). Calling was performed per family. Calls from different families that have overlapping confidence intervals of both breakpoints were merged in an iterative fashion, using a custom Perl script. Clusters supported by at least five read pairs were retained for further analysis.

DWAC--Seq
Version 0.7 of DWAC-Seq (http://tools.genomes.nl/dwac-seq.html) was used to call segmental copy-number changes. A BAM file representing a common GoNL reference set was constructed by taking 0.13% (1/769) reads from each of the individual samples and merging them into a single file. Each GoNL sample was then compared against this common reference using dynamic windows that overlap 250 unambiguously mapped sequencing reads (BWA flag X0=1). We filtered out CNV candidates that showed a copy-number change of less than 30%. CNV calls from different families with 80% reciprocal overlap were merged together in an iterative manner using a custom Perl script.

CNVnator
The CNVnator (http://sv.gersteinlab.org/cnvnator/) v0.2.2 read-depth algorithm was used for calling CNVs 7 . The read-depth signal consists of readcounts in non-overlapping genomic bins of equal size. Subsequently, the read-depth signal is then segmented into regions with different copy numbers. CNVs calls are predicted on the basis of statistical significance tests performed on the segmented regions.
To determine the optimal bin size for our dataset, we analyzed a quartet with MZ twins using CNVnator with nine different bin sizes (ranging from 100 -500 bp, with an increment of 50 bp). We found that a binsize of 500 bp resulted in the highest concordance between the MZ twin samples and therefore used this setting for the full data set.
We applied a GC correction and included CNVs in the final call on the basis of the following criteria: (i) p < 0.05 (corrected for multiple testing), (ii) the proportion of reads in the CNV that map to > 1 location does not exceed 50%, and (iii) the CNV does not overlap with an assembly gap of the reference genome or centromere.

FACADE
Short reads were mapped using mrsFAST to a repeat-masked reference (hg19) 8 . A correction for GC bias was applied to the raw read depth. An initial quality control step involved calculation of the correlation (R 2 ) between copy numbers and coverage at known loci with a fixed genotype. One sample with R 2 < 0.85 was withdrawn from further analysis.
With read depth information, we could predict the copy number of a gene or a segment of the genome. As a second quality control step, we performed copy number calls with 3 kb windows across the whole genome. Samples that had > 120,000 windows with a copy number of 2 were discarded, leading to the exclusion of 30 samples. In total, 224 families including all 19 quartets met the criteria and were used for variant calling.
To discover CNV events, we first estimated copy numbers across the whole genome from read-depth analysis using windows of 1kb of unique sequence. Then, copy number log 2 ratios between offspring and parents as well as all samples versus the population median were calculated. All pairwise log 2 ratios were segmented using the previously published method FACADE to produce the initial call set 8 . To account for variability in call boundaries with events encompassing a small number of windows of variable genomic size, we applied a Multiscale French Tophat wavelet to refine the breakpoints for events between 2 and 7 windows in size. Each call was classified into four classes: (i) de novo, (ii) assortment, (iii) inherited, and (iv) common to the family. To filter out low-confident calls, only calls spanning ≥ 2 windows and with a log ratio greater than 0.31 or less than -0.25 for duplications and deletions, respectively, were retained.

Mate--Clever
For deletion discovery, we ran the discovery part of MATE-CLEVER 9 , with minor modifications that account for different insert size distributions in different libraries.
MATE-CLEVER is an integrated approach. Its major purpose in the frame of the project is to discover deletions of size 30--100 bp (sometimes termed the "twilight zone of NGS indels").
It incorporates CLEVER 10 , as an internal segment size based approach and LASER 9 , as a split-read aligner.
MATE-CLEVER uses CLEVER in a first step, in order to discover deletions of size 30--100 bp at extremely high sensitivity, and uses LASER in a second step, in order to refine the breakpoint annotations made by CLEVER. It also uses several auxiliary tools, as described below. In the following, we provide the full description of details and commands, by which to reproduce MATE-CLEVER's callset.
To run the CLEVER-based deletion discovery pipeline, revision 3097f2 from the git repository at https://bitbucket.org/tobiasmarschall/clever-toolkit has been used. It includes CLEVER, LASER, and several auxiliary tools described below. To run (and parallelize) the below pipeline, we used the Python-based workflow engine Snakemake 11 .
CLEVER was run on all individuals separately. In detail, that means the following steps were performed: 1. We used tool \texttt{bwabam-to-alignment-priors} to extract prior alignment probabilities. The output was split by chromosome. Commandline: "bwabam-toalignment-priors -m mean-and-sd.txt ref.fasta input.bam | split-priors-by-chromosome -z -s input.bam priors". For each chromosome and each sample in the input, one gzipped file prefixed with "priors" is created. All alternative alignments provided by BWA through XA tags are used. Furthermore, this step generates robust estimates of insert size mean and standard deviation, which are stored in the file mean-and-sd.txt.
2. All files with alignment probabilities are sorted by genomic position by means of a standard unix sort (sort -g -k7).
3. The CLEVER core engine is run on each of these files. Command line: "zcat input.probabilities.gz | clever -c 150 -v > clever.output". The option "-c 150" instructs CLEVER to limit local coverage to 150 to avoid long runtimes for regions with excess coverage (due to repeats). LASER was run on all regions with putative deletions as identified by CLEVER in Step 1: 1. To generate high-quality background distributions for insert size, insertion size, and deletion size from uniquely mapping reads, LASER was used to realign reads from randomly chosen regions. To this end, 5000 regions of length 10000 were sampled uniformly at random. Reads mapped to these regions by BWA were extracted and remapped using LASER as described previously 9 . The following command line was used for the LASER core step: "join-split-reads -XIS -A14 --anchor_distance 2000 -- 5. The scores of all alignment pairs were recalibrated as follows. Phred-scaled insertion and deletion probabilities were set according to the empiric distribution obtained in Step 2.1. Deletions in the candidate set generated in Step 2.4 incur a constant (length-independent) alignment cost of 1. This upweights alignments that support deletions that have also been reported by CLEVER in Step 1 and are thus in line with read pair evidence. For each read pair the alignment pair with the highest probability is reported as primary alignment. All secondary alignments were discarded at this stage.
6. The number of LASER (primary) alignments supporting each deletion were extracted from the BAM files for further processing.
All deletions of CLEVER and LASER were merged. In this step, the whole GoNL population is considered simultaneously. We iterated over all putative deletions as reported by LASER sorted (decreasingly) by total support of primary alignments in the whole population.
1. For each individual, the set of CLEVER predictions was searched for deletions with a center distance of at most 100 and a length difference of at most 20 from the currently processed deletion. All CLEVER calls matching the current deletion are marked and ignored in subsequent iterations. This ensures that each CLEVER deletion is assigned to at most one LASER prediction, giving precedence to predictions common in the population.
2. If a deletion is supported by CLEVER with support of at least 5 and by at least one LASER (split) alignment, we report it to be present in the individual in question.
3. All deletions present in at least one individual are included in the final VCF file.

SOAPdenovo de novo assembly
We selected discordantly mapped and unmapped read pairs from each family and performed de novo assembly per family using SOAPdenovo2 13 , version r240 using kmer size 63 (http://soap.genomics.org.cn/soapdenovo.html). We then obtained family-specific contigs and assembled all family-specific contigs for the entire GoNL sample set together in a second round of de novo assembly. The resultant non-redundant contigs were compared against the GRCh37/hg19 genome reference using BLAT software. Alignments that consisted of two distantly mapped HSP segments of at least 150 base pairs that correspond to left and right part of a contig (with no more than 10 bp overlap at contig-level) were parsed by a custom script (available upon request) to define breakpoints of structural genome rearrangements.

Mobster
Non-reference mobile element insertions (MEIs) were identified using Mobster v0.1.6 with default parameters 14 . Analysis was run separately for each family. We only retained candidate MEIs supported by 5 or more reads.
The sensitivity for detection of MEIs by Mobster was separately tested using a set of 134 MEIs based on a combination of both single and double cluster predictions.

Identification of novel genomic segments
Two types of variants called in section 3.1.11 were analysed separately: i) full contigs that exceed 150bp in size and did not map anywhere with >99% homology when mapped by NCBI BLAST against the GRCh37/hg19 assembly and ii) parts of contigs larger than 150 bases that did not map to any genomic location. We realigned the individual-specific sets of discordant reads using these new segments as a reference sequence, in order to determine their presence/absence in the libraries of each individual.
We divided these new segments into two sets: i) potential contaminations (segments that appear in some of the libraries, but not in all libraries of the same individual) and ii) true new Table 3 GRCh38/hg38 genome reference or a decoy dataset hg38d1. We required 99% identity in the alignment between assembled segment and latest genome reference to discard a segment as unreported by GRCh38.

Construction of simple indel set
A data set with simple indels (1-20 base pairs) was constructed by merging four individual callsets obtained by running GATK HaplotypeCaller, Pindel, Mate-Clever and SOAPdenovo assembly (as described in previous sections).

Construction of complex indel set
Genomic regions showing a high density of polymorphisms (distance between adjacent polymorphisms below 30 basepairs) were tested for being complex events or alleles that potentially appeared as part of the single mutational event, but called as separate adjacent events. GATK HaplotypeCaller was rerun for these regions using the 'mergeVariantsViaLD' option and non-simple indels were retained. Simple events that originally represented these complex events in the simple indel dataset were then excluded from the simple indel dataset.

Variant validation
We Here, we describe an extension of these validations, which is based on the GoNL indel and SV call set released with this paper. For all validation rounds described below, we performed PCR amplification of the variant region or breakpoint junction regions (Supplementary Data   1). PCR primers were designed using primer3 software based on breakpoint coordinates.
Subsequently, we performed sequencing of PCR products using Sanger sequencing or MiSeq (2*250bp). Sanger sequencing reads were mapped to the human reference genome

Indels of 1--20bp
Besides the validation assays as described previously for the GoNL release 1 dataset 2 , we

21--100bp deletions
We have previously described validation for 96 deletions of 21-100bp 2 . We extended and reanalyzed these data and performed verification PCR assays and MiSeq sequencing for an additional 48 midsize deletions. Altogether, a total of 142 out of 144 21-100bp indels were confirmed.

Validation of 100bp+ deletions
We have previously described validation for 48 deletions larger than 100bp and observed a confirmatory rate of 46/48 2 . Here, we performed validation assays for an additional 51 100+ deletions by PCR and MiSeq on a single individual from the GoNL sample set. Two assays failed and two deletions could not be confirmed.

Validation of tandem duplications
A total of 48 tandem duplications were tested using PCR and MiSeq on a single individual.
We could confirm 41 out of 48 variants and failed to obtain a PCR product for 7 of them, resulting in a confirmation rate of 85.4%.

Validation of inversions
We identified a total of 76 inversions in the GoNL dataset. These were all subjected to experimental verification using PCR and MiSeq on diverse set of GoNL samples. Out of 76 tested inversion candidates, we confirmed 49, while the PCR assay failed for 27 variants, resulting in a confirmation rate of 64.5%.

Validation of mobile element insertions
We tested a 96 mobile element insertion (all Mobster calls) using PCR of one of the two breakpoint junctions. The assays were performed on a single individual of the GoNL dataset.
We successfully validated 92 mobile element insertions, while 4 assays failed, resulting in a confirmation rate of 95.8%.

Validation of interchromosomal rearrangements
All 42 interchromosomal rearrangements detected in the GoNL cohort were independently tested by PCR and MiSeq. Based on 42 breakpoint PCR assays, we could confirm 35 interchromosomal changes, resulting in a confirmation rate of 83.3%. Recent data showed that interchromosomal rearrangements may occur as a result of gene retrotransposition insertion polymorphisms (GRIPs) 16 . We confirmed at least 10 GRIP events in our set of interchromosomal rearrangements (Supplementary Data 2).

Validation of novel genomic segments
We tested 96 novel genomic segments, which were not present in the GRCh38 reference build. A total of 86 variants could be confirmed based on DNA from a single GoNL sample, while 10 assays did not result in a PCR product, resulting in a confirmation rate of 89.6%.

Validation of complex indels
We tested 10 complex indels and could confirm 8 of these, resulting in a confirmation rate of 80%.

Functional annotation of Variants
The consensus events were annotated using a custom annotation pipeline as well as

Effects of SVs on gene expression
Whole blood paired end RNA-seq data is currently being generated for the Genome of The

Variant Hotspots
To identify genomic regions where deletions frequently occur, 1 Mb genomic bins across the primary sequence were created and the number of unique deletion events per bin counted based on the consensus set. Regions with significantly more events were identified based on a z-score greater than 2 and compared to those previously reported 4 . Likewise to identify hotspots across variant types, the analysis was repeated to include deletion, duplication,

mobile element insertions and inversions. A cross variant hotspot was defined as a genomic
region with a z-score greater than 2 across more than 3 variant classes.
Candidate hotspots were subsequently filtered based on accessibility for sequencing as previously reported 1 .

Variant Novelty
To determine which SVs were known previously, we compared our call set to dbSNP, DGV, 1000 Genomes Data, CHM1 calls, and our previous public GoNL release 2,[21][22][23][24] . Details are given below. In addition to the datasets named above, the novelty of the MEI call set was also compared to three additional studies 15,25,26 .

Comparison to dbSNP
We compared simple and complex indel calls to dbSNP release 146 (released on 2015-11-04) by searching for exact matches of reference and alternative alleles between our and dbSNP polymorphisms.

Comparison to DGV
We compared our call set to release 2015-07-23 of the database of genomic variants (DGV, see footnote 1 ) 24 . For deletions, we considered all DGV entries where "varianttype" was equal to "CNV" and "variantsubtype" was "deletion" or "loss". For duplications, we considered DGV entries with type "CNV" and subtype "duplication", "gain", or "tandem duplication". For MEIs, we used DGV entries with type "CNV" and subtype "mobile element insertion". For inversions, entries with type "OTHER" and subtype "inversion" were used. For inversions, duplications, and deletions longer than 100bp, we counted an event as a match if its center was at most 2000bp away from the center of the respective DGV event, and the length difference was at most 500bp. For the (more frequent) deletions 21-100bp, we used more stringent criteria to avoid false positives and counted matches with a center distance of at most 100bp and a length difference of at most 20bp. For MEIs (where a reliable length estimate is often not available), we counted DGV events as a match if the insertion point was at a distance of at most 500bp.

Comparison to 1000 Genomes Data
We compared our call set to the integrated SV release of phase 3 of the 1000 Genomes Project, see footnote 2 . We compared all deletions >20bp to 1000G events where SVTYPE matched DEL, CNV, DEL_ALU, DEL_HERV, DEL_LINE1, or DEL_SVA. Duplications were compared to SVTYPES matching DUP or CNV. Inversions were compared to records with SVTYPE=INV and MEIs were compared to SVTYPES matching ALU, LINE1, or SVA. As described above for DGV, matching was done based on center distance and length difference, using the same cutoffs.

Comparison to previous GoNL release
Previously 2 , we have reported on a set of SNPs, simple indels, and deletions derived from the same NGS data. This call set is publicly available 3 as "release 5". We also compared the call set reported here to this previous one and only report variants as novel that are not included in the previous set. For indels we required an exact match of coordinates and called reference and alternative alleles. For matching deletions >20bp, we used the same criteria as used above for DGV and 1000 Genomes.

Comparison to CHM1
We obtained calls from CHM1 22 provided as three sets of events (insertions, deletions, and inversions) and compared them to our call set. For deletions and inversions, we used the same matching criteria as above. For insertions, the CHM1 data provides intervals (in a bed file). We counted a MEI as matching an insertion from CHM1 when the distance of our estimated insertion point was included in or at most 100bp away from one of the provided intervals. SNPs whose description specified a non-Caucasian cohort and SNPs where the risk allele was not specified. Removing duplicates yielded 7,457 SNPs, all of which were shown to be associated with disease in Caucasian individuals, which in the following we will refer to as GWAS SNPs. In the following, we only consider data of the parents to ensure that the sample is (approximately) independent. Both the SNPs and the deletions were phased using MVNCall 27 . SNP and deletion variants with too low minor allele frequency (< 4%) were discarded, following the workflow described in Maurano et al. 18 . This filtering step reduced the number of autosomal GWAS, non-GWAS SNPs and deletions to 3,198, 1,259,007 and 21,685, respectively.

Identification of significant (non)--GWAS SNP--deletion pairs
First, we tested for linkage disequilibrium (non-random association between alleles) in every GWAS SNP and deletion that are 1) on the same autosome, and 2) less than 1 million base pairs removed from each other (one CentiMorgan on average [Lodish et al., "Molecular Cell Biology"]), that is, associations are likely to be missed only in 1% of cases. The total number of GWAS SNP-deletion pairs that satisfied these constraints were 55,250. In order to assess association between each pair, we computed Pearson's coefficient of determination (R2) and applied Fisher's exact test (two-sided). We accounted for multiple testing by employing Benjamini-Hochberg's false discovery rate (FDR) control procedure with a q-value (expected  Pearson's coefficient of determination (R 2 ) and applying Fisher's (two-sided) exact test on the basis of the data of the parents. The SNP with the highest R 2 (of at least .8) and a p-value lower than 5% is regarded to be the most appropriate tag SNP for that particular deletion.

Tagging GoNL deletions
The results can be found in Supplementary Data 7.

Genotyping indels and SVs
SV genotyping as performed for all SVs in the consensus call set described in Section 3.3. In the following we describe the performed steps in detail for each SV type.

A maximum likelihood model for genotyping insertions and deletions
The following model has been adopted from [Marschall & Schönhuth, to appear], where it was described for midsize deletions. Here, we will make use of the model for deletions of arbitrary length, and also for insertions that can be detected via paired-end read and splitread data signals. The model described in the following has been implemented as part of the CLEVER Toolkit (that also contains MATE-CLEVER).
Let ! , ! , ! indicate the genotypes of an indel: ! denotes absence of the indel, ! denotes a heterozygous indel and ! a homozygous indel. In the following, we fix the region of interest, that is, the region that harbors the putative indel. Let A be a read alignment. Let R be a single read and R be the set of all reads that have an alignment with the region under consideration. For R in R, let A(R) be the alignment of R with the region we would like to genotype.
Let S be a subset of reads from R. In slight abuse of notation, we formally consider S as the event that precisely the alignments of reads from S in the region of interest are correct, while all others do not align in the region. Hence, is the corresponding probability. In the following, we consider a maximum likelihood (ML) setting, which in particular reflects that our prior belief in a genotype to be correct is the same for all types: This assumption is necessary for an efficient computation scheme, because a more flexible Bayesian approach would only yield a scheme that is exponential in |R|, the number of reads that align to the region of interest. If |R| exceeds 20, this becomes infeasible.
We are interested in maximizing (j=0,1,2) By considering probabilities P(S), we take alignment uncertainty due to the usual reasons like multiply mapped reads and alignment artifacts, which occur particularly often in indelaffected regions, into account. Let K:=|R| be the number of reads that align to the region to be genotyped. By Bayes' formula, equation (1) further implies that Note that we also assume that reads were generated independently of each other, which is a reasonable assumption. From equation (2), we conclude that where the second row results from expanding the third row. The last term, finally, can be computed in time linear in the number of reads R, which had been our goal. We use this formula for genotyping insertions and deletions of arbitrary length, that is, including insertion and deletion SVs.
It remains to compute reasonable probabilities ! |{ } and ! |{ } for read alignments A. While ! |{ } = ! because the read that underlies A does not stem from the region, hence has no influence on the respective genotype, computation of terms ! |{ } require further reasoning. One has to distinguish between two types of evidence for an indel given that A is correct: • Split-read evidence: A aligns with the region to be genotyped such that one read end stretches across the breakpoint(s) of the putative indel Let A be an alignment with a split/gap that agrees with D. By the above considerations, the read of A stems from a chromosomal copy that is affected by D. By Bayes' formula, and the formulas from above, In general, the assumption of no alignment uncertainty does not apply. In fact, split-read While these values may seem large, they are well supported by considerations on the uncertainty of (split-)alignments 9,30 .

Internal Segment--Based Evidence
Internal-segment-based evidence is provided by evaluating the empirical distribution on fragment length, which depends on the library from which the read stems. Here, we consider this distribution approximately Gaussian, which is a reasonable assumption for the vast majority of GoNL individuals. In general, all of the following considerations can be easily generalized to arbitrary, also non-Gaussian empirical distributions. Let D be the deletion to be genotyped and let C(D) be its centerpoint. Let R be the read of A where ! < < ! , that is, the alignment interval of A, which is the gap between ! , the alignment of the left end, and ! , the alignment of the right end, contains the counterpoint of the deletion D. Note that choosing centerpoints as anchors for the following considerations has advantages, as was explained in 9,30 . Let µ and σ be mean and standard deviation of the internal segment size distribution of the library R is from, which assumed to be Gaussian. So, internal segment length, as a random variable X, is distributed as the Gaussian distribution There are two cases: first, alignments A whose reads stem from a chromosomal copy that is not affected by D, and second, alignments B whose reads stem from a chromosomal copy that is affected by D. We obtain where the second case reflects that the alignment interval contains the deletion of length L.
We compute as appropriate densities for the cases of no variant and a homozygous variant. Let a normalization factor.
In analogy to considerations for the split-read case, we arrive at as an appropriate probability distribution for reads whose alignments span the breakpoints of deletions by their internal segments.
The procedure described above is implemented as part of MATE-CLEVER 9 , which can also use prior information in form of the Mendelian laws, if the input consists of multiple, ancestryrelated genomes. In order to genotype, MATE-CLEVER collects all (split-read and regular) alignments in a region, and uses them for carrying out the computations described above.

Mid--size deletions (20--10000bp)
For the consensus deletion calls in this length range, the respective regions (deletions +/-750bp) were extracted from the BAM files for all samples, converted to FASTQ format, and aligned using LASER 9 , version v2.0-rc1-20-gfea4980, with the following command line: The resulting VCFs for each family are merged using vcf-merge, which is part of vcftools 31 , to create a VCF for the full panel.
Phasing requires well-calibrated genotype likelihoods. We therefore conducted a simulation study to recalibrate GLs. We created three kinds of simulated diploid genomes, one that does not contain any midsize deletions (i.e. homozygous for the reference allele at all midsize deletion loci), one that is heterozygous for all midsize deletions (each one independently put at random onto one of the two haplotypes), and one that is homozygous for all midsize deletion alleles. We additionally added SNPs and indels to these three simulated genomes by randomly drawing SNP and indel alleles from the GoNL panel according to their AFs. We repeated this procedure 10 times to create a total of 30 simulated diploid genomes. For all these genomes, we simulated Illumina reads using the simseq program 32 , matching the coverage found in the GoNL. The resulting reads were aligned to the reference genome using LASER with the same parameters as described above.
Subsequent genotyping using the CLEVER Toolkit was also performed as described above.
For each deletion, we created a confusion matrix, counting the number of times we observed genotyped !"#$ (predicted by our genotype caller) when the true genotype was !"#$ , for all nine possible combinations of genotypes. Based on these counts, we estimated the empirical conditional probabilities ( !"#$ | !"#$ ), based on the assumption that ( !"#$ | !"#$ ) is proportional to ( !"#$ | !"#$ ). In the absence of observations (i.e. when the number of observed genotypes !"#$ is zero), we assume uniform conditional probabilities (corresponding to a flat prior). Using these deletion specific tables of conditional probabilities, we recalibrated all genotype likelihoods by computing posterior likelihoods.

Long deletions (10kbp and longer)
In principle, the genotyping of long deletions proceeded like the genotyping of mid-size deletions described above, with the following slight adaptations: • The analysis was done at a later point in time using the (then current) version v2.0-rc3-53 g7b13d29 of the CLEVER Toolkit. No major changes happened compared to the versions used for mid-size deletions, just slight fixes/usability improvements.
• Due to the size of the events, we did not extract the full-length deletions from the original BAM files, but used a +/-1000bp window around start position, center position, and end position of each deletion, which is sufficient for genotyping.

Inversions
To genotype the inversions in our consensus set, we used DELLY 33 , version 0.5.9. Since precise breakpoint coordinates were not available for all consensus inversions, we did not run delly in "genotyping mode", but used the "discovery mode" as follows:

Translocations
Genotyping of translocation was done analogously to genotyping inversions (see previous section).

Duplications
Genotyping of duplications proceeded mostly like for inversions as described above, with the 2. We matched consensus duplications to DELLY prediction, again using a maximum breakpoint distance of 500 and resolved ambiguities as follows: we prefer calls with a FILTER value of "PASS" over those with a "LowQual", we prefer PRECISE over IMPRECISE, we prefer calls with a lower distance to the original consensus duplication, and finally we prefer calls with a lower length difference.

Mobile Element Insertions
Genotyping of mobile element insertions (MEIs) was done via a logistic regression model,

Phasing SVs using MVNCall
We used the same haplotype scaffolds as previously described 2 to phase SVs onto the already phased sets of SNPs and indels. The scaffold contains sites present on Omni2.5M chips. Refer to the supplement of Francioli et al 2 , Section 12, for details on how it was created. For phasing, we used MVNcall 27 version 1.1. Apart from PED files (containing the family/trio structure) and scaffolds, we used the genotype likelihoods (GLs) reported by the CLEVER Toolkit (for deletions) 10 , by DELLY (for inversions, duplications, and translocations) 33 , by Mobster 14 (for mobile element insertions), and by GATK's HaplotypeCaller (for complex indels) as input. Before phasing, the GLs were regularized so as to avoid too low probabilities as follows. We added a value of 10 5 to each of the three probabilities for genotypes (RR, AR, and AA) and renormalized the sum to one. The so transformed GLs are passed to MVNcall 27 for phasing using the following command line: "mvncall --sample-file samples.list --ped-file samples.ped --k 100 --iterations 50 --add-gls --int 0 300000000 -scaffold-file scaffolds.haps --glfs input.vcf --o output.vcf". We ran MVNcall in this way for each chromosome and for each SV type separately.
3.6.2 Imputing SVs in independent Dutch individuals using IMPUTE2 In order to assess the quality of our reference panel with respect to its structural variants (see 37 for an encompassing study on the quality of the panel in terms of SNPs), we ran IMPUTE2 38 on two independent Dutch individuals and analyzed the resulting genotypes. We aimed at a scenario where one could compare the imputed genotypes with true genotypes.
This differs from the evaluation in Deelen et al 2014 37 who focused on classical imputation metrics 39 as a basis for evaluation, which requires sufficiently many individuals, but does not use true genotypes. Since the application of imputation metrics for SVs has not been discussed before in the literature, we opted for a scenario where one could juxtapose imputed genotypes with (highly likely) true genotypes.
For genotyping structural variants in the two individuals, we analyzed the available wholegenome sequencing data of the individuals, which were Illumina HiSeq paired-end reads at 17x and 19.5x coverage respectively, aligned against reference genome GRCh37 using Bowtie2 v2.2.4 40 , followed by marking duplicated reads and indel realignment. For genotyping SVs, we used the same tools with the same setting as for the GoNL panel, see Section 3.5; that is, we used MATE-CLEVER 9 for genotyping deletions, DELLY 33 for genotyping inversions, duplications and translocations, Mobster 14 for genotyping mobile element insertions, and GATK's HaplotypeCaller for genotyping complex indels. We ran all these tools in genotyping mode -and not in discovery mode -on the structural variants that belong to the panel. This yielded a set of read-based genotyped SVs that is in 1-to-1correspondence with the SVs one can impute, in particular also in terms of breakpoint • lower and upper are integers to indicate the genomic region to be studied (we use the full chromosome).
This resulted in imputation-based genotype likelihoods (GL's) for each SV. We then filtered these GL's for SV's that correspond to SV's where gold standard genotypes, as described above, were available.
Comparing genotypes / allele dosages. Let V be a placeholder for single SV's, and let V be all SV's considered (stratified by type or panel-based allele frequency, for example). For each V, we compare the imputed (SNP-and panel-based) genotype likelihood !,! , !,! , !,! which correspond to the imputation-based probabilities that the V is not present (0), heterozygous (1) or homozygous (2), with the directly read-based, carefully filtered "gold standard genotype" likelihoods !,! , !,! , !,! which reflect the ground truth probabilities that V is absent (0), heterozygous (1) or homozygous (2); note that one of the !,! is at least 0.999 as per the filtering criteria (0.85 for MEI's). Let V be a placeholder for single SV's. We then compute the probability ! that the imputed genotypes agree with the ground truth as