The ICR1000 UK exome series: a resource of gene variation in an outbred population

To enhance knowledge of gene variation in outbred populations, and to provide a dataset with utility in research and clinical genomics, we performed exome sequencing of 1,000 UK individuals from the general population and applied a high-quality analysis pipeline that includes high sensitivity and specificity for indel detection. Each UK individual has, on average, 21,978 gene variants including 160 rare (0.1%) variants not present in any other individual in the series. These data provide a baseline expectation for gene variation in an outbred population. Summary data of all 295,391 variants we detected are included here and the individual exome sequences are available from the European Genome-phenome Archive as the ICR1000 UK exome series. Furthermore, samples and other phenotype and experimental data for these individuals are obtainable through application to the 1958 Birth Cohort committee.

The advent of exome sequencing has increased our ability to comprehensively capture gene variation, furthering both discovery of genes predisposing to disease and the expansion of clinical genomic sequencing. In both contexts the spectrum and frequency of gene variation in the general population is a necessary consideration when evaluating potential association with disease. Although publicly accessible summary exome datasets such as the Exome Sequencing Project and Exome Aggregation Cohort (ExAC) series are available, these are compilations of exome data from disease cohorts rather than the general population, and individual level data is not provided, hampering utility 1,2 . The 1000 Genomes project has made individual population-based exome data available. However, the sample sizes for any given population range from 61 to 113 individuals, limiting detection of rarer variation 3 . Some larger population datasets are available; studies of 2,636 Icelanders and of 250 Dutch individuals have been recently published 4,5 , although individual sequences are not accessible, impeding direct comparative analyses of data from different populations. Moreover the spectrum of variation from founder populations, such as the Icelanders, is not representative of outbred populations, particularly in relation to rare variation. This is of crucial importance in clinical genomics because the majority of disease-causing gene variants are very rare.
A further limitation of some of the available datasets is the suboptimal detection of insertions and deletions (collectively termed 'indels'). Many publications of these datasets have either excluded all/some indels, or have explicitly stated that the indel calling is not high-quality 1,3,6-8 . This has potential to severely compromise clinical genomic utility, as indels are a major class of pathogenic variant, and are routinely and robustly detected by pre-NGS mutation detection methods, such as Sanger sequencing.
Here we have sought to enhance knowledge of rare gene variation in outbred populations and to provide an exome dataset with utility in research and clinical genomics. To this end we have generated exome data in 1,000 UK individuals from the general population (the ICR1000 UK exome series) and applied an analytical pipeline with high sensitivity and specificity for indel detection. We analysed the data with the OpEx pipeline (version 1.0.0) which has high sensitivity (95%) and specificity (97%) for indels, with a low false discovery rate (3.4%). OpEx includes Stampy v1.0.14 for alignment, Platypus v0.1.5 for variant detection and CAVA v.1.1.1 for variant annotation 9-11 . Of particular note, the default OpEx indel calls are consistent with the clinical convention (i.e. called at the most 3' position in the coding strand) but if alternative representation(s) are possible this is noted and the call at the most 5' position is also provided 11 .
To confirm the series provides representation of individuals with similar ancestry, we performed principal component analysis with the HapMap data for the Central European, Han Chinese, Gujarati Indian, and Yoruban populations 12 . All individuals clustered tightly with the Central Europeans, with no evidence of ethnic outliers ( Figure 1).
We evaluated 19,922 genes from the Ensembl database, of which 17,588 passed our sequence performance criteria and were included in the analyses outlined below (Supplementary Table 1). We calculated the percentage of coding bases that achieved coverage of at least 15×, the same threshold at which variant calling was optimised. This allows one to place a gene's observed variation in the context of what was successfully sequenced, an essential requirement in clinical genomics. The genes included and the percentage successfully covered is given in Supplementary Table 1. We observed variation in 17,464 of the 17,588 genes ( Figure 2). The 124 genes with no variation had less than 1.1kb of coding sequence and/or less than 80% of coding sequence successfully sequenced at 15× (Supplementary Figure 1). The 538 genes with no protein-altering variation were similarly enriched for smaller genes (Supplementary Figure 1). 712 genes had no rare (0.1%) protein-altering variants. Conversely, 2,296 genes had only rare protein-altering variants. The remaining genes had variants across the frequency spectrum ( Figure 2, Supplementary Table 2).
We outputted all variants in exons or within 10 bp of the intronexon boundary and detected 295,391 distinct variants in total: 284,437 base substitutions, 7,378 deletions and 3,576 insertions  Protein-altering variation includes nonsynonymous variants, inframe indels and protein-truncating variants (i.e. frameshifting indels or variants that alter essential splice-site residues). The majority of genes include variants across the frequency spectrum.
We first considered the quality of the base substitution calls. The overall transition/transversion (Ti/Tv) ratio was 3.05, confirming our low false detection rate. The Ti/Tv ratio increased from 3.17 to 3.25 as frequency decreased from common to low as anticipated 13 . The Ti/Tv ratio of the rare variants was lower, as expected, at 2.91. This is because rare variants are enriched for nonsynonymous variants, a class more likely to contain transversions, thus resulting in a lower Ti/Tv ratio 4,8  were low frequency and 17% (n=47,197) were common ( Figure 3, Supplementary Table 2).
We next considered the insertions and deletions. We identified 10,954 indels in total, of which 52% (n=5,724) were rare, 33% (n=3,559) were low frequency and 15% (n=1,671) were common ( Figure 3, Supplementary Table 2). The type and length of indel detected influenced variant frequency and we detected twice as many deletions as insertions (P=8.68×10 -289 ), in line with previous data ( Figure 4) 4,14 . For coding indels the majority of common variants were inframe 3 bp variants or 1 bp deletions or insertions. However, we observed a different distribution for rare coding indels, with more frameshifting single base indels than inframe 3 bp indels (P=1.2×10 -10 ; Figure 4). This likely reflects selection against frameshifting indels becoming common, as they are more often biologically deleterious than inframe mutations.   Deletions are more common than insertions, particularly for rare variants. There is enrichment of indels of 3 bp, 6 bp and 9 bp in coding but not non-coding sequence, because these cause inframe variants. We believe these results and the underlying raw ICR1000 UK exome data have considerable utility in scientific and translational research and in clinical genomics.

Exome analysis with OpEx
We analysed sequencing reads with the OpEx pipeline. OpEx includes Stampy v1.0.14 for alignment, Platypus v0.1.5 for variant detection and CAVA v.1.1.1 for variant annotation 9-11 . The OpEx pipeline is available at http://www.icr.ac.uk/OpEx. Full details of the OpEx description and performance are being prepared for publication. In brief, we validated the OpEx pipeline using 142 independent samples for which we had previously generated extensive data from genotyping and sequencing studies. We evaluated 12,656 sites which included 11,926 common base substitutions with minor allele frequency (MAF) ≥5%, 409 variants validated by Sanger sequencing, and 321 sites known to be negative through variant caller comparison analyses. For base substitutions the sensitivity was 99.7% (12017/12049), specificity was 96% (44/46) and the FDR 0.02% (2/12019). For indels ≤10 bp, the sensitivity was 95% (256/269), specificity was 97% (266/275) and the FDR 3.4% (9/265). The detection of longer indels (>10 bp) was less robust, with higher false detection rates than for small indels, these were therefore excluded from the variant tables and analyses.
In the variant analyses we included base substitution calls for which at least one call across all samples had a QUAL score >100. For deletions we included calls for which at least one call across all samples had both PASS in the FILTER column and ≥20% of reads included the deletion. For insertions we only included calls that had both PASS in the FILTER column and ≥20% of reads included the insertion. Seven samples showed excess heterozygosity and were excluded from the variant analyses.
Confirmation of OpEx performance in the ICR1000 UK exome series by Sanger sequencing We selected 116 gene variants that were detected once in the series to further evaluate variant calling performance. This included 31 base substitutions, 36 deletions, and 49 insertions. The 31 base substitutions, 13 of the deletions, and 2 of the insertions occurred in genes for which we had Sanger sequencing primers available in-house. The remaining 23 deletions and 47 insertions were selected randomly from amongst 96 individuals whose DNA was included on the same plate to aid the laboratory work. For these 70 indels, we generated FASTA sequence for the 500 bp window surrounding the called variant using the BSgenome package in R 15 . We used BatchPrimer3 to design primers with minimum product size of 200 bp, optimal of 350 bp, maximum of 501 bp, max Tm difference of 5.0, and default for all other settings. We performed PCR reactions using the Qiagen Multiplex PCR kit, and bidirectional sequencing of resulting amplicons using the BigDye terminator cycle sequencing kit and an ABI3730 automated sequencer (ABI PerkinElmer). All sequencing traces were analysed with both automated software (Mutation Surveyor version 3.10, SoftGenetics) and visual inspection.
113 of the 116 calls were detected by Sanger sequencing. Only one of the validated variants had a discrepant annotation; an insertion of 1 bp in a run of T's in the OpEx call appeared to be a deletion of 1 bp in the Sanger sequence. There were three false positive OpEx calls (FDR = 0.03). Two were insertions of 6 bp and 8 bp present at the end of reads, a common site of false positive calls. The other was a 3 bp deletion in a region with poor mapping quality and poor coverage.

Transcript selection
We selected transcripts from the Ensembl database (version 65) and evaluated the 19,922 genes present on chromosomes 1-22, X, or Y, that had both a start and stop codon, and were not known pseudogenes as specified in the default exome transcript database supplied with OpEx. The specific transcripts selected are provided in Supplementary Table 1 (see Supplementary Table 4 for descriptions of column headers).

Coverage evaluation per gene
Gene coverage was evaluated using the coding bases of the selected transcript; intronic and UTR sequence was excluded. 2,334 genes were excluded from the variant analyses because either <50% of coding bases were targeted by the pulldown (n=1,380) or <50% of coding bases were covered at ≥15× in 500+ individuals (n=954). Of the 1,380 genes that were not targeted by the pulldown, 207 had ≥50% of the gene covered at ≥15× in at least one individual and thus represented off-target effects (Supplementary Table 1).

Principal component analysis (PCA)
We utilized the genotype data from the 397 unrelated individuals in the Central European, Han Chinese, Gujarati Indian, and Yoruban populations in Phase 3 of the HapMap project 12 to perform PCA. The genotypes were evaluated for 2,577 base substitution variants on chromosome 1 which were called in both the exome data and the HapMap data. PCA was performed using the prcomp function in R.
To allow confident imputation of reference homozygotes in the exome data, variants were required to have ≥13× coverage at the position for every individual. Ten individuals were excluded from PCA analysis as fewer than 90% of the variants met the coverage requirement.

Detection of gene variation
We assessed variation in the 17,588 genes with good coverage. Variants in exons or flanking sequence up to ten bases into the intron were outputted and included in Supplementary Table 2 (see Supplementary Table 4 for descriptions of column headers).
Variants detected in only one individual in fewer than 15 reads were excluded. Multi-allelic variants were separated into their constituent alleles for the individual-level and gene-level analyses. For the individual-level analyses, the most severe consequence was selected for variants that affected multiple genes based on the functional impact class supplied by CAVA 11 . For the gene-level analyses, variants that affected multiple genes were included as variants in each gene. We defined coding indels as those that affected any exonic or essential splice-site (+1, +2, -1, -2) residue, and non-coding indels as those that affected residues +3 to +10 or -3 to -10.

Statistical analyses
The Ti/Tv and nonsynonymous/synonymous ratios were calculated in R using the exonic base substitutions. The overall proportion of deletions amongst the 10,954 indels (0.67) was tested for significant difference from 0.5 using the prop.test function in R. The number of 1 bp and 3 bp coding deletions was compared between rare and common variants using a 2×2 contingency test with the chisq.test function in R.

ICR1000 UK exome series availability
The FASTQ files for all 1,000 individuals have been deposited in the European Genome-phenome archive (EGA). The accession number is EGAD00001001021. Click here to access the data.

Supplementary Table 2. Summary information for all variants detected in the ICR1000 UK exome series.
Variant information for all 295,391 exome variants detected in the ICR1000 series.
Click here to access the data.

Supplementary Table 3. Range and average number of variants in the ICR1000 UK exome series.
The range and average number of variants in the ICR1000 series by variant type and frequency.
Click here to access the data.

Supplementary Table 4. Column descriptions for supplementary tables.
Descriptions of column headers for Supplementary Table 1 and Supplementary Table 2.
Click here to access the data.

F1000Research
The authors describe ICR1000, a UK-based reference data set for variation. Using 1000 consented samples from the 1958 UK birth cohort, they performed deep exome sequencing to a median depth of 15X. The samples were chosen from those who self-reported as having white British ancestry though does not comprise samples from individuals of other ancestry in the general British population. All data have been deposited in the EGA, and so are available as a resource to other bona fide researchers.
This is a thorough and well performed study that includes detailed methods and results. In particular the authors have developed a pipeline with high sensitivity and specificity for detecting indels. This results in findings of nearly 11,000 indels, over half of which are novel and in total over 159,000 rare variants were found (with frequency of 0.1%). It would be helpful if the authors could include a comparison to the UK10K project. This was a major UK-based initiative that sequenced 4000 genomes of individuals with deep phenotype data and these data are also available in the EGA.
In conclusion, this is valuable work that found new rare variants that will improve exome analysis. In particular, identification of novel indels, which have not been well characterised in many other studies, is of importance to the field and will be of benefit to clinical genomics where correct interpretation of rare variants is key.
---Here is a citation for the database mentioned in the paper. Including this is helpful for continued development and funding.
European Genome-phenome Archive: Lappalainen, I., Almeida-King, J., Kumanduri, V., Senf, A., Spalding, J.D., ur-Rehman, S., Saunders, G., Kandasamy, J., Caccamo, M., Leinonen, R., Gail Jarvik Division of Medical Genetics, University of Washington, Seattle, WA, USA This is well executed and data are available. If anything the authors are modest about impact. My report is below: Dr Rahman and her colleagues have summarized both variant and in/del variation in 1000 UK Biobank participants and shared these data. Their analyses show the rate of common and rare variation in the participants. While these estimates are important and underscore that rare variation is very common, possibly the largest impact of this work is the ability to assess the frequency of in/dels in a population-based cohort. Allele frequency data are among the most important consideration in determining whether a variant is pathogenic or not. Clinical laboratories and researchers alike rely on public databases for minor allele frequency data. As the authors rightfully point out, the ExaC variant database is a compilation of individuals with various diseases, which can skew allele frequencies. The