I’am hiQ – A Novel Accuracy Index for Imputed Genotypes

Albert Rosenberger, Viola Tozzi, Rayjean J Hung, David C Christiani, Neil E Caporaso, Geoffrey Liu, Stig E Bojesen, Loic Le Marchand, Demetrios Albanes, Melinda C Aldrich, Adonina Tardon, Guillermo Fernández-Tardón, Gad Rennert, John K Field, Mike Davies, Triantafillos Liloglou, Lambertus A Kiemeney, Philip Lazarus, Aage Haugen, Shanbeh Zienolddiny, Stephen Lam, Matthew B Schabath, Angeline S Andrew, Eric J Duell, Susanne M Arnold, Hans Brunnsstöm, Olle Melander, Gary E Goodman, Chu Chen, Jennifer A Doherty, Marion Dawn Teare, Angela Cox, Penella J Woll, Angela Risch, Thomas R Muley, Mikael Johansson, Paul Brennan, Maria Teresa Landi, Sanjay S Shete, Christopher I Amos, Heike Bickeböller on behalf of the INTEGRAL-ILCCO consortium

1 Abstract Background: Imputation of untyped markers is a standard tool in genome-wide association studies to close the gap between directly genotyped and other known DNA variants. However, high accuracy with which genotypes are imputed is fundamental. Several accuracy measures have been proposed and some are implemented in imputation software, unfortunately diversely across platforms. In the present paper we introduce

Background
To date information of more than 660 million reference single nucleotide polymorphisms (refSNPs) and 5.9 million regions with structural variation (SV) on the human DNA are known and stored in the publicly available databases, like dbSNP [1]. To identify those genetic variants, that are associated with common human diseases, genomewide association studies (GWAS) can be conducted. Usually, commercial single nucleotide polymorphism (SNP) microarrays are used to carry out genotyping of DNA samples for these studies. There are two predominant companies for high throughput genotyping arrays, Thermo Fisher Scientific Inc., Santa Clara, CA (Affymetrix TM ) and Illumina Inc., San Diego, CA. The underlying chemistry differs but both array types can be used to ascertain genotypes in a similar fashion [2]. In contrast to the more expensive and error prone new generation sequencing technologies, the number of genotyped variants ranges from 300,000 to 4 million. Array-based markers are supposed to tag the genomic region in their vicinity, but represent only a small proportion of all known DNA variants. Furthermore, these variants are not a random selection but have been chosen according to criteria such as minor allele frequency (MAF), location in exons or blocks of linkage disequilibrium or putative associations with certain disease. Imputation methods and strategies have been developed and are now a standard tool in GWAS to close the gap between genotyped and existing DNA variants [3][4][5]. These methods transfer information of DNA structure from one or several reference panels with high marker density (e.g. 1000 Genomes Project phase 3 [6] or Haplotype Reference Consortium (HRC) [7]) to the genotyped study samples [4]. Most imputation methods estimate a-posteriori genotype probabilities (referred to as dosages, ranging from 0 to 1) for each untyped variant and each individual in the sample of interest. The resulting increase of variant density in the study sample improves the genomic coverage and can increase the power to identify genomic variants associated with a trait [8].
Imputation further has the potential that an identified associated marker is located closer to a true risk locus; it facilitates fine mapping of causal variants and is essential for meta-analyses of GWAS, particularly when different genotyping arrays have been used for multiple studies [9]. However, imputation requires advanced statistical methods for data analysis and may introduce extra uncertainty in interpreting findings. Further, only DNA variants that have previously been genotyped in the used reference panel can be imputed [4,10].

Known accuracy measures
It is important to evaluate the quality of imputation, e.g. to exclude poorly imputed variants from statistical analysis. Several quality indices have been developed and are routinely applied [4,5,18]. These comprise inter alia the squared correlation r² between the true and imputed dose of an allele across all imputed samples (MaCH r², Minimac or Beagle r²) or IMPUTE2's info.
All r² measures can be derived from a-posteriori allele probabilities without knowledge of the true allele dose, but only if the allele probabilities are well calibrated and MAF is not too low. The power of an allelic test with N samples and imputed alleles is approximately equal to the power of the same test with r²N samples and known alleles, in case of a binary trait. Differences among the known r² measures are discussed elsewhere [4]. The commonly used info is defined as the proportion of statistical information on the population allele frequency in the imputed genotypes, relative to "known" genotypes [5]. If the Hardy-Weinberg disequilibrium (HWE) holds, info equalizes to Minimacs r². Hence, r²-based measures and info are directly related to the power of statistical test of a marker x trait association.
In general, both metrics have preferable characteristics if the a-posteriori genotype probabilities (dosages) are accurately calculated [18]. However, multiple factors can affect imputation accuracy, e.g. sample size, sequencing coverage and haplotype accuracy of the references panel(s), density of the genotyping array, allele frequency and poor LD between genotyped and imputed variants [4]. One can calculate these accuracy measures from dosage files. However, the standard outputs of common imputation programs (e.g. Beagle or IMPUTE2) contain different metrics. Hence, choosing an imputation program binds the user to the metrics provided, although the SNPTEST program offers the option of calculating a measure similar to that of info [19].
We propose a new pair of metrics to depict additional aspects of imputation accuracy also calculable from dosage files. First, we aim to quantify the amount of individualspecific versus population-specific genotype information in the imputed genotypes.
Second, we aim to assess the heterogeneity between dosages of a marker across the sample at hand. Both measures can be used to identify markers or regions in which population-specific genetic information conceal individual-specific information and are therefore less informative for e.g. association testing. These new metrics are not intended as a competitor to established scores, but are intended to support the making of well-founded decisions in SNP filtering of imputed markers prior to an analysis or in interpretation of results after an analysis.
We calculated this pair of accuracy measures on a series of 27,065 cases and controls gathered by the International Lung Cancer Consortium (ILCCO) to find meaningful thresholds for marker exclusion and compared it with info, because all of the ILCCO samples had previously been imputed with IMPUTE2 applied to a standard 1000 Genomes referent panel. Further, we contrasted the usability of the new measures to info in some simulated data.

Comparison of I'am and hiQ
When applying the novel indices I'am hiQ (as defined in the section Novel accuracy measures) to 517,482 SNPs types with the OncoArray, only a small portion (n=40,678, 4‰) can be considered as imputed without doubt (I'am =1 and hiQ=1). For the majority of SNPs a value between 0.95 and <1 was assigned for hiQ (9,760,392, ~94%), while only 30% (n=3,243,272 markers) achieved such a large value with respect to I'am. It is worth to mention, that we assigned a reduced value for I'am (from 0.4 to 0.75) to about as many SNPs (n=3,491,596, 33%). More details are given in Supplementary Table I.
Both components of I'am hiQ, are contrasted in a bubble plot ( Figure 2). The oversized grey bubble in the top right corner represents the vast majority of almost fullyinformative markers with I'am≥0.99 and hiQ>=0.99. It can easily be seen that the remaining small minority of not fully accurately imputed markers take advantage of the whole theoretical range for I'am (even negative values). In contrast, hiQ always exceeds 0.4 in the sample at hand, but seems to be sensitive in markers with low values for I'am, whereas lower values of hiQ are only assigned to common markers.

Figure 2
I'am by hiQ

Defining thresholds for marker filtering
In order to exclude less accurately imputed variants from further analysis, one needs to define a meaningful and applicable threshold for any accuracy index. For the measure info threshold values like 0.8 or 0.3 have been proposed, but without sound justification [5,20,21].It was even proposed to lower the threshold for info in very large samples, as those of the UK Biobank, and still maintain a good ability to detect associations [22].
We applied robust regression (PROG ROBUSTREG of SAS 9.4; cut-off-α=10 -6 for leverage points, cut-off-multiplier=5 for outliers [23]) to estimate the expected value of I'am indicates markers with less than ~50% individual-specific genotype information (in average across all samples). Such an intuitive interpretation cannot be given for hiQ.
For the majority of 9,094,772 (87.2%) variants, sufficient imputation accuracy was achieved, according to our defined thresholds (see Error! Reference source not found.). Limiting the markers to those with 0.5<info≤0.8 and info≥0.8, this fraction increases to 95.1% or 99.9%, respectively. In very rare genetic variants (MAF<1%) the fraction drops to 76.5%. In contrast, only 0.6% of variants meet neither the I'am nor the hiQ criteria. Interestingly, 1,214,620 variants (11.7%) missed only the I'am criteria.
The fraction was larger in very rare variants (23.2%) and in variants with info<0.5 (58.5%), while it was moderate in variants with 0.5<info≤0.8 (2.5%).   To identify more genomic regions prone to host inaccurate markers we calculated the exponentially weighted moving averages (ewma) of I'am and hiQ (PROC EXPAND of SAS 9.4; smoothing factor 0.1) [23]. We consider variants with an ewma<threshold (0.47 for I'am and 0.97 for hiQ) as belonging to a "hot region" and variants with an ewma<threshold/2 (0.23 for I'am and 0.48 for hiQ) as belonging to a "very hot region".

Identifying markers and regions of low accuracy
Across the whole genome we were able to identify 4,603 "hot regions" and 171 "very hot regions" according to I'am HWE , as well as 2,899 "hot regions" according to hiQ.
These regions partially overlap or are interconnected. "Hot" and "very hot" I'amregions contain in total 85,790 variants, only about 8‰ of all variants. "Hot" hiQregions contain in total 53,590 variants, only about 5‰ of all variants. However, about 1 out of 3 "hot" or "very hot" regions is very small and contains only one variant. In contrast, 10% of the "hot" I'am-regions and about 20% of either the "very hot" I'amor the "hot" hiQ-regions contain more than 20 variants (see Supplementary Table II-IV). Some of these regions on chromosome 1 are indicated by flames in Figure 4.

Comparing I'am hiQ with info and certainty
The missing rate for info was about 18% and 0.6% for certainty, in the data set at hand.
I'am and hiQ could be determined for all markers (see Supplementary Table V Figure   2), whereas this was the case for less than 2‰ of variants with reduced I'am (0.4 to 0.75). This means that I'am and info nevertheless carry different information on imputation accuracy.  Figure 2 and 3). These clearly show that info is less suitable for mapping regions enriched with less accurately imputed genotypes, genome-wide and chromosome-wide.

Usability
We also investigated the usability of the proposed indices in contrast to info by simulation. Usability was considered in terms of discrimination between sufficient and insufficient imputation, rather than in terms of validity of imputation because validity is a characteristic of the imputation routine (e.g. IMPUTE2).
We defined four pairs of scenarios (scenes) by two common genotyped tagSNPs flanking one intermediate marker for imputation. Each pairs consist of one scenario that is sufficient and one that is insufficient for imputing. The scenarios differed with regard to number of haplotypes, MAFs and LD. We visualized the ability to discriminate sufficient from insufficient imputation by plotting receiver operation curves (ROC) comparing all index within each scene. Details on the simulation and the results are given in the supplement. Info and I'am appear to be comparable usable in terms of discrimination between sufficient and insufficient imputation for common SNPs. However, hiQ seems to be superior when MAF of the imputed marker is low.

Discussion
Imputation is a cost-effective tool for GWAS to fill gaps of non-genotyped variants instead of whole-genome sequencing for all recruited individuals, since global coverage in genomic information of available arrays with less than 1 million SNPs not exceeds 25% [10,24]. However, imputation accuracy matters. Several accuracy measures have been proposed and implemented in imputation software, unfortunately diverse across platforms. Das et al. [4] favour r², the squared correlation between true and imputed allele dose, because it is tightly related to the power of an allelic test. However, they also emphasized the importance of adequate imputed samples for the r 2 accuracy.
We introduce I'am hiQ, an independent and complementary pair of accuracy measures.
Other than e.g. r², I'am quantifies the amount of individual-specific versus populationspecific genotype information in a linear manner for each individual before averaging,  However, it has been discussed that any imputation accuracy measures assuming HWE to calculate "expected" genotype counts can be confounded. This was demonstrated for MaCH r² [25]. For the proposed method, HWE is solely chosen as anchor point to define pure population informative dosages. One should keep in mind that I'am hiQ is just a tool for quality assurance and not a data analysis module. Thus slight violations from HWE do not compromise their use, but in case of family data caution is advised.
In such cases one can either apply I'am HWE to founders only, or use I'am chance .
Finally, accuracy measures with non-justified thresholds, as e.g. info, should be applied can be useful, particular to explain whether missed replication of an observed markerphenotype association is due to inaccurate imputation. Since the imputation accuracy of particularly rare markers tend to be low, an improved imputation of the ILCCO samples is planned on newer panels that contain more SNPS with low MAF.

Conclusion
In summary, I'am hiQ is a newly proposed pair of accuracy measures for imputed genotypes. In contrast to others, it addresses directly the contents of individual-specific genotype information and the heterogeneity between dosages. It is independent of the imputation platform and can be computed for all imputed variants. We recommend using I'am hiQ additional to other accuracy scores for variation filtering before stepping into the analysis of imputed GWAS data.

Novel accuracy measure
In the following we will consider = 1 markers with two alleles ( and ) and a . We assume the whole uncertainty related to genotype imputation is contained in these triplets. The allele dose of an individual i for an imputed marker will then be , = ∑ • 2 =0 and can take any value between 0 and 2. Multi-allelic markers are assumed to be split into pseudo-two-allele variants.

Index of individual-specific versus population-specific genotype information: I'am
To quantify the amount of individual-specific versus population-specific genotype information in the dosages of the imputed single marker m for a single person i, we first consider the following three marginal situations:  is much closer to 0 than for common markers (see Error! Reference source not found.). In case all dosages correspond to HWE (as in [iii]) ′ ℎ for common markers is close to 0 (indicating inaccurate imputation), whereas for rare markers it is close to 1 (misleadingly indicating accurate imputation). This shows that it is fairly hard to determine the content of individual-specific information in the triplet of dosages of rare markers.

Index of heterogeneity in quantities: hiQ
Inter-individual heterogeneity of dosages for a marker m is a second concern with respect to the usability of imputed genotypes. consists of three different dosages, leading to different best guess genotypes for the individuals. There is some heterogeneity that serves power for statistical testing. Table 4 Inter To construct an index of heterogeneity in quantities of dosages (hiQ) we compare "average dosages"(ad) across all individuals with "average of best guess dosages" (ab) applying the Hellinger H-distance. The H-distance quantifies the distance between two (trinomial) probability distributions, taking value H=0 in case of coincident probability distributions and H=1 if the probability vectors are perpendicular [27,28]. Therefore we defined .
In Error! Reference source not found. the "average of best guess dosages" for mark-

Application to a sample of lung cancer patients and controls
We applied the novel indices to a dataset of the to common cancers as well as for de novo discovery, and hence is enriched with low frequent and rare variants [29]. About 50% of markers are considered as GWAS backbone. Details of the sample, the genotyping and the quality control are given else- where [30]. The OncoArray whole-genome data were imputed in a two-stage procedure, using SHAPEIT to derive phased genotypes and IMPUTEv2 to infer additional genotypes for genetic variants included in the 1000 Genomes Project (phase 3 panel) [6,15]. We restricted calculations and comparisons to markers of the autosomes. A total of n=10,427,599 SNPs were finally included in this quality assessment. Presuma-bly difficult to impute, due to their MAF, are 5,226,623 of these SNPs (50%) with a MAF lower than 1% and 1,500,835 SNPs with a MAF between 1% and 5%.

Consent for publication
Not applicable.
ration of the manuscript. We acknowledge support by the Open Access Publication

Additional files
Supplementary information is available at European Journal of Human Genetics's website.