Establishing analytical validity of BeadChip array genotype data by comparison to whole-genome sequence and standard benchmark datasets

Clinical use of genotype data requires high positive predictive value (PPV) and thorough understanding of the genotyping platform characteristics. BeadChip arrays, such as the Global Screening Array (GSA), potentially offer a high-throughput, low-cost clinical screen for known variants. We hypothesize that quality assessment and comparison to whole-genome sequence and benchmark data establish the analytical validity of GSA genotyping. To test this hypothesis, we selected 263 samples from Coriell, generated GSA genotypes in triplicate, generated whole genome sequence (rWGS) genotypes, assessed the quality of each set of genotypes, and compared each set of genotypes to each other and to the 1000 Genomes Phase 3 (1KG) genotypes, a performance benchmark. For 59 genes (MAP59), we also performed theoretical and empirical evaluation of variants deemed medically actionable predispositions. Quality analyses detected sample contamination and increased assay failure along the chip margins. Comparison to benchmark data demonstrated that > 82% of the GSA assays had a PPV of 1. GSA assays targeting transitions, genomic regions of high complexity, and common variants performed better than those targeting transversions, regions of low complexity, and rare variants. Comparison of GSA data to rWGS and 1KG data showed > 99% performance across all measured parameters. Consistent with predictions from prior studies, the GSA detection of variation within the MAP59 genes was 3/261. We establish the analytical validity of GSA assays using quality analytics and comparison to benchmark and rWGS data. GSA assays meet the standards of a clinical screen although assays interrogating rare variants, transversions, and variants within low-complexity regions require careful evaluation.


Aim and design of study
This study defines an analytical validation framework for detecting and limiting genotyping error in GSA data (Fig. 1). To minimize platform specific genotyping biases, internally generated genotype data from independent platforms were paired and compared with publicly available genotype datasets.

Samples and datasets
To generate a reference genotype cluster file for the GSA, 664 DNA samples were purchased from the Coriell Institute for Medical Research, Camden, NJ (https:// www. corie ll. org) and 460 samples were selected from the Sanford Biobank. Individuals with biobank samples were enrolled in protocol number 03-11-061 approved by the Sanford Research Institutional Review Board. These samples were selected to cover different ethnicities (14 Coriell diversity panels, Additional file 1: Table S2) and the technical variability of the DNA extraction methods (460 samples from the Sanford Biobank). To capture the technical variability of the Infinium ® HTS Assay protocol (Illumina Inc.), all samples were genotyped in triplicate (by different technicians, robot-instrument configurations, reagent lots, and days) using the Infinium Global Screening Array-24 v.1.0 BeadChip. The resulting data were loaded into GenomeStudio v2.0.2 and used to generate the genotype cluster files per manufacturer recommendations (https:// www. illum ina. com/ Docum ents/ produ cts/ techn otes/ techn ote_ infin ium_ genot yping_ data_ analy sis. pdf ). Of the 1104 samples used in cluster file generation 72 were also included among the 263 samples used to define analytical validity (Supplementary Sects. "Methods" and "Aim and design of study"). Two hundred sixty-three DNA samples from Coriell were selected as representative of individuals from the 1000 Genomes Project Consortium (n = 258) and from the Genome in a Bottle Consortium (GiAB) [33] (n = 5) (Additional file 1: Table S1). Additionally, they were selected to assess assays genotyping alleles with ≥ 1% minor allele frequency (MAF) in the general population (Additional file 1 Table S3). These 263 DNA samples were resequenced with whole genome sequencing (rWGS) and genotyped in triplicate (263 × 3) with the GSA. These data were compared to 1KG and to publicly available Whole Genome Sequence (pWGS) data (1KG phase 3; downloaded: June 2018). This defined 4 genotype datasets for the 263 samples: (i) triplicate GSA genotypes (ii) pWGS, (iii) rWGS, and (iv) 1KG (Additional file 1: Table S4, Fig. S2, Sect. S3). All analyses including mapping, alignment, and genotyping were performed using HumanG1Kv37 (Genome Reference Consortium Human build 37).

Data processing GSA quality control (QC)
Laboratory QC Genotype clusters for the variants used for clinical reporting were manually curated to ensure accurate variant calling. Other variants were automatically curated using Illumina-recommended filters (https:// www. illum ina. com/ Docum ents/ produ cts/ techn otes/ techn ote_ infin ium_ genot yping_ data_ analy sis. pdf ). Using the data of DNA samples from 1104 individuals run on the GSA in triplicate, the cluster file analyses of each GSA assay found that 610,771 (92%) assays passed and 50,355 (8%) assays failed clustering quality control. Those that failed were excluded and marked as no-calls (./.) in the VCF files.

Data comparisons Principal component analysis
Principal component analysis (PCA) was used to test for intact super-population structure as a corollary for absence of batch and technical artifacts in the genotyping datasets. PCA structure derived from GSA data was compared to the super-population structure derived from 1KG data. Aggregate quality control analysis of the GSA data. A Principal Component Analysis (PCA) plots of 1KG data and GSA genotype data. red: African (AFR), yellow-green: Admixed Americans (AMR), dark-green: East Asian (EAS), blue: European (EUR), purple: South Asian (SAS). B Heatmaps of BeadChip array quality control analysis of call-rate (left), p10GC (middle), and estimated DNA contamination (right). Color gradient scales for the three panels are as follows: call-rate (orange < 0.94-blue > 0.99), p10GC (yellow < 0.50-blue > 0.60) and estimated DNA contamination (rainbow gradient: purple ~ 1%, blue ~ 2%, green ~ 3%, orange/red ~ >4%). C Heatmaps of reproducibility quality control analysis using replicate data as measured by call rate, estimated DNA contamination, number of assays with no genotype calls, and heterozygote to homozygote ratio. Color gradient scales for these four heatmaps are as follows: No genotype calls (blue < 166,000-orange > 400,000), and rainbow gradient for call rate (purple > 0.99-red < 0.94), estimated DNA contamination (purple < 1%-red > 4%), and heterozygote/homozygote ratio (purple > 2.25-red < 1.25), respectively

Whole genome sequence data quality control (QC) Bioinformatics QC
For bioinformatics quality control of rWGS data (n = 263), central tendency and anomalous outlier data points were assessed for (i) total processed reads, (ii) discordant reads, (iii) mapq0 reads, (iv) unmapped reads, (v) mapped reads, and (vi) average depth of sequencing (Additional file 1: Tables S4 and S5). On average > 95% of processed reads per sample (731,227,993/767,540,183 reads) mapped to the reference sequence. Because the concordance of two rWGS datasets (HG00111 and HG00257) with the 1KG data were 0.870 and 0.622, they were dropped from our GSA analyses leaving a total of 261 samples in the rWGS dataset. Variation data for 260 samples (SNVs and short indels ≥ 20× coverage and a Phred score ≥ 30) were submitted to the dbSNP database (Additional file 1: Sect. S8).

Performance metrics Genotype concordance, sensitivity, specificity and positive predictive value (PPV)
GSA and rWGS genotypes were compared to each other and to 1KG genotypes using the following performance metrics: (i) genotype concordance (C), (ii) sensitivity (S), (iii) specificity (P), and (iv) positive predictive value (PPV). We used the following definitions of genotype classification to label genotypes as positive [true positive (tp), false positive (fp)], negative [true negative (tn), false negative (fn)], or discordant (x) ( Table 1): Given the above definitions of true/false positive and negative and discordant genotypes (Table 1), we computed the performance metrics as follows: Genotype concordance (C) Genomic complexity of variation locus (low-complexity regions) To categorize SNVs based on the genomic complexity of the GSA assay locus, we used the UCSC genome browser bed-file definitions to define simple-repeats, micro-satellite regions, and low-complexity regions. The SimpRep, Microsatellites, and RepeatMasker bedfiles were downloaded from the UCSC Genome Browser FTP site and intersected with the GSA manifest file. Across the HumanG1Kv37 reference sequence, there were 962,715 simple repeat, 41,573 microsatellite, and 5,298,131 RepeatMasker regions.

GSA panels Medically actionable predispositions (MAP) 59 gene panel
GSA assays targeting potentially disease-associated variants in MAP59 genes [35] were selected in a multistep process (Table 2). Firstly, GSA assays that interrogated positions within 1000-bases upstream and downstream of the transcript start and end in HumanG1Kv37 were selected for the RefSeq transcript chosen for each gene. Secondly, alleles were annotated with their respective ClinVar classifications, and those that had at least one classification of pathogenic or likely pathogenic were selected. Thirdly, these assays were curated by clinical and laboratory staff to define a managed variant list (MVL) of 1883 assays appropriate for clinical reporting.

Statistics and compute infrastructure
Statistical analyses and data visualization were performed using R (version 3.4.3). Data analysis was done on a Linux Operating System with the following configuration: x86_64, 32 CPUs, 2.8 GHz AMD Opteron Processor 6320. AWS EC2 instances were spun-up for large compute jobs. All NGS and GSA data were archived on AWS S3. In-house software and data processing code and scripts were written primarily in Perl, Ruby, awk, and bash.

Data summary
DNA samples from 263 individuals were purchased from Coriell and genotyped in triplicate (n = 789) with the GSA. Genotypes and data for each replicate were saved to a VCF file. The GSA data were grouped and summarized as replicate datasets 1, 2, and 3. Of the 263 samples, 258 were present in the 1KG. Of the other 5 samples, 3 were from the Personal Genomes Project (PGP) [36], and 2 were from the NIGMS Human Genetic Cell Repository. The 263 × 3 data were compared with the 1KG data and with two WGS datasets, the resequenced WGS data (n = 261; rWGS = 37×) and the downloaded public WGS data (n = 24; pWGS = 51×) (Additional file 1: Sect. S3).

Principal component analysis defines the same population structure in GSA data and 1KG data
Principal component analysis (PCA) on each replicate of autosomal GSA data identified 5 major super populations conserved across replicates. PCA of the 1KG autosomal genotype data from the same loci generated a similar population structure ( Fig. 2A). This suggested that the GSA data did not have confounding technical factors skewing the PCA plot. To determine if fewer GSA genotypes were sufficient for this test, we randomly sub-sampled close to 10,000 genotypes; these recapitulated the population structure (Additional file 1: Sect. S4).

GSA triplicate data analysis shows data reproducibility in the majority of samples and no detectable stochastic QC failure
Given that PCA did not detect major technical confounders within the GSA genotypes, we analyzed the 263 × 3 data for quality and reproducibility [10] (Table 3; Fig. 3). Data were stratified by BeadChip identifiers and sample location on the BeadChip (row, column). Additionally, samples were grouped by replicates, and each replicate sample was evaluated for (i) genotype callrate (n = 610,771 assays), (ii) p10GC, and (iii) estimated DNA sample contamination (Additional file 1: Sect. S5). Aggregate quality control analysis showed a lower p10GC in higher numbered rows on the BeadChip (Fig. 2B); excluding contaminated samples, p10GC ranged from 0.56-0.61 (mean = 0.60, SD = 0.0085) in row 1 and from 0.50-0.61 (mean = 0.55, SD = 0.03) in row 12. Over 99% (782/789) of samples had a call rate of > 0.98. 3 samples in the third replicate dataset were contaminated, and 2 of these 3 samples had a call rate < 0.98 (0.93 and 0.94, Fig. 2C).
To test if call-rates were reproducible across replicates, we measured deviations from expectation and dispersion. The first approach, a Z-score method, computes the number of standard deviations a replicate Table 2 Selection process for GSA assays targeting genotypes considered medically actionable predispositions  sample call-rate is from the expected as defined by the global dataset average and standard deviation. The second approach computes the average call-rate of all replicates for a given sample and then computes variation around the average. Using the Z-score method, 7 samples had a Z-score ≤ − 4. With a more conservative cut-off (Z-score < −3), 11 samples deviated from expectation (Additional file 1: Sect. S6; Fig. S12). When analyzed relative to the BeadChip row and column, outlier Z-scores occurred for wells on the edge of the Illumina BeadChip-R12C01 or R11C01; the only exceptions were two contaminated samples that were in wells R01C01 and R01C02. Dispersion metrics calculated for call rates across each set of three replicates (Table 4) identified higher relative dispersion for the same samples detected by the Z-score method. Replicate pairwise concordance was calculated to assess the stochastic nature of sample genotyping quality and these were plotted as a 3D scatter-plot: [R1 vs. R2 (x-axis), R2 vs. R3 (y-axis), and R1 vs. R3 (z-axis)] (Fig. 3). The data along the diagonal of the cube are correlated data values across triplicates for all measured GSA genotypes for a given DNA sample. 260 of 263 samples in the triplicate dataset (262/263 R1 vs. R2; 260/263 R2 vs. R3; 261/263 R1 vs. R3) had concordance greater than 0.999 between replicates suggesting high reproducibility. Offdiagonal points, i.e., those with poor call rates (< 0.98) (Fig. 2B), were along the edge of Illumina chip or contaminated; we did not observe random occurrence of poor call rates.

Grouping GSA assays by variation type shows that SNVs have > 0.99 performance relative to the benchmark dataset 1KG across all metrics
Of the 263 samples with GSA data, 258 had corresponding 1KG genotype data for computing performance metrics of concordance, sensitivity, specificity, and PPV. Each GSA assay was grouped according to the type of nucleotide change assessed: (a) single nucleotide variant (SNV), (b) multi-allelic variant (MAV), (c) insertion, and (d) deletion (Table 5). SNVs accounted for 99.3% (656,601/661,126); 610,771 of these passed cluster file quality control, and 594,361 detected genotypes present in the 1KG. Among the MAV assays, 526 of 616 passed cluster file QC; however, because only 3 of these had genotypes present in the 1KG, we excluded MAVs from further analysis. Among insertion assays, 1044 of 1110 passed cluster file QC, and 36 of these had genotypes present in the 1KG. Among deletion assays 2677 of 2799 assays passed cluster file QC, and 95 of these had genotypes present in the1KG. Using the three replicate . The data is plotted as correlation across triplicates for all measured GSA genotypes for a given DNA sample. Note that most samples had concordance greater than 0.999 between replicates suggesting high reproducibility. A few samples had off-diagonal points, i.e., those with poor call rates or reproducibility. The color rainbow gradient is from blue (< 0.996) to dark red (1.00) GSA genotype datasets, the performance metrics of SNV assays were > 0.99. In contrast, insertion assays had highly variable concordance with the 1KG, and deletion assays had poor performance metrics (Fig. 4A).

GSA assays for transitions perform better than do those for transversions
Classifying the GSA-detected SNVs as transitions (purine-to-purine OR pyrimidine-to-pyrimidine) or transversions (purine-to-pyrimidine or vice versa)  (Fig. 4B). The assays for transversions between complementary nucleotides (i.e., A>T, T>A, C>G, G>C; Additional file 1: Sect. S7) had lower sensitivity (< 0.99) and lower cluster file QC pass rate (66-73%; Table 6) than did those for other transversions.    . 2011), the average performance metrics for GSA assays passing cluster file QC in each bin showed that PPV and sensitivity suffered when the alternate allele frequency was < 5%, whereas specificity and concordance declined as the alternate allele frequency increased (Fig. 4C). Among the 594,346 GSA SNV assays with IKG genotypes, 476,707 had a PPV equal to 1 (zero false positives) based on concordance with the 1KG and rWGS genotypes (Table 8). We observed that 81% of GSA assays in the [0-0.1%] and 28% of GSA assays in the (0.1-1%] bins had PPV < 1 (Fig. 5; Table 9), whereas other allele frequency bins had an average of 13% (8-17%) with a PPV < 1 (Table 8). These results are consistent with prior observations showing that accurate calling of rare alleles (MAF < 0.01) by genotyping arrays is compromised by low genotype frequencies and an absence of the homozygous alternate alleles needed for construction of cluster files [22,37].

GSA assays interrogating low-complexity genomic regions perform poorer than other assays
To determine assay performance characteristics within repetitive regions of the genome, we intersected GSA assays with annotated low complexity regions (LCRs) including simple repeats, microsatellites, and repeat masked (RepeatMasker-defined) regions in the human genome. Of a total of 594,346 assays passing QC and present in the 1KG, 203,901 (~ 34%) assessed a variant within one of the three annotation classes. 201,579 GSA assays mapped within the RepeatMasker class. Overlapping partially with the other two classes, 431 GSA assays mapped within the simple repeat class. GSA assays targeting genotypes within each LCR class had poorer performance metrics than did assays interrogating genotypes outside of these regions (Fig. 4D).

rWGS performed better than GSA relative to the benchmark dataset 1KG
rWGS data corresponding to GSA assays passing QC were extracted from the rWGS gVCF files and compared to the 1KG. Restricting the analyses to GSA assays for which > 90% of rWGS samples had genotype data defined 602,582 assays and excluded 38,093 GSA assays. An additional 9642 assays on the chromosome X were excluded due to discrepancies in genotype representation in comparison datasets. For the remaining 592,940 autosomal assays, the rWGS genotypes with ≥ 20 × coverage and  a Phred score ≥ 30 were used for calculation of performance metrics. These analyses, i.e., GSA vs. 1KG and rWGS vs. 1KG, showed consistent average metrics and small standard deviations among datasets (Table 9). For the 256 Coriell samples with 1KG data, we observed that rWGS performed better than GSA across all 4 performance metrics ( Fig. 6A; Table 9). Overall average concordance, sensitivity, and specificity for rWGS vs. 1KG were 0.9981, 0.9981 and 0.9991, respectively, whereas for GSA vs. 1KG, they were 0.9932, 0.9927, and 0.9957, respectively. PPV was 0.9977 for rWGS vs. 1KG and was 0.9892 for GSA vs. 1KG (Table 9).

Over 82% of all GSA assays have a PPV = 1
We compared the GSA and rWGS genotypes to the 1KG and computed the PPV. As shown in Fig. 6B, over 82% (476,828) of assays had a PPV of 1 for both the GSA and rWGS. Approximately 1.5% (8710) of rWGS assays had a PPV of 1 when GSA was 0, whereas only 0.12% (699) of GSA assays did when rWGS was 0.

GSA MAP59 secondary findings validated using rWGS, pWGS, and 1KG
Given that > 80% of GSA assays have a PPV = 1, we assessed rare variation detection within the 59 medically actionable predisposition genes (MAP59) defined by the American College of Medical Genetics (ACMG) [35]. Given the expected secondary finding rate of 1-2% [38][39][40] and the limited genomic space profiled by the GSA, we hypothesized 2-3 or fewer samples with GSA-detectable variants in the 261 cohort. Additionally, we hypothesized that comparison of these data to the 1KG and the rWGS data identifies false negative and false positive variants as well as pathogenic variation undetected by the GSA. Focusing on nucleotides with ≥ 20 × rWGS coverage (Fig. 7), we found that an average of 6347 (± 88) sites were genotyped by both rWGS and GSA in any given DNA sample. The GSA vs. rWGS average concordance, sensitivity, specificity, and PPV were 0.99897, 0.99367, 0.99962, and 0.9946, respectively.
For clinically reportable rare variants curated into the managed variant list (MVL), the GSA and rWGS were concordant for a heterozygous variant (MUTYH p.(Gly368Asp); rs36053993) in three samples and across GSA replicates. Two of the 3 samples had 1KG data Fig. 6 Scatter-plot comparison of performance metrics of whole genome sequencing (rWGS) and GSA using 1KG as the benchmark dataset.
A Scatter plots show sample-level performance metrics of rWGS and GSA relative to 1KG reference data. Plots are concordance (top left; blue), sensitivity (top right; orange), specificity (bottom left; green) and positive predictive value (PPV) (bottom right; maroon) respectively. Each dot represents a single sample's performance metric value. B Density scatterplot of each GSA assay's positive predictive value computed for GSA (y-axis) vs. rWGS (x-axis) using 1KG as the benchmark dataset. Each square represents PPV measured for GSA and rWGS relative to 1KG benchmark dataset, and the color indicates the number of assays within each square. The color gradient of each square ranges from 1 assay (dark purple) to 476,828 assays (yellow); therefore, the color on the scatterplot indicates the density of data-points in 2 dimensions and were concordant; one of these two had pWGS data that was also concordant. Highlighting the potential for false positives, rWGS and 1KG data refuted a GSA call of PKP2 p.(Arg355Ter) (rs754912778) in one sample. Conversely, highlighting the potential for false negatives, rWGS and 1KG detected two variants that were not detected by GSA: RB1 p.(Arg661Trp) (rs137853294), which the GSA called homozygous reference in triplicate, and MUTYH p.(Pro391Leu) (rs529008617), which the GSA called "no-call" in triplicate. In summary, the GSA identified 1 pathogenic variant (true positive), 1 false positive, and 2 false negatives (2 assayed and missed) among the MAP59.
The rWGS rate of detection of rare pathogenic variants in the MAP59 genes was 0.034 (3.4%); 7 variants in Fig. 7 Plot of the average percentage of bases within each MAP59 gene covered by whole genome sequencing (rWGS) to a read depth of A ×10 or more (gte10x) B ×20 or more (gte20x) among the 263 samples. Each rWGS nucleotide was required to have a Phred-based quality score of greater than 30 to be considered for this analysis 9 samples from a population of 261. Removing the 3 variants that were not independently confirmed by the 1KG due to lack of 1KG data gives 4 pathogenic variants in 5 individuals from a population of 261 or a rate of 0.019 (1.9%). This range (0.019-0.034) of pathogenic variants in the MAP59 genes is consistent with the published discovery rate [38,39,41,42].

Discussion
We report an approach to analytical validation of the GSA through quality analyses and through assessment of performance by comparison to benchmark datasets and independent whole-genome sequencing data. To the best of our knowledge, this is the first comprehensive analytical validation of the GSA for clinical genotyping. Our findings support and extend recently reported research studies assessing the utility of the GSA for genetic screening in primary immunodeficiency [43], for population-based genomic screening for rare and medically relevant variation [44], and for detecting rare and clinically relevant markers in multiethnic Indian populations [45].
In our study we used call rate and sample contamination as preliminary parameters of quality control for genotype analysis. Call rate is a primary quality control parameter in all genotyping studies [12,13]. A high threshold for call rate not only ensures inclusion of samples with high quality genotype data but also allows, independent of sample DNA quality, for detection of assays that perform poorly. Additionally, sample contamination detection [14] is key in preventing return of false positive genotypes and is demonstrated by our results. While more advanced quality control methods such as Hardy-Weinberg Equilibrium (HWE) test [15], likelihood of error [19], departure from Mendelian inheritance, and pedigree information are used in various research studies [4,20], they are implemented in analyses that follow genotype generation and are dependent on what analyses are subsequently performed using the genotype data. HWE is used to detect genotypes that deviate from the expectation of HWE, and it is typically applied to variants with a MAF of greater than 0.05 [12]. Consequently, because of our interest in variants of lower MAF, we did not implement this QC metric; however, HWE might be useful within certain cut-offs for MAF as implemented by Suratannon et al. [43] and Narang et al. [45]. Similarly, Mendelian inheritance and pedigree information quality control are critical for linkage and segregation analyses and did not apply to our individual-focused assay.
This evaluation of GSA data is consistent with previous studies that demonstrated the utility of sample data quality metrics like genotype call-rate, p10GC, and DNA contamination detection [11,22]. By analysis of replicates, we show that the majority of the GSA data are highly reproducible. Outliers arose either from positioning along the edges of the Illumina BeadChip or from contamination. Characterization of each GSA assay by variation class, type, genomic DNA complexity, and alternate allele frequency showed that the GSA has the highest performance for SNVs and transition nucleotide changes in genomic regions of high complexity. In contrast, assays interrogating low-complexity regions, rare alleles, or transversions performed poorly. Transversions between complementary nucleotides likely performed poorly because of the characteristics of the assays for these particular transversions (Additional file 1: Sect. S7). Also, consistent with previous reports [46][47][48], assays for rare alleles (< 0.001) had lower performance and might be improved by using algorithms for rare variant detection [10,31,32] or joint-calling [22] rather than the default genotype caller (GenCall). These should be considered in the future to improve detection of rare variants by genotyping chips.
The analytical framework implemented in this study followed a three-way analysis (GSA-rWGS-1KG) to assess the strengths and limitations of individual GSA assays. Unlike many published analyses in which WGS is the test dataset and the BeadArray genotypes are the truth [25][26][27]30], our study had the BeadArray as the test dataset and WGS as the truth. The reversal of test and truth datasets is a major challenge for comparing our results to the published literature. To overcome this challenge, we ensured that the rWGS data had performance metrics (concordance = 0.9981) comparable to that previously published (concordance = 0.9984 [25]). The three-way analysis framework also allowed detection of false positive and false negative genotypes on the GSA platform. Though not evaluated in the current study, the three-way comparison framework in our analysis allows for modeling of genotyping-error specific to variation classes and categories triaged during characterization of the GSA.
Over 82% of assays on the GSA returned genotypes with a high positive predictive value (PPV). The GSA detected some pathogenic variation (MAP59) in the test dataset of 261 Coriell samples, and these variants were independently validated by either the 1KG data or the rWGS/pWGS data or both. Although we attempted to compare GSA results to other chip results, the comparison to previous work was impeded by differences in probe content and density as well as chip design (e.g., 610 k assays on GSA, vs. 247 k assays on HumanExome chip). Some of the pros and cons of using the GSA are summarized in Table 10 below.
The test characteristics of the GSA compared to WGS clearly show that the GSA is not a diagnostic genomic test for individuals with rare disorders. As shown by our MAP59 results and recent research studies [43,44], the GSA lacks robustness for genotyping rare variants as well as probes for detection of private familial disease variants. On the other hand, we show that the GSA has the analytical robustness to serve as a clinical screen for genotypes for which one can establish robust cluster files for the AA, AB, and BB genotypes. This is most easily accomplished for more common genotypes that contribute to polygenic predispositions to disease, particularly common diseases. Screening of an asymptomatic population to assess the likelihood of predisposition to a disease is well established within medicine, and examples include newborn screening for inborn errors of metabolism, mammography for breast cancer, and cholesterol levels for coronary artery disease [50,51]. A major objective of screening tests is to reduce morbidity and mortality in the subject population through risk stratification to target surveillance, early detection, and treatment. With the characterization of genomic risk for drug responsiveness and predisposition to various cancers and cardiovascular disease [52][53][54], we propose that the GSA offers a potential clinical tool for genomic screening.

Limitations of our study
Our comparison of BeadChip arrays to NGS and benchmark datasets has some limitations. Firstly, we evaluated our dataset using accepted algorithms. This did not take into account the benefits of consensus genotyping by multiple algorithms for GSA or NGS data; Hwang et al. found that consensus genotyping minimized false findings [47]. Secondly, cell-line derived variation or lowlevel somatic variation might also have contributed to differences between datasets [25]. Thirdly, we did not analyze variants close to or overlapping other variation in the same location, e.g., insertions/deletions and copy number variation, because these loci are eukaryotic mutation hotspots [55]. Fourthly, our analysis would benefit from comparison to variant benchmark datasets defined in more recent publications [47] and to NIST/ GiAB datasets.

Conclusions
We established the analytical validity of the GSA via a systematic approach utilizing benchmark and rWGS data to evaluate the performance of each assay. We highlight that the GSA assays interrogating rare variants, transversions, and variants within low-complexity regions need careful evaluation. GSA assays can be analytically validated to clinically screen for common genotypes predisposing to disease.