M3-S: a genotype calling method incorporating information from samples with known genotypes

Background A key challenge in analyzing high throughput Single Nucleotide Polymorphism (SNP) arrays is the accurate inference of genotypes for SNPs with low minor allele frequencies. A number of calling algorithms have been developed to infer genotypes for common SNPs, but they are limited in their performance in calling rare SNPs. The existing algorithms can be broadly classified into three categories, including: population-based methods, SNP-based methods, and a hybrid of the two approaches. Despite the relatively better performance of the hybrid approach, it is still challenging to analyze rare SNPs. Results We propose to utilize information from samples with known genotypes to develop a two stage genotyping procedure, namely M3-S, for rare SNP calling. This new approach can improve genotyping accuracy through clearly defining the boundaries of genotype clusters from samples with known genotypes, and enlarge the call rate by combining the simulated data based on the inferred genotype clusters information with the study population. Conclusions Applications to real data demonstrates that this new approach M3-S outperforms existing methods in calling rare SNPs. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0824-5) contains supplementary material, which is available to authorized users.

In this article, we focus on the analysis of Illumina arrays, which use two color single base extension (SBE) biochemistry technology [16] to infer the genotype of a SNP with two alleles A and B. The goal of genotype calling is to infer the genotype (AA, AB, or BB) carried by an individual across the SNPs in the genome. All the genotype calling algorithms share the common feature of first defining genotype clusters and then assigning individuals to these clusters. They differ in how the clusters are defined and how the samples are allocated to these clusters. One class of genotyping algorithms is populationbased where all individuals are genotyped SNP-by-SNP. The genotype clusters are defined through the joint analysis of all samples at a given SNP separately. Although commonly used, its performance highly depends on the size of the study population. It may not perform well for SNPs with low minor allele frequencies (MAF) where there may be very few individuals for certain clusters. For example, GenCall [7], a representative method in this class, needs a large reference population (e.g. 10,000) to accurately define genotype clusters for SNPs with MAF < 0.01 [6]. In contrast, another approach, called the SNPbased method, defines genotype clusters using all SNPs of an individual at a time, as represented by GenoSNP [6]. The good performance of this approach depends on two assumptions: (1) The probes of all the SNPs behave similarly; and (2) the variations within a genotype cluster are much smaller than that between clusters. Compared to the population-based method, GenoSNP does not need a large number of samples to ensure calling accuracy for SNPs with low MAF. However, empirical applications of this approach lead to many more SNPs failing the Hardy-Weinberg (HW) principle, indicating that the two implicit assumptions are likely violated in reality.
To address the limitation of these two approaches, we developed M 3 that combines the population-based strategy with the SNP-based approach to improve calling accuracy for rare SNPs [17]. Compared to GenCall, M 3 borrows genotype cluster information from reference SNPs to improve the calling performance for rare SNPs. It also improves upon GenoSNP by utilizing the populationbased calling scheme. However, the effectiveness of M 3 depends on the quality of the reference SNP. If the reference SNP behaves very different from the target rare SNP, the inferred genotype will likely be incorrect. In this article, we consider using additional information to further improve the quality of the reference SNP. In particular, we consider the use of samples with known genotypes, e.g. HapMap samples, that are often included for quality control purposes. Because the HapMap samples have been extensively studied and their true genotypes can be considered known with almost certainty, the genotype calling results from these samples can provide a good metric for the performance of calling algorithms. Hence, we incorporate known genotype information from these quality control samples into the reference SNP selection procedure under the general framework for M 3 , and this new method is named M 3 -S where S denotes samples with known genotypes for calling. More specifically, M 3 -S utilizes known genotype information to construct three genotype clusters in the first stage, and the entire samples together with simulated data based on the inferred cluster information are then genotyped in the second stage.
The key component in our improved method is to explicitly define the boundaries of the three genotype clusters for each SNP through samples with known genotypes. Although the idea of leveraging subjects with known genotype information is intuitive, there is a practical challenge in implementations, e.g., there is often only two or even one cluster for known genotype samples, making the inference of boundaries difficult. This can be solved by taking advantage of the reference SNP selection method [17] developed for M 3 to the samples with known genotypes. It can directly define boundaries of genotype clusters before genotyping other study subjects. In addition, our proposed method is computationally efficient and applicable to the large-scale intensity data (see Additional file 1B).
The rest of the paper is organized as follows. We first describe the two stages of our new method (M 3 -S), and then explain how this new method helps calling for rare SNPs. Finally we compare the performance of our method with existing methods to demonstrate its better performance.

Illumina chip data description
The Illumina microarrays probe millions of SNPs per sample, with newer arrays including more recently discovered rare SNPs. In their probe design, the Illumina arrays are made up of a number of beadpools containing millions of beadtypes. Every SNP is represented by one beadtype with 20 pairs of allele-specific intensities for one individual [16], thus each beadtype is able to assay both SNP alleles. In this study, we consider the pair of raw intensity values at each beadtype for every sample to infer geneotypes.

Stage I: Estimation of three genotype clusters from samples with known genotypes
In the first stage, samples with known genotypes are first analyzed separately to infer three genotype clusters for each SNP denoted by AA, AB, and BB. We use B to denote the less common allele, and all possible genotypes are denoted as: AA, AB, or BB. Let x is = (r is , g is ) denote the measured intensity values of the s th SNP for the i th individual where (i = 1, . . . , n; s = 1, . . . , S) with S being the total number of SNPs and n being the total number of subjects. For every SNP, we use a hs = (r hs , g hs ) to denote the raw intensity values of the h th subject with known genotype for the s th SNP, where h = 1, . . . , n a and n a is the total number of samples with known genotypes. In the study that motivated our work, HapMap samples were included and we obtained their "true" genotypes from the International HapMap Project [3,4]. From the notations introduced above, it is clear that n a is less than n, and each element in M = {1, . . . , n a } can be matched to one unique element in a set N = {1, . . . , n}. To distinguish SNPs with study population and SNPs only containing individuals with known genotypes, the latter is named as featured SNPs. We propose the following five-step procedure to estimate the parameters of the genotype clusters.
(1) Step I: randomly assign samples with known genotypes into two groups. One will be called the training set with l a samples, and the other is the testing set. Generally, the training samples are used to infer the parameters of genotype clusters, and the testing individuals are used to evaluate the performance of the method. In this study, we define different allocation ratios between the training samples and testing samples (ratio 1:1 or 2:1) to evaluate the impact of the allocation ratio on genotyping. For the data set that motivated this research, there are 141 samples with known genotypes. Under the allocation ratio 1:1, we consider 71 training samples, i.e. l a = 71, and 70 testing samples. For the allocation ratio 2:1, 94 subjects are assigned to the training group, i.e. l a = 94, and 47 individuals are in the testing set. (2) Step II: evaluate the quality of each featured SNP via analyzing the training samples. When the training samples are randomly selected, the number of distinct genotypes and the sample size of each genotype can be inferred from known genotypes, then all the featured SNPs with training subjects (a * s : the raw intensity vector of training samples) can be classified into two groups, where G 1 collects those featured SNPs having three distinct genotypes, whereas G 2 collects those featured SNPs with fewer than three distinct genotypes.
where c as denotes the number of genotype clusters for training samples at the s th featured SNP; l a 0 s , l a 1 s and l a 2 s denote the number of training subjects in the three genotype groups (AA, AB or BB) for the s th featured SNP.
Note that S R is the set of Ref Then the combined vector of raw intensities is partitioned into three clusters according to the training samples' known genotypes. If k denotes the cluster label with values 1, 2, or 3, aa * sk is the combined raw intensity matrix in the k th genotype group. The mean and variance of the three genotype clusters can be estimated by the following equations, where l ask and l ark denote the number of training samples at the k th cluster for the s th featured SNP and r th Ref featured SNP, respectively; 1 G 2 is an (l ask + l ark ) x 1 column vector with all elements equal to 1; 1 G 1 is a l ask x 1 column vector with all elements equal to 1; and a * sk is the raw intensity matrix of training subjects in the k th genotype group for the s th featured SNP.
Note that μ ask is an 1 × 2 vector measuring the average intensity of training samples in the k th cluster for the s th featured SNP; ask is a 2 × 2 covariance matrix of the k th cluster at the s th featured SNP. In summary, the first stage focuses on selecting reference featured SNPs to better estimate the parameters of the three genotype clusters.

Stage II: Gaussian mixture model for augmented intensity data
In general, enlarging sample will lead to improved genotyping results, especially for rare variants. But, this is not feasible due to constrains on available samples and budget. So we propose to "increase" the sample size through simulating from the inferred cluster parameters and combine the simulated data with the observed data to improve calling accuracy in our second stage analysis. (1) Step I: simulate intensity data according to the inferred parameters of the three genotype clusters from the training samples with known genotypes.
y js ∼ Gaussian k (μ ask , ask ) with probability 1 3 j = 1,. . . ,m, s = 1,. . . ,S, k = 1, 2, or 3 In this study, we simulate m additional individuals at each SNP from the above Gaussian mixture distribution. Parameters μ ask and ask (k = 1, 2, or 3) could adequately provide the center and variability of the three genotype clusters for each SNP, and each cluster contains equal number of simulated subjects ( m 3 ). Here, we vary the value of m to be 600, 1500, and 3000 to evaluate the impact of this simulated data on genotyping. Specifically, we simulate equal numbers of subjects in every genotype group to improve the representation of rare genotypes in the samples for better genotype calling. More importantly, adding this simulated data with equal numbers in each genotype cluster to the observed data will not influence the configure of major homozygote and minor homozygote for the observed data. (2) Step II: genotype calling using both observed and simulated data. The pair of original raw intensities are x is = (r is , g is ), and the pair of simulated raw intensities are y js = (r js , g js ), then the combined raw intensity values at the s th SNP t s = x s y s consist of the augmented data. Within one SNP, pairs of raw intensities primarily consist of three genotype clusters which correspond to three genotypes (AA, AB, and BB). We apply the Gaussian Mixture Model (GMM) with fixed components [18], to t s , where the number of components is fixed at three. Besides, we introduce a null component for those individuals whose genotypes are difficult to be assigned to one of the three clusters. In principle, this model assigns the w th pair of the combined raw intensities t ws to one component with probability π sk where k measures three components corresponding to the three genotypes. The latent genotype class is denoted by the indicator variable z ws generated from a multinomial distribution where z ws takes the value of 1, 2, or 3. Then the three-component GMM can be expressed as: (t ws |μ sk , sk ) I(z ws =k) w = 1,. . . ,n * , s = 1,. . . ,S, k = 1, 2 or 3

(5)
where n * is the total number of individuals collected at the s th SNP and simulated data where n * = n + m, and S is the total number of SNPs. The normal density has mean μ sk and variance-covariance matrix sk in the k th cluster for the s th augmented SNP data; all pairs of raw intensity at the s th augmented data are measured by t s = (t 1s , t 2s , . . . , t n * s ) T ; the unknown parameters of the GMM is denoted by s = (π s , μ s , s ) where π s = (π s1 , π s2 , π s3 ), μ s = (μ s1 , μ s2 , μ s3 ), and s = ( s1 , s2 , s3 ). Through solving the score equation, the maximum likelihood estimates (MLEs) of the above parameters can be easily estimated [18]. The (u + 1) th iteration of the indicator variable z ws = k (k = 1, 2, or 3) is inferred by The relevant iterative estimates for the mean μ sk and variance-covariance matrix sk are Two values, Posterior Rate (PR: q k ws ) and Average Posterior Rate (APR: q s ), for the s th augmented data are calculated to quantify the quality of SNP calling [17], where It can be seen that PR, measures the strength of each observation's cluster signal, and APR is the average value of all individuals' PRs at the s th augmented data [17]. Note that n * k is the sample size in the k th cluster for the s th augmented SNP data. Genotypes can be inferred from the augmented data, including both observed and simulated subjects.

Data set description and cutoffs setting
We analyzed Illumina Omni 1M array data collected from 3258 samples to compare the performances of M 3 -S with representative calling algorithms, including Gen-Call (a population-based method), GenoSNP (a SNPbased approach), and M 3 (a hybrid of the previous two approaches). In this data set, 38 HapMap samples were measured multiple times, using a total of 141 arrays. We focused on SNPs from chromosome 22 with a total of 15,020 SNPs. The performance of each genotyping method was evaluated by the comparison results between SNP calls inferred by each calling method and those from the International HapMap Project for these HapMap samples [3]. We chose the following cutoffs for the four calling algorithms: GC score ≥ 0.15 in GenCall used to filter good quality SNPs, samples with posterior probability > 85 % for GenoSNP, and average posterior probability > 0.85 for both M 3 and M 3 -S. The effects of different thresholds on genotyping are summarized in Additional file 1E.

Data analysis results
Because there were 141 HapMap subjects, we used their "true" genotypes to evaluate the accuracy of different calling algorithms. We varied the numbers of training and testing samples (allocation design: 2:1 and 1:1) to explore their impacts on genotyping. The effectiveness of two allocation designs is summarized in Table 1. It is clear that the allocation 2:1 design provides more accurate genotypes compared to those of the allocation design 1:1. It is partially due to the fact that more samples are assigned to the training set to infer the boundaries of three genotype clusters under 2:1 design. Therefore, we select 2:1 design to do further analysis. Table 1 provides the comparison results among different calling methods in terms of calling accuracy. It can be seen that M 3 -S (99.38 %) has the best call accuracy and high call rate, followed by M 3 , GenoSNP and GenCall. We can also see that M 3 gives the highest call rate (99.77 %), followed by M 3 -S, GenoSNP, and GenCall.
We simulated different sizes of samples (e.g. 600, 1500, or 3000) in the second stage of M 3 -S to see the impact of simulated data on the genotyping, especially for rare variants. 600, 1500, or 3000 simulated samples are roughly 1, 1/2 or 1/5 times the original study population. In brief, the performance of our proposed calling algorithm based on three different simulation designs is evaluated through the comparison between the genotypes of testing HapMap samples inferred from M 3 -S and the genotypes of these subjects inferred from the HapMap project. Table 2 shows that enlarging the number of samples for simulated data can improve the accuracy of genotypes of testing HapMap samples for extremely rare SNPs. Thus, the 2:1 allocation design with 3000 simulated samples gives the best genotype accuracy (99.23 %), and the highest call rate (99.81 %) for extremely rare SNPs (MAF < 0.01). For the real data, we think the 2:1 allocation design with 3000 simulated samples is preferred while using M 3 -S to infer genotypes.  The successful application of this proposed calling procedure (M 3 -S) depends on the accurate estimations of the three genotype clusters from the subjects with known genotypes in the first stage, and adequately simulated subjects from the Gaussian mixture distribution in the second stage. For parameter estimations of the three genotype clusters, the reference SNP selection method [17] may help infer the boundaries of all genotype clusters for rare SNPs. To evaluate the influence of the simulated data, we test the performances of different sizes of simulated data (e.g. 600, 1500 or 3000) on the same rare SNP (rs1003505). As shown in Fig. 1, a larger size of simulated data generates a bigger cluster easily covering the target rare SNP. Hence, adding 3000 simulated data in our real data is a good option for improving genotyping accuracies. Next, we also select three rare SNPs (rs1003505, rs1003676 and rs1008185) displaying three genotype clusters, two clusters, and one cluster, respectively. As shown in Fig. 2, our method with 3000 simulated observations can accurately infer genotypes for different rare SNPs by leveraging information from the simulated data.
Here, we extend our analysis to the entire study population, not just testing HapMap samples. For all study subjects, the call rate and the concordance rate of the four calling algorithms are compared, where the call rate is the ratio of genotypes that can be inferred from each calling method to those that need to be genotyped, and the concordance rate is the genotype agreement between two algorithms. The overall comparison results are summarized in Table 4. It can be seen that genotypes inferred from M 3 -S, M 3 , GenCall and GenoSNP are highly consistent, especially those from M 3 -S and M 3 . This is likely due to the fact that both M 3 -S and M 3 are population-based calling approaches in a broad sense, and the reference SNP selection step in M 3 -S and M 3 can improve their call rates (99.56 %, 99.71 %) (Table 4). Besides, M 3 has the highest call rate because it utilizes two SNPs' subjects (2 × 3258) to infer genotypes at rare SNPs, but M 3 -S uses one SNP's individuals plus 3000 simulated subjects (3258 + 3000) to infer genotypes at these rare SNPs. The change in sample size of two methods results in the difference of call rate between these two approaches. Moreover, Additional file 1: Table S4 and Figure S1 further explain the differences among M 3 -S, GenCall and GenoSNP.
Hardy-Weinberg Equilibrium (HWE) test is an important criterion to examine genotyping quality as failing HWE test may indicate calling errors. We performed HWE tests for the four population groups: Hispanic African-American, non-Hispanic African-American, Hispanic European-American, and non-Hispanic European-American, separately. Table 5 summarizes the number of SNPs that fail the HWE test. GenoSNP has the largest number of SNPs failing HWE test, whereas GenCall has the smallest number of SNPs failing HWE. M 3 and M 3 -S fall in between in terms of the number of SNPs not meeting HWE. Specifically, we find that M 3 -S may make more SNPs fail the HWE test when this approach enlarges the number of samples for simulated data. It seems that improving call accuracy for rare SNPs is in conflict with guaranteeing the quality of SNPs via HWE test. People have to select the appropriate samples size for simulated data to balance the above two criteria.

Discussion
M 3 -S was motivated from a data set with HapMap samples genotyped as part of the study, but many studies do not have these valuable samples collected. In this case, we may use pairs of raw intensity of HapMap subjects provided by the International HapMap Project [3,4], but raw data in the Affymetrix data format cannot be directly applied to our program. Although some researchers have successively transformed the Affymetrix raw data into the Illumina raw data with high consistency [19][20][21], we are not sure the impact of transformation on the final genotyping result. M 3 -S applies the reference SNP selection step to samples with known genotypes for improving the missing rate and call accuracy for rare SNPs, especially for SNPs with very low MAF. The successful application of M 3 -S is to select the appropriate Ref featured SNP from the whole genome. In practice, it is not practical to search the Ref featured SNPs from the entire genome due to plenty of tedious calculation involved, then the instrumental featured SNPs near the target featured SNP are picked out.
The assumption about identical probe response of all SNPs allows each SNP to borrow information from other good quality SNPs. When some probes break this assumption, searching for the most optimal Ref featured SNP is  still an open question. Besides, the reference SNP selection is based on maximizing the mathematical similarity between the target featured SNP and the Ref featured SNP. Because the probe intensity is highly correlated with 50 base probe sequence, incorporating the probe sequence information in the reference SNP selection procedure may greatly improve the quality of SNP calls.
The prerequisite for running M 3 -S is based on the collection of samples with known genotypes. However, some SNP-array data may not contain individuals with known genotypes, which results in the restricted use in M 3 -S. Fortunately, M 3 -S is a supplement to our previous method M 3 [17]. We strongly suggest scientists to use M 3 method if their SNP array data do not contain any HapMap samples. If the data have samples with known genotypes, they could apply M 3 -S. Scientists could try these methods according to their requirements. Recently, a large amount of rare variants are widely captured in many SNP array data, some new powerful calling algorithms have been proposed for accurately calling rare SNPs, such as: optiCall [22] and zCall [11]. To better understand the effectiveness of various calling algorithms, we consider to summarize and compare the performances of multiple popular calling methods in our future study for providing an application guide.