Next-Generation Sequencing Data-Based Association Testing of a Group of Genetic Markers for Complex Responses Using a Generalized Linear Model Framework

: To study the relationship between genetic variants and phenotypes, association testing is adopted; however, most association studies are conducted by genotype-based testing. Testing methods based on next-generation sequencing (NGS) data without genotype calling demonstrate an advantage over testing methods based on genotypes in the scenarios when genotype estimation is not accurate. Our objective was to develop NGS data-based methods for association studies to ﬁll the gap in the literature. Single-variant testing methods based on NGS data have been proposed, including our previously proposed single-variant NGS data-based testing method, i.e., UNC combo method. The NGS data-based group testing method has been proposed by us using a linear model framework which can handle continuous responses. In this paper, we extend our linear model-based framework to a generalized linear model-based framework so that the methods can handle other types of responses especially binary responses which is a common problem in association studies. To evaluate the performance of various estimators and compare them we performed simulation studies. We found that all methods have Type I errors controlled, and our NGS data-based methods have better performance than genotype-based methods for other types of responses, including binary responses (logistics regression) and count responses (Poisson regression), especially when sequencing depth is low. We have extended our previous linear model (LM) framework to a generalized linear model (GLM) framework and derived NGS data-based methods for a group of genetic variables. Compared with our previously proposed LM-based methods, the new GLM-based methods can handle more complex responses (for example, binary responses and count responses) in addition to continuous responses. Our methods have ﬁlled the literature gap and shown advantage over their corresponding genotype-based methods in the literature.


Introduction
Next-generation sequencing (NGS) is a massively parallel sequencing technology used to determine the order of nucleotides in entire genomes or targeted regions of de-That is, for a group of common variants, the joint significance test (JS) based on NGS data is proposed. For a group of rare variants, the variable collapse test (VC) based on NGS are proposed. Compared with our previous work [17], the major contribution of this work is that it can handle a range of types of phenotypes, including continuous, binary, and count/integer phenotypes based on the GLM framework, whereas our previous work [17] can only handle continuous phenotypes based on the LM framework.
In this study, we proposed novel NGS data-based testing methods for association studies based on a generalized linear model (GLM). The proposed methods fill the literature gap as the first NGS data-based group testing methods without genotype calling based on a GLM. We previously used the LM framework to develop association testing methods for a group of genetic variables based on NGS data [17]; however, our previous methods can only handle continuous responses. In this paper, we aimed to extend our linear model-based framework to a generalized linear model (GLM)-based framework so that our methods can handle other types of responses, especially binary responses, which is commonly-faced in association studies. The objectives of this study were to (1) develop our proposed novel NGS data-based testing methods for association studies based on a theoretical framework of generalized linear models (GLM) and (2) show our proposed methods can achieve better testing performance than their corresponding genotype-based methods.

Methodology
Denote the sample size of the study as N. For individual i (1 ≤ i ≤ N), the data are (y i , g i , x i ). The term y i , g i , and x i , respectively, represent the phenotype, genotypes, and additional covariates. Examples of additional covariates are environmental variables, gender, and age. The genotype can only have values of 0, 1, or 2 for bi-allelic markers.
Suppose our group testing includes d g genetic variants. We use a row vector to represent the genotypes for individual i, i.e., g i = (g i1 , g i2 , . . . , g i dg ), where the genotype at variant j for individual i is g ij . We use the row vector x i = (1, x i1 , x i2 , . . . , x id x ) represent the intercept and d x additional covariates, where the value of additional variable j for individual i is denoted as x ij .

Model Complex Phenotypes Using a GLM Framework
Both the linear model for a quantitative phenotype and the logistic regression model for a binary phenotype conform the framework of the general linear model [33,34]. This motivates us to model complex phenotypes by a generalized linear model (GLM) framework. The GLM model-based derivation is a direct extension of our previous linear model (LM)-based framework for testing of a group of variants [17], which is extended from work on NGS-based single-variant testing [29].
We model the complex phenotype by a GLM [33,34]. For the i-th individual, the probability of observing phenotype y i is modelled to be p(y i |x i , g i ) = p α,β,φ (y i |x i , g i ) = exp( where the row vector α ∈ R d x +1 and β ∈ R d g . The linear predictor is η i = η α,β (x i , g i ) = αx T i + βg T i . Depending on the type of the phenotype under consideration, different specification of the functions a(), b(), and c(), including

•
The continuous phenotype, corresponding to a linear regression; • The binary phenotype, corresponding to a logistics regression; • The count (integer) phenotype, corresponding to a Poisson regression.
To be more specific, we show how the modelling of the three types of phenotype belongs to our generalised linear model (GLM) framework. First, consider a continuous phenotype as specified in our previous linear model (LM) framework [17], i.e., y i ∼ N(η i , σ 2 ), where η i = αx T i + βg T i . Our previously proposed LM framework is a special case of our GLM framework because Thus, a(φ) = φ, φ = σ 2 , b(η i ) = η 2 i /2 and c(y i , φ) = −y 2 i /(2φ) − ln(2πφ)/2 for a continuous phenotype.
Next, consider a binary phenotype modelled using a logistics regression model, i.e., π i = P(y i = 1|x i , g i ), and ln(π i /(1 − π i )) = η i = αx T i + βg T i . This logistics model is a special case of our GLM framework because Thus, a(φ) = 1, b(η i ) = ln(1 + e η i ) and c(y i , φ) = 0 for a binary response. Thirdly, consider a count (integer) phenotype modelled using a Poisson regression This Poisson regression is also a special case of our GLM framework because Thus, a(φ) = 1, b(η i ) = e η i and c(y i , φ) = − ln(y i !) for a count (integer) response. Our proposed GLM framework is a general framework which can handle different types of responses. The probability of observing phenotype y i is influenced by predictors (x i and g i ) and parameters (α, β, φ). When a(φ) and c(y i , φ) are constant functions with respect to φ, we can drop φ out and denote the functions as a and c(y i ). For example, in logistics regression, a(φ) = 1, c(y i , φ) = 0, and in Poisson regression, a(φ) = 1, c(y i , φ) = − ln(y i !). The parameters are (α, β) not involving φ. Then the probability of observing y i is influenced by (x i and g i ) and the parameters (α, β). In the following, we will discuss two situations, (1) the situation when parameters are (α, β, φ); and (2) the situation when φ is dropped out, i.e., parameters are (α, β).

Uncertain Genotypes
Because sequencing data and phenotype are conditional independent given true genotypes, we model their joint distribution to be where θ = (α, β, φ) or θ = (α, β) depending on whether φ can be dropped out, i.e., whether a(φ) and c(y i , φ) are constant functions with respect to φ. For individual i, D i denotes sequencing reads, y i denotes the phenotype, and x i denotes additional covariates. We denote the genotype state space as G, which contains all possible genotype values for g. The term ∑ g∈G in Equation (2) refers to sum over all possible values of g. Because each of the d g genetic variants can only has values of 0, 1, and 2, G = {0, 1, 2} d g . The term h(g, D i ) is short for the probability of NGS data and genotype, i.e., h(g, D i ) = p(D i |g)p(g|f ). Here, the estimated allele frequencyf is modelled in Skotte et al. [29]. The log-likelihood function thus can be written as where θ = (α, β, φ) or (α, β) depending on whether φ is dropped out. We are interested in testing for genetic effects based on NGS data. The null hypothesis is H 0 : β = 0. Under H 0 , we note that the density f does not depends on genotypes so that in the likelihood function under H 0 , we can pull it out of the summation.
Then, under H 0 : β = 0, we simplify the log-likelihood function as follows, when φ is not dropped out and parameters are (α, β, φ). When φ is dropped out and the parameters are (α, β), the formula is We adopt the score test [35,36]. When parameters are (α, β, φ), the score is where the length of row vector α is d x + 1, the length of row vector β is d g , and φ is a scalar. The analytical formula of s y,D (α, β, φ) is specified in Appendix A with detailed derivations provided in Supplementary Material S2 in Supplementary Information File. The observed information matrix is The analytical formula for o y,D (α, β, φ) is specified in Appendix B with detailed derivations provided in Supplementary Material S3 in Supplementary Information File.

The Situation When Parameters Are (α, β)
When φ is dropped out and parameters are (α, β), the score function is where the length of row vector α is d x + 1, the length of row vector β is d g . The analytical formula of s y,D (α, β) is specified in Appendix E with detailed derivations provided in Supplementary Material S6 in Supplementary Information File. The observed information matrix is The analytical formula of o y,D (α, β) is specified in Appendix F with detailed derivations provided in Supplementary Material S7 in Supplementary Information File.
Denote the constrained MLE of the parameters (α, β) under H 0 : β = 0 asθ = (α, 0). The test statistic is To calculate this test statistic, we need to evaluate both s y,D (α, β) and o(y, D)(α, β) at the constrained MLEθ = (α, 0). The evaluation of s y,D (α, β) at the constrained MLẼ Under H 0 , the test statistic R(y, D) ∼ χ 2 d g . We conduct score test based on R(y, D) and calculate p-values. For rare variants, variable collapse (VC) methods collapse multiple genetic variables into one variable and use it in testing [23,37,38]. Rare genetic variants can be collapsed in different ways, depending on which method is used. Weighted burden test based on genotypes aggregate/collapse p rare variants by a weighted sum with the weight w j , i.e., AG i = ∑ p j=1 w j g ij , where g ij refers to the genotype for the j-th rare variants for individual i. Rare alleles are coded as 0 and wild/reference alleles are coded as 1. The burden test adopts equal weight, i.e., w 1 = w 2 = · · · = w p = 1. In burden test, AG i = ∑ p j=1 g ij so that association study is performed between the total sum of rare alleles for a group of genetic variants and the phenotype. This means the influence of genotype g ij on the phenotype is through AG i .
We model the phenotype using a generalized linear model [33,34]. For individual i, the same generalized linear model is used except the change in linear predictor η i . The probability of observing phenotype y i is modelled to be where the row vector α ∈ R d x +1 and β 0 ∈ R is a scalar. The linear predictor in GLM model is Depending on the type of the responses, we adopt different functions for a(.), b(.), and c(., .). Note that AG i = ∑ d g j=1 w j g ij aggregates d g rare genetic variables into one aggregate variable. The probability of observing y i is influenced by both (x i and AG i ) and parameters (α, β 0 , φ).
Our model for common variants uses the same GLM framework except the change in linear predictor η i , i.e., where β ∈ R d g . We apply the chain rule in calculus to derive the formulae for the rarevariant model (Equation (9)) based on the formulae in the common-variant model, i.e., Equation (10). The connection between the two models is that the effects of rare variants as modeled by β 0 satisfy the condition that β = β 0 W, where β ∈ R d g and β 0 ∈ R. For burden test which uses equal weight, i.e., w 1 = w 2 = · · · = w d g = 1, so that W = [1, 1, . . . , 1] is a unit row vector and AG i = ∑ d g j=1 g ij [23,37]. Then, Unequal weights can also be adopted, such as w j = β 0 f Beta (MAF j , 1, 25), where f Beta is the Beta density function. The term MAF j is MAF for the j-th rare variant [25,26,30].
The same assumption on weights are used in our proposed NGS data-based variable collapse (VC) method. We adopted the assumption of weighted burden test in our test based on NGS data. This assumption has been widely used in VC test based on genotypes in the literature. [23,37]. The formula is β = β 0 W, where the weight W = (w 1 , w 2 , . . . , w d g ) is a row vector and β 0 is a scalar. For identification purpose, the constraint ∑ d g j=1 w j = d g is adopted.
In our joint significance (JS) method for a group of common variants, η i is modelled as are parameters and the length of row vector β is d g . In comparison, in our variable collapse (VC) method for a group of rare variants, the linear predictor η i is modelled as are parameters and β 0 is a scalar. First, under H 0 : β = 0 or H 0 : β 0 = 0, the same constrained MLE for α and φ is obtained, no matter which log-likelihood function (l y,D (α, β, φ) or l y,D (α, β 0 , φ)) is used. Thus, the same notationθ = (α, 0,φ) is used to represent both the constrained MLE in l y,D (α, β, φ) (note the term 0 refers to a row vector containing d g elements with all elements equal to 0) and the constrained MLE in l y,D (α, β 0 , φ) (note that the term 0 is a scalar with the value of 0).
The evaluation of the score function at the constrained MLE for the rare-variant model is obtained as follows using the chain rule.
is the posterior expectation of the genotype of individual i given sequencing data D i . Evaluation of the last two functions at constrained MLE is 0 because constrained MLE is obtained by constrained optimization of the likelihood function so that the firstorder condition is satisfied, i.e., evaluation of the first derivatives are equal to 0. Working similarly, for rare-variant models, we obtain the observed information matrix, and evaluate it at the constrained MLE. The formulae are Under H 0 : β 0 = 0, R(y, D) is approximately χ 2 1 . Based on the test statistic, we conduct score test and calculate p-value.

The Situation When Parameters Are (α, β)
Consider the situation when φ is dropped out and parameters are parameters are (α, β). All setups are the same except for the following changes: 1.
The same notationθ = (α, 0) is used to represent the constrained MLE in l y,D (α, β 0 ) (note the term 0 is a scalar of 0) and the constrained MLE in l y,D (α, β) (note the term 0 is a zero row vector of length d g ).
Evaluation of the score function at the constrained MLE is as follows.
Note that, on the above, the first formula is derived by the chain rule. The second formula is 0 because constrained MLE maximizes the log-likelihood under H 0 : β 0 = 0, so that the first order condition in constrained optimization is satisfied, i.e., evaluation of the first derivatives at constrained MLE is 0.
Working similarly, we derive the observed information matrix, and evaluate it at the constrained MLE. The derivations are as follows.
The test statistic is Under H 0 : β 0 = 0, R(y, D) is approximately χ 2 1 . Based on the test statistic, we conduct a score test and calculate the p-value.

Software to Implement the Methods
We implement our proposed NGS data-based methods using R software (version 4.2.0). We have uploaded the R script files to implement our methods into the Github folder, which is publicly available via the link: https://github.com/zhengxu0459/NGS.Data. Based.Group.Testing.Based.On.GLM (accessed on 17 May 2023). The Github folder contains six script files to implement our NGS data-based (1) joint significance test and (2) variable collapse test for (i) continuous phenotype, (ii) binary phenotype, and (iii) count phenotype.

Specification of Simulation Studies
To evaluate the performance of our proposed methods (NGS data-based JS test for common variants and VC test for rare variants) versus literature methods (corresponding methods based on genotypes), we conduct simulation studies. Various setting simulation have been designed.
For common genetic variables and binary response, we evaluated the performance of our JS test based on NGS data versus literature methods (Chi-square test) based on genotype. For rare genetic variables and binary response, we evaluated the performance of our VC test based on NGS data versus literature methods (burden test and SKAT test based on genotypes).
We also conducted simulations for count/integer response. Although continuous response and binary response are two of the most commonly encountered types of phenotype in association studies, other types of responses are also in association studies, though not as common as continuous type and binary type. We show that the use of a generalized linear model-framework allows us to handle other types of responses in addition to continuous phenotype. Because genotype-based SKAT testing method is only available for continuous phenotype and binary phenotype [25], we did not compare our methods with genotypebased SKAT method for count/integer response in rare-variant testing. For count/integer phenotype and rare genetic variables, we evaluated the performance our NGS data-based VC test versus genotype-based burden test literature. For count/integer phenotype and common genetic variables, we evaluated the performance of our NGS data-based JS test versus the genotype-based literature method (Chi-square test).
The software COSI was used to simulate 100 kb genomic regions based on a coalescent model. We adopt the best-fit model in COSI so that regions can be generate to mimic local recombination rate, LD patterns and European population history of Europeans [39]. We simulate chromosomes in the simulated regions. We used the software ShotGun [40] to generate sequencing data (per base pair error rate = 0.5%). ShotGun is publicly available at the webpage https://yunliweb.its.unc.edu/shotgun.html, (accessed on 17 May 2023). We specify various average sequencing depths, such as d = 1X, 2X, 4X, 10X. We classify genetic variables as common or rare depending on whether MAF ≥ 0.05. Two additional covariates are simulated: X 1 ∼ N(0, 1) (continuous) and X 2 ∼ Bernoulli(0.5) (binary).
To simulate the count/integer phenotype, we use the Poisson model where there are d g genetic variables, β 0 = 0, β 1 = 1, β 2 = 1, and ∼ N(0, 1). In simulations of both types, we set genetic effects, i.e., values of β gj , differently in different scenarios, which allows us to evaluate Type I errors and perform power analysis (Type II errors).
Under H 0 : β g1 = β g1 = · · · = β gd g = 0, we generate 9000 replicates for the evaluation of Type I errors. Type I errors are calculated for all combinations of 3 samples sizes (n = 300, 500, 1000) and 4 sequencing depths (d = 1X, 2X, 4X, 10X). Results of Type I errors are reported for (1) the JS test of a group of common genetic variables with the binary phenotype and the count/integer phenotype, and (2) the VC test of a group of rare genetic variables with the binary phenotype and the count/integer phenotype.
Then, we conduct simulation studies under the alternative hypothesis H 1 , i.e., there are non-zero effects in the d g genetic variables. In our simulation, we randomly choose multiple genetic variables as causal markers. For common variants, we randomly choose 2 to 5 causal genetic markers. For rare variants, we randomly choose 2 to 10 causal genetic markers. Then we simulate phenotypes based on these causal genetic markers and additional covariates (X 1 , X 2 ). The total genetic effect is between 0 and 1 (Scale Parameter = 0.2 multiplied by the magnitude range of 0 to 5) with individual genetic effect specified to be the total effect divided by the number of causal variables.
Our simulations have used various (1) sequencing depths, (2) sample sizes, (3) number of causal variables, and (4) genetic effects. Based on simulated data, we evaluate the performance of different testing methods for common variants and rare variants, and compare our NGS data-based methods with the corresponding genotype-based methods in literature.
Our NGS data-based joint significance (JS) test use true allele frequencies (AFs) and two methods to estimate allele frequencies as in Skotte et al. [29]. NGS data-based JS test 1, 2, and 3, respectively, refer to JS test based on NGS data using (1) true AFs, (2) estimated AFs using the two-step genotype-based method (first estimate genotypes, and then calculate AFs using the estimated genotypes), and (3) one-step MLE estimator of AFs using the likelihood function of NGS data inSkotte et al. [29]. Both AF estimation methods have been proposed in Skotte et al. [29]. In general, we expect our testing method based on NGS data to have the best performance when true allele frequencies are used, i.e., our JS test 1 based on NGS data; however, JS Test 1 is not feasible because we do not know true AFs in practice; therefore, we need to use estimated AFs. The use of estimated AFs instead of true AFs is expected to make testing performance a little worse but we expect it will still be better than the corresponding genotype-based methods in the literature. Because one-step MLE of AFs is expected to be more accurate than the two-step genotype-based AF estimator, which has been reported in our previous work based on the same simulated sequencing data [17]. Thus, we expect our JS Test 3 based on NGS data to be better than our JS Test 2 based on NGS data.
Similarly, Variable Collapse (VC) Test 1, Test 2, and Test 3 based on NGS data refers to VC tests based on NGS data using (1) true AFs, (2) two-step genotype-based estimated AFs, and (3) one-step MLE of AFs. VC Test 1 is an infeasible estimator because, in practice, we do not know true AFs. We expect that VC Test 3 based on NGS data will be better than VC Test 2 based on NGS data.

Plan of a Real NGS Data Study
We describe our plan to systematically evaluate our methods in real NGS data. To be more specific, in our real NGS data example, we use the recent expansion of the 1000 Genomes Project (1kGP), which includes 602 trios, as described in Byrska-Bishop et al. [41]. We selected the 1000 Genomes Project (1kGP) for analysis because 1kGP is the largest fully open resource of whole-genome sequencing (WGS) data consented for public distribution without access or use restrictions. Recent expansion of 1kGP has contained 3202 samples, including 602 complete trios, deep sequenced to a depth of 30X, which is a suitable NGS dataset for analysis. The dataset is publicly available at https://www.internationalgenome. org/data-portal/data-collection/30x-grch38 (accessed on 5 May 2023).
To evaluate the performance of our methods under different scenarios of sequencing depths (d = 1X, 2X, 4X, 10X, 30X), down-sampling has been conducted to generate sequencing data with depth d = 1X, 2X, 4X, and 10X using the bioinformatics software samtools [42], accessible at http://www.htslib.org/ (accessed on 5 May 2023), which can work on NGS data in the bam file format and randomly down-sample NGS reads. For example, if we want to generate NGS data with the depth 10X based on 30X data, we set the down-sample ratio to be 1 out of 3, i.e., the ratio of 1:3. To generate NGS data with sequencing depths d = 1X, 2X, 4X, and 10X, the down-sample ratios are, respectively, 1:30, 1:15, 2:15, and 1:3. Because our methods are for unrelated individuals only, we randomly select at most 1 individual for each family to form a dataset with unrelated individuals. The 1000 genome project has provided accurately-estimated genotypes based on deep sequenced NGS data at d = 30X and we use these accurately estimated as true genotypes. The estimated genotypes were obtained by genotype calling on the down-sampled NGS data, i.e., NGS data at depth d = 1X, 2X, 4X, and 10X. Because 1 kGP does not provide phenotype data, we simulate phenotype based on a generalized linear model. Therefore, this real data example makes use of real NGS data and genotype data rather than simulated phenotype data.
In our ongoing project, we will evaluate the performance of our methods and compare with traditional methods in the literature. Tables 1-4, and (2) power analysis in Figures 1-4. We evaluate the performance of different testing methods. We compare our testing methods based on NGS data with the corresponding genotype-based methods in the literature.

Results of Type I Errors
Type I errors in different scenarios (sample size n = 300, 500, 1000; depth d = 1, 2, 4, 10) are reported. Depending on whether genetic variables are common or rare, different NGS data-based methods (JS test or VC test) are used.
For association between continuous phenotype and common genetic variables, Type I errors of our NGS data-based joint significance (JS) tests using true AFs and two ways of estimating AFs are reported. In Table 1, Type I error for different testing methods for association between binary phenotype and a group of common genetic variables are calculated. Genotype-based Chi-square test conduct a Chi-square test for JS of a group of variables in the logistics regression of phenotype on estimated genotypes and other predictors. Genotype-based methods refers to methods which first estimate genotypes and then conduct association study based on estimated genotypes and phenotype. Our methods based on NGS data directly model the probability of observing phenotype and sequencing data, without the step of genotype estimation. We repeat our simulation study for count/integer phenotype and common genetic variants, and report Type I errors in in Table 2. According to Tables 1 and 2, for both binary response and integer/count response, in most scenarios, all methods control Type I errors as expected.
For association between binary phenotype and rare genetic variables, Type I errors of our NGS data-based variable collapse (VC) tests using true AFs, and two ways of estimating AFs (two-step genotype-based AF estimation and one-step NGS data-based AF estimation). NGS data-based VC Test 1, 2, 3 refer to our testing methods based on NGS data using true AFs and two allele frequency estimators. In Table 3, Type I errors for a group of rare genetic variables with a binary phenotype are reported. We evaluate the Type I errors of our NGS data-based VC Test 1, 2, 3, and two genotype-based rare-variant methods (burden test and SKAT test). We repeat our simulation study for count/integer phenotype and rare genetic variants and report simulation results of Type I errors in Table 4. Because genotype-based SKAT testing is only available for continuous response and binary response, we compare our methods with genotype-based burden tests. According to Tables 3 and 4, for both binary response and integer/count response, in most scenarios, all methods control Type I errors as expected.

Table 1. Type I errors of testing methods for a group of common genetic variants and binary phenotype.
Genotype-based χ 2 test refers to Genotype-based Chi-square test. The term "NGS JS Test 1, 2, 3" refer to NGS data-based joint significant test with use of (1) true AF, (2) two-step genotype-based estimated AF, and (3) MLE of AFs based on NGS data.

Results of Power Analyses
Performance of different methods are evaluated in terms of statistical power. Statistical power is the probability of rejecting the null hypothesis under alternative hypothesis.
In Figure 1, we show power of different tests for a binary phenotype and common genetic variables. From top to bottom, the four rows have sequencing depth of 1, 2, 4, and 10. From left to right, the three columns have sample sizes of 300, 500, and 1000. Powers of different tests (genotype-based Chi-square test (red), NGS data-based joint significance test 1 (black), test 2 (purple), and test 3 (green)) are represented as curves. We found that when sequencing depth is low (depth = 1X, 2X), our proposed methods based on NGS data performed better than the genotype-based test in the literature. When sequencing depth increases, the advantage of NGS data-based methods over methods based on genotypes decreases. When sequencing depth is 10X, methods based on NGS data and methods based on genotypes have similar performance. When sample size increases, the power of all tests increases. Comparing the three NGS data-based JS tests, we found JS Test 1 (using true AF) and 3 (using MLE of AF based on NGS data ) have similar performance, whereas JS Test 2 (estimate AF using genotype-based method) has slightly worse performance compared with Test 1 and 3; however, NGS data-based JS Test 2 still has better performance than the corresponding genotype-based test in the literature. When sequencing becomes deep, the three NGS data-based tests show similar performance in terms of statistical power.
We repeat power analysis of common genetic variants for count/integer phenotype. Similar findings are obtained for the count/integer phenotype. We found that the power of all tests increases as the sample size increases from 300 to 1000. When sequencing depth is low (depth = 1X, 2X), our NGS data-based methods demonstrate advantages over genotype-based methods in the literature; however, when sequencing depth increases, the magnitude of the advantage decreases. When sequencing depth is at 10X, all methods show similar performances. JS Test 1 (using true AF) and 3 (using MLE of AF based on NGS data [29]) have similar performance, whereas JS Test 2 (estimate AF using genotype-based method) show slightly worse performance compared with Test 1 and 3. We report our results for the count/binary phenotype in Figure 2.
In Figure 3, power of different tests for binary phenotype and a group of rare genetic variables are reported. The four rows from up to down are for sequencing depth 1X, 2X, 4X, and 10X. The three columns from left to right are for sample size n = 300, 500, and 1000. Powers of different tests are represented as curves and coloured differently. The tests include genotype-based burden test (red), SKAT test (blue), NGS data-based VC Test 1 (black), 2 (purple), and 3 (green). Our proposed methods based on NGS data are found to have better performance than genotype-based methods in the literature when sequencing depth is low (1X and 2X). When sequencing become more deep, the advantages of NGS data-based methods over genotype-based methods decreases. When sequencing depth is 10X, NGS data-based methods and genotype-based methods have similar performance. Comparing the three methods based on NGS data, we find VC Test 1 (using true AF) and 3 (using MLE of AF based on NGS data) have similar performance, whereas Test 2 (estimate AF based on genotypes) is slightly worse compared with Test 1 and 3. When sequencing becomes deep, the difference between the performance of the three VC tests decreases. Comparing two genotype-based methods (Burden and SKAT), we find that the burden test has better performance than SKAT methods because our scenarios assume that all genetic effects in one direction, i.e., positive, which is not the assumption for SKAT, which allows effects in two directions, i.e., positive effects and negative effects for the group of markers.
We repeat our power analysis of rare genetic variants for count/integer phenotype. In Figure 4, the powers of different tests for count/integer phenotype and a group of rare genetic variables are shown. Because SKAT test is not available for count/integer phenotype, we compare our NGS-based methods with genotype-based burden test. Similar findings were obtained for count/integer phenotype as we find in binary phenotype.
For binary phenotype and rare variants, we consider the scenario of genetic effects in both directions (positive and negative), which is suitable for the SKAT test. We found that under this scenario, all burden tests (genotype-based and NGS data-based) fail as we expected. This is because burden tests assume genetic effects only in one direction and the burden-test assumption is not satisfied in this scenario. The failure of the burden test under the scenarios of genetic effects in both directions was recognized by researchers.
We leave the work of developing NGS data-based group testing methods for rare variants allowing genetic effects in two directions as our future study since it is beyond the scope of this article. Developing the NGS data-based SKAT test can address this issue and will be studied in the future. This article focused on developing NGS data-based test corresponding to genotype-based burden test and joint significant test in the literature.

Discussion
The major finding of our article is that the proposed methods show advantage over their corresponding methods based on genotype in the literature both in testing for a group of common markers and testing for a group of rare markers. The main objective of our study was to apply the GLM framework to derive innovative group testing methods based on NGS data. We fulfil the demand from researchers in bio-statistics, bio-informatics, and biology for developing group testing methods based on NGS data.
Our methods adopt the GLM framework to handle a range of types of phenotypes, including binary responses and continuous responses, which are two mostly encountered types in association studies. Our method extends our previous linear model framework [17] with the analytical capacity for more complex phenotypes in addition to continuous phenotypes.
In the Results section, we show the findings of comparing our methods with their corresponding methods in the literature for binary phenotype and count phenotype. For continuous phenotype, the GLM-based model will reduce to our previously published LMbased model [17]. Our findings for group testing of common variants from binary phenotype and count/integer phenotype are similar to our findings from continuous phenotype for group testing of common variants published previously [17]. Similarly, our findings for rare variants testing from binary phenotype and count/integer phenotype are similar to our findings from continuous phenotypes for rare variant testing published previously [17].
Our proposed model deals with association studies with unrelated/independent individuals. Association studies can be conducted based on related individuals, such as the situation where multiple individuals from the same family are involved in the study. In that case, a generalized linear mixed (GLMM) model instead of a generalized linear (GLM) model is used. Future studies can be on developing testing methods based on NGS data for related individuals. In an ongoing project, we are working to extend our GLM framework to GLMM framework so that it can handle related individuals.
Our proposed methods are based on a score test. We adopt the score test due to its advantage of fast computation and easy derivation of calculation formulae because it only calculates constrained MLE, which is the optimizer maximizing likelihood function under the null hypothesis. The likelihood ratio test can have improved performance compared with the score test [43,44]. Future studies can be on developing likelihood ratio-based association tests using NGS data which may have improved performance. Sequencing depth can impact the comparison of performance between methods based on NGS data and methods based on genotypes. When sequencing depth is big (d = 4X and 10X) and genotype estimation is accurate, methods based on NGS data and methods based on genotypes show similar performance. When sequencing depth is small (depth = 1X and 2X) and genotype estimation is not precise, methods based on NGS data can have better performance than genotype-based methods, which are based on estimated genotypes [17,29,30]. In practice, given a limited financial budget, low sequencing to include more individuals is preferred to deep sequencing with few individuals sequenced [29,30]. Our proposed methods mainly show an advantage when sequencing is low (depth = 1X and 2X).
Our proposed methods are mainly developed based on a theoretical statistical framework of general linear models. We make use of statistical inference methodology to derive the score tests based on the likelihood function of next-generation sequencing (NGS) data and phenotypes with latent/un-observable genotypes. Then we conducted extensive simulation studies to evaluate the performance of our methods and compare our methods with the traditional methods in the theory. In an ongoing project, we are working on systematically evaluating and comparing our methods based on real NGS data. We separate the development of our methods into two stages. In Stage 1, we conduct theoretical development of our methods and simulation studies. In Stage 2, we further evaluate our methods in real NGS data.
We note that simulations performed in our hypothetical scenarios could be biased so that simulation studies could be biased and real data results can provide stronger verification. In our ongoing project, we aimed to systematically evaluate our method in real NGS data.
We note that verification of our methods can be at four levels. The four levels of verification are: (1) Statistically theoretical verification to derive our methods theoretically based on GLM framework to show that our methods are statistically theoretically founded; (2) Simulation studies to evaluate method performance and show that our methods can achieve better performance compared with traditional methods in the literature; (3) Multiple real NGS data examples to evaluate and compare different methods; (4) Biology lab verification and biology literature verification to show our methods indeed find some biologically meaningful genes related to the phenotype.
When possible, we recommend the use of higher levels of verification. For example, simulation studies under a specific scenario can be biased so that real data studies can provide stronger verification. We also recommend the use of as many verification levels as possible to provide verification in different perspectives. The current manuscript provides statistical theoretical derivations and simulation studies. The next steps for verification should be Level 3 (real data verification) and 4 (biology lab verification and biology literature verification). In our ongoing project, we are working on verification from real NGS data. In the future, biology lab verification and biology literature verification can be conducted by collaborating with biology experts.
Future directions of our study include (1) extending our GLM-based methods to generalized linear mixed model (GLMM)-based methods so that they can handle related individuals in association studies; (2) systematically evaluating the performance of our methods and comparing our methods with literature methods in real NGS data; (3) evaluating our methods based on the other three genetic effect models, recessive model, dominant model, and heterogeneous effect models [45]; and (4) biology lab verification and biology literature verification by collaborating with biology experts.
To describe the effect of genotype (coded as 0, 1, 2) at a single marker on phenotype, four models are widely used. The effect on linear predictor η in the GLM framework is specified differently in the four models [45], (1) additive model (e f f ect = β 0 + β a g); (2) recessive model (e f f ect = β 0 + β r I(g = 2)); (3) dominant model (e f f ect = β 0 + β d I(g ≥ 1)); and (4) heterogeneous effect model (e f f ect = β 0 + β 1 I(g = 1) + β 2 I(g = 2)). The indicator function I(condition) is equal to 1 when the condition is satisfied, and is equal to 0 otherwise. The probability of observing the response in the GLM framework is and the linear predictor η i is modelled differently for the recessive model, dominant model and heterogeneous effect model with d g genetic markers. We model linear predictor η i in GLM, respectively, as follows, where the four row vectors (β d , β a , β 1 , β 2 ) have length of d g . Although our methods are proposed based on the additive model described in Equation (10), the methods can be adapted for the other three genetic-effect models (recessive model; dominant model; heterogeneous effect model) by changing η i and its first and second derivatives with respect to model parameters.

Conclusions
We extend our previously proposed NGS data-based testing methods (joint significance test for a group of common variants and variable collapse test for a group of rare variants) from a linear model (LM) framework to a generalized linear model (GLM) framework so that it can handle a range of types of responses (binary phenotypes; count/integer phenotypes) in addition to continuous phenotypes. Our proposed methods fill the lit-erature gap. In addition, based on our results from simulation studies reported in the Results section, we found that our methods can achieve better performance than their corresponding genotype-based methods in the literature. Future studies will be conducted to evaluate our methods based on real NGS data.
Funding: Sixia Chen was supported by the Oklahoma Shared Clinical and Translational Resources (U54GM104938) with an Institutional Development Award (IDeA) from NIGMS.

Data Availability Statement:
All data used in the study are publicly available.

Conflicts of Interest:
The authors declare no conflict of interest.
Detailed derivations have been provided in Supplementary Material S3 in Supplementary Information File.
Appendix C. Evaluation of s y,D (α, β, φ) at Constrained MLEθ, i.e., s y,D (α, 0,φ) where E(g T |D i ) = {∑ g∈G h(g, D i )} −1 ∑ g∈G g T h(g, D i ) = (∑ g∈G g T h(g, D i ))/(∑ g∈G h(g, D i )) is the posterior expectation of the genotype given sequencing data. Detailed derivations have been provided in Supplementary Material S4 in Supplementary Information File.
x T i gh(g, D i );