Above and beyond state-of-the-art approaches to investigate sequence data: summary of methods and results from the population-based association group at the Genetic Analysis Workshop 19

This paper summarizes the contributions from the Population-Based Association group at the Genetic Analysis Workshop 19. It provides an overview of the new statistical approaches tried out by group members in order to take best advantage of population-based sequence data. Although contributions were highly heterogeneous regarding the applied quality control criteria and the number of investigated variants, several technical issues were identified, leading to practical recommendations. Preliminary analyses revealed that Hurdle-negative binomial regression is a promising approach to investigate the distribution of allele counts instead of called genotypes from sequence data. Convergence problems, however, limited the use of this approach, creating a technical challenge shared by environment-stratified models used to investigate rare variant-environment interactions, as well as by rare variant haplotype analyses using well-established public software. Estimates of relatedness and population structure strongly depended on the allele frequency of selected variants for inference. Another practical recommendation was that dissenting probability values from standard and small-sample tests of a particular hypothesis may reflect a lack of validity of large-sample approximations. Novel statistical approaches that integrate evolutionary information showed some advantage to detect weak genetic signals, and Bayesian adjustment for confounding was able to efficiently estimate causal genetic effects. Haplotype association methods may constitute a valuable complement of collapsing approaches for sequence data. This paper reports on the experience of members of the Population-Based Association group with several novel, promising approaches to preprocessing and analyzing sequence data, and to following up identified association signals.

first day of the workshop, group members briefly presented the contributions of the other pair member. Engaged discussions and intensive team work during 4 group meetings and a poster session led to a consensus summary of the group contributions, which was presented to all GAW participants in a plenary session.
After submission and peer review, 9 individual papers from the Population-Based Association group were accepted for publication in the GAW19 proceedings [1][2][3][4][5][6][7][8][9]. Figure 1 represents a mind-map of the 9 accepted contributions. Members of our group explored novel approaches to circumvent the limitations of current methods, and to take most advantage of nextgeneration sequence data. Individual contributions could be classified into 4 loose themes: development of new methods for new types of data; manipulation of rare variants; behavior of rare variants acting alone and interacting with environmental factors; and the follow up of association signals from sequence data. Table 1 summarizes the genotypes and phenotypes investigated by group members, and the applied filters for quality control. Although most group members analyzed whole exome sequence data in odd-numbered autosomes from unrelated individuals, some participants focused on genetic variability in the MAP4 gene. Regarding investigated phenotypes, the use of real and simulated data was well-balanced. Two participants defined affected cases as individuals with a systolic blood pressure greater than 140 mm Hg, or a diastolic blood pressure greater than 90 mm Hg, or taking antihypertension medication. A group member simulated their own phenotypes. The applied quality control filters were highly heterogeneous. For example, the threshold for variant exclusion owing to missing calls varied from 5 to 25 %. Also the number of investigated variants showed a large variability. In contrast to a group member who considered 88 variants in 2 genes, another participant examined more than 313,340 variants in odd-numbered autosomes.

New methods for new types of data
The relationship between genetic variability and a given phenotype is usually investigated based on called genotypes. Sequence data provides ancillary information on the distribution of the number of reads at a particular position. This includes the counts of reference and alternative alleles. González Silos et al. hypothesized that allele counts are genotype measurements that are more informative than called genotypes in the sense that the two counts, "no alternative allele out of 100 reads" and "one alternative allele out of 10 reads," both translate into the same called genotype (reference allele homozygote). In other words, after applying user-defined data quality filters, uncertainty in genotype calling is rarely taken into account in genetic association tests.
To explore association test approaches that rely on allele counts from sequence data as an alternative to  Negative binomial regression was applied to investigate the relationship between alternative allele counts as response variable, using the total number of reads at a particular position as an offset, and the diastolic blood pressure was adjusted for age, sex, and medication as an explanatory variable. Zero-inflated and Hurdle-negative binomial regression were examined, too, because of their flexibility in the presence of zero inflation. The genotype-phenotype relationship was also investigated based on the ratio "alternative allele count/number of reads", which was alternatively considered as a response and an explanatory variable in standard and robust linear regression models. Type I error rates were roughly estimated, assuming that most of the investigated variants were under the null hypothesis of no genetic association, and quantile-quantile plots were used to explore possible disparities between small probability values from the investigated regression models. Table 2 lists key concepts addressed in accepted papers from the Population-Based Association group. In addition to allele counts, negative binomial regression models, and extensions thereof, González Silos et al. dealt with the concept of "downsampling." Table 3 presents related bibliography and publicly available software used by group members.  Table 3 presents and briefly describes recent publications in the field. The analysis of association between binary phenotypes and single rare variants is challenging because conventional logistic regression approaches often violate the large-sample assumption for test statistics, resulting in poor type I error control and low statistical power. In particular, standard score tests can be extremely anticonservative under the null. Shin et al. explored alternative tests for low-frequency and rare variants, including 2 sparse-data approaches to single-variant tests, and collapsing tests for sets of variants. The first explored sparse-data approach was the Firth-type likelihood ratio test based on the penalized log-likelihood function

Handling rare variants
where i(β) is the Fisher information matrix. The second sparse-data approach was a modified score test which incorporates small-sample variance and/or kurtosis to obtain the null distribution. The investigated variant-collapsing tests included a MAF-weighted burden test, a nonburden sequence kernel association test (SKAT), and a unified approach that optimally combines SKAT and the burden test (SKAT-O). Investigated variant-collapsing tests were applied to sets of rare variants alone, rare and low-frequency variants, and all variants within defined subregions built according to physical proximity. Table 3 lists the software used (R-package pmlr and SKAT). Thompson and Fardo compared mapping methods that account for the evolutionary relatedness among individuals (tree-based) with standard methods that ignore evolutionary relationships (non-tree-based association mapping methods). Each genetic variant has an evolutionary history that can be represented by a phylogenetic tree (see Table 2). In Fig. 2, each tip represents a copy of the variant. Time moves from past to present, from left to right, across the tree, and the branch lengths represent time. If 2 variants share a large portion of their evolutionary history (the 2 blue diamonds), associated  New methods for new data types [1] Alternative allele count: Number of reads that support a given alternative allele based on individual sequence data [1] Negative binomial regression: Type of regression model used to investigate response variables that are counts. In contrast to Poisson regression, negative binomial regression allows for overdispersion-a variance larger than the mean [1] Hurdle and zero-inflated models: Two statistical models used to investigate count response variables with a large proportion of zeros. Hurdle models assume that a Bernoulli process determines whether counts are zero or positive. If the response is positive, its conditional distribution is governed by a truncated-at-zero count data model. Zero-inflated models assume the response variable is a mixture of a Bernoulli and a count distribution, eg, negative binomial [1] Downsampling: Selecting a subset of the reads in a high-coverage position to improve computational efficiency Handling rare variants [2] Variant ascertainment bias: Variant selection criteria, such as minor allele frequency, can influence kinship and population structure estimates [2] Kinship estimation: the estimation of relationships among samples based upon genotypes rather than known pedigrees is sensitive to the selected variants and the applied statistical methods [2] Population structure: Admixture events leave a signature in the patterns of genetic variation within a population. This can bias genome-wide association studies, and be used as a tool to identify genetic variants influencing a trait [3] Firth's penalized likelihood: A logistic regression likelihood penalized by Jeffrey's invariant prior. A first-order bias term is introduced into the score function to reduce the bias in the log odds ratio estimate that arises as a result of sparse data [3] Small-sample-adjusted score test: A logistic regression score test in which the null distribution of the test statistic is adjusted using estimates of small sample variance and kurtosis [3,9] Sequence kernel association test: Variant-collapsing test for a subset of variants constructed by aggregating individual variant score test statistics [4] Quantitative trait mapping: The search for positions along the genome associated with quantitative traits [4] Tree-based methods: Methods that account for uneven evolutionary relatedness among genetic variants [4] Phylogenetic tree: A bifurcating tree used to represent the evolutionary relationships among variants (illustrated in Fig. 2) [5] Within-chain permutation: Permutation of individual phenotypes is a widely used strategy to investigate the null distribution. Under the frequentist approach, statistics based on actual data are compared with the distribution of statistics from permuted data sets. In Bayesian analyses, computing time can be reduced by permuting phenotypes within the single Markov chains used to infer posterior distributions.
[6] Minor allele count (MAC): The total count of minor alleles for all individuals evaluated at a particular position. For rare variants, the MAC reflects better data sparsity than the minor allele frequency Rare variant behavior [7] Gene-environment interaction term model: Statistical approach that tests for gene-environment interactions by including a gene-environment interaction term to measure the change in the outcome when both the genetic marker and environmental factor are present, as compared to when one or both factors are not present [7] Environment-stratified models: Alternative approach to identifying gene-environment interactions, by comparing genetic effect sizes between strata defined by an environmental exposure Follow up of association signals [8] Bayesian adjustment for confounding: A Bayesian approach for estimating the average causal effect of an exposure on an outcome in observational studies while accounting for the uncertainty in confounder selection. It uses Bayesian model averaging to average inference across many models according to posterior weight determined by a joint model of the exposure and the outcome [9] Logistic Bayesian LASSO (least absolute shrinkage and selection operator): Method based on a retrospective likelihood that models the probability of haplotypes given disease status.
The odds of disease are expressed as a logistic regression model, whose coefficients are regularized through Bayesian LASSO  phenotypic traits are expected to be correlated, whereas if 2 variants share little evolutionary history (illustrated by the 2 black circles), trait values are expected to be uncorrelated. Tree-based association mapping methods that integrate evolution history in the statistical analysis may have improved ability to detect weaker associations but existing tree-based methods are unable to consider external covariate information and are computationally expensive. Related literature and software in the field is sparse (see Table 3). The identification of rare-variant associations by multimarker approaches, for example, collapsing, simplesum, and weighted-sum methods, has recently drawn much attention. Table 3 presents a noninclusive list of available software. These methods first collapse rare variants and then implement a LASSO (least absolute shrinkage and selection operator), partial least-squares regression model, or other supporting statistical method that relies on common and collapsed rare variants. Variant pooling may dilute rare variant effects, and the cutoff definition to distinguishing rare from common variants is arbitrary. To circumvent these limitations, Oh extended previous work and presented a Bayesian variable selection approach to detect associations with both rare and common genetic variants. Under his approach, rare variant mapping is framed as a variable selection problem in a sparse space where risk index scores are constructed for a group of rare variants over the genomic region. Technical details on the chosen priors and on inference relying on marginal posterior probabilities of latent variables can be found in the GAW 19 proceedings. Table 2 provides a brief explanation of within-chain permutation of phenotype data to calculate empirical thresholds and adjust false-positive rates.

Rare-variant behavior
In next-generation sequence data, the proportion of rare variants is substantially larger than the proportion of common variants typically measured in array-based genome-wide association studies. Rare variants present a challenge because often there are too few of the rare alleles for traditional statistical tests, and the increased variant density results in multicollinearity, making it difficult to identify independent associations. Schwantes-An et al. investigated the effect of the MAF, chromosomal position, significance threshold, and departure from normality of the investigated phenotype on the type I error rate. Five quantitative phenotypes were simulated under the null hypothesis. The first phenotype followed a standard normal distribution, the second followed a gamma distribution. The third and fourth phenotypes were the log 10 (rank-based inverse normal) transformations of the second phenotype. The fifth phenotype was the null trait Q1 provided by workshop organizers.
Fernández-Rhodes et al. compared type I error rates and statistical power for 2 gene-environment interaction methods. The first method ("interaction model") uses a gene-environment interaction term to measure the change in outcome when both the genetic marker and environmental factor are present, compared to the presence of the genetic marker alone. The second method tests for effect-size differences between strata under distinct environmental exposures (medication vs nonmedication in the present context, "med-diff" approach). The 2 methods can be applied to test gene-environment interactions with 1°of freedom (df ), as well as to test both main genetic and gene-environment interaction effects with 2 df tests. They were compared relying on genotype-medication interactions simulated by the GAW19 organizers. Gene-environment interaction analyses were adjusted for age, sex, population structure, and family relatedness.

Follow up of association signals
Estimation of causal effects of genetic variants on disease may help to bridge the gap between assessment of association and function, allowing at the same time an improved localization of disease variants. The causal effect of genetic variants on clinical phenotypes may be confounded by demographic and clinical characteristics, and also by other genetic variants as a result of linkage disequilibrium. Wang et al. explored the estimation of the average causal effect of genetic variants on clinical phenotypes using the Bayesian adjustment for confounding method. This method has been proposed to estimate average causal effects in the presence of many confounders, assuming all of them have been measured.
Bayesian adjustment for confounding utilizes a Bayesian model averaging approach and estimates causal effects by a posterior weighted average of average causal effect estimates from a battery of models with different sets of covariates. In contrast to standard Bayesian model averaging, Bayesian adjustment for confounding incorporates the strength of associations between covariates in the model and the exposure into the prior. It has been shown that the method tends to give large posterior weights to models that have been fully adjusted for confounding, thus resulting in unbiased causal effect estimates (see articles by Wang et al. listed in Table 3). The bias and variability of average causal effect estimates were examined by comparison with a "true model" (age, sex, their interaction, and 25 variants with true systolic blood pressure [SBP] effects), and a "full model" (age, sex, their interaction, smoking status, and 94 variants within, up-, and downstream of MAP4). Table 3 provides a link to software developed by Wang et al.
Haplotype analysis is a typical follow-up step after identification of single-association signals. Haplotypebased methods can be more powerful than single-single nucleotide polymorphism (SNP) methods when causal variants are not genotyped, and when multiple variants act in cis. In some situations, they also have increased power to detect rare-variant associations over recently developed "collapsing" methods. Datta et al. investigated possible associations with rare haplotypes in 2 hypertension-associated genes, ULK4 and MAP4. They analyzed sliding haplotype windows of 5 variants using 4 haplotype association methods: haplo.score, haplo.glm, hapassoc, and logistic Bayesian LASSO (LBL) and the three collapsing methods (SKAT, SKAT-O and  SKAT-RC, see Tables 2 and 3). LBL models the probability of haplotypes given disease status. Unobserved haplotypes are treated as missing data and the frequencies of haplotype pairs are modeled, allowing for Hardy-Weinberg disequilibrium. The odds of disease are expressed as a logistic regression model, whose coefficients are regularized through a double-exponential prior centered at 0 and a variance parameter, which is further assigned a hyper prior. By regularizing regression coefficients through their prior distributions, the LBL weeds out unassociated (especially common) haplotypes, allowing the associated rare haplotypes to be detected more easily. The investigated regression models with the best control of type I error rates were zero-inflated and Hurdle-negative binomial regression for the relationship between alternative allele counts and adjusted diastolic blood pressure, and standard and robust linear regression for the relationship between the ratio "alternative allele count/number of reads" as response variable and adjusted diastolic blood pressure as explanatory variable. The simultaneous consideration of ability to discriminate variants with small associated probability values, occurrence of convergence problems, and robustness of probability values against departing blood pressure observations indicated that Hurdle-negative binomial regression constitutes a promising approach.

Results and discussion
Regarding the selection of variants for kinship estimation, markers that were not ancestry informative resulted in the most accurate estimates. The homogenizing selection strategy with the maximum likelihood and PC-Relate estimators assigned correct relationships for more than 90 % of pairs of first-and second-degree relatives and unrelated subjects. All methods assigned relationships incorrectly for 20 % of third-and fourth-degree relative pairs. European and Native American ancestries were best captured by principal component analysis based on common variants, with similar results for variants with a MAF >0.01 or >0.05. The ability to capture African ancestry was poorer, and resulted as maximized when principal components relied on variants with a MAF <0.05. Clearly allele frequency plays an important role in estimates of relatedness and population structure, and should be carefully considered in such analyses.
Concerning the alternative tests for low-frequency and rare variants investigated by Shin et al., the penalized logistic likelihood ratio test and the small-sample modified score test were both better choices than standard single-variant tests. In tests of association between a simulated binary hypertension phenotype and variants in the MAP4 gene, the sparse-data methods generally improved the control of type I errors. Statistical power was sufficient to detect low-frequency and common variants, but remained low for the rare variants. The occurrence of conflicting p values from standard and small-sample tests of the same hypothesis can indicate that large-sample approximations are invalid. Although previous studies have found variant-collapsing tests to have higher power than single-variant tests for rare variants, simulation results for MAP4 showed low power for variant-collapsing tests when they only included rare variants. The statistical power increased when lowfrequency variants were included. Because the power of variant-collapsing tests depends on the number, frequency, and effect sizes of associated and neutral variants, identification of better grouping and weighting strategies may translate into an improved power to detect regions that only contain rare variants.
With reference to the consideration of evolutionary relationships, the type I error rate of tree-based mapping methods appeared to be well controlled, whereas nontree-based association mapping showed inflated type I error rate for all 5 investigated genes. Regarding detection, both methods performed well when analyzing genes with large-effect variants, and both methods showed small power in the analysis of genes with low-penetrance variants. Regarding localization of true causal loci, both methods showed similar mapping abilities for genes with large effects. In the case of small-effect variants, treebased outperformed non-tree-based association mapping, which may point to an advantage of using evolutionary information to detect weak genetic signals.
Under the Bayesian variable selection approach to detect associations with both rare and common genetic variants, marginal posterior probabilities were quite robust against the MAFs used to define variant rarity (from 0.05 to 5 %), and they clearly surpassed empirical probabilities based on permuted phenotypes for several positions, pointing to associations between rare and common variants in MAP4 with SBP and diastolic blood pressure (DBP).
In agreement with Fig. 3 from González Silos et al., 77 % of the variants investigated by Schwantes-An et al. were extremely rare (MAF <0.0025), and more than half had the variant allele in only a single individual. Type I error rates were not inflated for the normal and rank-based inverse normal transformed gamma null phenotypes. However, type I error rates for the gamma and log-transformed gamma-distributed phenotypes increased with decreasing MAFs, with increasing departure from normality, and with decreasing significance thresholds. Although Q1 was nearly normally distributed, type I error rates for common variants were higher than expected. The inflation of type I error rates for rare and extremely rare variants for null traits that departed from normality was ameliorated by transforming nonnormally distributed traits to those with a more normal distribution.
Regarding the behavior of rare variants interacting with environmental factors, type I error rates did not surpass the expected nominal 5 % significance threshold for either of the 2 investigated gene-environment interaction methods ("interaction model" and "med-diff" approach to test effect size differences between strata) for variants with at least a MAF of 1 %. The statistical power was higher for the "med-diff" approach for variants with a MAF <1 %, but it was higher for the "interaction model" when a variant with a MAF >5 % was evaluated. Nonconvergence was a limitation of the "med-diff" approach.
Finally, in respect of the follow up of association signals, mean squared errors were smaller for estimates based on Bayesian adjustment for confounding than for full-model-based estimates of the average causal effect in the investigated scenarios. The reduced variation of estimated average causal effects was a result of the simultaneous consideration of an exposure model and an outcome model in the Bayesian adjustment for confounding, suggesting that this method is able to efficiently estimate the causal effect of genetic variants.
Rare-variant haplotype analyses revealed that "hapassoc" often showed convergence problems and, when it converged, association results were similar to that of "haplo.glm." The haplotypes found to be associated depended on the method. The ranking of methods by the total number of significant haplotypes found on the 2 genes was LBL < haplo.score < haplo.glm. However after permutation of the case-control status to mimic the null scenario, the ratio of associated haplotypes to the total number of haplotypes was also lower for LBL than for "haplo.score" and "haplo.glm," indicating a controlled falsepositive rate for LBL compared to "haplo.score" and "haplo.glm." SKAT and its variants did not identify statistically significant association signals. Based on these results, haplotype association methods seem to be useful and complementary to collapsing approaches for sequence data.

Conclusions
With their results, members of the Population-Based Association group identified several current methodological gaps regarding both the preparation and the statistical analysis of sequence data. Sequence data is noisy, and the investigation of the distribution of allele counts instead of relying on called genotypes could offer some advantage. The selection of genetic variants was found to play a major role in the assessment of population structure and cryptic relatedness. Most statistical methods with good properties for common variants were found inappropriate for rare ones. Methodological gaps were also identified in the follow up of association signals. Novel methods are needed to investigate rare haplotypes and interactions between the environment and rare variants found in sequence data, as well as for causal effect estimation.