Identification of influential rare variants in aggregate testing using random forest importance measures

Aggregate tests of rare variants are often employed to identify associated regions compared to sequentially testing each individual variant. When an aggregate test is significant, it is of interest to identify which rare variants are “driving” the association. We recently developed the rare variant influential filtering tool (RIFT) to identify influential rare variants and showed RIFT had higher true positive rates compared to other published methods. Here we use importance measures from the standard random forest (RF) and variable importance weighted RF (vi‐RF) to identify influential variants. For very rare variants (minor allele frequency [MAF] < 0.001), the vi‐RF:Accuracy method had the highest median true positive rate (TPR = 0.24; interquartile range [IQR]: 0.13, 0.42) followed by the RF:Accuracy method (TPR = 0.16; IQR: 0.07, 0.33) and both were superior to RIFT (TPR = 0.05; IQR: 0.02, 0.15). Among uncommon variants (0.001 < MAF < 0.03), the RF methods had higher true positive rates than RIFT while observing comparable false positive rates. Finally, we applied the RF methods to a targeted resequencing study in idiopathic pulmonary fibrosis (IPF), in which the vi‐RF approach identified eight and seven variants in TERT and FAM13A, respectively. In summary, the vi‐RF provides an improved, objective approach to identifying influential variants following a significant aggregate test. We have expanded our previously developed R package RIFT to include the random forest methods.

. Even with larger sample sizes, there is often insufficient power to test individual rare variants due to the frequency of the rare allele and the multiple testing burden. The latter results from the fact that, both genomewide and within any region of the genome, there are far more rare variants than common variants.
To increase power for detecting rare variant associations, aggregate testing methods have been developed that combine or jointly test rare variants across a genomic region (Bocher & Génin, 2020;Cruchaga et al., 2014;Lee et al., 2014;Li et al., 2019). Aggregate tests perform a statistical test on a set of more than one rare variant, which is different from the classical single-variant association tests which test each variant individually. There are three broad groups of methods used for aggregate testing of rare variants: burden tests, variance component tests, and combined burden and variance component tests. The combined burden and variance component tests achieve the greatest power when the underlying model of disease association is unknown, which is most often the case (Seunggeun Lee, Wu et al., 2012). Once a region has been identified by these aggregate tests as containing trait-influencing variants, it is important to identify what subset of these variants are driving the association. First, for some assays, it is not feasible to molecularly test each of the variants for function in an associated region. Second, the variants that drive the association form a smaller set of variants that can more easily be appropriately powered in replication studies. We hold that it is more practical, efficient, and robust to test (experimentally or in replication studies) the subset of variants that appear to drive the association, thereby accelerating discovery.
We recently developed the rare variant influential filtering tool (RIFT) which prioritizes rare variants following a significant aggregate test of association (Blumhagen et al., 2021). The general approach of RIFT is to quantify the influence of a variant on the aggregate test by comparing the aggregate test of the complete set of variants to that of the aggregate test when a given variant is removed. In addition to the quantitative measure, RIFT uses outlier detection methods to classify variants as influential variants (IVs) that can be prioritized for future study. We previously showed over a range of conditions that RIFT achieved higher true positive rates than existing localization methods while maintaining low false positive rates. However, all methods investigated observed higher sensitivity to detect uncommon variants (0.001 < MAF < 0.03) compared to very rare variants (MAF < 0.001).
In this study, we expand RIFT by employing random forest (RF) methods to evaluate the influence of a variant on aggregate tests of association. A RF is a widely used ensemble learning method that combines multiple decision trees to classify/predict observations. RFs can be applied to data with a binary outcome (classification trees) or continuous outcome (regression trees). One of the many advantages of RF methods is that they provide estimates of the importance of each variable to the classification.
For data in which there are a large number of variables but only a small proportion are important, the standard RF algorithm does not perform as well for making accurate classifications (Amaratunga et al., 2008). This is relevant for large sets of genetic variant data where we do not expect a large proportion of rare variants to be associated with the phenotype of interest. To address this issue, Liu and Zhao (2017) presented a modified RF method, referred to as variable importance-weighted RFs (vi-RF), which instead of randomly selecting variables at each node of the decision tree, weights the selection of the variables based upon the variable importance measures of an initial RF (Liu & Zhao, 2017). This method was shown to improve prediction accuracy, most notably when the data are highly variable compared to the strength of the association signal.
We propose to use the resulting variable importance measures from both the standard RF and the variable importance-weighted RF as a quantitative measure to rank variants following a significant aggregate test. In addition to ranking variants, it is of interest to identify the subset of influential variants driving the disease association. To identify this subset of influential variants based on the variable importance measures, we take advantage of a well-established outlier detection method, the Tukey fence, based on the interquartile range. In our application, we are interested in classifying observations according to disease status or other discrete outcomes. This novel application of combining variable importance measures with outlier detection methods provides a method that is agnostic to putative function to allow detection of variants that may have unknown function.

Overview
We propose coupling a RF method and its variable importance measures with an outlier detection method to identify the rare variants most influencing significant aggregate tests of association ( Figure 1). Although these methods are applicable for any aggregate test of variants, we focus our simulation study on variants with a minor allele frequency (MAF) of <3% (i.e., rare and uncommon) as this is the most common application of aggregate testing. First, we describe the standard RF algorithm and then an extension to the method which weights variable selection based on the variable importance measures from an initial RF (Breiman, 2001;Liu & Zhao, 2017). We then describe the F I G U R E 1 Flow chart demonstrating application of random forest methods in combination with outlier detection to identify influential variants.
Tukey outlier detection method that we use to classify outlying variable importance measures (Tukey, 1977). We report a simulation study that evaluates the performance of the proposed methods and summarizes the ability of this approach to identify rare/uncommon variants associated with disease. We compare the RF methods to our previously published method, RIFT (Blumhagen et al., 2021). Finally, we apply the RF methods to rare variant regions previously found to be significantly associated with idiopathic pulmonary fibrosis (Moore et al., 2019).

Standard RF method
The RF algorithm described by Breiman (2001) is an ensemble learning method which combines the results from a large number of decision trees. Here a decision tree is constructed by first randomly selecting N observations (with replacement), referred to as in-bag observations, from the dataset; the observations not selected are referred to as the out-of-bag observations. At each node of the decision tree, a random subset of m variables is selected and the variable among this subset that results in the largest reduction in impurity among the in-bag observations is chosen to split the node. One of the benefits of an RF is that it provides insight into what variables play an important role in the classification of the outcome through the com-putation of variable importance measures. There are two common variable importance measures that are calculated as part of the RF algorithm: the accuracy-based importance and the Gini-based importance (Breiman, 2001;Breiman et al., 2017). The accuracy-based importance is calculated by comparing the prediction accuracy of the out-of-bag observations to the prediction accuracy of the out-of-bag observations in which the variable has been permuted. The accuracy-based importance measure is defined as the mean decrease in accuracy across all the trees. The Ginibased importance is calculated by averaging the decrease in the Gini impurity or the in-bag observations across all the decision trees. Given the Gini impurity is already calculated on the in-bag observations during the construction of the RF and does not require a permutation step, it is more computationally efficient. For both importance measures, larger values correspond to variables with larger predictive power and are considered more important in the classification of the outcome. It has been shown that rankings based on the Gini impurity for genetic data are more robust than the accuracy-based importance but can be sensitive to allele frequency; specifically, higher frequency variants may have inflated Gini impurity under the null (Boulesteix et al., 2012;Nicodemus, 2011). Given the advantages of both measures, we examined both the accuracy-based importance value and the Gini-based importance value as measures of a variant's influence.
We implemented the standard RF using the random-Forest function in R. We provided the genotype matrix (for a set of variants included in an aggregate test) as input and the phenotype as the response variable. We used default parameters, including the number of variables randomly sampled at each node as the square root of the number of variables and sampling observations with replacement. We ran each RF to include 500 trees with the minimum node size of 1 and no maximum node size. We applied the RF to each simulated dataset and obtained the importance metrics for each variant as a measure of a variant's influence on the outcome. We refer to this method as "RF:Gini" and "RF:Accuracy" corresponding to the Gini-based importance and the accuracy-based importance measure, respectively.

Variable importance-weighted RF method
We compared the standard RF to that of the variable importance-weighted RF method developed by Liu and Zhao (Liu & Zhao, 2017). Liu and Zhao's method weights the feature selection at each node by the accuracy-based importance value from an initial RF instead of performing a random feature selection. We applied the corresponding software package viRandomForest 1.0 using the same default settings, as those described for the standard RF package. As for the standard RF method, we saved the importance metrics generated from the weighted RF for each variant. We refer to this method as "vi-RF:Gini" and "vi-RF:Accuracy" corresponding to the Gini-based importance and the accuracy-based importance measure, respectively.
In addition to the previously published RF methods above, we explored whether we could improve the ability of the importance values to capture variants at lower allele frequencies by modifying the process by which variants are selected at a given node of the decision tree. To do this, we investigated weighting the variable selection as a function of the inverse of the sample MAF and compared to the vi-RF:Gini and vi-RF:Gini method.

Outlier detection methods to classify variants as influential variants
Within a set of variants identified as significant in aggregate testing, referred to as a "significant set," we can rank the rare variants by their respective importance measure, where larger values correspond to the variant having greater discrimination ability to classify the phenotype of interest. In addition to the quantitative measure of the importance value, it is also desirable to identify a subset of rare variants that show the greatest influence in the classification of the outcome. We refer to these variants as influential variants (IV). We apply Tukey fences, a nonparametric approach, to identify outlying importance values for classification of variants as IVs. The interquartile range (IQR) is used as a measure of spread (IQR = distance between the first [Q1] and third [Q3] quartiles), where observations lying outside the inner fence boundary, referred to as "fences", are considered outliers (Tukey, 1977). We selected the inner Tukey fence as it was found to have the highest true positive rate while maintaining low false positive rates for RIFT when compared to alternative outlier detection methods: outer Tukey fence, median absolute deviation, and standard deviation (Blumhagen et al., 2021). The inner fence is defined by Q1 − 1.5*IQR and Q3 + 1.5*IQR. Identification of outliers using the IQR is superior to methods that rely on parametric variance estimators such as the standard deviation, as the IQR is robust to extreme values (Seo & Gary M. Marsh, 2006).

Comparison with RIFT
We compared the RF and vi-RF to our previously developed localization method, RIFT (Blumhagen et al., 2021). The RIFT method measures the influence of a variant on the aggregate test of association by quantifying the change in the aggregate test statistic when that variant is removed. The quantitative change measure, referred to as the delta chi square score, is calculated for every variant in the significant set and represents the change in the aggregate test when a variant is removed. In addition to the ability to rank variants by the magnitude of the delta chi square score, RIFT classifies IVs through application of the Tukey fences outlier detection method to the delta chi square scores. RIFT has been shown to obtain a higher true positive rate while maintaining a low false positive rate compared to other existing localization methods (Blumhagen et al., 2021).

Simulated data
We follow a previously developed simulation approach to generate rare variant data with a binary outcome (Lee, Emond et al., 2012;Wu et al., 2011). We have included a schematic of the simulation strategy as well as a list of the parameters used for all simulations presented in the manuscript (Supporting Information Table S1, Figure  S1). Specifically, to generate case-control rare variant data under both null and alternative hypotheses, we simulated 10,000 haplotypes for a 1 MB genomic region under a coalescent model with parameters consistent with a European population using the software package COSI v1.2 (Schaffner et al., 2005). The run time for generating 10,000 haplotypes was <5 mins and required 444 MB of storage. We considered 50 different genomic regions of size 3 kb and included only rare variants with MAF < 0.03. Though a MAF cutoff for rare variant studies typically ranges between 1% and 5%, we chose the MAF cutoff of 3% to be consistent with the previously published targeted resequencing study in idiopathic pulmonary fibrosis, which we used for the application in this manuscript. To generate samples of cases and controls from the population, for each simulation replicate we selected two haplotypes at random and converted these haplotype data to genotypes at each variant location. We determined the probability of subject k being a case, defined as p k , based on a logistic regression equation where β k represents the coefficient for variant k (Equation 1). G jk corresponds to the number of minor alleles carried by subject k for variant j and β 0 the disease prevalence. For all simulations, the disease prevalence was fixed at 0.05. After determining case status for a large sample of individuals, we sub-sampled to obtain a specified number of cases and controls. Case status, Y k , is specified using the Bernoulli (p k ) distribution (Equation 2).
Consistent with previously published simulation of rare variant data, we defined the coefficient for variants under the alternative (Equation 3) to be a function of the population minor allele frequency (MAF j ) for variant j (Seunggeun Lee, Emond et al., 2012;Wu et al., 2011). This relationship between the coefficient and the population MAF results in larger odds ratios for more rare variants, consistent with previous studies of observed effect sizes of rare variants (Bomba et al., 2017;Povysil et al., 2019). The constant, c, defines the strength of association between the causal variants and the outcome and is independent of the number of causal variants.
With c set to 0.4, this corresponds to an odds ratio of 1.84 for a variant with a MAF of 0.03 and 4.95 for a variant with a MAF of 0.0001. As expected, the subsampling of the population results in some variants (both under the null and alternative) to be no longer observed in a given sample.
Note that our approach can be applied to any aggregate test (for a review of methods for rare variant aggregate tests, please see Lee, Abecasis et al., 2014). For exposition of the method, we use a combined burden and variance compo-nent test, the sequence kernel association optimal unified test (SKAT-O), due to its popularity in rare variant analyses (Lee, Emond et al., 2012). We used the recommended settings-including a linear weighted kernel, estimation of p-values using the "Davies" method and the variant weights as a function of the MAF using the Beta(1,25) distribution. If the SKAT-O p-value met the significance threshold (here, alpha = 0.05), we then applied the corresponding RF method to obtain importance values for each variant. To identify IVs, we applied the inner Tukey fence outlier detection method to the resulting importance values. For every variant, we calculated the proportion of times it was labelled as an IV across the samples it was observed; for variants under the alternative, this corresponds to a true positive rate and for variants under the null, this corresponds to a false positive rate.

Application to rare variant regions associated with IPF
We applied the RF methods to data from a recently published targeted resequencing study in idiopathic pulmonary fibrosis (IPF) (Moore et al., 2019) as part of the Global IPF Collaborative Network. The study of 3,017 idiopathic pulmonary fibrosis (IPF) cases and 4,093 controls found that rare variants contribute significantly to disease risk (Moore et al., 2019). Rare variants were grouped in sliding windows and tested for association using SKAT-O. We applied RF, vi-RF and RIFT with the goal of identifying rare variants within regions associated with IPF to focus follow-up experimental validation studies. We report results for two regions, each containing 50 rare variants. To provide insight into putative function, we include functional annotation information obtained from SNPDOC (https://wakegen.phs.wakehealth.edu/ public/snpdoc3/index.cfm). Additionally, we queried Reg-ulomeDB 2.0 for annotation of noncoding and intergenic regions of the genome to known and predicted regulatory elements (Boyle et al., 2012).

Importance measures from RF methods can be used to rank variants
We applied both RF methods to 500 simulated samples of 1000 cases and 1000 controls across 50 3 kb regions. For each genetic region, 10% of variants were simulated under the alternative with effect size parameter c = 0.4; all variants simulated were bi-allelic. The Gini importance measure (computed as the mean decrease in Gini index) and accuracy importance measure (computed as the mean decrease in accuracy) are able to capture F I G U R E 2 Importance measures are greater for variants simulated under the alternative compared to under the null for both random forest (RF) and variable importance weighted RF (vi-RF) methods. Gini-based importance and accuracy-based importance averaged across 500 simulations plotted by the population MAF. Data were simulated to have 10% variants under the alternative with effect size parameter c = 0.4 (see Equation 2). variants simulated to be associated with the outcome when applying the RF to a set of rare variants. For both the Gini and accuracy importance measures from the RF and vi-RF methods, variants simulated under the alternative have larger average importance measures across samples in which the variant was observed compared to variants simulated under the null (Figure 2). Among variants simulated under the alternative, both the average Gini importance measure and the average accuracy importance measure are larger for variants with higher allele frequencies (MAF > 0.001) compared to variants at the lower allele frequency spectrum (MAF < 0.001).

Performance of using variable importance measures with outlier detection
Though the importance measure itself provides a measure to rank the rare variants within an aggregate set, it is valuable to be able to identify the subset of rare variants considered to be influencing that significant aggregate test. By applying the inner Tukey fence outlier detection method to classify IVs, we were able to compare the proportion of times a variant is called an IV across the samples it was observed. These measures can be computed separately for variants simulated under the alternative (true positive rate) and variants simulated under the null (false positive rate). As expected, we find a similar relationship between true positive rate and MAF as was observed for the average importance measures; all methods are more sensitive to variants at higher MAF ( Figure 3).
The RF methods achieved a higher true positive rate compared to RIFT with the accuracy importance measures achieving the largest gains in sensitivity (Table 1). The most notable improvement in performance is for variants on the lower allele frequency spectrum with MAF < 0.001. The vi-RF:Accuracy had a median true positive rate of 0.24 (IQR: 0.13, 0.42) followed by the RF:Accuracy of 0.16 (IQR: 0.07, 0.33) compared to the RIFT method of 0.04 (IQR: 0.02, 0.14). This increase in true positive rate among very rare variants did not result in a correspondingly large increase in false positive rate; the vi-RF:Accuracy had the highest false positive rate of 0.04 (IQR: 0.01, 0.07).
Among uncommon variants (MAF ≥ 0.001), the vi-RF:Accuracy method had a median true positive rate of 0.87 (IQR: 0.74, 0.97) followed by the RF:Accuracy method of 0.81 (IQR: 0.66, 0.96), and the RIFT method of 0.61 (IQR: 0.50, 0.88) ( Table 1). This gain in sensitivity came at a cost of an increase to the false positive rate, where the vi-RF:Accuracy method had a median false positive rate of 0.24 (IQR: 0.17, 0.42) followed by the RF:Accuracy method of 0.21 (IQR: 0.12, 0.38), and the RIFT method of 0.10 (IQR: 0.05, 0.18) (Supporting Information Figure S2). The Gini importance measures achieved a higher true positive rate than that of RIFT while maintaining a low false TA B L E 1 Summary of true positive rate and false positive rate between random forest (RF), variable importance weighted RF (vi-RF), and rare variant influential filtering tool (RIFT) separated by MAF < 0.001 and MAF ≥ 0.  FigureS3 and S4). We found that weighting the feature selection based on the inverse MAF did not improve the performance characteristics above that of RF and vi-RF. The inverse MAF approach had a higher true positive rate for very rare variants; however, this improvement was at the expense of a very low true positive rate for uncommon variants (Supporting Information Figure S5; InvMAF). Further, the inverse MAF weighting approach observed a higher false positive rate for very rare variants, making the improvements in true positive rate less compelling (Supporting Information Figure S6).
The computational time required for RIFT and vi-RF are comparable (Supporting Information Figure S7); RIFT and vi-RF take on average 0.8 seconds and 13.2 s, respectively, across 30 regions of size 3 kb for a single workstation with 3.3 GHz CPU and 16GB memory. Even for a larger region of size 50 kb with an average of 902 variants, RIFT and vi-RF take only 13.7 min and 14.2 min, respectively. Increasing the sample size does not impact the computational time for RIFT (average time under 2 s for both 500 cases/controls and 2,000 cases/controls), but does increase the time required for vi-RF (average time 4 s for 500 cases/controls and 54 s for 2,000 cases/controls).

Clearer signal for IPF resequencing regions for vi-RF method compared to RIFT
The first IPF rare variant window we present corresponds to the most significant window identified in the study, and spans the 5′ UTR, exon 1 and an intronic region of the TERT gene on chromosome 5. This window had a Bonferroniadjusted SKAT-O p-value of 9.21 × 10 −16 after adjusting for sex and the most strongly associated common variant in the region, rs4449583. The RF:Gini and vi-RF:Gini method identified the same six variants within the aggregate set of 50 variants, three of which were previously identified with RIFT ( Figure 4, Supporting Information Table S1). In addition to the six variants identified by the Gini importance measure, the RF:Accuracy and vi-RF:Accuracy identified an additional three and two variants, respectively. The variant with the largest importance measure (Gini and Accuracy), chr5:1295228, had a probability score of 0.59 for being a regulatory variant from RegulomeDB; it was annotated to an active transcription start site in 28 tissues and as an enhancer in 16 tissues, both lists of tissues included lung. The variant with the second-largest importance measure, chr5:1294612, had a probability score of 0.86 and was annotated to an active transcription start site in 26 tissues and enhancer in 40 tissues, including lung. This information on the potential regulatory elements present at the top variants in the TERT region provides insight into the potential functional studies that could be considered to further elucidate the role of the non-coding variants identified by RIFT and the RF methods. Further, as reported in the RIFT paper, these two top variants have been identified in studies of familial and sporadic melanoma and found to promote tumorigenesis in cancer cell lines (Chiba et al., 2015;Harland et al., 2016;Horn et al., 2013;Huang et al., 2013).
The second window we present spans the third exon and an intronic region of FAM13A on chromosome 4 and had a Bonferroni-adjusted SKAT-O p-value of 0.0001 after adjusting for sex and the most strongly association common variant in the region, rs2609260. For this window, vi-RF:Gini identified nine variants and RF:Gini identified eight variants, whereas the RIFT method identified six IVs ( Figure 5, Supporting Information Table S2). The peak in Gini importance measures identifies the variant chr4:89711672 as having the highest importance value; this variant and the three nearest variants are correlated and F I G U R E 4 Overall MAF, variable importance weighted RF (vi-RF), and rare variant influential filtering tool (RIFT) by genomic position for the IPF-associated rare variant loci on chr5 (bp: 1294397-1295255). Color for the top plot corresponds to SNP-DOC functional annotation. Middle and bottom plot correspond to RIFT and the combined RF:Gini and vi-RF:Gini results, respectively, with red points called outliers by the Inner Tukey method. have pairwise r 2 values between 0.25 and 0.73. The probability score for being a regulatory variant ranged from 0.0 to 0.3 across these four variants from RegulomeDB and three of these variants have unknown function based on SNPDOC annotation.

DISCUSSION
We have demonstrated that the combination of using variable importance measures from RFs with outlier detection methods improved performance in the classification of influential variants over the RIFT delta chi-square score method. In considering very rare variants, the variableimportance weighted RF method using the accuracy importance measure provided a substantial boost in sensitivity without increasing the false positive rate. The standard RF method using the Gini importance measure provided favorable performance for uncommon variants, where it achieved higher sensitivity over that of RIFT while maintaining a similarly low false positive rate. The observation of larger values of the Gini importance measure for rare variants at the higher allele frequency range compared to those at the lower allele frequency range could be due to potential inflation of the importance score as a function of allele frequency in addition to having more power for F I G U R E 5 Overall MAF, variable importance weighted RF (vi-RF), and rare variant influential filtering tool (RIFT) by genomic position for the IPF-associated rare variant loci on chr4 (bp: 89711042-89712372). Color for the top plot corresponds to SNP-DOC functional annotation. Middle and bottom plot correspond to RIFT and the combined RF:Gini and vi-RF:Gini results, respectively with red points called outliers by the Inner Tukey method.
higher allele frequencies. However, the potential impact of that inflation is captured in the false positive rate, which as noted above is among the lowest of the methods we evaluated. Based on our results and the different potential applications of our methods, we do not believe a single method is optimal. Rather, we recognize that studies vary in their design and goals and thus vary in whether higher sensitivity or higher specificity is most important. For example, if the study is exploratory in nature or follow-up work applies high-throughput methods (Inoue & Ahituv, 2015), then applying the vi-RF:Accuracy method to obtain the highest sensitivity might be recommended to avoid excluding potentially important variants. Alternatively, if a study plans to perform expensive individual experimental follow-up of variants, then applying the RF:Gini method to maintain a low false positive rate would likely be recommended. In the absence of an individualized approach, we suggest a combined approach which calls very rare variants based on the variable-importance weighted RF using the accuracy importance and uncommon variants based on the standard RF using the Gini importance measure to achieve the most favorable combination of sensitivity and specificity across the allele frequency spectrum. It is worth noting that despite the simulated larger effect sizes for vary rare variants, the sensitivity to detect these variants is still lower than those of higher frequency. These variants are, by definition, only observed in a small number of individuals for typical sample sizes, resulting in low power to detect associations with disease status or another quantitative outcome. Future work that evaluates additional importance measures or standard variable selection methods that have been developed for genomic data is warranted (Degenhardt et al., 2019). The RF methods provide an objective approach to identifying influential variants that achieve an improved sensitivity over that of RIFT. There is convincing evidence that rare variants driving disease are regularly found outside of the coding region of the genome (Zhang & Lupski, 2015), where functional annotation is scarce. In our targeted resequencing study in idiopathic pulmonary fibrosis, one of the six IVs identified by RIFT was annotated as having unknown function (Blumhagen et al., 2021). In this study, the vi-RF method identified nine variants in a region spanning the intron of the FAM13A gene, six of which were annotated to have unknown function and the remaining three were annotated to be located within the intron. These findings further underscore the need to expand upon methods that are agnostic to functional annotation. Methods which rely on functional information limit the pool of variants to be investigated; agnostic methods which do not rely on any functional information can complement those rare variant localization methods that do rely on functional information. While the false positive rates for RIFT and the RF are markedly lower than that of the true positive rate, it is still possible that identified influential variants in these two regions contain false positives.
Though this work was demonstrated using data simulated with rare variants associated with a binary outcome such as disease status, the flexibility of these methods allow extension to other outcome types, such as a quantitative outcome, and can include a combination of common and rare variants. One of the limitations of the RF methods compared to RIFT is that the RF methods do not easily allow for adjustment of covariates. We recognize this could be a necessary feature in the analysis of rare variant data; for example, being able to identify rare variants driving an aggregate test of association while adjusting for a local admixture and common variants in the region is often desirable. One potential approach to understand how covariates impact the rare variant signal is to perform a stratification analysis in which the RF is applied to the rare variant data stratified by the covariate of interest. We applied the vi-RF:Accuracy method to the IPF resequencing data stratified by sex and stratified by the dosage of the common variant in the region, respectively. The top variant in the TERT region was identified as an IV for all stratified levels: both sexes and all three genotypes of the common variant (Supporting Information Figure   S8). However, most of the other variants identified in the overall analysis were identified in only some strata (for example, among males but not females). It is unclear if these differences are due to the rare variant signal not being present in a particular subgroup or if there is less power in the subset to identify the signal. For example, the top variant for the FAM13A region was identified as an IV among males, but not in females (Supporting Information Figure S9; middle panel). In addition, a variant not identified in the unstratified analysis was identified in a stratified analysis and had a large importance value in the group homozygous for the minor allele at the common variant (Supporting Information Figure S9; bottom panel). Though this analysis is not a definitive test of an interaction, it is suggestive of the presence of an interaction and these results could be hypothesis generating for follow-up studies. If identification of IVs adjusting for a covariate is desired, we recommend using RIFT with SKAT-O, which allows multiple covariate adjustment. Further, it may be necessary to adjust for genetic correlation between subjects if including related individuals or population structure, in which a mixed-effect RF such as MERF or MEIF, could be explored (Hajjem et al., 2014;Saha et al., 2022). We expanded the RIFT R package to enable implementation of the standard RF and the variable importance-weighted RF methods (https://github.com/rachelzoeb/RIFT). This allows the user to apply the method of choice or to follow the suggested combinatorial method based on both RF methods, dependent on MAF.
Aggregation of rare variants can be informed by, or agnostic to, existing functional information. For example, one might limit the analysis to only variants within genes that have been annotated to be putatively damaging (Tin et al., 2018). Alternatively, one might ignore function and simply use a sliding window approach (Bocher & Génin, 2020). We generally recommend performing a gene-based test (aggregating variants within gene or exon boundaries) in addition to window-based test (for example, windows consist of aggregating 50 variants overlapping by 25 variants). Regardless of the approach chosen, aggregate tests often contain a large number of rare variants, even after filtering to predicted pathogenic variants, where follow-up experimental studies of all variants is prohibitive based on cost and time. Identification of influential rare variants following a significant aggregate test allows more targeted validation. The methods we've developed here have the ability to identify variants with as-yet unknown function, providing investigators an additional tool in rare variant association studies.

A U T H O R C O N T R I B U T I O N S
CDL, TEF, and RZB designed the study and developed the conceptual approaches to data analysis; TEF and DAS provided resequencing data. RZB performed the simulations and data analysis; RZB wrote the manuscript; TEF, DAS and CDL reviewed the manuscript.

A C K N O W L E D G M E N T S
We acknowledge the Global IPF Collaborative Network, the members of which contributed many of the DNA samples used to generate the resequencing data that are included in the example.

C O N F L I C T O F I N T E R E S T S TAT E M E N T
Dr. Blumhagen reports salary support on a sponsored research agreement from Eleven P15, Inc., outside the submitted work. Dr. Schwartz is the founder and unpaid chief scientific officer of Eleven P15, Inc., and serves as a consultant for Vertex Pharmaceuticals. Dr. Fingerlin reports consulting fees from Eleven P15, Inc., outside the submitted work and a patent Methods and Compositions for Risk Prediction, Diagnosis, Prognosis, and Treatment of Pulmonary Disorders issued. Dr. Langefeld declares no competing interest.

D ATA AVA I L A B I L I T Y S TAT E M E N T
Summary level data are available on request from the corresponding author. The raw data are not publicly available due to privacy or ethical restrictions.