Gene-Based Methods to Detect Gene-Gene Interaction in R : The GeneGeneInteR Package

GeneGeneInteR is an R package dedicated to the detection of an association between a case-control phenotype and the interaction between two sets of biallelic markers (single nucleotide polymorphisms or SNPs) in case-control genome-wide associations studies. The development of statistical procedures for searching gene-gene interaction at the SNP-set level has indeed recently grown in popularity as these methods confer advantage in both statistical power and biological interpretation. However, all these methods have been implemented in home made softwares that are for most of them available only on request to the authors and at best have a web interface. Since the implementation of these methods is not straightforward, there is a need for a user-friendly tool to perform gene-based gene-gene interaction. The purpose of GeneGeneInteR is to propose a collection of tools for all the steps involved in gene-based gene-gene interaction testing in case-control association studies. Illustrated by an example of a dataset related to rheumatoid arthritis, this paper details the implementation of the functions available in GeneGeneInteR to perform an analysis of a collection of SNP sets. Such an analysis aims at addressing the complete statistical pipeline going from data importation to the visualization of the results through data manipulation and statistical analysis.

identification of susceptibility loci underlying common complex diseases. Single-locus approaches, whereby a large number of single nucleotide polymorphisms (SNPs) are tested independently for association, are usually performed in a first step analysis of GWAS. A large number of softwares have therefore been developed to handle GWAS data and to accomplish whole genome single association testing. When considering that individuals are randomly sampled among healthy and diseased populations, the most popular tools are the independent softwares PLINK (Purcell et al. 2007) and SNPStats (Solé, Guinó, Valls, Iniesta, and Moreno 2006), the R (R Core Team 2020) packages GenABEL (Aulchenko, Ripke, Isaacs, and van Duijn 2007), SNPassoc (Gonzalez et al. 2007), adegenet (Jombart and Ahmed 2011), GWASTools (Gogarten et al. 2012) and postgwas (Hiersche, Rühle, and Stoll 2013) as well as R Bioconductor (Gentleman et al. 2004) packages snpStats (Clayton 2020) and CGEN (Bhattacharjee, Chatterjee, Han, Song, and Wheeler 2019). Although single-locus approaches have been successfully applied in many studies, findings have explained relatively little of the heritability of most complex traits. The "missing heritability" can be partly explained by several features such as the presence of rare variants, the interaction with the environment, the observation of sex-specific effects, the location of SNPs in non-coding regions, the indirect measure of effects through tag-SNP observation, the presence of common variants with truly small effects. In this article we focus on two complementary sources of missing heritability: (1) SNP-set association and (2) SNPxSNP interaction testing for which efforts have been made in the literature.
In SNP-set association testing, all the SNPs within the region of a gene (or more generally a locus) are jointly modeled as a set. Since the gene is considered as the functional unit of the genome, SNP-set testing has gain in popularity the last few years and has been implemented in most of the above cited softwares (PLINK, SNPStats, GenABEL, SNPAssoc, postgwas). Additional softwares have been dedicated to the implementation of SNP-sets methods such as PUMA (Hoffman, Logsdon, and Mezey 2013) and the R packages hapassoc (Burkett, Graham, and McNeney 2006), cpvSNP (McHugh, Larson, and Hackney 2020), SKAT (Lee, Zhao, Miropolsky, and Wu 2020), MixMAP (Matthews and Foulkes 2015) and aSPU (Kwak et al. 2016).
SNPxSNP interaction consists in testing the association between the case-control status and the interaction between two SNPs. Such testing allow the detection of an epistatic effect between two loci which is very often considered as a main source of missing heritability. Therefore, popular softwares have proposed an implementation of SNPxSNP interaction testing in randomly-sampled design (PLINK, GenABEL, SNPStats SNPAssoc and CGEN). Many other tools have also been developed to tackle the issue of detecting SNPxSNP interaction, as, for example, the independent programs BOOST (Wan et al. 2010) and TEAM (Zhang, Huang, Zou, and Wang 2010) as well as the R packages MDR (Winham 2012), logicFS (Schwender 2020) and etma (Lin 2016).
However, none of these softwares and, to our knowledge, no R package has implemented functionality to simultaneously tackle the two objectives of SNP-set association and SNPxSNP interaction into a SNP-set × SNP-set interaction test. This lack of computational tools to test for interaction at the SNP-set level in randomly-sampled case/control designs is a clear limitation in the quest of missing heritability since gene-level testing can help characterizing functional, compositional and statistical interactions (Phillips 2008). By aggregating signals across variants in a SNP-set with respect to the linkage disequilibrium (LD) structure, statistical power is likely to be increased in situations when multiple causal interactions influence the phenotype of interest (Huang, Chanda, Alonso, Bader, and Arking 2011;Wu et al. 2010). Furthermore, if the interacting variants are only tagged, rather than directly observed, genebased tests can aggregate signals from different tag SNPs in partial LD. Finally, the use of the gene as the statistical unit can greatly facilitate the biological interpretation of findings (Jorgenson and Witte 2006;Neale and Sham 2004).
However, from a statistical point-of-view, detecting interaction at the SNP-set level is challenging, thus explaining the lack of computational tools devoted to SNP-set × SNP-set interaction. Tackling the two issues of SNP-set association and interaction indeed requires the simultaneous modeling of the correlation within and between the two SNP sets. Nevertheless, the very recent years have seen the development of statistical methods dedicated to the detection of interaction at the SNP-set level.
These methods can be grouped into two main classes. In a first class, proposed methods aim at modeling the joint distribution of SNPs within and between two genes through principal component analysis (PCA, Li, Tang, Biernacka, and de Andrade 2009), composite linkage disequilibrium (CLD, Rajapakse, Perlman, Martin, Hansen, and Kooperberg 2012), canonical correspondence analysis (CCA, Peng, Zhao, and Xue 2010), kernel canonical correspondence analysis (KCCA, Larson et al. 2014), partial least square path modeling (PLSPM, Zhang et al. 2013) and gene-based information gain method (GBIGM, Li et al. 2015). A second class of methods aims at aggregating interaction tests performed at the SNP level by the minimum p value (minP, Emily 2016), a gene-based association test using extended Simes (GATES, Li, Gui, Kwan, and Sham 2011), and two truncated tests, the truncated tail strength (tTS, Jiang, Zhang, Zuo, and Kang 2011) and the truncated product p values (tProd, Zaykin, Zhivotovsky, Westfall, and Weir 2002). However, all these methods have been implemented in home made softwares that are for most of them available only on request to the authors and at best have a web interface. Thus, searching for gene-gene interaction at the gene level is not straightforward. Furthermore, a comprehensive comparison of such methods, in terms of power and computational performances, remains hardly feasible.
In this paper, we present the R package GeneGeneInteR (Emily, Sounac, Kroell, and Houee-Bigot 2020) that provides an assembly of tools for all the steps involved in gene-based gene-gene interaction testing in case-control association studies when dealing with randomlysampled individuals. Our package, available from Bioconductor (https://bioconductor. org/packages/GeneGeneInteR/) and from GitHub (https://github.com/MathieuEmily/ GeneGeneInteR), proposes a set of functions that allow to (1) download data in various standardized formats (PED, PLINK, VCF), (2) impute missing genotypes, (3) perform a gene-based gene-gene interaction analysis and (4) visualize the results. Section 2 provides an overview of the implementation of the package and aims at summarizing the dependencies to other R packages. In Section 3, technical details and implementation are given regarding the 10 above mentioned statistical procedures (PCA, CLD, CCA, KCCA, PLSPM, GBIGM, minP, GATES, tTS, tProd) proposed in GeneGeneInteR. Section 4 is devoted to the description of the functions implemented in GeneGeneInteR in order to analyze a collection of SNP sets. Through the example of a dataset related to rheumatoid arthritis and composed of 17 genes, we introduce tools for the importation and the manipulation of the data as well as the visualization of the results.

Overview of the implementation of GeneGeneInteR
The implementation of GeneGeneInteR can be divided into three main functionality modules. The first module is a set of functions dedicated to the importation and the manipulation of raw data. This module exploits functionalities attached to the class 'SnpMatrix'. 'SnpMatrix' is a S4 class of objects developed under the snpStats package that is dedicated to large SNP association studies. A detailed example of the functionalities of this module is proposed in Sections 4.1 and 4.2. Our second module consists in functions devoted to the visualization of the results. One functionality of this module allows a network representation, for which functionalities from the igraph package (Csardi and Nepusz 2006) are imported with Gene-GeneInteR. The functionalities of this module are detailed in Section 4.4. Finally, our third module is a set of functions used to perform the statistical analysis. The main function of this module is the function GGI that allows the statistical testing of gene-gene interactions in a set of genes.
The implementation of GGI is displayed in the flowchart of Figure 1 and can be detailed as follows. First, function GGI takes as input three main parameters: the case-control status Y, a 'snpMatrix' object snpX obtained with the functions of our first module and a character string method that indicates which statistical pairwise method has to be performed. The object snpX contains a collection of n genes and the purpose of the GGI function is to iteratively test the n(n − 1)/2 possible pairs of genes with a for loop. For each pair of genes, the pairwise interaction is tested according to the statistical method specified in the method argument. The output of GGI is an object of class 'GGInetwork' defined as an S3 class consisting of a list of 4 elements: statistic, p.value, method and parameter. Using our own class allows us to define our own functionalities for printing, summarizing and plotting the result of the analysis of a set of genes. To this aim, we define specific S3 methods print, summary and plot that are adapted to the type of statistical analysis performed by the function GGI. A detailed example of the use of GGI is provided in Section 4.3.
In GeneGeneInteR, a total of 10 pairwise methods are available corresponding to statistical procedures introduced in the literature. Most of these methods are based on general statistical methods that have to be adapted to (1) the specific context of SNP-set interaction testing and (2) the type of data representation used in GeneGeneInteR. Therefore, the 10 pairwise methods have been implemented in separate functions that have their own dependency to other R packages. For example, if method = "PCA", the pairwise function PCA.test utilizes functionalities from packages FactoMineR (Lê, Josse, and Husson 2008) and snpStats. As usual, the specific numeric results obtained might differ for these multivariate methods depending on the linear algebra library used. For ease of use and convenience, the mandatory input arguments as well as the output argument are identical for each pairwise function: The input contains three mandatory arguments (Y, G1 and G2) and the output is an object of class 'htest'. A formal statistical description as well as examples of usage are provided for each pairwise method in Section 3.

Statistical methods for testing one pair of genes
In this paper, we propose an R package that implements statistical procedures to test for the interaction between two genes in susceptibility with a binary phenotype, typically a case/control disease status. Let Y ∈ {0, 1} be the phenotype, where Y = 0 stands for a control and Y = 1 a case, and X 1 and X 2 be the two genes for which the interaction is tested.
Let consider a sample of n individuals with n c controls and n d cases (n c + n d = n) and Y = [y 1 , . . . , y n ] the vector of the observed binary phenotypes. Individuals are assumed to be randomly sampled among the two populations of cases and controls. Each gene is a collection of respectively m 1 and m 2 SNPs. The observed genotypes for gene X 1 can be represented by a n × m 1 matrix: ij ] i∈{1,...,n};j∈{1,...,m 1 } where x 1 ij ∈ {0; 1; 2} is the number of copies of the minor allele for SNP j carried by individual i. A similar representation is used for gene X 2 where X 2 is a n × m 2 matrix. Let us further introduce X c 1 and X c 2 the matrices of observed genotypes among controls for gene X 1 and X 2 and X d 1 and X d 2 among cases for both genes. Thus X c 1 is a n c × m 1 matrix, X d 1 a n d × m 1 matrix, X c 2 a n c × m 2 matrix and X d 2 a n d × m 2 matrix. A general setup of the observed values can be presented as follows: In our package we propose 10 methods for testing interaction at the gene level. These 10 methods are all based on three main parameters: Y, a numeric or factor vector with exactly two distinct values, G1 and G2, two 'SnpMatrix' objects as proposed by the R Bioconductor package snpStats. 'SnpMatrix' objects have been chosen because of their capacity to handle large datasets. Furthermore, 'SnpMatrix' objects benefit from numerous methods provided by the snpStats package, allowing flexible data manipulation and efficient computation of summary statistics.
In the remainder of this section, we provide details on the methods implemented in our package and highlight the specific parameters used for each method. Our implementation is illustrated by the dataset gene.pair provided with the GeneGeneInteR package. The dataset gene.pair is a case-control dataset containing the genotypes of 8 SNPs within GC gene (object G1 of class 'SnpMatrix') and 4 SNPs within PADI2 gene (object G2 of class 'SnpMatrix'). The dataset gene.pair further includes the disease status in a factor variable Y. In more details, gene.pair contains 247 individuals affected by rheumatoid arthritis (RA) (Y = 1) and 206 individuals not affected by RA (Y = 0).

Length
In models M Inter and M No , P C i X 1 and P C j X 2 are the ith principal component of X 1 and the jth principal component of X 2 . The number of principal components, n 1 and n 2 , kept in the interaction test is determined by the percentage of inertia retrieved by the PCA. Such a percentage is defined by the user and corresponds to the threshold parameter.
In our package, two distinct principal component decompositions are provided by the functions PCA.test via the argument method. With method = "Std", the dataset is standardized using variables' standard deviation while with method = "GenFreq", the dataset is standardized using standard deviation under Hardy-Weinberg equilibrium, as proposed in the snpStats package.
When the percentage of inertia asked by the user is high, the number of PCs can be important and fitting logistic models M Inter and M No is likely to fail. In that case, the number of PCs retained in model (2) is reduced until convergence of the glm function for fitting models M Inter and M No . The following code line illustrates the PCA-based method when the dataset is standardized using standard deviation under Hardy-Weinberg equilibrium: R> PCA.test(Y = gene.pair$Y, G1 = gene.pair$G1, G2 = gene.pair$G2, + threshold = 0.7, method = "GenFreq") Gene-based interaction based on Principal Component Analysis -GenFreq data: gene.pair$Y and (gene.pair$G1 , gene.pair$G2) Deviance = 8.2157, df = 6.0, threshold = 0.7, p-value = 0.2227 alternative hypothesis: true deviance is greater than 0 sample estimates: Deviance without interaction Deviance with interaction 615.2977 607.0821 In the next code line, the dataset is standardized using variables' standard deviation: R> PCA.test(Y = gene.pair$Y, G1 = gene.pair$G1, G2 = gene.pair$G2, + threshold = 0.7, method = "Std) Gene-based interaction based on Principal Component Analysis -Std data: gene.pair$Y and (gene.pair$G1 , gene.pair$G2) Deviance = 8.5074, df = 6.0, threshold = 0.7, p-value = 0.2032 alternative hypothesis: true deviance is greater than 0 sample estimates: Deviance without interaction Deviance with interaction 615.0911 606.5837

Canonical correlation analysis (CCA)
The CCA test is based on a Wald-type statistic defined as in the following Equation 3 (see Peng et al. 2010 for details): where with r d the maximum canonical correlation coefficient between X d 1 and X d 2 and r c the maximum canonical correlation coefficient between X c 1 and X c 2 computed for controls (Y = 0). As suggested by Peng et al. (2010), the variances V(z d ) and V(z c ) are determined by applying a bootstrapping method. The number of bootstrap samples used to estimate V(z d ) and V(z c ) is determined by the n.boot argument (with default n.boot = 500). p value is then obtained by noting that under the null hypothesis U CCA ∼ N (0, 1).
The CCA based gene-gene interaction test is implemented in the CCA.test function and mainly depends on the cancor function from the stats package (R Core Team 2020).
The following code lines illustrate the use of the CCA.test function.

Kernel canonical correlation analysis (KCCA)
The KCCA based test provides a generalization of the CCA test to detect non-linear coassociation between X 1 and X 2 (Yuan et al. 2012;Larson and Schaid 2013) and is based on the following Wald-type statistic: where kz d = 1 2 (log(1 + kr d ) − log(1 − kr d )) and kz c = 1 2 (log(1 + kr c ) − log(1 − kr c )) with kr d the maximum kernel canonical correlation coefficient between X d 1 and X d 2 and kr c the maximum kernel canonical correlation coefficient between X c 1 and X c 2 . Similar to the CCA test, V(kz d ) and V(kz c ) are estimated using bootstrap techniques (Yuan et al. 2012;Larson and Schaid 2013) and the p value is obtained using the standard Gaussian distribution of U KCCA under the null hypothesis. Since the performance of kernel methods strongly relates to the choice of kernel functions, the default is the radial basis kernel function (RBF) owing to its flexibility in parameter specification. However, other kernel functions, such as linear, polynomial or spline kernels, can be used. Thus, in addition to the three arguments Y, G1 and G2, our implementation of the KCCA test proposes two optional arguments: n.boot that determines the number of bootstrap samples and kernel that provides the kernel function to be used. This kernel parameter is a character string matching one of the kernel name provided by the kernlab package (Karatzoglou, Smola, Hornik, and Zeileis 2004) such as "rbfdot", "polydot", "tanhdot", "vanilladot", "laplacedot", "besseldot", "anovadot", "splinedot". The arguments, sigma, degree, scale, offset and order, can also be passed to the kcca.test function in order to parameterize the kernel used in the analysis.
The KCCA based gene-gene interaction test is implemented in the KCCA.test function and mainly depends on the kcca function from the kernlab package. In a second example, KCCA.test is used with a Bessel kernel (with hyperparameters sigma = 0.05, degree = 1 and order = 1) and significance is tested according to the default value of 100 bootstrap replicates.

Partial least square path modeling (PLSPM)
The PLSPM testing has been introduced by Zhang et al. (2013) and is based on the Wald-like statistic defined in Equation 5: where β d (respectively, β c ) is the path coefficient between X d 1 and X d 2 (respectively, X c 1 and X c 2 ). As quoted by Zhang et al. (2013), the distribution of U PLSPM is unknown and significance can be tested using permutations.
The PLSPM based gene-gene interaction test is implemented in the PLSPM.test function and mainly depends on the plspm function from the plspm package (Sanchez, Trinchera, and Russolillo 2017). Besides the three mandatory arguments Y, G1 and G2, the number of permutations can be set by the n.perm argument (default is n.perm = 500).
The following code lines display an example of the use of the PLSPM.test function.

Composite linkage disequilibrium (CLD)
The CLD method, proposed in Rajapakse et al. (2012) is based on the normalized quadratic distance (NQD) and is defined in Equation 6: whereD,C and W are three (m 1 + m 2 ) × (m 1 + m 2 ) matrices of the covariance between the whole set of SNPs that combines SNPs from both genes. More precisely,D andC are defined as follows in Equation 7:D where W 11 (respectively, W 22 ) is the pooled estimate of the covariance matrix for X 1 (respectively, X 2 , D 12 (= D 21 ) and C 12 (= C 21 ) are the sample covariance matrix between the two genes estimated from X d 1 , X d 2 and (X c 1 , X c 2 ) respectively. In more details, the sample covariance matrices in cases, denoted by D, and in controls, denoted by C, can be partitioned in 4 blocks as follows in Equation 8: The pooled estimate of the covariance matrix, W , can thus be obtained by Equation 9: Since the distribution of δ 2 is not known under the null hypothesis, significance testing is performed using permutation tests, as proposed by Rajapakse et al. (2012). Such a test has been implemented in our package in the CLD.test function where the number of permutations is determined by the argument n.perm.
In the next few lines, function CLD.test is illustrated with the gene.pair example.

Gene-based information gain method (GBIGM)
Introduced by Li et al. (2015), the GBIGM method is based on the information gain rate ∆R 1,2 . ∆R 1,2 is defined as in Equation 10: where H 1 , H 2 , H 1,2 are the conditional entropies (given Y) of X 1 , X 2 and the pooled SNP set (X 1 , X 2 ) respectively. Assuming that H(·) is the classical entropy function, we have: Since the distribution of ∆R 1,2 is unknown, the significance testing is performed by permutations as suggested by Li et al. (2015). The GBIGM method has been implemented in the GBIGM.test function and the number of permutations is defined by the argument n.perm.
The following code lines give an example of the GBIGM.test function.

From SNPxSNP interaction to gene-gene interaction testing
This section provides details of the four statistical methods that propose a gene-based test from SNP-based tests (Emily 2016). Rather than considering multiple SNPs in both genes as part of a joint model, these methods aim at aggregating p values obtained at the SNP level into a single p value at a gene level.

Interaction testing at the SNP level
Let consider a pair of SNPs, (X 1,j , X 2,k ) where X 1,j is the jth SNP of gene X 1 and X 2,k the kth SNP of gene X 2 (1 ≤ j ≤ m 1 and 1 ≤ k ≤ m 2 ). To test for interaction at the SNP level, we used the following Wald statistic defined in Equation 14: where β j,k 3 is an estimate of the interaction coefficient β j,k 3 of the logistic model defined in Equation 15: β j,k 3 is obtained by maximizing the likelihood function on the observed data Y, X 1 and X 2 , while σ β j,k 3 is calculated by inverting the Hessian of the likelihood. Since the solution of the maximization of the likelihood function does not have a closed form, we compute W jk according to the iteratively reweighted least squares algorithm proposed in the glm function of the stats package.
To combine the statistics W jk into a single test, four methods have been proposed that all account for covariance matrix (Ma, Clark, and Keinan 2013). As proposed by Emily (2016), the covariance between W jk and W j ,k is estimated by: is the widely used correlation measure between SNP j and SNP j , given that p j and p j are the respective allelic frequencies and p jj is the joint allelic frequency (Hill and Robertson 1968).

minP
The minP test is based on the minimum p value that is often used to combine p values of association (Conneely and Boehnke 2007). Let W max = max (|W 11 |, . . . , |W m 1 ,m 2 |) be the maximum of the absolute observed statistics. The minP is then defined by: where Z = (Z 1 , Z 2 , . . . , Z m 1 m 2 ) is a random vector that follows a multivariate normal distribution Z ∼ N (0, Σ).
The computation of Equation 16 requires the calculation of the probability distribution of a multivariate normal random variable. For that purpose, we used the pmvnorm function from the R package mvtnorm (Genz and Bretz 2009).
To illustrate the minP function, the following code lines test the interaction between the two sets of SNPs in the gene.pair example. The function pmvnorm is applicable to dimensions up to 1000 (Genz and Bretz 2009). Thus, if we consider a first gene having m 1 SNPs and a second gene with m 2 SNPs, the minP test can be performed only if m 1 × m 2 ≤ 1000. Furthermore, it has been shown that the computation of the minP test lacks accuracy when the number of tests is larger than 900 (Conneely and Boehnke 2007). To overcome such a limitation, we propose a two-steps approach to perform the minP procedure. In a first step, each gene (or set of SNPs) is divided into subsets of SNPs so that all pairs of subsets do not contain more than 900 SNP pairs. Formally, genes X 1 and X 2 are split in respectively k 1 and k 2 subsets as follows: The number of subsets per gene (namely k 1 and k 2 ) are chosen to ensure that ∀i ∈ {1, . . . , k 1 } : |X i 1 | ≤ 30 and ∀j ∈ {1, . . . , k 2 } : |X j 2 | ≤ 30, where |X i 1 | and |X j 2 | are the number of SNPs in the subset i from gene X 1 and in the subset j from gene X 2 . By doing so, we guarantee that: To keep the 1-dimensional structure of the data, each gene is split according to a constraint hierarchical clustering based on the pairwise LD between SNPs. To perform such a clustering, we implemented in our GeneGeneInteR package, the chclust function.
In a second step, the minP procedure is applied to each pair of subsets to obtain a vector of k 1 × k 2 p values: [p 11 , . . . , p ij , . . . , p k 1 ,k 2 ], where p ij is the p value given by the minP test based on subsets X i 1 and X j 2 . To correct for multiple testing, a Benjamini-Hochberg correction is applied to the vector [p 11 , . . . , p ij , . . . , p k 1 ,k 2 ] and the minimum of the corrected p values is returned by the minP.test function.
To illustrate such a strategy, consider the interaction between two genes PCSK6 and TXNDC5, where PCSK6 (respectively, TXNDC5) contains 54 (respectively, 36) SNPs. The number of SNPxSNP interaction tests is therefore equal to 54 × 36 = 1944 > 1000, thus preventing the direct use of the function pmvnorm. Therefore, in a first step, constraint hierarchical trees are estimated for each gene and then used to split each gene into SNP subsets. As shown in Figure 2, gene PCSK6 is split into k 1 = 4 subsets and gene TXNDC5 is divided into k 2 = 2 subsets. It can be remarked that the sizes of the subsets for gene PCSK6 are 19, 2, 12 and 21 while equal to 10 and 26 for gene TXNDC5. Therefore, the maximum number of SNPxSNP interaction tests between the two subsets is 33 × 26 = 858 which is lower than 900. In a second step, each pair of subsets is tested for interaction leading to a vector of six p values [p 11 = 0.11, p 21 = 0.87, p 31 = 0.21, p 41 = 0.05, p 12 = 0.30, p 22 = 0.10, p 32 = 0.78, p 42 = 0.30]. These six p values are combined into a single p value by taking the minimum of Benjamini-Hochberg corrected p values. The overall p value is therefore given by 0.2925908.

R> ncol(TXNDC5)
[1] 36 Finally, the testing of the interaction between these two genes can be performed directly with the minP.test function as follows:

GATES
The GATES procedure, proposed by Li et al. (2011), is an extension of the Simes procedure used to assess the gene level association significance. Let p (1) , . . . , p (m 1 m 2 ) be the ascending SNPxSNP interaction m 1 × m 2 p values, GATES p value is then defined in Equation 17: where me is the number of effective tests among the m 1 × m 2 tests and me (i) the number of effective tests among the i most significant tests associated with the lowest order p values p (1) , . . . , p (i) . The number of effective tests ought to characterize the number of independent tests equivalent to the correlated tests that are really performed and is often used to account for dependence in a multiple testing correction.
Although no formal definition of the number of effective tests has been formulated in the literature, several procedures have been proposed to estimate such number. All methods are based on a transformation of the set of eigenvalues of the SNP covariance matrix assuming that (1)

tTS and tProd
The tTS and tProd procedures are two truncated tail strength methods that aim at combining signals from all single-SNP p values less than a predefined cutoff value (Jiang et al. 2011). Denoting by τ the cutoff value, the two truncated p values are defined as follows (Zaykin et al. 2002): where I is the indicator function. When p values are correlated, the null distributions of tTS and tProd are unknown. Following the approach proposed by Zaykin et al. (2002), a p value is obtained in the GeneGeneInteR package by computing an empirical null distribution using Monte Carlo (MC) simulations. For each MC iteration, an empirical value for tTS (or tProd) is obtained by simulating a vector of W jk with respect to a multivariate normal distribution with a vector of 0 means and Σ as covariance matrix. The empirical p value is calculated as the proportion of simulated statistics larger than the observed statistic on the "true" set of W jk . The tTS and tProd methods have been implemented in the functions tTS.test and tProd.test of the GeneGeneInteR package. Additional to the mandatory Y, G1 and G2 arguments, these two functions have two optional arguments: tau and n.sim that control the cutoff value and the number of simulations used to estimate the empirical value respectively. The following code lines give an example of the use of the tTS.test and tProd.test functions:

Analysis of a set of genes
In practice, it is very common to investigate the interaction between a collection of k > 2 genes. In this context, our package GeneGeneInteR proposes several methods to perform a complete pipeline of analysis, going from data importation to results visualization via data manipulation and statistical analysis. In the remainder of this section, we illustrate our pipeline through the analysis of a case-control dataset publicly available in the NCBI repository GSE39428 series (Chang, Xu, Wang, Wang, Wang, and Yan 2013). The dataset contains the genotypes of 312 SNPs from 17 genes in a total of 429 patients (266 individuals affected by theumatoid arthritis and 163 healthy controls) and is attached to our package GeneGeneInteR as an external file in PED format supported by the PLINK software (Purcell et al. 2007).
In the following, the description of the complete statistical pipeline is decomposed into four main parts. In Section 4.1, we first describe the importation of the genotype and the phenotype data as proposed in our GeneGeneInteR package. Then, in Section 4.2, we focus on the various functions dedicated to the manipulation of the genotype data. In Section 4.3, we detail the statistical analysis performed on a set of genes. Finally, in Section 4.4, we introduce two main functionalities for visualizing the obtained results.

Genotype and phenotype importation
At first, the path for the files containing genotype data and information regarding the SNP set are loaded.

Case-control status importation
Similar to functions introduced to analyze a single pair of genes (see Section 3), the casecontrol status is stored in a numeric or a factor vector with exactly two distinct values. If the phenotype is saved in a separate file in table form, it can thus be imported simply by using the read.table function such as: R> Y <-read.table(system.file(file.path("extdata", "response.txt"), + package = "GeneGeneInteR"), sep = ";") If the case-control status is provided in the PED file, it can be obtained as follows: R> Y <-data$status

Genotype data manipulation
Once genotype data have been imported, SNPs can first be selected in a filtering step in order to improve the quality of the data or to focus on particular genomic regions. Furthermore, we developed functions to allow the imputation of missing genotypes. These two parts are detailed in the following paragraphs.

Data filtering
Before performing the statistical analysis, it is very common to remove some SNPs in order to improve the quality of the data. Such a cleaning step can be performed in our GeneGeneIn-teR package by using the function snpMatrixScour. snpMatrixScour aims at modifying a 'SnpMatrix' object by removing SNPs that do not meet criteria regarding the minor allele frequency (MAF), deviation to the Hardy-Weinberg equilibrium (HWE) and the proportion of missing values. In the following example, SNPs with MAF lower than 0.05 or SNPs with p value for HWE lower than 0.001 or SNPs with a call rate lower than 0.9 are removed from the object data.

R> data$snpX
A SnpMatrix with 429 rows and 209 columns Row names: H97 ... RA345 Col names: rs10510123 ... rs4328262 Since the use of stringent filters could lead to the elimination of all SNPs within a gene, care has to be taken during the filtering step. However, in such a situation a gene without SNPs is removed from the dataset and a warning message is provided for the user.
In other situations, the user might be interested in performing the analysis on a predefined subset of SNPs. For that purpose, the selectSnps function provides three options to extract a collection of SNPs by specifying the argument select that should be one of the following: • a numeric vector with only the column number in the 'snpMatrix' (or row number for genes.info) of each selected SNP. The following code line allows the extraction of the 10 first SNPs: R> selec <-selectSnps(data$snpX, data$genes.info, select = 1:10) • a character vector with the names of each selected SNP or each selected gene. The following example is used to extract genes DNAH9 and TXNDC5: R> selec <-selectSnps(data$snpX, data$genes.info, c("DNAH9", "TXNDC5")) • a character vector which elements are position bounds of genes. Each element of the vector is either of the form "begin:end", or "chr:begin:end" if you have to specify the chromosome of the gene. The following code allows to select SNPs from position 101342000 to 101490000 on chromosome 15: R> selec <-selectSnps(data$snpX, data$genes.info, + c("15:101342000:101490000"))

Genotype imputation
Since our pipeline of analysis does not handle missing values, SNPs filtering as well as SNPs selection can help removing missing data. As described in the previous section, this can be done easily by applying the snpMatrixScour function with argument call.rate = 1. However, in this case, SNPs with an acceptable call rate are also removed and the lost information is likely to be critical. Genotype imputation is then commonly performed to keep most of the informative SNPs in the dataset. Since our genotype data are stored into a 'SnpMatrix' object, we implement the imputeSnpMatrix function that wraps functions snp.imputation and impute.snps from package snpStats. Our imputeSnpMatrix function mimics a leave-one-out process where missing SNPs are imputed for an individual based on a model trained on all other individuals.
In our example, the following code lines show that after the filtering step, 844 missing values still remain in the dataset.
R> sum(is.na(data$snpX)) [1] 844 To impute those missing values, we used our imputeSnpMatrix function as follows: R> data <-imputeSnpMatrix(data$snpX, data$genes.info) A simple check of the dataset shows that all missing values have been imputed: R> sum(is.na(data$snpX)) [1] 0 When the amount of missing values is so large that snp.imputation is not able to find a rule of imputation, some missing values may remain. In this case, the user can specify the action to be done thanks to the om.rem argument: • om.rem = "none": leave the dataset as it is, • om.rem = "SNP": remove all SNPs with remaining missing values, • om.rem = "ind": remove all individuals with remaining missing values.
It is noteworthy that removing all SNPs is often more parsimonious than removing individuals and allows to get a dataset without any missing values with minimum information loss.
Although, function snp.imputation can calculate accurate rules for imputation, we encourage the user to first impute missing genotypes with an external software (such as IMPUTE2; Howie, Donnelly, and Marchini 2009) prior to the importation step.

Statistical analysis
The statistical analysis of a set of genes, as implemented in the GGI function, consists in performing all possible pairwise tests between two genes. Pairwise tests are conducted by using the method argument with one of the 10 methods detailed in Section 3. The GGI function takes two further mandatory arguments: Y the vector of case-control status and snpX, a 'SnpMatrix' object that stores the genotypes for all SNPs. It is assumed that SNPs within the same gene are consecutive in the snpX argument. Furthermore, gene information, such as gene ordering and the number of SNPs within each gene, has to be provided either in the genes.length or in the gene.info argument.
The following code line allows the computation of all pairwise tests between the 17 genes of our example dataset with the PCA-based method.
R> class(GGI.res) [1] "GGInetwork" The class 'GGInetwork' is an S3 class based on a list of 4 elements statistic, p.value, method and parameter. When method = "PCA", a fifth element, called df, is added to the 'GGInetwork'.
R> names(GGI.res) [1] "statistic" "p.value" "df" "method" "parameter" Elements statistic, p.value and statistic, df are square matrices with M rows and M columns where M is the number of genes in the dataset. The elements of each matrix are the statistic, the p value and the degrees of freedom of the likelihood ratio test, respectively. The element method is the name of the method used to perform the pairwise interaction tests. Finally, the element parameter is a list of the parameters used to perform the pairwise interaction tests.
As example, the GGI.res object generated in the previous example is a list of 5 elements. Significant results can be summarized using the S3 method summary for class 'GGInetwork'. The method summary prints out the pairs of genes with an interaction p value lower than 0.05 after (1) no correction (2)

Visualization
The results are visualized with the S3 method plot for class 'GGInetwork'. Given an object of class 'GGInetwork' obtained from the analysis of M genes with our GGI function, results can be visualized through two types of representation: an heatmap-like visualization with the method = "heatmap" argument and a network-like representation with the method = "network" argument.

Heatmap-like visualization
The plot method can be used with the 'GGInetwork' object as the single input argument. Figures 3(a) and (b) show the graphical representation where all pairwise interactions are plotted (Figure 3(a) created with plot(GGI.res)) or only the interaction between the 3 genes CA1, Gc and PADI1 with the argument genes (Figure 3 (b) created with plot(GGI.res, genes = c("CA1", "Gc", "PADI1"))). When the number of genes is below 15, p values and names are drawn to make matrix reading easier (see Figure 3(b)). However, when the number of genes is larger than 15, p values are not drawn and if the number of genes is even larger than 25, none of the p values or the gene names are displayed (see Figure 3(a)). In this case, by setting the argument interact = TRUE, the user can start an interactive process that allows to click on a cell of interest to open a tooltip displaying which genes are involved in the selected interaction and the p value of the interaction test. Tooltips can be closed if the user clicks anywhere else than on a cell. This process stops when the user presses the escape button (or terminates the locator procedure in general) or when the user clicks on any place other than a cell when no tooltip window is open. Several arguments can further be specified to customize the output graphics such as colors (arguments col and NA.col), width of the bar for colors (argument colbar.width), and titles (argument title). Users can also decide whether p values (argument draw.pvals) and gene names (argument draw.names) should be drawn and they may disable the interactivity of the plot (argument interact). To further improve plot clarity and hence allowing a better interpretation of the results, (1) genes can be ordered according to a hierarchical clustering (argument hclust.order), (2) p values can be reported in −log10 scale (argument use.log) and (3) a threshold can be applied to the p values in order to distinguish between significant and non-significant interactions (argument threshold).  R> plot(GGI.res, col = c("black", "cyan", "white"), colbar.width = 0.25, + title = "Interaction between 17 genes", hclust.order = TRUE, + use.log = TRUE, threshold = NULL, NA.col = "#D3D3D3", + draw.pvals = FALSE, draw.names = TRUE, interact = FALSE) R> plot(GGI.res, col = c("black", "cyan", "white"), colbar.width = 0.05, + title = "Interaction between 17 genes", hclust.order = TRUE, + use.log = FALSE, threshold = 0.05, NA.col = "#D3D3D3", + draw.pvals = FALSE, draw.names = TRUE, interact = FALSE)

Network-like visualization
The plot function with method = "network" aims at drawing a graph between genes where two genes are adjacent if the p values between these two genes is below a given threshold (argument threshold with a default value equal to 0.05). The display of the network is performed by utilizing the graph_from_data_frame function from the igraph R package.
Two additional arguments can be used to customize the network. First, user can focus on a specific subset of genes with the argument genes and genes not linked to other genes can be removed from the graph with argument plot.nointer. and displays the default network obtained with all genes. In Figure 5(b), a subset of only 12 genes have been selected to be the vertices of the graph using R> set.seed(1234) R> plot(GGI.res, method = "network", genes = c("bub3", "CDSN", "Gc", "GLRX", + "PADI1", "PADI2", "PADI4", "PADI6", "PRKD3", "PSORS1C1", "SERPINA1", + "SORBS1"), threshold = 0.05, plot.nointer = FALSE) However, genes bub3 and PADI1 do not have a p value below the threshold of 0.05 with any of the other selected genes. Since the argument plot.nointer is set to TRUE, the two genes bub3 and PADI1 are not drawn in the resulting network.

Discussion
This article presents the R package GeneGeneInteR dedicated to the detection of an interaction between SNP sets in case-control genome-wide association studies. The package includes the GGI function performing all pairwise tests between two SNP sets and producing output of the class 'GGInetwork'. The GGI function takes two main arguments: method and snpX, a 'SnpMatrix' object. The method argument allows the user to specify the statistical procedure used for pairwise interaction testing as one of the ten methods implemented in GeneGeneInteR. Furthermore, several methods have been implemented in Gene-GeneInteR to import, manipulate and impute missing values for the snpX objects. Methods summary and plot, associated with objects of the class 'GGInetwork', are also proposed. The plot method allows the production of heatmap or network visualization of the statistical interaction between SNP sets. The R package GeneGeneInteR is available from Bioconductor (https://bioconductor.org/packages/GeneGeneInteR/) and from GitHub (https://github.com/MathieuEmily/GeneGeneInteR) .
Since our implementation relies on 'SnpMatrix' objects, GeneGeneInteR could in principle handle large datasets. However, in practice, testing for pairwise interaction remains a combinatorial burden and highly computationally efficient methods are needed to perform largescale analysis. In GeneGeneInteR, some methods are still computationally intensive (KCCA, PLSPM, GBIGM for example) so that their use is not recommended for large datasets. Possible extensions of GeneGeneInteR could consist in speeding up the execution time either by using parallel processing with the package parallel (R Core Team 2020) or by porting existing R code to C++ with Rcpp (Eddelbuettel and François 2011). Another limitation raised by the analysis of large scale datasets is the visualization of the results. However, we proposed several options to the user in order to focus on specific parts of a big network. Furthermore, the visualization of a big network provides information regarding a global pattern of relationships between SNP sets. Since the main purpose of the package GeneGeneInteR is to search for pairwise interactions, the detection of specific patterns in a network is beyond the scope of this work.
Finally, one of the main advantages of the GeneGeneInteR package lies in the use of a common pipeline of analysis for a collection of statistical procedures. Thus, GeneGeneInteR can easily be extended with the inclusion of novel statistical procedures to detect pairwise interaction between SNP sets. Further extensions could include the possibility to search for high-order of interactions (such as three-way interaction) between SNP sets.