Efficient algorithms to discover alterations with complementary functional association in cancer

Recent large cancer studies have measured somatic alterations in an unprecedented number of tumours. These large datasets allow the identification of cancer-related sets of genetic alterations by identifying relevant combinatorial patterns. Among such patterns, mutual exclusivity has been employed by several recent methods that have shown its effectiveness in characterizing gene sets associated to cancer. Mutual exclusivity arises because of the complementarity, at the functional level, of alterations in genes which are part of a group (e.g., a pathway) performing a given function. The availability of quantitative target profiles, from genetic perturbations or from clinical phenotypes, provides additional information that can be leveraged to improve the identification of cancer related gene sets by discovering groups with complementary functional associations with such targets. In this work we study the problem of finding groups of mutually exclusive alterations associated with a quantitative (functional) target. We propose a combinatorial formulation for the problem, and prove that the associated computational problem is computationally hard. We design two algorithms to solve the problem and implement them in our tool UNCOVER. We provide analytic evidence of the effectiveness of UNCOVER in finding high-quality solutions and show experimentally that UNCOVER finds sets of alterations significantly associated with functional targets in a variety of scenarios. In particular, we show that our algorithms find sets which are better than the ones obtained by the state-of-the-art method, even when sets are evaluated using the statistical score employed by the latter. In addition, our algorithms are much faster than the state-of-the-art, allowing the analysis of large datasets of thousands of target profiles from cancer cell lines. We show that on two such datasets, one from project Achilles and one from the Genomics of Drug Sensitivity in Cancer project, UNCOVER identifies several significant gene sets with complementary functional associations with targets. Software available at: https://github.com/VandinLab/UNCOVER.


Introduction
Alterations in the green set have high mutual exclusivity but no association with the target profile.Alterations in the orange set have lower mutual exclusivity but strong association with the target profile.Methods that find mutually exclusive sets of alterations without considering the target profile will identify the green set as the most important gene set.the null hypothesis that mutual exclusivity of a gene set is due to the interplay between waiting times to alterations and the time at which the tumor is sequenced.MEMo [17] and the method from [30] employ mutual exclusivity to find gene sets, but use an interaction network to limit the candidate gene sets.The method by [31] and PathTIMEx [32] introduce an additional dimension to the characterization of inter-tumor heterogeneity, by reconstructing the order in which mutually exclusive gene sets are mutated.None of these methods take quantitative targets into account in the discovery of significant gene sets and sets showing high mutual exclusivity may not be associated with target profiles (Fig. 1).[33] recently developed the repeated evaluation of variables conditional entropy and redundancy (REVEALER) method, to identify mutually exclusive sets of alterations associated with functional phenotypes.REVEALER uses as objective function (to score a set of alterations) a re-scaled mutual information metric called information coefficient (IC).REVEALER employs a greedy strategy, computing at each iteration the conditional mutual information (CIC) of the target profile and each feature, conditioned on the current solution.REVEALER can be used to find sets of mutually exclusive alterations starting either from a user-defined seed for the solution or from scratch, and [33] shows that REVEALER finds sets of meaningful cancer-related alterations.
fourfold.First, we provide a rigorous combinatorial formulation for the problem of finding groups of mutually exclusive alterations associated with a quantitative target and prove that the associated computational problem is NP-hard.Second, we develop two efficient algorithms, a greedy algorithm and an ILP-based algorithm to identify the set of k genes with the highest association with a target; our algorithms are implemented in our method fUNctional Complementarity of alteratiOns discoVERy (UNCOVER).Third, we show that our algorithms identify highly significant sets of genes in various scenarios; in particular, we compare UNCOVER with REVEALER on the same datasets used in [33], showing that UNCOVER identifies solutions of higher quality than REVEALER while being on average two order of magnitudes faster than REVEALER.Interestingly, the solutions obtained by UNCOVER are better than the ones obtained by REVEALER even when evaluated using the objective function (IC score) optimized by REVEALER.Fourth, we show that the efficieny of UNCOVER enables the analysis of a large dataset from Project Achilles with thousands of target measurements and tens of thousands of alterations.On such dataset UNCOVER identifies identifies several statistically significant associations between target values and mutually exclusive alterations in genes sets sets, with an enrichment in well-known cancer genes and in known cancer pathways.

Materials and methods
This section describes the problem we study and the algorithms we designed to solve it, that are implemented in our tool UNCOVER.We also describe the data and computational environment for our experimental evaluation.

UNCOVER: Functional complementarity of alterations discovery
The workflow of our algorithm UNCOVER is presented in Fig. 2. UNCOVERtakes in input information regarding 1. the alterations measured in a number of samples (e.g., patients or cell lines), and 2. the value of the target measure for each patient.UNCOVER then identifies the set of mutually exclusive alterations with the highest association to the target, and employs a permutation test to assess the significance of the association.Details regarding the computational problem and the algorithms used by UNCOVER are described in the following sections.The implementation of UNCOVER is available at https://github.com/VandinLab/UNCOVER.

Computational problem
Let J = {j 1 , . . ., j m } be the set of samples and let G = {g 1 , . . ., g n } be the set of genes for which we have measured alterations in J.We are also given a target profile, that is for each sample j ∈ J we have a target value w j ∈ R which quantitatively measures a functional phenotype (e.g., pathway activation, drug response, etc.).For each sample j ∈ J we also have information on whether each g ∈ G is altered or not in j.Let A g be the set of patients in which gene g ∈ G is mutated.We say that a patient j ∈ J is covered by gene g ∈ G if j ∈ A g i.e. if gene g is mutated in sample j.Given a set of genes S ⊂ G, we say that sample j ∈ J is covered by S if j ∈ ∪ g∈S A g .
The goal is to identify a set S of at most k genes, corresponding to k subsets S 1 , S 2 , . . .S k where for each subset S i we have that S i ⊆ J, such that the sum of the weights of the elements covered by S is maximized.We also penalize overlaps between sets when an element is covered more than once by S by assigning a penalty p j for each of the additional times j is covered by S. As penalty we use the positive average of the normalized target values if the original weight of the element was positive.If the original weight of the element was negative we assign a penalty equal to its weight.Let c S (j) be the number of sets in S 1 , . . ., S k that cover element j ∈ J. Therefore for a set S of genes, we define its weight W (S) as: The Target Associated k-Set problem: Given a set J of samples, sets A g1 , . . ., A gn describing alterations of genes G = {g 1 , . . ., g n } in the set J, weights w j and penalties p j > 0 for each sample j ∈ J find the S of ≤ k elements maximizing W (S).
The following results defines the computational hardness of the problem above.
Theorem 1.The Target Associated k-Set problem is NP-hard.
Proof.The proof is by reduction from the Maximum Weight Submatrix Problem (MWSP) defined and proved to be NP-hard in [34].The MSWP takes as input an m × n binary matrix A and an integer k > 0 and requires to find the m × k column sub-matrix M of A that maximizes the objective function |Γ(M )| − ω(M ), where Γ(M ) is the set of rows with at least one 1 in columns of M and ω(M Given an instance of Maximum Weight Submatrix Problem, we define an instance of the Target Associated k-Set as follow: the set of samples J corresponds to the rows of A, the set of genes G corresponds to the columns of A, and the set S g of samples covered by gene g ∈ G is the subset of the rows in which g has a 1.By setting w j = 1 and p j = 1 for all j ∈ J, we have that the objective function of MWSP corresponds to the weight W (S) for the Target Associated k-Set therefore the optimal solution of the Target Associated k-Set corresponds to the optimal solution of MWSP.

ILP formulation
In this subsection we provide an ILP formulation for the Target Associated k-Set problem.Let x i be a binary variable equal to 1 if set i ∈ G is selected and x i = 0 March 28, 2018 5/24 otherwise.Let z j be a binary variable equal to 1 if element j is covered and z j = 0 otherwise.Let y j denote the number of sets in the solution covering element j.Finally, let w j be the weight of element j and p j be the penalty for element j In our ILP formulation, the following constraints need to be satisfied by a valid solution: • the total number of sets in the solution is at most k: i x i ≤ k • for each element j ∈ J we have: y j = i:j∈Si x i • for each element j ∈ J, if j is covered by the current solution then the number of times j is covered in the solution is at least 1: y j ≥ z j • for each element j ∈ J, if j is covered by at least one element in the current solution then j is covered: With the variables defined above, the score for a given solution is z(q) constitutes the objective function of our ILP formulation.

Greedy algorithm
Since solving ILPs can be impractical for very large datasets, we also design a k-stage greedy algorithm to solve the Target Associated k-Set problem.During each stage the algorithm picks 1 set A i to be part of the solution; this is done by first computing the total weight of each subset which is defined as the sum of the weights of its elements W (A i ) = j∈Ai w j .Then the algorithm finds the subset A i of maximum positive weight and adds it to the solution.It may be that at some stage no additional set of positive weight can be selected, in this case, the solution obtained after stage − 1 will be our output.At the end of the iteration the weight of every element j that belonged to the chosen set A i is set to the negative of penalty p j , in order to penalize future selections of the same elements.The greedy algorithm is described in Algorithm 1.
Input: A set of elements J (samples), a class I of subsets of J (genetic alterations) and an integer k (number of alterations we want to find).Each element j ∈ J has an associated weight w j (target profile) and a penalty p j .Output: k subsets, S 1 , S 2 , . . .S k , where each subset selected is a member of I, such that the sum of the weights of the elements in the selected sets is maximized and the overlap between selected sets is minimized.
We note that our greedy algorithm is analogous to the greedy algorithm for the Maximum k-Coverage problem [35] with the difference that rather than eliminating the elements already selected we change their weight to a penalty.Also, assuming it is acceptable to return less than k sets, we only pick a set if it has a positive weight.The March 28, 2018 6/24 running time of the algorithm is O(kmn) where m = number of samples and n = number of alterations.While the greedy algorithm may not return the optimal solution, we prove that it provides guarantees on the weight of the solution it provides.Proposition 1.Let S * the optimal solution of the Target Associated k-Set and Ŝ be the solution returned by the greedy algorithm.Then W ( Ŝ) ≥ W (S * )/k.
Proof.Note that the weight of subsets in the optimal solution W (S * ) can only be lower compared to the original weight of the subsets, since the only weight update operation performed is to substitute positive weights of elements selected with a negative penalty.
The first subset Ŝ1 selected by our algorithm is the set of maximum weight out of all subsets and therefore W ( Ŝ1 ) ≥ W (S * ) for = 1..k.By the pigeonhole principle, one of these subsets in the optimal solution must cover at least W (S * )/k worth of elements.Thus W ( Ŝ1 ) ≥ W (S * )/k.Therefore the first subset selected by the algorithm already gives a 1/k approximation of the optimal solution.In subsequent iterations of the algorithm we only pick additional sets if they have a positive weight so our approximation ratio can only improve.
We also prove that the bound above is tight Proposition 2. There are instances of the Target Associated k-Set such that The proof is in the Supplementary Material.While the proposition above is based on an extreme example, our experimental analysis shows that in practice the greedy algorithm works well and often identifies the optimal solution.We therefore analyze the greedy algorithm under a generative model in which there is a set H of k genes with mutually exclusive alterations associated with the target, while each genes g ∈ G \ H is mutated in sample j with probability p g independently of all other events.We also assume that the weights w j are such that j∈J w j = 0 and for each j : |w j ≤ 1|.(In practice this is achieved by normalizing the target values before running the algorithm, by subtracting to each w j the average value j∈J w j /m and then dividing the result by the maximum of the absolute values of the transformed w j 's.)Note that this last condition implies that |p j | ≤ 1 for all j.We also assume that for genes in H the following assumptions hold: • the set H has an association with the target, i.e.: • each gene of H contributes to the weight of H, i.e. for each S ⊂ H and each The following shows that, if enough samples from the generative model are considered, the greedy algorithm finds the set H associated with the target with high probability.
Proposition 3. If m ∈ Ω k 2 ln(n/δ) samples from the generative model above are provided to the greedy algorithm, then the solution of the greedy algorithm is H with probability ≥ δ.
The proof is in the Supplementary Material.

Statistical significance
To assess the significance of the solution reported by our algorithms we use a permutation test in which the dependencies among alterations in various genes are maintained, while the association of alterations and the target is removed.In particular, a permuted dataset under the null distribution is obtained as follows: the sets A g , g ∈ G are the same as observed in the data; the values of the target are randomly permuted across the samples.
To estimate the p-value for the solutions obtained by our methods we used the following standard procedure: 1) we run an algorithm (ILP or greedy) on the real data D, obtaining a solution with objective function o D ; 2) we generate N permuted datasets as described above; 3) we run the same algorithm on each permuted dataset; 4) the p-value is the given by (e + 1)/(N + 1), where e is the number of permuted datasets in which our algorithm found a solution with objective function ≥ o D .

Data and computational environment
Alteration Data.We downloaded data for the Cancer Cell Line Encyclopedia on 25 th September, 2017 from http://www.broadinstitute.org/ccle.In particular we used the mutation (single nucleotide variants) and copy number aberrations (CNAs) which are derived from the original Cancer Cell Line Encyclopedia (CCLE) mutations and CNA datasets.The file we used is CCLE MUT CNA AMP DEL 0.70 2fold.MC.gct.It consists of a binary (0/1) matrix across 1,030 samples and 48,270 gene alterations in the form of mutations, amplifications and deletions, with a 1 meaning that the alteration is present in a sample, and a 0 otherwise.Target Data.In terms of target values we use the same datasets used by [33] to compare the performance of UNCOVER with REVEALER.In particular we used the following files available through the Supplementary Material of [33]: CTNBB1 transcriptional reporter.gct,which consists of measurements of a β-catenin reporter in 81 cell lines; NFE2L2 activation profile.gct,which includes NFE2L2 enrichment profiles for 182 lung cell lines; MEK inhibitor profile.gct,which contains MEK-inhibitor PD-0325901 sensitivity profile in 493 cancer cell lines from the Broad Novartis CCLE14l; and KRAS essentiality profile.gct,which corresponds to the feature KRAS from a subset of 100 cell lines from the Achilles project dataset.In all these cases we considered the same direction of association (positive or negative) between alterations and the target as in [33].Since our algorithm is very efficient we then decided to run it on a large dataset from Project Achilles (https://portals.broadinstitute.org/achilles),that uses genome-scale RNAi and CRISPR-Cas9 perturbations to silence or knockout individual genes.In particular, we use the whole 2.4.2Achilles dataset (Achilles QC v2.4.3.rnai.Gs.gct) available from the project website.This dataset provides phenotype values for 5711 targets, corresponding to genes silenced by shRNA.The phenotype values correspond to ATARiS [37] gene (target) level scores, quantifying the cell viability when the target gene is silenced by shRNA.These scores are provided for 216 cell lines [19], with 205 of them appearing in CCLE.
Data Preprocessing.To be consistent with REVEALER we discarded features with high or low frequency, in particular features present in less than 3 samples or more than 50 samples were excluded from our analyses.The only exception was the MEK-inhibitor example, where the high frequency threshold was changed to be 100 since the number of original samples was substantially higher (i.e., 493) for this case.From the Achilles dataset we excluded targets that have at least one missing value, in particular in this March 28, 2018 8/24 case we exclude 21 of the 5711 sets of target scores.In all our experiments we normalized the target values before running the algorithm, by subtracting to each weight w j the average value j∈J w j /m and dividing the result by the standard deviation of the (original) w j 's, in order to have both positive and negative target values.
Simulated Data.We investigated how effective UNCOVER is at finding selected alterations in a controlled setting, where the ground truth is known.We generated target values according to a normal distribution with mean 0 and standard deviation 1.
We tested dataset with 200, 600, 1000 and 10000 samples.For each dataset we considered the 38002 gene alterations present in CCLE and for each of them we placed alterations in the samples independently of all other events with the same frequencies as they appear in CCLE.To be consistent with the preprocessing done on rel data we filtered alterations to only have alterations with frequencies between 0.1 and 0.25.We also generated a set T of 5 features to have an association with the target values.This association was varied throughout the experiments to cover different percentages of positive and negative targets.In particular we generated the selected features to cover 100%, 80%, 60%, 40% of the positive target values and 5%, 10%, 15%, 20% of the negative target values respectively, chosing random subsets of samples with positive or negative target values.We will refer to the parameter indicating the percentage of samples with positive target values selected as P and to the parameter for the percentage of samples with negative target values selected as N .We divided the number of targets covered by each of the 5 mutations equally.
Computing Environment and Solver Configuration.To describe and solve an ILP we used AMPL 20150516 and CPLEX 12.6.3.All parameters in CPLEX were left at their default values.We implemented our greedy algorithm in Python 3.6.1.We run our experiments (with the exception of experiments conducted on simulated data) on a MacBook Air with 1.7 GHz Intel Core i7 processor, 8 GB RAM and 500 GB of local storage.In order to make a time comparison with REVEALER we also run the R code provided in [33] on the same machine, using R 3.3.3.All the parameters were left at their given values except for the number of permutations used to calculate their p-value, which we changed in order to compare the running time of the methods excluding the time needed to compute p-values.Experiments on simulated data were conducted on local nodes of a computing cluster.Each node had the following configuration: four 2.27 GHz CPUs, 5.71 GB RAM and 241 GB of storage.

Results and discussion
We tested UNCOVER on a number of cancer datasets in order to compare its results with state-of-the-art algorithms and to test whether UNCOVER allows the analysis of large datasets.In particular we used four datasets described in [33] to compare UNCOVER with REVEALER, and then performed a large scale analysis using targets from the Achilles project dataset and alterations from the Cancer Cell Line Encyclopedia (CCLE).

Comparison with REVEALER
We run the greedy algorithm and the ILP from UNCOVER on the same four datasets considered by the REVEALER publication [33].We used the same values of k used in [33], that is k = 3 for all the datasets, except from the KRAS dataset where k = 4 was used.For each dataset we recorded the solution reported by the greedy algorithm, the solution reported by the ILP, the value of the objective functions for such solutions March 28, 2018 9/24 and the running time to obtain such solutions.For ILP solutions, we also performed the permutation test (see Materials and methods) to compute a p-value using 1000 permutations.The results are reported in Table 1, in which we also show the results from REVEALER (without initial seeds).Fig. 3 shows alteration matrices and the association with the target for the solutions identified by UNCOVER.As expected, the value of the objective function we use is much lower for solutions from REVEALER than for solutions from our algorithm.
We then compared the solutions obtained by our algorihtms with the solutions from REVEALER in terms of the information coefficient (IC), that is the target association score used in [?] as a quality of the solution.Surprisingly, in two out of four datasets our methods, which do not consider the IC score, identify solutions with IC score higher (by at least 5%) than the solutions reported by REVEALER.For the other two cases, in one dataset the IC score is very similar (0.50 vs 0.49) while in the other case the IC score by REVEALER is higher (0.7 vs 0.67) but the solution reported by REVEALER differs from the solution reported by UNCOVER by 1 gene only.Interestingly, the latter is the only case where the solution from the ILP has a p-value > 0.1 (p < 0.03 in all other cases), and therefore the solutions (by our methods and by REVEALER) for such dataset may be, at least in part, due to random fluctuations of the data.
These results show that UNCOVER identifies solutions that better than REVEALER when evaluated using our objective function and also when evaluated according to the objective function of REVEALER with a running time that is on average two orders of magnitude smaller than required by REVEALER.

Analysis of Achilles project data
The efficiency of UNCOVER renders the analysis of a large number of targets, such as the ones available through the Achilles project, possible.We have run the ILP on the whole Achilles dataset for k = 3, looking for both positive and negative association with target values.After preprocessing the dataset included 5690 functional phenotypes as targets, and for each of these the CCLE provides alteration information for We can see that the greedy algorithm identifies the same solution of the ILP based algorithm in three out of four cases, and that the runtime of the ILP and the runtime of greedy algorithm are comparable and very low (< 40 seconds) in all cases.In contrast, the running time of REVEALER is much higher (> 1000 seconds in most cases).(We included all preprocessing in the reported runtimes in table 1 to ensure a fair comparison with REVEALER; not including preprocessing our running times are all under 10 seconds.)Comparing the alteration matrices of the solutions by UNCOVER and the ones of solutions by REVEALER (Supplementary Fig. 7) we note that alterations in solutions by UNCOVER tend to have higher mutual exclusivity and to be more concentrated in high weight samples than alterations in solutions by REVEALER.As expected, the value of the objective function we use is much lower for solutions from REVEALER than for solutions from our algorithm.
We then compared the solutions obtained by our algorihtms with the solutions from REVEALER in terms of the information coefficient (IC), that is the target association score used in [33] as a quality of the solution.Surprisingly, in two out of four datasets UNCOVER, which does not consider the IC score, identifies solutions with IC score higher (by at least 5%) than the solutions reported by REVEALER.For the other two cases, in one dataset the IC score is very similar (0.50 vs 0.49) while in the other case March 28, 2018 10/24 For each of the four targets (NFE2L2 activation, MEK-inhibitor, KRAS essentiality, β-catenin activation) considered in [33], the set of alterations of cardinality k reported by our ILP algorithm, by our greedy algorithm, and by REVEALER (without seeds) is reported.k is chosen as in [33].For each pair (algorithm, target) we also report the (objective) value of our objective function for the solution, the value of the IC score (that is, the objective function used in [33]), and the running time of the algorithm for the target.For solutions found by our ILP we also report the p-value computed by permutation test using 1000 permutations.
the IC score by REVEALER is higher (0.7 vs 0.67) but the solution reported by REVEALER differs from the solution reported by UNCOVER by 1 gene only.Interestingly, the latter is the only case where the solution from the ILP has a p-value > 0.1 (p < 0.03 in all other cases), and therefore the solutions (by our methods and by REVEALER) for such dataset may be, at least in part, due to random fluctuations of the data.
In most cases the solutions by UNCOVER and by REVEALER are very similar, with cancer relevant genes identified by both methods.For NFE2L2 activation, both methods identify KEAP1, a repressor of NFE2L2 activation [38].For MEK-inhibitor, both methods find BRAF, KRAS, and NRAS, three well knwon oncogenic activators of the MAPK signaling pathway, which contains MEK as well.For KRAS essentiality, both methods report mutations in KRAS in the solution.For β-catenin activation, both methods identify CTNNB1 mutations and APC mutations, that is known to be associated to β-catenin activation [39].These results show that UNCOVER identifies relevant biological solutions that are better than the ones identified by REVEALER when evaluated using our objective function and also when evaluated according to the objective function of REVEALER with a running time that is on average two orders of magnitude smaller than required by REVEALER.March 28, 2018 11/24 Results on simulated data For each combination we generated 10 simulated dateset as described in Materials and methods.Each dataset contains a planted set of 5 alterations associated with the target.We used both the greedy algorithm and the ILP from UNCOVER with k = 5 to attempt to find the 5 correct alteration, and evaluated our algorithms both in terms of fraction of the correct (i.e., planted) solution reported and running time.
As shown in Fig. 4, the greedy algorithm is faster than the ILP for all datasets, and the difference in running time increases as the number m of samples increases, with the runtime of the greedy algorithm being almost two order of magnitude smalle than the runtime of the ILP for m = 1000 samples.In addition, for a fixed number of samples and alterations, the running time of the greedy algorithm is constant, that is it does not depend on the properties of the planted solution, while the running time of the ILP varies greatly depending on these parameters.For m = 10, 000 samples the running time of the ILP becomes extremely high, so we restricted to consider only two sets of paramters (p − n = 0.95 and p − n = 0.2).In this case the ILP took between 44 minutes and 7 hours to complete, while the greedy algorithm terminates in 5 minutes.
In terms of the quality of the solutions found, as expected the ILP outperforms the greedy (Fig. 5) but the difference among the two tends to disappear when the number of samples is higher.In addition, since the ILP finds the optimal solution, we can see that for a limited number of samples we may not reliably identify the planted solution with 200 samples unless the planted solution appears almost only in positive targets and in almost all of them (p − n = 0.95), while for m=1000 we can reliably identify the planted solution using both the ILP and the greedy algorithm even when the association with the target is weaker (p − n = 0.6).When m = 10, 000, both the ILP and the greedy algorithm perform well in terms of the quality of the solution: the ILP finds the correct alterations on every experiment and the greedy identifies the whole planted solution in all experiments but one for p − n = 0.2, for which it still reports a solution containing 4 genes in the planted solution.
These results show that for a large number of samples the greedy algorithm reliably identifies sets of alterations associated with the target, as predicted by our theoretical analysis, and is much faster than the ILP.For smaller sample size the ILP identifies better solutions than the greedy and has a reasonable running time.

Analysis of Achilles project data
The efficiency of UNCOVER renders the analysis of a large number of targets, such as the ones available through the Achilles project, possible.After preprocessing the dataset included 5690 functional phenotypes as targets, and for each of these the CCLE provides alteration information for 205 samples and 31137 alterations.In total we have therefore run 10380 instances (i.e., 5690 targets screened for positive and for negative associations) looking for both positive and negative association with target values.Since the number of samples (205) is relatively small, we have run only the ILP from UNCOVER on the whole Achilles dataset and looked for solutions with k = 3 genes.The runtime of UNCOVER to find both positive and negative associations, including preprocessing, is 24 hours.Based on the runtime required on the instances reported in [33] (see the Comparison with REVEALER section), running REVEALER on this dataset would have required about 5 months of compute time.
To identify statistically significant associations with targets in the Achilles project dataset we used a nested permutation test.We first run the permutation test with 10 permutations on all instances (i.e., on all targets for both positive association and negative association).We then considered all the instances with the lowest p-value (1/11) and performed a permutation test with 100 permutations only for such instances.We the iterated such procedure once more, selecting all the instances with lowest p-value (1/101) and performing a permutation test with 1000 permutations only for such instances.For positive association we found 60 solutions with p-value < 0.001, and for negative association we found 102 solutions with p-value < 0.001.The solutions with p-value < 0.001 (with 1000 permutations) are reported in Supplementary Table 2 and 3. See Supplemental Fig. 8 for some corresponding alteration matrices.
The genes in the solutions by UNCOVER with p-value 1/1001 are enriched (p = 2 × 10 −12 by Fisher exact test) for well-known cancer genes, as reported in [11].We also tested whether genes in solutions by UNCOVER (with p-value 1/1001) are enriched for interactions, by comparing the number of interactions in iRefIndex [40] among genes in such solution with the number of interactions in random sets of genes of the same cardinality.Genes in solutions by UNCOVER are significantly enriched in interactions (p = 7 × 10 −3 by permutation test).In addition, the genes in solutions by UNCOVER are also enriched in genes in well-known pathways: 12 KEGG pathways [41] have a significant (corrected p ≤ 0.05) overlap with genes in solutions by UNCOVER and four of these (endometrial cancer, glioma, hepatocellular carcinoma, EGFR tyrosine kinase inhibitor resistance) are cancer related pathways.In addition, the targets (i.e., genes) with solutions of p-value 1/1001 are enriched (p = 10 −3 by permutation test) for interactions in iRefIndex and for well-known cancer genes as reported in [11].These results show that UNCOVER enables the identification of groups of well known cancer genes with significant associations to important targets in large datasets of functional target profiles.For example, for target (i.e., silenced gene) TSG101, related to cell March 28, 2018 13/24  growth, UNCOVER identifies the gene set shown in Figure 6 as associated to reduced cell viability.ERBB2 is a well known cancer gene and CDH4 is frequently mutated in several cancer types, and both are associated to cell growth.

Conclusion
In this work we study the problem of identifying sets of mutually exclusive alterations associated with a quantitative target profile.We provide a combinatorial formulation for the problem, proving that the corresponding computational problem is NP-hard.We design two efficient algorithms, a greedy algorithm and an ILP-based algorithm, for the identification of sets of mutually exclusive alterations associated with a target profile.We provide a formal analysis for our greedy algorithm, proving that it returns solutions with rigorous guarantees in the worst-case as well under a reasonable genarative model for the data.We implemented our algorithms in our method UNCOVER, and showed that it finds sets of alterations with a significant association with target profiles in a variety of scenarios.By comparing the results of UNCOVER with the results of REVEALER on four target profiles used in the REVEALER publication [33], we show that UNCOVER identifies better solutions than REVEALER, even when evaluated using REVEALER objective function.Moreover, UNCOVER is much faster than REVEALER, allowing the analysis of large datasets such as the dataset from project Achilles, in which UNCOVER identifies a number of associations between functional target profiles and gene set alterations.Our tool UNCOVER (as well as REVELEAR) relies on the assumption that the mutual exclusivity among alterations is due to functional complementarity.Another explanation for mutual exclusivity is the fact that each cancer may comprise different subtypes, with different subtypes being characterized by different alterations [27].UNCOVER can be used to identify sets of mutually exclusive alterations associated with a specific subtype whenever the subtype information is available, by assigning high weight to samples of the subtype of interest and low weight to samples of the other subtypes.

Fig 1 .
Fig 1. Identification of mutually exclusive alterations associated with a target profile.Alterations in the green set have high mutual exclusivity but no association with the target profile.Alterations in the orange set have lower mutual exclusivity but strong association with the target profile.Methods that find mutually exclusive sets of alterations without considering the target profile will identify the green set as the most important gene set.

Figure 2 :
Figure 2: Results of UNCOVER on four cancer datasets from [?].(a) Solution found by ILP and greedy for KRAS essentiality target.(b) Solution found by ILP and greedy for -catenin activation target.(c) Solution found by ILP for MEK inihibitor target.(d) Solution found by greedy for MEK inihibitor target.(e) Solution found by ILP and greedy for NFE2L2 activation target.Each panel shows the value of the target (top row) for various samples (columns), with red being positive and blue being negative values.For each gene in the solution, alterations in each sample are shown in dark blue, while samples not altered are in light blue.The last row shows the alteration profile of the entire solution.

Fig 3 .
Fig 3. Results of UNCOVER on four cancer datasets from [33].(a) Solution found by ILP and greedy for KRAS essentiality target.(b) Solution found by ILP and greedy for β-catenin activation target.(c) Solution found by ILP for MEK inihibitor target.(d) Solution found by greedy for MEK inihibitor target.(e) Solution found by ILP and greedy for NFE2L2 activation target.Each panel shows the value of the target (top row) for various samples (columns), with red being positive and blue being negative values.For each gene in the solution, alterations in each sample are shown in dark blue, while samples not altered are in light blue.The last row shows the alteration profile of the entire solution.

Fig 4 .
Fig 4. Running time of UNCOVERon simulated data.The running time (expectation and standard deviation) of the greedy algorithm and of the ILP approach are shown for different number of samples and the difference p − n between the percentage p of samples with positive target and the percentage n of samples with negative target covered by the the correct solution.

Fig 5 .
Fig 5. Quality of solutions of UNCOVERon simulated data.The fraction of genes in the planted (i.e., correct) solution found by the greedy algorithm and by the ILP approach are shown for different number of samples and the difference p − n between the percentage p of samples with positive target and the percentage n of samples with negative target covered by the the correct solution.

Fig 6 .
Fig 6.Solution by UNCOVER for silencing of TSG101 (data from Achilles Project).The alteration matrix of genes in the solution identified by UNCOVER as associated to reduced cell viability is reported.The figure shows the value of the target (top row) for various samples (columns), with blue being negative values (i.e., reduced cell viability) and red being positive values.For each gene in the solution, alterations in each sample are shown in dark blue, while samples not altered are in light blue.The last row shows the alteration profile of the entire solution.

Fig 7 .
Fig 7. Results of UNCOVER and REVEALER on four cancer datasets from [33].(a) Solution found by ILP and greedy for KRAS essentiality target.(b) Solution found by ILP and greedy for β-catenin activation target.(c) Solution found by ILP for MEK inihibitor target.(d) Solution found by greedy for MEK inihibitor target.(e) Solution found by ILP and greedy for NFE2L2 activation target.Each panel shows the value of the target (top row) for various samples (columns), with red being positive and blue being negative values.For each gene in the solution, alterations in each sample are shown in dark blue, while samples not altered are in light blue.The last row shows the alteration profile of the entire solution.
Functional complementarity of alterations discovery.UNCOVER takes in input the alterations information and a target profile for a set of samples, and identifies the set of complementary alterations with the highest association to the target by solving the Target Associated k-Set problem and performing a permutation test.

Table 1 .
Comparison of our algorithms with REVEALER.

Table 2 .
Solutions found when running our ILP algorithm for the Achilles dataset, looking for positive correlation with the target.For each target we report the objective function value for the optimal solution, the set of alterations of cardinality 3 and the p-value computed by permutation test using 1000 permutations

Table 3 .
Solutions found when running our ILP algorithm for the Achilles dataset, looking for negative correlation with the target.For each target we report the objective function value for the optimal solution, the set of alterations of cardinality 3 and the p-value computed by permutation test using 1000 permutations