Decoding cancer heterogeneity: studying patient-specific signaling signatures towards personalized cancer therapy

The past years have witnessed a rapid increase in the amount of large-scale tumor datasets. The challenge has now become to find a way to obtain useful information from these masses of data that will allow to determine which combination of FDA-approved drugs is best suited to treat the specific tumor. Various statistical analyses are being developed to extract significant signals from cancer datasets. However, tumors are still being assigned to pre-defined categories (breast luminal A, triple negative, etc.), conceptually contradicting the vast heterogeneity that is known to exist among tumors, and likely overlooking unique tumors that must be addressed and treated individually. We present herein an approach based on information theory that, rather than searches for what makes a tumor similar to other tumors, addresses tumors individually and unbiasedly, and impartially decodes the critical patient-specific molecular network reorganization in every tumor. Methods: Using a large dataset obtained from ~3500 tumors of 11 types we decipher the altered protein network structure in each tumor, namely the patient-specific signaling signature. Each signature can harbor several altered protein subnetworks. We suggest that simultaneous targeting of central proteins from every altered subnetwork is essential to efficiently disturb the altered signaling in each tumor. We experimentally validate our ability to dissect sample-specific signaling signatures and to rationally design personalized drug combinations. Results: We unraveled a surprisingly simple order that underlies the extreme apparent complexity of tumor tissues, demonstrating that only 17 altered protein subnetworks characterize ~3500 tumors of 11 types. Each tumor was described by a specific subset of 1-4 subnetworks out of 17, i.e. a tumor-specific altered signaling signature. We show that the majority of tumor-specific signaling signatures are extremely rare, and are shared by only 5 tumors or less, supporting a personalized, comprehensive study of tumors in order to design the optimal combination therapy for every patient. We validate the results by confirming that the processes identified in the 11 original cancer types characterize patients harboring a different cancer type as well. We show experimentally, using different cancer cell lines, that the individualized combination therapies predicted by us achieved higher rates of killing than the clinically prescribed treatments. Conclusions: We present a new strategy to deal with the inter-tumor heterogeneity and to break down the high complexity of cancer systems into simple, easy to crack, patient-specific signaling signatures that guide the rational design of personalized drug therapies.

Surprisal analysis is a thermodynamic-based information-theoretic approach [1][2][3]. The analysis is based on the premise that biological systems reach a balanced state when the system is free of constraints [4][5][6]. However, when under the influence of environmental and genomic constraints, the system is prevented from reaching the state of minimal free energy, and instead reaches a state which is higher in free energy (in biological systems, which are normally under constant temperature and constant pressure, minimal free energy equals maximal entropy).
For example, if the system under study is a living cell, an environmental constraint can be exposure to a drug, which inflicts a change in protein concentrations and activities in the cell. The system can be influenced by genomic constraints as well, such as genomic mutations that in turn affect protein function, often eliciting alteration of specific signaling pathways to oppose the functions of the damaged protein.
Surprisal analysis can take as input the expression levels of various macromolecules, e.g. genes, transcripts, or proteins. However, be it environmental or genomic alterations, it is the proteins that constitute the functional output in living systems, therefore we base our analysis on proteomic data. The varying forces, or constraints, that act upon living cells ultimately manifest as alterations in the cellular protein network. Each constraint induces a change in a specific part of the protein network in the cells. The subnetwork that is altered due to the specific constraint is termed an unbalanced process. System can be influenced by several constraints thus leading to the emergence of several unbalanced processes. When tumor systems are characterized, the specific set of unbalanced processes is what constitutes the tumor-specific signaling signature.
For every protein, i, surprisal analysis calculates its expected expression level when the system is at the steady state and free of constraints: Xi 0 . This term was shown to be constant, i.e. is independent of time and of the actual state of the system [7][8][9][10]. In terms of information theory, Xi 0 represents the state of maximal entropy, or minimal information.
Xi(k) is the actual, experimentally measured expression level of the protein i in the tumor k. In cases where Xi(k)≠ Xi 0 , we assume that the expression level of the protein i was altered due to constraints that operate on the system. Surprisal analysis discovers the complete set of constraints operating on the system in any given tumor, k, by utilizing the following equation [7]: ln Xi(k) = ln Xi 0 (k) -ΣGiαλα(k) (1) The term ΣGiαλα(k) represents the sum of deviations in expression level of the protein i due to the various constraints, or unbalanced processes, that exist in the tumor k. The processes are indexed α =1, 2, 3, … , such that the dominance of the process decreases with increasing index, i.e. unbalanced process 1 is active in more tumors than unbalanced processes 2, 3 etc.
The difference between the steady state expression level, Xi 0 , and the actual expression level, Xi(k), represents the amount of information we have about protein i. A protein that is influenced by constraints, i.e. is influenced by one or more unbalanced processes and is therefore functionally linked to other proteins, cannot take any possible expression level. Rather, its expression level is affected by the expression levels of other proteins.
The term Giα denotes the degree of participation of the protein i in the unbalanced process α, and its sign indicates the correlation or anti-correlation between proteins in the same process. For example, in a certain process α, proteins can be assigned the values: Gprotein1,α = -0.50, Gprotein2,α = 0.24, and Gprotein3,α = 0.00, indicating that this process altered proteins 1 and 2 in opposite directions (i.e. protein 1 is upregulated and protein 2 is downregulated, or vice versa due to the process α), while not affecting protein 3. Note that each protein can take part in a number of unbalanced processes at once. Importantly, not all processes are active in all tumors. The term λα(k) represents the importance of the unbalanced process α in the tumor k. Its sign indicates the correlation or anti-correlation between the same processes in different Surprisal analysis uses a covariance matrix of the surprisals, meaning the covariance of the natural logarithms, as an intermediate numerical step (11). This has a physical, rather than purely statistical, significance [7]. That matrix has a mathematical form dictated by the theory, which relates changes in protein concentrations to changes in chemical potentials (see mathematical explanation below). A change in chemical potential is directly related to the change in the free energy of the system. Therefore, changes in chemical potential of molecules allow us to predict the direction of biochemical reactions, and the behavior of the system [11]. Hence, our method allows interpretation of protein expression level changes from a different angle, i.e. according to thermodynamic-based laws. We have recently demonstrated that the approach allowed us to predict the direction of cell-cell movement [12,13], or the response to drugs [14]. Furthermore we have recently compared surprisal analysis to the pure statistical techniques widely used in biology, such as Principle component analysis and clustering methods, and have found that surprisal analysis provided a higher degree of resolution of biological cancer heterogeneity [9]. The detailed explanation about the main differences between surprisal analysis and other techniques can be found in [15].
A change in chemical potential for each protein is calculated using ln Xi(k) = ln Xi 0 (k) + (µi-µi 0 )/kT. This equation is developed further, to obtain the main equation in the text: (1) ln Xi(k) = ln Xi 0 (k) -ΣGiαλα(k). Thus, a change in the chemical potential of protein i, (µi-µi 0 )/kT, is represented by ΣGiαλα(k), denoting that the change in chemical potential occurred due to genomic and environmental constraints that operate in the tumor, indexed α = 1, 2, … [3,12]. The method of Lagrange undetermined multipliers is utilized to calculate the maximal entropy and to determine the λα(k) values.
To fit the sum of the terms, as shown on the right-hand side of the main equation ( [1]), to the logarithm of the measured expression level of protein i in a tumor k we use singular value decomposition (SVD). SVD serves as a method for diagonalizing the matrix that includes the logarithm of protein intensities in tumor k. In other words, SVD is used as a fitting method to bridge between the theory: ln Xi(k) = ln Xi 0 (k) -ΣGiαλα(k); and experimental protein expression levels [7].
We determine the minimal number of unbalanced processes needed to accurately reconstruct the experimental protein expression levels, as described in the next section. For more details regarding the mathematical analysis see references [7] and [12]. For comparison between surprisal analysis and common statistical methods used today, such as Principle component analysis and clustering methods see [15,16].
As explained above, the zeroth term, ln Xi 0 (k), is the expression level at the steady state of the system (the most significant process), invariant term, which was found not to change between patients or in time [7][8][9][10]. This term is utilized as a reference against which the deviation terms are identified. In the dataset analyzed in this manuscript, the expression levels of the proteins were normalized according to the median values. Since steady state distribution is composed of these reference values (ln Xi 0 (k)), mean/median centering of the data does not allow to determine this distribution explicitly. The zeroth term becomes a vector of zeros for all proteins i in all samples, and the dataset is fitted to the unbalanced processes. Equation (1) is reduced to the form: where M represents the median value.
Theoretically, the number of constraints that operate on the system is limited by the smaller dimension of the input matrix [7]. In our case it is limited by the number of macromolecules measured, i.e. in the current dataset, the levels of 181 proteins were measured, and therefore a maximum of 181 constraints, or unbalanced processes that operate on the system could have been found. However, most of these processes are insignificant, i.e. have a negligible weight, λα, in all samples. Our analysis revealed only 17 significant unbalanced processes, which reproduce the experimental expression levels of the proteins in the dataset (Fig. S2). Figure S4 demonstrates the robustness of surprisal analysis, and shows that the size of the dataset studied here is large enough. A smaller dataset, of 100 tumors of each type, did not change the results of our analysis.
Determination of the number of significant unbalanced processes. (see [15] for more details). The analysis of the 3467 patients provided a 3467x181 matrix of λα(k) values, such that every row in the matrices contained 3467 values of λα(k) for 3467 patients, and each row corresponded to an unbalanced process (Table S2). However, not all unbalanced processes are significant. Our goal is to determine how many unbalanced processes are needed in order to reconstruct the experimental data, i.e. for which value of n: ln (Xi(k)/M) ≈ -ΣGiαλα(k). To find n, we performed the following two steps: (1) Processes with significant amplitudes were selected: To calculate threshold limits for λα(k) values (presented in Table S2 and Fig. S5) the standard deviations of the levels of 10 most stable proteins in this dataset were calculated (e.g. those with the smallest Standard deviations values, such as Annexin VII). Those fluctuations were considered as baseline fluctuations in the population of the patients which are not influenced by the unbalanced processes. Using standard deviation values of these proteins the threshold limits were calculated as described previously [17]. The analysis revealed that from α = 18, the importance values, λα(k), become insignificant (i.e. do not exceed the noise threshold), suggesting that 17 unbalanced processes are enough to describe the system.
(2) Reproduction of the experimental data by the unbalanced processes was verified: To verify that the number of processes identified in step (1) is correct, we plotted ΣGiαλα(k) for α =1, 2, … , n against ln Xi(k) for different proteins, i, and for different values of n, and examined the correlation between them as n was increased. An unbalanced process, α = n, was considered significant if it improved the correlation significantly relative to α = n -1. Generation of functional subnetworks.
The functional subnetworks presented in Figure S1 were generated using a python script (written with the assistance of Mr. Jonathan Abramson). The goal was to generate a functional network according to STRING database, where proteins with negative G values are marked blue and proteins with positive G values are marked red, to easily identify the correlations and anti-correlations between the proteins in the network. The script takes as an input the names of the genes in the network and their G values, obtains the functional connections and their weights from STRING database (string-db.org), and then plots the functional network (using matplotlib library).
Note: Since the antibodies against pY(1248)ErbB2 and pY(1068)EGFR were noticed to cross react in the original RPPA assay [18], the following corrections were made to our analyses: In unbalanced processes in which both pY(1248)ErbB2 and pY(1068)EGFR were significant, EGFR was considered to be active only if pY(1173)EGFR was significant as well. pY(1248)ErbB2 was considered to be a false-positive result and was thus omitted from these processes. Therefore, pY(1248)ErbB2 was omitted from the unbalanced processes 1, 5,10,13,14   .

Calculation of barcodes.
The barcodes presented in Table S4, Figures 4-7 and Figure S6 were generated using a python script (written with the assistance of Mr. Jonathan Abramson). For each patient, λα(k) (α = 1, 2, 3, …, 17) values were normalized as follows: If λα(k) > 2 (and is therefore significant according to calculation of threshold values) then is was normalized to 1; if λα(k) < -2 (significant according to threshold values as well) then it was normalized to -1; and if -2 < λα(k) < 2 then it was normalized to 0.  Figure S6.
Cell lines and reagents.

Methylene blue assays.
Cells were seeded in 96-well plates and grown under complete medium. The next day, inhibitors were added for 72 hours. The cells were then fixed and the amount of surviving cells was quantified by staining with methylene blue (Sigma) as described in [19].

Western blot analysis.
Cells were seeded in 6-well plates and grown under complete medium. The next day, cells were treated as indicated for 48 hours, and then the cells were collected into clean tubes (including the fraction of floating dead cells) and lysed using hot sample buffer (10% glycerol, 50 mmol/L Tris-HCl pH 6.8, 2% SDS, and 5% 2-mercaptoethanol). Western blot analysis was carried out as described previously [20]. Figure S1.  Importantly, processes =16 and =17 had significant λα(k) values (λα(k) > 2 or λα(k) < -2) in some of the patients (see.

Supplementary Figures and legends
For example, Fig. S5) and thus had to be considered. In contrast, processes α = 18, 19, … , 181 did not have significant values in any of the patients. Moreover, the plot of the data reproduction did not change significantly when the processes α = 18, 19, 20, 21 were added (see panel C vs. panel D). Thus, we conclude that the current dataset is well characterized by 17 unbalanced processes. The proteins that take part in the different unbalanced processes were identified as follows: For every unbalanced process α, Giα values were sorted according to their size, and only proteins with significant Giα values were considered to participate in the unbalanced process α. This is exemplified for the process α = 1 in the figure. Shown are sorted values of Gi1, which represent the degree of participation of every protein i in the unbalanced process α = 1. The gray box represents threshold values. Proteins with Gi1 > 0.1 or Gi1 < -0.1 (which are not contained in the gray box and form the top and bottom "tails" of the distribution) were considered to participate the most in the unbalanced process α = 1. These proteins were used to build a functional subnetwork using STRING database, presented in Figure S1.     The amplitudes of the three most significant unbalanced processes, λ1(k), λ2(k), and λ3(k), were plotted against each other in pairs in order to examine the information that these plots can provide. Some cancer types are nicely separated, e.g. GBM, KIRC and BRCA by λ2 (panels a and b) and λ3 (panels b and c). However, though some cancer types can be separated to some extent according to these 3 processes, representation of tumors using these 2D graphs does not enable the accurate mapping of individual tumors. Many tumors with distinct protein network structures appear in the same place in these graphs, e.g. most UCEC and HNSC tumors, which we found to be highly heterogeneous. These results demonstrate that a partial identification of unbalanced molecular processes does not suffice to accurately map the cancer patients. Figure S8. The unbalanced subnetworks identified by surprisal analysis for 10 cell lines. For every process α, the proteins were assembled into networks using functional interactions according to STRING database.   (Table S7). These combinations are proposed by the literature as potential effective combinations for ER resistant breast cancer ( [21,22]. (C) Most frequent barcodes are shown for Glioblastoma Multiforme (~11% of GBM patients), Colon (~32% of colon cancer patients) and Lung adenocarcinoma (31% and 9% of lung cancer patients, Table S4). The suggested therapies for these barcodes (Table S8) are proposed by the literature as potential effective combinations for those cancer types [23][24][25][26]. Figure S11. Protein composition of unbalanced processes remained the same before and after addition of the prostate carcinoma dataset. The weights of the proteins in every unbalanced process from the first dataset were highly correlated with the weights of the proteins after addition of the prostate carcinoma dataset. The correlation coefficient R is above 0.7 for 15 processes, meaning that the correlation is strong, for only 2 processes R was above 0.5-0.6, pointing to moderately strong correlation. Ori -denotes analysis of the original dataset and new-denotes the values after adding the validation (prostate cancer) dataset.