Histone Signatures Predict Therapeutic Efficacy in Breast Cancer

Objective: Regulatory abnormalities caused by chromatin modifications are being increasingly recognized as contributors to cancer. While many molecularly targeted drugs have the potential to revert these modifications, their precise mechanism of action in cellular reprogramming is not known. Methods: To address this, we introduce an integrated phosphoprotein-histone-drug network (iPhDNet) approach to generate “global chromatin fingerprints of histone signatures.” The method integrates proteomic/phosphoproteomic, transcriptomic and regulatory genomic data to provide a causal mechanistic network and histone signatures of drug response. Results: We demonstrate the utility of iPhDNet in identifying H3K27me3K36me3 histone mark as a key fingerprint of response, mediated by chromatin remodelers BRD4, NSD3, EZH2, and a proto-oncogene MYC when treated with CDK inhibitors. Conclusions: We construct a regulatory network of breast cancer response to treatment and show that histone H3K27me3K36me3 status changes, driven by the BRD4/MYC pathway, upon treatment with drugs are hallmarks of response to treatment.

genes, CDK inhibitor gene CDKN2A, transcription factor MYC and genes representing the enriched phosphoproteins) to capture in vitro gene activity levels in (MCF7) cell line.

D. Quantification and Statistical Analysis Histone Signature Identification
An unsupervised clustering technique, non-negative matrix factorization (NMF), was used to stratify histone signatures. R Statistics package was used for the calculation and Cytoscape was used to generate network graphs. Similar to vector quantization methods such as principal components analysis (PCA) and singular value decomposition (SVD), the objective of NMF is to explain the observed data using a compact number of latent features, i.e., basis components, which when combined with loading/mixture components approximate the original data as accurately as possible. In our NMF formulation both the matrix representing the basis components (histone signatures) as well as the matrix of mixture coefficients (drug prototypes) are constrained to have non-negative values, and unlike PCA and SVD, no independence or orthogonally constraints are imposed on the basis components leading to a simple and intuitive interpretation of the factors that allow the basis components to overlap. This unique feature is particularly interesting in histone modules, where overlapping basis components identify combinatorial histone codes resulting from multiple signaling pathways and indicate a specific signature. Because NMF assumes an additive model, anti-log transformed values were used in our analysis.
Mathematically, NMF consists of finding an approximation A  WH, ` (1) , ≥ 0 where W, H are n  k and k  m non-negative matrices respectively where n are rowssamples and m are columnsthe measured features in A. Since the objective is to reduce the dimension of the original data A, the factorization rank k is often chosen such that k  (n, m). W contains basis vectors and H contains encoding vectors that estimate the extent to which each basis vector is used to reconstruct each input vector. We used a version of NMF to minimize the divergence function (KL divergence) given by Brunet et al. [35]. The function is related to the Poisson likelihood of generating A from W and H, more specifically, based on randomly initialized matrices W and H, NMF finds the solution of min D(A||WH) = ∑ =1 ∑ ( where, D is a loss function, via an iterative process [11]. At each step, W and H are updated by using the following coupled divergence equations: where Ai,j = [A]i,j indicates (i,j)-th element of the matrix A.
Because (1)  E. Cluster Validation To identify the optimal rank k, we used the cophenetic correlation coefficient [36] to determine the most robust clustering as: It measures how reliably the same histone codes are assigned to the same cluster across many iterations of the clustering algorithm with random initializations. The cophenetic correlation coefficient lies between 0 and 1 and reflects the probability that samples i and j cluster together. Higher values indicate more stable cluster assignments. We selected optimal k= 4 ( Supplementary Fig S1A, S1B) based on the largest observed cophenetic coefficient and where the magnitude of the cophenetic correlation begins to decrease by varying values of k from 2 to 10 (Supplementary Fig S1C). We used the NMF package in R to implement and compute these calculations. In eq 5, x(i, j) = | xi − xj |, is the ordinary Euclidean distance between the ith and jth observations. t (i, j) = the dendrogrammatic distance between the model points ti and tj (height of the node at which these two points are first joined), x bar is the average of the x(i, j), and t bar is the average of the t(i, j). After factorizing A into the basis matrix W and the encoding matrix H, we used the basis matrix W for histone stratification. Specifically, we grouped histone codes into four groups (k=4). We assigned histone code xi to cluster k* which has the highest value based on the basis vector, as: Similarly, we assigned targeted pathways for each drug dj to cluster k* which has the highest value based on the encoding vector, as:

) F. Histone Prediction Model
Histone-peptide interaction network was generated using partial least square regression (PLSR) method based on Kraemer et al. formulation [13]. PLSR is a multivariate regression method for constructing predictive model when the number of factors/predictor variables (in our case phosphopeptides) exceeds the number of responses / dependent variables (histone marks), and collinearity exists (phosphopeptides are correlated with one another). A past study [37] had shown the effectiveness of partial least square (PLS) application in understanding crosstalk between phosphoprotein signaling in macrophage cells, thus, prompting us to consider a PLS-based regression model. The general idea behind PLSR is to try to extract latent factors, accounting for as much of the observed variation as possible while modeling the responses well. For each sample n, the value ynj is defined as: Where ynj is a response, is the coefficient, is an explanatory variable and is an error term. This model is similar to linear regression; however, the way these i are found is different. To see this, a matrix format of equation (7) can be expressed as Y=XB+E where Y is an n cases by m variables response matrix (in our case it is drugs x histone data), X is an n cases by p variables predictor matrix (in our case it is drugs x phosphopeptides data), B is a p by m regression coefficient matrix, and E is a noise term for the model which has the same dimensions as Y. For our X predictor matrix, we first normalized all the phosphosignal values to their corresponding z-scores and centered Y response matrix (histone values). Intuitively, partial least squares regression produces a p by c weight matrix W for X such that T=XW, i.e., the columns of W are weight vectors for the X columns producing the corresponding n by c factor score matrix T. These weights are computed so that each of them maximizes the covariance between responses and the corresponding factor scores. Ordinary least squares procedures for the regression of Y on T are then performed to produce Q, the loadings for Y (or weights for Y) such that Y=TQ+E. Once Q is computed where B=WQ, we have Y=XB+E, and the prediction model is complete. To provide a complete description of PLSR, we also need a p by c factor loading matrix P which gives a factor model X=TP+F, where F is the unexplained part of the X scores. On the training data, we calculated the optimal model parameter using 10-fold cross-validation. We assessed the predictive performance by computing the residual sum of square (RSS) error of prediction on the test set (Supplementary Fig S2). We identified the optimal number of components (principal component, PC) that could be used to predict the model accurately using residual square sum (RSS) value < 0.05. Once the coefficients (i) are generated, we retained only the significant peptides (p_value < 0.0001) using a t-test where the degree of freedom DOF was computed as: DOF = min (column of X, row of X) -PC -1.
G. Integrated Phosphoprotein-Histone-Drug Network (iPhDNet) Using the coefficients from the histone signatures (c1, c2, c3, and c4) and the drug prototypes using NMF and model coefficients of phosphoproteins towards histone model prediction using PLSR, an integrated 3D network file is constructed connecting drugs to phosphoproteins and phosphoproteins to histones (iPhDNet). iPhDNet is visualized using Cytoscape highlighting hub nodes (most connected histones) linking histone to phosphoproteins to drugs. Influences of each drug or phosphoprotein towards a histone code then can be visualized by the properties of edges connecting them. For example, the thickness of the edges signifies the amount of contributions by each phosphoprotein or drug, colors of edges signify how they are correlated (i.e., positive or negative).

H. Mechanistic Causal Network (MCN) Reconstruction
A time-varying mechanistic causal network was constructed by back propagating iPhDNet, previously generated for 24 hour from P100 data. We first used a one-way ANOVA with a pvalue of 1.0e-4 to populate enriched (statistically significant) phosphoproteins at 6 and 3-hour time points. We then inferred protein-protein interactions for the phosphoproteins enriched in 24 hour by mapping them to the STRING database. An interaction score of 0.8 and above, experimentally validated PPIs, and gene fusions criteria were used to obtain these inferred proteins. Our final MCN was constructed by back propagating our mapping of the inferred proteins from 24 hour to enriched phosphoproteins in 6 and to 3 hour. Additional protein-coding genes were generated and added to the final MCN using the EnrichR tool (http://amp.pharm.mssm.edu/Enrichr/enrich).
We then validated our MCN by matching them against significant differentially expressed (DE) genes in L1000. Cytoscape was used to view the final reconstructed MCN.
I. Transcriptomic Analyses -Differential Expression of L1000 and TCGA Data Differential expression analyses for 978 landmark genes from L1000 assay treated with flavopiridol and dinaciclib were performed using the unpaired t-test implemented in CyberT. Cyber-T is based on a regularized Bayesian framework that addresses technology biases and low replication levels in high throughput data [38]. These analyses were performed on 3, 6 and 24-hour datasets. Multiple corrections were applied to p-values using Benjamini Hochberg. Similarly, differential expression analyses of TCGA matched normal vs cancer patients were performed using unpaired t-tests. Cyber-T web server [39] was used to generate these analyses (Supplementary Table 4).
J. Cluster Similarity Evaluation We used the Rand Index (RI) to evaluate the similarity of cluster assignments between every paired treatments in breast cancer and other cell lines. RI computes the percentage of pairs of objects for which both classification methods, the computed and the ideal one, agree. It is computed using False Positives (FP), False Negatives (FN), True Positives (TP) and True Negatives (TN) as follows: The RI value ranges from 0 (completely dissimilar group assignment) to 1 (exactly same group assignment).

NOLC1
Nucleolar and Coiled-Body Phosphoprotein 1 A nucleolar protein that regulates RNA polymerase I by connecting RNA polymerase I to ribosomal processing and remodeling enzymes, resulting in translational remodeling. It has a high binding affinity to c-MYC and Max transcription factors which play an important role in cancer. Although NOLC1 has not been studied extensively, a previous study found NOLC1 to have transcription factor-like activity in nasopharyngeal cancer progression suggesting its possible role in other cancers (Hwang et al. 2009).

SRRM2
Serine/Arginine Repetitive Matrix 2 Known to be involved in pre-mRNA splicing and has binding specificity for p53. SRRM2 has been detected as a 5'-3' Exoribonuclease 2 (Xrn2)interacting protein that is involved in premature termination of RNA polymerase II transcription (Sansó et al. 2016;Brannan et al. 2012) thus affecting cell cycle progression.

CASC3
Cancer Susceptibility Candidate 3 Also known as MLN51 is a component of the exon junction complex (EJC) whose expression as been shown to be elevated in some breast cancer cell lines (Tomasetto et al. 1995).
The EJC is known to be involved in a surveillance mechanism that degrades mRNAs with premature translation termination codons through a nonsense-mediated mRNA decay (NMD) function, thereby, promoting cell cycle arrest. Supplementary

Regulator
Affected pathway BRD4, NSD3, SRRM2, NOLC1, MYC, P-TEFb complex Inhibition of CDK by flavopiridol and dinaciclib shows molecular cascades of interactions among BRD4, NSD3, SRRM2, NOLC1, MYC with the P-TEFb complex and its recruitment to the proximal promoter region of MYC to block transcriptional elongation of RNA Pol II.
P-TEFb, AURKA IkB, an enzyme complex that is part of the NF-κB signaling pathway, interacts with P-TEFb via AURKA to activate the CDK pathway. P-TEFb targets the intrinsic kinase activity directed towards RNA Pol II essential for transcriptional initiation, elongation, and inhibition.
BRD4, P-TEFb, RELA BRD4 has been implicated in activating NF-κB pathway by recruiting P-TEFb to acetylated RELA. (Huang et al. 2009) BRD4, H3K56me2 As a consequence of the CDK and BRD4 inhibition, we observed a reduction of Map3K7 phosphorylation, which inhibits JNK expression resulting in an increase of H3K56me2 level.

PDPK1
Hyperactive RAS then acts as a signaling switch that converts JNK's role from pro-to anti-tumor signaling through the regulation of Hippo signaling activity by inhibiting the PDPK1 phosphoprotein.
BET proteins A recent study has shown that the combined effect of PI3K and BET inhibition in a wide range of cancer cell lines resulted in apoptosis, tumor regression, and clamped inhibition of PI3K signaling. (Stratikopoulos et al., 2015) EJC, P-TEFb, BRD4, RBM8A, ZC3H18 While EJC regulators indirectly bind to P-TEFb recruited by BRD4 via RBM8A and ZC3H18, they have a secondary binding effect with Wnt/Notch signaling pathway components.