Protein Markers Predict Survival in Glioma Patients*

Glioblastoma multiforme (GBM) is a genomically complex and aggressive primary adult brain tumor, with a median survival time of 12–14 months. The heterogeneous nature of this disease has made the identification and validation of prognostic biomarkers difficult. Using reverse phase protein array data from 203 primary untreated GBM patients, we have identified a set of 13 proteins with prognostic significance. Our protein signature predictive of glioblastoma (PROTGLIO) patient survival model was constructed and validated on independent data sets and was shown to significantly predict survival in GBM patients (log-rank test: p = 0.0009). Using a multivariate Cox proportional hazards, we have shown that our PROTGLIO model is distinct from other known GBM prognostic factors (age at diagnosis, extent of surgical resection, postoperative Karnofsky performance score (KPS), treatment with temozolomide (TMZ) chemoradiation, and methylation of the MGMT gene). Tenfold cross-validation repetition of our model generation procedure confirmed validation of PROTGLIO. The model was further validated on an independent set of isocitrate dehydrogenase wild-type (IDHwt) lower grade gliomas (LGG)—a portion of these tumors progress rapidly to GBM. The PROTGLIO model contains proteins, such as Cox-2 and Annexin 1, involved in inflammatory response, pointing to potential therapeutic interventions. The PROTGLIO model is a simple and effective predictor of overall survival in glioblastoma patients, making it potentially useful in clinical practice of glioblastoma multiforme.

Glioblastoma multiforme (GBM) is a genomically complex and aggressive primary adult brain tumor, with a median survival time of 12-14 months. The heterogeneous nature of this disease has made the identification and validation of prognostic biomarkers difficult. Using reverse phase protein array data from 203 primary untreated GBM patients, we have identified a set of 13 proteins with prognostic significance. Our protein signature predictive of glioblastoma (PROTGLIO) patient survival model was constructed and validated on independent data sets and was shown to significantly predict survival in GBM patients (log-rank test: p ‫؍‬ 0.0009). Using a multivariate Cox proportional hazards, we have shown that our PROTGLIO model is distinct from other known GBM prognostic factors ( Glioblastoma multiforme (GBM) 1 is the most common and aggressive malignant primary brain tumor in adults, with a median survival time of 12-14 months (1). GBM patients receiving standard therapy (surgery plus concurrent temozolomide (TMZ) chemoradiation) have variable clinical outcomes due to the complex nature of this disease (2). Age at diagnosis, extent of surgical resection, and postoperative Karnofsky performance score (KPS) are well-described prognostic factors of GBM (3).
Problems of molecular heterogeneity and genomic complexity have complicated the task of identifying and validating prognostically significant biomarkers in GBMs. Previous studies have identified mutational, gene expression, and epigenetic signatures as prognostic in the survival of GBM patients, but these biomarkers are not consistently exploited in clinical practice due to issues with their complexity, applicability, and cost (4 -6). One well-validated GBM prognostic marker is promoter methylation of the O 6 -methylguanine DNA methyltransferase gene (MGMT), which indicates sensitivity to the alkylating chemotherapy temozolomide and therefore improved survival (7,8). However, the utility of this biomarker is limited because only a proportion (ϳ40%) of patients have MGMT promoter methylation, there exists extensive controversy in how to best measure MGMT methylation, and there is an absence of a clear cutoff between methylated and unmethylated MGMT (8,9). Even more limiting than MGMT methylation is the combination of genomic aberrations, recently described by The Cancer Genome Atlas (TCGA) (10), which showed that patients with the glioma CpG island hypermethylator phenotype (GCIMP) and an isocitrate dehydrogenase 1 (IDH1) mutation, have improved survival (10). This combination of aberrations affects only ϳ6% of GBM patients (10). Mutations in genes encoding IDH1 have recently been shown to confer a survival benefit on glioma patients (11). However, a TCGA study of glioblastomas performed in 2013 estimated that IDH1 mutations occur in only ϳ10% of GBM patients (10). While these markers have been well described and validated, their limited applicability to the majority of GBM patients leaves an unmet need in identifying biomarkers predictive of patient survival.
Previous GBM prognostic models that have relied on gene expression measurements have been limited in uncovering biological significance by the lack of a 1:1 relationship between gene expression and protein abundance. Proteomic techniques, however, directly measure protein abundance, which correlates with the actual biological activity of enzymes and other proteins. Antibody-based proteomic techniques such as reverse phase protein array (RPPA) allow for highthroughput measurements of protein abundance (12). RPPA data have previously been used to identify protein markers of drug response and to predict prognosis in breast cancer and ovarian cancer (13,14). Protein arrays have also previously been used to study pathway alterations in glioma progression (15,16). TCGA GBM RPPA data were broadly analyzed by the TCGA network in previous studies but were not tied to patient survival (10,17). Additionally, a pan-genomic study published in Nature correlated omic data, including RPPA results, to patient survival across multiple cancer types (18). However, the prognostic power of GBM clinical variables could not be overcome by GBM omic data in this study, and no attempt was made to correlate RPPA results to patient survival (18). In the current study, we used GBM from TCGA to construct and validate a protein signature predictive of glioblastoma patient survival (PROTGLIO).
Glioblastomas are rare tumors with an incidence rate of 3.19 per 100,000 (age adjusted) (19). We are dependent on consortium efforts like TCGA in order to accumulate enough samples to obtain statistically significant analysis results. In LGGs follow the same routes of tumor initiation as GBMs, have remarkable molecular similarity and similar patient outcomes as compared with GBMs, and are now thought to be immediate precursors to GBMs (as many of these patients will progress to GBM) (20,21). Recent work has caused a shift in the classification of gliomas, with the primary distinction among groups becoming the presence or absence of an IDH mutation (20 -22). The IDHwt LGG protein data from TCGA are a relevant and useful resource for validating GBM findings. Here we use IDHwt lower grade glioma (LGG) protein data from TCGA to confirm our findings of a prognostic protein signature predictive of patient survival.

EXPERIMENTAL PROCEDURES
TCGA Patient Samples-Glioblastoma (GBM) and lower grade glioma (LGG) tumor samples were collected and snap-frozen at Tissue Source Sites on behalf of the TCGA and processed through the Biospecimens Core Resource at the International Genomics Consortium (Phoenix, AZ) for standardized pathology quality control as previously described (10,23). All specimens were collected using institutional review board approved protocols and were deidentified to ensure patient confidentiality. The samples included in our study were limited to the n ϭ 203 untreated primary GBMs and the n ϭ 42 untreated primary IDHwt LGGs for which RPPA and clinical data were available (10). Clinical data used in our study included: age at diagnosis, overall survival time and status, postoperative KPS, extent of surgical resection, and treatment type (chemotherapy, radiotherapy). We also considered omic data from TCGA, including the presence of MGMT methylation, IDH1 mutation, and glioma CpG island hypermethylator phenotype status (10). Detection of progression in brain tumors is dependent on the frequency of clinical and radiological follow-up, which is difficult to standardize in a multicenter study; therefore, we use overall survival as our end point. Overall survival time was defined as the time between surgery and cancer-related death or censored at date of last follow-up and was capped at 60 months. KPS was defined on a scale of 0 to 100 at intervals of ten. The extent of surgical resection was divided into three possible categories: gross total resection, subtotal resection, or biopsy. Treatment type was defined as: standard TMZ chemoradiation or other. The GBM samples were split into training (n ϭ 102) and validation (n ϭ 101) sets. The IDHwt LGGs samples were split into training (n ϭ 21) and validation sets (n ϭ 21). This random splitting (twofold crossvalidation) was preferred over a more standard splitting with three or more folds in order to achieve enough degrees of freedom in all levels of low-occurring prognostic variables such as MGMT methylation. Patient characteristics of the sample sets are illustrated in Table I and  Supplemental Table S1.
RPPA-The RPPA Core Facility at the MD Anderson Cancer Center performed the protein analysis on behalf of the TCGA. Standard RPPA experimental procedures with validated antibodies were used, and the protocol has been described elsewhere (10). GBMs were probed with 171 primary antibodies, including 45 phospho-antibodies.
LGGs were probed with 189 primary antibodies, including 51 phospho-antibodies. The two arrays have 141 overlapping protein antibodies. Antibody and probed protein names for both datasets can be found in Supplemental Table S2. Level three (quantified, normalized, and corrected) GBM and LGG RPPA data were downloaded at https://tcga-data.nci.nih.gov.
PROTGLIO Signature Generation-The Cox proportional hazards model is commonly used as a regression model in survival analyses (24). It assumes a semiparametric form for the hazard: where each patient is indexed by i for i ϭ 1, …, n, XiTϭ (Xi1, … Xij, … Xip) is the p-vector of all protein expression levels for the i th patient where X ij is the expression level of the j th protein for j ϭ 1, …, p for that i th patient, ␤ ϭ ͑␤ 1 , … ␤ j , … ␤ p ͒ T is the p-vector of regression coefficients where ␤ j is the coefficient of the j th protein for j ϭ 1, …, p, h i ͑t͒ is the hazard function for the ith patient at time t, and h o ͑t͒ is the baseline hazard.
To derive a protein model predictive of survival, we used an elastic net-penalized Cox regression model (25,26). The elastic net regularization technique is a commonly employed variable selection procedure in statistical models by combining L1 and L2 penalty terms to produce shrinkage estimates of regression parameters ␤ (27). The elastic net algorithm achieves a balance between the LASSO penalty (through the L1 term), allowing nonrelevant variables to be assigned a coefficient of zero, and the ridge penalty (through the L2 term), allowing for retention of correlated features and fitting in the highdimensional situation when the number of variables (p) exceeds the number of samples (n). In this study, variable selection was achieved via the elastic net-penalized partial likelihood, where both mixing and overall shrinkage tuning parameters (␣ and below) are simultaneously cross-validated, as described in Simon et al. (25,26). Specifically, from the partial likelihood: where t ͑1͒ Ͻ t ͑2͒ Ͻ … Ͻ t ͑m͒ is the list of ordered distinct failure times, k(i) denote the index of the observation failing at time t ͑i͒ , and R i is the set of indices k with y k Ն t ͑i͒ , that is, the set of individuals at risk of experiencing a death event at time t ͑i͒ ; one estimates ␤ by maximizing the partial likelihood L(␤)subject to the constraint: The following two submatrices (n Ͻ N) were used to train the algorithm (1) a survival matrix, Y ⑀ P n, 2 , where n is the number of TCGA GBM or LGG patients and column values are patients' overall survival time in months and event indicator (vital status: living or deceased).
(2) a protein matrix, X ⑀ P n,p , where n is the number of TCGA GBM or LGG patients and p is the number of protein features.
We used the "glmnet" package version 1.9 (25,26) in R version 3.1.3 (28) to solve the optimization problem for the Cox elastic net regression. In the glmnet package, ␣ controls the balance between the L1 and L2 penalty terms, and controls the overall shrinkage. The optimal was chosen by the glmnet package using tenfold cross validation to minimize the root mean squared error. The optimal ␣ (0.5) was chosen by testing ten possible values of ␣ ⑀ [0.1, 1.0] and selecting the one that maximized the significance of the resulting survival prediction (log-rank test). By averaging the estimated ␤ values over 100 repetitions of the cross-validated elastic net model, we attained mean ␤ coefficients for the proteins in the model (Table II).
A PROTGLIO score was calculated for each of the TCGA training samples in the training set. Patients with a score greater than the median were categorized as high risk, and those with a score equal to or less than the median as low risk. The PROTGLIO score is defined as the sum of the estimated ␤ coefficients multiplied by the patient's respective protein expression levels, as shown below: where patients are represented by i (i ϭ 1, …, n), proteins with nonzero elastic net coefficients ␤ j are represented by j (j ϭ 1, …, z) (adapted from (14)). Statistical Analysis-The independence of the PROTGLIO model was assessed with a multivariate Cox proportional hazards model using PROTGLIO, age at diagnosis, extent of resection, KPS, treatment type, and MGMT methylation as covariates. Due to the small number of patients with an IDH1 mutation and/or positive for glioma CpG island hypermethylator phenotype, we lacked the degrees of freedom necessary to include these variables as covariates in our multivariable model. The Wald test was used to assess the significance of the covariates in the multivariable models. A log-rank test was used to assess the significance of differences between Kaplan-Meier survival curves. p values Ͻ 0.05 were considered significant. We also assessed the predictive power of the PROTGLIO model using Harrell's concordance index (C-index) (29). The C-index for censored data is a rank correlation U-statistic that estimates the probability of concordance between predicted and observed survival times. It is used here as a nonparametric measure of survival model prediction performance: C-index ϭ 1 indicates perfect prediction accuracy and C-index ϭ 0.5 is equivalent to a random guess.
To validate the PROTGLIO model we tested the ability of the training set generated survival model to predict patient survival in the independent validation set. We used the ␤ coefficients generated from the training set to calculate PROTGLIO scores for the TCGA test samples in the validation set. The patients in the validation set were categorized as low or high risk using the median PROTGLIO score of the training set. The significance of distributional separation of the resulting Kaplan-Meier survival curve was calculated using the logrank test. The prediction performance (accuracy) of the model was assessed using the C-index.

Robustness of PROTGLIO-Reproducibility and Lower Grade
Glioma Assessment-We assessed the robustness of the PROTGLIO model in two ways. First, we assessed the reproducibility of the approach described above by generating a PROTGLIO model on ten random pairs of training and validation sets from the TCGA GBM data, i.e. 10-repeated twofold cross-validations (Supplemental Table  S3). The elastic net regression model was implemented on each training set to generate ten separate PROTGLIO patient survival models. The predictive performances of the resulting models were evaluated on their respective training and validation sets. The significance of distributional separation of the resulting Kaplan-Meier curves were calculated using the log-rank test. The prediction performance (accuracy) of the models was assessed using the C-index.
Next, we validated our findings using TCGA RPPA data from IDHwt LGGs. IDHwt lower grade gliomas have recently been shown to have the same routes of tumor initiation as GBMs, have remarkable molecular similarity and patient outcome as GBMs, and are now thought to be immediate precursors to GBMs (as many of these patients will progress to GBM) (20 -22). We utilize the IDHwt subset of TCGA LGG samples to validate our prognostic GBM signature in the absence of available proteomics GBM validation datasets. Table I, GBM patients included in the training set (n ϭ 102) had a median age of 60, and those in the validation set (n ϭ 101) had a median age of 59. Gross total surgical resection was achieved in 89% of patients in the training set and 86% of patients in the validation set. In the training and validation sets, 77% and 71% of patients, respectively, were postsurgically treated with radiation and concurrent TMZ. The median overall survival time/time to last follow up was 9.4 months for patients in both the training and validation sets. Additionally, both groups of patients had a median KPS of 80. Four patients in the training set and seven patients in the validation set carried an IDH1 mutation. Methylation of the MGMT gene was present in 35 patients in the training set and 31 patients in the validation set. Four patients in the training set and ten patients in the validation set were positive for the glioma CpG island hypermethylator phenotype.

Patient Characteristics-As described in
Patients included in the IDHwt training set (n ϭ 21) and validation set (n ϭ 21) had median ages of 54 and 52, respectively (Table I). The median overall survival time/time to last follow up was 12.4 months for patients in the training set and 12.2 months for those in the validation set. Both groups of patients had a median KPS of 90. Methylation of the MGMT gene was present in 33% of patients in the training dataset and 43% in the validation set.
Protein Signature Generation and Validation-Using our elastic net Cox regression model, we identified z ϭ 13 proteins, shown in Table II, significantly associated with overall survival in GBM patients. We termed this protein signature predictive of glioblastoma patient survival-PROTGLIO. Each patient was classified as high-risk (high PROTGLIO score) or low risk (low PROTGLIO score) using the median training set PROTGLIO score (-0.88) as the cutoff. Four proteins (GAB2, SMAD1, phosphorylated Src, and Src) were shown to be associated with longer overall survival (negative beta coefficient in Table II), while nine proteins (AMPK, ANXA1, phosphorylated c-Jun, phosphorylated c-Met, phosphorylated CHEK1, Cox-2, IGFBP2, phosphorylated NFkB1, and TGM2) were associated with shorter overall survival (positive beta coefficient in Table II). Functional analysis was investigated using standard Gene Ontology terms and functional associations were discovered using GeneMANIA (30). Our functional analysis showed that the PROTGLIO signature is en-  A Kaplan-Meier survival analysis showed very significant differences between the low-and high-risk PROTGLIO groups of the training set (n ϭ 102, p Ͻ 0.0001), as shown in Fig. 1. The C-Index of the training set was 0.63. The median overall survival times of the low-and high-risk groups of the training set were 22.2 and 12.0 months, respectively. A Kaplan-Meier survival analysis of the validation set PROTGLIO risk groups (Fig. 1) also showed very significant differences (n ϭ 101, p ϭ 0.0009). The C-Index of the testing set was 0.60. The median overall survival times of the low-and highrisk groups in the validation set were 16.9 and 10.6 months, respectively.
In order to test the significance of the PROTGLIO signature (meta-variable) predictive power from known prognostic markers of GBM patient survival, we fitted a multivariate Cox proportional hazards model with PROTGLIO score, age at diagnosis, extent of surgical resection, postoperative KPS, treatment with TMZ chemoradiation, and MGMT methylation as covariates (Table III). Patients with missing values were excluded from the hazards analysis. In the training set, after adjustment for known predictors of GBM survival, the PROTGLIO score remained very significant (p ϭ 0.008) among included covariates (Table III). Age also remained a significant predictor of survival after multivariate adjustment (p ϭ 0.005), which is consistent with clinical observations that age at diagnosis is a distinct prognostic factor for GBM survival (31). Our findings that PROTGLIO is an independent predictor of glioblastoma survival were confirmed using our validation set. The PROTGLIO score (p ϭ 0.008) alone remained significant after adjustment for all predictors using a multivariate analysis (Table III).
Robustness of the PROTGLIO Model-We assessed the robustness of the PROTGLIO model by repeating the model generation procedure using ten random pairs of training (n ϭ 102) and validation (n ϭ 101) sets, i.e. ten repeated twofold cross-validations (Supplemental Table S3). We were able to reproduce our results and generate ten PROTGLIO models that significantly predicted survival in their respective training and validation sets (Supplemental Table S4 and Supplemental Fig. 1). There were 37 total proteins identified as significant in GBM survival by the ten PROTGLIO models. There was significant overlap in the proteins contained in each PROTGLIO model (Table IV and Supplemental Table S4). Annexin 1 and Cox-2 were found to be significant in nine out of ten PROTGLIO models; Insulin-like growth factor binding protein 2, phosphorylated Chk1, and transglutimase-2 were significant contributors to seven out of ten models, and an additional five proteins (GAB2, phosphorylated Src, Smad1, Src, and phosphorylated c-Met) were identified as significant by more than half of the ten PROTGLIO models. Functional analysis that each of the ten PROTGLIO models are enriched in proteins encoded by genes with inflammatory and immune response function.
Using an elastic net Cox regression, we developed a prognostic protein signature and identified z ϭ 12 proteins, shown in Table V, significantly associated with overall survival in patients with IDHwt LGGs. Each patient was classified as high-risk (high protein score) or low risk (low protein score) using the median training set IDHwt protein score as the cutoff. Five proteins (BID, NF2, TP53, CHEK1 and phosphorylated SRC) were shown to be associated with longer overall survival (negative beta coefficient). While seven proteins (AMPK1, PECAM1, BIRC2, IGFBP2, FRAP1, RPS5, and ANXA1) were associated with shorter overall survival (positive beta coefficient). Like the PROTGLIO signature, functional analysis showed that the IDHwt LGG signature is enriched in proteins encoded by genes with inflammatory response function. A Kaplan-Meier survival analysis showed very significant differences between the low-and high-risk groups of the training set (n ϭ 21, p Ͻ 0.0001), as shown in Fig. 2. The median overall survival time of the high-risk group of the training set was 12 months and the median overall survival time of the low-risk group was 26 months. The C-Index of the training set was 0.82. A Kaplan-Meier survival analysis of the validation set risk groups (Fig. 2) also showed significant differences (n ϭ 21, p ϭ 0.01). The median overall survival time of the high-risk group in the validation set was 13 months and 20 months for the low-risk group. The C-Index of the testing set was 0.70. In the training set, after adjustment for age at diagnosis and MGMT methylation using a multivariate Cox proportional hazards model, the IDHwt protein score remained significant (p ϭ 0.03). Our findings that the protein score is a distinct predictor of IDHwt lower grade glioma survival were confirmed using our validation set. The validation set IDHwt protein score (p ϭ 0.01) remained significant after adjustment for all predictors using a multivariate analysis.
Currently, there are no therapeutic stratification schemas available to clinicians that treat glioblastoma patients. This is in contrast to the use of erlotinib targeting mutant EGFR in non-small-cell lung cancer patients and a host of other targeted approaches used in other cancer types (32). A targeted therapeutic approach to glioblastoma treatment based on the immunohistochemistry of a small number of proteins presents a practical and promising approach to treatment of GBM patients. In this study, we identified 13 proteins predictive of glioblastoma patient overall survival. Using data from 203 TCGA glioblastoma multiforme cases, we developed and validated a protein signature predictive of glioblastoma patient survival-PROTGLIO model.
We found a significant association between overall survival and PROTGLIO risk groups in both our training set (p Ͻ 0.0001) and independent validation set (p ϭ 0.0009). Both patient sets were well matched (median survival ϭ 9.4 months, training and validation) and were composed of a mix of long-and short-term survivors (training survival range: 0.1-60 months, validation survival range: 0.2-60 months), emphasizing the significance of this result. The predictive power of the PROTGLIO signature of risk groups remained highly significant for both the training (p ϭ 0.008) and validation sets (p ϭ 0.008) even after adjustment for known GBM prognostic factors (age, KPS, TMZ treatment, extent of surgical resection, and MGMT methylation).
The robustness of the PROTGLIO model was examined by repeating our model generation procedure on ten random pairs of training and validation sets and by observing similar trends in the LGG validation dataset. Functional analysis showed that the repeated model generation procedure on the GBM yielded prognostic models enriched in genes in the inflammatory and immune response pathways. Unfortunately, there does not exist a publically available dataset for GBMs or LGGs on which the PROTGLIO signature can be directly computationally validated. Due to changes in antibodies available the protein array that was used by the TCGA to analyze GBMs is slightly different from that used to analyze LGGs (171 versus 189 proteins probed, respectively, with only 142 proteins overlapping). Cox-2, a key protein in the PROTGLIO signature was not measured in the LGG TCGA protein array. Without this measurement, we cannot computationally validate the PROTGLIO signature using the LGG data. In the absence of such a dataset we used the LGG dataset to confirm the trends that we observed among GBMs. The GBM and IDHwt protein models have five proteins directly in common (AMPK1, ANXA1, CHEK1, IGFBP2, and SRC) (Table V). Moreover, both the GBM and IDHwt protein models are enriched in cell signaling and inflammatory response proteins. These two models highlight the involvement of the role of proteins involved in inflammatory response in glioma patient survival and point to possible avenues of therapeutic intervention for these tumor types.
Cox-2 and Annexin 1 were found to be significant contributors to the PROTGLIO model in nine out of ten repetitions of the model generation procedure. Cox-2 is involved in the conversion of arachidonic acid to prostaglandin H2, is an important contributor to the inflammatory process, and has been associated with recurrence and metastatic disease in multiple cancers (33,34). Cox-2 inhibition has been shown to modulate tumor growth in colon cancer (35). Annexin 1 is a lipocortin that is also involved in the inflammatory response (36). Annexin 1 is lost in clinical breast cancer, and data Catalyzes protein crosslinking suggest that Annexin 1 may act as a tumor suppressor gene that modulates the proliferative functions of estrogens in breast cancer (37). There has been interest in the use of Annexin 1 as an anticancer therapy as it can be induced to block the NF-kB signal transduction pathway, which is exploited by cancerous cells in order to avoid cell death (38). The PROTGLIO model has implicated these two proteins involved in inflammatory response as important predictors of survival in glioblastoma patients. Interestingly, we found phosphorylated SRC, a known oncogene, to be in low abundance in both the low-and high-risk GBM groups. Those patients in the low-risk PROTGLIO group had 1.4-fold less p SRC on average than those in the high-risk PROTGLIO group. The PROTGLIO model is a simple and effective predictor of overall survival in glioblastoma patients, making it potentially useful in clinical practice. The protein models not only have potential as a clinical stratification schema to distinguish be-tween high-and low-risk patients but also highlight potential therapeutic interventions. Specifically, the PROTGLIO model implicates proteins involved in the inflammatory process as being correlated to patient survival. Further retrospective and prospective proteomics studies are needed to assess the value of implementing these findings in the clinic and to further interrogate the identified proteins as therapeutic targets.