Plasma protein-based signature predicts distant metastasis and induction chemotherapy benefit in Nasopharyngeal Carcinoma

Rationale: Currently, for locoregionally advanced nasopharyngeal carcinoma (LA-NPC), there is no effective blood-based method to predict distant metastasis. We aimed to detect plasma protein profiles to identify biomarkers that could distinguish patients with NPC who are at high risk of posttreatment distant metastasis. Methods: A high-throughput antibody array was initially applied to detect 1000 proteins in pretreatment plasma from 16 matched LA-NPC patients with or without distant metastasis after radical treatment. Differentially expressed proteins were further examined using a low-throughput array to construct a plasma protein-based signature for distant metastasis (PSDM) in a cohort of 226 patients. Results: Fifty circulating proteins were differentially expressed between metastatic and non-metastatic patients and 18 were proven to be strongly correlated with distant metastasis-free survival (DMFS) in NPC. A PSDM signature consisting of five proteins (SLAMF5, ESM-1, MMP-8, INSR, and Serpin A5) was established to assign patients with NPC into a high-risk group and a low-risk group. Patients in the high-risk group had shorter DMFS (P < 0.001), disease-free survival (DFS) (P < 0.001) and overall survival (OS) (P < 0.001). Moreover, the PSDM performed better than N stage and Epstein-Barr virus (EBV) DNA load at effectively identifying patients with NPC at high risk of metastasis. For patients in the high-risk group, induction chemotherapy significantly improved DMFS, DFS, and OS. Conclusions: The PSDM could be a useful liquid biopsy tool to effectively predict distant metastasis and the benefit of induction chemotherapy in patients with LA-NPC.


Supplementary Tables and Figures
. Clinicopathological characteristics of 16 matched patients with post-treatment metastatic nasopharyngeal carcinoma (MNPC) and post-treatment non-metastatic nasopharyngeal carcinoma (NMNPC). Table S2. Univariate Cox regression analysis to explore the impact of time interval on clinical outcomes. Table S3. The result of differential expression analysis in high-throughput and low-throughput arrays. Table S4. Univariate analysis of the 42 differently expressed proteins associated with distant metastasis-free survival. Table S5. The results of Univariate Cox analysis and differential analysis of the 18 proteins significantly associated with distant metastasis-free survival. Table S6. Univariate analysis of the 42 differently expressed proteins associated with disease-free survival. Table S7. Univariate analysis of the 42 differently expressed proteins associated with overall survival. Table S8. The concentration of the 5 proteins of PSDM signature in high and low metastatic risk group stratified by the PSDM risk score.

Constructing a protein-based signature for metastasis (PSDM) by LASSO cox regression analysis with ten-fold validation
The least absolute shrinkage and selection operator (LASSO) is a popular method for regression with high-dimensional predictors. It introduces a penalty parameter λ to shrink some regression coefficients to exactly zero. The penalty parameter λ, called the tuning parameter, controls the amount of shrinkage: the larger the value of λ, the fewer the number of predictors selected [1]. LASSO has been broadly applied to the Cox proportional hazard regression model for survival analysis to prevent overfitting [2][3][4]. We selected 17 DMFS-correlated plasma proteins with upregulated tendency and adopted a LASSO Cox regression model to achieve shrinkage and variable selection simultaneously.
Ten-fold cross-validation was used to determine the optimal values of λ. In short, the 226 LA-NPC patients were randomly partitioned into 10 equal-sized subsamples. A series of different λ values for LASSO was generated by the "glmnet" package [2] in R software. For each λ, 9 subsamples were used as training data to generate a model, and the remaining 1 subsample was retained to validate the model.
The partial likelihood deviance was calculated to evaluate the efficacy variation between the training and validation subsamples. The cross-validation process is then repeated 10 times, with each of the 10 subsamples used exactly once as the validation data. In this way, for each λ, the mean and estimated standard error of the partial likelihood deviances in ten times were calculated. We choose λ via 1-SE (standard error) criteria [3-4], i.e. the optimal λ is the largest value for which the partial likelihood deviance is within one SE of the smallest value of partial likelihood deviance ( Figure 1B-C). Based on this λ value, we could obtain the variables whose beta coefficients were not zero, namely SLAMF5    Eight proteins whose concentrations were below LOD were excluded.
Abbreviations: DMFS: distant metastasis-free survival; ShhN: Sonic Hedgehog N-Terminal; LOD: the lower limit of detection.  Eight proteins whose concentration were below LOD were excluded.
Abbreviations: DFS: disease-free survival; ShhN: Sonic Hedgehog N-Terminal. LOD: the lower limit of detection. Eight proteins whose concentrations were below LOD were excluded.
Abbreviations: OS: overall survival; ShhN: Sonic Hedgehog N-Terminal. LOD: the lower limit of detection.     Figure S1. Expression of plasma proteins related to metastasis in the high-throughput and lowthroughput arrays.