Study cohort
This study was retrospective and approved by the local Institutional Review and need for written informed consent was waived. All included of consecutive patients who underwent prostate mpMRI at two tertiary care medical centers were reviewed. All procedures performed in studies involving human participants were in accordance with the 1964 Helsinki declaration and its later amendments.
The primary patients comprised an evaluation of the two institutions database for medical records and were histologically proven between January 2014 and December 2019. The inclusion criteria were followed: (1) patients with biopsy or prostatectomy proved PCa; (2) standard prostate 3.0 T MRI performed within 4 weeks before the biopsy or prostatectomy; (3) with standard histologic tissue slices of dissected prostatectomy specimens. Patients were excluded if (1) absence of biopsy, surgical intervention or medical records within 8 weeks after MRI examination (n = 9554); (2) noncompliance with imaging quality or imaging exam from outside institutions (n = 141); (3) previous surgery, radiotherapy or drug therapies for PCa (interventions for benign prostatic hyperplasia or bladder outflow obstruction were deemed acceptable) (n = 436). Finally, 903 patients from center 1 and 539 patients from center 2 were eligible for clinical evaluation.
As a standard part of patient management in our two medical centers, the lesion with a Prostate Imaging Reporting and Data System (PI-RADS) [17] scored ≥ 3 underwent fusion or cognitive targeted MR-guided biopsy in conjunction with systemic biopsy by five urologists who had a prior experience of at least 1000 TRUS-guided prostate needle biopsies. Patients with PI-RADS 1-2 underwent TRUS-guided systemic biopsy. two high-experience uropathologists reviewed the available histopathological slides according to the 2014 WHO/ISUP recommendation [18]. From histopathology, we primarily defined biopsy-benign, Gleason score 3+3, 3+4, 4+3 and ≥ 4+4 as G0, G1, G2, G3 and G4 group, respectively. We secondly grouped G0-1 into clinically insignificant (CIS) disease, G2-3 into intermediate-risk PCa and G4 into high-risk PCa.
The PI-Risk model was primarily designed for a multi-task classification of G0, G1, G2, G3 and G4 diseases. We randomly split the data of center 1 into training (n = 671) and test (n = 232) group, respectively, for model development and internal test. We also used the data from center 2 with 539 patients for external validation. A flow diagram of patient selection with inclusion and exclusion criteria is showed in supplementary Fig. S1.
Follow-up
The first postoperative visit was 6 weeks later after RP and then patients were consistently followed-up at intervals of 3 to 6 months based on PSA. The time of a BCR was recorded. Patients were censored in case of emigration, or on 30th Jul 2020, whichever came first. The definition of BCR was referred to criteria previously reported [19, 20].
Prostate mpMRI
All imaging exams were performed on two 3.0T MRI scanners with pelvic phased-array coils (MAGNETOM Skyra; Siemens Healthcare, Munich, Germany) at the two institutes. The mpMRI consisted of T2WI in three panes, DWI with high b value of 1500 s/mm2 and apparent diffusion coefficient (ADC) map in axial pane (supplementary data, Table S1).
Lesion Segmentation
Entire volume of interest (VOI) of lesion was segmented using an in-house software (Oncology Imaging Analysis v2; Shanghai Key Laboratory of MRI, ECNU, Shanghai, China) based on histopathologic-imaging matching by two dedicated radiologists (reader 1 and reader 2 with 3-yr and 5-yr experience of prostate imaging). The contours of VOIs were then rechecked in consensus with a board-certified radiologist (reader 3, with 15-yr experience of prostate imaging). In patients with RP (n = 1006), postsurgical ex vivo prostates were processed using a previously described protocol [21]. Key steps included sectioning, digitization, and annotation of cancer regions by highly experienced urological pathologists. The histopathologic specimens were then assembled into pseudo-whole-mount sections and coregistered to the MRI using a previously described registration method [21]. In this way, regions of annotated PCa were mapped onto the images to produce the ground truth maps. In total, histopathologic-imaging matched specimens were identified. In patients without RP (n = 436), all subjects underwent MRI/TRUS-fusion targeted biopsy followed by 11-gauge core systemic needle biopsy. A central challenge in image labeling is the presence of ambiguous regions, where the true tumor boundary cannot be deduced from the image, and thus multiple equally plausible interpretations exist. To fill this gap, the VOI of each lesion was drawn twice by each of two independent radiologists. Regional identification overlapping in two instances was identified as the authorized VOI of the targeted lesion. Because it is inaccessible to achieve an imaging correlation with whole-mount prostatectomy specimens in our retrospective data, the unit of assessment in this study was per-patient. When patients had multiple lesions, only the index lesion with the largest lesion size and/or Gleason score was assessed.
Development, performance, and validation of predictive models
Volumetric radiomics features were analyzed from the target lesions using an open-source Python package Pyradiomics [22]. Image normalization was performed using a method that remapped the histogram to fit within µ ± 3σ (µ: mean gray-level within the VOI and σ: gray-level standard deviation). A total of 2,553 radiomic features such as intensity, shape, texture, and wavelets, were computed from the target volume on T2WI, b-value of 1500 s/mm2 DWI and ADC images that provide rich descriptions on the heterogeneity of entire-volumetric intratumor regions (IntraT-Rad).
The IntraT-Rad features focus on the inner regions of PCa. We further investigated a tumor-related region around the target lesion using novel deep transferable learning feature representations (PeriT-DLR). PeriT-DLR features were directly measured on MRI data using an image embedding toolbox (https://github.com/biolab) through five pre-trained deep neural networks, i.e. DeepLoc, Inception v3, SqueenzeNet, VGG-16 and VGG-19 as embedders [23]. In order to obtain the representative imaging features of the target lesion, we used hand-cropped VOI as an attention to gate each embedder for analyzing PeriT-DLRs (i.e., regions around the PCa) in the center slice of an MRI scan. Each embedder calculates a feature vector for each image and returns an enhanced image descriptor. For image embedding, we used the penultimate layer of embedders to produce new image profiles, serving as another set of imaging feature representations (PeriT-DLRs) in parallel to IntraT-Rad for PCa. The detailed parameters of each embedder are summarized in a supplementary data (Table S2).
Reducing the feature space dimension aims to select informative characteristics, reduce the risk of bias and potential overfitting. To obtain the quantitative imaging hallmarks, we first assessed multi-scale imaging profiles including 2553 vectors from Pyradiomics, 6144 vectors from Inception v3, 3000 vectors from SqueezeNet, 12288 vectors from VGG16, 12288 vectors from VGG19 and 1536 vectors from DeepLoc, respectively, using the mean decrease Gini index (MDGI) calculated by a Random Forest algorithm. The MDGI represents the importance of individual features for correctly classifying a residue into linker and non-linker regions. The MDGI was calculated by classifying 200 randomly selected linker features and 200 non-linker features, and the mean MDGI was calculated as the averaged MDGI over 100 trials. The mean MDGI z-score of each feature was calculated as: , where is the individual MDGI of the feature dedicated; and and σ is the mean and standard deviation of all MDGIs, respectively. Vector elements with MDGI z-score larger than 2.0 were selected as optimum feature candidates. Next, MDGI-selected features from each embedder were analyzed using an auto stacked-ensemble ML based on an open-source auto ML platform (https://github.com/awslabs/autogluon). The first layer of our auto ML framework has 5 base learners such as a k-nearest neighbors (kNN), AdaBoost, Random Forests, Logistic Regression (LR) and a Support Vector Machine (SVM), whose outputs are concatenated and then fed into the next layer, which itself consists of multiple stacker models. These stackers then act as base models to an additional layer. It merely employs random search for hyperparameter tuning, model selection, ensembling, feature engineering, data preprocessing, data splitting, etc., thus offers us an enticing alternative to deploy high-performance stack-ensemble models. We performed a random search over the parameter configuration, and chose the optimal parameters with the best score based on the evaluation of log-loss of ML model on 5-fold cross-validation datasets. The outputs calculated from ML predictor indicated the relative risk that the patient had G0, G1, G2, G3 or G4 disease. By this way, six quantitative imaging hallmarks, i.e., PeriT-DLR-DeepLoc, PeriT-DLR-SqueezeNet, PeriT-DLR-Inception v3, PeriT-DLR-VGG16, PeriT-DLR-VGG19 and IntraT-Rad, were developed from mpMRI, respectively, for decoding the heterogeneity of PCa.
In order to evaluate synergistic effect of multimodal features for the prediction of Gleason grade, the obtained 6 new imaging signatures were integrated with 4 clinical variables such as patient age (≤ 60 yrs, > 60 yrs), PSA level (4-10 ng/ml, 10-20 ng/ml, 20-100 ng/ml and > 100 ng/ml), location of observation (peripheral zone [PZ], transition zone [TZ]) and a PI-RADS score from radiologists’ reports. An interpretable risk assessment model (PI-Risk) was finally developed using a multinomial LR with elastic net penalty. The PI-Risk model is based on proportionally converting each regression coefficient in multivariate logistic regression to a 0- to 100-point scale. The effect of the variable with the highest β coefficient (absolute value) is assigned 100 points. The points are added across independent variables to derive total points, which are converted to predicted probabilities (Pi). The performance of PI-Risk model was independently tested on 232 internal test datasets, and on 539 external validation datasets. The entire flowchart of auto ML analysis for the PI-Risk model development is showed in Fig. 1.
Predictors of clinical outcome
Additionally, we prospectively evaluated a Cox model in using 5 clinic-imaging risk factors including dedicated age, PSA, PI-RADS score and predicted PI-Risk to assess the incremental aspect of our imaging signatures for predicting biochemical recurrence (BCR) of PCa after RP in 462 PCa patients who underwent RP treatment.
Statistical Analysis
By using biopsy and/or prostatectomy specimens as reference standard, the extents of lesions were divided into G0, G1, G2, G3 and G4 group. Quantitative variables were expressed as mean ± standard deviation (mean ± SD) or median and range or median and range, as appropriate. Model performance was typically evaluated against a “ground truth” with imaging-histopathologic annotations using receiver operating characteristic (ROC). Detection rates such as true positive, true negative, false positive and false negative rate were reported using a confusion matrix analysis. The Cox model’s performance was evaluated based on Harrell’s concordance index (C-index), calibration curves and Kaplan–Meier survival analysis. All the statistics were two-sided, and a p-value less than 0.05 was considered statistically significant. All statistical analyses were performed using MedCalc software (V.15.2; 2011 MedCalc Software bvba, Mariakerke, Belgium) and R software package (V.4.0.2; https://www.r-project.org).