Automated Gleason Scoring and Tumor Quantification in Prostate Core Needle Biopsy Images Using Deep Neural Networks and Its Comparison with Pathologist-Based Assessment

The Gleason grading system, currently the most powerful prognostic predictor of prostate cancer, is based solely on the tumor’s histological architecture and has high inter-observer variability. We propose an automated Gleason scoring system based on deep neural networks for diagnosis of prostate core needle biopsy samples. To verify its efficacy, the system was trained using 1133 cases of prostate core needle biopsy samples and validated on 700 cases. Further, system-based diagnosis results were compared with reference standards derived from three certified pathologists. In addition, the system’s ability to quantify cancer in terms of tumor length was also evaluated via comparison with pathologist-based measurements. The results showed a substantial diagnostic concordance between the system-grade group classification and the reference standard (0.907 quadratic-weighted Cohen’s kappa coefficient). The system tumor length measurements were also notably closer to the reference standard (correlation coefficient, R = 0.97) than the original hospital diagnoses (R = 0.90). We expect this system to assist pathologists to reduce the probability of over- or under-diagnosis by providing pathologist-level second opinions on the Gleason score when diagnosing prostate biopsy, and to support research on prostate cancer treatment and prognosis by providing reproducible diagnosis based on the consistent standards.

By excluding cases with ambiguous or empty responses from pathologists, 674 from the 700 cases were available for difficulty stratification: 147 easy, 505 medium, and 22 hard. Among the easy cases, 140 were benign and the system's analysis showed the highest diagnostic concordance with By excluding cases with ambiguous or empty responses from pathologists, 674 from the 700 cases were available for difficulty stratification: 147 easy, 505 medium, and 22 hard. Among the easy cases, 140 were benign and the system's analysis showed the highest diagnostic concordance with the reference standards and original hospital diagnoses. In contrast, the hard cases showed the worst correlation, and the slides with medium difficulty showed moderate correlation. The corresponding kappa coefficient matrices are detailed in Table 2. Figure 3 shows the distribution within each difficulty level, and Figure 4 shows the representative images of each level. The concordance between pathologists was also measured at each difficulty level, as listed in Table 3. Similar to the DeepDx Prostate results, the correlation between the pathologist-based scores was lower for the hard level than for the easy level. the reference standards and original hospital diagnoses. In contrast, the hard cases showed the worst correlation, and the slides with medium difficulty showed moderate correlation. The corresponding kappa coefficient matrices are detailed in Table 2. Figure 3 shows the distribution within each difficulty level, and Figure 4 shows the representative images of each level. The concordance between pathologists was also measured at each difficulty level, as listed in Table 3. Similar to the DeepDx Prostate results, the correlation between the pathologist-based scores was lower for the hard level than for the easy level.     The DeepDx Prostate results of tumor length measurements were also notably closer to the reference standard than the original hospital diagnoses. Correlation coefficient R between the tumor The DeepDx Prostate results of tumor length measurements were also notably closer to the reference standard than the original hospital diagnoses. Correlation coefficient R between the tumor lengths of the pathologists and DeepDx Prostate was above 0.97, whereas the correlations for the original diagnoses were all below 0.90, as shown in the correlation matrix of Figure 5.  There were 76 cases within the validation set for which one or more pathologists reported poor image quality. Among them, 49 cases showed faults during slide preparation and 33 cases showed errors during scanning. However, none of these cases impeded the diagnoses through DeepDx Prostate or by the pathologists ( Figure S1 shows poor quality images).
Among the 11 atypical small acinar proliferation (ASAP) cases, one or more pathologists marked six cases as requiring immunohistochemical staining, such as high molecular weight cytokeratin (HMW-CK; Table S1 shows details of ASAP cases). Five of these cases were confirmed to exhibit acinar adenocarcinoma or atypical small acinar proliferation (ASAP) under HMW-CK staining ( Figure S2).

Discussion
In this study, we introduced an automated Gleason scoring system, trained it using 1,133 cases of prostate core needle biopsy samples, and validated it on 700 cases. The system diagnoses were compared with the reference standard derived from three certified pathologists. The results showed a substantial diagnostic concordance between the system-grade group classification and reference standard (non-weighted Cohen's kappa coefficient was 0.615); this value was higher than that of the original hospital diagnoses (0.524), or the previously reported agreements between pathologists, ranging between 0.4 and 0.6 [7,8]. As the Gleason score provides an important index of prognosis and is one of the key determinants for selecting treatment modalities, the variability between the diagnoses of different pathologists is of clinical concern. The automated Gleason scoring system can help pathologists by providing diagnoses based on consistent standards and clear percentages of each Gleason pattern; this could help urologists in selecting appropriate treatment modalities.
Several studies have addressed automated Gleason scoring by using deep learning. Arvaniti et al. [24] reported a deep learning system that achieved comparable results with pathologist evaluations on Gleason scoring with respect to resection specimen of prostate tissue microarray. Likewise, Nagpal et al. used deep learning based on a modified InceptionV3 classifier to obtain reliable performance in the grade-group determination on resection specimens from the TCGA dataset [25].
Both studies focused on patch-level neural networks, whereas our automated Gleason system implements a pixel-level segmentation neural network, possibly achieving more reliable Cohen's There were 76 cases within the validation set for which one or more pathologists reported poor image quality. Among them, 49 cases showed faults during slide preparation and 33 cases showed errors during scanning. However, none of these cases impeded the diagnoses through DeepDx Prostate or by the pathologists ( Figure S1 shows poor quality images).
Among the 11 atypical small acinar proliferation (ASAP) cases, one or more pathologists marked six cases as requiring immunohistochemical staining, such as high molecular weight cytokeratin (HMW-CK; Table S1 shows details of ASAP cases). Five of these cases were confirmed to exhibit acinar adenocarcinoma or atypical small acinar proliferation (ASAP) under HMW-CK staining ( Figure S2).

Discussion
In this study, we introduced an automated Gleason scoring system, trained it using 1133 cases of prostate core needle biopsy samples, and validated it on 700 cases. The system diagnoses were compared with the reference standard derived from three certified pathologists. The results showed a substantial diagnostic concordance between the system-grade group classification and reference standard (non-weighted Cohen's kappa coefficient was 0.615); this value was higher than that of the original hospital diagnoses (0.524), or the previously reported agreements between pathologists, ranging between 0.4 and 0.6 [7,8]. As the Gleason score provides an important index of prognosis and is one of the key determinants for selecting treatment modalities, the variability between the diagnoses of different pathologists is of clinical concern. The automated Gleason scoring system can help pathologists by providing diagnoses based on consistent standards and clear percentages of each Gleason pattern; this could help urologists in selecting appropriate treatment modalities.
Several studies have addressed automated Gleason scoring by using deep learning. Arvaniti et al. [24] reported a deep learning system that achieved comparable results with pathologist evaluations on Gleason scoring with respect to resection specimen of prostate tissue microarray. Likewise, Nagpal et al. used deep learning based on a modified InceptionV3 classifier to obtain reliable performance in the grade-group determination on resection specimens from the TCGA dataset [25].
Both studies focused on patch-level neural networks, whereas our automated Gleason system implements a pixel-level segmentation neural network, possibly achieving more reliable Cohen's kappa coefficient (0.615) and Cohen's quadratic kappa coefficient (0.907). Similarly, Campanella et al. [22] trained a patch-level neural network with a massive dataset containing over 10,000 WSIs to access slide-level prediction and determine whether slides contain cancerous regions through multiple instance learning. They obtained a high area under the receiver operating characteristic curve (0.98) without requiring region-level annotations from pathologists.
Moreover, Arvaniti et al. [24] and Nagpal et al. [25] showed the possibility of better risk stratification in biochemical recurrence or disease progression through deep learning. Nagpal et al. [25] suggested a fine-grained Gleason scoring system with interpolation methods to enhance biomarkers in prognostic factors. Our measurements, including grade groups and tumor length obtained from the automated Gleason scoring system will be studied in survival analysis to test whether these factors are significant and remain consistent.
We asked three pathologists to assess not only the Gleason score but also the difficulty level (easy, medium, or hard) of each case. Concordance analysis with respect to difficulty level revealed that diagnostic concordance was highest in easy cases and deteriorated as case difficulty increased, and this trend was consistent with the correlation analysis between pathologists. Benign cases composed the majority of easy cases, whereas notable peaks were observed for grade groups 2 and 5 at both the medium and hard levels. In grade group 2 (Gleason score 3 + 4), variable judgment about the composition ratios of Gleason patterns 3 and 4 could be the reason for the lower concordance rate. While reviewing the hard cases of grade group 5, we found several cases with lesions suspicious of intraductal carcinoma of the prostate (IDC-P), requiring differential diagnosis from cribriform Gleason pattern 4 through additional immunohistochemical staining.
We additionally considered cancer quantification in each case. Several methods have been used in previous studies to evaluate the extent of tumor. Evaluation metrics include the number/fraction of positive cores, tumor percentage or linear millimeters of carcinoma per core, linear percentage carcinoma or linear millimeters of carcinoma in core with the greatest tumor volume, and total linear millimeters of carcinoma across all cores [27,28]. There is no consensus regarding which method is the best in measuring the extent of tumor in prostate core needle biopsy; several studies have demonstrated relationships between their respective methods of tumor measurement and certain prognostic factors. For instance, Quintal et al. [29] found that the percentage of total carcinoma length (in millimeters) in all cores of a needle biopsy was the strongest predictor for stages beyond pT2 and the risk of biochemical recurrence following radical prostatectomy. Brimo et al. [30] found that the fraction of positive cores, the total percentage of carcinoma, and both the total and greatest cancer core length were closely related to the pathologic stage and biochemical failure.
In this study, we asked pathologists to measure carcinoma in terms of linear millimeters. Tumor lengths measured by pathologists from WSIs and those automatically determined by the system showed a high correlation, whereas a notable difference was observed in the measurements obtained from glass slides of the source hospitals. To determine the tumor area, pathologists usually measure the tumor length by using a ruler or by manually estimating tumor percentage. However, this is a time-consuming method, and the results may be inaccurate and vary depending on the pathologists. Despite conflicting findings about the superiority of measurement methods affecting clinical management, the use of DeepDx Prostate can provide various tumor quantitative data, such as tumor percentage and length, and these data can be useful for relating tumor volume to prognosis.
This study had several limitations. As there was no final diagnosis of IDC-P or high-grade prostatic intraepithelial neoplasia (HGPIN) solely without prostate adenocarcinoma in the validation set, accurate performance analysis of this lesion could not be performed, although the pathologist performed gland-level annotation for IDC-P, HGPIN, or ASAP lesions. In addition, we had insufficient data on precancerous lesions, such as IDC-P, HGPIN, and ASAP. We expected to improve the system performance to better identify these precancerous lesions by training more data including immunohistochemically stained slide images. We also planned to validate the system's performance on these lesions by building path-level reference standards and performing path-level analysis.
In addition, we were not able to evaluate the system's performance on data generated by different digital image scanners, which have different color tones in practice; in the future, we planned to evaluate the compatibility by training and evaluating our system using data scanned from different scanners.
Although our system clearly indicates the percentage of each Gleason pattern, validation was not possible because the pathologists were not asked to describe the percentage of each Gleason pattern, only tumor length. We also planned to investigate the relationship between grade groups and tumor volume, including the percentage of Gleason pattern, analyzed through our system with patients' outcome by using survival data, including the recurrence rate. We would also try to find morphological phenotypes reflecting poor or better prognosis in a single grade group.
In addition, we would evaluate our system's performance by comparing only pathologist-based diagnoses and diagnoses by a pathologist assisted by the proposed system. We confirmed that all protocols in this study were performed in accordance with relevant regulations. The original glass slides from all cores included in the study were randomized and de-identified, and the institutional review boards of the two hospitals waived the need for informed consent from each patient. In this paper, we included cases of patients who underwent prostate core needle biopsy between 2010 and 2018.

Data Acquisition and Digitization
After anonymizing patient data, the slides were digitized using Aperio AT2 scanners (Leica Biosystems Inc., Vista, CA, USA), at ×40 magnification, (i.e., resolution of 0.25 µm/pixel). After digitization, 700 images were selected for the validation set and balanced according to the grade groups reported in the original diagnosis. Specifically, we selected 100 acinar adenocarcinoma cases per grade group and 200 cases including 188 benign and 11 ASAP. However, owing to misdiagnosis later discovered in the original hospital diagnoses, one benign case was reclassified as grade group 3. The remaining 1133 cases were used for the discovery set, as detailed in Table 4.

Data Annotation
To develop the deep learning system, an experienced Korea board-certified pathologist provided gland-level accurate annotations for the 1133 WSIs in the discovery set by using annotation tools of DeepDx Prostate (Deep Bio Inc., Seoul, Korea). The pathologist blindly reviewed each core, marking tissue regions according to one of the following seven classes: benign tissue, acinar adenocarcinoma of Gleason patterns 3, 4, and 5, HGPIN, intraductal carcinoma of the prostate (IDC-P), and ASAP [4]. WSIs of high molecular weight cytokeratin (HMW-CK, Mouse monoclonal, Clone 34βE12, 1:200 dilution, Dako, Glostrup, Denmark) immunohistochemical stain of all cores except benign (according to original hospital diagnosis) were provided to the pathologist for accurate annotation.
To establish the reference standard diagnoses for the 700 WSIs in the validation set, three pathologists, who were not involved in the creation of the discovery set, reviewed each WSI independently for slide-level Gleason scores. The final Gleason scores were defined by majority vote, and in the case of a disagreement, the diagnoses of the genitourinary specialist were selected. Along with the Gleason scores, the pathologists were asked to identify each slide for the following characteristics: the necessity for immunohistochemistry, image quality in terms of slide preparation and digitization process, diagnostic difficulty of the case based on three levels (easy: cases whose diagnoses can be made simply by glancing at the WSIs; hard: cases that require consultation or immunohistochemical staining for diagnosis; medium: others), and tumor length if applicable. All the measurements, including Gleason scores, tumor lengths, and difficulty levels, were conducted using annotation tools of DeepDx Prostate.
To prevent any discrepancies that may arise from different color displays, the five pathologists were provided with identical hardware configurations (iMac Retina 5K display, 27 inches, model 2017; Apple Inc., Cupertino, CA, USA). Cases diagnosed as another carcinoma, such as urothelial carcinoma, were excluded from this study.

Automated Gleason Scoring System
The procedure of the proposed DeepDx Prostate system comprises two steps: patch-level segmentation and slide-level evaluation. In the first step, the target WSI is split into a grid of 704 × 704 pixel-sized image patches, which are then analyzed by the deep learning system in batches. The deep learning system retrieves a segmentation result per patch with respect to five categories: non-cancerous; Gleason patterns 3, 4, and 5; and IDC-P. In the second step, the patch results in a slide are aggregated into a single heatmap. The proportion of each Gleason pattern within the heatmap determines the final Gleason score ( Figure S3 shows the overall algorithm workflow of our system). The deep learning system consists of two artificial neural networks of identical structure based on DeepLab v3+ [31] with additional non-local blocks [32] placed along with the connections between encoder and decoder of DeepLab v3+, extended from the original to accept image patches of the aforementioned size for the lesion segmentation. The networks were trained solely on the set of 940,875 image patches and the same number of corresponding region-level annotations, which was constructed by the patch extraction on a regular grid of size 352 × 352 from 1133 WSIs in the discovery set, followed by filtering out the patches with less than 10% being tissues. The networks were trained on eight-GPU machines using the stochastic gradient descent optimizer with Nesterov momentum. The initial learning rate was set to 0.01, which was multiplied by 0.1 for each 10 epochs during a total of 120 epochs. During training, we applied various random image transformations to augment the data. Specifically, image patches and corresponding annotations were rotated and flipped randomly, and then brightness, contrast, saturation, and hue were randomly modified to alleviate the effect of color variance from staining and scanning. In addition, image patches were extracted at sizes in the range of ±20%, and then resized to 704 pixels × 704 pixels to provide robustness against variation of magnification. The set of patches was randomly split into five nonoverlapping, equal-sized portions: A, B, C, D, and E. One network was trained using A, B, C, and D and tuned using E, while the other was trained using B, C, D, E, and tuned using A. The final system output was defined as the average of the two network outputs.
To further train the deep learning system, given the discovery set of limited size, we adopted hard-example mining after the initial network training, as in [25]. In the hard-example mining step, the system-generated patch segmentation results were evaluated using their mean intersection over union (mIoU) with the pathologist annotations at the beginning of each 15-epoch round. Patches with mIoU of less than 0.95 were considered as hard examples and selected for training. The snapshot and performance of each network on the tuning set were saved at each epoch. After 15 rounds of hard-example mining, the network snapshots associated with the highest performance were chosen for each network for the ensemble.
All the data preparation, network training, and evaluation were done with codes implemented in Python [33], using OpenCV [34] for tissue area detection by Otsu thresholding [35], and PyTorch [36] for the neural network implementation, training, and data augmentation.

Evaluation
The results from the validation set were evaluated by considering the grade groups corresponding to the Gleason scores generated by our system. We compared our system results to the established reference standard, individual diagnoses of each pathologist, and original diagnoses from the respective hospitals. The cases reported as ASAP by the original hospital diagnoses were analyzed separately.
The system performance was categorized according to the difficulty levels annotated by the pathologists, where easy, medium, and hard levels were labeled as 1, 2, and 3, respectively. The average difficulty level by the pathologists was rounded to the nearest integer to establish the reference standard of each case.
To assess the system performance based on the grade-group classification, Cohen's kappa coefficient was used with and without quadratic weighting. For k number of classes, Cohen's kappa coefficient κ is defined as following based on the number of elements in weight w ij , in observed x ij , and in expected matrices m ij , where κ is not weighted if w ij = 0, 1 and is quadratically weighted if In addition to classification, we evaluated the system's ability for quantifying cancer in terms of tumor length. Tumor length measurements from the original hospital diagnoses, those independently measured by pathologists using the annotation feature in DeepDx Prostate, and measurements generated by the system were compared. Pearson correlation coefficient R was used as the evaluation measure, and it is defined as where X and Y are the compared random variables, σ represents the standard deviation of the corresponding variable, and COV() is the covariance. Only WSIs found to be Gleason 6 or above in all criteria (hospital diagnoses, evaluation by reference standard, and system-generated results) were considered in length comparisons.

Conclusions
This paper proposed an automated Gleason scoring system based on deep neural networks for diagnosis of prostate core needle biopsy samples. The proposed system can assist pathologists to reduce the probability of over-or under-diagnosis by providing pathologist-level second opinions on the Gleason score when diagnosing prostate biopsies. We also expect that the proposed system will support research on prostate cancer treatment and prognosis by providing consistent and reproducible diagnosis.

Supplementary Materials:
The following are available online at http://www.mdpi.com/2072-6694/11/12/1860/s1, Figure S1: Analysis result of poor quality images by DeepDx Prostate, Figure S2: Representative image of atypical small acinar proliferation case and high molecular weight cytokeratin (HMW-CK) staining, Figure S3: The overall algorithm workflow of our system, Table S1: Diagnoses for cases with atypical small acinar proliferation (ASAP) and respective immunohistochemical (IHC) staining requests. Data and Code Availability: The datasets used in this study are not publicly available at this moment due to data usage agreement restrictions. The de-identification process and usage of slides have been approved by the respective Institutional Review Boards (IRBs) of Korea University Guro Hospital and Hanyang University Medical Center. The source codes used in this paper is copyrighted by Deep Bio Inc. and is available only under appropriate license agreement with the copyright holder.