Performance Evaluation of Deep Learning-Based Prostate Cancer Screening Methods in Histopathological Images: Measuring the Impact of the Model’s Complexity on Its Processing Speed

Prostate cancer (PCa) is the second most frequently diagnosed cancer among men worldwide, with almost 1.3 million new cases and 360,000 deaths in 2018. As it has been estimated, its mortality will double by 2040, mostly in countries with limited resources. These numbers suggest that recent trends in deep learning-based computer-aided diagnosis could play an important role, serving as screening methods for PCa detection. These algorithms have already been used with histopathological images in many works, in which authors tend to focus on achieving high accuracy results for classifying between malignant and normal cases. These results are commonly obtained by training very deep and complex convolutional neural networks, which require high computing power and resources not only in this process, but also in the inference step. As the number of cases rises in regions with limited resources, reducing prediction time becomes more important. In this work, we measured the performance of current state-of-the-art models for PCa detection with a novel benchmark and compared the results with PROMETEO, a custom architecture that we proposed. The results of the comprehensive comparison show that using dedicated models for specific applications could be of great importance in the future.


Introduction
Prostate cancer (PCa) is the second most common cancer and the fifth leading cause of cancer death in men (GLOBOCAN [1]). In 2018, almost 1.3 million cases and around 360,000 deaths worldwide were registered due to this malignancy. According to the World Health Organization (WHO), there will be an increase of prostate cancer (PCa) cases worldwide, with 1,017,712 new cases being estimated for 2040. Most of these cases will be registered in Africa, Latin America, the Caribbean and Asia, and appear to be related to an increased life expectancy [2].
To diagnose PCa, digital rectal examination (DRE) is the primary test for the initial clinical assessment of the prostate. Then, prostate-specific antigen (PSA) is used in a screening method for the investigation of an abnormal prostatic nodule found in a digital rectal examination (DRE). Finally, in the case of abnormal DRE and elevated PSA results, trans-rectal ultrasound-guided biopsy is performed to obtain samples of the prostate tissue [3]. Then, these tissue samples are scanned, resulting on gigapixel-resolution images called whole-slide images (WSIs), which are then analyzed and diagnosed by pathologists.
Due to the high increment of new cases, and thanks to the impacts of artificial intelligence (AI) in recent years [4,5], several computer-aided diagnosis (CAD) systems have been developed to speed up the process of PCa diagnosis. A computer-aided diagnosis (CAD) system is an automatic or semi-automatic algorithm whose purpose is to assist doctors in the interpretation of medical images in order to provide a second opinion in the diagnosis. Among the different AI algorithms, deep learning (DL) has become very popular in recent years, and convolutional neural networks (CNNs) particularly [6]. They have been applied in several fields in medical image analysis, such as in disorder classification [7], lesion/tumor classification [8], disease recognition [9] and image construction/enhancement [10], among others.
Deep learning (DL) algorithms have also been applied to other medical image analysis fields such as histopathology, in which whole-slide images (WSIs) are used. Since it is not possible for a convolutional neural network (CNN) to work with a whole WSI as input due to its large size, a common approach is to divide this image into small subimages called patches. This procedure has been widely used in order to develop CAD systems in this field.
Recently, many researchers have investigated the application of CAD systems to the diagnosis of PCa in WSIs. Ström et al. [11] developed a deep learning (DL)-based CAD system to perform a binary classification distinguishing between malignant and normal tissue. The classification was performed using an ensemble of 30 widely used InceptionV3 models [12] pretrained on ImageNet. They achieved areas under the curve (AUC) of 0.997 and 0.986 on the validation and test subsets, respectively. For areas detected as malignant, the authors trained another ensemble of 30 InceptionV3 CNNs in order to discriminate between different PCa Gleason grading system (GGS) scores, achieving a mean pairwise kappa of 0.62 at slide level. Campanella et al. [13] presented a CAD system to detect malignant areas in WSIs. The classification was performed with the well-known ResNet34 model [14] together with a recurrent neural network (RNN) for tumor/normal classification. achieving an area under curve (AUC) of 0.986 at slide level. In a previous study [15], we proposed a CAD system, in which we focused on performing a patch-level classification of histopathological images between normal and malignant tissue. The proposed architecture, called PROMETEO, consisted of four convolution stages (convolution, batch normalization, activation and pooling layers) and three fully connected layers. The network achieved 99.98% accuracy, 99.98% F1 score and 0.999 AUC on a separate test set at patch level after training the network with a 3-fold cross-validation method.
These previous works achieved competitive results in terms of accuracy, precision and other commonly-used evaluation metrics. However, to the best of our knowledge, most state-of-the-art works do not focus on prioritizing the speed of the CAD system as an important factor. Many of them used very complex, well-known networks to train and test, without taking into account the computational cost and the time required to perform the whole process. Since these algorithms are not intended to replace pathologists but to assist them in their task, in some cases it is better to prioritize the speed of the analysis, sacrificing some precision so that the expert has a faster and more dynamic response from the system.
In this paper, a novel benchmark was designed in order to measure the processing and prediction time of a CNN architecture for a PCa screening task. First, the proposed benchmark was run for the PROMETEO architecture on different computing platforms in order to measure the impacts that their hardware components have on the WSI processing time. Then, using the personal computer (PC) configuration that achieved the best performance, the benchmark was run with different state-of-the-art CNN models, comparing them in terms of average prediction time both at patch level and at slide level, and also reporting the slowdown when compared to PROMETEO.
The rest of the paper is structured as follows: Section 2 introduces the materials and methods used in this work, including the dataset (Section 2.1), the CNN models (Section 2.2) and the benchmark proposed (Section 2.3). Then, the results obtained are presented in Section 3, which are divided in two different experiments: first, the performance of a proposed CNN model is evaluated in different platforms, and then it is compared to stateof-the-art, widely-known CNN architectures. Sections 4 and 5 present the discussion and the conclusions of this work, respectively.

Dataset
In this work, a dataset with WSIs obtained from three different hospitals was used. These cases consisted of different Hematoxylin and Eosin-stained slides globally diagnosed as either normal or malignant.
From Virgen de Valme Hospital (Sevilla, Spain), 27 normal and 70 malignant cases obtained by means of needle core biopsy were digitized into WSIs. Clínic Barcelona Hospital (Barcelona, Spain) provided 100 normal and 129 malignant WSIs, also obtained by means of needle core biopsy. Finally, from Puerta del mar Hospital (Cádiz, Spain), 65 malignant (26 obtained from needle core biopsy and 39 from incisional biopsy) and 79 (33 obtained from needle core biopsy and 46 from incisional biopsy) WSIs were obtained. Table 1 summarizes the WSIs considered in the dataset.

CNN Models
Different CNNs models were considered in this work in order to compare their performance by using the benchmark proposed in Section 2.3. Three different architectures from state-of-the-art DL-based PCa detection works were compared, along with other wellknown CNN architectures. The first one is the custom CNN model, called PROMETEO, which we proposed in [15], where we also demonstrated that applying stain-normalization algorithms to the patches in order to reduce color variability could improve the generalization of the model when predicting new unseen images from different hospitals and scanners. The second CNN architecture that was considered in this work is the wellknown ResNet34 model [14], which was used by Campanella et al. in [13]. The third one is InceptionV3, introduced in [12], which was used by Ström et al. [11].
Apart from these three CNN models, other widely-known architectures were evaluated with the same benchmark, comparing their performance in terms of execution time with the rest of the networks for the same task. These were VGG16 and VGG19 [16], MobileNet [17], DenseNet121 [18], Xception [19] and ResNet101 [14].

Benchmark
In this work, a novel benchmark was designed in order to measure and compare the performances of different CNN models and platforms on a PCa screening task. In order to make the benchmark feasible to be shared with other researchers so that it could be run in different computers, a reduced set of WSIs were chosen from the dataset presented in Section 2.1. Since the total amount of WSIs of the dataset represent more than 300 gigabytes (GB) hard drive space, only 40 of them were considered, building up a benchmark of around 50 GB, which is much more shareable. These 40 WSIs were randomly selected, considering all the three different hospitals and scanners, and thus representing well the diversity of the dataset in this benchmark.
The benchmark performs a set of processing steps which are detailed next (see Figure 1). First, as it was introduced in Section 1, since it is not possible for a CNN to use a whole WSI as input due to its large size, these images are divided into small subimages called patches (100 × 100 pixels at 10× magnification in this case), which are read from each WSI. This process is called "read," and apart from extracting the patches from the input WSI, those corresponding to background are discarded (identified as D in the figure). Then, in the scoring step, a score is given to each patch depending on three factors: the amount of tissue that it contains, the percentage of pixels that are within Hematoxylin and Eosin's hue range and the dispersion of the saturation and brightness channels. This score allows discarding patches corresponding to unwanted areas, such as pen marks, external agents and patches with small amounts of tissue, among others. In Figure 1, discarded patches in this step are highlighted in red, while those that pass the scoring filter are highlighted in green. The third step, called stain normalization, performs a color normalization of the patch based on Reinhard's stain-normalization algorithm [20,21] in order to reduce color variability between samples. In prediction, which is the last step of the process, each of the patches are used as input to a trained CNN, which classifies them as either malignant or normal tissue. Deeper insights into these steps are given in [15]. When the execution of the benchmark finishes, it reports both the hardware and system information of the computer used to run the benchmark, and the results of the execution. These results consist of the mean execution time and standard deviation for each of the four processes (read, scoring, stain normalization and prediction) shown in Figure 1 and presented in [15], both at patch level and at WSI level.

Results
The CNN-based PROMETEO architecture described in Section 2.2 was proposed and evaluated in terms of accuracy and many other evaluation metrics in [15]. In this work, we evaluated that model in terms of performance and execution time per patch and WSI.
First, the same architecture was tested in different platforms using the benchmark proposed in Section 2.3. These results allow us to measure and quantify the impacts of different components in the whole processing and prediction process, which is useful for designing an edge-computing prostate cancer detection system. Then, the benchmark was used to evaluate the performances of different state-of-the-art CNN architectures on the computing platform that achieved the best results on the first experiment.
Fourteen different PC configurations were used to evaluate the performance of the PROMETEO architecture introduced in Section 2.2. The hardware specifications (central processing unit (CPU) and graphics processing unit (GPU)) of these computers are listed in Table A1 of Appendix A. In Figure 2, the average patch processing time is shown for each of the fourteen configurations, where the mean time for the steps performed when processing a patch (see Section 2.3) is reported. As it can be seen, the step that requires more time is the prediction in most of the cases, but it is highly reduced in configurations consisting of a GPU.  Table A1. Figure 3 depicts the average and standard deviation of the execution time needed per WSI when running the benchmark on the fourteen different PC configurations. As in Figure 2, each of the steps considered in the whole process is shown. As it can be seen, reading the whole WSI patch by patch is the step that involves the longest amount of time in most of the devices (mainly in those configurations with no GPU). This might seem contradictory considering Figure 2, but it is important to mention that, in that step, all patches from a WSI are read and analyzed, but not all of them are processed in the following steps. Unwanted areas, such as background regions with no tissue, are discarded before being scored. Then, only those which are not background and pass the scoring step are stain normalized and predicted by the CNN.   Table A1.

PROMETEO Evaluation
The sum of the average execution time of the four preprocessing steps for each WSI was computed and it can be seen in Figure 4. The best case (device M) takes 22.56 ± 5.67 s on average to perform the whole process per WSI, where the prediction step only represents 4.20 ± 1.73 s.   Table A1.
The execution times obtained and used for generating the plots presented in this subsection are detailed in Table A2 of Appendix A.

Performance Comparison for Different State-of-the-Art Models
After evaluating the PROMETEO architecture using the benchmark designed for this work with different PCs, the same network was compared to other widely-known architectures. For this purpose, the same computer (device M) was used in order to perform a fair comparison. The same benchmark that was used in the previous evaluation (see Section 3.1) was executed in computer M (see Table A1) for each of the CNN architectures mentioned in Section 2.2. The CNNs considered are PROMETEO [15], ResNet34 and ResNet101 [14], InceptionV3 [12], VGG16 and VGG19 [16], MobileNet [17], DenseNet121 [18] and Xception [19].
The average patch processing time per preprocessing step can be seen in Figure 5 for each of the architectures mentioned. Since the architecture does not have an effect on the first three steps (reading the patch from the WSI, scoring it in order to discard unwanted patches, and normalizing it), the times needed to process them are similar across all the different cases reported in the figure. This does not happen with the prediction time, which directly depends on the complexity of the network.   Table A1). Figure 6 reports the combined processing time that device M takes to compute a WSI on average, together with its corresponding standard deviation. The same case explained in Section 3.1, where the WSI reading step takes much longer than the patch reading step in relation to the rest of the subprocesses, can also be observed in this figure. It is important to mention that the model proposed by the authors is faster than the rest in terms of prediction time, with a total of 22.56 ± 5.67 s per WSI on average.  Table A1). Table 2 presents a summary of the results obtained for each architecture, focusing on the prediction process, which is the only one affected when changing the CNN architecture. Moreover, the number of trainable parameters and the slowdown are also reported. The latter is calculated by dividing the average prediction time per WSI of the corresponding CNN by that obtained with PROMETEO. This way, the improvement in terms of prediction time between PROMETEO and the rest of the architectures considered can be clearly seen. The proposed model predicts 2.55× faster than the CNN used in [13] and 11.68× faster than the one used in [11]. It is also important to mention that, in the latter, the authors did not use only an InceptionV3 model, but an ensemble of 30 of them. In this case, the figures and tables only report the execution times for a single network. When compared to other different widely-known architectures, PROMETEO is between 7.41× and 12.50× faster. The execution times obtained and used for generating the plots presented in this subsection are detailed in Table A3 of Appendix B.

Discussion
In order to design a fast edge-computing platform for PCa detection, an evaluation of a proposed CNN was performed. This allowed us to compare different hardware components and configurations and measure the impacts of them when processing WSIs. Apart from the figures presented in Section 3.1, two specific cases are highlighted in Figure 7. Figure 7a shows the impact that the frequency of the CPU has on the whole process when using the same computer. As it can be seen, the four processing steps clearly benefit when a faster CPU is used. On the other hand, Figure 7b compares two cases where the same configuration is used, except for the GPU, which was removed in one of them. As expected, the GPU highly accelerated the prediction time (by around three times in this case). Therefore, in order to build a low-cost edge-computing platform for PCa diagnosis, this analysis could be useful and should be taken into account in order to prioritize in which component the funds should be invested. As it was explained, all patches from a WSI have to be read, but not all of them have to be predicted, since the majority of them correspond to background and are discarded first. Therefore, the CPU has a higher impact than the GPU in the whole process. When comparing PROMETEO to other state-of-the-art CNN models, the former achieved the fastest prediction time, being from 2.55 times up to 12.50 times faster than any of the rest. Although the results in terms of accuracy and other commonly-used metrics in DL algorithms cannot be compared since the authors in [11,13,15] used different datasets, all of them reported state-of-the-art results for PCa detection. In [15], the authors compared PROMETEO to many of the models used in this work in terms of accuracy when using the same dataset for training and testing the CNN, showing that similar results were obtained.
The use of transfer learning in CNNs for medical image analysis has become a commonplace technique, and most of the current research focuses on using this approach for avoiding the problem of having to design, train and validate a custom CNN model for a specific task. This has proved to achieve state-of-the-art results in many different fields and has also accelerated the process of training a custom CNN from scratch [22]. However, when using this technique, very deep CNNs are commonly considered, which, as presented in this work, leads to a higher computational cost when predicting an input image, and therefore, a slower processing time. Some specific tasks could benefit from designing shallower custom CNN models from scratch, such as DL-based PCa screening, providing a faster response to the pathologists in order to help them in this laborious process. With the increases in the number of cases and the mortality produced by PCa, this factor could become even more relevant in the future.
As an alternative, cloud computing has provided powerful computational resources to big data processing and machine learning models [23]. Recent works have focused on accelerating CNN-based medical image processing tasks by using cloud solutions. While it is true that processing images using GPUs and tensor processing units (TPUs) in the cloud is faster than in any local edge-computing device, there is an aspect that is not commonly taken into account when stating this fact: the time required to upload the image to the cloud. This depends on many factors and it is not easy to predict. Moreover, when digitizing histological images, scanners store them in a local hard drive using around 1 GB for each of them. As an example, with an upload speed of 300 Mbps, it would take more than 27 s in ideal conditions just for uploading the WSI to the cloud, which is more than the time it would take to fully process the image on a local platform.
To design a fast, low-cost, edge-computing platform, both the hardware components considered and the CNN model design have to be taken into account. Optimizing these two aspects led to achieving a very short WSI processing time when compared to current DL-based solutions without penalizing the performance of the system in terms of accuracy. In the next future, the authors would like to build a custom bare-bones approach based on the evaluations achieved in this work and test it in some of the hospitals that collaborated with us in this project.

Conclusions
In this work, we have presented a comprehensive evaluation of the performance of PROMETEO, a previously-proposed DL-based CNN architecture for PCa detection in histopathological images, which achieved 99.98% accuracy, 99.98% F1 score and 0.999 AUC on a separate test set at patch level.
Our proposed model outperforms other widely-used state-of-the-art CNN architectures such as ResNet34, InceptionV3, VGG16, VGG19, MobileNet, DenseNet121, Xception and ResNet101 in terms of prediction time. PROMETEO takes 22.56 s to predict a WSI on average, including the preprocessing steps needed, using an Intel ® Core™ i7-8700K (Intel, Santa Clara, CA, USA) and an NVIDIA ® GeForce™ GTX 1080 Ti (NVIDIA, Santa Clara, CA, USA). If we focus only on the prediction time, PROMETEO is between 2.55 and 12.50 times faster than any of the other architectures considered.
The promising results obtained suggest that edge-computing platforms and custom CNN designs could play important roles in the future for AI-based medical image analysis, being able to aid pathologists in their laborious tasks speed-wise. Table A2. PROMETEO evaluation results. The average (Avg) and standard deviation (Std) of the execution times (in seconds) are shown for each of the four processes presented in Section 2.3 (Figure 1), both at patch level and at slide (WSI) level.  Table A3. Execution time comparison between different architectures. The average (Avg) and standard deviation (Std) of the execution times (in seconds) are shown for each of the four processes presented in Section 2.3 (Figure 1), both at patch level and at slide (WSI) level.