WP-UNet: Weight Pruning U-Net with Depthwise Separable Convolutions for Semantic Segmentation of Kidney Tumors

Background: The major challenge in medical imaging is to achieve high accuracy output during semantic image segmentation tasks in biomedical imaging while having fewer computational operations and faster inference. It is necessary in medical modalities such as kidney tumor CT scan activities, to assist radiologists. Several previous studies have carried out a complex deep network that requires high computational resources. However, a deep network on semantic segmentation of kidney tumor CT scans with fewer �ops and parameters has not yet been evaluated. Methods: This research paper presents a novel network model called Weight Pruning U-Net (WP-UNet) which is extremely fast, compact, and computationally e�cient to address this problem with kidney tumor CT scan images as an application. Results We apply the proposed deep network model on the kidney tumor CT scan image dataset on computational devices with limited resources for computing. We build a CNN model with minimum parameters inspired by the commonly adapted U-Net architecture of the deep convolution neural network model for CT scan image analysis by making use of a depthwise separable convolution functional layer in the entire network model. We proposed weight pruning with the depthwise separable and batch normalized UNet model to reach the expected performance and reduce the loss in the process. WP-UNet has 3 major bene�ts,- : (a) a lightweight model with a smaller size (b) fewer parameters, and (c) a faster assumption time with a less than �oating point calculation with computational complexity (FLOPs). WP-UNet was tested on the KiTs challenge Biomedical CT Scan imaging Dataset for kidney tumor semantic segmentation (KiTs), and the results showed that comparable and often better results were obtained by the WP-UNet model compared to the existing state-of-the-art models. Conclusions:


Introduction
With more than 400,000 new cases in 2020, kidney cancer [25] accounts for 2.4% of cancers worldwide [1], with signi cant differences in incidence rates depending on geography, race, sex and time [2]. The increasing use of medical imaging has had two consequences in the last two decades: the size of the renal tumor diagnosed has steadily decreased [3], and the number of localized renal masses detected incidentally has increased [4].
In computer assisted diagnostic systems that are helpful in allowing clinical diagnosis of kidney tumors, CT scan image segmentation [20] plays a central role. Segmenting the region of interest, such as an organ or tumor volume, the location of the region of interest has been an increasing demand for accurate, fast and cost effective automated processing medical image analysis equipment, such as computer tomography (CT scan).Automated processes can not only save time and costs, but also reduce the reliance on manual labor (radiographer) and human error.
There has recently been a growing need to incorporate deep learning solutions on devices equipped with low computational complexity, such as handheld computers or other low budget computing machines.
The major problem of complex deep network models is that they are more parameterized which requires substantial computational equipment for testing. In-depth learning of several methods, including pruning [14] [27] or quanti cation of weights has been suggested by researchers. Pretrained models on large medical modality datasets [20]. Others have concentrated on lightweight preparation. By factorizing regular convolution layers into depthwise separable convolution layers [21] from scratch [7][8][9][10][11] for less parameterized computational layers.
A similar methodology using these lightweight designs is introduced in this article, also known as mobile network architectures with the objective of improving the U-Net model with fewer parameters [23] that require less disk capacity, fewer numerical speci cations, and faster inference. However, in terms of accuracy, separable convolutions are considered to have degraded e ciency relative to layers of regular convolution. Therefore, weight pruning [30] coupled with batch normalization [22] is implemented on each input layer's weights to restore the lack of precision. This architecture is fresh It is known as WP-UNet. WP-UNet performance is measured on the KiTs challenge dataset.

Related Work
In this section, we describe in detail the literature works that inspired our study.

Depthwise Separable Convolutions
For two reasons, deeply wise separable convolutions have recently become common in biomedical and other models: -They have fewer parameters and are thus less likely to over t than "regular" convolution layers.
-They also require fewer operations to compute with fewer parameters, and are thus cheaper and quicker.
The difference between the number of lters in normal convolutions and depthwise separable convolutions [21] is as follows: -Spatial dimension (S) -Width and height, with square inputs assumed, -Filter layer (F) -Width and height of the lter, square lter believed, -Input Channel (InC) -Input channel count, -Output Channel (OutC) -Number of channels of output.
In a regular convolution ( Fig. 1 ) there are F x F x InC x OutC parameters, because every lter is 3D and there is one such lter per output channel.
In depthwise separable convolution [21] (Fig. 2)there are F x F x InC parameters for depthwise convolution [21] and InC x OutC parameters for the mixing part. The sum of these two parameters is less than the regular convolutions.
Using depthwise convolutions [21] some of the deep network models are able to reduce the computational process by 8 or 9 times compared to standard convolutions.

Normalization
In deep network models, normalization techniques ( Fig. 3) are used to decrease our network model's training time by a huge factor. The major bene t of, normalizing is that it normalizes each feature so that they maintain the contribution of every feature, as some features have higher numerical values than others. It also reduce internal covariate shift. It makes the optimization faster because normalization does not allow weights to explode all over the place and restricts them to a certain range. Many researchers have proposed different normalization techniques to optimize network models. The most commonly used techniques are batch, switchable and group normalizations [22].
Over the years, batch normalization [24] (BN) has become a commonly accepted process and has proven to be very successful in many deep learning tasks. To normalize its functions during activations, BN [24] makes use of the mean and variance computed within a mini batch of results. To have zero mean and unit deviation, BN [24] standardizes activations. BN [24]'s key bene ts include facilitating quicker convergence in fewer iterations of instruction, and having a degree of regularization, thereby reducing the error of generalization. Because of device constraints, BN [24] does not work effectively. Therefore, group normalization (GN) [22] was implemented as a layer that separates channels into groups, calculates the mean and standard deviation over these channel groups into groups and calculates the mean and standard deviation over these channel groups during training over each case. Batch measurements are not exploited by GN. This helps it to do well with smaller micro batch sizes than BN [24].

Weight Pruning
The weight pruning model aims to cause sparsity in different relation metrics in a deep neural network, thus reducing the number of parameters in the model that are not valued at 0. Recent researchers (Han et al., 2015a;Narang et al., 2017) plant deep networks at the expense of just a small loss of precision and accomplish a substantial reduction in the size of the model. Prune models achieve up to 3x decreases in the number of nonzero parameters with limited loss of precision across a wide variety of neural network architectures (deep CNNs, stacked LSTM and seq2seq LSTM models).

Fully Convolution Networks(FCNs)
The most basic concept behind FCNs [18] is that they only consist of locally connected layers without completely connected or thick layers (dropout, convolutions, activation, pooling). This helps to decrease the time and the number of parameters used for computation. It also implies that, regardless of

U-Net
UNet [19] has evolved from the CNN for medical image modality analysis. U-Net is contraction, expansion and the bottleneck layers that merge knowledge from the contraction and expansive paths do so by concatenating the function maps as summarized in the FCN[18] deep network architecture. The key distinction between U-Net and FCN[18] is that four sections, each consisting of two unpadded 3x3 CNNs with a ReLU activation layer and a 2x2 max-pooling layer, comprise the direction of the U-encoding Net. After each sampling stage, the number of feature channels tends to double, but due to max-pooling, the size of the feature maps is limited. 2x2-up sampling of 3x3 regular convolutions is used in the direction of decoding. A concatenation of characteristics from the respective layers in the encoding direction is followed by each convolution. This helps move the localization knowledge that is retrieved from the contraction to the expansive route during downsampling.

Materials And Methods
In this section, we outline the proposed technique, describe the WP-UNet architecture and conduct experiments.

Weighted Pruning (WP) with Depthwise Separable Convolutions
WP-UNet has been proposed to be implemented on standard convolutions. It was recommended that WP-UNet be added to the regular convolutions. In this work, to minimize the number of parameters and necessary computations in the U-Net model, the regular convolution layers are replaced with depthwise separable layers [21]. WP is added to the U-Net's usable layers. Therefore, WP-UNet achieves a higher and smoother failure curve during training and helps increase model accuracy.

WP-UNet (Proposed Architecture)
With a few changes, WP-UNet (Fig. 5) follows a similar architecture to U-Net. The other convolution layers are constructed of separable convolutions, except for the rst convolution layer, which has a regular convolution. Five blocks are made up of the pruning [27] of encoding layers. The design of the WP-UNet architecture is as follows: -Block 1: Initial block with a regular convolution layer, ReLU activation function and normalization batch -Blocks 2, 3 and 4: These blocks of two depthwise separable convolution layers [21], two activation layers and one normalization layer are composed of a WP-Unet block (Fig. 4).
-Block 5: A separable nal depthwise layer [21] with a dropout layer [17] The upsampling of the decoding path is performed with a scale of two to restore the size of the segmentation map. The WP-decoding UNet's direction is made up of a mixture of standard convolution blocks and WP-UNet blocks. It also consists of the same number of network layer blocks.
-Block1: A separable convolution layer in depth with its characteristics concatenated with the dropout layer [17] from the encoding direction block4.
-Blocks 2, 3, and 4: a WP-UNet block and a separable depthwise layer [21] concatenated from the encoding direction with matching blocks -Block 5: Two WP-UNet blocks with the last block one as the nal layer and two depthwise separable layers.

Con guration
The training was based on Keras with a TensorFlow backend as a Google Colab deep learning framework enabled with an NVidia GPU such as T4(12 GB memory) with a high-memory VM.

Datasets(KiTs Challenge Dataset)
The KiTs challenge datasets for kidney tumor disease segmentation are the datasets used to assess the performance of WP-UNet. Proposed deep network model applied on the KiTS dataset [5]. It consists of 210 high contrast CT scans of patients, collected in the preoperative arterial process and chosen from a cohort of subjects who underwent partial or radical nephrectomy [26] for one or more kidney tumors at the University Of Minnesotal Medical Center and were applicants for inclusion in this database between 2010 and 2018. The volumes included are characterized by different plane resolutions ranging from 0.437 to 1.04 mm, with slice thicknesses ranging from a minimum of 0.5 mm to a minimum of 5.0 mm in each case.
The dataset also provides the ground-truth mask of both healthy kidney tissue and healthy tumors (Fig. 6) for each case included. Under the guidance of experienced radiologists, a group of medical students manually generated sample labels with only CT scan image axial projections. A detailed description of the segmentation strategy for the ground truth is described in [5]. The KiTs challenge dataset is provided with shape (num slices, height, width) in the standard NIFTI format.

Data Preprocessing
Initially, the resolution of the images of the KiTs challenge dataset stacks was originally 512 x 512, but because of technical limitations, it was resized to 256 x 256. To reduce disk capacity, the data stack is accessible in the standard NIFTI format, which is converted into tfrecords. Owing to the small number of training images available data augmentation techniques have been used. A smaller number of images could lead to a concept known as over tting, where a trained model performs on training data very well but on new test data performs poorly. Horizontal ip, zoom range, height and width adjustment range were used in these enhancement techniques. After improvement, the number of images of the box stacks dataset grew to 120. The resolution of the images was also decreased in the Kit data set (512 x 512). Center cropping and data normalization were also used to ensure 0 mean and unit variance and the original 3D slices were converted into 2D slices with separable and ReLU convolution layers for training and testing of UNet [19]. For the training of 44175 images and 17030 image veri cation, the suggested WP-UNet with ReLU activation function is used.

Optimization
The Adam optimization algorithm [16] has been used to train the network model with a learning rate range from 0,0001 to 0,00001 on the KiTs CT scan image dataset. Losses in the training were based on the KiTs dataset's binary cross-entropy loss. The loss was a weighted sum of the loss of a negative dice and of the binary algorithms for the KiTsdataset.

Performance Metrics
The key performance metrics used in measuring WP-UNet performance on the CT scan dataset are explained in detail in this section.

Accuracy (AC)
In the formula given below, accuracy measures the percentage of correct predictions and is given

Results
In this section, experiments are performed to calculate the computational requirements of WP-UNet, its inference speed, and the e ciency of segmentation (Fig. 9,10) on the speci ed datasets.

Ablation Study
To test the e ciency of the proposed model and to help the nal design decisions made in this report, a detailed ablation study is conducted. There are three separate changes to the design of the architecture: Alongside the original U-Net architecture, the proposed WP-UNet based on BN [24] in Table 1, the performance of these modi cations is recorded.

Results on KiTs Dataset
U-Net and WP-Unet are trained from scratch for only 10 epochs and their mean dice scores are related to their inference velocity on validation data. The choice of a smaller number of epochs is due to the potential of the Adam [16] optimization algorithm to reach a minimum e ciently and each epoch runs 1014 iterations over the KiTs dataset. A higher loss and average Dice coe cient are obtained by WP-UNet (Fig. 7) relative to the U-Net (Fig. 8) model. Its inference rate is also faster on a single GPU device.

Conclusion
To help in disease detection, therapy, and general research the segmentation of biomedical images is an important rst phase in distinguishing tissues in image scans. To help avoid complications that may occur due to late detection, early diagnosis is necessary. However, with the availability of large-scale biomedical evidence, the workload has also expanded for neurologists, radiologists, and other eld specialists. Several deep learning architectures have been proposed to help provide faster, precise and timely detections, and several have experienced considerable success in these tasks. One such model that is widely agreed upon by CT scanning image segmentation researchers is the U-Net architecture.
Portable devices have recently been enabled with computing capacities that were only imaginable for large machines in the past. Deep learning implementations, however, require much greater computing. This makes the development of deep learning systems on mobile or embedded computers very di cult. For example, the U-Net architecture needs more than 62 M FLOPs and a storage capacity of over 370 megabytes (Mb), which are very high requirements. In comparison, little attention has been paid to the implementation of deep learning approaches in biomedical imaging elds on resource-constrained systems.
For the segmentation of CT scan image data on devices with small computational budgets, weighted pruning UNet (WP-UNet)( Fig. 9 & 10) was presented in this review. Separable convolutions are used by the WP-UNet architecture. Our ndings indicate that the architecture proposed is smaller than U-Net and demands 3x less computational complexity ( 4. Competing interests (same as provided on the submission system) The authors declare no competing interests as de ned by journal or other interests that might be perceived to in uence the results and/or discussion reported in this paper.

Funding
The author(s) received no nancial support for the research, authorship, and/or publication of this article.
6. Authors' contributions (same as provided on the submission system) Patike analyzed and interpreted the dataset regarding kidney tumor disease using WP-UNet. Subarna validated the model based on the number of parameters and ops.

Acknowledgements
The datasets used for the analysis in this manuscript were obtained from KiTS 19 . We gratefully acknowledge the contribution of the people and organizations involved in the cancer image archive initiative as participants, organizers, or funders.  Depthwise convolution uses 3 kernels to transform a 12x12x3 image to an 8x8x3 image  Sample CT scan imaging and ground truth labels from the Kidney, and Kidney Tumor Segmentation (KiTs) Datasets.  Sample Segmentation results on the KiTs Dataset from our test split. WP-UNet signi cantly performs better than U-Net on tumors Figure 10 Sample Segmentation of Kidney and Tumor