Effect of Data Augmentation of Renal Lesion Image by Nine-layer
Convolutional Neural Network in Kidney CT

Artificial Intelligence (AI) becomes one hotspot in the field of the medical images analysis and provides rather promising solution. Although some research has been explored in smart diagnosis for the common diseases of urinary system, some problems remain unsolved completely A nine-layer Convolutional Neural Network (CNN) is proposed in this paper to classify the renal Computed Tomography (CT) images. Four group of comparative experiments prove the structure of this CNN is optimal and can achieve good performance with average accuracy about 92.07 ± 1.67%. Although our renal CT data is not very large, we do augment the training data by affine, translating, rotating and scaling geometric transformation and gamma, noise transformation in color space. Experimental results validate the Data Augmentation (DA) on training data can improve the performance of our proposed CNN compared to without DA with the average accuracy about 0.85%. This proposed algorithm gives a promising solution to help clinical doctors automatically recognize the abnormal images faster than manual judgment and more accurately than previous methods.


Introduction
In general, the common diseases of urinary system of human includes calculi, infection, tumor, congenital dysplasia and trauma. CT as one main tool is applied to detect and diagnose most of them. With the development of the digitalized and intelligent medical diagnosis, AI becomes one hotspot in the field of the medical images analysis and provides rather promising solution. In an essence, AI could not replace human wisdom but assist in overcoming the disadvantage of human being energy limited and fallible. Although some research has been explored in smart diagnosis for the common diseases of urinary system, some problems remain unsolved completely.
Mangayarkarasi et al. [4] adopted a PNN model to classify the renal ultrasound images into normal and abnormal categories. Their dataset only contained 24 normal and 53 abnormal images. Which are preprocessed by histogram equalization, mean filter and Gauss filter, segmentation of Region of Interest (ROI) operations. Then the PNN is trained by inputting the image attributes of mean, entropy and variation of one image. Though the overall average accuracy was 93.5%, the generalization of this method is hard to guarantee for possible existing overfitting on this small dataset.
These neural networks took specified features extracted by experience as the input data and obtained about 7% higher accuracy than typical deep learning models. This indicates that supervised machine learning models depend on both human knowledge and training data. Because the number of the training samples are often deficient, the performance of the classifier may be less generalization.
Since the typical CNN model has been applied successfully in massive discriminative tasks of various fields, such as medical diagnosis, it becomes one promising tool for researchers. Wang et al. [5] designed a seven-layer CNN to classify the renal lesion on the dataset with 614 CT images and got the state of art result of 90.36 ± 1.02% accuracy.
In the case of the training data acquired rather difficult, some methods such as transfer learning, data augmentation and so on could be applied to solve the problem of overfitting [6]. The effect of data augmentation techniques on image data depend on whether the data label is preserved after data warping and oversampling. For example, Wang [7] used rotation, gamma transformation and noise injection to augment CNN training dataset so as to achieve better performance in alcohol use disorder detection. Afterwards, this paper attempts to investigate the effect of data augmentation in our CNN model to distinguish the renal lesion.
The main contributions of this paper have three points. The first one is to improve the recognition accuracy for the renal lesion on CT dataset with CNN. The second one is to normalize the distribution of our dataset by color space transformation. The third one is to investigate the effect of the data augmentation on the class imbalance dataset.
The remaining of this paper is organized as below. Section 2 describes the dataset of the related kidney CT images from the clinical patients. One CNN with data augmentation is constructed and explained in detail according to the classification target in Section 3. As following, the implementation parameters and results are tuned and illustrated, especially the effect of the data augmentation is discussed. Furthermore, more comparisons between the structures of CNN and other deep learning models are checked in Section 4. Final conclusion is drawn at the end of the paper (Section 5).

Dataset
This study got formal written consent approved by the Ethics Committee of Tongliao hospital of Inner Mongolia and all the subjects in the dataset kept credential formal written consent with Tongliao hospital. Our dataset consists of 614 kidney CT images, which is collected by clinical doctors through general or enhanced scans to diagnose the renal lesions with the device of Siemens SOMATOM Force CT in Tongliao hospital.
According to the clinical diagnosis by experienced doctors, these kidney CT images are identified as abnormal and normal classes. Moreover, there are four subtypes representing typical renal diseases in the abnormal class which are calculi, cysts or Hydronephrosis, calculi with cysts or Hydronephrosis and tumor. Then the samples in our dataset are illustrated in Tab. 1. To construct our CNN classifier, the human preprocessing only includes cropping out the Region of Interest (ROI) in the CT images. No any other preprocessing need to be done. In addition it should be worth mentioned the enhanced CT images are selected by doctors using the excretory urography which is taken from the excretory phase after injecting contrast agents about 15 minutes. For example, the sample of 1-4 tumor subtype was scanned by the enhanced CT, its tumor lesion was located in the dark part of the corresponding ROI. Comparatively, the rest samples in this table were scanned by the general CT, the lesion contrast of which varied significantly in intensity, shape and size.
Totally our dataset involves 500 general CT images and 114 enhanced CT. These CT images cover two categories of normal and abnormal with total five renal lesion types. We define the abnormal category as the positive class and the normal category as the negative class.
As following, the size of our dataset presented in Tab. 2 is obviously not large and exists the case of class imbalance. The main reasons of this come from the difficulties for clinical doctors in their daily works to track diseases, mark images and integrate text records all together for us. Thus, two problems are prone to appear, overfitting caused by training on a small dataset and biasing to the majority class for prediction by training on a class imbalance dataset. To solve the problem of class imbalance, a common method is to increase the penalty cost of wrong prediction for minority class in the target function of the classifier model. In next section, data augmentation as a solution to alleviate these two problems is clarified in detail.
To evaluate the generalization of one classifier, it is best to use new data instead of training data to test. The holdout idea separates a part of dataset as test data and uses the remainder as training and validation data. In order to keep different classes in test data with same probability, same number of test data are hold out based on the class with less samples as below in Tab. 3.

Methodology
As we have seen, CNN as a classic deep learning model is one end to end network which is heavily explored for researchers. Our method comprises two parts as shown in Fig. 1. One part of this method is the data augmentation techniques applied to enlarge our kidney CT dataset. The other is one nine-layer CNN constructed to classify the renal lesion through training. Next, the related theories should be understood and interpreted at first for further reading.

CNN Structure
According to the function of a layer, a CNN consists of a series of basic layers of complex convolution, activation [8,9], and fully-connected, as well additional layers of pooling, batch normalization, softmax. These layers are organized together in sequence and some of them repeat several times. Each layer as a module is composed of certain units [10][11][12]. Each unit transforms the input x to the output f x ð Þ. Thus, the values from those units of the previous adjacent layer connecting to the unit of this layer compose of its input vector x i . The corresponding output value of the unit is f x i ð Þ: Therefore, the operation f of the unit and the connection relationship determine the functions of this layer. Thus, the outputs of all units of the layer compose of a vector y as below.
As for the dimensions of these vectors of input and output, they rely on the number and the connection of the units locating in two adjacent modules. So, the number of the units are those external parameters need be chosen for users. These modules are assembled together carefully to implement a specific image recognition task. Next we explain the functions and the connections in those layers.

Convolution, Batch Normalization and ReLU Layer
The convolution operation can be expressed as f where w is the vector of the convolution kernel weights [13][14][15], b is the bias for the output. When the convolution operates on an image I ¼ Height; Width; Channel ½ , the size of the kernel filter need be designed initially as Then the output with this filter in a specific position of the image I is The connections among the units for a convolution kernel are sparse [16][17][18], so as to extract these important features in local region. At the same time, w and b are just internal parameters learned from the training data. Then, If the kernel filter is operated with the sliding the window one by one, a new image is output as a feature map. It is obvious that the output size O 1 is which is smaller than the input size. Therefore, to these pixels in the border of the input image, the kernel filter needs an additional padding around the four borders so that the output size is same to the input size. When the padding size is P = [up, down, left, right], the output size O 2 is When the kernel filter slides more than one pixel, another a pair of parameters are set as stride ¼ stride height; stride width ½ . Thus, the output size O 3 of the feature map is When set multiple kernel filters as K ¼ height; width; kernel number ½ which are the external parameters about the size and the number of the kernel filters, the number of the output of the convolution layer is exactly the number of the kernel filters.
From the above explanation, we find out the kernel filter operates the input image iteratively for all pixels. In order to overcome the tediousness, parallel computation in batches applies B ¼ Height; Width; Channel; Batch ½ , the tensor way to implement the acceleration.
However, the distribution of the batches in the training dataset may vary greatly, which affects the stability of the internal parameters learned. To solve it, Ioffe, Szegedy [19] proposed Batch Normalization (BN) operation before the activation function to reduce the shift of internal covariate [20].
As far as the activation layer, it is to simulate the response output only when surpass a certain threshold.
Þand so on.

Fully-Connected and Softmax Layer
For a fully-connected layer, the function is same to that of the convolution layer. Differently, each unit of the fully-connected layer collects the information from all units of the previous layer as own input [21][22][23][24]. Thus, the connections in the fully-connected layer are rather dense. The number of corresponding weights for the connections is N ¼ n Â m ½ , much larger compared to the convolution layer. Meanwhile the value of these weights and biases do not be shared each other because the output of each unit represents a value in a definite category [25][26][27][28].
If we need get the relative value among different categories, the softmax function realizes this transformation based on the Bayes probability model.
Typically, x i with larger relative scores yields exponentially larger probabilities.

Data Augmentation
Deep Learning relies on big data to avoid overfitting. In the case of the limited data, artificially inflating datasets namely data augmentation achieves the benefit of big data in the limited data domain. Many data augmentation techniques have been proposed for constructing better datasets which can generally be classified as either a data warping or oversampling technique [29].
For data warping techniques, transformations in geometric and color space are two common forms of it. On one hand, geometric transformations encompass translation, rotation, scale, flipping, cropping. On the other hand, color transformations contain color filter, noise injection [30], histogram change, kernel filters, mixing images, random erasing and so on. All of them target to cover the more general data distribution to shorten the difference between training data and test data. However, the disadvantages of these methods include additional memory and time costs computationally. Meanwhile, the error rate drop from some methods such as mixing images is very difficult to explain from a human view. Data augmentation prevents overfitting by modifying limited datasets to possess the characteristics of big data. It performs best under the assumption that the training and test dataset are both extracted from the same distribution. Otherwise, these methods will very unlikely be useful.
Data augmentation also alleviates class imbalance harm because they prefer the models to majority class predictions and render accuracy as a deceitful performance metric. Data augmentation falls under a data-level solution to it. Many different strategies for implementation are used. A naive and easy solution would be a simple random oversampling with small geometric and color space operations with different class ratios for majority and minority class [31].
However, oversampling could also cause overfitting more prevalent post-sampling on the minority class [32]. So more intelligent strategy on oversampling methods to increase the minority class size while preserving the extrinsic distribution, such as adversarial training, neural style transfer, GANs, and metalearning schemes is a promising area for future work.
In regards to our samples dataset, because the sample size is small and the class size is imbalanced, data augmentation will be applied to overcome the overfitting and data bias problems. Four geometric transformations and two color transformations are used together to our dataset. The detailed values of these transformations are described as Tab. 4 below, where affine transformation applies two dimensional shear operations, noise type chooses Gaussian white noise with zero mean and variance of 0.01. As for gamma enhance, because gamma represents the degree of adjusting brightness, less than 0.4 will make the new image too bright, while greater than 1.6 will make it too dark.
For each original training sample image, 30 new images are generated by one transformation with the same size to the input of our proposed CNN.
As shown in the following Fig. 2, one original CT image with the renal lesion type of calculi with cysts is taken as one example. As a result, 180 new images are generated by these six transformations with corresponding value ranges and steps. Only six new images of each transformation are exhibited in Figs. 2b-2g, whose indices are 1, 6, 11, 16, 21, 27 in the corresponding 30 new images.

Implementation
The CNN program is developed in Matlab 2019a. Its training and test stages are all run on a laptop with the operating system of Windows-10, NVidia GeForce GTX 1050 with 5 multiprocessors, and CPU clock rate of 2.2 GHz. To evaluate our CNN's performance, six indicators are used to get the average and overall values from multiple viewpoints. They are sensitivity (recall positive category), specificity (recall negative category), and precision of the positive category, accuracy of all categories, F1 and MCC. MCC gives a correlation coefficient between observation and prediction, whose value ranges between −1 to 1 and means disagreement to a perfect prediction.

Training Configuration
The CNN training algorithm minimizes the loss function [33] with least mean squared error and L 2 regularization item shown as Eq. (10), where there are m training samples and n optimized parameters, r is the L 2 Regularization coefficient set as 0.005.
The weights are updated by the optimizer of stochastic gradient descent with momentum method [34] which averages previous gradients together to obtain smoother search path. It is given as Eqs. (11) and (12). where m is the momentum coefficient set as 0.9, rL is the gradient of the objective function at one iteration stage, e is the learning rate which defines how much degree to update the internal parameters in each iteration. The parameters we assign in the software are given as following. The mini-batch size is 128, the maximum epoch is 30. The initial learning rate is 0.001 with decreasing by a factor of 0.1 in step of every 10 epochs.

Network Configuration
We construct a nine-layer deep CNN to classify our renal CT dataset. The structure of this CNN is shown in Fig. 3. The parameter values of each layer of CNN are described in Tab. 5. The input data are the preprocessed kidney CT images with size 72 × 72 and 3 channels. One whole convolutional layer is composed of the convolution operation directly followed by a BN and a ReLU stage. The output size of one convolution layer is calculated as Eq.  Given the input size of Conv_1 layer is [72,72,3], the kernel size is [3,3,16], the stride size is [2,2] and the padding is [0, 1, 0, 1], the output width is (72 -3 + 0 + 1)/2 + 1 = 36. Therefore, the output size of Conv_1 is [36,36,16]. Our CNN has 7 such convolution layers to extract multi feature maps. At the end of the CNN pipelines, three fully connected layers (FCL) are added. Two FCLs with ReLU activation are used to output values. FCL(50) means there are 50 neurons in this FCL. Another FCL with softmax activation FCL(2) outputs the probability of the image binary classification.

Performance of Proposed CNN
We train the proposed CNN ten times and achieve the prediction results on the test dataset. Tab. 6 shows the performance of ten runtimes evaluated by these 6 indicators which are sensitivity, specificity, precision, accuracy, F1 and MCC. Each row gives the performance of one runtime. Finally the mean and the standard deviation of ten runtimes are exhibited at the last row. It indicates our CNN classifier performs rather well and steadily because the average values of the front five indicators are all above 91.98% and the standard deviation of them are less than 2.40%.

Result of Data Augmentation
Here we investigate the effect of data augmentation of renal lesion image by using our nine-layer CNN in Kidney CT dataset. Besides the above experiments, we run CNN training ten times on original dataset   8, the corresponding averages of the ten-runtime experiments with DA are 0.72%, 0.99%, 0.93%, 0.85%, 0.86% and 1.63% higher than those without DA. It indicates DA can improve the classification performance through enlarging the training data. On the other hand, data augmentation of each image in original dataset takes nearly 2 seconds. Because image transformation is forward and the dataset is small, so the time cost is quite short. At the same time, the augmented images are stored in hard disk to save RAM memory. Therefore, the spatial cost of data augmentation is acceptable relatively to the enormous hardware capacity. Therefore, the effect of data augmentation performs well.

Optimal Structure of Convolutional Layers
The convolutional layers do multiple feature extraction in a deep neural network. When fixing three FCLs as Tab. 5, we check how many convolutional layers the CNN should have so as to obtain the best performance. The number of convolutional layers is adjusted from small value to large value. Tab. 9 shows the experimental results of five CNNs with 3 to 7 convolutional layers. All the last convolutional layer has the same parameters setting to the Conv_6 in Tab. 5. It proves that the CNN with 6 convolutional layers is the optimal structure since the performance does not improve any more according to those six indicators.

Optimal Number of FCL
Here we fix six convolutional layers, and tuned the number of FCL layers carefully from small to large value. The experiments change the number of FCL from 2 to 5. The results are shown in Tab. 10. The input to the first FCL is the output of the sixth convolutional layer which has 3 × 3 × 128 = 1152 dimensions.
When the number of FCL layers is set as 2, FCL(50) and FCL(2) are applied in sequence. When the number of FCL layers is set as 3, then FCL(50), FCL(10) and FCL(2) are applied in sequence. When the number of FCL layers is set as 4, then FCL(50), FCL (25), FCL(10) and FCL(2) are applied in sequence. When the number of FCL layers is set as 5, then FCL(50), FCL (25), FCL(10), FCL(5) and FCL(2) are applied in sequence. It indicates that the CNN with 3 FCLs performs the best.

Comparison to State-of-the-Art Algorithms
To validate the advantages of our method over the previous methods, we compare with the related works to classify kidney images. One is PNN model used in paper [4] with the selected features, which include mean, entropy and standard deviation of ultrasound images. The other is our previous 7-layer deep CNN [5]. The results given in Tab. 11 show the average overall accuracy of our nine-layer CNN is 1.71% higher than the 7-layer CNN and over 26% higher than PNN. In fact our proposed 9-layer CNN without data augmentation achieves average accuracy of 91.22 ± 1.07% which performs 0.86% better than the previous 7-layer CNN.
We also compare the time costs of these methods. The training times are listed in Tab. 11. The nine-layer CNN proposed in this paper takes 811.86 seconds for ten times training on the original dataset, which is called the original training time. When it runs on the augmented training dataset which is 180 times of the original training dataset, it takes about 180 times of the original training time. So on average one training on DA dataset costs about 4 hours. It evidently is more time-consuming than previous methods. Nevertheless, the trade-off is valuable to get a more accurate classification model in the training stage. While in the test stage, the test time of our DA-CNN is comparable because it takes 10.37 seconds on 82 test samples. Therefore the result means only less than 0.13 second is used to identify whether one image has the renal lesion. All in all, it is evident that the new method is faster than manual judgment to get more accurate prediction.

Discussion
In our deep learning algorithm, the number of convolutional kernels increases with the layers piling up, while the size of them keep same. This is the key point that CNN extracts a large number of local features to replace predefined limited features which are used to differentiate categories of samples. At the fully connected layers, the number of the nodes in one FCL decreases with the layers piling up. This realizes the function of gathering different features layer by layer to summarize the categories.
Meanwhile, the effect of data augmentation is positive to train a more accurate model. After enlarging the training dataset by DA, the learning model converges after certain epochs. So maximum epoch can be set as 10 so as to rationally reduce the training time.
From the above four groups of comparative experiments, we get the optimal structure of the nine-layer CNN. In general, the number of training samples, convolutional layers and fully connected layers could affect the performance of our CNN algorithm to some extent with moderate time cost.

Conclusion
In this paper, a nine-layer convolutional neural network is proposed to classify the renal CT images. Four groups of comparison experiments prove the structure of this CNN is optimal and can achieve good performance with average accuracy about 92.07 ± 1.67%. Although our renal CT data is not very large, we do augment the training data by affine, translating, rotating and scaling geometric transformation and gamma, noise transformation in color space. Experimental results validate the Data Augmentation (DA) on training data can improve the performance of our proposed CNN compared to without DA with the average accuracy about 0.85%.
Despite all of them, some works need be done in future. (i) The optimal structure of convolutional and fully connected layers have been verified in our method, but pooling layers are not considered. Further comparison about the pooling effect can be discussed. (ii) We compared with some related works, but more deep neural network algorithms should be covered to find out the best result. (iii) Our dataset includes CT images collected from different sources of the general and enhanced CT devices. Whether different brightness affect the classification performance may be investigated.
Currently the radiography is applying AI to implement the medical image recognition in clinical practice. Our algorithm is validated to be faster than manual judgment and more accurately than previous methods. With the amount of the images increasing in daily check, the limited manual diagnosis is becoming more laborious and time-consuming. Therefore, this kind of automatic identification of abnormal images may be a promising alternative prejudge approach to help clinical radiologists and doctors reduce their workload. Future works also need focus on generalization and interpretability of deep learning method.
Funding Statement: This study was supported by National Educational Science Plan Foundation "in 13th Five-Year" (DIA170375), China. Guangxi Key Laboratory of Trusted Software (kx201901); British Heart Foundation Accelerator Award, UK.

Conflicts of Interest:
The authors declare that they have no conflicts of interest to report regarding the present study.