1. Introduction
The World Health Organization (WHO), through an estimation of the demography in its World Malaria Report 2018, reported that there were 212 million patients and as many as 435,000 patient deaths worldwide from malaria. In tropical Africa, it is estimated that 3.1 billion US dollars are lost per year due to increased public health expenditures, adversely affecting tourism [
1,
2]. Malaria is a disease caused by the Plasmodium parasite that spreads throughout the human body through the bites of female anopheles, which can then spread to others from mosquitoes that bite malaria patients. However, it cannot spread from person to person. In addition to being transmitted from mother to fetus, patients may be infected with malaria through blood transfusions or through sharing syringes [
3,
4]. The symptoms of an infected person are similar to the flu and can also include other symptoms, such as a high fever, chills, septicemia, pneumonia, gastritis, enteritis, nausea, vomiting and death [
5,
6]. Malaria is often found in areas with hot, humid climates near natural water resources, representing the habitat of anopheles mosquitoes that carry contagious diseases [
7,
8].
The method of malaria diagnosis consists of a centrifuge machine separating white blood cells and red blood cells so that only red blood cells can be used for analysis by employing a blood film. It is a standard laboratory method for diagnosing malaria and is known as a dipstick method for diagnosis, including thick and thin blood smears [
9,
10]. The method of detecting malaria by microscopy can report the results of the analysis in terms of both the amount and species of the infection to help diagnose malaria, but it is also useful in monitoring the treatment of patients. Malaria patients are diagnosed and treated without delay, and the doctor treats the patient by using antimalarial agents, such as Chloroquine, Doxycycline, Quinine Sulfate, Hydroxychloroquine and Mefloquine [
11]. Thick and thin blood smears are a detected feature of red blood cells (RBCs) shown in blood films, revealing features such as the color, size, texture, morphology and position of the parasite from the malaria patient. They represent the most popular method for the diagnosis of malaria for all clinics, hospitals and medical laboratories because they represent an inexpensive method for the diagnosis of an endemic disease such as malaria [
12,
13].
Figure 1 presents the dipstick method.
The methods employed conduct deep investigations of blood smears by using a microscope, which provides images of patient‘s blood to the doctor or medical laboratory technologist for finding parasites in RBCs. Deep learning is a subset of biologically inspired machine learning methods that were designed to imitate the function of information processing and decision making in the human brain. Functions of the human brain are much wider than current deep learning capabilities and include organization, awareness, personality, etc [
14]. Nowadays, there are many different research techniques that use deep learning for many of the most widely-used computer vision and pattern recognition and commercial applications. The convolutional neural network (CNN) is a class of deep neural networks that is characterized by shared-weights architecture and translation invariance characteristics, and are therefore often used for image analysis [
15].
The effectiveness of learning in CNN models can be improved even further. There are many important factors to consider, such as improving model weight initialization by transfer learning or using data augmentation and dropout as methods of regularization to combat overfitting during model training [
16,
17,
18]. In training CNN models, a large dataset is needed for the model to learn the patterns of features that are complex in detail so that the CNN model can classify those features, achieving an appropriate classification performance [
19,
20]. Therefore, the researchers often try to reduce time to learn useful features from the dataset by CNN model by fine-tuning the hyperparameters of the adjustment methods mentioned above. This enables learning with a reduced learning time and therefore can support efficient learning from small- and medium-sized datasets [
21]. This can efficiently support the learning of small- and medium-sized datasets. In 2018, Rajaraman et al. aimed at developing a CNN model to improve the performance of the computer aided diagnosis (CAD) system to detect malaria cells using deep learning with a malaria dataset, which obtained malaria cell images from the thin blood smears. This research used a deep learning technique to help diagnose malaria-infected and uninfected blood cells. The objective of developing a CAD system intends to help with the screening of malaria patients, thus reducing the workload of practitioners in diagnosing large numbers of patients. It also helps to enhance the accuracy of malaria detection by radiologists with little experience in diagnosing this disease [
15]. The model was developed to improve the hyperparameter tuning of the optimizer which were originally a stochastic gradient decent (SGD) and Adam, with adjustment of the learning rate and the use of CNN architecture such as VGG-16, ResNet50, Xception using rectified linear unit (ReLU) [
1]. In 2019, the accuracy achieved using Mish activation function was 1.671% more than the accuracy of the model that used ReLU on the dataset CIFAR 100, which is one of the most effective activations compared to the state of the art of activation function performance between (Mish) and (Swish) activations function that were developed in 2018. Mish is still more than 0.494% more effective, validated with a 70-item benchmark dataset [
22]. In 2015, the optimizer named Nesterov accelerated adaptive moment estimation (Nadam) was developed from Adam and was combined with the Nesterov accelerated gradient that was developed in 2014, and is used in the development of this research [
23].
The performance of Xception [
24] is slightly better than that of Inception-v3 [
25] on the ImageNet dataset [
26]. However, these higher levels of performance do not result from the expanded capacity but are instead due to the more effective use of model parameters, as the number of parameters in the architecture of Xception is the same as that of Inception-v3. In 2018, a research study using VGG-16 model in combination with transfer learning was conducted to automatically classify single cells in thin blood smears on standard microscope slides consisting of uninfected and infected samples, amounting to 27,578 single cell images. Adjustment of the size of the images was applied in the experiment, in which the width and length was 44 × 44 pixels, with three color channels (red, green, blue) from Chittagong Medical College Hospital, Bangladesh, to develop the CAD system to diagnose malaria with an accuracy of 97.37% [
15]. In 2017, CNN and support vector machine (SVM) were used to diagnose malaria. In the research, 1034 infected cell images and 1531 uninfected cell images were collected from the University of Alabama at Birmingham. The research divided the malaria dataset into two sets of approximately equal size, by which it was shown that SVM provided accuracy of 91.66%, and CNN provided accuracy of 95% [
27]. In 2020, ResNet was used to increase the effectiveness of training on the dataset consisted of 1,182 blood cell images at three different magnifications of 200x, 400x and 1000x with a 750 × 750 pixel resolution collected from microscopic observation. For the creation of the CNN model, the dataset was divided into 80% for training and another 20% for validation, and an accuracy of 98.08% was achieved [
28].
Masud et al. aimed at developing a CNN model by fine-tuning the hyperparameter of the pretrained model and improving performance by using cyclical learning rates-triangular2, which finds the best learning rate of SGD to improve the performance for malaria detection [
29]. Vijayalakshmi et al. proposed CNN models (VGG16, VGG19) with support vector machines (SVM) to determine the stages of parasite infection and improved the training time by using pre-trained CNN models and the transfer learning technique [
30]. The aim was to improve the architecture by using state-of-the-art activation function (Mish) to increase the performance of the CNN model. The optimal effectiveness of the model was proposed to be achieved by using other optimizers, such as, SGD and Nadam [
31]. The contribution of the research [
32] was aimed at developing a CNN model to fine-tune the hyperparameter of the pre-trained model by using transfer learning.
This paper used the above-mentioned powerful techniques to develop the research. The contribution of the proposed work aims at the improvement of the CNN model and fine-tuning it to develop a CAD system for the detection of malaria by applying Mish, which is considered to be an effective activation function. This research was conducted to examine the use of Xception architecture with a combination of Mish and Nadam. If ReLU is replaced by Mish for use inside Xception, the enhancement of the performance of the image classification may be achieved, particularly when compared with the original Xception architecture, as well as other types of CNN architecture. In sum, the proposed deep learning model utilized Xception in combination with Mish and Nadam and this method achieved an accuracy of 98.86% on the malaria detection task. Hence, it is feasible to employ the presented deep learning model for malaria detection.
3. Implementation Details
This study involved the development of a CAD system for detecting malaria in thin blood smear images with deep learning techniques. Below we provide the description of the implementation environment that included software and hardware. The details are shown in
Table 1.
This research uses six CNN models that are popular with computer vision in image classification, including AlexNet [
53], VGG-16 [
54], NasNetMobile [
55], ResNet-50 [
56], Inception-V3 [
25] and Xception [
24], which allow for more efficient optimization of parameters, including the optimizer, batch size, learning rate, activation function, dropout, loss function, etc. In this experiment, the optimizer uses SGD with a learning rate of 0.002, RMSProp with a learning rate of 0.001 and Nadam with a learning rate of 0.002. These values are based on the research presented in [
34,
57,
58].
For dropout, 0.5 and 20 batch sizes are specified, which are used to increase training speed. In addition, the activation function includes ReLU and Mish, which is one of the most effective state-of-the-art approaches, the loss function is cross-entropy, and the Softmax function takes the weights and converts them into the probability to predict malaria [
62,
63,
64]. The iterations are 50 epochs and the output layer of the CNN model in this research has two classes, which consist of an infected status and uninfected status, as shown in
Table 2.
In addition to the gradient derived from the cost function, there is another parameter that we need to optimize when training the gradient descent algorithm: the learning rate, or alpha, for an optimization algorithm. Choosing a learning rate directly affects the performance of the gradient descent algorithm.
3.1. Dataset Setting
Due to the thin blood smear film, it is not appropriate for training CNN models, therefore it is required to adjust the thin blood smear film images. Techniques to increase the number of images in a dataset through a rotation technique are popular and are used to increase the effectiveness of small data sets, but typically rotate by no more than 90 degrees. In this study, the image is assigned an angle of 0 to 270 degrees randomly using the shuffle sampling technique together with the rotation in the development. These methods increased the malaria image dataset to 7000 images, consisting of original images from the thin blood smear, and images obtained by the rotation and sampling techniques to reduce data duplication [
65,
66]. The data enhancement flip diagram is shown in
Figure 6. The image was constructed such that the data had a normalization value between 0 to 1, by changing the range of pixel intensity values. In this research, we resize the images to suit the CNN model’s structure used in the development CAD by adjusting the matrix size to 224 × 224 × 3 and 299 × 299 × 3 with blue, red and green colors (or an RGB color system). The malaria dataset was split into training 80%, validation 20% and the final model was applied to 700 images (or 10% of the total number images) to test the CNN model. The research used ROI to detect the image boundaries, which does not affect other parts of the image [
67,
68].
3.2. Xception Architecture, Activation Function (Mish) and Revision of the Model
The continuous improvement of CNN architecture enables more accurate image recognition. The Xception architecture was built upon a variety of essential principles, including a convolutional layer, a depth-wise convolutional layer, and a separable convolutional layer. Furthermore, the activation function is required for this architecture, wherein Mish is an innovative activation function, which provides an alternative to commonly used activation functions such as ReLU. This subsection introduces the updated Xception architecture, including the latest Xception with Mish design [
22,
24].
3.2.1. Xception Architecture
Xception is a concept founded on the original Inception design that generates cross-channel and spatial relationship similarities within CNN’s feature maps that can be fully detached. The framework uses cross-channel correlations by splitting input data in four ways to obtain a 1 × 1 convolution size and conducts average pooling, and then maps 3 × 3 convolution size correlations and forwards them for concatenation [
24], as shown in
Figure 7.
The depth-wise separable convolution proposed was also able to identify eye-catching objects in image detection by using 3 × 3 convolution kernel size. Point-wise convolution, commonly known as 1 × 1 convolution and abbreviated as PW, is mainly used for data dimensionality reduction and parameter reduction. In Xception, PW is used to change three feature maps into six feature maps, which enriches the features of the input data [
69], as shown in
Figure 8.
3.2.2. Convolution Kernel Replacement
Even with PW, due to the 3 × 3 and 1 × 1 convolution kernel parameters, directly calculating such a large amount is still very difficult, the training time is quite long and Xception has to conduct re-optimization by replacing multiple large convolution kernels with multiple small convolution kernels [
54].
where
v represents the max-pooling filter. The output attribute map describes
Fm, which is sorted by shape and size, where every
Fm saves the highest value of Fi in the input attribute map [
70]. Each module is positioned equally in relation to the original Xception with Mish architecture, as demonstrated in
Figure 8. At the activation function point, only ReLU is substituted with a Mish. An additional Mish is appended after global average-pooling and prior to logistic regression as a small change. For the grouping of images, the original Xception model is ideal. Still, sustained development must involve classification operation enhancement.
To evaluate the performance, we examined the Mish activation function. Accordingly, the design for the original Xception is used as the basis for the novel model, though it employs the Mish activation function to boost the performance of image classification.
where
Iv represents the input channels and
Ov represents the output channels for the layers. The estimate of f (
Iv,{Pi}) notifies the outstanding mapping to be understood. The capacity to avert signal mitigation through the conversion of many stacked nonlinearities is one advantage of the residual link [
71], as shown in
Figure 9.
4. Experimental Results
Table 3,
Table 4,
Table 5,
Table 6,
Table 7 and
Table 8 show the CNN model’s performance using Mish and three optimizers.
Table 3 illustrates the malaria disease detection effectiveness of traditional NasNetMobile, with ReLU compared to NasNetMobile, which uses Mish. The optimal results of NasNetMobile were achieved by the use of Mish and Nadam, with the F1 measure rate at 90.99%, the recall rate was 90.98%, the precision rate was 91.01% and the accuracy rate was 91%, which had an execution time usage of 72 min 12 s. NasNetMobile combined with ReLU and SGD offered the lowest effectiveness; an F1 measure rate of 78.64%, a recall rate of 78.63%, a precision rate of 78.64% and an accuracy rate of 78.64% were obtained.
For Inception-V3, the optimal results were achieved by using Mish with Nadam, and the F1 measure rate was 95.20%, the recall rate was 95.21%, the precision rate was 95.21% and the accuracy rate was 95.21%, which had an execution time usage of 67 min 12 s. In addition, for Inception-V3 combined with ReLU and SGD offered the lowest effectiveness, an F1 measure rate of 87.28%, a recall rate of 87.31%, a precision rate of 87.28% and an accuracy rate of 87.29% were obtained.
Table 4 demonstrates the results of these models.
Table 5 illustrates the optimal effectiveness of Xception for the detection of malaria using Mish and the optimizer method, which can improve the performance of Xception. In addition, this research used Mish and Nadam employing Xception to predict malaria with an F1 measure rate of 99.28%, a recall rate of 99.28%, a precision rate of 99.29%, and an accuracy rate of 99.28%, which had an execution time usage of 125 min 29 s.
Xception combined with ReLU and SGD provided the lowest effectiveness. An F1 measure rate of 93.49%, a recall rate of 93.50%, a precision rate of 93.49% and an accuracy rate of 93.50% were obtained. The performance of Xception using Nadam and Mish is demonstrated in
Figure 10, which displays the effectiveness of CNN model training using a training dataset. The confusion matrix result for Xception using Nadam and Mish is demonstrated in
Figure 10a. Xception predicted an uninfected status for 709 images and an infected status for 681 images of an infected status and did not correctly predict malaria for 10 images.
Figure 10b demonstrates the results of Inception-V3; this model correctly predicted an uninfected status for 665 images and an infected status for 644 images.
Table 6 illustrates the malaria disease detection effectiveness of traditional AlexNet, with ReLU compared to AlexNet, which uses Mish. The optimal results of AlexNet were achieved by the use of Mish and Nadam where the F1 measure rate was 82.70%; the recall rate was 82.78%; the precision rate was 82.92%; and the accuracy rate was 82.71% and had an execution time usage of 15 min 15 s. AlexNet combined with ReLU and SGD provided the lowest effectiveness: an F1 measure rate of 76.05%, a recall rate of 76.05%, a precision rate of 76.07% and an accuracy rate of 76.07% were obtained.
Table 7 illustrates the malaria disease detection effectiveness of traditional VGG-16, with ReLU compared to VGG-16, which uses Mish. The optimal results of VGG-16 were achieved by the use of Mish and Nadam, where the F1 measure rate was 84.99%, the recall rate was 85%, the precision rate was 84.99% and the accuracy rate was 85%, which had an execution time usage of 51 min 12 s. For VGG-16 combined with ReLU and SGD the lowest effectiveness was provided: an F1 measure rate of 78.83%, a recall rate of 78.83%, a precision rate of 78.86% and an accuracy rate of 78.85% were obtained.
Table 8 illustrates the malaria disease detection effectiveness of traditional ResNet-50, with ReLU compared to ResNet-50, which uses Mish. The optimal results of ResNet-50 were achieved by the use of Mish and Nadam, where the F1 measure rate was 93.07%, the recall rate was 93.10%, the precision rate was 93.13% and the accuracy rate was 93.07%, and which had an execution time usage of 49 min 52 s. ResNet-50 combined with ReLU and SGD provided the lowest effectiveness: an F1 measure rate of 86.70%, a recall rate of 86.78%, a precision rate of 86.96% and an accuracy rate of 86.71% were obtained.
5. Discussion
To improve CNN model performance, we can use various optimizers, activation functions and image processing techniques to extend the original malaria dataset. Furthermore, the image classification ability can be boosted by data augmentation approaches. The parameters utilized to adjust the function of each optimizer are employed in approaches. The arguments of Nadam comprised of the learning rate, epsilon, beta_1, and beta_2. The arguments of RMSprop comprised of the learning rate, momentum, epsilon, rho; the arguments of SGD comprised of learning rate, momentum and Nesterov [
54,
63], are shown in
Table 9.
Xception is defined as a hypothesis based on the Inception, which performs correlations of cross-channels and spatial relations within feature maps of the CNN model. As revealed in
Figure 9, devolving more appreciably from the established convolution method with the depth-wise convolution aligned with the point-wise convolution and producing a 1 × 1 convolution kernel size that executes the depth-wise separable convolution enables this. Based on this, Xception was born, and the author called it Extreme Inception.
This experiment faced several limitations. First, the recommended conditions were not possible with the low computer hardware features, indicating the unsuitability of the application software in this assay. Contemporary computer hardware might be feasible for extensive image assessment as it performs to a high degree. The operation of the classification models using several optimization approaches is compared in
Table 10. Xception linked with Nadam and Mish was the most accurate of the CNN models, offering an accuracy of 99.28%. Inception-V3 with Nadam and Mish, with a 95.21% accuracy, provided the second-best accuracy. ResNet-50 merged with Nadam and Mish offered the third-highest accuracy of 93.07%. NasNetMobile with Nadam and Mish, with a 91% accuracy. The minimum time was derived from AlexNet combined with ReLU and SGD, which had an execution time usage of 14 min 24 s when comparing the classification models’ time consumption. The next-shortest duration came from ResNet-50 combined with ReLU and SGD, providing an execution time consumption of 48 min 34 s. VGG-16 combined with ReLU and SGD had the third-lowest time consumption of 49 min 51 s. Inception-V3 combined with ReLU and SGD had the fourth-lowest time consumption of 64 min 17 s. NasNetMobile combined with ReLU and SGD had the fifth-lowest time consumption of 63 min 31 s. Xception combined with ReLU and SGD had the sixth-lowest time consumption of 121 min 15 s.
To optimize network training, this study specified the parameters of the batch size according to the criteria used for every model of CNN. The amount of samples defined for the training session is the batch size. A higher lot size improves the discovery level of the model. The lot size impacts the usage of GPU memory. When the accessible GPU capacity is not substantial, it is safer to use a lower value. For this study, the accuracy of Mish was higher than the accuracy of ReLU. Mish guarantees the cohesiveness of every point. Mish possesses a lower limit, but there is no higher limit. In fact, the seamless and non-monotronic features also have an influence on the productivity. Analysis of the validation accuracy is shown in
Figure 11 and
Figure 12.
Figure 11 reveals comparison results for the training and validation accuracy between Xception using Nadam with Mish and the traditional Xception, which uses ReLU, elevating the accuracy to 92.45% for training and 93.50% for validation.
Figure 11a shows that Xception can enhance the precision to 98.70% for training and 99.29% for validation. For the training and validation history, 50 epochs are needed, as determined by this research.
The findings of the correlation for validation and training losses between Xception using Nadam with Mish and Xception show a reduction in loss to 0.0894% for training and 0.0708% for validation, as revealed in
Figure 12. Xception minimizes the loss to 0.4265% for training and 0.4179% for validation.
Figure 13b shows the AUC of 98.44% with Xception (Traditional method) and the AUC of 99.99% with Xception paired with Nadam and Mish, as shown in
Figure 13a. Adjusting the hyper parameters using three optimizing procedures and Mish, while governing the correct values for each optimizing parameter in order to achieve the optimum results, enables the research to affect the stability of traditional CNN models.
Table 11 displays the effectiveness of the CNN model testing with testing dataset, including 315 images of an uninfected status and 385 images of an infected status from the malaria dataset. Erroneous estimates for 3.49% of the uninfected status, or 11 images and 2.85% of the infected status, or 11 images, were produced by the traditional Xception approach; 96.51% of the uninfected status, or 304 images and 97.15% of the infected status, or 364 images, were valid forecasts. Erroneous forecasts of 1.26% for the uninfected status, or four images and 1.03% for the infected status, or four images, were established by Xception combined with Nadam and Mish; real estimates for 98.74% of the uninfected status, or 311 images, and 98.97% of the infected status, or 381 images.
6. Conclusions
This study aimed to apply a deep learning model for the detection of malaria. The proposed approach employed Xception, and comparisons were drawn with alternative network models, including Inception-V3 ResNet-50, NasNetMobile, VGG-16 and AlexNet. Malaria causes large numbers of fatalities every year, and poses a particular threat to younger people. The CNN deep learning approach offers a means of producing effective image classification models which might be well-suited to medical applications, such as malaria detection and diagnosis. However, the CNN approach has not yet undergone trials using malaria images, which might support doctors during initial screenings, thereby leading to faster diagnoses, which is the purpose of the research. The classification accuracy of CNN can be improved by the application of an activation function, known as Mish. If Mish is used inside Xception in the place of ReLU, the image classification performance may be enhanced, especially in comparison to the initial Xception architecture, along with other CNN architectures. This paper sought to use a novel Xception modification along with the Mish activation function and Nadam to explore the potential for developing a new screening system which might detect malaria. This system could be trained using benchmark malaria datasets and by applying a technique for augmentation which can improve the quality of the image dataset.
The research methodology consisted of five sections. The first and the second steps required data method preparation, involving data augmentation methods and then split the malaria dataset into three datasets for training, validation and testing. The effectiveness of the CNN model could be significantly enhanced, depending on the number of images involved and the choice of data preprocessing methods used. Some CNN structures are appropriate to use as the dataset training parameters, in order to boost the accuracy and lower the amount of time required. The third step consisted of transfer learning, along with dropout techniques, which were used to make the CNN model more efficient. Dropout served to address the problem of overfitting, while transfer learning helped to enhance the time consumption effectiveness and to achieve a more accurate classification of the images. The fourth step employed the Mish activation function, which can be combined with a loss function based on the concept of cross-entropy, and a number of other optimizer methods, such as SGD, Nadam and RMSprop, in order to establish which CNN model would generate the best prediction performance. The fifth step used a confusion matrix and ROC to evaluate the CNN models’ effectiveness for malaria cell classification.
Training of the model can be conducted using optimization and will depend upon the activation function, the size of the batch and the optimizer. The three optimizer techniques are able to determine whether it is necessary to alter the CNN model learning rate. Studies investigating the activation functions are still being conducted, and in the field of deep learning. Currently, ReLU function is a popular activation function. This situation may be changed, however, by the arrival of Mish. The scale is determined by the activation function for output variable values derived from input variables, while ensuring smoothness at every point. Mish is able to accept one individual scalar for the purpose of making parameter alterations within the network, with no need to enter any scalar. Mish is partly based on the self-gating capacity of Mish, under which the gate is provided with the scalar input. Self-gating makes it possible to replace functions such as ReLU while the parameters of the network remains unchanged. There is no upper bound for Mish, yet a lower bound does exist. Moreover, the smooth and non-monotonic qualities it offers are able to provide enhanced results. A weighting system places emphasis upon those inputs which serve to establish the weighting along with the associated neuron prior to the transfer of this weighting, which will be employed as the input required for the activation function. As the model undergoes training, the original weightings may see changes, as the overall accuracy is gradually improved. This study has certain limitations, for instance, the computer used in the study has inadequate levels of performance when compared to the stated requirements, and therefore it was not possible to employ the application software during the research. Furthermore, the performance of today’s computer hardware is excellent and makes large-scale image analysis feasible.
A summary of the model testing performance is provided in
Table 11, where the detection of malaria was accompanied by a 96.85% accuracy when the model applied was the Xception model. In the case of the model which used Xception in combination with Mish and Nadam, the images achieved an accuracy of 98.86%. This model therefore offers the best malaria detection performance, and was shown to be superior to the Xception model. The results in this study enhanced the optimization of CNN models for each of the parameters used in optimization, including the activation function and learning rate and therefore generated a more efficient performance in the CNN model for malaria prediction.