Evaluation of Convolutional Neural Networks’ Hyperparameters with Transfer Learning to Determine Sorting of Ripe Medjool Dates

Convolutional neural networks (CNNs) have proven their efficiency in various applications in agriculture. In crops such as date, they have been mainly used in the identification and sorting of ripe fruits. The aim of this study was the performance evaluation of eight different CNNs, considering transfer learning for their training, as well as five hyperparameters. The CNN architectures evaluated were VGG-16, VGG-19, ResNet-50, ResNet-101, ResNet-152, AlexNet, Inception V3, and CNN from scratch. Likewise, the hyperparameters analyzed were the number of layers, the number of epochs, the batch size, optimizer, and learning rate. The accuracy and processing time were considered to determine the performance of CNN architectures, in the classification of mature dates’ cultivar Medjool. The model obtained from VGG-19 architecture with a batch of 128 and Adam optimizer with a learning rate of 0.01 presented the best performance with an accuracy of 99.32%. We concluded that the VGG-19 model can be used to build computer vision systems that help producers improve their sorting process to detect the Tamar stage of a Medjool date.


Introduction
The date palm fruit (Phoenix dactylifera L.) is a berry composed of a fleshy mesocarp, covered by a thin epicarp and an endocarp covering all of its seed [1]. The name of this fruit is "date," which comes from the Greek word "Daktylos," which means "finger" [2]. This fruit has been the primary source of food in several countries in the Middle East, playing an essential role in the economy, society, and environment [3].
This fruit's growth presents a progressive maturity level in four stages known by their Arabic names: Kimri, Khalal, Rutab, and Tamar. At its first stage of growth (Kimri), the fruit is small, green, and with a hard texture. In its second stage (Khalal), the fruit reaches its maximum size and changes it is green color to yellow or red. In the third stage (Rutab), the fruit is losing weight and moisture, turning the fruit into a brown color. In the last stage (Tamar), the fruit is ripe and ready to be harvested [4].
According to Food and Agriculture Organization of the United Nations data, the world's largest date producers are Egypt, Saudi Arabia, Iran, Algeria, and Iraq, producing 66% of the world production in 2018 [5]. However, despite not being a native crop of the American continent, the date has also become a priority fruit for cultivation in southern California and Arizona in the United States and northwestern Mexico, where high-quality dates, such as Medjool cultivar, are grown [6].
The date palm producers face several challenges concerning harvesting, sorting, and packaging because they are mainly performed manually [7]. Therefore, many employers are hiring for these activities that involve long working hours. People perform repetitive tasks, causing mistakes in the correct inspection of the fruit's quality attributes, such as color (maturity level), size, and texture.
Particularly in the Medjool date harvesting process, fruit pickers shake the palm bunch so that the ripe dates fall into containers. This can cause the ripe fruit to suffer damage to its texture or that fruits in other ripening stages are also harvested. The dates are placed in trays, where the immature ones will be extracted and grouped in other trays to dry them in the sun until they reach their full maturity. In contrast, the minute or damaged ones are commonly separated to develop date by-products or for animal consumption. Finally, the Medjool date sorting (which has the required degree of maturity) is packed.
Regarding dates, we identified that some studies use machine learning algorithms and image processing techniques to sort among date palm fruit or to detect among their different maturity stages [15][16][17]. Further, there are research works that propose using CNNs [8,9,18]. However, these studies do not present models to detect the maturity stage of the Medjool date.
The main contribution of this article was the identification of the hyperparameters that best influenced the training of a CNN architecture that transfers learning to Medjool's mature date sorting. To achieve it, we performed a comparison of the performance of eight CNN architectures. Two versions of the CNN architecture are called the Visual Geometric Group (VGG) from Oxford University, VGG-16 and VGG-19. Three versions of the CNN architecture are called Residual Network from Microsoft research, ResNet-50, ResNet-101, and ResNet-152. WE also looked at AlexNet, Inception Version 3, and a CNN from scratch. The hyperparameters analyzed were the number of layers, the number of epochs, the batch size, optimizer, and learning rate.

Image Acquisition
The images corresponding to ripe and unripe dates in trays were taken in September 2020, during the first round of harvest of Medjool dates in the plantation located in Colonia La Herradura (32 • 36 56 N, 115 • 15 36 W) in the Mexicali Valley, Mexico. The acquisition of images was made with three different cameras, using natural light between 8:00 a.m. and 2:00 p.m. We used a Canon EOS Rebel T6 of 18 megapixels and the cameras of the smartphones Samsung, SM-N950F and SM-N960F, which have a dual camera of 12 megapixels.

Image Data Set
The image data set contained 1002 images in JPG format, which are of different sizes (5184 × 3456, 4449 × 3071, and 4376 × 3375 pixels). The network architectures were trained with JPG images because they are fed with low-quality images in real scenarios. We refer to low-quality as images with blur, noise, contrast, or compression. We considered that if you are trained in architecture with this type of image, the system will be able to classify the Medjool date in images with these features. Further, a study shows that convolutional neural networks are minimally affected in their performance by using JPG format [19].
The image data set was distributed as follows: 501 images of ripe dates and 501 images of unripe dates on trays ( Figure 1). The dates in trays were previously classified as ripe or unripe by expert people.

Convolutional Neural Networks
The convolutional neural network (CNN) is a type of artificial neural network, where neurons correspond to receptive fields similar to the neurons in the primary visual cortex of a biological brain [20]. Also, CNN is identical to ordinary neural networks such as multilayer perceptron. They are composed of neurons that have weights and biases that can learn. Each neuron receives some input, performs a scalar product, and then applies an activation function [21]. The CNN as a multilayer perceptron has a loss or cost function on the last layer, which will be fully connected. Figure 2 presents a CNN structure, which consists of three blocks. The first is the input, an image. Next, we can see the block of feature extraction, which consists of convolutional and pooling layers. Finally, the third block is of classification, which consists of fully connected layers and softmax. The structure of the convolutional network changes as the number of convolution and pooling layers increases. The main difference between convolutional neural networks from ordinary neural networks is that they explicitly assume that the inputs are images [21], allowing specific encoding properties in the architecture, gaining efficiency and reducing the number of parameters in the network. In this way, CNNs can model complex variations and behaviors, giving quite accurate predictions. This study considered eight CNNs' architectures: VGG-16, VGG-19, Inception V3, ResNet-50, ResNet-101, ResNet-152, AlexNet, and one CNN from scratch.

VGG-16 and VGG-19 Architectures
VGG is the abbreviation for the Visual Geometric Group [22]. The VGG model was developed by Simonay and Zimmerman [23]. VGG uses 3 × 3 convolutional layers stacked on top of each other in increasing depth. The reduction of volume size is handled by max pooling. Two fully connected layers, each with 4096 nodes, are then followed by a softmax classifier [23]. The number 16 or 19 is the layer of networks considered deep.

Inception V3 Architecture (GoogLeNet V3)
This architecture was born with the name of GoogLeNet, but subsequent updates have been called Inception vN, where N refers to the version number put out by Google [22]. The basic module of Inception [24] consists of four branches concatenated in parallel: a 1 × 1 kernel convolution, followed by two 3 × 3 convolutions; a 1 × 1 convolution, followed by a 3 × 3 convolution; a pooling, followed by a 1 × 1 convolution; and finally a 1 × 1 convolution. Inception consists of 10 modules, although these modules are going slightly as the net gets deeper. Five of the modules are changed with the purpose of reducing the computational cost by replacing the n × n convolutions with two convolutions, a 1 × 7 followed by a 7 × 1. Two last modules replace the last two convolutions: 3 × 3 of the first branch with two convolutions each and one 1 × 3 followed by another 3 × 1, this time in parallel. In total, Inception V3 has 42 layers with parameters.

ResNet-50, ResNet-101, and ResNet-152 Architectures (Residual Neural Network)
ResNet [25] does not have a fixed depth and depends on the number of consecutive modules used. However, increasing the network's depth to obtain a greater precision makes the network more difficult to optimize. ResNet addresses this problem by adjusting a residual application in place of the original and adding several connections between layers. These new connections skip several layers and perform an identity or a 1 × 1 convolution. The base block of this network is called the residual block. When the network has 50 or more layers, it is composed of three sequential convolutions, a 1 × 1, a 3 × 3, and a 1 × 1, and a connection that links the input of the first convolution to the output of the third convolution. This study used three models with this architecture, ResNet-50, ResNet-101, and ResNet-152, which are composed of 50, 101, and 152 layers, respectively.

AlexNet
This architecture consists of five convolutional layers and three fully connected layers. Some convolution layers are followed by max-pooling layers (1, 2, and 5 layers). The Rectified Linear Unit (ReLU) nonlinearity is applied to the output of every convolutional and fully connected layer. The fully connected layers have 4096 neurons each [26]. To avoid data over-adjustment, a regularization method is used, known as a dropout, which consists of "turning off" neurons with a predetermined probability during training.

CNN from Scratch
The CNN that we built from scratch was composed of four alternate convolutional and max-pooling layers, followed by a dropout after every other convolutional and pooling pair. After the last grouping layer, we attached a fully connected layer with 256 neurons, another dropout layer, and, finally, a softmax classification layer for our classes. The loss function was the cross entropy since it is useful with convolutional neural networks, most significantly for purposes of image classification [27]. In order to compare the performance of a network that learns from scratch against other architectures that start from transfer learning, a convolutional network was trained from scratch.

CNNs' Optimization Techniques and Hyperparameters Techniques
All the above networks were too deep to train them from scratch with our data set. Therefore, we used transfer learning, which consists of taking the features learned in other contexts and using them in a new and similar problem [28]. Transfer learning is usually done for tasks where the data set has too little data to train a full-scale model from scratch. This was our case since we only had 1002 Medjool date images.
Transfer learning is commonly used in two ways: (1) pretraining model, which consists of using a pretrained model that replace its last layers with others, so that the characteristics are of the new data set and (2) convolutional network tuning, which is a strategy to tune the weights of the layers using backward propagation.
For this study, the application of transfer learning was the pretraining model. We used the pre-trained networks with ImageNet, which is a large visual database designed for use in visual object recognition [26]. We removed the final classification layer, the neuron softmax layer at the end, which corresponds to ImageNet, and instead replaced it with a new softmax layer for our image data set. A summary of the utilized CNN architectures is shown in Table 1. Hyperparameters Hyperparameters are variables that define the structure of a convolutional network as well as allow it to be trained [29]. These hyperparameters are learning rate, epochs, optimizer, batch size, number of layers, and activation functions, among others, which can be adjusted to make CNN more efficient. In this study, we changed the values of the hyperparameters optimizer, learning rate, batch, and epochs. Our CNN used an optimizer Adaptive Moment Estimation (Adam) and Stochastic Gradient Descent (SGD) since those are well-known optimizers, which have good performance to classify images in CNN [30]. The learning rates for the optimizers were 0.01 and 0.001. The batch size value was 64 and 128, the epochs were 25 and 400, and the number of layers depended on the CNN architecture used (Table 1).

Experimental Framework
To implement and evaluate the CNN architectures presented in Section 2.3, we used the Google Colab cloud service based on Jupyter's Notebooks, which allows the free use of Google's GPUs or TPUs, with the libraries Scikit-learn, PyTorch, TensorFlow, Keras, and OpenCV [31].

Performance Evaluation
The accuracy is the metric used to evaluate the classification performance of the architectures proposed in this paper. This metric calculates the percentage of samples that are correctly classified, and it is represented in the next equation: where tp represents true positives, those that belonged to the class and were correctly classified in that class; tn represents true negatives, those that did not belong to the class and were correctly classified in another class; fp represents false positives, those that did not belong to the class and were wrongly assigned to the class; and finally, fn represents false negatives, those that belonged to the class and were mistakenly classified in another class.

Results
Using the Adam parameter as an optimizer, it can be observed in Table 2 that for the evaluation with 25 epochs, the highest performance percentage was for VGG-16 (96.63% and 95.27%), with a learning rate (0.001), and for VGG-19 (93.92% and 97.30%), with a learning rate (0.01). The lowest performance was for AlexNet (64.19%) and ResNet-152 (64.17%), for a learning rate (0.001), and CNN from scratch (46.62% and 53.38%), with a learning rate (0.01). On the other hand, for 400 epochs, the highest percentage was Inception V3 (98.65%) and VGG-19 (98.75%), both for a learning rate (0.001) and for Inception V3 (98.65%) and VGG-19 (99.32%), with a learning rate (0.01). Likewise, the lowest performance was for ResNet-101 and ResNet-152 (both with 80.41%) and ResNet-101 (79.05%), for a learning rate (0.001) and, finally, AlexNet (67.57%) and CNN from scratch (43.24%), both with a learning rate (0.01). It can also be observed that the two best results were for VGG-19 (99.32% and 98.65%) for a batch (128), followed by Inception V3 (98.65%) for both batches (64); all these for 400 epochs. Regarding the time parameter in Table 2, CNN from scratch had the lowest values for processing time. However, some values were higher than those reported by ResNet-50, ResNet-101, ResNet-152, and AlexNet architectures. Likewise, the highest processing times in 25 epochs were for ResNet-152 (25 min) and Inception V3 (13 min), with a learning rate (0.001), and for ResNet-152 and AlexNet (16 min) and ResNet-152 (15 min), for a learning rate (0.01). For 400 epochs, the highest process time was for Inception V3 (131 min) and ResNet-152 (54 min), both for a learning rate (0.001) and ResNet-152 (65 and 60 min), with a learning rate (0.01). The ResNet-152 architecture was the CNN that required the most processing time on its network for most hyperparameters. The highest processing times were not associated with high or low accuracy. Table 3 reveals that using the Stochastic Gradient Descent (SGD) parameter as an optimizer, for an evaluation with 25 epochs, the highest performance percentage was for VGG-19 (87.16%) and VGG-16 (87.16%), with a learning rate (0.001), and for Inception V3 (92.56% and 91.89%), with a learning rate (0.01). While the lowest performance was for AlexNet (52.70%) and CNN from scratch (51.35%), for a learning rate (0.001), and for ResNet-50 and ResNet-152 (both with 45.94%) and ResNet-50 (45.94%), with a learning rate (0.01). On the other hand, for 400 epochs, the highest percentage was obtained by Inception V3 (95.94%) and CNN from scratch (94.59%), both for a learning rate (0.001), and VGG-19 (94.59%) and Inception V3 (95.27%), with a learning rate (0.01). Likewise, the lowest performance was obtained by AlexNet (56.08% and 60.81%), for a learning rate (0.001), and, finally, ResNet-50 (50% and 52.03%) with a learning rate (0.01). It can also be observed that the two best CNN architectures turned out to be CNN from scratch (94.59%) and Inception V3 (95.27%) for a batch (128), followed by Inception V3 (95.94%) and VGG-19 (94.59%) for a batch (64).  Table 3 shows that, for the time parameter, there was no defined pattern to identify the architecture that presented the lowest processing time in all its hyperparameters. Low values mostly appeared for CNN from scratch. However, the lowest value was for the ResNet-101 model with 8 min, in epochs (25), batch (64), and learning rate (0.01). Likewise, the accuracy of CNN from scratch was better than that reported by ResNet-50, ResNet-101, ResNet-152, and AlexNet architectures. The highest processing times in 25 epochs was for VGG-16 (14 and 23 min), with a learning rate (0.001), and for ResNet-152 (14 min) and ResNet-101 (69 min), for a learning rate (0.01). For 400 epochs, the highest process time was for ResNet-152 (58 and 115 min) for a learning rate (0.001) and (58 and 54 min), with a learning rate (0.01). Finally, ResNet-52 architecture required the most processing time for most hyperparameters. The highest processing times were not associated with high or low accuracy.

Discussion
Convolutional Neural Networks (CNNs) are used in several agriculture areas such as leaf and plant disease detection, land cover classification, crop type classification, plant recognition, segmentation of root and soil, crop yield estimation, fruit counting, obstacle detection in row crops and grass mowing, and identification of weeds, to mention a few [32,33]. For example, in Mohanty et al. [34], they presented the training of CNN architectures AlexNet and Google Net with a PlanVillage image data set to detect 26 types of diseases in 14 kinds of crops. Their results showed an accuracy of 99.35% to identify healthy and diseased plants. Meanwhile, Rahnemonfar and Sheppard [35] proposed using the CNN architectures' inception and Residual Networks (ResNet) architectures to estimate the yield of a tomato plant using synthetic images. Their results indicated that, with 91% accuracy, they can evaluate the yield.
Another example was presented in [36], where authors proposed training several convolutional networks to identify four fruits (mango, orange, apple, and banana). They were classified into two categories: fresh and rotten. The best performing models were Inception version 3 and the Visual Geometric Group of 16-layer (VGG-16) architectures, which received the learning transfer. Their results showed identification and classification percentages of 90% accuracy. A similar study was presented in [13], where the use of a VGG-16 network to classify vegetables and fruits was proposed. A total of 26 categories were classified: pumpkin, celery, cauliflower, pineapple, pomegranate, grapefruit, banana, cucumber, broccoli, onion, carrot, etc. The authors claimed to have 95.6% accuracy in classifying these fruits and vegetables. Regarding dates, we identified research works that proposed using CNNs to sort among dates or to detect among their different maturity stages [8,9,16].
Currently, determining the stage of maturity in the Medjool date using traditional image processing and machine learning methods is complicated. This is because these methods are trained to extract features in various cultivars such as their appearance, color (associated with the maturity stages), shape, and texture [7,16]. However, there are no studies where a feature extraction or predictive model for sorting Medjool dates that we are aware of. Furthermore, recent models cannot determine sorting Medjool because this cultivar is harvested, sorted, packaged, and consumed in its Tamar stage.
To contribute with a model that may be useful in sorting the Medjool date through images, we compared the performance of eight CNN architectures in this study. Additionally, some hyperparameters' values were modified, and transfer learning was used to identify and propose the use of CNN with the best precision.
As shown in Table 2, our findings indicated that when we use an Adam optimizer, the VGG architectures show the best accuracy, with the VGG-19 model that reached the highest percentage of accuracy with 99.32%. Likewise, the ResNet and CNN from scratch architectures showed the lowest performance percentages; the CNN from a scratch model achieved the most insufficient precision, with 43.24%. The highest average percentage generated among the eight architectures was 89.53%, using the combination of batch (64), learning rate (0.01), and epochs (400), with an average time of 48.71 min, while the lowest was 75%, combining a batch (64), learning rate (0.001), and epochs (25), with an average time of 12.25 min.
Likewise, Table 3 indicates that no architecture showed the best accuracy when we used an SGD parameter as an optimizer. However, the ResNet-50 architecture showed the lowest performance percentages, with batch (64 and 128) and learning rate (0.001). The highest percentage generated among the eight architectures was 80.57%, using the combination of batch (64), learning rate (0.001), and epochs (400), with an average time of 38.13 min, while the lowest was 66.21%, combining a batch (128), learning rate (0.01), and epochs (25), with an average time of 21.63 min.
It was noticeable that, if the number of epochs for all models was increased, the percentage of accuracy and required processing time also increased. Likewise, we observed that the highest processing times corresponded to the ResNet-152 architecture, which could be associated with the fact that this architecture had the highest number of layers. However, none of its precision was higher than 85% performance.
The optimizer can help us minimize the error function that allows us to conform to the training set examples. In this study, the accuracy was higher for Adam than for SGD.
Several studies have focused on identifying the CNN that offers the best precision for selecting dates from cultivars in their various stages of maturity [8,9,18]. However, there are currently no reported studies that use any CNN to classify the date cultivar Medjool.  One aspect to consider in this comparison is that the Medjool date is only consumed in its Tamar stage. Therefore, this study only used two stages for its sorting. The number of images was lower compared to the rest of the studies. However, in our work, the percentage of accuracy was higher due to the application of transfer learning and modification in various hyperparameters, which influence architectures' performance [37,38].
In our study, resulting from choosing the hyperparameters epochs (400), batch (128), optimizer (Adam), and a learning rate (0.01), we identified that VGG-19 architecture had the best performance. Likewise, this architecture could be included as part of the software that controlled a robotic mechanism to support the date palm farmer in an automated system of sorting ripe fruits.

Conclusions
This study evaluated the precision and processing time of eight CNN architectures. Seven of them were pretrained by an extensive image database designed for object recognition (ImageNet). These models were named VGG-16, VGG-19, Inception V3, ResNet-50, ResNet-101, ResNet-152, and AlexNet, which received transfer learning when their last classification layer was replaced. Additionally, a model that learns from scratch was used, that is, without obtaining learning.
All CNN architectures were evaluated by modifying the epochs, batch, optimizer, and learning rate hyperparameters since these parameters have been reported to have positive effects on the performance of convolutional networks. The results indicated that the CNN with the best performance for the sorting Medjool date was the architecture of the VGG group, which used the Adam optimizer. From these architectures, the VGG-19 model was the one that reported the best accuracy, with 99.32%. Likewise, the ResNet group architectures were the ones that reported the lowest performance using the same optimizer, the ResNet-152 model, which reported the most insufficient accuracy, with 64.17%. The use of the SGD optimizer did not have a significant effect on obtaining high accuracies.
Finally, it will be necessary to continue working on the best accuracy and the shortest processing time, with the modification of other hyperparameters. The inclusion in the evaluation of different fruit attributes, such as its size, gives it a high commercial value. It is essential in the packing process of this fruit.