Smart GreenGrocer: Automatic Vegetable Type Classification Using the CNN Algorithm

In the food industry


INTRODUCTION
Vegetables have been a necessary global food resource over time.Each type of vegetable carries its benefits for the life of other living beings.Nutrition and vitamins inside are advantageous for the growth and development of humans and animals, for instance, to improve eyesight, maintain digestive system health, and reduce cancer risk [1].
Before vegetables are distributed, they need to be sorted based on their type and category.The variety of shapes, colors, and textures of each type of vegetable could be distinguished by humans in the manual sorting process.However, because it takes plenty of time to sort a large number of different types of vegetables, human errors might arise at any time, and using human resources to sort vegetables is not always effective [2].
Consequently, some research has been done to discover effective methods and approaches to automatically group each type of vegetable with the help of computers.A study has developed vegetable classification using a Support Vector Machine (SVM) that uses the color and shape of their leaf [3].A paper on this topic offers a solution to classify vegetables using a machine learning algorithm Support Vector Machine (SVM) utilizing color features and Saliency-HOG [4].Other studies also proposed an automatic categorization with the Naive Bayes approach using Hue Saturation Value or HSV as its feature [5].However, conventional machine learning methods have some downsides, like the need to do feature engineering manually and the requirement of the high cost of computation.
Several studies, such as the CNN model, use deep learning to solve problems.A paper offers a solution to classify children's facial emotions using the CNN model, which achieves an accuracy score of 99.9% [6].A study has also developed a CNN model for solving mobile-based image recognition, which can reach 93,6% for the average accuracy score [7].Other studies have offered a solution for classifying vehicle systems using the CNN model [8].We can conclude that the CNN model can solve many problems with computer vision.
Thus, this paper proposes a classification model that can classify the types of vegetables with the Convolutional Neural Network.We build an effective model to solve the problem arising around multi-label vegetable classification.In this study, we offer these specific contributions as follows: 1. We introduced a novel method for solving vegetable classification problems with a CNN algorithm to train and develop the model.2. We built a model to perform multi-label classifications on some vegetable types.
This classification model could answer vegetable categorization problems involving intraspecies variety and interspecies similarity of color and shapes.This model could further replace conventional machine learning methods to distinguish different types of vegetables.3. We tested our proposed model to obtain better accuracy in a short time.We also tuned the parameters needed so that the trained model received the best possible accuracy rate.

Main Idea
The primary goal of this paper is to build a robust model for vegetable classification from an image dataset containing images from 15 types of vegetables using the CNN algorithm.CNN is well known for its high-accuracy prediction; thus, we use this supervised learning algorithm to create a multi-label classifier that could map vegetable image data to its pre-existing predicted categories.Following steps are taken to make the model.Firstly, we acquire the dataset and augment the images.Then, we fit the model with the train data and obtain the performance score with accuracy.

Dataset
In this study, we gather the dataset Vegetable Image Dataset [9] from Kaggle containing 21.000 colored images with 15 types of vegetables, as shown in Figure 1.Some variations in the position and number of vegetables of the same species are contained in the picture.The picture could also contain other than vegetable objects, such as hands, as depicted in the Brinjal labeled image in Figure 1.This variation means that the dataset we use for model training represents realworld data.To generate the model, we separate the dataset into three parts: the training dataset to build the model, the validation dataset to evaluate the model while training, and the testing dataset to evaluate the model's performance.Table 1 depicts the dataset distribution.We use 71,4% of the dataset to train the model; the rest is for validation 14,3% and testing 14.3%.parts.Here, we set the fill mode as the nearest filling, the batch size as 32, and the image target size as 150x150 pixels.Afterward, we encode the classes with indices to indicate the label of each type of vegetable image.

CNN
Deep learning is a process of artificial intelligence imitating how brain cells work to solve problems.One deep learning method well-suited for 2-dimensional data is Convolutional Neural Network (CNN).The ability of this algorithm to detect and learn the features of CNN makes it advantageous for image classification tasks.Having similar characteristics as other neural network-based algorithms, CNN also consists of many neurons with weight, bias, and activation functions.

Feature learning
This stage is an encoding of an image into features that represent it.This process consists of some layers working together to obtain the characteristics of the image, specifically the convolution layer, activation function layer, and pooling layer.The first type of layer is the convolution layer.In this layer, the filtering process with the sliding window concept occurs by using a filter with a predefined size and value.This filter acts as a multiplier in which dot multiplication is done on each intensity of an area of a matrix where the filter lies with the filter.The resulting matrix is then passed onto the next layer for further computation.The model's training utilizes the filters in this layer and the back-propagation of the model's parameters.Next is the activation function.ReLU is one of the most well-known activation functions.It maps all non-positive numbers as zero, and the rest is linear.The pooling layer is vital in reducing the dimension of resulting convolution matrices.This process enables faster computation due to the requirement of a smaller number of parameters to update, which lowers the risk of overfitting.There are two main types of pooling layers: max pooling and average pooling.Max pooling takes the maximum value of each matrix input, and the other computes the average of all entries in the filter matrix range [10].

Classification
Classification of each neuron that has been extracted in feature learning is done by some types of layers, namely the input layer, hidden layer specifically with dropout regularization, and output layer.The input layer is a one-dimensional array, a flattened array from the previously computed feature map matrix.Each neuron of this layer is connected to a neuron in the hidden layers.The hidden layers use multi-layer perceptron (MLP) to transform the dimension of the data to be linearly classified by utilizing backward and forward propagation.Dropout is a handy regularization technique that shuts down neurons in the hidden layers to limit the contribution of some neurons to the output.As a result, other than reducing the risk of overfitting, which becomes a common problem in CNN, a less extensive computation is needed to complete the process.The output layer is the final layer where each input goes into an activation function, usually softmax, to produce a probabilistic output ranging from 0 to 1 [11].

Modeling
Our proposed model utilizes CNN as the backbone of the predictive model.The architecture of the model is shown in Figure 2.This architecture consists of twelve layers in which 2-dimensional convolution layers are used in the beginning, each followed by a max pooling layer.Then, the resulting matrices from the last max pooling layer are flattened and go through dropout layers before being inserted into the dense layer.Our model contains some parameters, such as 4 convolutional layers (conv2D), 4 pooling layers (MaxPooling2D), 1 layer flatten, 1 layer dropout, and 2 dense layers, as shown in Figure 3.The first dense layer is used as the hidden   We feed the model with augmented data split into training and validation portions.The training data is used to develop the ability of the model to classify different varieties of vegetables from the image data while it is validated using the validation portion.The resulting model is then tested, and the performance score is obtained using the testing data.

RESULT AND DISCUSSION
In this research, the computing device used is GPU GTX 980 Ti.The setting of the epoch is 50 throughout the training and testing phase.To optimize calculations, we use early stopping criteria and reduce the learning rate on a plateau for callbacks with parameters, as stated in Table 2.We also vary the parameters on optimizer type and batch size to obtain the best result possible based on the computation time and accuracy of the model.Table 3 below provides detail on the result obtained with different parameter settings, and Figure 4 and 5 shows the visualization of the metrics for each configuration.

CONCLUSION
We conclude that the most optimal configuration based on training time is CNN with RMSProp optimizer with 64 training batches.The highest accuracy is obtained using the configuration of RMSProp optimizer with 32 training batches.However, a more optimal result with a high accuracy rate and shorter training time is obtained with the Adamax optimizer using the batch size of 64, which produces an accuracy of 98.10% in 25 minutes and 45 seconds of training.
In the future study, we can use another algorithm to improve this model using Generative Adversarial Network (GAN) and Graph Convolutional Network (GCN) architecture.To produce higher-quality accuracy, we can use dynamic neural networks with additional features that can be developed.

Figure 1 .
Figure 1.Some samples of the dataset that represent the 15 classes 2.3 Data pre-processing In this stage, we augment the images in the training and validation set by doing the following processes: rescaling, rotation, width and height shifting, shearing, zooming, and flipping.Throughout the augmentation process, the dataset splits into training and validation

Figure 4 .
Figure 4. Test Accuracy of Models.

Figure 5 .
Figure 5. Training Time of Models.

Figure 6 .
Figure 6.Training and validation accuracy and loss for model 3.

Figure 7 .Figure 8 .
Figure 7. Training and validation accuracy and loss for model 7.

Figures 6 , 7 ,
Figures 6, 7, and 8 show the tradeoff of training and validation accuracy and loss for parameter configuration of models 3, 7, and 8, respectively.The training for model 3 shows an extreme oscillation of validation loss and accuracy as the epoch increases, while models 7 and 8 give a more stable and smoother oscillation.This means model 7 and 8, which uses Adamax as their optimizers, are more stable during the training process.