Machine learning approach for classification of mangifera indica leaves using digital image analysis

ABSTRACT There is a wide range of horticulture farming in Asia. Mangifera Indica belongs to the species of flowering plant, also publicly recognized as mango. It has a significant local demand as well as a broad export marketplace throughout the world, and is considered as ‘King of Fruits.’ There are many mango varieties and each has its own business market. Efficient identification of the mango varieties is still difficult because of untrained growers and obsolete farming culture, especially in remote areas of the Asia. The primary purpose of this research study was to discriminate mango varieties with the potential of machine learning techniques by analyzing their leaves. For the purpose, we selected leaves of eight mango varieties, namely: Anwar-Ratul (AR), Chaunsa (CHAUN), Langra (LANG), Sindhri (SIND), Saroli (SARO), Fajri (FAJ), Desi (DESI), Alo-Marghan (ALM). A digital cell phone camera captured these datasets in open atmosphere without any well-equipped lab and infrastructure. Binary, histogram, RST, spectral, and texture features were employed for machine learning (ML)-based mango leaf image discrimination. A k-fold (k = 10) cross-validation method was used for ML classification. The k nearest neighbors (KNN) classifier achieved maximum overall classification accuracy (OCA) from 88.33% to 97%.


Introduction
Agriculture and farming are as historic as the human civilization breathes. The world has no means without agriculture and farming. Humanity survives with the art and state of cultivation and livestock. Almost all of the food supplements in our world are based on cultivation and livestock. Agriculture is one of the fundamental pillars of economic growth, and it is considered the backbone of the economy of any country. Agriculture is the largest financial sector in Pakistan's economy . [1,2] Pakistan is a substantially fertile and growing land of many different crops, seeds, plants, and fruits. The researchers of Pakistan are also playing a significant role in the agriculture sector and trying very hard to uplift the economy of Pakistan in this sector. Pakistan's agricultural sector is based on four seasons known as winter, autumn, spring, and summer. [3] During different agrarian seasons, Pakistan cultivates and exports many crops, which include cotton, rice, sugarcane, wheat, vegetables, and fruits.
Fruits are a big source of minerals, vitamins, and antioxidants. Fruits are also produced according to their farming seasons. Mango is of them also known as the 'King of Fruits.' The notion behind the 'King of Fruits' is that because it is a very delicious, tasty fruit that has a lot of potential benefits too. [4] Mango is not only the most loving fruit of Pakistan in the summer season, but it is also adored throughout the world because of its taste, attractive scent, repertoires of vitamins, and minerals. [5] Mango has a lot of health benefits, including the immense source of magnesium and potassium, and it is also used in different food items such as juices, bakery, jams, and medicine . [6] Mango is widely farmed in Asia, and the yearly production of mango is given in India (18,779 tons), China (4664 tons), Thailand (34243 tons), Mexico (2197 tons), Indonesia (2184 tons), Pakistan (1606 tons), Brazil (1417 tons), Egypt (1277 tons), Bangladesh (1162 tons), and Nigeria (918 tons) . [3] In Pakistan, mango is also the first-ranked summer fruit for exporters and growers in the international market. Pakistan exported 125,000 tons of mangoes worldwide and gained revenue of almost $72 million in the fiscal year 2020. According to the report of the planning commission of Pakistan in the year 2016, Pakistan is among the top ten mango-producing countries, and Pakistan is the sixth-largest exporter of mango fruit. [3] Different mango varieties are produced throughout the country. Provincial mango production and percentage of yield in Pakistan are described as: 399.2 tons in Sindh (36%), 1313.6 tons in Punjab (62%), and 3.0 tons in KPK (1%), and 1.1 tons in Balochistan (1%). Pakistan's economic and production growth of mangos (in tons) has improved from 2010 to 2020 . [3] District Multan, Bahawalpur, and Rahim Yar Khan are significant areas that produce high-quality mangoes. [7] Mango is a very delicate fruit with a wide variety of mangos. For good and quality production, every variety needs to be looked after. The gardeners require much well-timed and useful information for unharmed farming and quality mango production. The essential information is needed to identify the exact mango variety plant. Formally, the experts measure and discriminate the plant through their visual observation and domain knowledge. [8] Recently, identifying the variant plants and categories has been one of the most significant research issues in the agricultural domain . [9] Accurate, well-times, and efficient classification of leave species such as mango, citrus, guava, apple, dates, etc., is highly interested in many reasons. Some of the dominant uses include quality standard farming, prompt disease identification, efficient storage, crop yield, and the efficient food processing. [10] Leaves of any plant differ based on the complex properties such as their venation pattern, shape, mid-rub, apex, lamina, or margin. And a minor change in these properties is easily misinterpreted and deceived by the human eye, leading to wrong discrimination against the plant. Human pattern recognition based on shape, domain knowledge, size, pattern, and color is inefficient, tiring, and errorprone. [11] For this reason, automated leaf-based plant classification has become a major research issue in the agriculture domain, which leads to the quality standard. [3] Accordingly, leaf-based mango variety classification based on human eye pattern recognition also becomes incredible, irreconcilable, and unreliable, especially becomes more complicated in the field when thousands of the mango plants are without fruiting in the off-season . [12] Therefore, it is compulsory to provide an automated machine-based solution to discriminate the mango plants of different varieties based on their leave characteristics.
In the recent past, researchers greatly inspired machine learning approaches to address the issues that required quick machine vision techniques to replace the inefficient human vision and understanding for pattern recognition in the agricultural domain. In a study, the researchers discriminated the cotton and sugarcane varieties based on spectral features. [13] Mangoes maturity levels were identified on the basis of their color variations, as different maturity levels have their own usages in the various industrial sectors as a raw material . [14] In a research study, automated robots of harvesting system for fruit identification in real environment and quality measurement system are evaluated . [15] Authors investigated machine learning and image analysis techniques in various domains such as medical, natural, and agriculture sectors in. [16] In another study, healthy and defective seeds are examined with an overall accuracy of 90%. [17] The research authors identified gincu mango variety by employing C language, computer vision, and ANN techniques with 94% accuracy result in. [18] Another research has been conducted to identify automated mango fruit structure by applying texture indices and pattern algorithms, and gained overall 92% accuracy. [19] The authors applied K-means clustering for mango grading using mango size and ripeness, and obtained 88.88% accuracy. [20] Barley and wheat seeds are identified by employing morphological, statistical, and color features with accuracy of 99%. [21] In a research study, the economic worth of mango is analyzed in Pakistan by the author investigate the in Pakistan by surveying the stakeholders of Sindh and Punjab province . [22] Machine learning model is proposed for the identification of mango varieties based on shape and texture features with overall 70% accuracy. [23] Bio-diversity systems for environmental protection and climate cleanness are discussed at four levels of species, space land, eco-system, and genetic are discussed in the research investigation. [24] In another research study, a machine learning mango grading technique is proposed with overall 95% accuracy. [25] A research experiment was conducted to identify strawberry varieties based on Speeded-up-robust (S-U-R) and ANN by gaining overall 94.34% accuracy. [26] The authors applied a train-test deep learning model for the segmentation and classification of apple image datasets, and achieved overall 97% accuracy. [27] In a research experiment, mango maturities are discriminated by applying SVM classifier with an overall 95% accuracy. [28] The authors investigated agro-ecological change problem, such as floods, strong winds, poor quality, and shortage of grains in West Africa and particularly in Burkina Faso. Different social, structural and technical approaches are discussed to address these problems . [29] Crop management methods to enhance rice index by maintaining good water management and bio_geo_chemical process are investigated in research study . [30] In another research investigation, the agriculture land of rice in China was successfully assessed using an optical sensing Land_Set device, and obtained 75% overall accuracy. [31] The authors proposed a machine learning model for mango quality grading by applying four machine learning classifiers, and achieved classification accuracies between 87.9% and 98.1%. [32] In another research experiment, mango varieties are discriminated by employing 12 simple sequence repeat (SSR) methodologies, and the experiment achieved accuracies ranging from 75% to 100%. [33] In a research study, the authors employed weed kernel ANN model to classify mango images based on the geometrical, color and texture features, and gained accuracies between 90% and 100%. [34] CNN and DNN models also have been employed in the agricultural computer vision and in a research, mango quality is measured by applying of fully convolutional network (FrCNet) model based on segmentation and feature extraction, and gained 98.9% result accuracy. [35] In another study, the authors examined the colored images of mangoes by employing the region of CNN (R-CNN) feature and achieved overall 80% accuracy. [36] In research study, the authors, applied vgg-16 CNN model to discriminate the four date classes: khalal, tamar and rutab and faulty by collecting dataset using a Smartphone, and gained 96.98% result accuracy. [37] The research authors designed a mango grading system based on CNN and ultra-light-weight squeeze-Net algorithm, and achieved overall 97.37% accuracy. [38] There are also some limitations of CNN. A CNN tells you about the class of the objects but not where they are located. It is possible to regress bounding boxes directly from a CNN, but that can only happen one at a time. If multiple objects are in the visual field, then the CNN bounding box regression cannot work well due to interference. [39] This research study was organized to develop an effective and progressive automated system to discriminate the mango varieties based on their leaf features. Our system overcomes all the indicated problems in mango varieties identifications. The designed system was based on the state-of-the-art machine learning techniques to discriminate the eight dominant mango varieties using their leave properties. Generally, Pakistan produces many varieties of the mangoes, but the selected eight varieties are more popular and common in use . [40][41][42][43] The eight selected mango varieties include Anwar-Ratul (AR), Chaunsa (CHAUN), Langra (LANG), Sindhri (SIND), Saroli (SARO), Fajri (FAJ), Desi (DESI), Alo-Marghan (ALM). Thus our primary objectives are to design a mango leaves based classification system, which comprise setting up mango leave dataset acquisition process, ROIs creation, mango leave feature extraction and optimization, and classification.

Proposed methodology
The proposed methodology comprised the image acquisition, pre-processing, segmentation, feature extraction, optimization classification, and evaluation. Image acquisition contains the capturing of leaf images by using a fixed heighted cell phone camera. Pre-processing involves enhancements of the captured images. Segmentation phase isolated the leaf area and remove extra surfaces and damage parts. During feature extraction we extracted the leave properties of different categories for their texture analysis. Feature optimization phase gave us the most relevant properties for texture analysis and all irrelevant features were removed, we obtained optimized features dataset. Classification step deployed LMT and KNN to discriminate the leave varieties. Finally, we evaluated the performance of the classifiers. The next sections discuss these steps in detail.

Image acquisition
For the research experiment, we collected Mango leaves for all the mentioned varieties, namely: AR, CHAUN, LANG, SIND, SARO, FAJ, DESI, from the garden located in Tehsil Ahmad Pur East of District Bahawalpur Punjab, Pakistan. All the images were captured by a 64 megapixels camera of Samsung cell phone A-12 by placing them on a clean white-colored surface. The collected dataset comprised 100 images of each variety of mangoes. During the experimentation process, we found the leave images most suitable and clear by capturing them at the height of 1.5 feet above the leaves. These images were not too much close and not they were out of focus. We used a stand to fix the cell phone at the required height 1.5 feet above the leaves to standardize the whole image acquisition process. The image acquisition setup is shown in Figure 1. These images were captured on a sunny day at noontime (11:30 am to 3:00 pm). This research is distinguishing because this empiric process is employed without any well-furnished ecosystem and fully equipped research lab. [44] Sample images are shown in Figure 2.

Image pre-processing
During the pre-processing phase each leaf image was resized using image convert and resize software. All non-ROI equal-sized 800 (100 x 8) images were converted into color to gray-scale with the 8-bit format, shown in Figure 3. To remove the noise from these non ROI and standardized images we applied Laplacian filter . [45]

Segmentation
Since, we placed the taken leave images on a surface, so we applied the segmentation procedure to remove the surface and defective leaf parts. For the purpose, we applied range oriented by pixel resolution (RO-PR) algorithm on the leave images. RO-PR algorithm is employed to describe the clustering region of the leave pixels. The total value of pixel (TVP) examines the determined threshold value by considering a unique leaf image cluster. This thresholding level was used as a pixel value of the leaf (PVL) image or initial of the cluster. Resolution of the all neighboring pixels was compared, and the total value of image resolution is examined to create the cluster. In the end, it is evaluated that the purview dataset value of TVP is assessed the same as the PVL, then consider it the same clustering of pixel region (PR) and increase the total data of cluster by evaluating its ROIs. In the similar manner every next adjacent cluster is determined. the complete algorithmic steps of RO-PR scheme are shown in Figure 5.

Feature extraction
After segmenting the mango leaves, next step was to take ROIs and extracting the image properties of these leaves for texture analysis. Different automated, semi-automated, and manual methodologies exist for creating ROIs. Automatic schemes depend highly on image enhancement, [46] whereas the semi-automated and manual methods are based on expert's judgment. [47] To establish mango leaf dataset, we created three non-overlapping ROIs of sizes 512 × 512 on the images of each category, resulting 300 (100 x 3) ROIs images of each one. The sample ROI's are shown in Figure 4. In this way, we got a total of 2400 (300 x 8) ROIs image dataset for eight mango varieties. We employed CVIP) and WEKA 3.8.1, the computer vision and image processing software tools. [48][49][50] Feature acquisition process is most vital process in machine learning-based dataset classification, as it gives the necessary  information for texture analysis. [44] For the study, 28 binary, five histograms, seven RST, 10 texture, and seven spectral features were designated for texture analysis. [12,40] In this way, total 57 features were extracted from each of the created ROI. The total features vector space (FVS) comprised 136,800 (8 × 100 × 3 = 2400 × 57) features. Next we present the overview of some these features.

Histogram feature
A histogram is representation summarizes a series of discrete or continuous form of numerical data in a format that is easy to interpret. It groups the data into logical ranges of different heights. These are also known as bins. It is represented as a frequency graph of intensity values of underlying pixels. We need the interpretation of intensity distributions of the individual pixels of mango leaves . [12]

Texture feature
The texture is a significant to identify objects or regions of interest in an image. In images texture represents a specific pattern, which is usually repeated sequentially in the image. Any change in the underlying texture is determined by the texture features. As the leave changes, we need to capture the similarities and changes in the underlying patterns of the leaves. [12] Binary feature Binary features are widely used in texture analysis, object detection, facial detection and many other areas. They are fast and powerful features working at local level to analyze the patterns. To examine the  leaves at local levels we also applied binary features, including area, centroid, orientation, euler number, thinness, aspect ratio, and projection. [12,40]

Rotation, scaling, and translation (RST) feature
To analyze and identify the objects based on their scale and angular information we apply the RST properties. The geometrical interpretations are applied to determine the mango leaves. [12,40]

Spectral feature
Spectral features are based on time, amplitude and frequency. Fourier transformation is used convert into the frequency domain. Density variations in the incoming signals are measures with sensitivity. Sensitive variations in the intensities are determined using the deep properties of wavelength. In this way, the closest objects are also able to distinguish. As in the case of mango leaves, we require to discriminate the minor variations in the leaves of different mango varieties. [12]

Feature optimization
Feature optimization is very significant in machine learning and dataset classification. During this phase, the most valuable features for the texture analysis are extracted and the irrelevant features are eliminated. [51] In the previous phase, we extracted 57 features from each ROI image obtained a multifeature FVS, comprising 136,800 features. Such a large-scaled FVS was not sufficient for efficient texture analysis. Thus, it was compulsory to minimize the size of FVS. For this intent, we applied a supervised correlation-based feature selection (CFS) technique on the FVS dataset, and obtained the most significant multi-feature FVS. The equation of CFS is given below [10] S T ¼ N �  Table 1. Hence, volume of FVS was reduced from 136,800 to 45,600 (2400 x 19). This minimized multi-feature dataset was input to the ML classifiers.

Classification
Classification is the final phase in the machine learning-based texture analysis. During this step, the extracted feature dataset is given input to the algorithm for the discrimination of different image classes. [52] However, in our methodology, the next step was to classify the optimized mango leave's multi-feature dataset into the corresponding mango varieties. For the purpose, we deployed two machine learning classifiers on the optimized mango leave's multi-features FVS. We tried different classifiers, but we got promising results on LMT and KNN. LMT is a nice supervised train-test classification model that fuses regression and decision tree. [53] Finally, the KNN gave more better accuracy as compared to LMT. The KNN classification is traditional, nonparametric, and simple to understand, as it works by measuring the distance between a group of data points defined by the value of k. For problems where all the data points are well defined or contain less non-linearity, then KNN is the best choice. Furthermore, it is calm to implement and has no complex parameters to regulate tasks. [54,55] The complete framework of the proposed methodology is shown in Figure 6.

Results and discussion
The purpose of this research was to identify the variant mango varieties using ML classifiers. In this section, we present and discuss the output values after the classification. We examine all the outputs with field-based dataset values. In this experimental process, multi-feature ROIs dataset were created, and we input optimized feature set to different machine learning classifiers and completed the experimental procedures. In the start, the optimized multi-features dataset was evaluated by employing ML classifiers, meta bagging, tree J48, Logistics, and tree random forest. We observed that the output results for the classifiers were not sufficient and the result accuracy remained less than 80%. During the experimentation process, when the same procedure was repeated for LMT, a Tree-based classification method, we observed that the results were slightly improved and the classification accuracy  Table 2. The values in Table 2 present the overall classification accuracy for each mango class, the cross-class occurring of the sample or the missclassifications. The difference in classification results was observed due to the variants in the acquired features. The graph of the LMT classification performance to discriminate the eight mango varieties is shown in Figure 7.
Since the classification results were not still satisfactory on LMT classifier, so, we repeated the experiment for KNN, a straightforward, conventional nonparametric classification scheme by determining the k distance between different groups of data points. The classification accuracies of mango varieties were improved on KNN classifiers. When we applied the KNN classifiers, the overall classification accuracy (OCA) results of the following eight mango varieties, AR, CHAUN, LANG, SIND, SARO, FAJ, DESI, and ALM were 94.33%, 97%, 96%, 96.67%, 94.67%, 94.33%, 88.33%, and 89.67%,, respectively. The KNN classification rates are shown in Table 3. The values in Table 3 show the overall classification accuracy for each mango class, the cross-class occurring of the sample or the missclassifications. The graph of the KNN classification performance to discriminate the eight mango varieties is shown in Figure 8.

Conclusion
This research was accomplished to discriminate eight mango varieties using multi-feature analysis of their leaves, based on machine learning classifiers. A 64 megapixels Samsung cell phone camera  AR  CHAUN  LANG  SIND  SARO  FAJ  DESI  ALM  Total Datasets   OCA  (%)  AR  283  12  0  0  0  3  0  2  300  94.33  CHAUN  6  291  0  1  1  0  0  1  300  97  LANG  0  1  288  8  2  1  0  0  300  96  SIND  0  0  9  290  1  0  0  0  300  96.67  SARO  1  0  0  1  284  12  2  0  300  94.67  FAJ  0  0  0  1  14  283  0  2  300  94.33  DESI  0  1  1  0  1  3  265  29  300  88.33  ALM  1  1  0  0  2  0  27  269  300 89.67   [12] Texture + Histogram + Binary + Spectral + RST Artificial Neural Network (ANN) 95-98 [40] Binary + RST Invariant Artificial Neural Network (ANN) 96 [18] C Language + Deep Learning Artificial Neural Network (ANN) 94 [36] F1 Score, Threshold Region Convolutional Neural Network (R-CNN) 80-90 [23] Entropy + Variance + Correlation etc. LDA + PCA 70, 83 [56] RGB captured the leave images of the selected mango varieties in an open climate, and mango leaves image dataset was constructed. We created ROIs on the leave images and extracted binary, histogram, RST, spectral, and texture features from the ROIs and formed a multi-feature vector space. Large volume of the features vector was reduced by applying CFS optimization. Different machine learning classifiers were experimented on the optimized feature vector, using k-fold (k = 10) cross-validation. Two of the classifiers, the LMT and KNN gave good classification accuracies to discriminate the eight mango varieties. LMT set out classification accuracy rates between 80.33% and 88.33%, for different mango varieties, whereas the KNN classifier achieved the maximum overall classification accuracy between 88.33% and 97%. LMT has a regressive nature and it establishes logistic iterations, cross validation, and tree pruning. The architecture of LMT fuses two algorithms, the regression and tree induction. This property makes LMT able to exhibited good performance on the multi-^ KNN classifier depending on measuring the distances, and performs well without normalizing when incoming features belong to small scaled variants. [54,55] Since, we applied it on the mangos, which have small scaled variations, so it outperformed on this dataset. Finally, observing the other studies, it is shown that the classification results high on the designated set of features. n the future, our proposed classification framework may be experimented on other datasets comprising small-scaled features variations. Another study may be conducted to test the proposed classification model on a dataset having opposite characteristic, means on dataset comprising large scaled feature variations.