A Comparative Analysis of Machine Learning Algorithms Modeled from Machine Vision-Based Lettuce Growth Stage Classification in Smart Aquaponics

The arising problem on food scarcity drives the innovation of urban farming. One of the methods in urban farming is the smart aquaponics. However, for a smart aquaponics to yield crops successfully, it needs intensive monitoring, control, and automation. An efficient way of implementing this is the utilization of vision systems and machine learning algorithms to optimize the capabilities of the farming technique. To realize this, a comparative analysis of three machine learning estimators: Logistic Regression (LR), K-Nearest Neighbor (KNN), and Linear Support Vector Machine (L-SVM) was conducted. This was done by modeling each algorithm from the machine vision-feature extracted images of lettuce which were raised in a smart aquaponics setup. Each of the model was optimized to increase cross and hold-out validations. The results showed that KNN having the tuned hyperparameters of n_neighbors=24, weights='distance', algorithm='auto', leaf_size = 10 was the most effective model for the given dataset, yielding a cross-validation mean accuracy of 87.06% and a classification accuracy of 91.67%.


I. INTRODUCTION
The land resources for agriculture has been decreasing as more rural areas are urbanized to accommodate industrial needs. Urban Agriculture (UA) is one of the growing food security solutions as the global population and urbanization rapidly increase. UA is defined as the production, process, and distribution of food produced in cities for local needs [1]. One developing form of UA is the aquaponics system.
Aquaponics is a combination of hydroponics (soillessbased planting) and aquaculture (fish farming). It is a closed-loop system recycling fresh water between fish and plant, making nutrients shared between them as well [2]. Consequently, aquaponics systems need intensive monitoring to work successfully [3]. With this, data acquisition and control systems are studied, developed, and implemented to make the systems smart aquaponics. Machine vision is a system widely innovated for agricultural automation [4], [5]. High quality of fruits and vegetables grading are reliant on images processed from visions systems [6] to properly extract the features for analysis.
As crop growth monitoring are heavily dependent on subjective human judgment, the monitoring and control are Manuscript received March 14, 2020; revised July 13, 2020. The authors are with Gokongwei College of Engineering, De La Salle University, Manila, Philippines (e-mail: sandy_lauguico@dlsu.edu.ph). prone to inaccuracy. A machine vision system implemented helps "see" the crops and analyze numerous essential elements in crop growth effectively [7].The use of machine visions to acquire data from a smart agricultural setup is evidently capable of increasing the efficiency of food production. This method can extract all the features that the human eye hardly visualizes. The extracted features from the image processed are then used for developing models from the processed dataset through an algorithm. The developed models are utilized for further monitoring, analysis, and control.
The general objective of the study is to determine which among the machine learning classification algorithms: Logistic Regression (LR), K-Nearest Neighbor (KNN), and Linear Support Vector Machine (L-SVM) is the most accurate in classifying the three growth stages (vegetative, head development, and harvest) of lettuce farmed in a smart aquaponics setup. Specifically, this is achieved through the images acquired from machine visions which are processed to extract the features. Another specific aim of the study is to use the features extracted as dataset to train the abovementioned algorithms to produce models. The models are then optimized according to algorithm parameters and then compared from one algorithm to another to give off the highest possible classification accuracy.
The paper is divided into 4 more sections. Section II focuses on the related literature to support the novelty and significance of the study conducted. Section III discusses the methodology from data gathering to model optimizing. The results of the trained models and comparison of the three algorithms are analyzed in Section IV. Lastly, Section V deals with the conclusion and future recommendation of the research paper.

II. RELATED STUDIES
Monitoring and control in agriculture is a necessity as the demand for food is reaching its high peak [8]. To efficiently produce high quality crops through control and monitoring, learning algorithms should be implemented. These algorithms should satisfy the attributes of the crops to accurately produce the desired output from the system. Specifically focusing on lettuce, a study discussed that it is very critical to know the development stages of this crops as it can easily die in a non-controlled environmental condition and can be developed with low qualities if not cultivated properly. Studies [9], [10] show that AI-based farming applied in lettuce can be really beneficial.

A.
Lettuce Culture Lactuca Sativa or better known as lettuce takes 45 -55 days to reach its full maturity after germination. Its growth development is divided into three stages, categorized as: vegetative, head development, and harvest. Vegetative stage is the first two weeks of the lettuce development from transplanting. The third up to the seventh week is the head development stage. The last three weeks of the plants' life is the harvest stage [9].
Attributes contributing to defining the growth stage of lettuce are focus on the morphological features of the biomass: perimeter, area, convex area, convexhull area, convexhull perimeter, compactness, solidity, convexity, dominant, major axis length, minor axis length, length of skeleton, and skeleton perimeter. These features were proven to be significant contributors in determining what development stage the lettuce is already in.

B.
Machine Vision and Learning in Agriculture Machine visions are becoming widely used for extracting features of agricultural products for further analysis. In a precision agriculture, an apple size estimator was done by processing the images captured from a vision system to detect and segment apples in color images using shapes, color, and size attributes. The estimation of physical size or resolution of every pixel in an image was conducted by modeling the relationship between pixel size, pixel coordination, and distance from the camera through a number of checkerboard images [11]. A 3D approach for plant and crop analysis [12] was implemented to obliviate the difficulties encountered by machine learning estimators as 3D can generate richer information from extraction. One of the evaluations employed in the study was the use of 3D imagery in getting the ellipsoid parameters of potatoes for fitting.
Features derived from image processing becomes more significant if utilized as training dataset to introduce AIbased control and monitoring. There are numerous studies dedicated to the use of vision systems in extracting attributes for training learning algorithms in agricultural applications. A tomato defect discrimination was developed through RBF-SVM using LAB color-space pixel values. The model was able to achieve 98.9% mean accuracy [13]. An application of deep learning was used for detecting postharvest apple pesticide residues. Calculating the roundness value and extracting the region with the highest roundness value in the connected region, a region of interest (ROI) mask was created for the apple. The hyperspectral region was then extracted by getting the different pesticide residue types in the masks of ROI. The extracted hyperspectral images were used as input into a Convolutional Neural Network (CNN). The accuracy of 99.09% was able to be achieved by the model [14]. Vision systems integrated to unmanned aerial vehicles for gathering data in machine learning applications are proven significant as well in the field of agriculture [15].

III. METHODOLOGY
Implemented in the study is an experimental design divided into four phases for achieving the objective of classifying the development stage of the lettuce using three different estimators.
Shown in Fig. 1 is the system architecture of the research.
A. Data Gathering A smart aquaponics system is established in Rizal, Philippines. Shown in Fig. 2 is the hydroponics part of the aquaponics setup on which lettuce crops are planted. Data are gathered once a week by capturing 30 images of different lettuce planted. This lasted for 10 weeks gathering a total of 300 images. The image on the right side of Fig. 2 is a sample of the 300 images captured.

B.
Image Preprocessing To extract the attributes from the images, MATLAB was used to preprocess the gathered data. First, the original image was overlaid with superpixels. The superpixel calculates the superpixels of a 2D RGB image and uses simple linear iterative clustering (SLIC) to achieve the similarity between pixels [16]. A k-means clustering algorithm was then applied to determine the objects that belong to the same shades of a certain color. The said algorithm can provide clusters for the different colors existing in an image as it determines the distance between the centers of the clusters to determine which belongs to the group or not [17]. For the application in this study, the cluster producing green color was extracted. Some unconnected pixels were also removed as some of the lettuce from other pockets are overlapping in the image. Beforehand, the RGB image was converted into gray to make it compatible with bwareaopen function. Shown in  Fig. 3 and Fig. 4 are examples of the images processed for overlaying superpixels and clustering through k-means respectively.  The next steps done were smoothening the objects, masking the leaves, and segmenting each of the leaf in the image. The segmentation was done using watershed transformation with sobel filter utilized in the frequency domain. A recent study [18] used watershed transformation and was proven effective in seperating touching objects in images. In Fig. 5, it shows one of the watershed transformations done to segment each of the leaf of the lettuce. After segmentation, morphological operations were done with the segments to extract the numerical values from the physical features of the leaves.

C.
Model Training and Optimization The extracted features from the morphological operations were used as the dataset to train the three algorithms; Logistic Regression, K-Nearest Neighbors, and Linear Support Vector Machine. Logistic Regression is a machine learning algorithm that is used for classifying applications by allocating observations to a discrete set of classes [19]. The K-Nearest Neighbors is identified as one of the ten most popular and significant estimators in machine learning [20]. KNN uses a distance function (usually Euclidean) to measure the similarity in proximity between two samples in determining the closest neighbor of a query point or testing data [21]. Support Vector Machine on the other hand is a classic discriminative estimator derived from statistical learning theory. A pattern is represented in a dimensional space to having features. It then discovers a hyperplane in the space to distinguish the attributes into classes [22]. With these estimators proven to be important algorithms for classification, they were used in the study for modeling the development stage of lettuce in dependence to their physical features.
Setting up the data before training, the dataset was preanalyzed if there is missing data, categorical data, outliers, and imbalances. If the mentioned are existing, they were addressed using imputers and encoders before setting the independent (X) and dependent (Y) variables. After this process, the X and Y values were split into 80% training data and 20% testing data. The independent variables of both the training and testing set were feature scaled using standardization. Shown in equation 1 is formula used to derive the rescaled dataset where X is the original value, is the mean, and is the standard deviation. Upon setup, the data were fitted one-by-one with the three algorithms using default parameters. Shown in Tables  I and II are the results of the fitting. The validations were preliminary results used only for comparing to the validations of the final model. Interpreting the preliminary results, it can be understood that KNN has the best performance in terms of accuracy and the majority of the other performance metrics.  The performance metrics shown in the tables are: • Accuracydefined as how much were correctly classified • F1 -Scoredefined as a harmonic mean considering precision and recall • Specificitydefined as how often the predicted is negative when actual value is negative • False Positive Rate (FPR)defined as how often the predicted is incorrect when actual value is negative • Precisiondefined as how often the predicted is correct when predicted value is positive To summarize, all the performance metrics except for FPR denotes that a good model has values closer to 1, while the latter should be closer to 0.
To further increase the performance of the model, each of the algorithm were optimized using GridSearch CV to determine the best performing parameters for the algorithms. For LR, the best accuracy score was obtained using: C=100, penalty = 'l1', solver = 'liblinear', random_state = 0. On the other hand, for KNN, the parameter combination with best accuracy score was n_neighbors = 24, weights = 'distance', algorithm = 'auto', leaf_size = 10. Lastly for L-SVM, the combination of C = 0.001, kernel = 'poly', degree = 4, gamma = 10 yielded the best performance In Fig. 6, Fig. 7, and Fig. 8 are the confusion matrices for the optimized models. Overall, it can be visually analyzed that the model performed well as the diagonal boxes are lighter in color and the outside boxes are darker. Interpreting this summarizes that the model was able to predict the actual values most of the instances.
In Fig. 6, it is shown that the classifier made a total of 60 predictions. Out of all the predictions, the model predicted that 12 are in the vegetative stage when there is actually 12 in the vegetative stage. In head development stage, 27/35 were accurately predicted as 8 were predicted to be harvest when they are actually under the head development. 10 out of 13 were correctly predicted with the harvest stage. Similarly, 60 predictions were made using KNN. 16 out of 19 were correctly identified as harvest, as three were mistaken to be head development. 27 out of 29 were correctly identified as head development, as two were mistaken to be harvest stage. A hundred percent accuracy is shown in predicting the vegetative stage. The overall accuracy as depicted in the confusion matrix points that KNN is best performing model in comparison with the three as it consistently gives high accuracy for each of the growth stages.

D. Cross Validation and Hold-Out Validation
The cross validation used a stratified k-fold with 10 splits for all algorithms to validate the consistency of the model for different groups of data set as training and testing. Stratified k-fold is a variation of k-fold on which the return folds are created by preserving the percentage of samples for every class, making it more reliable than the conventional k-fold cross validation. Shown in Table III is the optimized results of the cross validation in terms of the mean and variance of accuracy and F1 score. It is evident that KNN is still the best-performing after optimization. A hold-out validation was conducted to determine the sufficiency of fitting. Analyzing the KNN from Table IV, it can be interpreted that the model is only underfit by a few percent, which still makes it a good model.

E.
Classification Report The classification reports are shown in Tables V, VI, and VII to determine how accurate the performances of the models are in predicting the actual values for each classification. The reports strengthen the claims presented by the confusion matrices.
It can be inferred that all the models are more accurate in classifying the vegetative stage as compared to the other stages yielding to an almost perfect 100% performance for every model. The optimized KNN is evidently the highest performing model for predicting the correct classification of features for vegetative, head development, and harvest stages. The worst performing model was from the optimized logistic regression having a recall as low as 56%.

IV. RESULTS AND DISCUSSION
Summarizing the results provided, a comparative analysis can be drawn out for the three algorithms used, both when they were modeled with their default and optimized parameters. The analysis includes the comparison of three significant performance metrics.
It can be observed that the accuracy from cross-validation are always lower to the hold-out metrics, concluding that the trained algorithms using the default parameters of the estimators produce underfit functions for all three models. Still, the KNN shows the best performance among the three in classifying the lettuce to its growth stages. Optimization evidently made the model better in a lot of aspects. First, the accuracy of the cross validation is close to the accuracy and F1 of hold-out validation leading to a model that is sufficiently fit. This does not apply to LR as the graph shows that the cross-validation accuracy is significantly higher than the hold-out metrics making it overfit.
However, looking at the best performing model, optimization made KNN more appropriate to be utilized for future unseen data to accurately classify the growth stages of the lettuce. The cross-validation accuracy of KNN yielded 87.06% while its classification accuracy from hold-out gave an accuracy of 91.67%.
International Journal of Environmental Science and Development, Vol. 11, No. 9, September 2020 Having the optimized KNN as the best estimator for the given application using morphological attributes for classifying growth stage of a lettuce, shown in Fig. 11 is the plot of the test data in comparison for the prediction output using the optimized KNN model. It can be interpreted from the figure that majority of the sample done, the predicted values matched the actual test values. Mathematically, this represents 55 out of 60 samples predicted accurately which mirrors the 91.67% holdaccuracy obtained by the model.

V. CONCLUSION
The comparative analysis was achieved successfully by employing the extracted features acquired from the machine vision-based image preprocessing to train the three abovementioned estimators. The trained models were further optimized to increase the performance of each algorithm. Analyzing the models in comparison to one another resulted to a conclusion that the K-Nearest Neighbors with a parameter combination of n_neighbors = 24, weights = 'distance', algorithm = 'auto', leaf_size = 10 is the best performing model to be used in the application as supported by majority of the performance metrics used.
It is recommended in future studies that a feature selection should be done as there were 15 attributes used for training the model, making the computational cost very high. It is also recommended to do further comparative analysis with other machine learning algorithms and deep learning techniques.

CONFLICT OF INTEREST
The authors declare no conflict of interest.

AUTHORS CONTRIBUTION
Sandy Lauguico was responsible for the numeric data preprocessing, algorithm training, testing, and optimization, and the model comparative analysis. Ronnie Concepcion II built the algorithm for extracting features from image data, Jonnel Alejandrino assisted on the image feature extraction, Rogelio Ruzcko Tobias proposed a system for data acquisition and acquired the dataset, Dailyne Macasaet helped in determining the research gap and identifying the problem, Dr. Edwin Sybingco provided Machine Vision concepts, techniques, and options to use for the given problem, and Dr. Elmer Dadios guided us on how the research is going to be conducted. She had an internship experience as a quality assurance engineering intern from a Japanese-based company for seven months and then continued as a part-time faculty in a local college for a year. She has one published paper entitled Design of an Audio Transmitter with Variable Frequency Modulation Parameters Using National Instruments LabVIEW 2011 and Universal Software Radio Peripheral 2920 as an alternative public address system for Asia Pacific College, which has been published in Manila by IEEE The 10th International Conference on Humanoid, Nanotechnology, Information Technology, Communication and Control, Environment, and Management (HNICEM) in 2019. She has two more on-going papers for publication by IEEE 9th CIS-RAM and IEEE 11th HNICEM. Her current research focuses on providing environmental control and automation in a smart aquaponics setup which will further be used for future analysis based on artificial intelligence. Engr. Lauguico is an active member of the Institute of Electrical and Electronics Engineers and the Institute of Electronics and Communications Engineers of the Philippines. She also worked as a session chair and a technical committee in the IEEE The 11th International Conference on Humanoid, Nanotechnology, Information Technology, Communication and Control, Environment, and Management (HNICEM). She is a licensed electronics engineer and technician, and an Class B amateur radio operator. He has four years of industry experience as operations and performance database administrator, and a professor of engineering at a local private university. He is one of the editorial board members who worked on the completion of the journal publication of Acta Scientific Computer Sciences (Acta Scientific, Volume 2 Issue 1 -2020, published last January 1, 2020). He was able to publish numerous technical and scientific papers aligned with his research interests which are biosystems engineering, computational intelligence, intelligent systems, sustainable agriculture and structural health monitoring. Engr. Concepcion is a fellow of the European Alliance for Innovation, fellow of the Royal Institute of Electronics Engineer, Singapore, an associate member of the National Research Council of the Philippines, member of the Institute for Systems and Technologies of Information, Control and Communication, His former period researches bagged several awards in national investigatory project competitions in the principality of computational chemistry, particularly in water impurities. He had been involved in research works about wireless communication potentiality during disaster scenarios. He was adequate to publish several technical and scientific papers aligned with his research specialties which are wireless communications, network system, sustainable agriculture, and structural health monitoring, biochemical engineering, computational intelligence, and intelligent systems. He is also a constituent of various research programs like information and communication system for disaster resilience and hydroponics and aquaponics system of smart farming. He is a licensed electronics engineer and technician, an amateur radio operator class B. He is also member of the Institute of Electrical and Electronics Engineers Republic of Philippines Section. He earned his master of science in electronics and communications engineering degree in 1993 and bachelor of science in electronics and communications engineering degree in 1990 from the same university. He had 17 years of industry experience from several companies and worked as an electronics engineer consultant, a management information system consultant, and a technical trainor. In international training and research collaboration, he became an exchange scientist working on Adaptive Differential Pulse Code Modulation in Japan. He was also a postgraduate research fellow under the Department of Science and Technology Philippines and a visiting researcher in the University of New South Wales -School of Mechanical and Manufacturing Engineering. His administrative experiences involved academic and research fields as he became a thesis and research coordinator, Faculty Development Coordinator, and vice chair of the ECE department in DLSU. He was also able to publish numerous research papers related to machine learning, machine vision, robotics, signal processing, and data analytics. Dr. Sybingco is an active member of Institute of Electrical and Electronics Engineers (IEEE) and IEEE Computational Intelligence Society. At present, he is the organizing co-chair of The International Conference on Humanoid, Nanotechnology, Information Technology, Communication and Control, Environment, and Management.

Rogelio Ruzcko Tobias
Elmer P. Dadios is presently a full professor at the De La Salle University, Manila, Philippines under the Manufacturing Engineering and Management Department, and the graduate program coordinator of Gokongwei College of Engineering. He currently leads government funded researches on bomb removal robot, traffic surveillance, and smart aquaponics. He obtained his degree on doctor of philosophy at the Loughborough University, United Kingdom. He accomplished his degree in master of science in computer science (MSCS) at De La Salle University (DLSU), Manila and his bachelor of science in electrical engineering degree from Mindanao State University (MSU), Marawi City, Philippines. For his professional experiences, he became part of a scholarship committee and administrative staff working for the Department of Science and Technology (DOST) Philippine Council for Industry, Engineering Research and Development. He was as well a research coordinator and a director at the DLSU. He had experiences as session chair, program chair, publicity chair, general chair in various local and international conferences, and became an external assessor at the University of Malaysia. He won numerous awards such as Top 100 Scientists Listed in Asian Scientist Magazine and Leaders in Innovation Fellowship "Fellow" given by the United Kingdom Royal Academy of Engineering. He had published numerous technical and scientific research papers regarding robotics, artificial intelligence, software engineering, automation and intelligent systems. Dr. Dadios is presently the president of the Mechatronics and Robotics Society of the Philippines. Aside from being a senior member of the Institute of Electrical and Electronics Engineers (IEEE), he is also the region 10 executive committee, the section and chapter coordinator, and the section elevation committee chair. He is also a vice chair of the National Research Council of the Philippines and an active member of the Steering Committee, Asian Control Association (ACA), the Philippines American Academy of Science and Engineering (PAASE), and the Society of Manufacturing Engineers (SME).