Using Machine Learning Algorithms to Forecast the Sap Flow of Cherry Tomatoes in a Greenhouse

The sap flow of plants directly indicates their water requirements and provides farmers with a good understanding of a plant's water consumption. Water management can be improved based on this information. This study focuses on forecasting tomato sap flow in relation to various climate and irrigation variables. The proposed study utilizes different machine learning (ML) techniques, including linear regression (LR), least absolute shrinkage and selection operator (LASSO), elastic net regression (ENR), support vector regression (SVR), random forest (RF), gradient boosting (GB) and decision tree (DT). The forecasting performance of different ML techniques is evaluated. The results show that RF offers the best performance in predicting sap flow. SVR performs poorly in this study. Given water/m 2 , room temperature, given water EC, humidity and plant temperature are the best predictors of sap flow. The data are obtained from the Ideal Lab greenhouse, in the Netherlands, in the framework of the European Funds for Regionale Ontwikkeling (EFRO) EVERGREEN Greenport Noord Holland Noord project (2018-2020).


I. INTRODUCTION
In alignment with artificial intelligence (AI) and big data technology, machine learning (ML) introduces new opportunities to unravel, measure, mine and understand the hidden patterns of data processes in dynamic and static environments [1]. ML is defined as the scientific field of statistical techniques that confers machines with the ability to learn from a series of input and output examples. ML is applied in many scientific fields, for example, bioinformatics, medicine, finance and economic sciences, robotics and vision engineering, sentiment analysis of social media, agriculture, climatology and food security. One important use of ML is predicting possible factors that influence crop management, specifically yield forecasting, crop growth forecasting, health prediction, decision making and crop mapping [2]. ML has potential to address existing and future challenges in agriculture by means of massive volumes of data containing a wide variety of indicators that can be captured, analyzed, processed and used for decision making. It is essential to gather data from various sources when making predictive decisions e.g., preventing crop loss and increasing yield while minimizing the use of resources. Many studies have investigated applications of ML in agriculture. Kaul et al. [3] applied artificial neural networks for highly accurate corn and soybean yield prediction. Logan et al. [4] applied generalized linear model (GLM), Bayesian additive regression tree (BART), and classification and regression tree (CART) methods together to utilize the high predictive power to achieve efficacy in the decision-making process with respect to Royal Gala apples. In another study, Delgado et al. [5] adapted a fuzzy logic information network and a decision-support system to address imprecision and inaccuracy for effective decision making in olive cultivation. Furthermore, Utkarsha et al. [6] focused on clustering ML for crop growth prediction, while Jing-Xian et al. [7] performed regression supervised learning to forecast sugarcane yield. This study aims to predict sap flow in cherry tomatoes. Currently, automated irrigation systems are commonly used in greenhouses. In contrast to the manual effort required to water plants, farmers need only one person to control the computer. The amount of the water to be given is determined by solar radiation. However, with the use of new technologies, the current irrigation strategy does not meet the required accuracy for greenhouse applications. According to Tom et al. [8], the water given to plants based on solar radiation might be wasted by the low water storage capacity substrate (Rockwool), which contradicts the energy-saving strategy and might cause declines in production and quality. Sap flow sensors make the water requirements of plants more obvious, accurate and direct. In contrast to the mass-balance technique to check plant water uptake, the sap flow sensor provides real-time data [9] and shows precise changes in water use in response to different environmental conditions. For commercial purposes, such sensors can help farmers to improve or adjust their water management strategy, as according to Gimenez et al. [10], sap flow can be used as an indicator of a plant's water status. Sap flow sensors are commonly used in forestry and vine production for research and commercial purposes and have even been adopted in the orchid industry. However, there is a gap in the use of sensors for edible herbaceous plants. Some features of cherry tomato plants make them good research objects. Tomato plants are perennials that are grown throughout the entire year in greenhouses, which provides a long time for research. Moreover, tomato plants have strong and thick stems, which simplify sensor installation [8]. Another motivation of this study is to avoid water drainage or loss, which is directly proportional to the amount of given water. We can prevent excessive drainage or loss by providing only the required amount of water to the plant.
To contribute to the application of sap flow information, this study attempts to predict tomato sap flow based on multiple variables using ML algorithms. Three categories of variables are considered in this study: climate data, irrigation data, and sap flow data. These data were collected from the Ideal Lab greenhouse [11] in Naaldwijk, the Netherlands. To achieve optimum monitoring, various sensors with actuators are installed on and around the tomato plants in our experimental greenhouse lab. The sensors on the plant provide information about sap flow. In addition, sensors are installed in the greenhouse to obtain (big) data about the conditions within the greenhouse: for example, climate sensors for measuring temperature, humidity, sunlight and irrigation water supply. Moreover, sensors in the substrate mat continuously measure mat weight and the pH and electrical conductance (EC) of water. The forecasting problem in this study is considered as a regression problem, and ML regression models, such as linear regression (LR), least absolute shrinkage and selection operator (LASSO), elastic net regression (ENR), support vector regression (SVR), random forest (RF), gradient boosting (GB) and decision tree (DT), are used to predict sap flow. The models are trained using the climate, irrigation and sap flow datasets provided by Ideal Lab greenhouse in Naaldwijk, the Netherlands. The dataset is divided into a training set, including 80% of the records, and a test set, with the remaining 20%, and the R-squared score (R 2 ), adjusted Rsquared score (adjusted R 2 ), mean square error (MSE), root mean square error (RMSE) and mean absolute error (MAE) are used as performance metrics.
The key findings of this study are listed as follows: • RF offers the best sap flow prediction capability with the highest R 2 value of 81% (approx.) and an MSE of 0.25. • Given water/m 2 , room temperature, given water EC, humidity and plant temperature are the best predictors.
This paper is divided into five sections. Section I describes the background and goals of this study; in Section II, the materials and methods are described. The methodology is presented in Section III. Section IV presents the results and discussion of the results, including figures and tables. Finally, Section V concludes this paper.

II. MATERIALS AND METHODS
A. DATASET A cherry tomato variety is used in this project. Tomato plants were grown in the Ideal Lab greenhouse (length: 12.50 m, width: 6.50 m; height: 6 m) located at the World Horti Center, Naaldwijk, the Netherlands [11]. The seedlings (16 cm) were provided by Axia Vegetable Seeds company [12] and were grafted onto rootstocks (Maxifor, provided by Rijk Zwaan) on 8 th November 2018 [13]. Rockwool slabs were provided by Grodan' GT Master [14]. Artificial light was applied between 7:00 am and 6:00 pm each day, the average day temperature inside the greenhouse was 23 °C, the average night temperature was 17 °C, the CO2 application was 533 ppm, and the irrigation system applied water at a rate of 0.5 L/m 2 on average. These amounts can be adjusted to meet the Dutch cultivation strategy depending on the weather outside the greenhouse. A total of 364 records from 12 samples of cherry tomato plants were considered.

1) Climate and Irrigation Datasets
The climate dataset includes the room temperature, air humidity, carbon dioxide (CO2), outside radiation, air density, outside temperature, outside air humidity, outside air density, wind speed and plant temperature. The irrigation dataset consists of given water EC, given water pH, given water/m 2 , drained water EC, drained water amount, and absorbed water amount. All these data were monitored and recorded automatically via the Priva [15] climate system on a daily basis.

2) Sap Flow Dataset
The sap flow dataset was recorded using Dynagage SF sensors provided by 2GROW [16]. The sap flow rate was recorded every 2.5 seconds. One sensor was installed on tomato plant, and data were monitored and recorded automatically for the entire project period. The data are presented visually using the Phythosense software package of 2GROW [17]. Data samples from each dataset are shown in Tables I, II, and III. The variables selected for inclusion in the study are as follows: • Room temperature • Air humidity SUPERVISED MACHINE LEARNING MODELS The purpose of this study is to construct models to predict sap flow based on input predictors: seven supervised ML methods are considered in this study.

1) Linear Regression
LR is a supervised ML algorithm [18] based on independent and dependent variables. According to the number of variables, LR can be categorized as simple LR or multiple LR, as shown in the following equations. The goal of LR is to identify the best combination of weight (w) and bias (b) that leads to the lowest cost (J).
⑴ Or ⑵ J is the cost function, is the predicted value, and is the actual value.

2) Least Absolute Shrinkage and Selection Operator
LASSO is an LR regression technique [19] that performs well in cases of high multicollinearity and sparse models [20]. In contrast to normal multiple LR, LASSO performs automatic selection among the predictors. The goal of LASSO is to minimize a coefficient, i.e., to minimize (sum of squared residuals + λ*|slope|). The equation is shown below.
⑶ λ is the shrinkage. When λ equals 0, the estimate is the same as that from LR.

3) Elastic Net Regression
ENR is a regularized regression algorithm that combines LASSO and ridge regression [21]. The estimates for ENR are the minima (sum of the squared residuals + λ*|slope|+ λ*slope 2 ). ENR addresses the disadvantage of LASSO by removing the limitation on the number of selected variables. The equation is shown below.

4) Support Vector Regression
SVR is widely used in classification problems in ML [22]. A line, also called a hyperplane, is constructed to separate the training data in N dimensions. Multiple hyperplanes can be used to classify the data. The hyperplane with the best performance is the one that achieves the largest separation. Subsequently, the regression is performed based on the hyperplanes. The equation of SVR is as follows: ⑸

5) Random Forest
RF is a supervised ML algorithm that is used for classification and regression tasks [23]. RF is an ensemble of multiple regressions, where multiple DT regressions are performed in a parallel manner. The result aggregates many DTs into a single ensemble regression via voting or by taking the mean value of different DTs. The goal of RF is to perform forecasting based on the regression trees.

6) Gradient Boosting
GB converts weak learners into strong learners [24], typically starting with a DT model. GB builds upon a previous model by adding another DT. If the new DT does not correlate with the previous forecasting system, it will be selected out. The final prediction is the weighted sum of the previous predictions.

7) Decision Tree
DT builds regression models in a tree structure and uses a set of binary rules to calculate a target value [25]. The model can be trained to fit any historical data and to learn any relationships between data and variables.

C. EVALUATION PARAMETERS
The performance of each model was evaluated in terms of the R-squared (R 2 ) score, adjusted R-squared (adjusted R 2 ) score, mean square error (MSE), root mean square error (RMSE) and mean absolute error (MAE).
The R-squared score represents the performance of the regression model [26]. When R 2 is less than 0, the model has no value. When it is equal to 0, the predicted value is equal to the mean value of the dependent variable. When it is equal to 1, the model performs the best. The value of R 2 score ∈ (0,1); the higher the R 2 score is, the better the performance of the model.

⑹
2) Adjusted R-squared Score Adjusted R 2 can avoid the over-feeding data problem, which leads to a continuously increasing R 2 score [27]. When useless variables are added to the model, the adjusted R 2 decreases. ⑺

3) Mean Square Error (MSE)
Mean square error is the average of the squared error [28]. The smaller the MSE value is, the better the performance of the model.

4) Root Mean Square Error (RMSE)
RMSE is the square root of the mean square error [29]. The equation is shown below.

5) Mean Absolute Error (MAE)
MAE is the average of all the absolute errors between the predicted values and actual values [30]. The smaller the absolute error is, the lower is the MAE, and lower values indicate better model performance.

D. HYPERPARAMETERS
The parameters of the model that must be set before running the model, in contrast to parameters that are learned during the training process, are referred to as hyperparameters [31]. To optimize the performance criteria, these parameters should be carefully tuned, as using excessively large or small values may result in poor model performance [32]. Hyperparameter tuning is, therefore, the process of finding good values of parameters for a specific dataset [31]. Sometimes, default values of the hyperparameters are defined by the packages being used; for example, in Python, if a value for a certain hyperparameter is not provided by the user for a particular ML algorithm, the default value is applied for training. The following hyperparameters are used.

1) alpha
The Scikit library in Python provides GridSearchCV to find the optimum value of alpha. In this study, alpha is a hyperparameter used for LASSO and ENR, and the chosen alpha values for LASSO and ENR are 0.05 and 0.06, respectively.
2) kernel SVR uses linear and nonlinear kernels to map lowdimensional data to high-dimensional data. This study uses a linear kernel that supports listing feature importance, which is not possible when using other kernels, as data are transformed to another space via the kernel method.

3) n_estimators
n_estimators represent the number of trees to be built for making average predictions. Higher values make the model stronger and more stable, but the code becomes slower. Therefore, the highest value that a processor could handle can be chosen for best results. e_estimators is the hyperparameter used in RF and GB, with a value of 1,500.

4) max_depth
In DT, the dataset is partitioned into different subsets. Partitioning starts with a binary split and continues until no further splitting is possible. The max_depth refers to the depth of each tree in the forest, where deeper trees are expected to capture more information about the data. In the study, max_depth was set to 3, as higher values resulted in poor model accuracy.

Ⅲ. METHODOLOGY
This study focuses on sap flow prediction in tomato plants using multiple predictors, such as climate variables (room temperature, humidity, CO2) and irrigation variables (given water, drainage water, given water pH), which are daily data. Three main steps were used to construct the forecasting system, as shown in Fig. 1. The initial dataset was processed into Table IV. Standard scaling in Python was used to obtain a dataset close to a normal distribution, which benefits the performance of many ML algorithms, such as LR and SVR. The output dataset after application of standard scaling is shown in Table V. The forecasting was performed using seven ML techniques. The potential important features were pre-selected based on the variance inflation factor (VIF), which accelerated the modeling process, and the correlations between predictors and sap flow are shown in Fig. 2. Then, the dataset was split into training and test sets based on a parameter between 0 and 1, expressed as a percentage. The dataset was split to prevent look-ahead bias, overfitting and underfitting. Common split percentages include 80%, 67% and 50% [33]. The value of 80% indicates that 80% of the data are included in the training set, while 20% of the data are included in the test set. These values were adopted in this study. Commonly, there is no optimal split percentage. A y y n = = -å chosen split percentage needs to meet the project's objectives with respect to the following consideration [33]: • Computational cost in training the model.
• Computational cost in evaluating the model.
The ML models LR, LASSO, ENR, SVR, RF, GB and DT were implemented, and the performance was evaluated in terms of R 2 , adjusted R 2 , MSE, RMSE and MAE.

A. RESEARCH METHODOLOGY
Sap flow directly represents the water requirements of a plant and provides an opportunity to understand the plant's hydraulic function and plant growth in a given environment [34]. The movement of sap illustrates the connection between a plant and its surroundings [34], and sap flow sensors are applied broadly in the forestry sector for water management and research purposes [35]. However, such sensors are rarely used for herbaceous plants. In this research, the tomato, an herbaceous plant, was chosen as the research object to contribute to the sap flow database. The relationships between sap flow and climate factors were analyzed, and a predictive model of sap flow was constructed and tested. This model can be used to enhance greenhouse automation management, to improve water use efficiency and to reduce waste during production. In previous studies, sap flow was generally studied in reference to solar radiation, vapor pressure deficit, relative humidity, and air temperature [36]. By contrast, this study includes more measured variables as compared with vapor pressure deficit, which is calculated based on measured data, and more potential variables, such as plant temperature and CO2, are included. Moreover, since most of the sensors are developed for woody plants [37], this research may contribute to sap flow sensor innovation.  10, 11, 12, 13, 14, 15, and 16 show the performances of different ML algorithms for sap flow prediction. According to the results, the sap flow data change frequently. Most of the forecasted values are accurate at the lowest peak; however, predictions of the highest peak are unstable. SVR shows good predictive ability for the highest peak of sap flow; however, the predicted values are not highly correlated with the actual values and the mean square error is relatively high, which reduces the reliability of SVR. LR, LASSO and ENR show similar patterns with respect to the trend of sap flow data. LR shows the highest correlation between the predicted data and actual data (0.792) and is closely followed by ENR (0.788). In terms of the MSE and MAE, LR achieves the lowest values; therefore, LR performed the best in the linear regression technique group. RF performed best in this study; it achieved the best sap flow prediction with the highest correlation and lowest error for peaks. In previous research, RF was described as a simple and diverse supervised learning algorithm, as it can be easily used for both classification and regression. Niklas illustrated that RF is likely to achieve better performance than other approaches because of its high tree diversity [38]. When splitting a node, the best feature among a random subset of features, rather than the most important feature overall, is selected [38]. Additional advantages are reported by Julia: RF works well with high-dimensional data and unstable data [39]. RF achieves a lower variance than DT, as the variance of each DT is averaged in RF [39]. Moreover, RF does not suffer from excessive overfitting [40] and includes a rapid training process [39]. GB and DT show lower correlation and higher error than other algorithms. Therefore, GB and DT did not perform well with respect to sap flow prediction in this study.

C. FEATURE IMPORTANCE
To further improve the sap flow prediction performance, the feature importance in LR, ENR, SVR, RF, GB and DT was analyzed. LASSO was excluded from this process, as the features are automatically selected in LASSO.
The feature importance results are presented in Table VII as the feature importance score (FI score). Four features contributed to the prediction by LR: given water/m 2 , given water EC, room temperature and humidity. Features such as CO2, plant temperature, given water pH and drained water have negative values and should be removed from the LR sap flow prediction model. The prediction of sap flow by ENR is based on given water/m 2 , room temperature, humidity, given water EC and plant temperature; thus, removing CO2, given water pH and drained water might improve the performance. The most important features for SVR sap flow prediction are given water/m 2 , room temperature, given water EC and humidity. By contrast, RF and GB rely on all 7 features: given water/m 2 has the highest FI score, followed by room temperature. The remaining features have similar FI scores (greater than 0 and less than 0.1). The most important features for DT are water amount, room temperature, plant temperature and humidity.
Given water/m 2 , room temperature, given water EC and plant temperature contribute the most to the sap flow predictions of the different models. Given water/m 2 has previously been identified as an important feature in the prediction of plant sap flow [41]. The relationship between room temperature and sap flow is also consistent with previous research [42]. Given water EC has previously been found to negatively influence sap flow [43]. Moreover, plant temperature, an indicator of sap flow in this research, has not been reported in previous research. Plant temperature represents stomatal conductance, which is linked to transpiration and plant growth [44]. Furthermore, transpiration is the main driver of sap flow; therefore, theoretically, plant temperature and sap flow may be related. Given water EC indicates the degree of difficulty for plants to absorb water [45]. CO2 is strongly related to plant photosynthesis and does not show strong correlation with sap flow in this research. However, Remy et al. showed that CO2 concentration exerts a significant negative influence on sap flow [46]. The relationship might be very minimal, which would require an accurate measurement methodology to support it. Given water pH contributes to nutrient uptake, which exhibits no relationship with sap flow. As shown in Table VI, RF offers the best sap flow prediction performance. Moreover, on the basis of Table VII, the predictors CO2, given water pH and drained water amount should be removed to improve the prediction performance of RF.

V. CONCLUSION
The use of sap flow information can improve water management. Such information allows farmers to easily adapt irrigation strategies, which may help to minimize the waste of resources. In this study, an ML-based prediction system was used to predict sap flow, and the results show that RF performed best. Moreover, the literature has previously shown that RF has high tree diversity, low bias, moderate variance, and minimal problems with overfitting, which contributes to good predictions. LR and ENR also show good performance. Given water/m 2 , room temperature, given water EC, humidity and plant temperature were identified as the most important features for sap flow predictions. Among these features, given water/m 2 was the most important variable for RF, and plant temperature was newly identified as an indicator for plant sap flow. A reliable prediction model (with higher R 2 value) for sap flow may contribute to better decision making during the irrigation process. This study will be enhanced in the future, and the dataset will be updated with additional records, including growth parameters such as stem growth, head thickness, and stem thickness.