Ensemble machine learning techniques using computer simulation data for wild blueberry yield prediction

Precision agriculture is a challenging task to achieve. Several studies have been conducted to forecast agricultural yields using machine learning algorithms (MLA), but few studies have used ensemble machine learning algorithms (EMLA). In the current study, we used a dataset generated by a computer simulation program, and meteorological data obtained over 30 years ago from Maine, United States (USA). The primary goal of this research is to increase the forecast accuracy of the best characteristics for overcoming hunger challenges. We designed stacking regression (SR) and cascading regression (CR) with a novel combination of MLA based on the wild blueberry dataset. We used features that indicated the best regulation for wild blueberry agroecosystems. The four feature engineering selection techniques are applied variance inflation factor (VIF), sequential forward feature selection (SFFS), sequential backward elimination feature selection (SBEFS), and extreme gradient boosting based on feature importance (XFI). We applied Bayesian optimization on popular MLA to obtain the best hyperparameters to achieve accurate wild blueberry yield prediction. The SR used a two-layer structure: level-0 contained light gradient boosting machine (LGBM), gradient boost regression (GBR), and extreme gradient boosting (XGBoost); level-1 provided the output prediction using a Ridge. The (CR) topology is the same MLA used in SR, but in a series form that takes the new prediction as a feeder to each MLA and removes the previous prediction in each stage. We assessed many techniques, CR, and SR outcomes regarding the root mean square error (RMSE) and coefficient of determination (R2). In the results, the proposed SR showed the best performance 0.984 R2 and 179.898 RMSE compared with another study that published 0.938 R2 and 343.026 RMSE on the seven features selected by XFI. The SR achieved the highest 0.985 R2 on all features and the features that were selected by SBEFS. Our SR outperformed CR, many other techniques, and another study on wild blueberry yield prediction.


I. INTRODUCTION
The wild blueberry crop is Maine's most important fruit crop, growing in upland acidic sandy soils. The wild blueberry crop (also known as lowbush blueberries) is the most important crop for most people and is considered the largest producer among the other crops in the USA [1]. The main advantage of wild blueberries over cultivated blueberries is that they are not planted but grow naturally in rocky hills and forests. Cultivated blueberries, on the other hand, must be planted [2].
Wild blueberries produce diverse crops, as they are protected from selective breeding and genetic engineering. Hand harvesters use rakes to scrape berries from plants [3] because of the rough terrain and altitude of wild blueberries. The best time to pick wild blueberries in Maine is from the end of July to the beginning of September [4]. Insect pollination species and weather conditions are the major factors affecting wild blueberry yields [5]. The number of insect pollinators, the optimal time for wild blueberry pollination (around 6 am -5 pm), bee species (honeybee (HB), bumblebee (BB), Andrena (AD), and Osmia(OS)), and weather conditions (cold temperatures, wet weather, strong winds, frost, and warmer temperatures) are factors influencing the wild blueberry yield [6]. Obtaining reliable crop yield predictions to meet the Food and Agriculture Organization (FAO) targets is difficult. The primary objective of FAO is to eradicate global hunger. Many scientists have focused on precision agriculture, forecasting crop yields using machine learning (ML) methodologies [7][8][9][10]. However, few studies have used EMLA with diverse topologies [11][12]. MLA require large and high-quality datasets to accurately estimate crop yield [13]. Modeling agricultural production systems is essential, especially when collecting large-scale temporal and geographical data is prohibitively expensive, complex, or impossible in some cases. In many studies, quality data used during training insufficient, may affect experimental results [14], and this degrades crop yield output forecast accuracy [15]. To address these issues, computer simulation modeling approaches that simulate measurements based on scenes to create data have been used [16]. To validate a computer simulation model, model's the assumptions are tested. Sensitivity analysis is used for data assumptions. Sensitivity analysis has been used to demonstrate the interactive impact of bee community dominance and meteorological variables on wild blueberry output [17]. The Naval Postgraduate Schools Simulation, Experiments, and Efficient Design Center for Data Farming defines data farming as the practice of accumulating datasets through simulations and computer modelling. Thus, data farming aims to give decision-makers insights into complicated challenges through simulations, generating data [18]. On this premise, we used previously validated simulationbased modelling of wild blueberry pollination [16]. Obsie et al [17] utilized computational experiments to generate training data, which may subsequently be used to train and assess MLA. For empirical datasets, EMLA have outperformed individual MLA in terms of accuracy. The major benefit of using EMLA is that they can solve difficult problems [19]. The fundamental idea behind using EMLA is to convert a weak learner into a powerful one to improve decision support [20]. A few studies have combined various ML techniques to achieve high accuracy. Stacking techniques are used to implement classification models such as PredT4SE-Stack [21] and NeuroPpred-Fuse [22], as well as weighted multiple metamodel stacking methods [23] for regression. Several techniques with different architectures have been used to achieve precision agriculture through crop yield prediction [12]. The stacking technique obtains all prediction outputs from other MLA and feeds them to the metamodel to obtain the best decision. The researcher [17] believes that simulation data combined with climate data may be useful for training MLA because these data are of high quality and produce datasets difficult to collect manually. The major goal is to avoid costly real-time datasets obtained from satellite images, with an average cost of US $1/km 2 [24].
A previous study has used computer modeling to synthetically generate data for training a wild blueberry yield prediction model, which was then integrated with meteorological data from Maine, USA. A previous study selected seven characteristics [17].
In this study, we used the stacking and cascading techniques to accurately estimate wild blueberry yield prediction based on unique subgroup criteria. Meta-learning is a technique that integrates the predictions of many MLA to develop an EMLA approach. Meta-learning requires an MLA to be trained and tested using data before feeding the output prediction into another MLA.
Our research focuses on predicting wild blueberry yield using EMLA with various methodologies and computer simulation data. Our main goal was to develop a set of MLA using the stacking and cascading approaches to improve prediction accuracy and reduce errors. The main contributions of this study are summarized.
(1) We tuned MLA hyperparameters by Bayesian optimization to obtain the highest accuracy of each prediction model. The remainder of this paper is structured as follows: Section 2 discusses comparable efforts, and Section 3 discusses the data used and the data preprocessing procedure, as well as the benefits of different feature selection strategies. Section 4 introduces the two proposed models and their implementation. The association between the provided attributes and predictions was identified using Shapley additive exPlanations(SHAP). Section 5 presents the MLA used in the two proposed models. Section 6 describes methods for evaluating the performance of each MLA, including an explanation of the benefits of Bayesian optimization. Section 7 evaluates the performance of many ensemble techniques, including CR and SR, and compares the experimental results of the two proposed models with other algorithms. Section 8 outlines the findings and gap study. This study has some limitations. Finally, Section 9 concludes the study briefly and discusses future work.

II. RELATED WORKS
In developing countries, crop yields are typically manually estimated. However, other countries have focused on using emerging technologies including MLA and deep learning algorithms, to predict crop yields.
Over 30 years on India's west coast, time-series data on rice yield paired with meteorological characteristics were collected. Four MLA are used for predicting rice yield. Overall, the least absolute shrinkage and selection operator (LASSO) achieved the best result [25].
The hybrid multiple linear regression (MLR) -artificial neural network (ANN) exhibits [26] that, rather than arbitrary weight and bias initialization, input layer weights and bias are initialized using MLR coefficients and bias. The planting area, water system region, fertilizer consumption, and water system points of interest were included in the generated agricultural data. Climate precipitation, highest and lowest temperatures, and solar radiation were among the climatological characteristics included in the dataset. MLR-ANN was applied to these datasets to precisely predict paddy crop yield.
Five MLA, namely, linear regression (LR), LASSO, XGBoost, LGBM, and random forest (RF), were combined using an optimized weighted ensemble to predict corn yield data, including crop management, climate, soil conditions, and historical yields. The five MLA combined with an optimized weighted ensemble outperformed existing MLA [12].
Shahhosseini et al. [27] presented a new ensemble convolutional neural network (CNN)deep neural networks to predict corn yields: a dataset containing planting date, weather data, soil characteristics, and historical corn yields.
Cereal yield was predicted using MLA trained with a dataset containing weather data, remote sensing, climate, and yield data [28], and the dataset was combined with satellite data and historical yield. These four datasets were combined to form a single dataset. One linear MLA and three nonlinear MLA were used. XGBoost outperformed other MLA.
Linear and nonlinear models were used to estimate soybean and corn yields using Landsat and SPOT images. Two models, neural network and MLR, have been examined [29], and the findings revealed that when all images were included, the models created a strong match, exceeding neural networks in terms of accuracy.
Khaki et al. [30] YieldNet is a contemporary CNN model that combines transfer learning between corn and soybean yield prediction by switching spine weights, including the extractor, using a new deep learning method. The dataset was created using remote sensing data, allowing for continuous crop monitoring throughout the growth cycle. Datasets were created using a wild blueberry pollination simulation model. This information comprises four species of bees, their clone sizes (CSs), and weather data. Four MLA were used to predict wild blueberry yields. XGBoost [17] achieved the best performance across all evaluation measures. Many published research papers have been reviewed and their findings summarized. Few studies have used simulation data to predict crop yield [17], [31], and over 5 years, the Agricultural Production Systems Simulator provided maize yield, soil, and N fertilizer data for the US Midwest [31].

A. EXTRACTED DATA
The simulation model was used to study the influence of wild blueberry placement, bee species and their density, and weather fluctuations on the yield. The dataset used comprised data from an open-source computer simulation model and meteorological information collected from Maine, USA, and the Canadian Maritimes over the last 30 years [16]. Meteorological data for the areas of Bangor, Maine, USA, were collected between 2015 and 2019, and the five-year data were averaged to be feature inputs. Obsie et al [17] deliberately changed the weather data to 125 percent and 75 percent of the average daily temperature and rainy-day data daily from May 1 to June 30 over the past five years. These factors are the main targets to be obtained from model simulations. The number of levels for each factor was provided. Among the characteristics generated by the simulation trials were the following: CS; HB, BB, AD, and OS densities; daily air temperature; and daily precipitation. CS ranges from 10 to 40 m 2 , and different density ranges should be provided for different bee taxa. For field statistics, the levels of each factor were calculated. The levels of (i.e., the number of sample spaces within) each component have been computed, which are as follows: six levels for CS, seven for HB density; ten for BB density, twelve for AD and OS bee densities, and three for weather data. Obsie et al [17] performed three distinct random Latin hypercube sampling (RLHs) [33] with various boundary values to provide excellent space-filling qualities, and RLHs helped in reducing the number of simulations. The total number of simulations performed was 77700. After applying the RLHs, the number of simulations was reduced to 777.

C. FEATURE SELECTION
The main target of using feature engineering techniques is as follows: (1) Some features are unrelated to the output target, which is the fundamental goal of feature selection.
(3) The accuracy is improved. The major contribution of feature selection approaches is the removal of characteristics significantly degrade performance. M1 represents all features in the dataset, as shown in Figure  1; However, many approaches and methodologies, including filters, wrappers, and embedding methods, should be used to obtain a subset of features to achieve the highest prediction accuracy and speed up the various MLA. Filtering techniques were used in the data preparation process. MLA was used to choose the key features based on the accuracy of the target predicted feature, and statistical tests were conducted to identify the correlation between the input features and the anticipated output target. The VIF was used to detect collinearity and multicollinearity.
Higher values indicate that precisely assessing the contribution of predictors to a model is difficult or impossible. VIF yielded nine features represented in M4, SFFS yielded 10 features represented in M2, and SBEFS yielded 10 features represented in M3 as shown in Figure. 1. Both feature selection techniques were used as wrapper techniques. An ML approach based on wrapper techniques was used to select the best characteristics. Embedded techniques combine the benefits of filtering and wrapper methods. Based on feature significance, XGBoost based on XFI yielded seven features represented in M5. Obsie et al. [17] used various techniques for feature engineering selection. The optimal features that may improve performance were obtained.

IV. TWO PROPOSED MODELS
The workflow of the two models implemented is as follows: (1) Important features were selected using the VIF, SFFS, SBEFS, and XFI techniques.
(2) Bayesian optimization was applied to the MLA to tune hyperparameters for boosting the EMLA.
(3) The two proposed models were examined for all features and the different subgroups of features chosen using different feature engineering techniques.
(4) The target was compared and validated by verifying the model performance.

A. STACKING REGRESSION TECHNIQUE
Stacking is the process of training several MLA to forecast the target variable while simultaneously learning how to use the predictions of each MLA to anticipate the target variable's value. Stacking is a popular EMLA technique. Stacking methods employed a hierarchical approach, with MLA separated into two levels, level-0 and level-1. We performed several iteration combinations with different MLA using the stacking technique and the best combination was chosen based on the wild blueberry yield dataset.
LGBM, GBR, and XGBoost generate predictions for each ML separately, and the three new features are added to the dataset. After training Ridge with the new dataset. Ridge obtains the final prediction. The LGBM, GBR, and XGBoost should be at level-0, and the output prediction of those MLA should feed Ridge level-1 before obtaining the final output prediction from Ridge as shown in Figure 2. EMLA should improve the accuracy and reduce the RMSE; the MLA are chosen based on their advantages in solving real-world problems.  This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication.  Figure. 3. We evaluated the performance of SR and other techniques according to popular metrics.

B. CASCADING REGRESSION TECHNIQUE
The cascading technique employed MLA at each stage of the MLA while sequentially feeding the other MLA. Each MLA's prediction is the feeder and at each step, the MLA is trained on the dataset with the newly predicted feature, while removing the previously predicted feature. Consequently, in each stage, the MLA included only one new predicted feature in the dataset. Finally, the MLA forecasts the outcomes as shown in Figure 4.  This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3181970 The cascading combination uses the same MLA as the stacking techniques used for comparison. Datasets are split into 80% training and 20% test sets. Then Bayesian optimization applied to each MLA. After these procedures, the cascading technique is applied to different subsets of features and 13 features as shown in Figure 5. Finally, the performance of the cascading technique and many other techniques was compared.

C. Advantage of SHAP
SHAP [32] is presented as a strategy for converting the blackbox of MLA into a form representing the effect of each feature on wild blueberry prediction. The ratings are based on significant contributions to a certain feature. SHAP was applied to the 13 features in our study.
The primary benefits of using SHAP are as follows: (1) Relationship between specified characteristics and prediction.
(2) Investigation of the factors influencing predictions to obtain more specific information; and (3) Interpretation of the blackbox of the MLA.
The technique used XGBoost in SHAP. In Figure. 6, according to SHAP, red indicates the greater value for each selected feature on its unit scale, whereas blue indicates the lower value. An increase in RD, MaxUTR, CS, and HB values negatively affect the yield prediction, whereas an increase in OS, BB, and AD values have a positive impact on wild blueberry yield prediction, and the relationship between OS, BB, and AD and wild blueberry yield prediction is directly proportional.
According to Figure 6, reducing RD, CS, and HB values increases the wild blueberry yield prediction accuracy, indicating an inverse proportional relationship. The x-axis represents the SHAP value (impact on the model output), and the y-axis represents the feature value.
An increase or decrease in MaxLTR, MinLTR, and AvLT values did not affect wild blueberry yield prediction. The features are arranged according to the main features that have a significant impact on wild blueberry yield prediction. An increase in AvUTR, and MinUTR values decreases the target output.

A. BASIC MODELS
In this study, LGBM, GBR, XGBoost, and Ridge were combined using the stacking and cascading techniques. his tech B. Light Gradient Boosting Machine LGBM [33] is a boosting ensemble model that builds a potentially effective model by combining linked weak learners.
LGBM enhances the capabilities of gradient-boosted decision tree models by increasing running time and decreasing memory usage while maintaining excellent accuracy. While other tree algorithms develop horizontally, the LGBM algorithm grows vertically, which means that it undergoes leaf-wise tree development to minimize overfitting. However, other tree algorithms grow level-wise.
LGBM selects the leaf with the highest growth loss. When expanding the same leaf, the loss can be reduced by more than that of a level-wise method. IENT BOOSTING MACHINE (LGBM)

C. Gradient Boosted Regression
GBR [34] is one of the most popular algorithms among the MLA used in this study. It minimizes bias and variance. GBR This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication.

E. Ridge.
Ridge regression [36] is extension of linear regression that modifies the loss function to reduce the model complexity. This is accomplished by introducing a penalty parameter that is equal to the square of the coefficient magnitudes. OLS + alpha × summation is the loss function (squared coefficient values) The alpha parameter should be chosen in the loss function. A low alpha value can result in overfitting, whereas a high alpha value can result in underfitting.

A. BAYESIAN OPTIMIZATION OF HYPERPARAMETERS
Bayesian optimization [37] generates a posterior distribution of functions (Gaussian process) that best characterizes the tobe-improved function. The posterior distribution improves as the number of observations increases, and the algorithm becomes more convinced of which regions of the parameter space are worth exploring and which are not. The algorithm iterates to balance its exploration and exploitation requirements while considering what it learns about the target function. A Gaussian process is fitted to the known samples (previously examined points) at each step, and the posterior distribution, together with an exploration strategy (such as upper confidence bound or expected improvement), is used to decide the next point to be researched.
The key advantage of using Bayesian optimization is to achieve the best accuracy by selecting the appropriate hyperparameters for each MLA; most studies have concluded that Bayesian optimization is faster than grid search [38], and that grid search achieves higher accuracy than Bayesian optimization, but takes longer to find optimal hyperparameters.
We tuned the hyperparameters of each MLA. To reduce the hyperparameters of each MLA, as shown in Tables. 1-17, we used the split train-test function (training set 80% and test set 20%) for each MLA using Bayesian optimization.

RFORMANCE EVALUATION METRICS
We examined 17 regression MLA and the stacking and cascading approaches using four measurable formula equations.
(1) R 2 [39] is the coefficient of determination, which is used to determine how much variance in the response variable can be explained by the independent variables.
(2) The difference between the expected and actual values was calculated using RMSE.
(3) The absolute mean of the real and predicted yield values is the mean absolute error (MAE).
(4) The relative root mean squared error (RRMSE) was obtained by dividing the RMSE by the realvalue average. This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. (1) where n is the count of the real yield, Yi is the real yield value, and Xi is the MLA yield prediction for wild blueberries. R 2 is calculated as follows: ̅ is the mean real yield when the value of R 2 is close to 1.0, and the relationship between the MLA forecasted and real yield is more linear. An RMSE of zero shows that the MLA output yield corresponds perfectly to the real yield.
The number of independent factors used to predict the target variable is taken into consideration by the Adjusted R-squared (Adj R 2 ) [41]. This allows us to see if adding new variables to the model improves model fit.
The number of data points in our dataset is n, the number of independent variables is k. As a result, if the Adj R 2 value does not increase sufficiently because of the inclusion of a new independent variable, the value of R 2 will decline.

VII. RESULTS
We conducted intensive experiments to determine the best MLA combination using the stacking and cascading techniques. We also evaluated the performance of these two techniques with other traditional and modern MLA. The best hyperparameters were selected using Bayesian optimization to improve accuracy. This study extensively investigated more MLA to improve prediction performance. The results in Figure 7 show that SBEFS was the best feature selection technique. Among all datasets, the M3 dataset had the highest adj R 2 . Adj R 2 decreases when more irrelevant variables are included in a model. Adj R 2 increases when more relevant variables are included. R 2 after adjustment will always be smaller than or equal to R 2 .

A. COMPARISON OF THE PERFORMANCE OF 17 MLA, CR, AND SR
From     Figure.8 displays a scatter diagram for locating the estimated and actual wild blueberry by SGD and SR for the M1, M2, M3, M4, and M5 datasets. The x-axis represents the actual yield (kg/ha), and the y-axis represents the prediction yield (kg/ha), as shown in Figure 8. The SR and SGD techniques were applied to the wild blueberry datasets. In addition, each colored point represents the MLA prediction. The two-colored points blue, and orange, represent the predictions of SGD and SR, respectively, as shown in Figure 8. The highest R 2 was achieved by SR for M1, M2, M3, M4, and M5 datasets, indicating that the points of the output of SR are closer to linear, which is almost perfect, based on the scatter diagram. The highest error from the SGD model is also the worst for the M1, M2, M3, M4, and M5 datasets.     [42] to gain flexibility and improve accuracy. Researchers have focused on simulating experimental implementation, feature selection approaches, employing sensitivity analysis to determine the importance of a feature influencing crop output, and measuring the predictive model's performance [17]. According to prior research, various factors influence wild blueberry productivity, including bee populations and species, plant diseases, weeds, soil type, clone genetics, climate change, and pesticides [2]. A statistical study revealed that yield is directly related to an increase in bee density in the field [17]. The challenge in the agriculture industry is insufficient data, and features needed for analysis is inaccessible. Small datasets are not suitable for deep learning methods. In our study, we focused on improving prediction accuracy by Bayesian optimization and then combining different MLA using the stacking and cascading techniques. Compared with earlier research that used the same dataset and had the same goal, our proposed stacking technique using a combination of LGBM, GBR, XGBoost, and Ridge provided the highest accuracy, as shown in Table.   SHAP was used to analyze the 13 features of the prediction; SHAP translated the inputs and output prediction to determine the most important features for prediction, and SHAP's graphical representation aided in translating the contribution of each feature to the output target in scientific information about wild blueberry yield. SHAP is used in the modeling process to interpret significant features, achieving good performance with few features.
We investigated the stacking and cascading techniques using a novel combination of four MLA (LGBM, GBR, XGBoost, and Ridge) and 17 MLA for wild blueberry yield prediction. A simulation model of wild blueberry pollination spanning 30 years in Maine, USA, and the Canadian Maritimes yielded an input dataset of 777 records and 13 characteristics involving meteorological data, average blueberry CS, and four distinct kinds of bees and their densities [17]. The best hyperparameters for each MLA were determined using a Bayesian optimization technique. The stacking strategy outperformed the cascading and other techniques for all unique subsets of the dataset. Regarding the maximum RMSE value achieved on the M1, M2, M3, M4, and M5 datasets, the SGD was the worst. We investigated four unique feature selections playing a vital role in lowering the complexity of the dataset.
Following the theorem of Bayesian optimization of the dataset subjected to feature selection approaches, we concluded that using the SBEFS technique, the stacking technique achieved similar accuracy of 0.985 for all features. SHAP provides more information about features significantly influencing wild blueberry yield. An increase in OS, BB, and AD densities boosted wild blueberry yield. In general, our analysis found that the stacking strategy achieved the best performance in terms of the examined metrics, namely, R 2 , RMSE, and MAE for the varied subset features.

IX. CONCLUSION
In this empirical study, we used the cascading and stacking strategies to create prediction models for wild blueberry yield, utilizing two novel MLA combinations. The characteristics features used in our empirical investigation were derived from a simulated model of wild blueberry pollination. The data were generated using a computer model and integrated with weather data collected 30 years ago in Maine, USA. To obtain more specific information, SHAP explores the aspects of impact prediction by XGBoost. SHAP concludes that the features MaxLTR, MinLTR, and AvLT do not influence the wild blueberry yield. The stacking and cascading approaches have been used for various subsets of features. To reduce complexity, feature selection approaches such as SFFS, SBEFS, VIF, and XFI were used to select unique features. Bayesian optimization was used to improve accuracy. After tuning the hyperparameters of the four MLA (LGBM, GBR, XGBoost, and Ridge), the stacking and cascading techniques were used to combine them. The performance evaluation was performed to calculate R 2 , RMSE, MAE, and RRMSE for SR and CR, and many techniques were used for all features and different subsets of features. The comparison was performed using SR and another study. Results show that SR outperforms CR, 17 MLA, and the method in another study. SR increases the weak Ridge model up to 0.187 R 2 which is a significant progress on the seven features and increases it by 0.046 R 2 more than that in the previous study. In the future, we will apply EMLA with deep learning techniques to a large wild blueberry yield dataset. A large dataset could be collected from various sources and have new features absent in our study, such as soil type and weeds. HAYAM R. SEIREG had worked as a teaching assistant at El-Shorouk Academy, has a master's degree from the Arab Academy for Science, Technology, and Maritime Transport, and is now pursuing a PhD. Her research is primarily concerned with artificial intelligence, namely multi-agent systems and machine learning.