The effect of dataset construction and data pre-processing on the eXtreme Gradient Boosting algorithm applied to head rice yield prediction in Australia

Dataset quality heavily impacts the predictive performance of data-driven modelling. This issue can be exacerbated in the prediction of agricultural production due to the complex interactions between the climate, the environment and the way the plant is affected by these conditions during the season. This study aims to create an empirical model to predict Head Rice Yield (HRY), the primary quality metric for rice growers and millers globally. Model development focused on an industry-level dataset made available by SunRice, Australia ’ s most prominent rice trading company. Using the SunRice data, two dataset construction methods were implemented to evaluate the effect of dataset construction and data pre-processing on model accuracy. The first dataset construction method was based on aggregating meteorological conditions using estimates of phenology, while the second method used aggregations based on defined lengths of time. Deviations of each construction method were generated to explore the impact of varying levels in aggregation stages and stage lengths. Each constructed dataset underwent feature selection prior to model training using the XGBoost algorithm with Leave-One-Year-Out Cross-Validation. The time-based dataset construction method proved to be the most accurate dataset construction method, producing the highest mean model accuracy scores across all pre-processing and model training configurations. The single most accurate model came from the two-week aggregation dataset, which yielded a 125% increase in Lin ’ s Concordance Correlation Coefficient compared to the worst-performing model produced in this study. Developing a highly accurate model that allows for crop stage knowledge discovery is critical for uncovering actionable insights to improve the management of future rice crops for HRY. The knowledge discovered in this study provides actionable insights to improve the management of future rice crops for HRY. The developed model demonstrates the potential for SunRice to predict HRY at the receival point to optimise post-harvest handling and milling. When matched to region-specific data, the dataset construction methods explored can be replicated in other rice-growing regions globally.


Introduction
Rice (Oryza sativa) is the staple for 3.8 billion people globally (Courtois et al., 2010;Sirikanchanarak et al., 2018).While eating quality traits strongly influence the market acceptance of rice, physical traits, particularly the presence of broken grains, have the most significant impact on the price rice receives in the market (Buggenhout et al., 2013).
Rice grains are primarily consumed as white rice, especially in the more traditional rice-consuming markets (Mohan et al., 2017).To produce white rice, the harvested paddy rice must be milled to remove the outer hull and bran layers.However, the forces applied during this process can cause lower quality grains to break, significantly impacting milling revenue as broken grains may receive a market discount of 40 % or higher than whole grains (Lyman et al., 2013).The ratio of whole and broken grains at the end of milling is classified as Head Rice Yield (HRY), calculated as the weight of grains recovered greater than 75 % of the unbroken length, expressed as a percentage of the original paddy rice sample (Lyman et al., 2013).In the Australian rice industry, HRY is measured for each grower, with a premium or discount applied to their final payment based on their result against the industry average (Kealey and Clampett, 2000).A critical inefficiency of rice post-harvest handling is the inability to measure HRY as rice is delivered to the grain elevator due to the high moisture of the grain at harvest.Predicting HRY on incoming rice deliveries could enable grain depots to optimise their storage and drying conditions based on quality.This would also remove the need for samples to be taken from each delivery, requiring transportation and storage of samples, which are retained for HRY assessment after drying at the lab.Additionally, an earlier prediction of grower level HRY could allow rice milling companies to reduce the delay in quality payments to their growers.
Globally, the generation of timely and accurate crop production predictions has become increasingly important due to the impact climate change and increasing food demand have and will continue to have on all agricultural systems (Srivastava et al., 2022;Zhu et al., 2022).Across agricultural supply chains, forecasts of crop production benefit both farmers and advisors to assist with crop management decisions (Donohue et al., 2018), as well as industry and agribusinesses to aid grain marketing and grain handling decisions (Becker-Reshef et al., 2010;Donohue et al., 2018;Wu et al., 2015).These forecasts are also for government-level policymaking, particularly as the global population and increased living standards worldwide drive-up food demands (L.Li et al., 2021;Tilman et al., 2011).While these circumstances prove the importance of crop production forecasts, the complexities involved in crop production and the effect of spatial and temporal variation make crop production forecasting challenging (Zhu et al., 2022).
Generally, crop yield forecasting is achieved using either processbased crop models or empirical models.Process-based crop models, such as the Agricultural Production Systems sIMulator (APSIM) (Keating et al., 2003), are formed on known interactions and relationships between a crop, the environment and the management in which they are grown (Li et al., 2021).However, developing these models requires substantial field data to uncover the relationships and help calibrate and validate the models.These models are also limited by the requirement of underlying assumptions on the discovered relationships and are often not well-suited to a new area of interest (Filippi et al., 2019).
Empirical modelling produces a crop yield prediction by training a model on historical data using machine learning algorithms (Yang et al., 2022).Unlike the previously mentioned process-based models, the empirical models are data-driven, which means that the algorithm discovers the relationship from the input data rather than the relationship dictated by the researcher developing the model (Dhaliwal et al., 2022;Olden et al., 2008).
An advantage of empirical models is the ability to handle significant levels of data, which many agribusinesses have been storing for other internal purposes, removing the need to capture data for resourceexpensive crop production trials (Delerce et al., 2016).Simultaneously, the data needed to explain variation in crop production is becoming more accessible through the increase in Internet of Things (IoT) systems and sensor networks, interpolation of climate data and the increasing number and improved resolution of remote sensing data providers (Delerce et al., 2016).
While empirical model development has gained popularity, complications can arise in developing the training dataset.Pre-processing the different data sources to construct a training dataset is estimated to take approximately 80 % of the model development effort (Zhang et al., 2003) and can drastically affect model accuracy (Miksovsky et al., 2002).Therefore, to meet the demand for accurate prediction of crop production, datasets must be constructed with expertise in the agricultural domain and application of best practice data pre-processing (Zhang et al., 2003).
In short-cycle annual crops, such as rice, this is particularly true.During crop development, environmental conditions will have varying levels of sensitivity in each development stage and are directly related to the physiological processes occurring in the plant (Delerce et al., 2016;Yang et al., 2022;Yoshida, 1981).In recent literature, different methods have been used to account for the physiological process of the crop by aggregating the in-season environmental conditions by the crop development stages (Delerce et al., 2016;Yang et al., 2022).In studies where data is collected from a limited number of farms or research stations, precise recording of crop phenology dates is more easily obtained, owing to the dataset's size.Conversely, in cases where the study encompasses a more extensive area, crop phenology dates may necessitate estimation through domain expertise.Although this approach may yield less precise phenology records, the dataset will include increased spatial and temporal variation, with potential applications to a broader spectrum of crop development conditions.
The complexity of accounting for crop development is compounded further by selecting phenological stages to include.Recent studies focusing on predicting wheat yield in China illustrate this variability.Across four studies, the phenology stages included for environmental data aggregation had one and two stages (Ruan et al., 2022), three stages (Zhou et al., 2022), four stages (Li et al., 2022) and six stages (Li et al., 2021).
While phenology-based aggregation of agroclimatic factors has proved beneficial to crop yield prediction, other studies have instead elected to aggregate conditions using a nominated duration of time.The month of the year aggregations have been applied to yield predictions for wheat (Cai et al., 2019;Kern et al., 2018;Vannoppen et al., 2020), rice (Ji et al., 2007), rapeseed, maize and sunflower (Kern et al., 2018).However, the length of time can be notably shorter, as seen in a recent study by Srivastava et al. (2022), where the growing season was split into forty-five one-week aggregations.Conversely, aggregation periods may be set according to the revisit time of the satellite provider (Wang et al., 2020) or alignment with agronomically or operationally important dates for the farmer (Filippi et al., 2019;Filippi et al., 2020;Han et al., 2022) or the industry (Han et al., 2022).
The most appropriate method to account for the physiological process of the crop will be dictated by the crop being studied, available data points, dataset size and the target metric for prediction.Consequently, selecting the number of stages and the length of time each stage covers becomes a balance of attempting to capture the intricacy of the cropping system while avoiding dataset complexity and overfitting the model.Compounding the risk of overfitting the model is the impact that too many features can have on the interpretability of the model.
Unfortunately, there is no best practice method for dataset construction when developing empirical models for predicting crop productivity.To explore the effect of dataset construction on model accuracy, this study will create five dataset variations using phenologybased aggregation and five dataset variations using time-based aggregations.The aggregations produced will be specific for the prediction of head rice yield, using an industry-level dataset provided by the Australian rice industry.The number of stages selected and the length of time in each stage will be adaptations of the previously described studies, allowing comparison across other crop types.It is aimed that this process will uncover the method that provides the highest level of accuracy for a dataset of this nature and reveal the difference in sensitivity and importance of features included in each aggregation.

Study area
Australian rice production is consolidated within the Riverina and Murray regions of New South Wales, with a small number of crops in the Shepparton region of northern Victoria (Ricegrowers Association of Australia, 2021) (Fig. 1).The area provides the crucial requirements for rice production with adequate temperatures, abundant with flat topography, suitable heavy clay soils, and water availability through the Murrumbidgee, Coleambally, and Murray irrigation schemes (Ashton et al., 2016).Rice is grown in a single season in southern Australia, with plantings occurring during Spring (October/November) and the harvest in Autumn (April/May) of the following year (Kealey and Clampett, 2000).
The Australian Rice Industry is vertically integrated.Growers buy their seed from SunRice, sell and deliver their rice to SunRice (Through their subsidiary Australian Grain Storage (AGS)) and have their rice packaged and marketed by SunRice.For this study, SunRice provided farm-level crop production data detailing the location (irrigation valley and growing region), variety, sowing date, harvest date, harvest moisture, and the head rice yield achieved for each crop.Records were collated from the 2011/12 to the 2020/21 season.
In this study, the dataset was filtered to crop records for the medium grain variety, Reiziq, the prominent variety of the Australian rice industry during this period.The subset included 10,907 crop production records, displaying large levels of inter-seasonal production variation.During this period, planted area ranged from 58,918.9Ha in 2012/13 to 2,784.2Ha in 2019/20, while seasonal average HRY ranged from a maximum of 65.4 % in 2019/20 to a minimum industry average of 49.3 % in 2012/13 (Table 1).
HRY variation was also observed intra-seasonally.This intra-season variation was particularly noted in the 2017/18 Riverina rice season, where Reiziq crops averaged 53.8 % HRY (Table 1), substantially lower than the long-term average.However, in that same season, individual crop HRY levels ranged from 4.8 % to 69.2 % (Fig. 2).These inter-and intra-seasonal variations highlight that HRY variation is driven both by environmental factors and by grower management, where in any given year, individual growers can achieve much higher or lower HRY% than the seasonal average.

Datasets construction 2.2.1. Meteorological data
Daily climate data was obtained from the Scientific Information for Land Owners (SILO) point database (Jeffrey et al., 2001), available at https://silo.longpaddock.qld.gov.au.Using the interpolated SILO data, weather data was imported to the location of the crops nearest Australian Grain Storage (AGS) receival site, as the precise location of the crop was unavailable.Fig. 1 illustrates the geographical distribution of the eighteen AGS sites.

Aggregation stage feature engineering
SILO data was split into crop-specific stages using two methods, with the daily climate records aggregated between the start and end date of each stage.Firstly, daily climate records were aggregated using estimates of crop phenology, and secondly, based on defined periods of time.For each SILO variable, the average was calculated, along with the maximum, minimum and average of each variable's three-day and seven-day rolling averages.For the precipitation variable, the number of  days of rain (precipitation ≥1 mm) and the total precipitation were calculated (Table 2).To explore the applicability of estimated phenology dates to aggregate environmental data, the dataset was replicated with various stages ranging from one to five (Fig. 3A).

Time-based crop record.
Time-based aggregation utilised the harvest date provided by SunRice, with stages created back to 16 weeks before harvest (Fig. 3B).Aggregation stages were created using varying lengths of time to explore the impact of stage length and number of stages included on model accuracy.Utilising the harvest date for stage construction provides a methodology more relevant to grain elevator companies without upstream supply chain involvement.While growers in the Australian rice industry must provide information on sowing dates to SunRice, this is not often the case around the globe.

Model development experimental overview
The model development stages were restricted to a single algorithm to explore the impact of the two distinct dataset construction methods.The following sections explain the model development, model performance assessment and knowledge discovery processes in more detail (Fig. 4).Each stage was executed using the caret package (Kuhn, 2008) within the R software environment.

Erroneous record filter
Season length, calculated as the days between planting and harvest date, was used to filter records with inaccurate sowing dates, which may impact the phenology-based dataset construction.Season length outliers were defined using the interquartile range, where records with a season length 1.5 times the interquartile range more than the third quartile (Q3) or 1.5 times the interquartile range less than the first quartile (Q1) were considered an outlier (Q1).

Feature selection
Feature selection is critical to exclude irrelevant features in the dataset, allowing the model to focus on the most applicable features (Ruan et al., 2022).In this study, feature selection was completed in two stages as an adaptation of the work by Ruan et al. (2022).Stage one applied a collinearity filter to the dataset to remove multicollinearity between features based on Pearson's correlation coefficient.To explore the impact of collinearity on the constructed datasets, model development was repeated on duplications of each dataset with accepted collinearity levels of 1 (no filter), 0.95, 0.85 and 0.75.Stage two was completed using recursive feature elimination (RFE).RFE is a greedy algorithm used as a wrapper method to iteratively remove the least important features (Guyon et al., 2002).This process is repeated recursively (based on backward elimination) until the optimal number of features is obtained (Corrales et al., 2022).Utilised in conjunction with XGBoost, RFE assesses the mean decrease in accuracy to rank feature importance, where features causing the most significant decline in prediction accuracy are deemed most important.In the RFE-XGBoost feature selection, hyperparameters were set to a tree depth of 3 and a minimum child weight of 50 records.RFE was performed using defined feature subsets, ranging from 5 to 50, using an interval of 5 features, then 100 features and all features in the dataset.The selected feature subset was based on the XGBoost model trained on the lowest number of features with an RMSE within a five per cent threshold of the bestperforming model.This allowed a subset to be selected that included the most important variables while minimising the training dataset's features and, hence, the model's complexity.

Model training
Model training was conducted using the XGBoost (eXtreme Gradient Boosting) algorithm, which has gained popularity in recent studies targeting crop yield prediction (Dhaliwal et al., 2022;Ghafarian et al., 2022;Obsie et al., 2020;Srivastava et al., 2022).Unlike a random forest algorithm, the XGBoost algorithm sequentially builds decision trees, where each new tree corrects the residual errors of the previous ones to accurately predict the target variable (Dhaliwal et al., 2022).By creating an ensemble of these simple decision trees, the model can still capture intricate interactions within the dataset while ensuring it is less prone to outliers through L1 and L2 regularisation.(Ghafarian et al., 2022).
A replicated model training workflow was designed to equally assess the dataset-feature subset combinations produced in this study.Adapting the work of Vannoppen and Gobin (2021), each feature subset underwent model training using XGBoost hyperparameter optimisation and validation with Leave-One-Year-Out cross-validation (LOYO-CV) (Filippi et al., 2020).Hyperparameter optimisation was run using a manual search grid with tree depths of 2, 3, 4, and 5 tested with minimum child weights of 1, 10, 30, 50, 100, 250 and 500.This equated to a dataset-collinearity filter subset undergoing 28 different model training experiments.
To further validate the developed models, the top model identified through RFE underwent testing using both Leave-One-Region-Out (LORO-CV) and Leave-One-Region-Year-Out cross-validation (LORYO-CV), adapted from Filippi et al. (2019).The 'region' used for spatial cross-validation is the nearest AGS depot, which groups the Riverina growing region into eighteen geographical zones (Fig. 1).

Accuracy assessment metrics
The prediction accuracy of the trained models was evaluated using Lin's concordance correlation coefficient (LCCC) (Lin, 1989), the coefficient of determination (R 2 ), and the root-mean-square error (RMSE).A two-way analysis of variance (ANOVA) with a post hoc Tukey's honestly significant difference test (Tukey's HSD) (P < 0.05) was completed between construction methods and the aggregations used for each dataset variation to test for significant differences between the mean model evaluation metrics.

Knowledge discovery
Models trained using the XGBoost algorithm do not immediately explain how the model was trained or the important features and values.However, additional data mining can interpret the answer to these key questions.The XGBoost algorithm provides a score of feature importance using the gain value, explaining the relative importance of the features included in a model.Higher gain values imply that a feature was more important to the prediction generated by the model.XGBoost gain was used to explore the trained models in two ways.Firstly, to understand which crop stage aggregation had the most importance for each dataset construction process and secondly, to uncover the most important features in the best model produced from model development.
Stage-based importance was measured by training a separate model on all features in each dataset, with standard 10-fold Cross-Validation.The gain of each feature was then summed by the stage.

Feature selection
Applying collinearity feature selection effectively reduced the number of features in each dataset across various collinearity levels.Initially, the phenology-based datasets comprised 82 to 418 features, while the time-based datasets contained between 81 and 1221 features (Appendix T1).Notably, applying a collinearity filter greater than or equal to 0.95 led to a reduction of over 40 % in the number of features, a trend consistent across subsequent filter levels.The subsequent application of RFE further narrowed the training features to twenty-five or fewer in all dataset iterations, except for the TD_4 and TD_5 datasets, which started with the highest number of features at 613 and 1221, respectively.Postfeature selection, the TD_5 dataset retained the most features, especially in the 'no filter' and '0.95 collinearity filter' subsets, with forty features (Appendix T1).These findings demonstrate the efficacy of the two-stage feature selection method employed in this study, which successfully reduced the dimensionality of the datasets, thereby streamlining the data for more efficient analysis and interpretation.

Model accuracy
This study involved training individual models using 28 hyperparameter variations across four feature subsets derived from five phenology and time-based datasets, resulting in 1120 unique models.It was observed that the time-based datasets generally exhibited higher model accuracy than the phenology-based datasets.ANOVA analysis with post hoc Tukey's HSD tests showed that, on average, model accuracy was significantly higher in all time-based datasets, with the exception of the TD_1 dataset, as measured by mean LCCC, R 2 , and RMSE (Table 3).Furthermore, in both dataset types, the accuracy metrics of LCCC, RMSE, and R 2 improved progressively, albeit at a diminishing rate, as more stages were included, with the most notable accuracy gains observed between the single-stage and two-stage datasets (Table 3).
Model accuracy was influenced not only by dataset construction and the number of included stages but also by feature selection and hyperparameter tuning.The range of LCCC in individual feature subsets and hyperparameter tuning results spanned from 0.35 to 0.64 in phenologybased datasets and from 0.32 to 0.68 in time-based datasets.This equates to a 91 % and 113 % increase in the LCCC from the worst to the best models in each dataset construction method.The collinearity x RFE feature selection revealed that the 0.75 collinearity filter typically yielded the lowest LCCC, whereas no collinearity filter and the 0.95 collinearity filter were associated with the highest accuracy.Regarding XGBoost hyperparameters, an increase in tree depth generally enhanced Model LCCC for both dataset types.However, in phenology-based datasets, tree depths beyond three had minimal or negative effects on model LCCC.The minimum number of records required for a split also varied in impact, with a lower minimum enhancing performance in time-based datasets and a higher minimum being more effective in phenology-based datasets.These findings underscore the significant role that dataset design, data preprocessing, and model training parameters play in determining the accuracy of the final predictive model.
Tukey's HSD analysis (Table 3) identified the TD_4 and TD_5 datasets as yielding the most accurate models.These datasets also had the highest number of features post-feature selection (Appendix T1), especially with no collinearity filter and a >=0.95 filter.To examine the impact of dimensionality reduction on these datasets, new models incorporating the top fifteen and ten features, as identified by RFE at each collinearity level, were trained.In both datasets, Lin's Concordance Correlation Coefficient (LCCC) notably improved for these reduced feature subsets compared to the original RFE subsets (Fig. 5a).This improvement was established using a two-way ANOVA, which compared feature subsets and collinearity filters in relation to model LCCC and Root Mean Square Error (RMSE).Statistical differences, determined by post-hoc Tukey's HSD tests, are indicated by unique letters, with a change in letters denoting a significant difference (P < 0.05).Groups sharing a letter are not significantly different, with the highest mean values labelled in  reverse alphabetical order, with 'A' representing the highest mean value.Correspondingly, RMSE typically decreased in these reduced feature subsets, although the degree of reduction varied between the top fifteen and ten feature subsets (Fig. 5b).
The secondary RFE feature selection revealed that an improved model was trained using the best ten features from TD_4 with a 0.95 collinearity filter.This model achieved an LCCC of 0.72, an R 2 of 0.55, and an RMSE of 5.94 (Fig. 6), with XGBoost hyperparameters set to a maximum tree depth of three and a minimum number of records per split of ten.Comparatively, the most accurate model from the original RFE datasets had an LCCC of 0.67, an R 2 of 0.49 and an RMSE of 6.26 (Fig. 5), representing a 7 % 11 % decrease in the LCCC and R 2 and 5 % increase in RMSE.Against all models in the original RFE results, the new model displayed a 125 % LCCC increase over the worst model and a 32 % LCCC increase over the average.These findings suggest that even with fewer features, time-based datasets can produce significantly higher accuracy in HRY prediction.
To further validate the model, spatial and spatial-year cross- Results show the maximum LCCC and minimum RMSE achieved across all hyperparameter tuning options.The green columns represent the dataset using the original RFE feature.The orange and purple columns represent the fifteen and ten highest-ranked features from the original RFE results, respectively.Columns with different letters significantly differ at P < 0.05 according to two-way ANOVA followed by Tukey's test.
A. Clarke et al.
validation was conducted, considering the AGS region and the year of rice cultivation.The spatial cross-validation method demonstrated higher model validation accuracy compared to the standard LOYO-CV results.The best algorithm configuration using LORO-CV achieved a validation accuracy of LCCC 0.78, an R 2 of 0.64, and an RMSE of 5.3, while the LORYO-CV yielded a validation accuracy of 0.77, an R 2 of 0.63, and an RMSE of 5.3.Notably, in both spatial cross-validation approaches, the most accurate model was trained using a maximum tree depth of five and a minimum number of records per node of ten.These results further affirm the robustness of the model when validated across different spatial and temporal dimensions.

Feature importance 3.3.1. Dataset construction X aggregation stages
To ascertain the significance of crop development/aggregation stages in both dataset construction methods, XGBoost feature importance, evaluated through gain, was utilised with individual feature scores aggregated by stage.In the single-stage sowing to harvest dataset, PD_1, the in-season climatic conditions produced a total gain of 0.55.The remaining 0.45 comprises the variables related to harvest, predominantly from harvest moisture (Fig. 7).In the PD_2 dataset, split by the PI date, the total accumulated gain of the in-season climate increased to 0.58.Most accumulated gain within the two in-season stages came from the PI to harvest (PI-HVST) stage.This trend continued across the subsequent datasets, where the highest total gain was observed in the stage immediately before harvest.Splitting the PI to anthesis (PI -FLWR) stage in the PD_5 dataset established that the PI to microspore (PI -MS) stage had a greater accumulated gain relative to the microspore to anthesis (MS -FLWR) stage.Notably, the inclusion of an additional stage increased the total gain compared to the single PI -FLWR stage used in the PD_3 and PD_4 datasets.The PD_5 dataset had the highest accumulated gain of the in-season climate variables, at 0.6, with the grain fill to harvest (GF -HVST) stage accounting for 0.27.
In the first iteration of the time-based construction, TD_1 produced a total gain of 0.59 from the in-season climatic conditions (Fig. 8).By dividing this period into two eight-week stages, the gain attributed to climatic conditions was increased slightly to 0.6.Of the two stages, the first eight-week period from sixteen to eight weeks pre-harvest (16 W -8 W) accounted for 0.43 of the total gain, while the eight weeks preharvest to harvest (8 W -HVST) stage accounted for just 0.18 of the total gain.This result was divergent from the trend of the phenologybased datasets, where the later stages were attributed to a higher percentage of accumulated gain.The TD_3 dataset, separated into four fourweek stages, highlighted the importance of the twelve to eight weeks pre-harvest (12 W -8 W) stage.However, the total gain of the sixteen to twelve weeks pre-harvest (16 W -12 W) and the 12 W-8 W showed no improvement to the single 16 W-8 W stage in TD_2.The TD_4 dataset, constructed as eight two-week stages, produced the highest accumulated gain from the in-season climate conditions across all datasets, with 0.62.Again, the 12 W-8 W stage accounted for the most gain, with the ten to eight weeks pre-harvest (10 W -8 W) stage accounting for the vast majority of the gain of the two two-week stages.Including the two-week stages also increased the total gain produced by the 16 W-12 W period, predominantly driven by the fourteen to twelve weeks pre-harvest (14 W Fig. 6.The plot of observed HRY against the predicted HRY using the best ten features from TD_4 with a ≥0.95 collinearity filter applied.The blue points represent each record, the black dashed line represents the 1:1 (perfect prediction), and the red dashed lines represent the ±10 % RRMSE error line.-12 W) stage.The TD_5 dataset, constructed as sixteen one-week stages, illustrated that within the critical 10 W-8 W, the ten to nine weeks preharvest (10 W -9 W) stage was the most important feature.Furthermore, within the 14 W-12 W period, the thirteen to twelve weeks preharvest (13 W -12 W) stage provided the most However, the total accumulated gain of the in-season climate conditions was slightly lower than the previous dataset at 0.61.These findings underscore the impact of dataset design and stage aggregation on the significance and timing of engineered features in predictive modelling.

Second stage feature selection model
The analysis of XGBoost Gain for the TD_4 model's best ten features revealed key insights into feature importance.The average grain moisture percentage at harvest (agm_pc) emerged as the most significant feature, contributing about 44 % to the total gain.Notably, five of the other nine features originated from the 10 W-8 W period (Fig. 9), together accounting for approximately 35 % of the total gain, aligning with the accumulated gain observed in the entire dataset's development stage importance analysis (Fig. 8).In terms of individual variables, relative humidity at the time of maximum temperature (rhmaxt) was particularly prominent in the 12 W-10 W and 10 W-8 W stages, marking its significance (Fig. 9).Other variables of note included the diurnal temperature range (dtr) in both the 12-10 W and 10 W-8 W stages, the water deficit index (wdi) in the 8 W-6 W stage, and evapotranspiration (evtp) in the 10 W-8 W stage.These findings underscore the vital role of specific climatic variables in the selected stages, underlining their importance in influencing the model's predictive accuracy.

Model performance
Dataset construction using the time-based construction method has a clear and significant advantage in model accuracy compared to the estimated phenology approach.The lift in model accuracy is likely due to the error in phenology estimates caused by using the sowing date as the reference on which the estimates are based.While the sowing date is known to affect the timing of crop phenology, the subsequent management of irrigation flushes, permanent ponding, and the climate will also significantly impact when the PI date will occur (Darbyshire et al., 2019).The date of PI will then determine the timing of the subsequent stages.Therefore, this construction method may have caused errors in the estimated dates of crucial crop and grain development stages across the datasets.This may have led to high amounts of predicted HRY% error in the records where the dates of estimated phenology have the highest error.For example, in the phenology-based datasets, two crops harvested in the same region with identical harvest moisture but planted on different days will experience varying lengths of the final crop development stage.While in some instances, this may be accurate to what occurred in the field, in other cases, the dates of crop development may have been quite similar due to discrepancies in pre-PI irrigation scheduling.A significant difference in days included in the final phenological stage (Fig. 10) would create a variation in the calculated weather features for this stage across the dataset.This is likely to impact the model's HRY prediction substantially, given that feature importance analysis identified this final stage as the most critical for the phenologybased datasets.
While the time-based construction method is arguably a more naïve method of constructing the dataset, it has the benefit of crops harvested on the same date having the same development stage calculations.The improved accuracy in this method may be due to many growers attempting to reach PI within a similar window through irrigation management, irrespective of the sowing date.This is due to the significant impact of the PI date on the crop's paddy yield potential (NSW DPI, 2021).Hence, this construction method may have improved the engineering of weather features to allow more accurate crop record comparisons.The higher total accumulated gain of in-season aggregation stages, relative to the phenology method, suggests that the time-based construction allowed the XGBoost algorithm to find better relationships within the datasets, particularly those driven by in-season climatic conditions.This discovery highlights that HRY prediction using this method of dataset construction and model training could be applied to other rice milling companies internationally without the need for them to have accurate records of sowing date, instead only requiring the delivery date.

Feature engineeringstage aggregation
Each dataset construction method experienced a steep incline in model accuracy when a second stage was added to the dataset.However, as additional stages were added to the datasets, average model accuracy remained stable in the phenology-based datasets.In contrast, it continued to increase in the time-based datasets but at a steadier rate.This indicates that model accuracy doesn't increase linearly as more aggregation stages are added to the dataset.The exact number of stages and stage timing that produces the most accurate prediction will be dictated by crop type and the predicted metric (yield, quality, disease, etc.).However, these results indicate that simply adding more stages won't necessarily increase accuracy and may cause a decrease in some cases.Therefore, using crop-specific domain knowledge to experiment with the inclusion of different stages should lead to the development of a more accurate model for a given dataset.

Feature selection
The results of the collinearity filter indicate that a filter of variables with a correlation of ≥0.75 may be too harsh for the datasets produced in the study.In both datasets, this is exacerbated when more stages, and hence features, are added where the subsets produced from this filter produce significantly lower model accuracy than the less harsh filters.This is likely due to important features correlated with other features in the dataset being removed that could have improved model accuracy.In this study, the collinearity filter of variables with a correlation of ≥0.95 provided similar accuracy to the original dataset while also reducing the number of features of the dataset by up to 40 %.This provides the benefit of reduced time taken for feature selection processing and, subsequently, the complexity of the model, making it easier for model interpretability and knowledge discovery.
RFE significantly reduced the number of features in all datasets.As the number of stages, and hence features, were added, the average number of features selected using RFE increased.However, even for TD_4 and TD_5, with the highest number of features, the maximum number of selected features was forty.Following additional feature selection, TD_4 produced an increased model accuracy relative to TD_5.This indicates that a week-long stage may be too short to capture the variation occurring in a given development stage.Given this result, TD_4 was selected as the best method for dataset construction.Similarly, this dataset required fewer features to achieve levels of accuracy comparable to the weekly dataset.Including more features creates a more complex model that risks being prone to over-fitting.The complexity of this model would then increase the difficulty of identifying the features with the highest impact on the prediction and how they relate to crop physiology.

Hyperparameter tuning
Hyperparameter tuning had minimal impact on the model accuracy.However, the stability of the models at both low tree depth and with a higher minimum number of records per split provides confidence in the robustness of the model to outliers.The results of spatial cross-validation demonstrate the model's robustness to variations in crop production and environmental interactions across the entire Riverina rice-growing region.However, including records from the same season in the training and validation datasets in both LORO-CV and LOYRO-CV may explain the superior accuracy metrics compared to standard LOYO-CV.

Dataset construction
The aggregation stages with the highest feature importance varied between time-based and phenology-based dataset construction methods.In the five phenology-based datasets, the stage immediately before harvest exhibited the highest XGBoost gain.This stage, especially the period between grain filling and harvest, significantly affects Head Rice Yield (HRY).Increased grain dry-down rates and moisture reabsorption during this phase can cause grain fissures, raising the likelihood of breakage during milling (Clampett et al., 2004;Kealey and Clampett, 2000;Kunze, 1979;Thompson and Mutters, 2006).While some of the variances detected, and hence trees produced by the XGBoost algorithm, may reflect these biological relationships, other variances detected by the algorithm may have been due to the variation in the length of the final stage created by the phenology estimation method.Such artificial variations might lead to the algorithm's generation of less accurate trees, potentially diminishing the overall accuracy and reliability of the trained models.
In the time-based datasets, stages with the highest accumulated gain typically occurred earlier in the season, contrasting with the phenologybased datasets.For instance, the 10 W-8 W stage showed the highest accumulated gain in the TD_4 dataset, indicating its importance.This stage corresponds to the flowering to grain fill stage as per the phenology calendar by NSW DPI.While the highest accumulated gain occurred closer to harvest in the phenology-based datasets, the stages between the estimated PI and flowering also had relatively high levels of accumulated gain.The flowering to grain fill period coincides with the development of the early kernel through to the mature grain.In this period, the grain is susceptible to grain chalk and fissure development (Cooper et al., 2008;Lanning et al., 2011;Peng et al., 2004).Grain chalk is an opaque region in the kernel caused by loosely packed starch granules (Fitzgerald et al., 2009;Lisle et al., 2000).Rice grains with chalk have a weaker structure, making them more susceptible to the development of fissures as the grain begins to dry down (Buggenhout et al., 2013) and, subsequently, more susceptible to breakage during milling (Indudhara Swamy and Bhattacharya, 1982).These findings highlight the importance of early season stages in determining HRY, especially in time-based dataset constructions.
The average grain moisture percentage at harvest was identified as the most important feature in phenology and time-based dataset constructions.Grain moisture at harvest significantly affects the grain's resilience and response to environmental stress (Lu et al., 1995;Kealey & Clampett, 2000;Bautista & Siebenmorgen, 2004;Bautista et al., 2009;Nalley et al., 2016).High moisture levels maintain the grain's flexibility during the pre-harvest stage, enabling it to endure environmental stressors that accelerate drying.In contrast, lower moisture levels increase grain fragility, increasing the likelihood of fissures under adverse environmental conditions.The high importance attributed to this feature suggests that the influence of harvest moisture on grain quality might overshadow the effects of earlier climatic conditions, highlighting its crucial role in predictive modelling for grain quality.

Second stage feature selection model
In the TD_4 dataset's best ten features subset model, the average grain moisture percentage at harvest was identified as the most important feature, as in the overall time and phenology-based datasets.Consistent with overall time-based models (Fig. 8), the 10 W-8 W period stood out as the most significant stage in the final model.This is aligned with the flowering to grain fill period, which determines the susceptibility of the grain to chalk development (Lyman et al., 2013).Key climate features influencing this stage include relative humidity at the time of maximum temperature, diurnal temperature range, water deficit index, and evapotranspiration, all of which are temperature dependent.Elevated temperatures during this time can lead to grain chalk (Ali et al., 2019).However, in the crucial 12-6 weeks pre-harvest stage, these factors may also enhance transpirational cooling, especially in the flooded conditions typical of this growth stage.This cooling effect, shown to reduce panicle temperatures by up to 6.8 • C in Australian conditions (Matsui et al., 2007), could mitigate temperature stress on the grain, thereby reducing the likelihood of chalk development.This interplay of climatic variables underscores their critical influence on grain development and the model's predictive accuracy.

Potential use for grain elevator HRY prediction
The training datasets constructed included data points and features only available until delivery to the grain elevator.This was selected so that either dataset method could be reconstructed in a live environment where crop production, grain elevator and climate data can be ingested in real-time and sent to the trained model.According to Li et al. (2013), model accuracy is considered excellent when the relative RMSE (RRMSE) is less than 10 %, good if between 10 % and 20 %, and fair if between 20 % and 30 % (M.Li et al., 2013).RRMSE is calculated by dividing the RMSE by the average value of the observed dataset (Despotovic et al., 2016) and provides a more comparable value between models.
The TD_4 best ten features model produced an RMSE of 5.94 % HRY from a training dataset with an average observed HRY of 57.48 %.This equates to an RRMSE of 10.34 %, meaning the model falls close to an excellent categorisation based on the characterisation from Li et al. (2013).Inspection of the predicted against the observed plot (Fig. 6) shows that most predictions fall with an RRMSE of 10 %, where the predicted error is within ±5.7 of the observed HRY value.These results indicate that real-time data ingestion and hosting of the model at the grain elevator would produce accurate HRY predictions, providing critical information to improve grain storage and milling decisions to maximise head rice recovery in milling.

Limitations
There are three primary limitations included in this research.Firstly, the absence of irrigation management data and actual records of crop phenology.Access to these records would have led to the more accurate splitting of climatic data into the actual crop development stages they occurred in, leading to more accurate models being produced.The second limitation was the exclusion of the GIS data in the training dataset.Access to GIS information would have allowed the integration of remotely sensed vegetative indices.These satellite-derived variables could be used to gather important information related to early crop development and the rate of grain dry-down.Additionally, the GIS information would allow soil characteristics to be attributed to each record.The final compounding limitation is the measurement system of the historical HRY records used in the training dataset.Instead of recording the HRY on a per-delivery or per-paddock basis, records are categorised by variety through a combination of samples collected from each delivery.As a result, the HRY score stems from a composite of samples potentially aggregated from multiple paddocks, all of which have been sown and harvested at varying times.This aggregation diminishes the model's capacity to capture variations linked to specific crop management and environmental conditions.While the model's accuracy can be deemed 'good' based on the RRMSE classification by Li et al. (2013), the limitations above are likely to have constrained the model from achieving even higher accuracy.In forthcoming investigations, employing an enhanced method for estimating crop phenology in cases where actual records are absent and recording HRY results at least at the paddock level will likely yield further improvements in model accuracy.

Conclusion and future directions
This study explored phenology and time-based feature engineering methods for climate data aggregation, employing varying levels of crop development stage length and the number of stages included.Once each dataset was created, a replicated method of feature selection and model training was completed to uncover which dataset construction method and number of stages included produced the most accurate model for the prediction of HRY in Australia.This analysis deduced the following conclusions: 1. Time-based dataset construction produced the most accurate models compared to the phenology-based method.From the five dataset variations produced using this method, the most accurate model was developed using the best ten features from the dataset using eight two-week aggregations (TD_4) of climate variables, working back from the harvest date.2. Aggregation stage feature importance highlights that the most important stage in the phenology-based construction method was immediately before harvest.However, it was hypothesised that this might have been created due to the inaccurate estimates of crop phenology dates employed in this study.
3. For the time-based datasets, which produced more accurate models, the most important stage was found to be between eight and ten weeks before harvest, equating to the flowering period of the crop.
This knowledge is critical information for growers, agronomists and breeders to understand when the crop may experience detrimental conditions and how to manage their crop to ensure the critical stages of crop development occur in conditions promoting higher levels of HRY.A noted limitation of this work was the absence of geospatial data points in the constructed dataset.Future work should focus on a dataset with relevant crop-level geospatial data.This work would allow the impact of soil parameters and vegetation status across the crop development cycle to be explored using a data-driven methodology.

Fig. 1 .
Fig. 1.Location of the rice growing regions where crop records were collated for this study.Australian Grain Storage (AGS) receival site locations are shown in orange 2.2.2.1.Phenology-based crop record.Phenology-based aggregation utilised the planting date provided by SunRice and long-term variety trial site data made available by the New South Wales Department of Primary Industries (NSW DPI) and recent results from Ali et al. (2019).The collated crop phenology calendar is as follows: Planting to Panicle Initiation − 72 Days, Panicle Initiation to Microspore − 16 Days, Microspore to Anthesis − 16 Days, Anthesis to Grain Fill − 20 Days and Grain Fill to Maturity − 34 Days.

Fig. 2 .
Fig. 2. Summary of head rice yield variation in the Reiziq dataset.The grey columns boxplots show the distribution of HRY observed inter and intra-seasonally.

Fig. 3 .
Fig. 3. -A) Phenology-based stages included in the five datasets with stages defined between the record sowing and harvest dates.B) Time-period-based stages included in the datasets ranging from 16 weeks out to harvest.

Fig. 4 .
Fig. 4. Overall workflow of HRY model development using the phenology and time-based dataset streams with SunRice grower crop records and SILO climate data.Model training was completed with the XGBoost algorithm.

Fig. 5 .
Fig. 5. Impact of additional feature selection on model LCCC (a) and RMSE (b) in the two-week and week-long aggregation datasets across collinearity filter levels.Results show the maximum LCCC and minimum RMSE achieved across all hyperparameter tuning options.The green columns represent the dataset using the original RFE feature.The orange and purple columns represent the fifteen and ten highest-ranked features from the original RFE results, respectively.Columns with different letters significantly differ at P < 0.05 according to two-way ANOVA followed by Tukey's test.

Fig. 9 .
Fig. 9. Feature importance, represented by XGBoost Gain, of the best ten features according to RFE from the TD_4 dataset with a 0.95 collinearity filter.

Fig. 10 .
Fig. 10.Distribution of days between the estimated grain fill and the average harvest date for each crop.The black dashed line represents the average time of 73 days.

Table 1
Summary of the crop records, production (planted hectares) and head rice yield variation in the Reiziq dataset.

Table 2
List of features included in each dataset, including the crop production records and the feature-engineered climate variables, calculated by each aggregation stage.Calculated meteorological features are shown in italics.

Table 3
Results of the ANOVA with the post hoc Tukey's HSD for the selected model performance metrics.Letters indicate significant differences between datasets (p < 0.05).

Table A1
Features included in each dataset following collinearity filter and RFE application.