Do Gridded Weather Datasets Provide High-Quality Data for Agroclimatic Research in Citrus Production in Brazil?

: Agrometeorological models are great tools for predicting yields and improving decision-making. High-quality climatic data are essential for using these models. However, most developing countries have low-quality data with low frequency and spatial coverage. In this case, two main options are available: gathering more data in situ, which is expensive, or using gridded data, obtained from several sources. The main objective here was to evaluate the quality of two gridded climatic databases for ﬁlling gaps of real weather stations in the context of developing agrometeorological models. Therefore, a comparative analysis of gridded database and INMET data (precipitation and air temperature) was conducted using an agrometeorological model for sweet orange yield estimation. Both gridded databases had high determination and concordance coefﬁcients for maximum and minimum temperatures. However, higher errors and lower conﬁdence coefﬁcients were observed for precipitation data due to their high dispersion. BR-DWGD indicated more accurate results and correlations in all scenarios evaluated in relation to NasaPower, pointing out that BR-DWGD may be better at ﬁlling gaps and providing inputs to simulate attainable yield in the Brazilian citrus belt. Nevertheless, due to the BR-DWGD database’s geographical and temporal limitations, NasaPower is still an alternative in some cases. Additionally, when using NasaPower, it is recommended to use a measured precipitation source to improve prediction quality.


Introduction
The latest assessment report of the Intergovernmental Panel on Climate Change [1] indicated significant changes in the concentration of greenhouse gases and global temperature [2]. Changes in air temperature, CO 2 concentration, rainfall, and relative humidity alter the rates of chemical reactions in all living organisms, especially plants [3].
Therefore, food production is highly dependent on environmental conditions, resulting from the complex interaction between the components of the soil-plant-atmosphere system. These facts emphasize that agriculture is a high-risk economic activity, requiring correct management strategies to reduce the impacts of extreme weather-related events and optimize the use of natural resources. The increasing temperature for citrus species can • RQ1: Do gridded data present a high correlation with in situ climatic data, allowing them to serve as a substitute or to fill potential gaps in measured data? • RQ2: How do gridded data impact simulated sweet orange yield, using in situ data as a baseline for comparison?
The main objectives of this work are (i) to compare the main climatic data inputs options for predicting sweet orange yield in Brazil (which could also be used for other crops and areas); and (ii) to conduct a case study to evaluate the potential impacts of substituting in situ weather station data for gridded data (allowing better spatial and temporal coverage by filling gaps in the available in situ data). Both contributions may have a direct impact on improving the predictions in areas with lower weather station densities. Additionally, the same methodology could be applied to other crops, areas, and countries.
The work was organized into the following sections: Section 2. presents the materials and methods used in the case study; Section 3. describes and discusses the main results obtained in the case study; and Section 4. concludes the work, presenting the final remarks, limitations, and recommendations for future works.

Citrus Yield Prediction: Concepts and Models
A critical aspect of improving the decision-making processes of all links in the citrus supply chain is to predict the fruit volume produced in each season [19]. To account for different problems to estimate this volume, such as differences in planted areas between years, technologies, and processes used in different regions and the impact of extreme weather events, it is more beneficial to predict the yield instead of the total volume produced. A better yield prediction would allow the farmers to plan his/her processes better, the industries that produce inputs to better plan their production processes, the processing industries to estimate production and input sourcing, and the distribution agents to plan their logistics [20].
Predicting citrus yield is a challenging task. As perennial crops, they consist of several species and cultivars with different characteristics and resilience towards soil characteristics, pests, diseases, and the impact of extreme weather events, among other factors. Therefore, it is difficult to determine which variables should be used in a yield prediction model. Most studies used correlations of yield data with raw climatic data, mainly precipitation and temperature [21,22]. Additionally, it is crucial to observe that few works in the literature used correlations of processed data and model outputs, such as evapotranspiration and water deficit [20,23].
The use of processed data and climate indices (instead of the raw physical variables such as precipitation and temperature) should generate better prediction models because they capture more information from the original data [20,23]. For example, indices such as the Standardized Precipitation Index (SPI) and The Standardized Precipitation Evapotranspiration Index (SPEI) better capture the potential impact of droughts on crop yield than the use of only temperature and precipitation [24]. The work by Da Silva et al. [25] is an example of using an unsupervised machine learning model to predict sugarcane crop yield in different cities in São Paulo state, Brazil, using the SPI as one input of a prediction model.
One interesting aspect is that only some studies in the literature developed models with sequential equations that result in estimating yield. Several works, such as Ruß et al. [6] and Everingham et al. [8], used machine learning and artificial intelligence models, which are data-driven models that do not depend on sequential equations and extract information from the dataset. However, due to the data-driven nature of this approach, it does not incorporate important knowledge accumulated through decades of experimental studies and in situ crop yield research.
Agrometeorological models, on the other hand, tend to incorporate this knowledge in the equations that constitute the model [18][19][20]23,26]. This allows for better predictions in scenarios with a small volume of data but tends to finetune the model for a specific crop, variety, area, and weather pattern [27]. Nevertheless, using agrometeorological models based on sequential equations is the traditional approach for citrus yield prediction [18][19][20]23,26]. Therefore, this approach will be used in this work.
Despite similar objectives, five of the most important yield prediction models utilized in sweet oranges show different outputs, encompassing yield, fruit volume, fruit quality, and water productivity (Table 1). Table 1. Characteristics of five of the most important yield prediction models for sweet oranges.
Camargo et al. [26] and Martins and Ortolani [18] adapted the Jensen [30] model for the 'Valencia' sweet orange. However, with the use of precipitation and temperature data to calculate potential (PET) and actual evapotranspiration (AET), the calculation method decided by the authors penalized the yield according to the water conditions in critical phenological stages of the crop.
The approach to correlate fruit yield and water deficit using AET and PET as central components of yield estimation is not new for crop yield prediction, especially in regions that experience droughts. An important agrometeorological model, the AEZ-FAO [31], adapted for maize, sugarcane, wheat, and other crops, has in its base equation the evapotranspiration (referred to as ET). Fadel [32] applied the AEZ-FAO model to predict the yield of seven mandarine cultivars, using different indices of sensitivity to water stress for each critical phenological stage.
Considering the water balance effect on yield and seeking to develop a simulation model for growth responses to climate change, Pereira et al. [28] studied the implications of atmospheric concentrations of CO 2 and variations in air temperature on water use efficiency. The authors developed the model using 'Natal' sweet orange and obtained a practical model to analyze several potential climate change scenarios. This is essential to improve the quality of decision-making throughout the supply chain and the resilience of citrus production.
Tubiello et al. [29] applied the yield estimation model of 'Valencia' sweet oranges developed by Ben Mechlia and Carroll [19] to predict the potential yield in relevant future climate scenarios for 2030 and 2090. This model estimates the number of fruits and final fruit size, as well as the growth of 'Navel' and 'Valencia' sweet oranges, using as inputs the orchard's age, planting density, previous year yield, and meteorological variables such as temperature, precipitation, and cold and heat-related indices.
Except for the Ben Mechlia and Carroll [19,37] model, all the cited works were elaborated based on data collected from cities and States located in Brazil, involving lower weather station densities than those in developed countries. Using datasets with a broader spatial coverage could result in models for yield prediction in lower-density areas, improving those countries' decision-making regarding citrus supply chains.

In Situ and Gridded Data for Crop Yield Prediction
As observed throughout this work, high-quality input data is essential to provide accurate yield predictions. This is the case for both traditional agrometeorological models based on sequential equations and machine learning models based on complete data-driven methods. This also applies to hybrid models (also called physics-based machine learning models), which are starting to be explored in the literature.
Actual data obtained from correctly calibrated and well-maintained physical weather stations are always the best choice to provide high-quality inputs since they are real-time collected information [11]. Data from simulations and interpolations may contain several errors due to the assumptions used [38].
However, weather stations or in situ data can present gaps in the datasets and potential outliers that may be difficult to detect. Additionally, in many situations in developing countries, it is possible to observe both a lack of spatial coverage (due to the low density of weather stations in some areas) and of temporal coverage (due to station malfunction or stations that may have been only recently installed) [39].
Gridded databases provide an alternative for addressing the lack of spatial and temporal coverage. Additionally, this resource can be used to correct outliers and fill data gaps. Several works in the literature have explored using gridded datasets for crop yield prediction [28,[40][41][42]. However, only a few studies, such as Bai et al. [38]; Bender and Sentelhas [43]; Battisti et al. [44] and Duarte and Sentelhas [11], aimed to compare in situ and gridded data, which is essential for helping the modeler or decision-maker to choose which data sources to use on his/her crop yield prediction model. In this work, we aimed to address this gap and to provide a methodology for other researchers and practitioners to compare and to choose databases for crop prediction models.
Bai et al. [38] and Duarte and Sentelhas [11] compared NasaPower data with weather station data to simulate maize yield in China and Brazil, respectively. Both authors identified problems in using only the NasaPower data in the model application, demanding other sources to complement NasaPower temperature and precipitation data. Monteiro et al. [12] also analyzed the NasaPower data as input in a sugarcane prediction model. The authors indicated the need for several adjustments, since this database did not provide satisfactory quality for wind speed, relative humidity, and precipitation estimation.
Other databases were also tested for yield modeling in substitution for real weather stations, such as AgMERRA (AgMIP Modern-Era Retrospective Analysis for Research and Applications) [44,45] and BR-DWGD [11,44]. AgMERRA and NasaPower were considered satisfactory in estimating soybean yield in Brazil [44].

Materials and Methods
This work was elaborated in two phases ( Figure 1). In Phase 1, called Scenario Evaluation, we generated and evaluated three relevant scenarios: (i) all data (without removing potential outliers); (ii) data without potential outliers; and (iii) all data, but separated into one dataset per state. Those scenarios encompass different traditional approaches for processing the inputs of the agrometeorological model and are essential for better extracting information from the data.
These scenarios also help to answer two critical questions: (i) What is the impact of removing potential outliers from the data? (ii) Should the analysis consider all data available, or should it separate the data considering spatial factors? Both are relevant for Brazil due to the significant differences between weather, soil, and agricultural processes in the different citrus-producing regions.
Phase 2 was called Input Evaluation. It consisted of evaluating and comparing three different options of inputs for the agrometeorological model: (i) in situ weather stations, which are traditionally used (and will be considered the baseline for comparison in this work); (ii) NasaPower gridded data; and (iii) BR-DWGD gridded data Using gridded data would allow better coverage of the production sites. It is essential to observe that Phase 2 also considered an in-depth exploratory data analysis and an analysis of outliers and gaps in the data for all considered scenarios and inputs.

Study Area
The location selection was based on two relevant parameters for citrus production: (i) volume produced (an indication of the importance of an area); and (ii) physical-related variables, mainly focusing on climate variables. We aimed to encompass Brazil's most important citrus-producing regions while considering areas with different climate patterns. Therefore, for the citrus belt (São Paulo and Minas Gerais states), more than one location was selected within each of the five production regions (north, northwest, center, south, and southwest).
Then, the available years and the elevation above sea level were considered within these regions. When there was no weather station close to the location, another site was selected, which was recurrent for the Northeast of Brazil. This was essential because the baseline for comparison was using INMET weather station data.
We have collected data from São Paulo, Minas Gerais, Bahia, and Sergipe states from ten, five, four, and one locations, respectively ( Figure 2, Table 2). This data collection region distribution roughly reflects each state's importance for citrus production. The time interval of data collection was from 1 January 2010 to 12 December 2020, resulting in 10 years of daily observations for climate data. Usable data from 20 stations were obtained, resulting in 73,521 observation points for each climate variable.
The gridded data were obtained from two databases: NasaPower and BR-DWGD [17]. NasaPower data were downloaded via the Internet from the NasaPower website. The BR-DWGD data were downloaded using a Python script and archives (nc) prepared and provided by the authors using Python.
All data were inserted into a single dataframe for processing and knowledge extraction.  Table 2.

Data Processing and Scenario Generation
As previously described (Section 2.1.), three relevant scenarios were analyzed in the case study: (i) all available data (without removing potential outliers); (ii) data with outliers removed using the Boxplot method [46]; and (iii) data separated by states (without removing potential outliers).
In scenario (ii), the outliers of each city were identified using the interquartile (IQR) method, also known as the Boxplot method [46]. The IQR is the difference between the 75th and the 25th percentiles. For maximum and minimum temperatures, a multiple of 3× IQR was used. For precipitation, a multiple of 5× IQR was used, as 3× IQR still encompassed data that were not identified as outliers. All potential outliers were eliminated from the dataframe in this scenario. The Climpact R package was used to analyze the data, and outliers were eliminated using a Python script.
The agrometeorological model developed by Martins and Ortolani [18] was then applied to each scenario. Linear regressions were used to analyze the results and compare the models' outputs using INMET climate data and gridded data (NasaPower and BR-DWGD). Our main objective was to explore the results of the agrometeorological model using two different gridded datasets as inputs compared to the INMET baseline as input.

Data Quality Analysis
After determining the best data processing scenario, a data quality analysis was applied to the gridded data in Phase 2, following the methodology by [11]. For each climatic variable (Tmax, Tmin, and P) and Eto calculated, a linear regression analysis was conducted between observed data (INMET baseline) and gridded data. Besides daily analysis, the data were aggregated and analyzed monthly and annually in order to better identify trends in the data. All the results were analyzed using the metrics described above. The main Python libraries used in this step were Pandas and Matplotlib (https://matplotlib.org/ accessed on 20 September 2022).

Agrometeorological Model Application
We used the agrometeorological model described by Martins and Ortolani [18] for 'Valencia' sweet orange (Citrus sinensis Osbeck) to simulate the relation between attainable yield and potential yield. As already mentioned, this is an important and thoroughly validated model to predict citrus yield in Brazil, notably in the Southwest region.
First, for each location and year, the potential (PET) and actual evapotranspiration (AET) were calculated by the Thorthwaite and Mather [49] method considering a soil moisture storage capacity of 100 mm, a standard used value [18,26]. The WaterbalANce app [50] was used to calculate PET and AET.
The agrometeorological model was then applied for each location, using the second combination of phenological phases tested by the authors, which showed the best performance [18]. All the results were analyzed using the error metrics described in Section 2.3.
Lastly, a spatial analysis of the results was conducted, generating and analyzing evapotranspiration and yield maps for each input data method only in the citrus belt region, considering the mean values of the locations selected within each region. Mean error maps were also elaborated, comparing model outputs with mean values for each citrus belt region for eight harvests from the sweet Orange Crop Forecast (Fundecitrus). Those maps allow a better spatial analysis of the models' results, generating further insights to help decision-making.

Results
The results presented herein were based on scenario (i), in which all regions were considered together, and no potential outliers were removed. This is the most common scenario analyzed in the literature, as in the works by Battisti et al. [44] and Duarte and Sentelhas [11].
Except for Tmin, the BR-DWGD data showed the best performance compared with the INMET data in all time scales (Table 3 and Figure 3). That means that the BR-DWGD data provides higher-quality results than the NasaPower data, considering the agrometeorological model and the specific regions. The higher precision of BR-DWGD data is due to a methodology that interpolates weather station data from INMET and the National Water Agency (Agência Nacional de Águas-ANA), which are more accurate than satellite data. Specifically for the minimum air temperature, the BR-DWGD database reduced the r, d, and C indices. Six of the twenty cities studied here presented statistical indices lower than 0.5 for this variable. Xavier et al. [51] identified that Tmin and wind speed variables have the highest number of days with inhomogeneous data. The authors could not establish a single cause. Some potential causes were defective instruments and using different units [51]. Although they did not find a reason for this specific problem on Tmin, the stations from those six cities probably presented a homogeneity problem, influencing the BR-DWGD calculation method. This resulted in better results for the NasaPower database in the final dataset.
The differences between INMET and NasaPower databases are probably related to several factors, such as sensor resolutions, pixel size from the satellites, or even geographical differences between satellite records and weather station measurements [38]. A large dispersion was observed for P daily data (Figure 3c), especially when using NasaPower, resulting in the worst R 2 (0.15), d (0.57), and C (0.22).
On the other hand, the P daily data provided by the BR-DWGD database (Figure 3f) presented better performance and indices (R 2 = 0.70, d = 0.90, and C = 0.76). This agrees with several previous research articles, such as Monteiro et al. [12] and Van Wart et al. [13], who observed that p values estimated by NasaPower always showed the worst correlation with measured data. These results occur due to the difficulty of estimating light and extreme precipitations and avoiding false positives for precipitation clouds in simulation methods [52]. Our findings agree with other works which evaluated the quality of weather data for different modeling applications [11,12,38,43,44,53]. A possible explanation for the errors is the topographic influence in temperature estimation, as White et al. [53] observed in mountainous regions.
The aggregation of P data on monthly and annual scales increased the correlation indices for both gridded databases due to the reduction of data dispersion (Table 3). It is important to emphasize that the primary variable considered by the model is AET, which is highly affected by soil water balance and precipitation. The PET and AET calculated in the model by Martins and Ortolani [18] applied the Thorthwaite and Mather [49] methods, which use monthly data as an input.
The PET and AET estimated by the BR-DWGD database (r = 0.92) presented better results than the NasaPower database (r = 0.85) compared to INMET evapotranspiration values ( Figure 4). Additionally, those values were significantly lower than the original daily data. The variations between PET (125 mm maximum) and AET (90 mm maximum) are the main factors responsible for the relations between attainable yield (Yr) and potential yield (Yp) (Figure 4). It is possible to observe an underestimation of PET and AET when using a gridded database compared to the baseline in all citrus belt regions (Figure 4). This is probably due to those databases' temperature and precipitation estimation errors [44]. Therefore, as expected, it is essential to use high-quality data since it will directly affect PET and AET determination.
In the present study, the BR-DWGD database presented better results regarding precipitation and, consequently, AET, which was observed in other applications [44]. An alternative to using the NasaPower database is using other sources such as the ANA database to substitute the precipitation data [12]. Duarte and Sentelhas [11] obtained better results for maize yield simulation using NasaPower and ANA precipitation data rather than only NasaPower data. This could be a strategy to improve the quality of the results observed in this work.
In order to answer our first question, i.e., if gridded data were good enough for filling gaps or substituting measured data, the BR-DWGD database presented significant correlation indices with the INMET baseline. It could be used to fill data gaps or even for locations with little data in Brazil since the database is limited to this country. However, this gridded database includes only data from 1961 to 2020, making it necessary to use NasaPower for more recent years. Additionally, the NasaPower interface is easier to use for non-programmers since the BR-DWGD database requires more advanced knowledge in this field. Table 4 illustrates the comparison results using the NasaPower, BR-DWGD, and IN-MET data as inputs for the Martins and Ortolani [18] agrometeorological model. It contains the error metrics and correlation results for the four scenarios. Due to the high dispersion observed in p values when using the NasaPower database and the better temperature data correlations with INMET, the BR-DWGD database resulted in a better yield estimation using the agrometeorological model. Legend: the scenario with the best performance for data quality analysis is highlighted. SP-São Paulo; MG-Minas Gerais; BA-Bahia; RMSE-root mean square error; ME-mean error; MAE-mean absolute error; d-agreement index; r-Pearson coefficient; R 2 -coefficient of determination; C-confidence index.
Regarding the relation between attainable yield (Yr) and potential yield (Yp) outputs considering the NasaPower and BR-DWGD inputs, besides the input sources, the methodology for processing the input data is also relevant ( Figure 5). Herein, the best scenario for data quality analysis was using all locations together and not removing the outliers identified in the analysis (Table 4).
When removing the outliers (scenario (ii)), the errors were reduced, especially RMSE, which is closely related to outliers. However, as the main outliers identified were from P data as extreme values, removing them resulted in underestimating attainable yield (Yr), reducing the Yr/Yp relation. It is vital to observe that estimating extreme values is a problem in satellite precipitation estimation by algorithms [54]. Different remotely sensed products for P estimation show substantial differences in representing P extremes [54,55]. Overall, there is always a tendency to miss a significant P volume when using those algorithms [54].
When applying the agrometeorological model separately for each State (scenario (iii)), there was a reduction in the quality of results (R 2 < 0.45 and C < 0.51 for NasaPower and R 2 < 0.7 and C < 0.75 for BR-DWGD), using gridded data, comparing to applying the model for all States simultaneously (R 2 = 0.69 and C = 0.76 for NasaPower and R 2 = 0.82 and C = 0.86 for BR-DWGD). The worst results were for Bahia State, probably due to the low density of stations, leading to a smaller quantity of data and reducing the quality of data interpolations.  Figure 6 illustrates the maps of the attainable and potential yields and the mean errors for the agrometeorological model using the different inputs analyzed in this work. First, there was a tendency to overestimate the yield when using the agrometeorological model, inherent to the model itself, as identified by its authors [18]. Our results indicate that this is even more pronounced when using the gridded databases. In Figure 6e,f, it is possible to observe that the peripherical regions of the citrus belt (north and southwest) presented higher errors. This was mainly due to the PET underestimation, which reduced the Yr/Yp relation and increased the errors.
Therefore, to answer our second question, i.e., if gridded data presented similar quality to measured data in simulating yield, the answer is yes. The BR-DWGD database presented significant correlation indices with the INMET baseline. Nevertheless, in specific scenarios, the NasaPower database can also be used as an input source when analyzing recent years or locations outside Brazil.
Using agrometeorological models is essential for agricultural decision-making, and high-quality input data are crucial for satisfactory results [38]. As already discussed, there are different types of climatic databases, each with advantages and flaws [56]. An alternative to obtaining accurate outputs is to use the best sources, which makes data quality analysis an essential step in this process. The methodology used in this work could be adapted for use on other crops, areas, and periods.

Conclusions
High-quality data are essential for agricultural decision-making. One crucial aspect that depends directly on data and processing quality is yield prediction, which is essential for decision-making in citrus supply chains. However, many areas lack climate data, which are primary inputs of the different agrometeorological models.
In this work, we analyzed the potential use of the two gridded databases to fill gaps in historical climate variables series, considering both areas with higher and lower weather station density. An agrometeorological model was used to predict the yield of 'Valencia' sweet orange in different regions in Brazil.
Our results suggest that the BR-DWGD database is better than the NasaPower database at filling gaps and being used as an input to simulate attainable yield in the Brazilian citrus belt. However, due to the geographical and temporal limitations of the BR-DWGD database, NasaPower is still an alternative in some specific cases. Additionally, when using NasaPower, it is recommended to use a measured precipitation source (such as INMET, ANA, or a weather station available on site) for obtaining outcomes with the lowest errors and highest precision and accuracy since the main limitation of this database is poor precipitation simulation.
Despite the low quality of precipitation data from NasaPower, this database is more accessible and easier to use than BR-DWGD. Combining its data with data from other databases may provide better insights for decision-making. Lastly, a data quality analysis, such as the one presented in this work, must be conducted for every yield prediction task.
Alongside the conclusions described above, this study testifies the data quality of gridded databases for citrus yield research in the Brazilian citrus belt region, the second biggest producer of citrus and the biggest producer of sweet orange of the world. We also analyze the recent actualization of the BR-DWGD database for agroclimatic research.
The limitations of this study were as follows: (i) only one agrometeorological model was used; (ii) no machine learning yield prediction model was used; and (iii) the model used was based on yield penalization by water deficit, which is very relevant for the Brazilian context. Future works must focus on the following: (i) evaluating more gridded databases; (ii) conducting case studies for other crops, varieties, regions, and countries; (iii) evaluating the use of other agrometeorological and machine learning models; and (iv) evaluating other important model inputs, such as solar radiation. Additionally, it would be interesting to explore the quality of the results when combining multiple gridded databases or using model ensembles.