Learning Bayesian Networks with Heterogeneous Agronomic Data Sets via Mixed-Effect Models and Hierarchical Clustering

Maize, a crucial crop globally cultivated across vast regions, especially in sub-Saharan Africa, Asia, and Latin America, occupies 197 million hectares as of 2021. Various statistical and machine learning models, including mixed-effect models, random coefficients models, random forests, and deep learning architectures, have been devised to predict maize yield. These models consider factors such as genotype, environment, genotype-environment interaction, and field management. However, the existing models often fall short of fully exploiting the complex network of causal relationships among these factors and the hierarchical structure inherent in agronomic data. This study introduces an innovative approach integrating random effects into Bayesian networks (BNs), leveraging their capacity to model causal and probabilistic relationships through directed acyclic graphs. Rooted in the linear mixed-effects models framework and tailored for hierarchical data, this novel approach demonstrates enhanced BN learning. Application to a real-world agronomic trial produces a model with improved interpretability, unveiling new causal connections. Notably, the proposed method significantly reduces the error rate in maize yield prediction from 28% to 17%. These results advocate for the preference of BNs in constructing practical decision support tools for hierarchical agronomic data, facilitating causal inference.


Introduction
Studies encompassing heterogeneous collections of related data sets (RDs) in which the relationships between the covariates and the outcome of interest may differ (say, in slope or variance; Gelman and Hill, 2007) are widespread in many fields, from clinical trials to environmental science (Spiegelhalter et al., 2004;Qian et al., 2010). Hierarchical (multilevel) models are commonly adopted to pool information across different subsets of the data while accounting for their specific features (Gelman et al., 2014). However, heterogeneity is not the only challenge in fitting a model on such data: the variables involved are typically related by a complex network of causal relationships, making their joint distribution is challenging to learn (especially) from small data sets.
Bayesian networks (BNs) provide a powerful tool to learn and model highly structured relationships between variables (Green et al., 2003). A BN is a graphical model defined on a set of random variables X = {X 1 , . . . , X K } and a directed acyclic graph (DAG) G that describes their relationships: nodes correspond to random variables and the absence of arcs between them implies the conditional independence or the lack of direct causal effects (Pearl and Mackenzie, 2018). In particular, a variable X i is independent from all other non-parent variables in G given the set of variables associated with its parents pa(X i ) (Pearl, 2009). A DAG G then induces the following factorisation: where Θ X i are the parameters of the conditional distribution of X i | pa(X i ).
In equation (1), the joint multivariate distribution of X is reduced to a collection of univariate conditional probability distributions, the local distributions of the individual nodes X i . If all sets pa(X i ) are small, (1) is very effective in replacing the high-dimensional estimation of Θ with a collection of low-dimensional estimation problems for the individual Θ X i . Another consequence of (1) is the existence of the Markov blanket of each node X i , the set of nodes that makes X i conditionally independent from the rest of the BN. It comprises the parents, the children and the spouses of X i , and includes all the knowledge needed to do inference on X i , from estimation to hypothesis testing to prediction. The process of learning a BN can be divided into two steps: Structure learning aims to find the dependence structure represented by the DAG given the data D. Several algorithms are described in the literature for this task. Constraint-based algorithms such as the PC algorithm (Spirtes et al., 2000) use a sequence of independence tests with increasingly large conditioning sets to find which pairs of variables should be connected by an arc (or not), and then they identify arc directions based on the difference in conditional independence patterns between v-structures (of the form X j → X i ← X k , with no arc between X j and X k ) and other patterns of arcs. Score-based algorithms instead use heuristics (like hill climbing; Russell and Norvig, 2009) or exact methods (as in Cussens, 2012) to optimise a network score that reflects the goodness of fit of candidate DAGs to select an optimal one. Parameter learning provides an estimate of Θ through the parameters in the Θ X i conditional to the learned DAG. Structure learning alg,orithms are distribution-agnostic, but the choice of the conditional independence tests and of the network scores depends on the types of distributions we assume for the X i . The three most common choices are discrete BNs, in which the X i are multinomial random variables; Gaussian BNs (GBNs), in which the X i are univariate normal random variables linked by linear dependence relationships; and conditional Gaussian BNs (CGBNs), in the X i are either multinomial random variables (if discrete) or mixtures of normal random variables (if continuous). Common scores for all these choices are the Bayesian information criterion (BIC; Schwarz, 1978) or the marginal likelihood of G given D (Heckerman and Geiger, 1995). As for the conditional independence tests, we refer the reader to Edwards (2000) which covers various options for all types of BNs. Parameter learning uses maximum-likelihood estimates or Bayesian posterior estimates with non-informative priors for all types of BNs (Koller and Friedman, 2009). All the conditional independence tests, the network scores and the parameter estimators in the literature referenced above can be computed efficiently thanks to (1) because they factorise following the local distributions.
In this work, we learned a BN from agronomic RDs, a task that is related to transfer learning (Pan and Yang, 2010) but that is not widely found in the literature. Transfer learning has mainly focused on applications involving deep learning, with very few publications involving BNs. Notably, a recent work by Yan et al. (2023), proposed a structure learning approach based on conditional independence tests for operational adjustments in a flotation process characterised by a small data set with a limited sample size. To induce transfer learning, they considered the results of the independence tests performed on variables X i and X j in both the source and target data sets, which differed in terms of sample size. Other authors have suggested the use of order-search algorithms to learn BN structures, introducing a structural bias term to facilitate the transfer of information between data sets and achieve more robust networks (Oyen and Lane, 2015). BNs and structural equation models have proven successful in the agronomic sector, optimizing various management practices such as phytosanitary treatments (Lu et al., 2020), irrigation management strategies (Ilić et al., 2022) and soil management (Hill et al., 2017), to minimise environmental impact and mitigate climate change. However, in the agronomic literature transfer learning has predominantly focused on crop disease classification using deep learning techniques like convolutional neural networks (Coulibaly et al., 2019;Paymode and Malode, 2022), with little research involving BNs.
We learned the structure and the parameters of a CGBN from a real-world agronomic data set that has a hierarchical structure. In order to account for the high heterogeneity that characterises such data, we developed a novel approach that integrates random effects into the local distributions in the BN, building on Scutari et al. (2022). Random effects are the salient feature of linear mixed-effects models (LME; Pinheiro and Bates, 2000). LME models are hierarchical models that extend the classical linear regression model by adding a second set of coefficients, called "random effects", which are jointly distributed as a multivariate normal. The other coefficients are called "fixed effects". The coefficients associated with the random effects have mean zero, and they naturally represent the deviations of the effects of the parents in individual data sets from their average effects across data sets, which is represented by the fixed effects.
The hierarchical estimation in BNs learned from RDs was originally introduced by Azzimonti et al. (2019), who proposed a novel approach to tackle this challenge for discrete BNs using a hierarchical multinomial-Dirichlet model. That approach outperforms a traditional multinomial-Dirichlet model and is competitive with random forests but, as the number of domains increases, the estimation becomes more complex, necessitating the use of ap-proximations such as variational or Markov chain Monte Carlo inference.
The remainder of the paper is structured as follows. In Section 2, we briefly describe the data set (Section 2.1), we introduce the local distributions and the structure learning approach used to learn the BN (Section 2.2) and we how we evaluated its performance (Section 2.3). In Section 3, we present and evaluate the BN, and in Section 4 we discuss its performance before suggesting some possible future research directions.

The Data Set: Agronomic Performance of Maize
This study uses the data from Millet et al. (2019) who conducted a genome-wide association study on 256 varieties of maize (Zea mays L.) to evaluate the genetic variability of plant performance and its interactions with environmental variability. The data were collected at experimental sites in France, Germany, Italy, Hungary, Romania, and Chile between 2011 and 2013. After filtering out incomplete observations, the study analysed eight sites, each with a different sample size: Gaillac (France, n = 2437), Nerac (France, n = 1716), Karlsruhe (Germany, n = 2626), Campagnola (Italy, n = 1260), Debrecen (Hungary, n = 2181), Martonvasar (Hungary, n = 1260), Craiova (Romania, n = 1055), and Graneris (Chile, n = 760). The average temperature ( • C), the diurnal temperature range, the average relative humidity (%) and the diurnal relative humidity range (%) were recorded at a height of 2m for each site and year for three different periods (May to June, July to August, and September to October). At the end of the experiment, the phenological variables listed below were measured for each plot at each site: • The grain yield adjusted at 15% grain moisture, in ton per hectare (t/ha).
• The grain weight of individual grains (mg).
• The plant height from ground level to the base of the flag leaf (highest) leaf (cm).
• The tassel height, plant height including tassel, from ground level to the highest point of the tassel (cm).
• The ear height, ear insertion height, from ground level to the ligule of the highest ear leaf (cm).

Learning Algorithm
We learned the structure of the BN, denoted B LME , following the steps in Algorithm 1.
Data: data set D, blacklist, and a whitelist Result: The DAG G max that maximises BIC(G max ,D).
1. Run a linear regression on grain yield and extract the residuals ϵ i . 2. For each Site × Variety combination, compute the mean and the standard deviation of ϵ i . 3. Perform hierarchical clustering on the means and standard deviations of the residuals from each site-variety combination. 4. Add a new variable with the cluster labels to D. 5. Compute the score of G, S G = BIC(G, D) and set S max = S G and G max = G. 6. Hill-climbing: repeat as long as S max increase: (a) Add, delete or reverse all possible arc in G max resulting in a DAG. i. compute BIC of the modified DAG G * , For the hill-climbing algorithm, we used the implementation in the bnlearn R package (Scutari, 2010) and the BIC score. We provided a list of arcs to be excluded (blacklist) or included (whitelist) by hill-climbing to avoid evaluating unrealistic relationships (such as Average temperature of July-Aug → Average temperature of May-June).
Firstly, we regressed the grain yield against all available variables for all combinations of site and variety. We used the mean and variance of the residuals from the regression for each combination of site and variety to cluster them using the agglomerative Ward clustering algorithm (Murtagh and Legendre, 2014) from the stats R package. The resulting discrete variable was added to the data used to identify the RDs.
Following the approach and the notation described in Scutari et al. (2022), we assumed that each RD is generated by a GBN, and that all GBNs share a common underlying network structure but different parameter values. In order to ensure the partial pooling of information between RDs, the clusters are made a common parent for all phenological variables and incorporated in the local distributions as a random effect. Therefore, we modelled the local distributions for those variables as a linear mixed-effect model using the lme4 R package (Bates et al., 2014): where bold letters indicate matrices. The only exception was grain yield, because it required also a model for variances which have been implemented using nlme R package (Heisterkamp et al., 2017) as follows: In both (2) and (3), the notation is as follows: • j = 1, . . . , J are the clusters identifying the RDs; • Π X i is the design matrix associated to the parents of X i ; • b i,j,0 is the random intercept; • b i,j is the random slope parameter for the jth cluster; • Σ i is the n j × n j block of Σ i associated with the jth cluster; • σ 2 i,j I nj is the n j × n j matrix arising from the assumption that residuals are homoscedastic in (2); • µ i,j is the intercept; • and β i are the fixed effects.
In (3), we assumed the variance of residuals to be heteroscedastic and following a power function, where ν is the variance covariate and θ j is the variance function coefficient that changes for every level of the common discrete parent.
We modelled the weather variables using only fixed effects for simplicity: We prevented the clusters from being their parent with the blacklist because the resulting arcs are not of interest from an agronomic perspective. From these assumptions, the BN B LME we learned from the data has a global distribution that is a mixture of multivariate normal distributions like a CGBN.

Predictive and Imputation Accuracy
The most important variable in this analysis was grain yield because it is one of the key quantities used to evaluate an agronomic season. To assess the predictive ability of B LME , we evaluated the Mean Absolute Percentage Error (MAPE) of: • the predictive accuracy of grain yield predictions in the scenarios listed in Table A.1, which are meant to study the potential effect of measuring a reduced set of variables in future years; • the imputation accuracy for the grain yield in each site-variety combination, which we removed in turn and imputed from the rest.
As a term of comparison, we used a CGBN learned from the data (B CGBN ) and compared its performance with that of B LME . We implemented both prediction and imputation using likelihood weighting (Koller and Friedman, 2009;Darwiche, 2009).
To validate the learning strategy in Algorithm 1, we performed 50 replications of hold-out cross-validation where 20% of the site-variety combinations were sampled and set aside to be used as a test set. The remaining 80% were used as a training set to learn B LME and B CGBN . We computed the predictions for each phenological node X i (except for grain yield) from its Markov blanket, and used these predictions to predict the grain yield in turn. We used the kernel densities of the predicted values and the resulting credible intervals with coverage 0.80 to assess the amount of variability in prediction.

Results
The complete BNs B LME and B CGBN learned from the data are shown in the Supplemental Material; here we show only the subgraph around the variable grain yield for each BN in Figure 1 and 2. Following Section 2.2, we identified 60 site-variety clusters (with only 5 containing fewer than 100 observations) and used them a discrete variable set to be the parent of the phenological nodes.
The structure of B LME is more complex than that of B CGBN : B LME has 118 arcs compared to the 92 of B CGBN , and the average Markov blanket size reflects that (17 for B LME , 12 for B CGBN ). Notably, we discovered more relationships for the phenological nodes and in particular for the grain yield variable (Table A.2), which had eight more parents than in B CGBN . The predictive accuracy for each of the scenarios reported in Table A.1 is shown in Figure 3 for both B LME and B CGBN . Overall, B LME outperformed B CGBN in terms of MAPE. The exception was in a few cases, specifically scenarios 7 to 11, 19, 20 and from 21 to 24, where B CGBN demonstrated a lower MAPE than B LME , albeit with a difference in MAPE of only 0.06. In contrast, when B LME outperformed B CGBN , the difference in MAPE was 0.14. This trend was particularly evident in scenarios 27 to 32, where an increasing usage of weather/phenological variables was provided. As expected, the scenarios with the lowest MAPE were those that utilised the Markov Blanket (scenario 31) and the parents of grain yield (scenario 32). The MAPE for the imputation of different site-variety combinations is shown in Figure 4. We observe that B LME and B CGBN perform similarly for all combinations except those involving the sites of Craiova (numbered 250-500) and Campagnola (numbered 1250-1500), for which B CGBN has a higher MAPE than B LME . The kernel densities of grain yield for the first 15 runs of cross-validation are shown in Figure 5. The densities for the training set exhibit somewhat heavier tails compared those for the predicted values. Furthermore, the predictive densities have narrower credible intervals compared to those from the training set, particularly on the lower tail, and are more often positively skewed. The mean values are nearly identical for both, at approximately 7t/ha, with a 0.8 credible interval [4t/ha, 11t/ha].
Finally, we employed a step-wise parent elimination algorithm to search for non-significant effects in each of the local distributions of the phenological variables. The BIC values consistently indicated that, within B LME , the best set of effects were those selected by our method. The only exception was the variable "tassel height," where the BIC was lower when the "diurnal RH range July-August" variable was omitted. The same procedure was applied with the removal of random effects from the local distributions. In this case, the set of effects selected by our method still yielded the best BIC values. Furthermore, we conducted a comparison of the BIC values for local distributions with and without the random effects. Generally, the presence of random effects improved goodness of fit, except for the variables "tassel height" and "silking," which exhibited better BIC values in the absence of random effects. All BIC results are reported in Table A.3 and Table A.4, The BIC values of the first row correspond to the set of parents of the local distribution found with our method, the other rows correspond to the BIC value found after each parent elimination.

Discussion and Conclusions
In our analysis, we used a Bayesian network (BN) to analyse the results of a multi-site agronomic experiment comprising eight different sites in Europe and Chile. Our goal was to modify the structure learning of the BN to encode the hierarchical structure of the data, thus addressing the violation of the exchangeability assumption that characterises the RDs.
The data set consists of weather variables and phenological variables of maize measured from 2011 to 2013. In our study, we selected certain variables based on their agronomic relevance in addition to the weather variables for temperature and humidity. The latter were measured daily for each site, so we calculated their mean for specific time-slices corresponding to the key phenological phases of maize, namely seeding, germination, emergence (May-June); vegetation stage, tasselling, silking, ear formation (July-August); and grain filling, maturation (September-October). The reason for this choice was to capture the effect of each weather variable on the phenological variables. Based on strong prior knowledge, specific arcs were prohibited due to their lack of causal meaning. For example, it is not plausible for a weather variables from a later time slice to affect another in an earlier time slice. We applied the same logical reasoning to the connections between phenological variables that are recorded in different time slices: for instance, the arc from grain yield to silking was prohibited, as it is causally impossible for the grain yield to cause female flowering (silking). Moreover, all arcs that made the cluster variable a child of other variables were prohibited.
In comparing the structures of both networks, we observed that B LME exhibited 26 additional arcs compared to B CGBN , with a significant difference in the case of grain yield which had 8 more parents. We further assessed the predictive accuracy of phenological variables in both B LME and B CGBN using the Diebold-Mariano test (Diebold and Mariano, 2002). This statistical evaluation allowed us to determine that the predictive accuracy improvement observed in B LME was statistically significant for all of the variables (p-value < 0.05, results not shown).
Regarding grain yield, plant height emerged as a new parent: its role as a reliable predictor for maize grain yield is well-documented in the literature (Yin et al., 2011;Pugh et al., 2018). Additionally, its ease of measurement using remote sensing makes it a suitable candidate for predicting maize grain yield (Han et al., 2018;Chu et al., 2018). Supporting evidence comes from the work of Anderson et al. (2019), who studied 280 hybrids conducted in 1500 plots using unmanned aerial systems and found a positive correlation between plant height and maize grain yield. Another new parent identified in the analysis is silking. This finding is also supported by existing evidence, as Malik et al. (2005) demonstrated a significant negative correlation between silking and grain yield. They posited that this negative relationship could be attributed to late female flowering, resulting in less favourable photoperiod and low temperature induced by changing seasons. Considering variables related to temperature and relative humidity (RH), they are all the parents of grain yield in B LME but not in B CGBN , where only diurnal RH range May-June (RH4), diurnal RH range Sept-Oct (RH6), average temperature May-June (T1), average temperature July-Aug (T2), diurnal temperature range May-June (T4) where present. This is plausible since environmental conditions are very important for maize growth: for instance, evidence shows that high humidity during flowering promotes the maize yield (Butts-Wilmsmeyer et al., 2019). Temperature plays a crucial role in influencing maize yield, particularly during the reproductive phase, where sub-optimal or supra-optimal values can have a significant impact. For instance, temperatures ranging from 33°C to 36°C during the pre-and post-flowering stages can result in a reduction of grain yield by 10% to 45% (Neiff et al., 2016). In a review by Waqas et al. (2021), the detrimental effects of thermal stress on maize growth were thoroughly examined from both an agronomic and a physiological perspective. They emphasised that high temperatures, especially during the flowering period, can have various adverse consequences on floret number, silk number, and grain development. Furthermore, the process of fertilisation and grain-filling may also be compromised under such conditions. On the other hand, low temperatures below 10°C can also be detrimental, negatively impacting the normal growth process of maize. Such cold temperatures can limit germination, adversely affect root morphology, and decrease the efficiency of photosystem II. These combined factors demonstrate the sensitivity of maize to temperature fluctuations, which can significantly influence its growth and overall productivity.
We applied hierarchical clustering to the mean and variance of the residuals from a simple linear regression of the grain yield, which was selected due to its agronomic relevance, all available variables to avoid making any assumptions about the possible parents grain yield. After grouping the residuals by site-variety combinations, hierarchical clustering produced 60 relativelybalanced clusters: they were included in the data as a discrete variable that was set as a common parent of the phenological variables in a setup similar to a conditional Gaussian BN as described in Section 2.2. We decided to use the clusters, rather than just the site of origin or the maize variety as individual variables, for two reasons: • When using either the site or the maize variety as a common discrete parent variable, we found the dispersion of residuals in the local distribution, particularly that of grain yield, to be non-homogeneous.
• Combining the site of origin and the maize variety without clustering their combinations produces a variable with approximately 1200 possible values, which would make BN structure learning computationally prohibitive.
As a result, we improved both the computational feasibility and the predictive accuracy of the model. Using clustering as a pre-processing step has been proposed in the literature to find suitable scenarios in risk assessment analysis (Pettet et al., 2017) or to reduce the dimension of the estimation problem, learning the structure of one subgraph for each cluster (Gu and Zhou, 2020). Rodriguez-Sanchez et al. (2022) also proposed a multi-partition clustering that produces a set of categorical variables that encode clusters. These partitions represent a distinct clustering solution and were used as parents, leading to a more interpretable and flexible way to find clusters. As we discussed in Section 2.2, we assumed that the local distributions of phenological variables are linear mixed-effect regression models in order to allow for the partial pooling of information across clusters: this model balances the individual cluster-specific estimates with the overall trend observed in the data, leading to more stable and reliable estimates (Scutari et al., 2022). We assumed different local distributions for grain yield and the weather variables. For grain yield, we introduced a power function to model the variance after observing a skewed residual distribution during the exploratory data analysis. Moreover, we modelled grain yield with a random intercept as the only random effect; in contrast, all other phenological variables have both random coefficients and intercepts. For the weather variables, we used a linear regression model containing only fixed effects as the local distribution. We made this decision based on visual inspection, which revealed that the weather variables appeared disconnected from the clusters. This observation implies that the values of these variables were independent of both the site of origin and the variety of maize.
These assumptions reduced the prediction error for grain yield from 28% to approximately 17% when its Markov blanket or its parents were used as predictors, as shown in Figure 3. Furthermore, we assessed whether the incorporation of random effects in the structure learning procedure enhanced the local distributions, not just in terms of structure but also in terms of model specification. Our findings indicated that random effects have a favourable impact on both structure learning and model specification. They contribute to a more accurate explanation of the data without introducing undue complexity. However, there were exceptions observed for "tassel height" and "silking" which were best modelled without random effects. This suggests that the maxima identified with our method might be local maxima rather than global ones. Additionally, it implies that the estimation of their local distributions did not benefit from the partial pooling information provided by random effects.
Our findings confirm that a CGBN incorporating mixed-effects models to exploit the hierarchical structure of the data provides better accuracy than a standard CGBN. As a result, hierarchical BNs can serve as an effective decision support system, particularly in domains with inherent hierarchical structures such as the agronomic field (Burchfield et al., 2019;Li et al., 2020).
In future research, we propose expanding the clustering approach to identify specific clusters for each phenological variable, instead of assuming that the clusters identified for grain yield are applicable to all other phenological variables. This refinement would allow for a more detailed analysis and interpretation of the model. Additionally, we could consider a fully Bayesian approach such as the Integrated Nested Laplace Approximation (INLA) during Bayesian computation, so that expert information could be further assimilated (Rue et al., 2017), enhancing the overall robustness and reliability of the analysis. Table A.1: Scenarios used to predict grain yield of maize, where a different set of variables of the Bayesian network was used: average temperature May-June (T1), average temperature July-Aug (T2), average temperature Sept-Oct (T3), diurnal temperature range May-June (T4), diurnal temperature range July-Aug (T5), diurnal temperature range Sept-Oct (T6), average RH May-June (RH1), average RH July-Aug (RH2), average RH Sept-Oct (RH3), diurnal RH range May-June (RH4), diurnal RH range July-Aug (RH5), diurnal RH range Sept-Oct (RH6), Silking (Si), GW (Grain weight), An(Anthesis), TH (Tassel height), PH (Plant height) and EH (Ear height). Table A.2: New relationships found in B LME . Variables: average temperature May-June (T1), average temperature July-Aug (T2), average temperature Sept-Oct (T3), diurnal temperature range May-June (T4), diurnal temperature range July-Aug (T5), diurnal temperature range Sept-Oct (T6), average RH May-June (RH1), average RH July-Aug (RH2), average RH Sept-Oct (RH3), diurnal RH range May-June (RH4), diurnal RH range July-Aug (RH5), diurnal RH range Sept-Oct (RH6), Silking (Si), GW (Grain weight), An (Anthesis), TH (Tassel height), PH (Plant height) and EH (Ear height).

Appendix A. Sample Appendix Section
→ GW