Spatial prediction of the concentration of selenium (Se) in grain across part of Amhara Region, Ethiopia

Grain and soil were sampled across a large part of Amhara, Ethiopia in a study motivated by prior evidence of selenium (Se) deficiency in the Region's population. The grain samples (teff, Eragrostis tef, and wheat, Triticum aestivum) were analysed for concentration of Se and the soils were analysed for various properties, including Se concentration measured in different extractants. Predictive models for concentration of Se in the respective grains were developed, and the predicted values, along with observed concentrations in the two grains were represented by a multivariate linear mixed model in which selected covariates, derived from remote sensor observations and a digital elevation model, were included as fixed effects. In all modelling steps the selection of predictors was done using false discovery rate control, to avoid over-fitting, and using an α-investment procedure to maximize the statistical power to detect significant relationships by ordering the tests in a sequence based on scientific understanding of the underlying processes likely to control Se concentration in grain. Cross-validation indicated that uncertainties in the empirical best linear unbiased predictions of the Se concentration in both grains were well-characterized by the prediction error variances obtained from the model. The predictions were displayed as maps, and their uncertainty was characterized by computing the probability that the true concentration of Se in grain would be such that a standard serving would not provide the recommended daily allowance of Se. The spatial variation of grain Se was substantial, concentrations in wheat and teff differed but showed the same broad spatial pattern. Such information could be used to target effective interventions to address Se deficiency, and the general procedure used for mapping could be applied to other micronutrients and crops in similar settings.

summarized below.
If we have a case where two or more variables are to be modelled jointly (e.g. Se concentration in teff grain and of wheat grain, and other spatial variables) then we treat these as a multivariate normal random variable (perhaps after appropriate data transformation). In this case a set of observations of the primary variable, y 1 , and the (m − 1) secondary variables, y T 2 , · · · , y T m observed at n 1 , n 2 , · · · n m locations respectively, is modelled as a length n = n 1 + n 2 + · · · + n m random variate: There are three terms on the right of this equation. The first deals with the fixed effects (environmental covariates in this study). The term X is a design matrix. This has n rows and p = p 1 + p 2 + · · · + p m columns where p i is equal to the number of fixed effects terms in the model for the ith random variate.
In a simple example, if all the variates have one single fixed effect predictor variable in the model then p 1 = p 2 = · · · = p m = 2, and if the jth element in the vector y corresponds to an observation of the ith variate then X [j, 2i] is equal to the corresponding value of the predictor variable, X[j, (2i − 1)] is equal to 1, and all other elements in the jth row of X are zero. The vector τ i contains the fixed effects coefficients for the prediction of the ith variate. In this simple example the two terms in τ i are the intercept and coefficient of a linear regression of y i on the predictor variable in X. The terms η 1 , η 2 , · · · , η m are spatially correlated normal random variables of mean zero, which we assume conform to a LMCR. Let some element of η i , where i ∈ {1, m} be denoted by η i (s) where s is a vector with the coordinates of the observation in space.
Under the LMCR the covariance of any two observations separated spatially by a lag vector h: η i (s), η i (s + h), is assumed to depend only on the lag vector and is given by where there are s ≥ 1 independent additive components in the model and the terms c i,j k are variances and covariances. The function ρ k (h|φ k , κ k ) is a spatial correlation function which gives the correlation for the kth component of the LMCR over lag h. In this study we assume that the correlation function depends on the lag distance, |h|, not the direction (i.e. that it is isotropic), and can be described by the function due to Matérn (Stein, 1999): where φ is a distance parameter, κ is a smoothness parameter and K κ is a modified Bessel function of the second kind of order κ.
Each term ε 1 , ε 2 , · · · , ε m is a zero-mean, independently and identically distributed random variable. Under the assumption of normality, these terms are entirely characterized by their variances and covariance, which we denote c i,k 0 , i, k ∈ {1, m}. These terms are spatially uncorrelated, and so they represent components of the variation of our variables which are either not spatiallydependent, or which are spatially dependent at scales too fine to be resolved by the sampling of the variables.
The variances and covariances of the two random terms, and the parameters of the Matérn correlation function may be estimated from observations by maximum likelihood (ML) or by residual maximum likelihood (REML), as described by Marchant and Lark (2007 Once the fixed and random effects parameters of the LMM have been estimated the model can then be used to obtain predictions of the variables Y 1 , Y 2 , · · · , Y m at sites where they have not been measured but where any predictor variables in X are known. This prediction has two components, the first of which (a regression-type prediction) depends on the value of the covariates and the fixed effects parameters τ i , i ∈ {1, m}. The second component is a spatial interpolation of the random component of the model for the target variable, it is essentially a cokriging prediction, and so is based on estimates of the random effects η i , i ∈ {1, m}, at the sample sites. The overall prediction, the E-BLUP, has errors, the variance of which can be computed from the LMM. The E-BLUP is the best predictor in the sense that this variance is minimized.
Note that all spatial modelling was done after conversion of the coordinates of all observations to the Universal Transverse Mercator projection zone 37N. The spatial predictions were then transformed back to geographical coordinates (latitude and longitude) for presentation as maps.

S.2 The log-likelihood ratio test
Consider a case where the maximized log-likelihood (not the residual log-likelihood) for a linear mixed model that we call the null model is N , and that for a proposed model in which we have added an additional 'candidate' predictor in the fixed effects to the set used in the null model is 1 . At the start of a sequential process of variable selection, as used in this study, the null model is one with a constant mean as the only fixed effect, and the proposed model adds the first-listed predictor variable of interest. One may compare the two models with the log-likelihood ratio statistic: The statistic L is a measure of the evidence that the added predictor has conditions hold for all applications in this paper.

S.3 Factorial kriging
When a spatial variable is modelled as the additive combination of two or more spatially correlated random variables, with a combined variogram which is the sum of the variograms for these variables, then factorial kriging is a method which allows one to decompose the data into estimates of the two components (and an uncorrelated nugget component).

S.4 The standardized squared prediction error of cross-validation
The standardized squared prediction error at any sample site at location s was computed as where y i (s) and Y i (s) denote, respectively, the observed selenium concentration in the grain sample at location s and its E-BLUP prediction, and σ 2 BLUP (s) denotes the prediction error variance of the E-BLUP, which depends only on the parameters of the LMM used for prediction, the spatial distribution of the neighbouring sample points and the value of the environmental covariates at s. The expected value of the standardized squared prediction error is one, for a valid model, and the most sensitive diagnostic is the median value, with an expectation of 0.455 for normally distributed prediction errors (Lark, 2009).

S.5 Ordering of soil properties as potential predictors of Se concentration in grain
The selected order for testing soil properties for prediction of grain Se concentration at a site is shown in Table 2 of the paper. The rationale for the ordering is summarized below.
1. It was decided that extractable soil Se would be the most likely predictor of grain Se concentration, among soil properties. There was more than one measure of extractable Se in the data set, and all were included at the top of the list in order: i. soluble Se (nitrate) ii. exchangeable Se (phosphate) iii. organic Se (TMAH) 2. Following extractable soil Se, we considered soil properties expected to affect Se mobility in soil in order: iv. pH v. sum of oxalate-extractable Fe, Al and Mn (metal oxides) 3. We then included soil sulphur concentration, as this may be a competitor with Se for transporters into the plant. We excluded exchangeable sulphur for which 38% of observations were below detection limit, and many were negative, and included the others in the following order: vi. soluble S (nitrate) vii. organic S (TMAH) 4. We then included exchangeable iodine which could be a proxy of Se content of soil originating in rainfall. Because this is a proxy effect, and the extractable Se is already included, it was given a relatively low rank. Probability of cropping Figure  S1: Probability that arable cropping is found at nodes on a 500 m grid over the sample region. Coordinates are in metres relative to datum Latitude 5 degrees N, Longitude=20 degrees E on the Lambert Azimuthal Equal Area projection. 2 °F igure S3: Location of sample sites (black triangles) relative to the borders of Ethiopia (black lines). Degrees longitude (East, WGS84) and latitude (North) are shown as vertical and horizontal dashed lines. The boundary of Ethiopia is taken from file gadm36_ETH_0_sp.rds, provided by GADM, https://gadm.org/index.html, and reproduced under the terms of the GADM licence https://gadm.org/license.html The boundaries, denominations, and any other information shown on this map do not imply any judgment about the legal status of any territory, or constitute any official endorsement or acceptance of any boundaries, on the part of any Government.