Validation of baseline and modified Sentinel-2 Level 2 Prototype Processor leaf area index retrievals over the United States

The Sentinel-2 Level 2 Prototype Processor (SL2P) is made available to users for the retrieval of vegetation biophysical variables including leaf area index (LAI) from Multispectral Instrument (MSI) data within the Sentinel Application Platform (SNAP). A limited number of validation exercises have indicated SL2P LAI retrievals frequently meet user requirements over agricultural environments, but perform comparatively poorly over heterogeneous canopies such as forests. Recently, a modified version of SL2P was developed, using the directional area scattering factor (DASF) to constrain retrievals as an alternative to regularisation (SL2P-D). Whilst SL2P makes use of prior information on expected canopy conditions, SL2P-D is trained using uniform distributions of input parameters to define radiative transfer model (RTM) simulations. Using in situ measurements available through the Copernicus Ground Based Observations for Validation (GBOV) service, we performed an extensive validation of SL2P and SL2P-D LAI retrievals over 19 sites throughout the United States. For effective LAI (LAIe), SL2P demonstrated good overall performance (RMSD = 0.50, NRMSD = 31%, bias = − 0.10), with all LAI retrievals meeting the Sentinels for Science (SEN4SCI) uncertainty requirements over homogeneous canopies (cultivated crops, grasslands, pasture/hay and shrub/scrub), whilst underestimation occurred over heterogeneous canopies (deciduous forest, evergreen forest, mixed forest, and woody wetlands). SL2P-D retrievals demonstrated reduced bias, slightly improving overall performance when compared with SL2P (RMSD = 0.48, NRMSD = 30%, bias = − 0.05), indicating its retrieval approach appears to offer some advantages over regularisation using prior information, especially at LAIe > 3. Additionally, SL2P-D resulted in 32% more valid retrievals than SL2P, with the largest differences observed at LAIe < 1. Validation against in situ measurements of LAI as opposed to LAIe yielded similar patterns but poorer performance (RMSD = 1.08 to 1.13, NRMSD = 49% to 52%, bias = − 0.64 to − 0.68) because the RTM used by SL2P and SL2P-D does not account for foliage clumping. In addition to the retrievals themselves, we examined the relationship between predicted uncertainties and observed differences in retrieved and in situ LAI. With respect to LAIe, SL2P’s predicted uncertainties were conservative, underestimating observed differences in only 35% of cases, whilst those for LAI were unbiased.


Introduction
Timely information on the status of vegetation is a crucial requirement in agriculture, forestry and environmental and biodiversity assessment, enabling resources to be monitored and managed effectively in the face of environmental change and an increasing global population (GCOS, 2019). Offering repeat observations and global coverage, satellite remote sensing represents a valuable source of such information. Over the last two decades, several operational products have been developed, making use of radiative transfer model (RTM) inversion, statistical approaches, and hybrid techniques to retrieve vegetation biophysical variables from remotely sensed optical images (Baret and Buis, 2008;Verrelst et al., 2015). Using data from moderate to coarse spatial resolution instruments such as the Advanced Very High Resolution Radiometer (AVHRR) (García-Haro et al., 2018), Moderate Resolution Imaging Spectrometer (MODIS) Knyazikhin et al., 1998;Pinty et al., 2011a;2011b;Yan et al., 2016a), Ocean and Land Colour Instrument (OLCI) (Gobron, 2010), Visible Infrared Radiometer Suite (VIIRS) (Yan et al., 2018) and PROBA-V (Lacaze et al., 2015), current examples of vegetation biophysical products provide estimates at 300 m to 4.8 km, with a frequency of between four days and one month.
Whilst the spatial resolution of existing operational products is adequate for regional and global scale monitoring, increased spatial resolution is required in precision agriculture, forest management, and adaptation studies, where within-field and stand-scale information is necessary (Clevers and Gitelson, 2013;GCOS, 2019;Majasalmi and Rautiainen, 2016). The Sentinel-2 missions, which form part of the European Union's Copernicus programme, provide a unique opportunity in this respect. Comprised of a constellation of two platforms orbiting 180 • apart, the near-identical Multispectral Instrument (MSI) sensors carried on-board benefit from both high spatial (10-60 m) and temporal (≤5 days) resolutions (Drusch et al., 2012). The spectral characteristics of the instrument, which incorporates thirteen visible near-infrared and shortwave-infrared bands (including three red-edge bands), are also well-suited to vegetation monitoring (Delegido et al., 2011;Frampton et al., 2013;Xie et al., 2019).
Currently, estimates of vegetation biophysical variables are not produced operationally by the Sentinel-2 ground segment. Instead, a retrieval algorithm has been implemented in the freely available Sentinel Application Platform (SNAP). Developed by Weiss and Baret (2016) and known as the Sentinel-2 Level 2 Prototype Processor (SL2P), the algorithm enables users to generate so-called 'L2B' products from atmospherically corrected L2A MSI data (Müller-Wilm, 2018). SL2P adopts artificial neural networks (ANNs) that are trained with RTM simulations from the coupled Leaf Optical Properties Spectra (PROS-PECT) (Feret et al., 2008;Jacquemoud and Baret, 1990) and Scattering by Arbitrarily Inclined Leaves (4SAIL) (Verhoef et al., 2007) models. The algorithm provides retrievals and predicted uncertainties of leaf area index (LAI), the fraction of absorbed photosynthetically active radiation Fig. 1. Schematic diagram illustrating the cascaded retrieval approach adopted by SL2P-D.

Table 1
Comparison of the major similarities and differences between SL2P and SL2P-D.
Recently, a modified version of SL2P, known as SL2P-D, was developed by Fernandes and Djamai (2019), in which the directional area scattering factor (DASF) is used to constrain retrievals. DASF is a spectral index representing an estimate of the fraction of leaf area inside the canopy visible from a given direction outside the canopy (Knyazikhin et al., 2013). It is a dimensionless quantity typically ranging from zero to one, and is monotonically related to LAI and foliage clumping (Adams et al., 2018;Stenberg and Manninen, 2015). It is equivalent to the canopy bidirectional reflectance assuming a foliage single scattering albedo of one (i.e. non-absorbing leaves) and zero boundary reflectance (i.e. due to dark soils or a sufficiently dense canopy). Provided the background reflectance is negligible, DASF can, therefore, be determined directly from measured or simulated reflectance at wavelengths where foliage is weakly absorbing (i.e. ~800 nm to 850 nm), without prior knowledge of the foliage single scattering albedo (Knyazikhin et al., 2013). Unlike other spectral indices, DASF is by definition invariant to foliage biochemistry but sensitive to canopy structure.
Whilst detailed descriptions of SL2P and SL2P-D are provided in their respective algorithm theoretical basis documents Weiss and Baret, 2016), it is instructive to describe their major similarities and differences (Table 1). Both adopt ANNs trained with RTM simulations based on joint distributions of leaf, canopy, soil and acquisition geometry parameters. To regularise retrievals, SL2P uses prior information in the form of truncated Gaussian distributions of input parameters, which were designed to reflect global conditions. However, previous work suggests that such a strategy may lead to locally-biased retrievals (Combal et al., 2003). In an attempt to address this issue, SL2P-D uses uniform distributions of input parameters, and applies partitioning as an alternative to regularisation. This strategy involves a cascaded retrieval approach. First, a dedicated ANN estimates DASF. Then, one of 18 ANNs trained using only simulations matching the retrieved DASF value (± the expected precision of DASF retrievals) is selected for retrieving the corresponding biophysical variables (Fig. 1). Testing against independent simulations indicates that DASF retrievals are unbiased, and have good precision (Appendix A). DASF is used for partitioning for two reasons: i) it is sensitive to canopy structure but invariant to foliage biochemistry (Adams et al., 2018;Stenberg and Manninen, 2015), and therefore removes the confounding effects of biochemistry on retrievals of structural variables when partitioning the simulation database, and ii) it can be estimated without prior knowledge of the foliage single scattering albedo (Knyazikhin et al., 2013). Further details on SL2P-D are provided in Appendix A.
The Land Product Validation (LPV) sub-group of the Committee on Earth Observation Satellites (CEOS) Working Group on Calibration and Validation (WGCV) defines a four-stage hierarchy for product validation (Fernandes et al., 2014). Several moderate spatial resolution products have reached the second stage of this hierarchy, in which accuracy is assessed over a significant set of locations and time periods (Brown et al., 2020;Camacho et al., 2013;Fang et al., 2019;Yan et al., 2016b). For SL2P, however, validation exercises have been comparatively limited, reaching only the first stage of the hierarchy, in which accuracy is assessed over a small (typically < 30) set of locations and time periods. These exercises have indicated that SL2P LAI retrievals frequently meet user requirements over agricultural environments, but appear to demonstrate comparatively poor performance over heterogenous canopies such as forests and at higher values (i.e. LAI > 3) Djamai et al., 2019;Hu et al., 2020;Pasqualotto et al., 2019bPasqualotto et al., , 2019aUpreti et al., 2019;Vanino et al., 2018;Vuolo et al., 2016;Xie et al., 2019).
Recently, environmental monitoring networks, such as the National Ecology Observatory Network (NEON) (Kao et al., 2012), Terrestrial Ecosystem Research Network (TERN) (Karan et al., 2016), and Integrated Carbon Observation System (ICOS) (Gielen et al., 2018) have been established to collect long-term environmental data over permanent measurement sites, and are planning or performing in situ measurements of vegetation biophysical variables on a routine basis. The Copernicus Ground Based Observation for Validation (GBOV) service was initiated to exploit these data for satellite product validation: now in its third year, 4,178 in situ reference measurements are available through the GBOV service over 20 NEON sites (https://land.copernicus. eu/global/gbov/). The GBOV dataset was recently used to evaluate several moderate (≥300 m) spatial resolution LAI products (Brown et al., 2020). However, the high spatial resolution LAI retrievals provided by SL2P and SL2P-D were not explicitly addressed. As such, the objective of the present study is to use the GBOV dataset to perform an extensive validation of SL2P and SL2P-D LAI retrievals over the United States of America. Unlike previous local-scale validation efforts, the multiple sites and time periods incorporated within the dataset provide substantial progress towards the second stage of the CEOS WGCV LPV hierarchy. Four specific research questions are addressed: 1. What accuracy can be expected from SL2P and SL2P-D LAI retrievals over different vegetation types characteristic of biomes found in the United States? 2. Does SL2P-D result in similar or reduced bias in LAI retrievals, particularly at higher values (i.e. LAI > 3)? 3. Do both algorithms perform best over homogeneous as opposed to heterogenous canopies, as indicated by previous studies? 4. How well do the predicted uncertainties provided by SL2P reflect observed differences between retrieved and in situ values?

In situ data collection and processing
In situ measurements provided by the GBOV service at 19 NEON sites throughout the United States were used to estimate reference LAI and effective LAI (LAI e ), in which a random distribution of leaves is assumed, for validating SL2P and SL2P-D retrievals (Table 2). Sites included a wide range of vegetation types as defined by the National Land Cover Database classification (i.e. cultivated crops, deciduous forest, evergreen forest, grasslands, mixed forest, pasture/hay, shrub/scrub and woody wetlands) (Homer et al., 2020). At each site, in situ measurements covered a period ranging between three and six years. All in situ measurements were derived from estimates of gap fraction obtained using digital hemispherical photography (DHP) (NEON, 2019). At each site, a minimum of three plots (nominally 20 m × 20 m) were sampled every two weeks from leaf emergence until the end of senescence. Each plot contained 12 samples (Fig. 2), and both upwards-and downwardsfacing images were acquired if understory and overstory vegetation was present. Meier et al. (2018) provide further information on the NEON DHP acquisition protocol, whilst Brown et al. (2020) describe the approach used by the GBOV service to process NEON DHP images, which includes quality control to reject images meeting any of the following conditions: plots with less than 12 images, no downwardfacing images at forest sites, images acquired in lossy formats, and images demonstrating fixed pattern noise, overexposure, colour balance issues, variable illumination, or foreign objects within the field-of-view.
Because neither SL2P nor SL2P-D account for foliage clumping Weiss and Baret, 2016), their LAI retrievals correspond to LAI e . Thus, retrievals were primarily validated against in situ estimates of LAI e . However, since LAI rather than LAI e is the physical quantity desired by many users, we also validated SL2P and SL2P-D against estimates of LAI itself (i.e. accounting for the effects of foliage clumping). The data provided by the GBOV service represent plant area index (PAI) and effective PAI (PAI e ) as opposed to LAI and LAI e , as the upwards-facing image classification cannot distinguish between foliage and woody material (Brown et al., 2020). PAI e is determined according to Warren-Wilson (1963), whilst the derivation of PAI relies on the clumping correction approach of Lang and Yueqin (1986), such that where lnP(θ 57.5 • ) is the natural logarithm of mean gap fraction values and lnP(θ 57.5 • ) is the mean of the natural logarithm of gap fraction values at 57.5 • (±5 • ). Following the CEOS WGCV LPV good practices for LAI validation (Fernandes et al., 2014), in the absence of site-or plot-specific information on woody area, in this study we applied a first-order correction for woody material, deriving LAI and LAI e from PAI and PAI e (assuming no woody area in downward-facing images) as where PAI up and PAI down are PAI or PAI e values derived from upwardsand downwards-facing DHP images, respectively, and ∝ is the woody-tototal area ratio (Baret et al., 2005;Brown et al., 2020;Chen, 1996

Table 3
Mean and standard deviation of woody-to-total area ratio (∝) values for each forest type derived from previously published values (Bréda, 2003;Gower et al., 1999). determined for each forest type based on previously published values for a range of deciduous and evergreen species (Table 3) (Bréda, 2003;Gower et al., 1999). Using uncertainty propagation (Working Group 1 of the Joint Committee for Guides in Metrology, 2008), we quantified the combined standard uncertainty in our LAI e and LAI estimates as a result of a) the uncertainty in DHP-derived PAI e and PAI values, and b) the correction for woody area, such that where u(PAI up ) and u(PAI down ) are the standard uncertainties in DHPderived PAI e or PAI values provided by the GBOV service, which incorporate the effects of variability in gap fraction and instrument levelling (Brown et al., 2018;Origo et al., 2017), whilst u(∝) is the standard deviation of the mean ∝ value for the forest type in question (Table 3).

MSI data pre-processing and execution of SL2P and SL2P-D
All MSI L1C top-of-atmosphere reflectance scenes acquired over the 19 sites during the study period were ingested from the Copernicus Open Access Hub (https://scihub.copernicus.eu/) and processed to L2A bottom-of-atmosphere reflectance using Sen2Cor 2.5.5 (Müller-Wilm, 2018). Once processed, the quality scene classification map obtained using Sen2Cor was used to exclude pixels contaminated by cloud, cloud shadow, thin cirrus, water, or snow, in addition to dark, saturated, or defective pixels. Standalone versions of SL2P (https://github.com/d jamainajib/sl2p_v1/) and SL2P-D (https://github.com/djamainajib/s l2p_dasf/) were used to retrieve LAI from 20 m L2A MSI data. Both SL2P and SL2P-D provide a quality flag layer Weiss and Baret, 2016). Invalid retrievals (i.e. those flagged by the algorithm to indicate inputs/outputs were outside of the domain/range of the training database) were discarded from further analysis.
The standalone version of SL2P, which is also now implemented in Google Earth Engine within the Landscape Evolution And Forecasting (LEAF) Toolbox (https://github.com/rfernand387/leaf-toolbox/), is equivalent to the version implemented in SNAP 7.0 (http://step.esa. int/main/toolboxes/sentinel-2-toolbox/sentinel-2-toolbox-features/), with the exception that it also provides predicted uncertainties as well as the retrievals themselves, following the approach proposed by Baret et al. (2010). Using the same inputs as the ANN adopted for LAI retrieval, a dedicated ANN is trained to estimate the root mean square difference (RMSD) that could be expected for a given MSI observation. This is expressed as the RMSD between the LAI value associated with each simulation and the LAI values associated with similar candidates within the training database (i.e. those lying within MSI's uncertainty (assumed 0.02 additive, 4% multiplicative), within ±5 • of the solar and viewing zenith angles, and within ±10 • of the relative azimuth angle of the simulation in question). It is important to note that the predicted uncertainties correspond to the expected RMSD over all similar inputs, and not the actual RMSD (this would imply the ANN used for LAI retrieval was poorly trained, as in such a case, it would be possible to simply correct retrievals using the predicted uncertainties themselves).

Statistical analysis
First, an intercomparison of all valid SL2P and SL2P-D LAI retrievals over the considered measurement plots was carried out to assess their consistency. Then, SL2P and SL2P-D LAI retrievals were validated against in situ measurements made within one day of the associated MSI scene acquisition. The agreement between LAI retrievals and in situ measurements was assessed using the coefficient of determination (r 2 ), RMSD, normalised RMSD (NRMSD), bias, uncertainty agreement ratio (UAR) , and slope. The r 2 and slope were determined using ordinary least squares regression, whilst the RMSD, NRMSD, and bias were calculated as where p i represents the value provided by SL2P or SL2P-D, o i represents the in situ measurement, and n representes the number of comparisons. A positive bias corresponded to overestimation of in situ measurements by SL2P or SL2P-D. Using uncertainty propagation, the standard uncertainty in the RMSD, NRMSD and bias values resulting from the uncertainties associated with each in situ measurement (Section 2.1) was determined.
The UAR corresponds to the percentage of retrievals falling within the Sentinels for Science (SEN4SCI, 2011) uncertainty requirements (1 unit or 20%; used for both LAI e and LAI, as specific requirements for LAI e were not available), such that where I[x] is the indicator function. These are less stringent than current GCOS uncertainty requirements (15%), and so may be considered 'threshold' requirements . It is worth noting that the GCOS requirements were originally developed for global moderate spatial resolution products and are currently under revision (GCOS, 2019), whilst uncertainties reported for in situ LAI measurements often exceed 15% (Camacho et al., 2013;Fang et al., 2019;Garrigues et al., 2008), meaning that conformity to the GCOS requirements is difficult to reliably test. Additionally, by evaluating retrievals against the SEN4SCI requirements, we could better compare our results to those of previous SL2P validation efforts . SL2P provided 433 valid retrievals matching the in situ measurements, whilst SL2P-D provided 572 (an increase of 139 or 32.10%), with the majority of additional valid retrievals occurring at LAI e values of less than 1 (i.e. 131 or 94.24% of additional valid retrievals), and over cultivated crops, grassland/herbaceous, pasture/hay or shrub/shrub canopies (i.e. 114 or 82.01% of additional valid retrievals). Our study concentrated on thematic uncertainty rather than temporal frequency requirements. However, users require products at a frequency of between one and 10 days; making it critical to increase the frequency of valid retrievals (Djamai and Fernandes, 2021). A more comprehensive assessment of the frequency of valid retrievals of SL2P and SL2P-D will require global, seasonally representative sampling that is beyond the scope of our study.
To determine the significance of observed biases, one sample t-tests were performed, whilst paired t-tests were carried out to identify cases where SL2P and SL2P-D biases were significantly different from each other. To ensure they were comparable, only retrievals valid for both SL2P and SL2P-D were used in the calculation of statistics (n = 430). In addition to overall values, all statistics were calculated for land cover type, meteorological season, and LAI e and LAI magnitude subsets. Note that at deciduous sites, seasonal differences in performance are partly accounted for by LAI e and LAI magnitude, but this is not the case at evergreen and cropland sites, where seasonal variations in performance may occur independent of LAI e and LAI magnitude due to factors such as snow cover and residual cloud contamination.
When time series of LAI retrievals were analysed, both SL2P and SL2P-D were able to realistically resolve seasonal variations in vegetation status over all of the considered land cover types (Fig. 4). As also demonstrated by the intercomparison statistics (Tables 4 and 5), SL2P-D provided notably higher retrievals than SL2P over deciduous forest and at higher LAI (i.e. > 3). This was reflected in the spatial distribution of differences between SL2P and SL2P-D LAI retrievals. At lower LAI values, SL2P-D tended to provide similar or slightly lower retrievals than SL2P. As discussed in Section 2.3, at those sites with lower LAI values (such as those characterised by cultivated crops and shrub/scrub vegetation), SL2P-D provided fewer invalid retrievals than SL2P; many retrievals for the latter were flagged due to inputs/outputs being outside of the domain/range of the training database, resulting in substantial areas of no data (Fig. 4).

Validation of SL2P and SL2P-D LAI retrievals against in situ data
Both SL2P and SL2P-D LAI retrievals were in better agreement with in situ estimates of LAI e than LAI (Fig. 5). Retrievals appeared linearly biased with respect to LAI. Regardless of whether they were validated against in situ estimates of LAI e or LAI, SL2P-D LAI retrievals were subject to reduced bias when compared with SL2P LAI retrievals (-0.05 as opposed to − 0.10 for LAI e and − 0.64 as opposed to − 0.68 for LAI), demonstrating slopes closer to one (0.82 as opposed to 0.77 for LAI e and 0.58 as opposed to 0.54 for LAI). Additionally, the paired t-tests indicated that SL2P and SL2P-D biases were significantly different from each other in both cases. This led to slightly reduced RMSD values (0.48 as opposed to 0.50 for LAI e and 1.08 as opposed to 1.13 for LAI) and a slightly greater proportion of LAI retrievals compliant with the SEN4SCI uncertainty requirements (UAR = 95.35% as opposed to 94.42% for LAI e and 66.51% as opposed to 61.63% for LAI).
In terms of performance by land cover type, the best agreement was achieved for cultivated crops, grassland/herbaceous, pasture/hay and shrub/scrub canopies, for which all SL2P and SL2P-D LAI retrievals were compliant with the SEN4SCI uncertainty requirements when validated against in situ estimates of LAI e (Table 7). It is worth noting that cultivated crops and grassland/herbaceous cover types had substantially fewer samples than the other considered land cover types, and lower LAI values might be expected over these sites when compared to forests. LAI retrievals not meeting the SEN4SCI uncertainty requirements were restricted to deciduous forest, evergreen forest, mixed forest and woody wetlands. An increased proportion of SEN4SCI compliant LAI retrievals was observed in the case of SL2P-D when compared with SL2P, with the exception of deciduous forest.
For both algorithms, biases were significantly different from zero over all land cover types except cultivated crops. A reduction in bias was evident for SL2P-D LAI retrievals when compared with SL2P LAI retrievals in most cases, although the paired t-tests indicated that SL2P and SL2P-D biases were significantly different from each other only over Fig. 3. Intercomparison between valid SL2P and SL2P-D retrievals over the measurement plots considered within the study. Biases and slopes significantly different from zero and one, respectively (p < 0.05), are indicated with * Table 4 Intercomparison statistics for SL2P and SL2P-D LAI retrievals by land cover type. Biases and slopes significantly different from zero and one, respectively (p < 0.05), are indicated with *.     . When compared to SL2P, SL2P-D yielded slopes closer to one for all land cover types except pasture/hay and shrub/scrub (Table 7). Results for in situ estimates of LAI as opposed to LAI e revealed similar patterns but poorer performance statistics overall (Table 8).
When analysed by magnitude, the best agreement was observed for LAI e values of less than 2, at which between 98.68% and 100% of SL2P and SL2P-D retrievals were within the SEN4SCI uncertainty requirements, and RMSD values were between 0.31 and 0.39 (Table 9), though relative performance was poor for LAI e values of less than 1 Table 8 Validation statistics for SL2P and SL2P-D LAI retrievals when compared against in situ estimates of LAI, by land cover type. Biases and slopes significantly different from zero and one, respectively (p < 0.05), are indicated with *, whilst SL2P and SL2P-D biases significantly different from each other (p < 0.05) according to a paired ttest are shown in bold.

SL2P
SL2P-D  (NRMSD = 114.03% to 116.07%). As expected, RMSD values increased with magnitude in the case of both algorithms, as did biases between LAI e values of 1 and 4. Notably, the paired t-tests indicated that SL2P and SL2P-D biases were significantly different from each other over all magnitude ranges, with SL2P-D providing less biased retrievals (and slopes closer to one) in all cases. The reduction in bias with respect to SL2P was greatest at LAI e values of greater than 2, where biases were reduced by between 0.08 and 0.14. Nevertheless, for both algorithms, biases were significantly different from zero over all magnitude ranges (Table 9). Results for in situ estimates of LAI as opposed to LAI e are provided in Table 10, which revealed similar patterns but poorer performance statistics. With respect to seasonal variations, the best compliance occurred during the winter and spring, when between 97.75% and 100% of SL2P and SL2P-D retrievals met the SEN4SCI uncertainty requirements and RMSD values were between 0.35 and 0.39, though because of the lower magnitude of values experienced during the winter, relative performance was in fact poorer (NRMSD = 59.92% to 62.67%). Additionally, due to their small sample size, the winter statistics should be treated with caution (Table 11). Increased RMSD values (0.46 to 0.53) were experienced during the summer and autumn, as were a reduced proportion of SEN4SCI compliant retrievals (92.89% to 95.00%). Biases   Fig. 6. Comparison of SL2P predicted uncertainties against the observed absolute difference in LAI retrievals when compared against in situ estimates of LAI e (a) and LAI (b). The dashed line represents a 1:1 relationship, whilst error bars represent the combined standard uncertainty associated with each in situ measurement. Biases significantly different from zero (p < 0.05) are indicated with *. 'Underestimated' refers to the proportion of predicted uncertainties that underestimated the observed absolute difference.

Table 12
Validation statistics for SL2P and SL2P-D LAI retrievals when compared against in situ estimates of LAI, by meteorological season. Biases and slopes significantly different from zero and one, respectively (p < 0.05), are indicated with *, whilst SL2P and SL2P-D biases significantly different from each other (p < 0.05) according to a paired t-test are shown in bold.

SL2P
SL2P-D  were significantly different from zero for both algorithms over all seasons except winter, whilst the paired t-tests indicated that SL2P and SL2P-D biases were significantly different from each other only during the summer and autumn, when SL2P-D reduced biases by between 0.05 and 0.08. Nevertheless, over all seasons, SL2P-D provided slopes closer to one (Table 11). Once again, results for in situ estimates of LAI as opposed to LAI e revealed similar patterns but poorer performance statistics (Table 12).

Relationship between predicted uncertainties and observed differences in retrieved and in situ LAI
In terms of the relationship between the predicted uncertainties provided by SL2P and observed differences between SL2P retrievals and in situ data, little association was observed in the case of LAI e (r 2 = 0.01) (Fig. 6a). The RMSD between the predicted uncertainty and the absolute difference in LAI with respect to in situ LAI e was 0.70, whilst on average predicted uncertainties overestimated the observed absolute difference (bias = 0.40). This was reflected by relatively few points lying below the 1:1 line (35.12%), indicating predicted uncertainties are typically conservative. An increased association was observed in the case of LAI (r 2 = 0.33), whilst a reduced RMSD of 0.61 was demonstrated. In this case, a greater proportion of predicted uncertainties underestimated the observed absolute difference (59.53%), although the bias was low (− 0.10) (Fig. 6b).

Differences in retrieval accuracy between SL2P and SL2P-D
Previous validation exercises have demonstrated that SL2P retrievals frequently meet user requirements over agricultural environments, but underestimate higher LAI values such as those observed over forests Djamai et al., 2019;Hu et al., 2020;Pasqualotto et al., 2019bPasqualotto et al., , 2019aUpreti et al., 2019;Vanino et al., 2018;Vuolo et al., 2016;Xie et al., 2019). When validated against in situ measurements of LAI e , the fact that 100% of SL2P retrievals over cultivated crops, grassland/herbaceous, pasture/hay and shrub/scrub canopies met the SEN4SCI uncertainty requirements in our study is consistent with these previous findings, as is the fact that negative biases were observed over deciduous forest, evergreen forest, mixed forest and woody wetlands. Our results reveal that SL2P-D tends to provide higher retrievals over forest environments and at LAI > 3, indicating that it is somewhat effective in counteracting SL2P's issue of underestimation. When validated against in situ data, the increased retrieval accuracies, reduced biases, and greater proportion of retrievals compliant with the SEN4SCI uncertainty requirements lend further support to this conclusion.
In addition to reducing biases over forest environments and at higher LAI values, SL2P-D resulted in lower biases during the autumn season. Whilst the prior distributions of leaf chlorophyll and brown pigment concentrations used to train SL2P were optimised for sensitivity to green leaves (leading to underestimation of total LAI during senescence), such constraints are not imposed by the uniform distributions adopted by SL2P-D. An additional advantage of SL2P-D was that, for several sites characterised by lower LAI values, it produced substantially fewer invalid retrievals than SL2P. The uniform distributions of RTM input parameters adopted by SL2P-D mean a greater diversity of reflectance spectra are incorporated in its training database, and so fewer observed reflectance values are likely to be flagged as having an out of range input. This is an important consideration, since Xie et al. (2019) found that SL2P produced flagged retrievals over 37% of their winter wheat measurement plots, substantially reducing the utility of the algorithm when compared to other investigated retrieval approaches.
It should be noted that when validated against in situ measurements of LAI rather than LAI e , reduced retrieval accuracies and increased biases were observed in the case of both SL2P and SL2P-D. These findings are consistent with the results of Hu et al. (2020), who report an RMSD of 1.14 and 1.06 when validating SL2P retrievals against in situ measurements LAI and LAI e , respectively. We hypothesise this is caused by both algorithms being trained using the 4SAIL RTM, which represents the canopy as a horizontally homogeneous turbid medium, and thus does not incorporate foliage clumping. For more homogeneous canopies with relatively low foliage clumping such as cultivated crops, grassland/ herbaceous vegetation, and pasture/hay, SL2P and SL2P-D retrievals provide relatively accurate estimates of LAI, as the difference between LAI and LAI e over these canopies is relatively small (Richter et al., 2009;Verger et al., 2011). On the other hand, for heterogeneous and highly clumped canopies such as forests, both SL2P and SL2P-D underestimate LAI. Indeed, for forests, Brown et al. (2019) found that using a heterogeneous RTM improved retrieval accuracy. An important finding of this study is that SL2P retrievals appear linearly biased with respect to LAI. If LAI rather than LAI e is the desired quantity, it may, therefore, be possible to derive and apply a bias correction (Brown et al., 2020).

In situ measurement uncertainties
Whilst the in situ measurements used in this study provide a useful reference for evaluating SL2P and SL2P-D retrievals, it is worth noting that DHP provides an indirect estimate of LAI. Previous work has demonstrated that DHP can underestimate LAI in tall, complex canopies when compared to direct (i.e. destructive) measurements (Bréda, 2003;Chianucci and Cutini, 2012;Dufrêne and Bréda, 1995;Fassnacht et al., 1994;Jonckheere et al., 2004;Liu et al., 2016). In our analysis, we mitigated such effects to some extent by accounting for foliage clumping in the derivation of LAI (Brown et al., 2020). However, uncertainties related to the choice of clumping correction method remain (Leblanc and Fournier, 2014;Macfarlane et al., 2007;Walter et al., 2003;Woodgate et al., 2017;Yan et al., 2019). These factors should be borne in mind when interpreting observed biases.
A further source of uncertainty is related to the incorporation of senescent foliage within the DHP-derived estimates. For upwards-facing images, the classification approach used to provide the GBOV in situ reference measurements incorporates both green leaves and senescent foliage, whilst for downwards-facing images, the classification is most sensitive to green leaves only (Brown et al., 2020). When both upwardsand downwards-facing images are combined (i.e. at forest sites), these effects may be partly compensatory. Finally, although we applied a firstorder correction for the effects of woody area, site-or plot-specific information on woody area is ideally required. In the case of deciduous species, ∝ may vary depending on the amount of foliage present at any given time, making measurements throughout the growing season highly desirable. These could be obtained in future work by, for example, capturing images at multiple exposures to better discriminate between foliage and woody material throughout the phenological cycle (Woodgate et al., 2016), or by making use of techniques such as nearinfrared imaging (Baret et al., 1993;Milton, 2002;Osmond, 2009) and terrestrial laser scanning (Calders et al., 2018;Li et al., 2018).

Utility of the cascaded retrieval approach adopted by SL2P-D
The retrieval of vegetation biophysical variables such as LAI is considered ill-posed, as different combinations of biophysical and biochemical properties may lead to similar reflectance spectra, whilst confounding factors such as measurement and model uncertainties may introduce error in the retrieved value (Combal et al., 2003;Gobron et al., 1997;Verger et al., 2011;Verrelst et al., 2015). To overcome illposedness and achieve robust retrievals, regularisation techniques have been proposed to constrain the potential solution space, the most popular of which is the use of prior information on expected canopy conditions (Bacour et al., 2006;Baret et al., 2007;Combal et al., 2003;Verger et al., 2011;Verrelst et al., 2015). Such information may be provided on the basis of land cover or local in situ measurements of leaf and canopy biophysical/biochemical properties. In SL2P, prior information is based on a compilation of previously published experimental data available at the time the algorithm was developed .
The prior distributions of input parameters adopted in SL2P were designed to reflect global conditions insofar as possible, given the available experimental data . Nevertheless, it is likely that these distributions do not perfectly reflect reality, and that they are even less representative of local canopy conditions over particular sites or vegetation types . For example, in the case of LAI, SL2P uses a truncated Gaussian prior distribution with a mean = 2 and standard deviation = 3. As retrievals will tend towards the mean of the adopted prior distribution, this may lead to locally (and globally) biased outputs, as observed here and in previous studies Djamai et al., 2019). Our results suggest that the alternative strategy adopted by SL2P-D, which is based on cascaded retrieval for different partitions of DASF, appears to offer some advantages over regularisation using prior information (as demonstrated by slightly improved retrieval accuracies and reduced biases). These results rely on the fact that DASF is sensitive to canopy structure but invariant to foliage biochemistry (Adams et al., 2018;Stenberg and Manninen, 2015), enabling the confounding effects of biochemistry on retrievals of structural variables such as LAI to be removed. However, to achieve further improvements in performance, it is likely that RTMs better able to represent heterogeneous canopies are required.

Perspectives on product uncertainty estimates
As users look to incorporate vegetation biophysical variables such as those provided by SL2P within data assimilation schemes (Chernetskiy et al., 2017;Lewis et al., 2012;Mathieu and O'Niell, 2008), there is an increasing need for accurate estimates of their associated uncertainty. Such information enables individual observations to be appropriately weighted based on our confidence in their quality (Demarty et al., 2007;Raupach et al., 2005;Richardson et al., 2011). Several operational LAI products are now providing some form of uncertainty estimate (García-Haro et al., 2019, 2018Yan et al., 2016a), although few incorporate all relevant sources of uncertainty (such as those related to sensor radiometry, atmospheric correction, RTM assumptions, and the retrieval scheme itself) (Chernetskiy et al., 2017;Lewis et al., 2012;Pinty et al., 2011a;2011b). It should be noted that SL2P-D also provides predicted uncertainties using the same approach as SL2P. However, because its training database is comprised of uniform distributions, it contains many cases that are unlikely to be encountered in reality, causing predicted uncertainties to be overly pessimistic. For SL2P-D to provide useful predicted uncertainties, they should instead be determined using a locally representative database. As such a strategy was beyond the scope of our study, only SL2P's uncertainty estimates were considered.
Despite the fact that predicted uncertainties are provided by SL2P, there has been little investigation into how well they reflect observed differences between retrieved and in situ LAI e . To our knowledge, our study is the first to investigate this relationship. As previously mentioned, we do not expect a strong linear correlation between predicted uncertainties and observed differences, as this would imply SL2P was poorly trained (in such a case it would be possible to simply correct retrievals using the predicted uncertainties themselves). Nevertheless, it is desirable that the predicted uncertainties are unbiased over a large range of samples. For LAI, the low bias is encouraging, as it suggests that SL2P's predicted uncertainties can be used when mapping over large spatial and temporal domains with surface conditions similar to the investigated sites (when aggregating over time and space, random errors are supressed, making bias the most important factor). For LAI e , the observed bias is not ideal, but at least conservative, enabling outer bounds to be determined that will in most cases contain the observed difference between retrieved and in situ LAI e . Both results indicate that SL2P's prediceted uncertainites may be applicable to data assimilation schemes (Demarty et al., 2007;Raupach et al., 2005;Richardson et al., 2011).
Though SL2P's predicted uncertainties should prove useful, it is possible that advances in retrieval techniques will provide more realistic uncertainty estimates. For example, as opposed to training separate ANNs to derive predicted uncertainties, recent work has successfully applied probabilistic non-parametric machine learning approaches that intrinsically provide uncertainty estimates. Of particular promise is Gaussian process regression (GPR), which has been shown to provide similar or improved retrieval accuracy to ANNs, in addition to faster computation times (García-Haro et al., 2019, 2018Upreti et al., 2019;Verrelst et al., 2015;. By adopting a Bayesian approach to the regression problem, GPR delivers predictions in the form of a posterior probability distribution, such that the mean of the distribution represents the predicted value, and the standard deviation its uncertainty. Another promising technique, also based on Bayesian inference, is optimal estimation, as implemented in the Joint Research Centre Two-Stream Inversion Package (JRC-TIP) (Clerici et al., 2010;Disney et al., 2016;Kaminski et al., 2017;Pinty et al., 2011a;2011b;, and the Earth Observation Land Data Assimilation Scheme (EO-LDAS) (Chernetskiy et al., 2017;Lewis et al., 2012).

Conclusions
Whilst previous validation efforts have provided useful information on the performance of SL2P, their spatial and temporal coverage have been limited. Using in situ reference measurements available through the GBOV service, we performed an extensive validation of SL2P LAI retrievals over 19 sites throughout the United States. We also investigated the performance of a modified retrieval approach (SL2P-D). When validated against in situ measurements of LAI e , uncertainty requirements were met by SL2P over homogeneous canopies (cultivated crops, grasslands, pasture/hay and shrub/scrub), consistent with the results of previous validation exercises. However, over heterogeneous canopies (deciduous forest, evergreen forest, mixed forest, and woody wetlands) SL2P retrievals were subject to underestimation. SL2P-D reduced biases over these canopies, slightly improving overall performance. At lower LAI values, SL2P-D also resulted in substantially fewer invalid retrievals than SL2P. Based on our results, the retrieval approach adopted by SL2P-D appears to offer some advantages over regularisation using prior information, but RTMs better suited to heterogeneous environments are likely required to further improve performance. In addition to the retrievals themselves, we investigated the relationship between predicted uncertainties and observed differences in retrieved and in situ LAI and LAI e . For LAI e , our results revealed that SL2P's predicted uncertainties were conservative, indicating they should prove useful in determining outer bounds of uncertainty that typically contain the reasonable worst-case error, though improved uncertainty estimates may be provided by adopting more advanced machine learning approaches in future work.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Observatory Network is a program sponsored by the National Science Foundation and operated under cooperative agreement by Battelle Memorial Institute. This material is based in part upon work supported by the National Science Foundation through the NEON Program. The authors are grateful for support from Natural Resources Canada and the Canadian Space Agency. This project has received funding from the European Union's Horizon 2020 research and innovation program under the grant agreement No 101004242.

Appendix A
SL2P-D is a modification of SL2P  in an attempt to improve its retrieval accuracy for vegetation biophysical variables including LAI. This appendix provides a description of the training and application of SL2P-D to MSI data. Additional details, such as the theoretical basis for the retrieval algorithm and cross-validation results are found in Fernandes and Djamai (2019).
Both SL2P and SL2P-D use a library of non-linear regression models based on a multi-layer ANN architecture to predict the variables defined in Table A1 and their associated prediction uncertainties, given the inputs defined in Table A2. The training data for each ANN corresponds to MSI surface reflectance and associated viewing and illumination geometry as simulated by the coupled PROSPECT and 4SAIL RTMs, hereafter referred to as PROSAIL (Jacquemoud et al., 2009), based on sampling a joint distribution of model parameters (Table A3), and one output corresponding to the target variable or the target prediction uncertainty. The target variable is determined from the PROSAIL simulation. The target prediction uncertainty is determined by first assessing predictions against a testing database of PROSAIL simulations and then computing the RMSD of all simulations whose reflectance differs from a given input by less than measurement uncertainty ('binning'), whose solar and viewing zenith angles differ by less than 5 • , and whose relative azimuth angle differs by less than 10 • . Since perpixel uncertainties are not provided in the L2A products derived using Sen2Cor, fixed additive and multiplicative uncertainties are assumed (Table A2). These values are considered representative of typical uncertainties in L2A MSI data (Li et al., 2015;Upreti et al., 2019), as demonstrated by previous surface reflectance intercomparison exercises (Djamai and Fernandes, 2018;Doxani et al., 2018).
Both SL2P and SL2P-D share the same ranges and copulas (covariation relations) for PROSAIL parameters using a single parameterisation for all land cover types, the same architecture for each ANN, and the same ANN training algorithm. The range of each parameter and copula between each parameter and LAI is specified based on the empirical range over a global data survey conducted by Weiss and Baret (2016). Further details on the survey are provided in Section 3.3.2 of their algorithm theoretical basis document. Each ANN corresponds to a three layer network with 11 inputs for the first layer (Table A2), standardised to have zero mean and unit standard deviation, each with weighted connections to the same five nodes in a hidden layer corresponding to a tangent sigmoid function with associated bias, and a final linear output layer also with a bias term. Each ANN is initialised with random weights and biases drawn from a standardised normal distribution. Backpropagation using the Levenberg-Marquardt minimiser implemented in MATBLAB is used to minimise the mean square error (MSE) of predictions over the training data as a function of ANN weights and biases using parameters indicated in Table A4. Training is halted if the MSE over an independent cross-validation dataset falls below 1% relative error, or if six iterations of training over the training database do not reduce the MSE ('early stopping'). In all cases, the ANNs were found by early stopping, which is encouraging, as this is an indication that they are not overfitting the training data.
In contrast to SL2P, SL2P-D uses multiple ANNs to retrieve a target variable in a two-stage manner. The first network, shared with all

Table A2
Inputs and associated assumed additive and multiplicative noise for SL2P-D. Note that noise consists of both wavelength dependent and independent components    SL2P-D also differs from SL2P in the simulation databases used. Firstly, each simulation also includes DASF as an output variable, corresponding to a PROSAIL simulation using the same input parameters except for a canopy single scattering albedo of one and soil reflectance of zero. DASF varies with the ratio of leaf reflectance to transmittance. As this ratio is wavelength dependent, we select the value at 800 nm, where it is relatively insensitive to biochemistry (Adams et al., 2018). SL2P-D uses uniform marginal distributions for input parameters (Table A3) when producing the training, cross-validation and testing databases, in contrast to empirical distributions used in SL2P. SL2P-D uses Sobol rather than orthogonal sampling, as the former allows for a linear rather than geometric increase in samples as the number of parameters and parameter levels sampled increases. SL2P-D uses 1,572,864 samples across all databases rather than the 42,472 used in SL2P, to allow for sufficient sample sizes (at least 10,000) for each second stage ANN used for prediction of target variables.
Both training and testing databases demonstrate relatively high values of DASF with the majority exceeding 0.8 (Fig. A1). This is a consequence of the homogeneous canopy assumption of PROSAIL, resulting in no view of the underlying zero reflectance boundary. Whilst the in situ dispersion of DASF is likely broader, we note that here, DASF retrievals are used as a relative partitioning variable, and not an unbiased estimator of DASF for heterogeneous canopies. As a consequence, DASF retrievals are required to have high precision and a bias that is either small or relatively monotonic with respect to reference DASF values. For the testing database, the precision of DASF retrievals (quantified as the standard deviation of differences) is better than 0.05, whilst the magnitude of the bias is less than 0.03 or 3% (Fig. A2).

Fig. A2
. Bias (red line), RMSD (sold black line and black circles), and box plots of differences in DASF retrievals (as used for partitioning) vs. reference DASF values from the testing database. Boxes indicate ± one standard deviation, dot symbols span the 98th percentile, and asterisk symbols correspond to residuals exceeding the 98th percentile.