Country-wide Retrieval of Forest Structure From Optical and SAR Satellite Imagery With Deep Ensembles

Monitoring and managing Earth's forests in an informed manner is an important requirement for addressing challenges like biodiversity loss and climate change. While traditional in situ or aerial campaigns for forest assessments provide accurate data for analysis at regional level, scaling them to entire countries and beyond with high temporal resolution is hardly possible. In this work, we propose a method based on deep ensembles that densely estimates forest structure variables at country-scale with 10-meter resolution, using freely available satellite imagery as input. Our method jointly transforms Sentinel-2 optical images and Sentinel-1 synthetic-aperture radar images into maps of five different forest structure variables: 95th height percentile, mean height, density, Gini coefficient, and fractional cover. We train and test our model on reference data from 41 airborne laser scanning missions across Norway and demonstrate that it is able to generalize to unseen test regions, achieving normalized mean absolute errors between 11% and 15%, depending on the variable. Our work is also the first to propose a variant of so-called Bayesian deep learning to densely predict multiple forest structure variables with well-calibrated uncertainty estimates from satellite imagery. The uncertainty information increases the trustworthiness of the model and its suitability for downstream tasks that require reliable confidence estimates as a basis for decision making. We present an extensive set of experiments to validate the accuracy of the predicted maps as well as the quality of the predicted uncertainties. To demonstrate scalability, we provide Norway-wide maps for the five forest structure variables.


Introduction
Forest structure relates to the three-dimensional (3D) spatial arrangement of the plant community within a forest and is both a result and a driver of ecosystem processes and biological diversity (Spies, 1998). Forest structure is strongly correlated with a forest's ability to store carbon and to provide habitat for a variety of species (Turner et al., 2003;Bergen et al., 2009;Dubayah et al., 2020). Given the large extent and dynamic nature of forest ecosystems, it is important to develop ways of mapping and monitoring their structure consistently through space and time . Such maps can support a more informed and dynamic management of forest resources, required to tackle important global challenges such as the loss of biodiversity; and the mitigation of, as well as adaptation to climate change.
Measuring forest structure has long been an expensive and time-consuming task. Traditional in-situ methods are of open data from Earth observation satellites. A prominent example is the European Union's Copernicus program, with multiple satellite missions whose data products are useful to predict forest structure dynamics . In particular, the Sentinel-1 (synthetic-aperture radar, SAR) and Sentinel-2 (multispectral optical) missions provide near-global coverage, with both high spatial resolution (∼10 m ground sampling distance, GSD) and frequent revisit times (≤5 days; the effective temporal resolution of the optical sensor is reduced by the cloud coverage). These missions deliver observations over a wide spectral range at an unprecedented rate, but can they be converted to forest structure products? A new challenge arises to retrieve the relevant information from multispectral image data at ∼10 m GSD, rather than LiDAR waveforms with <1 m 2 footprints. In the absence of a tractable radiative transfer model, supervised machine learning offers a statistical approach to retrieve environmental parameters from image data. In recent years, deep neural networks (DNNs) have emerged as the mainstream tool for image analysis, including their applications in remote sensing and, in particular, mapping of forest structure (Lang et al., 2019). DNNs have brought substantial performance gains through their ability to learn the complete functional mapping from raw image values to the desired output variables. While DNNs often achieve high precision, their outputs are typically limited to point estimates and/or are overly confident (Guo et al., 2017;Gast and Roth, 2018). As well-calibrated estimates of uncertainty are important whenever the model outputs feed into critical decisions or probabilistic models, the Bayesian interpretation of model outputs is an active line of research (Blundell et al., 2015;Gal and Ghahramani, 2016;Guo et al., 2017;Kendall and Gal, 2017;Lakshminarayanan et al., 2017;Ovadia et al., 2019;Gustafsson et al., 2020;Wilson and Izmailov, 2020).
In this paper, we go a step further and exploit recent advances from the field of Bayesian deep learning (BDL) to densely map forest structure variables and associated uncertainties from optical and SAR satellite images. BDL is understood as the deep learning counterpart to traditional Bayesian inference techniques, for instance those for linear regression (Rasmussen, 2004, Ch. 2 to 2.1.1) or Gaussian process regression (Rasmussen, 2004, Ch. 2.1.2 to 2.3). Instead of committing to a single solution of model parameters, BDL is characterized by (approximate) marginal-ization over a posterior distribution of all possible models, given some assumptions about the prior and the data likelihood. Such a principled approach is particularly attractive for data-driven models, where the model uncertainty is not a consequence of explicit modeling decisions, but instead is caused by the limited training data. Bayesian marginalization may improve the predictive accuracy, but more importantly it can give the user access to well-calibrated estimates of the predictive uncertainty, which take into consideration how well a given model prediction is supported by training data. This ability to self-diagnose the reliability of each individual prediction is indispensable for many downstream tasks that ingest the model output as their input. As an example, when making informed decisions, it is crucial to know how reliable the data are that the decisions are based on. Uncertainty estimates have been successfully used for decision making in many fields of application (Council, 2006;Soroudi and Amraee, 2013;Martin and Johnson, 2019;Sniazhko, 2019;Alzate-Mejía et al., 2021), and we argue that they are equally relevant in forest management in order to optimize decisions for cost, carbon stock, biodiversity etc.
Our method (see overview in Fig. 1) is fully supervised and consists of a training and a testing phase. During the training phase, a deep convolutional neural network is optimized on training data, i.e., optical and SAR images with associated, pixel-wise forest structure reference values computed from ALS data. The fitting employs a loss function that does not just penalize deviations between model outputs and ALS reference data. Instead, the loss is proportional to the negative log posterior probability over the model parameters given the data, assuming a zero-mean normally distributed prior (acting as a regularizer) and a Gaussian data likelihood. Thus, for each pixel and variable, the model outputs the parameters of a Gaussian distribution, thereby capturing uncertainty inherent in the input data (aleatoric uncertainty). During the testing phase, multiple trained networks are combined into an ensemble model, exploiting the stochastic nature of neural network initialization and training. The predicted distributions of all networks are aggregated into an ensemble estimate of the distribution over structure variables. The aggregation of predictions from independently trained models corresponds to a sample-based approximation of Bayesian marginalization (Gustafsson et al., 2020;Wilson and Izmailov, 2020). It captures the uncertainty in the model parameters (epistemic uncertainty), which is combined with the aleatoric uncertainty to obtain the overall uncertainty of the predicted forest structure variables.
In this work, we propose a deep ensembles approach to predict structural forest variables often used in modeling forest biophysical properties (95 th height percentile, mean height, density, Gini coefficient and fractional cover), using optical and SAR satellite images as input. Our approach can be seen as a scalable complement to regional ALS, enabling forest structure mapping with 10-meter ground sampling distance (GSD) at country-scale and with a high update frequency (in the extreme case down to five days), at a low cost. Moreover, the estimated forest structure maps come with an individual, calibrated uncertainty estimate for every structure variable at every single pixel. We conduct extensive experiments on a test region in Norway, for which reference data from a full-waveform ALS campaign is available. We then apply our method to compute a country-wide forest structure map for all of Norway, that we make available online. 1 Source code for training and testing the model is provided, too. 2

Remote sensing of structural forest variables
Since its early days, ALS research has seen a tremendous growth, and forest structural variables derived from ALS are broadly used as a source of auxiliary information for the large scale characterization of forest ecosystems. Many different types of modeling techniques have been tested over the years, most of which rely on linking field measured forest biophysical properties (e.g. biomass) with forest structural predictor variables derived from ALS, like height percentiles, vegetation density, forest cover, or foliage height diversity (Naesset et al., 2004;Naesset, 2007). More recently, there has been a trend to directly use ALSderived structural variables to map forest structural (Coops et al., 2016;Valbuena et al., 2017;Adnan et al., 2021) and functional diversity (Schneider et al., 2017;Zheng et al., 2021). While ALS data are the most detailed source of information to characterize forest structure, their geographic and temporal availability remains limited.
If the goal is frequent, or even continuous, monitoring of large forest areas, a more promising data source are freely available satellite images, such as those provided by the Landsat and Copernicus programs. Landsat-based approaches have mostly been relying on time series features to map forest structure (Tyukavina et al., 2015;Hansen et al., 2016;Potapov et al., 2019Potapov et al., , 2021. Also Copernicus' Sentinel-1 and Sentinel-2 missions, with 10 m GSD and less than five days revisit time, offer dense time series of observations that are suitable for forest monitoring. Previous studies have demonstrated the usefulness of Sentinel data to map and estimate key forest biophysical variables (e.g., above ground biomass) (Laurin et al., 2018;Puliti et al., 2020; and their dynamics trough time . In the past two years, a growing body of literature has shown the possibility to map forest canopy height with Sentinel-2 (Lang et al., 2019;Shimizu et al., 2020;Astola et al., 2021;Lang et al., 2021). Since canopy height is only one of many ecosystem characteristics that can be obtained from ALS, our work aims to understand whether Sentinel data can provide maps for a more comprehensive spectrum of structural variables, like vegetation density, cover, and complexity. Like Lang et al. (2019) our model learns to extract predictive texture features from single input images.

Deep learning in remote sensing
In the last decade, deep learning has revolutionized the way information is extracted from images. In particular, convolutional neural networks (CNNs) have achieved unprecedented results in areas like image classification (Krizhevsky et al., 2012;Simonyan and Zisserman, 2015;He et al., 2016), semantic segmentation (Long et al., 2015;Chen et al., 2016), object detection (Szegedy et al., 2013;Girshick et al., 2014;Redmon et al., 2016) and further perception tasks.
In addition, deep learning is increasingly being applied to vision tasks in remote sensing, such as super resolution (Lanaras et al., 2018;de Lutio et al., 2019) or change detection (Caye Daudt et al., 2018). Traditional applications include land cover classification from aerial (Kaiser et al., 2017;Marmanis et al., 2018;Zhang et al., 2019) or satellite images (Helber et al., 2019). Agricultural crop type classification, a specific type of land cover, is a well-studied task, where most authors exploit features from time-series data (Rußwurm and Körner, 2018;Rustowicz et al., 2019;Garnot et al., 2019;Rußwurm and Körner, 2020;Turkoglu et al., 2021a,b). More related to our work are methods that seek to predict biophysical indicators such as crop yield from climate data and Enhanced Vegetation Index maps (Kuwata and Shibasaki, 2015), the prediction of sea ice concentration from RADARSAT SAR satellite images (Wang et al., 2016) or tree density estimation from Sentinel-2 optical images (Rodríguez and Wegner, 2018;Rodríguez et al., 2021). Most works so far do not accompany their map products with calibrated uncertainty estimates. Exceptions include (Rodríguez et al., 2021), where uncertainties are used to guide active learning; and (Lang et al., 2022), where BDL is integrated into a 1-dimensional CNN for spaceborne LiDAR analysis. The prior work most relevant for our paper is Lang et al. (2019), where a CNN was developed to predict country-wide vegetation height maps from Sentinel-2 optical images. Their model was trained and evaluated for Gabon and Switzerland, where training data were derived from LiDAR measurements and photogrammetric surface reconstruction, respectively. Here, we predict five forest structural variables jointly and estimate their predictive uncertainties using an ensemble of probabilistic neural networks. Moreover, we explore the potential of data fusion using Sentinel-1 SAR images as an additional input signal on top of the Sentinel-2 optical bands.

Uncertainty in deep learning
Although deep learning has become the most widely used method in computer vision, model outputs are often trusted "blindly". In regression problems, outputs are usually point estimates with no attached notion of uncertainty, while classification scores have been shown to be overconfident (Guo et al., 2017;Lakshminarayanan et al., 2017). To mitigate this effect and to develop more trustworthy models, reliably quantifying the predictive uncertainty is important -and an open research question (Gustafsson et al., 2020).
Usually, uncertainty in machine learning is decomposed into aleatoric and epistemic uncertainty (Gal and Ghahramani, 2016). The former is assumed to be inherent in the observations -e.g. resulting from sensor noise or lack of signal -and can thus not be "explained away" with more training data. On the contrary, epistemic uncertainty captures uncertainty in the model itself, i.e., the lack of knowledge due to not having seen enough training data for the respective region of the input space. Unlike aleatoric uncertainty, it cannot simply be learned from data -instead, it generally requires marginalization over an (approximate) posterior distribution over model parameters. These methods are colloquially referred to as Bayesian deep learning (Kendall and Gal, 2017); common approaches include variational Bayesian inference (Ranganath et al., 2014;Blundell et al., 2015) and methods based on Markov chain Monte Carlo (MCMC) sampling (Neal, 1996;Welling and Teh, 2011;Chen et al., 2014). In this work, we use a deep ensemble (Lakshminarayanan et al., 2017), a method specifically developed for deep neural networks that can be understood to perform approximate Bayesian inference (Gustafsson et al., 2020). The general idea is to train an ensemble of M independent models on the same data, each initialized with a different set of random weights. The randomness inherent in the weight initialization, as well as random sampling of training batches, causes each model to converge to a different local minimum in the solution space, and the resulting weights can be interpreted as samples from an approximate posterior distribution (Gustafsson et al., 2020;Wilson and Izmailov, 2020). In practice, among all methods performing approximate Bayesian inference in deep learning, ensembles are generally reported to achieve the best results in terms of predictive performance and reliability of the produced uncertainty estimates (Ovadia et al., 2019;Gustafsson et al., 2020;Ashukha et al., 2020).

ALS structural forest variables
In this study, the target variables are forest structural variables normally extracted from ALS data. In Norway, the corresponding ALS data are available as part of a national program aimed at the production of a high resolution digital terrain model (DTM) of the entire country. Amongst the many available ALS projects, we selected a sample of 41 projects covering a range of latitude (58 • N -69 • N) , longitude (5 • E-18 • E), and the corresponding diversity in landscapes and forest types in Norway. Our selected areas are shown in Fig. 2 and are covered by the following main forest types: • East: productive boreal forest in the south eastern part, characterized by mild slopes and continental climate.
• North: predominantly deciduous areas with a large portion of low-productivity mountain birch (Betula nana) in the north. These forests represent the transition from subpolar oceanic vegetation towards the tundra and sub-artic climates.
• West: deciduous forest alternating with patches of coniferous forests and un-productive forests in the coastal areas. These areas are characterized by a milder oceanic climate that, due to the steep slopes (i.e., fjords), can transition to tundra and alpine vegetation within short distances.
The selected samples of ALS data were collected between 2015 and 2018, with ∼80% being from the 2016 -2017 period. For ∼35% of the area, the ALS data were collected under leaf-off conditions (Oct.-Dec.). The point density in the selected ALS projects was 2 pts./m 2 for 51% of the area, 5 pts./m 2 for a further 48%, and 10 pts./m 2 for the remaining 1%. Overall, the sample covers a broad range of climatic, vegetation, terrain, and ALS data characteristics, thus providing a suitable training set for deep learning models that shall generalize across a range of conditions.
The height values z (in meters above sea level) from the raw ALS point clouds were normalized to Dz (in meters above ground) by subtracting the ground elevation from each point's z value. Furthermore, all points with Dz > 1.3 m were classified as vegetation points. We extracted a selection of commonly used ALS variables to describe forest structure and ultimately ecosystem characteristics : height, density/cover, complexity, and habitat area. These variables were selected with the aim to minimize the number of variables while maintaining complementary information on the vertical and horizontal distribution of the forest canopy. While these variables do not provide direct quantitative biophysical properties used in traditional forest management inventories (e.g. above ground biomass), they remain useful proxies for characterizing for forest monitoring purposes (Senf and Seidl, 2022;Hansen et al., 2013). The variables were computed from the normalized point clouds (i.e., m above ground) on a 10 m raster, which was then bi-linearly resampled to match the Sentinel-2 pixels. The following variables were computed for the entire area where ALS was available. We show corresponding histograms in Fig. 3 and an illustration of the variables in Fig. 4.
• P95 (m above ground): the 95 th percentile of the Dz values of all vegetation points. This variable is close to the maximum height but it removes potential noise from spurious high ALS returns (e.g., from birds or power lines). The use of the height percentiles dates back to the some of the first area-based forest inventories (Naesset, 2002). Among them, the 95 th percentile  Figure 3: Histograms of the five ALS-derived structural forest variables. The coloring indicates the different geographical areas, and bars are stacked such that the overall bar height denotes the overall number of data points (i.e., pixels) for a given bin.
is useful as a measure of canopy top height, which is known to be correlated with the developmental stage of the forest and hence its biomass stock. Furthermore, canopy height diversity has been shown to be useful to monitor forest disturbance regimes (Senf et al., 2020).
• MeanH (m above ground): mean of the Dz values of vegetation points. This variable is of interest because it not only captures the height of a forest but also the vertical distribution of the plant material within the canopy. We note that, thanks to its ability to encompass these two levels of information, MeanH is at present the only predictor variable used to produce country-wide maps of forest biomass in Norway (Astrup et al., 2019).
• Dens (%): proportion of vegetation points in the entire set of LiDAR returns. This variable describes the density of the forest canopy and, along with the height percentiles, has long been used in area-based ALS forest inventories. Its is complementary to the canopy height, as it provides information about the vertical distribution of the plant material through the canopy (Naesset, 2002).
• Gini (index): The Gini coefficient is equal to half of the relative mean difference in Dz values among all the vegetation returns, and was calculated using the function implemented in the leafR package (Alves de Almeida et al., 2020). The Gini coefficient is a measure of the inequality among the members of a data distribution, and it has been used as a proxy for tree size variation (Knox et al., 1989) and to map differences in forest structures and management regimes (Valbuena et al., 2017;Adnan et al., 2019). While typically the Gini coefficient has been calculated using single-tree data, a recent work by (Adnan et al., 2021) demonstrated the usefulness of the Gini coefficient calculated from the ALS Dz values, showing that it could reliably describe the structural heterogeneity of the forest.
• Cover (%): Forest cover in terms of the proportion of projected canopy area relative to the entire area of a pixel (100 m 2 ). Cover was computed by projecting the vegetation points onto a (x, y) plane and converting them to a binary occupancy grid with 1 m resolution. The forest cover was then derived as the percentage of pixels occupied by forest. While somewhat correlated to Dens, cover remains a fully 2-dimensional variable, describing the horizontal vegetation cover, rather than the density of points in the vertical canopy profile. Time series of forest cover maps have been widely used in remote sensing as a measure of the canopy openness, and to assess land use changes with particular interest on forest cover losses (Hansen et al., 2013) and forest disturbance (Senf et al., 2018). We divide each ALS project geographically into horizontal stripes of width 900 pixels (9.0 km). Within each stripe we assign the northernmost 5.4 km to the training set, the next 1.8 km to the validation set and the southernmost 1.8 km to the test set. In this way, the different regions and modalities in each ALS area are evenly distributed between the three sets. Overall, the data set consists of 105,022,419 pixels (10,502 km 2 ) of ALS reference data, divided into 64,487,551 training pixels (6,449 km 2 ), 20,784,407 validation pixels (2,078 km 2 ) and 19,750,461 test pixels (1,975 km 2 ), following the "60-20-20" ratio often used in machine learning studies.

Sentinel satellite imagery
Sentinel-1 and Sentinel-2 are satellite missions that belong to the European Union's Copernicus Earth observation programme (European Space Agency, 2021), providing high-resolution SAR and optical imagery of land and coastal water areas between 56°South and 84°North. Each mission consists of a ground segment and a constellation of two satellites in sun-synchronous low earth orbits, phased 180°from each other. Jointly, both satellites in each mission achieve high revisit frequencies of >1 visit every five days, depending on latitude. The Sentinel-2 satellites are equipped with multispectral instruments that detect light in 13 spectral bands ranging from visible blue to short-wave infrared, with band-dependent spatial resolutions between 10 and 60 m. This spectral profile gives rise to a multitude of applications, including forest and vegetation monitoring as well as the inference of vegetation parameters such as leaf area index or carbon stock. For this work, we have collected Sentinel-2 Bottom-of-Atmosphere (BOA) reflectance images (level 2A, Main-Knorn et al., 2017) that have already been atmospherically corrected. For each ALS acquisition area, we select largely cloud-free images captured between May and October of the respective year, such that the area of interest is fully covered. Because BOA reflectance images still contain small amounts of atmospheric variation, we collect between two and seven images for each ground point (depending on cloud conditions), such that our model learns to cover a range of conditions. The Sentinel-1 satellites are equipped with a C-band SAR sensor that actively monitors the surface with radiation of about 6 cm wavelength (European Space Agency, 2021). Sentinel-1 is invariant against meteorological factors like clouds, and also against illumination conditions. Although C-Band SAR does not enter deep into the canopy, it may superficially penetrate it. Thus, we explore this as a complementary signal to the optical image. For every collected optical image, we query a Sentinel-1 image that was acquired within ten days of the optical image and that covers the same geographic area. As preprocessing, we carry out orbit correction and terrain correction, where for the latter we rely on the Copernicus digital elevation model with a resolution of one arc second. We intentionally keep the preprocessing simple because we expect our model to automatically infer the optimal transformations given the task at hand. For all preprocessing steps, we use SNAP (European Space Agency, 2020a) with the Sentinel-1 Toolbox (European Space Agency, 2020b) as provided by ESA.

Forest structure model architecture
We utilize a multi-branch convolutional network (CNN) to transform co-registered optical and SAR images into maps of forest structure variables, and of their associated uncertainties. A CNN gradually transforms its inputs into outputs with a series of simple transformations (layers). One can think of the early layers as a pre-processing of the two input modalities, the middle layers as a feature extractor and the last layers as a regression, but there is no clear distinction between those parts. At the heart of all of them is the (discrete) convolution operator, i.e., a linear filtering of a c-channel input image with a k ×k ×c kernel of weights. In each convolutional layer, multiple convolutions with different filter weights are applied, so the output is again an "image" with as many channels as there were filters. The filter weights are the trainable parameters of the model. Convolutional layers are interleaved with some element-wise, non-linear activation functions. The series of transformations gradually abstracts the bare image pixels into a sequence of feature maps until, at the output level, they have become the desired forest structure maps.
Our model architecture is inspired by the design principles of ResNeXt (Xie et al., 2017), modified to address the specific challenges of forest structure mapping. As in a number of remote sensing works (e.g., Rodríguez and Wegner, 2018;Lang et al., 2019), we retain the spatial dimensions of the input images throughout the network, i.e., we do not use any pooling operations or convolutions with stride. This is motivated by the observation that spatial feature aggregation tends to negatively impact performance on remote sensing images, as their spatial resolution is already limited and one cannot afford to lose further spatial detail. Furthermore, we have added separate entry blocks for the optical and SAR channels, allowing for domain-specific low-level processing. Finally, we have added a global pixel shortcut that re-injects the raw pixel values before the final regression layers, in order to better preserve high-frequency details that otherwise tend to get blurred (as information over an increasingly larger receptive field is mixed through repeated convolutions). We visualize our model architecture in Fig. 5 and in the following, the data flow in the model is described in more detail.
Entry blocks. Each entry block consists of a convolutional layer, followed by batch normalization (Ioffe and Szegedy, 2015) and a rectified linear unit (ReLU) non-linearity, which truncates negative values to zero. The convolution in the optical entry block has kernel size 1, i.e., it only mixes the channels at every individual pixel, while a larger kernel size of five is used in the SAR entry block to allow for learned spatial smoothing of the more noisy SAR data. For all convolution operations in our model that have kernel size >1, we apply appropriate zero-padding to preserve the spatial dimensions of the input.
Multi-branch convolution blocks. The outputs of both entry blocks are concatenated along their channel dimension and fed through N stages stages, where each stage i ∈ {1 . . . N stages } consists of N blocks,i ResNeXt blocks. Each ResNeXt block in turn is made up by a 1 × 1 convolution layer, a grouped convolution layer (Krizhevsky et al., 2012;Xie et al., 2017) with N groups groups and kernel size three, and another 1 × 1 convolution layer. The former two are each followed by a batch normalization and ReLU, whereas the latter is followed only by batch normalization. Following the ResNet principle, a skip connection bypasses each block, such that the block learns an additive residual update to its input, making it possible to training much deeper networks (He et al., 2016). Due to the grouped convolution, each block effectively implements a multi-branch computational graph, where each branch can be interpreted as a lower-dimensional feature embedding, and where all branches are eventually combined by summation. This layout has been shown to offer superior predictive performance (Xie et al., 2017)  classical ResNet design (He et al., 2016).
Regression heads. Finally, from the resulting feature map, two parallel heads regress the parameters of the likelihood function for each structural variable and pixel. We assume a Gaussian likelihood, therefore for every pixel five means and log-variances are regressed. The variance of each predictive distribution will be trained to resemble aleatoric uncertainty, and is output in its logarithmic form for numerical stability (see Sec. 4.3). Each head consists of two convolution layers with kernel size one and an intermediate ReLU operation, gradually reducing the number of channels in the representation from 2304 down to five. In order to constrain the predicted mean values to their valid ranges (see Sec. 3), we apply a final exp(·) activation to the P95 and MeanH predictions, as well as a sigmoid activation σ(x) = 1 / (1 + e −x ) to the Dens, Gini and Cover means.

Model training
As commonly done in deep learning, we iteratively learn the model parameters with stochastic gradient descend, starting from a random initialization. In each iteration, we randomly sample a batch of B = 64 reference data patches of size 15 × 15 pixels, where a patch is only considered for training if the center pixel is forested. We consider a pixel forested if and only if it contains vegetation points (points with Dz > 1.3 m, see Sec. 3.1) and also is considered forested based on NIBIO's Norway-wide timber volume map (Astrup et al., 2019). We use the latter as an additional precautionary measure to avoid unnecessary noise from non-forested areas, as we are interested in learning forest characteristics only. For every reference data patch, we randomly pick an optical image from the correct year and two SAR images (one ascending and one descending orbit) with acquisition dates near the one of the optical image. Using SAR with both ascending and descending orbits is expected to add robustness against terrain-induced geometric distortions such as shadowing, foreshortening and layover (Small et al., 1995;Carrasco et al., 1997). In total the model input is composed of 12 optical bands forming a tensor of size B×12×15×15, and four SAR bands (two polarizations × two orbital directions), forming a corresponding B ×4×15×15 tensor. The model output are two tensors of shape B×5×15×15 each, one for the means of the five structural forest variables and one for their variances. An appropriate loss function measures the deviation between the model output and the ALS reference data, see below. Note that we only calculate this loss for pixels that are forested according to the above definition. Then, the gradients of all model parameters θ w.r.t. that loss (plus some regularization term) are computed with back-propagation (Werbos, 1982) and used to update the parameters in the direction of steepest descent. We use the Adam (Kingma and Ba, 2015) variant of stochastic gradient descent (SGD), which adaptively scales the magnitude of the parameter updates based on the statistics of previous updates to speed up convergence. During training, we periodically evaluate the prediction error of the model (i.e., the current set of parameters) on a held-out validation set and keep the configuration θ * with the lowest error as the final model.

Loss function
The loss function, which is optimized during training, measures the quality of a set of network parameters θ w.r.t. the training data D , under some regularizing prior assumptions. We use a standard loss function L(D; θ) whose minimization corresponds to maximizing the posterior probability of the parameters given the training data. As it is commonly done in machine learning (see e.g. Goodfellow et al. (2016)), we assume a zero-mean isotropic Gaussian prior over the network parameters (corresponding to L 2 regularization) and a Gaussian likelihood function with meanμ i :=μ(x i ; θ) ∈ R 5 and diagonal covariance matrix with logarithmic elementsŝ i :=ŝ(x i ; θ) ∈ R 5 : Here, i ∈ {1 . . . N } indexes the data point while j ∈ {1 . . . 5} indexes one of the five structure variables we are aiming to predict. The hyperparameter λ is inversely proportional to the variance of the parameter prior. Note that by explicitly predicting aleatoric uncertainty, in the form of log-variancesŝ(x i ; θ) of the output variables, the model learns to reduce the influence of data points on the loss that it deems particularly noisy, which in turn improves model performance (Kendall and Gal, 2017). Learning logvariances constrains the variances to positive values and prevents a potential division by zero in the loss function. The derivation and some further explanation of the loss function Eq. 1 is provided in Appendix A.1.

Acquiring uncertainty estimates
Letμ(x * ; θ) andσ 2 (x * ; θ) be the mean and variance of the predicted distribution that the network with parameters θ outputs when shown the test image x * . Furthermore, let p(θ | D) denote the posterior distribution over the network parameters, given the prior and the likelihood function defined in Sec. 4.3.
The exact predictive distribution p(y * | x * , D) of our model is intractable to compute, mainly due to the extremely high dimensionality of the parameter space. However, it is possible to sample from an approximate p(θ | D) by training an ensemble of multiple neural networks from the same data (but with different random initializations and random batches for SGD). With M different networks this gives rise to a Monte Carlo approximation of the posterior, whereμ * ,k :=μ(x * ; θ k ) andσ 2 * ,k :=σ 2 (x * ; θ k ) are the mean and variance predicted by the k-th network.
In the approximate predictive distribution derived in Eq. 2, epistemic (i.e., model) uncertainty is captured by sampling multiple models θ k , which will lead to widely scattered predictions and thus to high uncertainty in regions of the input space that are not sufficiently backed by training data. Aleatoric uncertainty, on the other hand, is learned from data and thus predicted as the variance of the likelihood function for each output variable at each image pixel. Note that for conceptual and computational simplicity, the likelihood is limited to have a diagonal covariance matrix. While this restricts the network's flexibility to express correlations in aleatoric uncertainty, we empirically find this modeling decision to be sufficient for the task at hand. Finally, based on Eq. 2, the tower rule and the law of total variance deliver approximations for the total mean and variance of the predictive distribution: In Sec. 5.2, we will empirically study the quality of the uncertainty estimate (4), and demonstrate how it can be used to identify and filter out unreliable predictions.

Implementation details
We have implemented our models in PyTorch (Paszke et al., 2017). We trained M = 5 models with batch size B = 64 and a base learning rate α = 10 −4 . The learning rate is automatically reduced by a factor of 0.1 when the validation loss has not improved for 15 consecutive epochs. We apply weight decay to control the strength of the unit Gaussian prior, with an empirically chosen magnitude of 10 −3 that is inversely proportional to the hyperparameter λ from Eq. 1. We chose β 1 = 0.9, β 2 = 0.999 and = 10 −8 as hyper-parameters for the Adam optimizer. Each neural network was trained on a single Nvidia RTX2080Ti GPU for ∼14 days.

Experimental Results and Discussion
During evaluation, we slide a window of size 15 × 15 pixels over the test regions of our data. We use a stride of nine and only retain the innermost 11 × 11 pixels for every window location, thus allowing for a sufficiently large spatial input context for all predictions. Fig. 6 depicts the model output for a diverse 180 × 180 pixel example region sampled from the East area of the test set, along with the ALS reference data and the absolute error between the predicted mean and the reference data. Qualitatively, a high correspondence between the predicted mean and the reference data can be observed, indicating that the model has successfully learned to predict the respective variables for a very diverse sample region. Further, a correlation between predicted uncertainty and absolute error can be observed, although the latter is more "grainy", an attribute that we believe is transferred from the reference data, that can vary strongly between adjacent pixels (this is not that visible as in the error maps, because the overall value range is larger). The correlation between those quantities is not as visually obvious as the previous one, which calls for a more extensive investigation into the predicted uncertainties. Similar to Fig. 6, we have added qualitative examples from the North and West areas in Figs. B.12 and B.13 in the appendix.
In the following subsections, among other experiments, we will quantitatively validate the observations made above (good reconstruction of the true structural variables as well as uncertainty estimates being representative of the error) for the entire test set.

Evaluation of forest variables prediction
For each variable, indexed by j ∈ {1 . . . 5}, we report the Mean Absolute Error (MAE), Root Mean Square Error (RMSE) and the Mean Bias Error (MBE) defined as follows: To represent the predictive distribution for each pixel i and variable j as a point estimate, we use its approximate meanμ ij in accordance with Eq. 3. The MAE and RMSE metrics measure the average deviation between model prediction and reference data (with RMSE giving higher penalty to large deviations), while the MBE serves to identify systematic biases in the predictions. In addition to the above metrics, we calculate their normalized counterparts MAE%, RMSE% and MBE% by dividing by the corresponding mean values over the training data. Tab. 1 shows the results of the test set evaluation of the final model. Overall (subtable a), we achieve low mean absolute errors for all predicted variables, e.g., 1.65 m for the 95th height percentile (P95). Assessing the relative errors, the experiments show that the developed model produces consistently accurate predictions with an MAE% always smaller than 15%. The variables ranking in order of increasing relative error (MAE%) are forest cover (Cover, 11.2%), vegetation density (Dens, 11.8%), Gini index (Gini, 12.4%), the 95th height percentile (P95, 12.9%) and lastly the mean height (MeanH, 14.4%). When evaluating the individual geographic regions shown in Fig. 2, we observe comparable values for the eastern region (subtable b), which can be explained by this region making up for the majority of reference data points. Interestingly, the model performs even better in the northern test region in terms of mean absolute error of P95 and MeanH, however the error for the remaining structural variables is slightly bigger. For the western test region, evaluation errors are consistently higher, yielding e.g. an MAE in P95 of 1.85 meters and an MAE% of 14.4%. Possible reasons for the slightly lower performance in the western region are the scarcity of cloud-free data due to oceanic climate, and the increased topographic complexity. The fjord landscape in these western areas is characterized by very steep and narrow valleys causing both topographic shading and large variations in sun-target-sensor geometry. All of these factors have a negative impact on the quality of the satellite observations and thus the predictions obtainable in such areas. In addition, the forests in these areas are more diverse in terms of species compositions and structurally complex compared to the other areas. It is noteworthy that, for all of the tested regions, the MBE% is <0.7%, indicating the absence of any systematic model bias, regardless of the predicted variable and geographic region. Our model can therefore be considered an unbiased estimator for forest structure, a significant benefit for downstream tasks such as decision making based on the resulting maps.
For further analysis of our model, we provide confusion plots for all variables in Fig. 8. For the plots, all test samples are binned with regard to both their true, ALSderived reference value (x axis) and their predicted mean (y axis), the color of a bin indicating how many samples fall into the respective bin. Higher densities along the identity line imply higher accuracy of the model, and indeed we can observe very high agreement between prediction and reference data, for all variables.
In addition, we conduct a residual structure analysis of each variable with respect to all others, i.e., we investigate how the residuals in one variable are distributed across other variables' value ranges. Note that to get more stable estimates, we remove data points exceeding the 99th percentile of the respective variable before binning. For this analysis, we used b = 10 bins. In Fig. 7, the column indicates the query variable, whose ground truth values (clipped to the 99th percentile for stability) are discretized into b bins and define the ordering along the x-axis. The row indicates the target variable whose residual distribution is plotted (so, for instance, the fourth graph in the top row are the P95 residuals ordered by the corresponding pixels' Gini coefficient). The solid green line denotes the mean residual, the shaded area the residuals' standard deviation. Mean residuals close to zero (grey line) mean that the estimation of the target is unbiased at the given value of the query. We observe that our predictions generally have low bias across a large portion of the range, which is consistent with our findings in Sec. 5.1. A notable exceptions is the under-estimation of the height-related variables P95 and MeanH on high trees -a well-known effect when retrieving vegetation height from satellite images Potapov et al. (2021); Lang et al. (2019). A similarly pronounced under-estimation bias at the top of their own value range can be noticed for the canopy density (Dens) and for the Gini coefficient (Gini); moreover, high Gini values also tend to cause underestimation of P95. In the opposite direction, we also observe some (weaker) over-estimation biases for low reference values. This concerns in particular the Dens and Cover estimates at low values of Dens, Gini and Cover. Overall, the biases are moderate and exhibit the typical behavior of regressors fitted to minimize mean squared error. That is, they tend towards the data mean and, when the evidence is weak or ambiguous, over-estimate low values and under-estimate high ones. Left tile: The model input is composed of a Sentinel-2 optical image (12 bands), and two preprocessed Sentinel-1 SAR images taken from an ascending and descending orbit direction, respectively. For visualization, the amplitudes of the VH and VV polarizations have been assigned to the red and green channels of the RGB image. Right tile: ALS reference data, the predicted mean and standard variation as well as the absolute prediction error |μ ij − y ij | for each of the five structural variables. The figure shows high correspondence between reference data and the predicted mean, indicating that our model is able to accurately regress the structural variables. There is also a clear (albeit not as crisp) correlation between predicted standard variation and absolute error, indicating that erroneous predictions are assigned higher uncertainty. We analyze the latter in more detail in Sec. 5.2.

Uncertainty evaluation
We evaluate the quality of the model's uncertainty estimates (Eq. 4), i.e., how well the predicted uncertainty correlates with the expected error of a given prediction. If a model is perfectly calibrated, then, for any value of the predicted variance, the expected squared error between the true value and the predicted mean should be equal to the predicted variance (by definition of the variance). To obtain an estimate of the expected squared error, we group predictions into bins based on the predicted uncertainty and compare, for each bin, the mean squared error with the mean predicted variance of all samples in the bin. For a more intuitive interpretation, in the units of the original variables, we take square roots and compare the empirical RMSE and the root mean variance. In Fig. 9, the result of this comparison is shown for 20 bins for the P95 variable. It can be observed that the plot corresponding to the ensemble follows the identity line closely, implying highly reliable uncertainty estimates. Only for higher-variance predictions (Var[y * | x * , D] 12) we can observe a slight under-estimation of uncertainty. Each of the individual neural networks, in contrast to the ensemble, suffers from much more severe under-estimation of uncertainty (i.e., they are over-confident) throughout all uncertainty levels.  In each plot, the x axis denotes the ALS reference data value and the y axis denotes the mean of the predicted likelihood distribution. The number of data points that fall into each 2D bin is indicated by the color. The plots demonstrate that most predictions are located very close to the identity line, where prediction and ALS reference data agree.  Fig. 2 (subtables b -d). P95 and MeanH metrics are given in meters, while Dens, Gini and Cover are reported as fractions. All normalized metrics (postfixed with %) are given as a fraction over the respective training set mean.

P95
MeanH Dens Gini Cover  This is consistent with the literature, where it has repeatedly been observed that neural network models tend to be overconfident about their predictions (Guo et al., 2017;Lakshminarayanan et al., 2017). The calibration plots for the remaining structure variables are qualitatively similar and can be found in the appendix (Fig. B.14). Note that even though the ensemble reaches the best calibration for all structure variables (compared to the individual neural networks), the individual networks' uncertainty estimates are, in our specific case, already very good. We attribute this to the large and diverse dataset that was available for training, such that only few test samples are insufficiently represented by the training set and require the (approximate) Bayesian marginalization to achieve good model calibration.
Another way to validate the usefulness of the predicted uncertainty values is to observe how error metrics change when retaining a decreasing fraction p ∈ (0, 1] of least uncertain predictions. The intuition is that if one discards predictions with high uncertainty and the latter is indeed a good predictor of the expected error, then the average error of the remaining predictions should decrease. Fig. 10 displays this relationship between the fraction of retained pixels and the MAE%, for all structural variables. Indeed, all curves increase monotonically, implying that the predicted uncertainty correlates (in expectation) with the actual error at a given test sample.

Sensor ablation study
We investigate how much the reported accuracy depends on the individual optical and SAR image inputs. To this end, Tab. 2 reports the MAE when training a network with different input data configurations. Unsurprisingly, the default configuration that uses one Sentinel-2 optical and two Sentinel-1 SAR images ("S2+S1") performs best w.r.t. all predicted structure variables. What stands out is that removing the optical information ("S1 only") is significantly more detrimental than removing the SAR input ("S2 only"), indicating that the former is more predictive of forest structure. We assume that this is mainly due to two reasons: (1) Sentinel-1 imagery is noisier, which impairs the analysis of small structures at the level of individual pixels and below; and (2) Sentinel-1 is more heavily affected by topographic influences. In particular, the mountainous topography prevalent in Norway can give rise to SAR-specific effects, such as shadowing, foreshortening and layover (Small et al., 1995;Carrasco et al., 1997).
The ablation study further shows that including SAR data from both ascending and descending orbits is beneficial in terms of regression performance, compared to using only one (randomly chosen) orbit direction during training and testing. 3 As an explanation, we suspect a combination of two effects: (1) The model has a second observation for any given pixel, thus benefiting from redundancy to suppress noise; and (2) the scene is illuminated from two different directions, which helps to mitigate SAR-specific effects as outlined previously.
Lastly, the ablation study demonstrates that the impact of removing one of the SAR images is far less pronounced when additional optical information is available ("S2+S1" vs. "S2+S1Rand"), compared to a configuration that solely relies on SAR ("S1 only" vs. "S1Rand only"). This outcome further supports the finding that optical images are the more predictive data source for forest structure. Table 2: MAE of the reported structural variables for various input configurations. The version that uses both Sentinel-2 optical and Sentinel-1 SAR imagery performs best, while leaving out the optical input is more harmful compared to leaving out the SAR. The experiments also demonstrate the effectiveness of using SAR images from both ascending and descending orbits, as compared to only one, random orbit direction ("S1Rand").

Country-wide forest structure map
To demonstrate the applicability of our method at country-scale, we compute a Norway-wide map of forest structure variables for the year 2020 and make it publicly available. To be more robust against clouds and to avoid gaps, we use T = 10 optical images per location that were acquired throughout the leaf-on-period June to October. We also add one SAR image to every location and pair it with the optical data. We feed all eleven inputs per location to our model, which consists of five independently trained neural networks as described in Sec. 4. This ultimately yields ten predictions for mean and variance for every geographical location. To produce an easy-to-interpret map, we collapse the ten predictive distributions into a single point estimate by applying an inverse-variance weighting: for every pixel i and structural variable j. This scheme poses a natural way of combining multiple measurements into a single forest structure map, exploiting the (previously demonstrated) property of our model to return well-calibrated uncertainties. Intuitively, predictions that are assigned a high uncertainty by our model should have less influence on the final results compared to predictions of high confidence. It can be shown that the resulting, inverse-variance weighted forest structure estimates have the smallest expected error among all possible weighted averages (Strutz, 2010). Therefore, by means of outputting well-calibrated predictions, we are able to combine multiple measurements in an theoretically optimal way, which is not possible when only point estimates are available. Fig. 11 shows overview images of the generated map for each of the five structural forest variables. We mask out areas that are not considered forested according to the NIBIO timber volume map (see Sec. 4.2) as already done for training our model 4 . Figure 11: Overview images of the five channels of the produced Norway-wide forest structure map, corresponding to the five structural variables. Non-forested areas have been masked out and labeled as "No forest".

Conclusion
The main finding of our research is that forest structure indicators can be mapped at country-scale from publicly available Sentinel data, in combination with modern deep machine learning and open airborne laser scanning (ALS) data as reference data for training. The availability of these maps at almost any point in time (starting from the beginning of the Copernicus program in 2015) opens up new possibilities for large scale forest monitoring and resource mapping and is thus a promising tool for understanding and tackling challenges such as climate change or biodiversity loss. A straightforward application of the predicted structural variables could be updating existing nationwide forest resource maps hitherto produced on the basis of the same variables, but derived from ALS data (Nord-Larsen and Schumacher, 2012;Nilsson et al., 2017;Astrup et al., 2019). Given the large costs for updating these maps, our Sentinel-derived forest structure variables could provide inexpensive auxiliary data for producing dense time series of forest resource maps. The availability of such dynamic maps can help to better understand the effect of forest management practices on forest biomass dynamics and on the variations in functional diversity. In addition to their use for mapping traditional variables (e.g., above-ground biomass), dense time series of canopy height (e.g., P95) from Sentinel data allow one to derive the canopy height vertical growth, which in turn can be used to measure site productivity (Lennart Noordermeer and Naesset, 2018;Svein Solberg and Puliti, 2019). In addition, the produced maps can also be used to provide proxies to identify and map forest disturbances on a large scale (Hansen et al., 2013;Senf et al., 2018).
An innovative aspect of our method is that it additionally predicts variables describing the vegetation density (Dens) and its variation in the vertical profile of the canopy (Gini), different to previous studies that focused on canopy height or cover only. These additional variables are important as they provide complementary information to the height and the cover and thus allow for a more comprehensive understanding of forest structure on a large scale. The combined use all of these variables further opens up new possibilities to address complex issues, such as the quantification of biomass losses and gains from subtle land-use changes, e.g., forest degradation and forest restoration.
A further novelty of the presented method, compared to previous applications of deep learning to remotely sensed data, is that it outputs pixel-wise, well-calibrated output distributions instead of mere point estimates. The quantification of the predictive uncertainty makes the system more reliable and more trustworthy. In particular, estimates of the expected predictive error can be propagated to downstream tasks, such as higher-level mapping systems that rely on our forest structure estimates as input. Even further downstream, maps ultimately serve as a basis for forest management, where well-calibrated uncertainty estimates ("knowing what we don't know") translate to better decision making.

Appendix A. Derivations
In the following, we provide mathematical derivations for the employed loss function (Eq. 1) and the moments of the posterior predictive distribution (Eqs. 3 and 4).
Appendix A.1. Maximization of the parameter posterior A standard approach in machine learning (see e.g. Goodfellow et al. (2016)), we start with the maximizer of the logarithmic posterior parameter probability (i.e., its mode) and then deduce the equivalence to the loss function presented in Eq. 1: In Eqs. A.1b and A.1c, we apply Bayes' rule (in log space) and then drop the evidence term as it does not depend on θ.
We assume independent and identically distributed labels given the respective input and the model parameters and can thus factor the likelihood into individual data point likelihoods (Eq. A.1d). In Eq. A.1e, N (· | µ, Σ) denotes the probability density function of the multivariate Gaussian normal distribution with mean µ and covariance Σ. We assume an isotropic Gaussian with variance σ 2 p as prior over the parameters, as it is common in machine learning (this is equivalent to L 2 regularization). In the same line, Σ i = diag(σ 2 i ) is a diagonal matrix containing the predicted variances for all structural variables as diagonal elements. To arrive at Eq. A.1f, we plug in the Gaussian density function and observe that λ ∝ 1/σ 2 p governs the strength of regularization. In Eq. A.1g, j ∈ {1 . . . 5} indexes the structure variables and we again useŝ ij = logσ 2 ij .
Further ablation study results. Supplementary to Tab. 2, where we reported the MAE of our method for various input configurations, Tab. B.3 show the other error metrics for these experiments.
Further calibration plots. Fig. B.14 shows the calibration plots similar to Fig. 9 for the remaining structure variables. As for P95, we observe a systematic underestimation of the variance for the individual networks, and a clearly better uncertainty calibration for the ensemble. The effect is more or less pronounced, depending on the variable. For all plots we have again used 20 uncertainty bins and the same noise removal technique as explained for Fig. 9.
Baseline comparison. At the time of writing, we are not aware of any comparable method that predicts diverse forest structure variables from optical and SAR images, and that we could compare to. As a baseline to better understand how our method fares in comparison to a common "best practice" method for vegetation mapping, we train a random forest (RF) regressor on our data, using per-pixel spectral intensities as inputs. Because of the size of the training dataset (≈65 million data points), we resort to bootstrapping with a sample ratio of 0.05 and 20 ensemble members. Tab. B.4 compares the test set results of the RF regressor to those of our method. In terms of both MAE and RMSE, the proposed neural network outperforms the RF baseline by large margins for all five structure variables, in most cases reducing the deviations from ground truth by 40 − 50% (except for Gini, where the gains are only 25 − 30%). The MBEs are generally low, underlining that both methods are unbiased within the expectable measurement accuracy: all relative MBEvalues lie within ±2%. Overall, the experiment confirms the superior predictive power of deep feature extractors, in particular when trained on large data sets.  Figure B.13: Qualitative prediction sample (180 × 180 pixels, 324 ha) from the West region, analogous to Fig. 6. Figure B.14: Calibration plots for the MeanH, Gini, Dens and Cover variables, similar to those in Fig. 9.