Global canopy height regression and uncertainty estimation from GEDI LIDAR waveforms with deep ensembles

NASA's Global Ecosystem Dynamics Investigation (GEDI) is a key climate mission whose goal is to advance our understanding of the role of forests in the global carbon cycle. While GEDI is the first space-based LIDAR explicitly optimized to measure vertical forest structure predictive of aboveground biomass, the accurate interpretation of this vast amount of waveform data across the broad range of observational and environmental conditions is challenging. Here, we present a novel supervised machine learning approach to interpret GEDI waveforms and regress canopy top height globally. We propose a probabilistic deep learning approach based on an ensemble of deep convolutional neural networks(CNN) to avoid the explicit modelling of unknown effects, such as atmospheric noise. The model learns to extract robust features that generalize to unseen geographical regions and, in addition, yields reliable estimates of predictive uncertainty. Ultimately, the global canopy top height estimates produced by our model have an expected RMSE of 2.7 m with low bias.


Introduction
Forests play a key role in the carbon cycle and in the regulation of the climate (Mitchard, 2018). However, estimates of carbon fluxes between the atmosphere and the terrestrial biosphere suffer from high uncertainty, with a relative standard deviation of up to 47% (Friedlingstein et al., 2019). For the last decade (2009)(2010)(2011)(2012)(2013)(2014)(2015)(2016)(2017)(2018), annual CO 2 emissions from global land-use change, mainly from deforestation, account for 1.5±0.7 GtC yr −1 , and the global terrestrial CO 2 sink amounts to 3.2±0.6 GtC yr −1 (Friedlingstein et al., 2019). Although the aboveground biomass (hereafter "biomass") of forests is crucial to estimate terrestrial carbon flux, existing global biomass products differ significantly in spatial patterns (Mitchard et al., 2013;Avitabile et al., 2016;Mitchard, 2018;Ploton et al., 2020;Spawn et al., 2020). Tropical carbon stocks are particularly difficult to estimate due to spatial heterogeneity and saturation of remote sensing signals (Köhler and Huth, 2010;Mitchard et al., 2014;Phillips and Lewis, 2014), contributing to high uncertainty about the tropical land-use change flux (Pan et al., 2011). This lack of accuracy in global biomass maps hinders our ability to assess the impact of deforestation and subsequent regrowth on atmospheric CO 2 concentrations, limiting the quality of climate simulations and forecasts (Kunreuther et al., 2014). In addition, better maps of aboveground biomass are needed to * Corresponding author: nico.lang@geod.baug.ethz.ch protect various ecosystem services, by optimising spatial planning and by identifying priority areas for conservation and restoration (Strassburg et al., 2020).
Canopy height is an important predictor for aboveground biomass (Jucker et al., 2017;Dubayah et al., 2010;Qi et al., 2019b;Carreiras et al., 2017;Asner and Mascaro, 2014;Silva et al., 2018;Drake et al., 2002a,b). It is also a key morphological trait which, along with composition and function, helps to monitor ecosystem response to climate and land use change as well as restoration (Schneider et al., 2017;Valbuena et al., 2020). Additionally, ecosystem structure, which is related to canopy height and its heterogeneity, is an essential biodiversity variable and a strong predictor for species richness at local to global scales (Kreft and Jetz, 2007;Gatti et al., 2017;Marselis et al., 2020). Previous pan-tropical and global biomass maps (Baccini et al., 2012;Saatchi et al., 2011) were based on canopy height derived from space-borne LIDAR data collected by NASA's Ice, Cloud and Elevation Satellite (ICESat) (Abshire et al., 2005). ICESat was optimised to retrieve ice surface elevations and ceased operation in 2010. A new space-borne LIDAR, NASA's Global Ecosystem Dynamics Investigation (GEDI), was launched to the International Space Station (ISS) in late 2018 (Dubayah et al., 2020a). GEDI is the first space-borne, full waveform LIDAR that is specifically designed to measure ecosystem structure by providing vertical profiles of forest canopies with 25 m footprints on the ground. After its nominal mission duration of two years, the mission is expected to have sampled 4 % of the land surface between 51.6 degree north and south on a sparse sampling grid.
One of the major challenges of using space-borne LI-DAR concerns the processing of its waveforms. Features such as the location of the ground and the top of the canopy must be identified to derive canopy height and other structure variables, such as canopy cover and vertical plant area profiles. The effects of known (e.g., pulse shape, digitizer noise) and unknown (e.g., slope, multiple scattering) properties are often difficult to explicitly model in the separation of canopy and ground signals and determination of ranging points along the waveform. For example, under dense canopy cover conditions, the ground return may be quite weak; detecting this weak signal against high background noise is difficult. Similarly, detecting the very tops of trees, where there may be little leaf area, and thus a weak return, will lead to errors in height recovery (Dubayah et al., 2020a).
Traditional approaches to LIDAR waveform processing  have relied on conventional signal processing, but GEDI presents a challenge to these methods for several reasons. First, it will produce an unprecedented high number of fine spatial resolution observations. These span the huge range of topographic, climatic, and edaphic conditions across the Earth's temperate and tropical forests, leading to tremendous structural variability of the waveforms. Secondly, GEDI utilizes both high and low power beams which, when coupled with the structural diversity, variations in noise levels, and potential instrument artifacts, complicate the calibration. The unprecedented amount of space-borne measurements from GEDI thus presents a challenge, but also an opportunity to explore new data-driven algorithms based on machine learning.
In this work, we propose one such data-driven approach, based on state-of-the-art probabilistic deep neural networks, that estimates global canopy heights together with well-calibrated uncertainties from geolocated (Level 1B/L1B) GEDI waveforms (Dubayah et al., 2020c). Our approach increases the retrieval performance compared to initial GEDI canopy height products (Level 2A/L2A) that were processed with traditional methods, but using provisional calibrations (Dubayah et al., 2020b), reducing the root mean square error (RMSE) from 4.4 m to 2.7 m and the underestimation bias from −1.0 m to −0.1 m (mean error). We parameterize this direct, functional mapping from GEDI L1B waveforms to canopy top heights with a one-dimensional convolutional neural network (CNN), with a residual block architecture (ResNet, He et al., 2016) at its core. Our proposed CNN uses the full GEDI L1B waveform as its input and transforms it into representative features (recognizing peaks and patterns in the waveform) that are suitable for the regression of a waveform height metric (the 98 th percentile height, or RH98) as a proxy for the canopy top height.
A key advantage of our probabilistic approach is that the model directly provides a well-calibrated uncertainty together with every estimate. This property is crucial to model deployment at global scale, because reference data for validation are limited. We model two sources of uncertainty: the aleatoric and the epistemic uncertainty (Der Kiureghian and Ditlevsen, 2009;Kendall and Gal, 2017). The former describes the stochasticity in the data (e.g., caused by constantly varying atmospheric transmission and surface reflectance, as well as sensor noise), which is inherent, and is not reduced by collecting more training data. This uncertainty is estimated by minimizing the Gaussian negative log likelihood, a common approach for regression tasks (Gast and Roth, 2018). In other words, instead of outputting a point prediction, the network predicts a distribution over the output, which is parameterized by a mean and a variance, to approximate the conditional distribution of the target canopy height given an input waveform. The second source is the epistemic uncertainty (also known as model uncertainty) that describes imperfect knowledge of the model parameters. In our supervised machine learning approach, this knowledge is extracted from the training data. In contrast to the aleatoric uncertainty, it can be reduced by collecting more samples to get a more complete representation of the data distribution. We estimate the epistemic uncertainty with an ensemble of ten separate CNNs that are trained independently, starting from different random initializations (Lakshminarayanan et al., 2017). The variance computed over their individual mean predictions yields an estimate of the epistemic uncertainty and the aleatoric uncertainty is computed by averaging the individual variance predictions. The overall predictive uncertainty assigned to each canopy height estimate is the sum of both sources. This predictive uncertainty plays an important role to inform downstream applications that build on the canopy height. Furthermore, accompanying canopy height estimates with uncertainties enables the user to directly filter them to reduce the overall error. For example, in our case, discarding the 30% most uncertain predictions reduces the error by 25%, while preserving the full range of canopy heights.
To train and evaluate the proposed neural network, a dataset of 68,483 on-orbit GEDI waveforms were matched to the canopy top heights (so-called RH98, which is the height of the 98th percentile of the cumulative waveform energy relative to the ground) derived from airborne laser scanning (ALS) reference data. To reduce noise caused by geolocation errors of the first release GEDI data (mean horizontal offset ≈20 m), blocks of GEDI on-orbit waveforms are optimally correlated to simulated waveforms derived from the ALS data (Hancock et al., 2019). This reduces geolocation errors to a few meters on average and yields a high quality training and validation dataset. We examine the error characteristics of our CNN predictions regarding forest structure, topography, and laser power. To test if our model is globally applicable, we carry out a geographic cross-validation, where we train the model without any samples from a particular continental region and then test it on that held-out region. This avoids overly optimistic interpretations due to spatial auto-correlation (Ploton et al., 2020).
We present global canopy height maps based on the first four months of L1B mission data to demonstrate that our approach may be applied across the vast range of conditions that exist at the global scale. While this work focuses on the estimation of the RH98 as a proxy for the canopy top height, we also show that the proposed machine learning approach is generic and can be applied to estimate other forest structure variables and the ground elevation from on-orbit GEDI waveforms.

GEDI L1B waveforms product
The GEDI laser system is mounted onboard the international space station (ISS) flying at 415 km altitude and measures along eight ground tracks that are 600 m apart. Four of the ground tracks are produced by two full-power lasers and the other four tracks by one so-called coverage laser with reduced power. Individual laser shots are spaced 60 m along track and have an approximately 25 m footprint on the ground. First measurements were acquired in April 2019 and after its nominal life time of two years, ten billion waveforms will be measured over the land surface at 25 m spatial resolution (Dubayah et al., 2020a). GEDI provides the L1B full waveform product (Dubayah et al., 2020c) with a maximum waveform length of 1420 vertical bins and a fixed interval of 1 ns (∼15 cm) between consecutive returns.

GEDI L2A elevation and height metrics product
The L2A GEDI mission product contains elevation and relative height metrics derived from the L1B waveforms (Dubayah et al., 2020b). The first release of the L2A product (R001) contains six different algorithm setting groups that are designed to minimize false positive errors in detecting the ground and canopy top under a wide range of environmental conditions. In addition, the L2A product provides geographical locations of the lowest mode (ground) and the percentage of non-vegetated area from the MODIS product MOD44B Version 6 Vegetation Continuous Fields (Dimiceli et al., 2015).

Airborne LIDAR reference dataset
To learn robust features that capture the global canopy height distribution and generalize to different biomes and forest types, the proposed CNN requires a representative dataset of corresponding waveforms. An extensive airborne LIDAR dataset has been collected by the GEDI mission (Dubayah et al., 2020a) that spans five continental regions (Fig. 1a) and covers a diverse set of forest structures in terms of canopy top height (RH98), canopy cover, and plant functional type (extracted from the MCD12Q1 (Sulla-Menashe and Friedl, 2018)). The dataset also spans a broad range of topography (flat to steep terrain) between −36.8 and 51.8 degree latitude (Fig. 1b). All ground slope data is derived from the Shuttle Radar Topography Mission (SRTM) (Farr et al., 2007). North America contributes more than half of the data. For South America and East Asia, there is no reference data available and Europe is scarcely represented in the current dataset.
To construct the training and validation dataset used in this work, GEDI L1B waveforms from the first seven months of the mission were matched with the ALS data. Assigning reference data from ALS to the GEDI waveforms is complicated by geolocation errors in the first release of GEDI L1B data, which introduces noise into the reference dataset. To correct the systematic geolocation error of the GEDI waveforms, sequences (blocks) of GEDI waveforms along individual ground tracks are correlated with simulated GEDI-like tracks of waveforms created from the intersecting airborne laser scanning (ALS) reference data (Blair and Hofton, 1999;Hancock et al., 2019). The process determines the horizontal and vertical offset that maximizes the correlation between on-orbit and simulated waveform sequences. For each ground track, a rigid block of successive GEDI waveforms is shifted to maximize the average Pearson correlation coefficient between the on-orbit waveforms and the simulated reference waveforms. After this transformation, the collocated GEDI and ALS reference data are saved along with the Pearson correlation coefficient of the individual waveforms.
Once we have filtered poor quality offsets (computed using <25 shots and with a correlation >0.9), we use the Pearson correlation coefficient between individual onorbit and simulated waveforms to assess the quality of the match, and thus the quality of the assigned true canopy top height (RH98). Remaining noise in the reference value assignment may exist due to simulation inaccuracies, temporal changes, and the random component of geolocation errors. Empirically, we found that rejecting samples with <0.95 Pearson correlation yields largely clean reference data, which results in a total of 68,483 waveforms with a maximum RH98 of 70.1 m. For these data the systematic geolocation error for individual beams amount to 19.7±10.7 m (mean±1σ), but in the case of orbits with degraded geolocation flags offsets may reach up to 60 m. When applying a higher Pearson threshold, fewer samples (with presumably higher quality) remain in the dataset, e.g., an overly strict threshold of 0.99 would discard a substantial amount of samples (only 19,186 samples retain in the dataset) leading to a less representative dataset in which no samples above 50 m RH98 would be retained.

Waveform preprocessing
Several preprocessing steps are applied to the raw L1B waveforms to harmonize the input features. After sub- tracting the mean noise level of the GEDI L1B product, we normalize the total energy of each waveform to one. Additionally, we set all waveforms to the fixed maximum length of 1420 vertical bins by padding shorter waveforms with zeros at the end of the vector. A fixed waveform length allows the CNN to implicitly learn the absolute scale of canopy heights, as the interval between consecutive returns is fixed to 15 cm. We normalize the distribution of the input features (waveform amplitudes) as well as the targets (canopy heights) to standard normal distributions, a common preprocessing step for neural networks (LeCun et al., 2012). Since the learnable network parameters are randomly initialized in such a way that the distribution of the input data is preserved in the output, that preprocessing leads to faster convergence during optimization. The means and standard deviations of the training data are stored and used to harmonize the test data in the same way.

Convolutional neural network architecture
Our method uses a deep CNN at its core to regress canopy top height. We adapt the ResNet architecture (He et al., 2016) to our needs by, first, replacing all two-dimensional network layers with one-dimensional convolutional, pooling and normalization layers. Second, we found empirically that a network architecture with more regular down-sampling than the original ResNet performs better. More precisely, our SimpleResNet consists of eight residual blocks (Fig. B.8 in the appendix). Each block consists of two consecutive convolutional layers with kernel size 3, each followed by batch normalization and a rectified linear unit (ReLU) activation function. All blocks include a skip connection that adds the input of the block to its final activation map. Such a bypass not only alleviates the vanishing gradient problem of deep neural networks, but simplifies the task of a block to learning a (usually smaller) residual correction to its input. After each residual block, a max-pooling layer reduces the extent of the intermediate feature representations (i.e., activation maps) by a factor of two. This reduces memory consumption and speeds up the computation of the network output. Furthermore, down-sampling gradually increases the receptive field of the learned features, so that deeper features represent larger data segments of the input waveform, encoding more contextual evidence. The last block is followed by an adaptive average pooling layer that averages each final feature map to a scalar value, resulting in a 1-dimensional feature vector. A dropout layer (Srivastava et al., 2014) with drop ratio of 0.5 is used for regularization, before the final, fully connected layer creates the model outputs.

Predictive uncertainty estimation
Uncertainty in deep neural networks originates from two main sources. While the aleatoric uncertainty describes the stochasticity in the data, the epistemic uncer-tainty captures uncertainty of the model, i.e., the ambiguity of its individual parameters. Aleatoric uncertainty is represented with a probability distribution over the model output, describing the conditional probability of different target values given the input, p(y|x). Epistemic uncertainty is represented as a probability distribution over the model parameters (Gast and Roth, 2018). The epistemic uncertainty arises from a lack of knowledge, in the case of a supervised learning task due to limited training data. Therefore, epistemic uncertainty can be reduced by collecting additional data that adds information not present in the current training set. In contrast, aleatoric uncertainty is inherent in the problem and cannot be reduced. We model it by approximating p(y|x) with a Gaussian probability density function. To estimate the epistemic uncertainty, we use an ensemble of ten CNNs (Lakshminarayanan et al., 2017). Note that an ensemble of five models may already be enough to estimate the epistemic uncertainty (Ovadia et al., 2019).
In general, the goal is to approximate the true conditional probability distribution p(y|x, θ) of the target value y (e.g., the ALS canopy height), given an observed input waveform x. To that end we fit the parameters θ of a neural network f θ , which outputs a prediction f θ (x). The most common approach is to train the network to predict a single point-estimateŷ = f θ (x), by minimizing the mean squared error (MSE) between the true values y and the predicted valuesŷ (Goodfellow et al., 2016). Formally, optimizing the MSE is equivalent to parameterizing the conditional distribution as a Gaussian normal distribution p(y|x) = N (ŷ(x), σ 2 ), with the meanŷ(x) as a function of the input x and a constant noise term σ 2 (Goodfellow et al., 2016). We relax this constraint of constant noise and allow more flexibility, by also estimating the noise from the input. Instead of a pointestimateŷ, our network has two outputs [μ,σ 2 ] = f θ (x) to approximate the true conditional distribution with a Gaussian p(y|x) = N (μ(x),σ 2 (x)), with both meanμ and varianceσ 2 as functions of the input x (Gast and Roth, 2018). We thus minimize the Gaussian negative log likelihood (NLL) as follows: Within Bayesian estimation, this is also known as measuring the heteroscedastic aleatoric uncertainty (Kendall and Gal, 2017). In practice, our network outputs the two parameters µ and s, with s = log σ 2 . Converting the second output to the variance σ 2 = exp(s) guarantees that the estimated variance is positive (Kendall and Gal, 2017). For numerical stability, a small number = 10 −8 is added to the estimated variance, to avoid division by zero. We model the epistemic uncertainty with an ensemble of deterministic CNN models (illustrated in Fig. 2), which is computationally more efficient than Bayesian neural networks, and tends to perform better in practice (Lak-shminarayanan et al., 2017;Ovadia et al., 2019). Deep ensembles provide an effective approach to approximate Bayesian marginalization, i.e. marginalizing out parameters instead of optimizing towards a single "best" parameter (Gustafsson et al., 2020;Wilson, 2020;Wilson and Izmailov, 2020). 1 The basis of this strategy is to train M different networks and regard them as random samples from the distribution of models. We do this by starting the training with a different random initialization for each network, along with the randomness due to learning with stochastic gradient descent. At prediction time we run the input waveform through each network and compute the variance of the M outputs. Intuitively, each individual network should generalize about equally well, as long as a test sample lies within the training data distribution. In contrast, test samples that are out-of-distribution should cause arbitrary errors that differ between the m networks, and therefore have high variance. While reliably detecting out-of-distribution (OOD) samples is an unsolved problem and and active direction of research, deep ensembles at present perform best in practice.
Averaging outputs of the individual CNNs corresponds to a mixture of Gaussians with equal weight, as illustrated in Fig. 2. The final prediction of the model ensemble is reported as the meanŷ (Eq. 2) and the variance V ar(ŷ) (Eq. 3) of this mixture distribution: Under our model, the variance of this Gaussian mixture distribution is the predictive uncertainty, composed of the epistemic uncertainty (first two terms in Eq. 3) and the aleatoric uncertainty (last term in Eq. 3).

Model training
We split the dataset into non-overlapping training and test sets. Within the training set, a random subset of 10% is set aside as validation set to monitor the optimization process. Thus, CNN parameters are fitted to 90% of the training set. The validation set is not used in the optimization, but serves to monitor the loss (cumulative prediction error) on unseen data. For each ensemble member, the network parameters with the lowest validation loss are used in the final model, which is then applied to the held-out test set to evaluate model performance. For each individual CNN in the ensemble the learnable parameters (weights) are initialized randomly, and trained by minimizing the Gaussian negative log-likelihood as a loss function (Eq. 1). That minimization is carried out with ADAM (Kingma and Ba, 2015), an adaptive version of stochastic gradient descent (SGD). In each iteration of the optimizer, the loss is computed on a random batch of 64 data samples. Based on that loss, partial derivatives (gradients) with respect to all model parameters are determined by back-propagation, i.e., calculation of the numerical gradients with the chain rule. Model parameters are updated in small steps along the negative gradient direction to minimize the loss. The step size is controlled by a hyper-parameter called the learning rate, which we initialize to 0.0001. ADAM then automatically adapts the learning rate after every iteration. Each model is trained for 200 epochs, with one epoch corresponding to the number of iterations (respectively, random batches) needed to go once through the complete training set. Since shifting the waveform up and down should not affect the canopy top height, we apply random shifts to augment the training data. This can be seen as simulating canopies with the same LIDAR response at different elevations. By drawing shifts uniformly at random from ±20% of the waveform length, we ensure that the network predictions are invariant against such shifts.

Random cross-validation
We perform random 10-fold cross-validation to measure whether the model performance is invariant against different train/test splits. The complete dataset of waveforms with ground reference is split randomly into 10 mutually exclusive subsets. Each subset serves once as the test set, while the remaining nine subsets are used to train the ten CNN models from scratch, i.e. the full ensemble is retrained ten times and evaluated on the corresponding, held-out test fold. Overall performance metrics are calculated over all ten cross-validation folds, such that each waveform was in the test set exactly once. Additionally, we measure the variability of the error metrics between the ten test sets.

Geographical cross-validation
Geographic transferability of the model is tested by splitting into geographical regions. We reserve all available samples from one geographic region for testing and train the models from scratch on the other regions, such that no sample from the test region has been seen during training. With this procedure one can assess whether the model over-fits to the training locations, or whether it learns generic features that also support prediction in unseen locations.

Canopy top
Ground Individual models Ensemble averaging Figure 2: Illustration of the model ensemble. A L1B GEDI waveform is fed into multiple CNNs that have identical architecture, but were trained independently. Each CNN estimates a mean and a variance of the conditional distribution of canopy height. Averaging the individual outputs yields a mixture distribution, whose mean and variance constitute the final estimate.

Evaluation metrics
As main error measure to quantify the deviations between predicted heights and reference heights, we report the root mean square error (RMSE). Moreover, we measure the bias of the prediction with the mean error (ME): whereŷ i denote the prediction of the model ensemble and y i the reference value at sample i. Under this definition, a positive ME means that model predictions are higher than the true values according to reference data. For completeness and to allow comparisons with other works, we additionally report the mean absolute error (MAE): To study relative deviations (height error in percent of the reference height) we use the mean absolute percentage error (MAPE): which normalizes the residuals by the "true" height before computing their absolute differences.

Evaluation of the predictive uncertainty calibration
The predictive uncertainty can be used to rank the predictions of individual samples. A standard tool for the evaluation of ranked retrieval results are precisionrecall curves (Schütze et al., 2008). Beside precision-recall curves, calibration plots are a common strategy for analyzing the estimated predictive uncertainty (Guo et al., 2017). The estimated predictive standard deviation is plotted against the empirical error by marginalizing over subsets of samples. The closer the resulting line in the diagram follows the diagonal, the better is the calibration of the uncertainties. Specifically, for regression tasks with continuous output values (like our canopy height) we expect the predictive standard deviation to follow the RMSE (Laves et al., 2020). In order to get robust, statistically meaningful empirical errors, we must group the test samples. We do this by binning the samples into 1 m intervals based on their predictive standard deviation. For all bins with >200 samples the average predictive STD is plotted against the RMSE (Fig. 5b).

Canopy top height regression
Random cross-validation on this dataset yields a root mean square error (RMSE) of 3.6 m and a mean error (ME) of −0.3 m, indicating a slight underestimation bias compared to the ALS reference (Fig. 3a). To investigate whether the performance is affected by specific data splits, the variation of the error metrics across the ten random folds was analyzed (Tab. 1). Both the RMSE and the ME have 0.1 m standard deviations across the ten folds, i.e., the performance is invariant to the random data splits. The mean absolute percentage error (MAPE) accounts for 26.5 % averaged over all test samples, where samples with low reference values inherently have high relative errors. A detailed residual analysis with respect to forest structure and topography suggests that the canopy height estimates are robust against diverse characteristics (Fig. 4). Negative residuals denote an underestimation of the canopy top height compared to the ALS reference. We observe a slight positive bias for samples with RH98<10 m and a systematic negative bias (underestimation) for higher canopies (RH98>10 m). Yet the median of the residuals is <10 % (≈5 m) for canopies >60 m. A slightly positive bias appears with slope>30 degrees. Both systematic effects seem to be related to underrepresented samples in our dataset (Fig. 1b). The predictions are less affected by canopy cover or plant functional type. Additionally, no considerable biases are observed regarding latitude and continental region (Fig. 4). For both the power and coverage lasers, the residuals have a median close to zero (Fig. 4). While the overall performance is similar for different laser powers, further investigations reveal that laser power affects the performance for high canopies (Fig. C.9a in the appendix). Specifically, the samples with reference RH98>50 m have a stronger underestimation bias with the coverage laser (ME of −12.1 m) compared to the power laser (ME of −5.8 m). Finally, the overall CNN performance (over all heights) is not affected by daylight background noise and appears to be independent of beam sensitivity (Fig. D.10 in the appendix).

Geographic generalization
Deep neural networks have large model capacity, i.e., enough trainable parameters to adapt to details of the training data distribution. It is thus important to verify that our model does not over-fit to particular training regions. Therefore we conducted experiments with multiple geographic train/test splits to investigate geographical generalization (also called geographic transferability (Dubayah et al., 2020a)). In each experiment, all samples from a particular test region are used for evaluation and the model is trained from scratch without seeing any sample from that test region. We present the geographical generalization results for Europe, Australia, Africa, and the tropics (between 23.5 degrees north and south) in Tab. 1, the corresponding scatter plots are included in the appendix (Fig. E.11).
None of those experiments resulted in a higher RMSE. The bias is slightly stronger than with random crossvalidation without geographical generalization (Random CV) with a mean error (ME) of −0.3 m. Canopy height is overestimated in Europe with a ME of 0.7 m, but underestimated in Australia, Africa and the tropics with a ME of −0.9 m. The relative error (MAPE) grows from 25.6% for the random cross-validation to 45.9% in Europe and to 32.1% in Australia. For both Africa and the tropics, it slightly decreases to 24.2%. The reason for this is that in Europe and Australia the canopy heights are largely below 30 m, and with decreasing height the computation of the relative error is increasingly dominated by the constant (not height-dependent) part of the absolute error. (see Fig. E.11 in the appendix). When using all African regions as the test set, the model has seen some data from tropical Asia, compared to the last experiment that holds out all available data from tropical regions as the test set. However, we do not observe any notable difference in performance.
Overall, our experiments suggest that the proposed model generalizes reasonably well to geographical regions not seen during training and is not over-fitted to specific (training) regions.

Estimation of the predictive uncertainty
The estimated predictive uncertainty should, in expectation, correlate with the empirical error (Guo et al., 2017). We employ two different schemes to evaluate the quality of the estimated uncertainty (Fig. 5). One intuitive way to analyze whether the estimated uncertainty is meaningful is to sort the test samples by their predictive uncertainty. Filtering of the test samples with increasingly strict thresholds for the highest permissible uncertainty should then progressively reduce the overall error (Fig. 5a). We observe that the RMSE initially goes down quickly as recall decreases, then flattens and becomes stable against recall. First, we evaluate this filtering over all test samples with a Pearson correlation coefficient >0.95 (resulting from the waveform matching). Removing the most uncertain 10% of the predictions reduces the RMSE from 3.6 m (100% recall) to 3.0 m (90% recall). For example, at the 70% recall, the RMSE reaches 2.7 m. Filtering based on the estimated uncertainty effectively reduces the overall error, which shows that the model uncertainty is predictive of empirical error. Secondly, we apply a stricter Pearson correlation threshold of 0.98 to the test data, keeping only those 53% of the test samples for which the reference data has (presumably) higher quality. We observe that in that setting the RMSE decreases by ≈0.25 m over the entire range of recall, i.e., at 70% recall the RMSE falls below 2.5 m. This behavior indicates remaining inaccuracies in the reference data due to simulator error, GEDI/ALS temporal mismatch, imprecise geolocation, and related factors that are not resolved by the waveform matching. This is directly reflected in the quantitative error metrics, even though its causes are not deficiencies of the CNN model, but of the test reference data.
The second evaluation looks directly at the calibration of the uncertainty estimates. Those estimates are well calibrated if they correlate linearly with the empirical error (RMSE). Our results illustrate that the standard deviations predicted by the CNN are well calibrated (Fig. 5b).     Table 1: Geographic generalization: The first row shows random cross-validation results as baseline, with the standard deviation over ten cross-validation folds in parentheses. The remaining rows correspond to geographic generalization experiments, where the listed continental region was the test data and was held out during training.
with respect to reference data. Comparing the calibration plots of the aleatoric and epistemic uncertainty on its own, shows that the aleatoric uncertainty is the dominating source in the random cross-validation experiment ( Fig. I.15 in the appendix). Nevertheless, the epistemic uncertainty contributes a non-negligible part and may become more prominent for under-represented cases. Qualitatively, we observe that the CNN assigns lowest predictive uncertainty to waveforms with unambiguous signal for the top of the canopy (signal start) and the ground (last mode). In contrast, samples with particularly high predictive standard deviation correspond to waveforms where the canopy top and ground, as features, are less distinct. (Fig. 6). This may occur for a variety of reasons, e.g. waveform noise, such as from background illumination, low-lying dense canopy that may obscure the ground, or canopy shape, such as in conifers, where the amount of reflective leaf material near the top is limited. While the CNN is not immune to those effects that also hinder traditional waveform processing, our results illustrate the general advantage of a probabilistic approach to make reliable predictions, i.e., the model knows in what cases it may be inaccurate.

Filtering canopy height estimates based on the predictive uncertainty
A key advantage of the probabilistic approach is that predictions can be filtered according to the estimated predictive uncertainty, so as to reduce the overall error at the cost of recall. The question is how to best design the corresponding filter. One important aspect for several applications (e.g., estimation of biomass) is that the full range of existing canopy heights should be preserved. We find that both filtering based on the absolute and the relative standard deviation (i.e., the coefficient of variation) have limitations. Such naive filters reduce the overall error, but effective thresholds on the absolute uncertainty remove almost all high canopies. In contrast, thresholding the relative uncertainty removes almost all low canopies (see Fig. F.12 in the appendix).
To preserve the full height range, but still reduce the expected error, we apply an adaptive threshold depending on the predicted canopy height. We empirically find that linearly increasing the absolute threshold with height preserves the full range of predicted heights (Fig. 3b), while causing a reduction of the overall error (Fig. 5a). We therefore propose to filter according toσ < τ (μ + ); i.e., a fixed minimal threshold is chosen for canopy height 0 empirically set to = 10 m. With increasing canopy height the cut-off value converges towards a relative threshold τμ (appendix Fig. G.13). Ultimately, this filtering strategy reduces the underestimation bias (appendix Fig. H.14) and successfully discards the underestimated samples of the low power laser (coverage laser) in canopies>50 m (appendix Fig. C.9).

4.5.
Comparison to the first release of GEDI L2A height products The first release of GEDI L2A mission data contained estimates of height corresponding to six algorithm setting groups, each with a different combination of settings for signal and noise smoothing widths, and noise thresholds (Dubayah et al., 2020b). These were designed to account for different observation conditions and waveform properties . Prediction of optimal calibration settings of these algorithms is not available for the first GEDI L2A release, however improved calibration settings are planned for future releases. This reflects the iterative procedure towards waveform calibration: as more on-orbit data are processed, with increasingly better geolocation, calibration of waveform processing algorithms will be refined based not only on ALS cross-overs but also GEDI orbit cross-overs, eventually leading to final data products following mission completion with a single, best estimate of height (or some other derived parameter). Nonetheless, the efficacy of our proposed CNN approach can be assessed through comparison with this first GEDI L2A release (Tab. 2). On average, algorithm setting group 2 (out of the six possible) performs best over all continental regions with 4.4 m RMSE and −1.0 m ME, however it is important to note that the performance of individual algorithm setting groups changes regionally. If we always select the optimal of the six algorithms which is closest to the reference data, we find an RMSE of 3.4 m and an ME of −0.9 m. Compared to this optimal selection, the CNN has slightly higher RMSE (3.6 m), but lower ME (−0.3 m) at 100 % recall. Specifically, the CNN suffers   less from underestimation of canopies <40 m but more so for canopies >40 m. Furthermore, our probabilistic approach allows one to reduce the overall error by using the CNN output with the highest confidence, thus, sacrificing some recall. When using only the 70 % least uncertain samples, the RMSE drops to 2.7 m and the overall bias to −0.1 m ME, a significant improvement over the initial GEDI L2A height estimates, even when the optimal setting is handpicked.

Global canopy height predictions
We illustrate a global canopy height map for the extent covered by GEDI (between 51.6 degrees latitude north and south) in Fig. 7. The map is computed from 1.8×10 9 L1B quality waveforms measured along 1,535 orbits during the first four months of the GEDI mission, from April to June 2019. We make use of the quality flag included in the corresponding L2A product, which is expected to yield waveforms with a good sensitivity of the ground returns. For the map, the individual heights per footprint have been gridded to 0.5 deg resolution, because of the sparse nature of GEDI observations within the four-month period. Final grid resolutions of 1 km or finer will be possible once data have been accumulated over two years. We apply three filtering steps before rasterizing canopy heights. First, we filter waveforms from non-vegetated areas, based on the "non-vegetated" layer from MODIS product MOD44B Version 6 Vegetation Continuous Fields (Dimiceli et al., 2015). We keep all samples with >70% probability of being vegetation. This step is necessary to exclude urban areas, which would cause an increase in the RH98 and could deteriorate the large-scale canopy height patterns. Second, we filter based on the predictive uncertainty, with the setting that removed the 30% most uncertain waveforms during random cross-validation (see above). Third, we discard predictions with negative canopy height. Alternatively, one could explicitly limit the mean output of all CNNs in the ensemble to zero, as it is done for the variance estimate. However, we prefer not to do this in favour of better epistemic uncertainty estimation, letting individual CNNs in the ensemble express their disagreement also for predictions close to 0 m. Finally, we obtain canopy top height predictions for 346×10 6 footprints of 25 m diameter, ranging up to 97.2 m max canopy height. These sparse predictions are averaged over ≈55 km raster cells (at the equator) to obtain a global dense map (Fig. 7a).
To study the geographical distribution of the predictive uncertainty, we average the individual predictive standard deviations into the same 0.5 degree (≈55 km) raster (Fig. 7b). The corresponding maps for separated aleatoric and epistemic uncertainty are given in the appendix (Fig. J.16). As expected, the predictive uncertainty is highest in regions with high canopies. Finally, a closer look at the epistemic uncertainty (model uncertainty) at global scale reveals that the model, while able to predict canopy heights up to ≈95 m (footprint level, after filtering), is more uncertain about predictions that exceed the 70 m height range of the training data ( Fig. K.17 in the appendix).

Potential to estimate ground elevation and other
structure variables To demonstrate that deep ensembles of convolutional neural networks can serve as a generic tool to process full waveform data, we present experiments in which, instead of canopy height, we regress two further parameters from the GEDI L1B waveforms. First, we estimate the relative height metric RH70, which provides a test of how well the method can predict structure within the canopy (that is below the top). Second, we estimate the ground elevation by regressing the elevation difference z ∆ = z 0 − z g between the first waveform return z 0 and the true ground return z g (appendix Fig. L.18 and Fig. L.19). Regression of RH70 leads to 2.2 m RMSE. This is significantly lower than for canopy top height (RH98), which is expected because RH70 values are in general lower, have a smaller range, and are less dependent on a well-defined, closed canopy top. Estimation of the ground elevation yields an RMSE of 1.4 m. This error is lower compared to the RH98 regression primarily because remaining geolocation error in the reference dataset will have a lower impact on the ground heights, which are locally smooth, in contrast to the canopy top heights that can drastically change within a few meters. The impact of the geolocation error on the RH98 label noise is thus dependent on the heterogeneity of the vegetation (Roy et al., 2021).

Discussion
Our proposed probabilistic deep learning approach provides a novel computational tool to estimate canopy top height from complex waveforms. It compares favorably with results from the first release of the GEDI L2A product (noting that this product is likely to improve considerably with subsequent releases). A major advantage of the CNN approach is that it avoids manual algorithm calibration to find the best settings, which can be a difficult and time-consuming process and often needs to be tuned to individual beams, plant functional types, canopy covers, and other observation conditions. The proposed end-to-end learning is data-driven, but generalizes across all possible scenarios, with the neural network encoding an adaptive waveform processor that is effective across these variable conditions. Traditional signal processing approaches to waveform interpretation exploit the near-direct measurement of height possible from LIDAR and are based on (and constrained by) physical principles. Whether or not the CNN ultimately out-performs traditional processing for near-direct measurements such as canopy height and cover remains to be seen, but our initial comparison with early GEDI products suggests this possibility. Our results indicate that our data-driven approach may be a powerful alternative and potentially an important addition to the toolkit for space-borne LIDAR calibration.

RMSE [m] ME [m]
GEDI L2A R001: Setting group 2 4.4 -1.0 GEDI L2A R001: Optimal setting group 3.4 -0.9 CNN (100 % recall) 3.6 -0.3 CNN (70 % recall) 2.7 -0.1 A key feature of our approach is that it provides direct estimation of predictive uncertainty, which, according to our evaluation, is well calibrated. This is a critical characteristic of our approach. It can be used to filter canopy height predictions to meet specific accuracy requirements for a given application. By design, GEDI provides only a sample of the Earth's surface, with calibrated uncertainties, the trade-off between reduced sample size (recall) and precision can be explicitly balanced. Thus, canopy height may be filtered with an adaptive threshold, to reduce predictive uncertainty to the degree needed to preserve the range of estimated heights and their geographical coverage, as illustrated by our experiments. The estimated predictive uncertainty does not include uncertainties introduced by systematic geolocation errors when applied to uncorrected waveforms, but will implicitly capture label noise introduced by the remaining geolocation errors in the collocated training dataset.
Our model is able to predict canopy top heights up to 97.2 m, which means that it is able cover the relevant height range for the vast majority of forests on Earth (Tao et al., 2016), and to make predictions that exceed the range of canopy heights contained in the training dataset (in our case up to 70.1 m). As expected, the epistemic uncertainty is higher for canopies above 70 m across the globe and we observed a systematic underestimation of tall canopy heights. We note that the tallest trees generally have the highest biomass, and therefore a systematic underestimation of the associated biomass may have implications for carbon cycle applications. Since the model is trained by minimizing the overall error, our assumption is that this bias is caused by the dataset sampling bias (see the imbalanced distribution of RH98 in Fig. 1b). Hence, collecting more training data with tall canopy heights will decrease both uncertainty and bias. However, the over-and underestimation in the lower and upper range, respectively, can also be caused by the tendency to predict values towards the mean of the training data if the input waveform is noisy and thus less informative. In general, the epistemic uncertainty may guide the collection of new training data that is most informative to the model. Compensating such dataset biases in the training procedure is an active, yet unsolved, research topic (Cui et al., 2019). It remains to be investigated how methods developed particularly for classification can be transferred to regression. Nevertheless, more on-orbit data over the available ALS reference regions will allow to create more training samples, espe-cially in the tropics where the sample density is currently sparser than in high latitudes due to the ISS orbit and the frequent cloud coverage.
GEDI utilizes a coverage laser that is optically split into two beams to improve coverage (each power laser produces two tracks of footprints, while the coverage laser produces four tracks) (Dubayah et al., 2020a). Due to the reduced power, the coverage beams exhibit lower signal-tonoise ratio, which makes it harder to detect weak returns from both the ground and the canopy. This difference in power did not affect overall CNN performance, which is an encouraging result that might pave the way towards a less restrictive use of GEDI's coverage beams. For example, in its current biomass estimation protocol, GEDI does not use coverage beams during daylight conditions for canopy cover exceeding 70 % because of weak ground returns (Dubayah et al., 2020a). Additionally, maximizing the accuracy and usability of all GEDI data has important implications for derived map products that aggregate the footprint predictions, e.g., to 1 km grid cells. Reliable footprint estimates will also advance downstream mapping tasks based on multi-modal data fusion, e.g., combining sparse GEDI canopy height predictions with wall-towall raster satellite data from other missions like Sentinel-2 , TanDEM-X (Lee et al., 2018;Qi et al., 2019a), or Landsat (Healey et al., 2020Potapov et al., 2020).
Another useful property of the generic, data-driven CNN approach to waveform processing is that it can be directly applied to other variables one wishes to derive from the raw waveform, such as ground elevation, plant area index, but also biophysical variables that are often correlated with the LIDAR returns, such as biomass or species richness. For instance, most models that predict biomass from LIDAR use a variety of derived waveform metrics and relate them to ground-based biomass estimates through some form of regression. It may be possible to bypass the explicit computation of waveform predictor metrics, such as canopy height, and their associated predictive uncertainty, and instead regress biomass end-to-end from the raw waveform, potentially uncovering more intricate relationships between waveform morphometry and biomass in a data-driven manner.

Conclusion
In conclusion, the proposed probabilistic deep learning approach has demonstrated the ability to navigate the complex task of deriving global canopy heights from on-orbit L1B GEDI waveforms with favorable accuracies that may exceed traditional waveform processing. Formulating canopy height regression from GEDI waveforms as an end-to-end supervised learning task avoids the explicit modelling of unknown atmospheric, environmental, and instrument effects and instead learns to directly isolate the desired signal from the noisy data. Moreover, our experiments have confirmed that the CNN model learns to extract robust features that generalize to unseen geographical regions that are not included in the training dataset. Modelling the predictive uncertainty with a probabilistic learning strategy supplements the canopy height regression with uncertainty estimates that, empirically, are correctly calibrated and reliable. These estimates of predictive uncertainty, among other uses, make it possible to filter out highly uncertain predictions, which reduces the empirical error, while preserving the full range of canopy heights and their global distribution.
Information on the ALS derived reference data that support the findings of this study are available on request from the GEDI mission science team. Requests must be directed to John Armston (armston@umd.edu) or Ralph Dubayah (dubayah@umd.edu Information on non-public ALS datasets are in the acknowledgments and subject to data use agreements.

Code availability
The code to train and deploy the convolutional neural network is available at: https://github.com/langnico/ GEDI-BDL.

Acknowledgements
We thank the GEDI mission science team, in particular Carlos Edibaldo Silva, Laura Duncanson, Steven Hancock and David Minor for their contributions to the collation and processing of the ALS reference dataset, and Michelle Hofton and Bryan J. Blair for guidance on the GEDI Level 2A algorithm. Eliakimu Zahabu from the Sokoine University of Agriculture (SUA), Erik Naesset and Terje Gobakken from the Norwegian University of Life Sciences (NMBU) gave permission to use ALS data in Liwale and Amani Nature Reserve (Tanzania). These data were acquired by financial contribution from the Royal Norwegian Embassy in Tanzania under the grant entitled "Enhancing the measuring, reporting and verification (MRV) of forest in Tanzania through the application of advanced remote sensing techniques". Hans Verbeeck and Elizabeth Kearsley (University of Ghent) gave permission to use ALS data acquired at the Yangambi in the Democratic Republic of the Congo. Ross Hill (Bournemouth University) gave permission to use ALS data acquired in the New Forest (United Kingdom). Sassan Saatchi (NASA JPL) granted access to the ALS data over Lope in Gabon, which will eventually be released through Sil-vaCarbon after an embargo period. The project received funding from Barry Callebaut Sourcing AG, as part of a Research Project Agreement, and funding from NASA contract #NNL 15AA03C to the University of Maryland for the development and execution of the GEDI mission (Dubayah, Principal Investigator).

Appendix A. Computation time
Scaling to the large data volume of a global product is possible thanks to efficient parallel computing on modern graphics processing units (GPUs). The computation time for a single orbit file with 1.2×10 6 waveforms is <10 minutes for the complete model ensemble, on a single consumer-grade GPU (Nvidia GeForce GTX 1080Ti). Predicting canopy height for our archive of four months of quality GEDI L1B waveforms (12.3 TB of raw waveform data) amounts to approximately 250 hours of GPU computing, where every waveform is processed independently, so that the process can easily be parallelized on a GPU cluster to reduce effective computation time. .19: Regression of ground elevation from L1B GEDI full waveforms. We train the model to predict the elevation difference between the first waveform return and the true ground elevation. I.e., the final ground elevation would be computed by adding the estimated value to the elevation that corresponds to the first return. (a) ALS reference data vs. the CNN estimates, for all test samples. (b) Precision vs. recall when varying the threshold for the maximum (absolute) predictive uncertainty.