Very high resolution canopy height maps from RGB imagery using self-supervised vision transformer and convolutional decoder trained on Aerial Lidar

Vegetation structure mapping is critical for understanding the global carbon cycle and monitoring nature-based approaches to climate adaptation and mitigation. Repeated measurements of these data allow for the observation of deforestation or degradation of existing forests, natural forest regeneration, and the implementation of sustainable agricultural practices like agroforestry. Assessments of tree canopy height and crown projected area at a high spatial resolution are also important for monitoring carbon fluxes and assessing tree-based land uses, since forest structures can be highly spatially heterogeneous, especially in agroforestry systems. Very high resolution satellite imagery (less than one meter (1m) Ground Sample Distance) makes it possible to extract information at the tree level while allowing monitoring at a very large scale. This paper presents the first high-resolution canopy height map concurrently produced for multiple sub-national jurisdictions. Specifically, we produce very high resolution canopy height maps for the states of California and Sao Paulo, a significant improvement in resolution over the ten meter (10m) resolution of previous Sentinel / GEDI based worldwide maps of canopy height. The maps are generated by the extraction of features from a self-supervised model trained on Maxar imagery from 2017 to 2020, and the training of a dense prediction decoder against aerial lidar maps. We also introduce a post-processing step using a convolutional network trained on GEDI observations. We evaluate the proposed maps with set-aside validation lidar data as well as by comparing with other remotely sensed maps and field-collected data, and find our model produces an average Mean Absolute Error (MAE) of 2.8 meters and Mean Error (ME) of 0.6 meters.


Introduction
Spatially explicit maps of forest vegetation structure, such as tree canopy height and crown projected area, are powerful tools for assessing forest degradation, forest and landscape restoration (FLR), and estimating above-ground woody biomass for carbon emission and sequestration modeling.Existing assessments of the climate implications of woody vegetation flux, including FLR, deforestation, and natural regrowth, often rely on remotely sensed dynamic vegetation models of deforestation and regrowth (Friedlingstein et al., 2019).Such wall-to-wall data on tree height and canopy structure are used to estimate aboveground woody biomass.However, land-use patterns operate on more granular spatio-temporal scales than those captured by global carbon models, which typically have coarse spatio-temporal resolution.This contributes to the large uncertainty in existing nation-wide and global accounting of carbon stored in forests (Popkin, 2015;Duncanson et al., 2020;Yanai et al., 2020).For instance, Cook-Patton et al. (2020) produce a global 1-km scale map of potential above-ground carbon accumulation rates by developing machine learning models based on more than 13,000 locations derived from literature.Cook-Patton et al. (2020) find significant variability in predicted carbon accumulation rates compared to defaults from the International Panel on Climate Change (IPCC) at the ecozone scale.In the African tropical montane forests, Cuni-Sanchez et al. (2021) model forest carbon density based on 72,336 measurements of height and tree diameter, identifying two-thirds higher carbon stocks than the respective IPCC default values.
The uncertainty of biomass modelling also affects the uncertainty of the carbon implications of deforestation and regrowth.Tree-based FLR, including agroforestry, reforestation, natural regeneration, and enrichment planting, is considered to be a cost-effective natural climate solution for adaptation and mitigation.However, evaluating the effectiveness of FLR interventions at a large scale is difficult due to its highly distributed nature, typically being practiced on individual land parcels by respective land owners (Reytar et al., 2020).While carbon reporting frameworks exist for FLR, for example through verified carbon markets, such data are highly project-specific owing to their reliance on intensive manual field measurements.Utilizing remotely sensed data to assess vegetation structure on areas with FLR interventions such as intercropped agroforestry or natural regeneration is difficult due to the presence of multiple species, multiple canopy strata, and trees of different ages (Viani et al., 2018;Vallauri et al., 2005;Camarretta et al., 2020).For instance, Tesfay et al. (2022) found that 70% of the shade trees in an agroforestry system in Ethiopia were below 3 meters in height, while 3% were above 12 meters in height, with more than a two-order of magnitude range of per-tree carbon stocks depending on tree size.
Critical to reducing uncertainty in woody carbon models are measurements of forest height and biomass to improve assessments of the spatial variability of carbon removal rates across forest landscapes that have heterogeneous structure (Harris et al., 2021).Tree height is especially critical to accurately assessing carbon removal rates, as growth rate increases continuously with size (Stephenson et al., 2014).Recent earth observation missions from NASA, namely GEDI and ICESat-2, provide repeated vegetation canopy height maps for the first time.Global Ecosystem Dynamics Investigation (GEDI) collects canopy height and relative height at a 25 meter resolution (Dubayah et al., 2021).ICESat-2 collects canopy height and relative height at a 13 × 100 meter native footprint (Markus et al., 2017).Recently, multi-sensor fusion has demonstrated potential to improve aboveground biomass mapping (Silva et al., 2021).To generate wall-to-wall maps of canopy height, researchers commonly combine active optical LiDAR data from ICESat-2 or GEDI with optical imagery from Sentinel-2 (Lang et al., 2022a;Schwartz et al., 2022) or Landsat satellites (Schwartz et al., 2022;Li et al., 2020).
A number of recent studies have utilized spaceborne lidar data from GEDI and ICESat-2 to produce canopy height maps in combination with multispectral optical imagery.
Among them, Potapov et al. (2021) combined GEDI RH95 (95 th percentile of Relative Height) data with Landsat data to establish a global map at 30 meter resolution, using a bagged regression tree ensemble algorithm.More recently, Lang et al. (2022a) produced a global canopy height map at a 10-meter resolution, applying an ensemble of convolutional neural network (CNN) models to Sentinel-2 imagery to predict the GEDI RH98 footprint.Other works have produced regional 10-meter CHMs utilizing Sentinel-2 and aerial lidar (Astola et al., 2021;Fayad et al., 2023).
Aerial lidar data has also demonstrated utility as training data for high resolution (< 5 m) and very high resolution (< 1 m) canopy height maps.At a national scale, Csillik et al. (2019)  The estimation of canopy height from high resolution optical imagery shares similarities with the computer vision task of monocular depth estimation.Vision transformers, which are a deep learning approach to encoding low-dimensional input into a high dimensional feature space, have established new frontiers in depth estimation compared to convolutional neural networks (Ranftl et al., 2021).While depth estimation models benefit significantly from large receptive fields (Li et al., 2018;Fu et al., 2018;Miangoleh et al., 2021), Luo et al. (2016) demonstrate that the effective receptive fields of CNN models have Gaussian distributions, limiting the ability for CNNs to model long-range spatial dependencies.In contrast to convolutional neural networks (CNNs), which subsequently apply local convolutional operations to enable the modelling of increasingly long-range spatial dependencies, transformers utilize self-attention modules to enable the modeling of global spatial dependencies across the entire image input (Dosovitskiy et al., 2021a).
For dense prediction tasks on high resolution imagery where the context can be sparse, such as ground information in the case of near closed canopies, the ability of transformers to model global information is promising.Among the applications to aerial imagery, the work of Xu et al. (2021) uses a Swin transformer to classify high-resolution land cover.Finding that a baseline transformer model struggled with edge detection, Xu et al. (2021) utilized a self-supervised edge extraction and enhancement method to improve definition of class edges.Wang et al. (2022) utilize the vision transformer architecture as a feature encoder, and apply a feature pyramid decoder to the resulting multi-scale feature maps.Gibril et al. (2023) segment individual date palm trees by applying vision transformers to 5-to 30-cm drone-based imagery, finding that the Segformer architecture improves generalizability to different resolution imagery when compared to CNN-based models.More recently, also leveraging vision transformers, Reed et al. (2022) scale the Masked Auto-Encoder approach of He et al. (2022) and apply it to building segmentation.
A major challenge of applying high resolution, airborne lidar data to the generation of wall-to-wall canopy height maps is the relative scarcity of airborne lidar data to the scientific community.Such scarcity can negatively impact the generalizability of models to unseen geographies, especially data-poor regions where little to no airborne lidar exists (Schacher et al., 2023).Given this context of low annotation, Self-Supervised Learning (SSL) is a promising tool to shape more robust features than traditional deep approaches.In particular, the SSL DINOv2 approach of Oquab et al. (2023) recently led to state-ofthe-art performances in several computer vision tasks such as image classification, depth prediction, and segmentation.In the context of satellite image analysis, self-supervised learning has been shown to improve the generalizability of building segmentation models in Africa (Sirko et al., 2021).To mitigate the reliance of vision transformers on selfsupervised learning, Fayad et al. (2023) utilized knowledge distillation with a U-Net CNN teacher model to generate 10-meter CHM of Ghana using Sentinel-1, Sentinel-2, and aerial lidar.
Understanding the importance of highly spatially explicit vegetation structure mapping to both large-scale carbon modeling and project-specific avoided deforestation and restoration monitoring, the objective of this study is to produce high resolution canopy height maps that are able to scale and generalize to large geographic regions.Our method consists of an image encoder-decoder model, where low spectral dimensional input images are transformed to a high dimensional encoding and subsequently decoded to predict perpixel canopy height.We employ DINOv2 self-supervised learning to generate universal and generalizable encodings from the input imagery (Oquab et al., 2023), and train a dense vision transformer decoder (Ranftl et al., 2021) to generate canopy height predictions based on aerial lidar data from sites across the USA.To correct a potential bias coming from a geographically limited source of supervision, we finally refine the maps using a convolutional network trained on spaceborne lidar data.We present canopy height maps for the states of São Paulo, Brazil, and California, USA, and provide qualitative and quantitative error analyses of height estimation and the decomposition of height estimates into tree segmentation maps.

Experimental Design
This paper presents canopy height maps for São Paulo State, Brazil, and California State, USA.These geographies were chosen due to their prevalence of timber production, presence of old growth forests, mountainous terrains, and high degree of tree biodiversity (Maioli et al., 2020;Luyssaert et al., 2008;Ribeiro et al., 2011).The dataset was generated with a machine learning model utilizing a transformer encoder and convolutional decoder trained with an input composite of approximately 0.59 meter GSD Maxar imagery spanning the years 2018 to 2020 and output labels from 1 meter GSD aerial lidar.Our data and methods sections are structured as follows.First, we describe the satellite and aerial lidar data used for model training and map generation.Next, we describe the model training specifics, including self supervised learning and the methods for combining models trained on aerial lidar with models trained on GEDI observations, and the baseline models selected and ablation studies performed.Finally, we present our approach for qualitative and quantitative evaluation of height accuracy and tree segmentation, and discuss the generalization of our model.

Satellite image data description
Maxar Vivid2 mosaic imagery 1 served as input imagery for model training and inference.This dataset provides global coverage by mosaicing together imagery from multiple instruments (WorldView-2 (WV 2), WorldView-3 (WV 3), Quickbird II) and observation dates.By starting with this mosaiced imagery, we leveraged the extensive data selection pipeline from Maxar, resulting in imagery that had less than 2% percent cloud cover, a global revisit rate predominately (more than 75%) below 36 months (imagery dates from 2017 to 2020 are utilized in this dataset), view angles of less than 30 degrees off nadir, and sun angle of less than 60 degrees from zenith.This imagery consisted of three spectral bands: Red, Green, and Blue (RGB), with approximately a 0.5 meter GSD.The imagery was processed in the Web Mercator projection (EPSG:3857) and stored with the Bing tiling scheme2 .Given the high resolution of the original geotiffs, Bing zoom 15 level tiles, with 2048 × 2048 pixels per tile were used, giving a pixel size of 0.597 meters GSD at the equator.

Satellite image data preparation 2.3.1. Image preparation
For easier training and validation of computer vision models, we extracted small regions from the input satellite imagery.Centered around a given location, a box of fixed ground distance was selected, using a local tangent plane coordinate system.Due to the Web Mercator projection of the image tiles, the extracted images at each position had varying dimensions according to their latitude, which were re-sampled to a fixed number of pixels.We chose a box side length of 152.7 meters, which, when re-sampled to 256×256 pixel images, provided "thumbnail" images that matches the lowest resolution (0.597m) of the input imagery described in Section 2.2.Using these thumbnail images both for training and inference ensured that the dataset had constant number of pixels and that the pixel size was the same for all latitudes, preventing potential biases with latitude which may be introduced by variation in pixel size.

Dataset for Self-Supervised Learning
For training the self-supervised encoder, we randomly sampled 18 million 256 × 256 pixel satellite thumbnail images.No labels were used for the SSL stage.

Validation segmentation dataset
We also manually annotated a random selection of 9000 Maxar thumbnail images for segmentation testing.A binary tree / no tree label was applied by human annotators.Pixels estimated to have a canopy height above one meter (1m) tall and with a canopy diameter of more than three meters (3m) were labeled as tree.

Supervised dataset
We gathered approximately 5800 canopy height maps (CHM), selected from the National Ecological Observatory Network (NEON) (2022).Each CHM typically consisted of 1km × 1km geotiffs, with a pixel size of one meter (1m) GSD, in local UTM coordinates.We selected the sites used by Weinstein et al. (2021) and additionally manually filtered for sites that have CHM imagery that was well registered and mostly free from mosaicing artifacts.Additionally, we selected sites with imagery acquired less than two years from the observation date in the associated Maxar satellite imagery.A complete list of NEON sites used for training and validation is contained in Appendix A.
The CHM geotiffs were reprojected to a local tangent plane coordinate system and resized to match the resolution of Maxar images.For each ALS CHM, a corresponding RGB satellite image was linked, and these pairs of imagery served as the training data for our decoder model.The 5800 images in the NEON ALS dataset were split in sets of 80% training images, 10% calibration and 10% set-aside validation images.During the training, validation and testing phases, we sampled 256×256 random crops from the RGB -ALS image pairs.Model training was conducted over epochs sampled from the training dataset.At the completion of each epoch, metrics were computed from a 10% calibration dataset to calibrate the hyperparameters of the model training process.The calibration dataset was drawn from the same set of sites as the training datasets, but from separate 1km × 1km geotiffs to ensure non overlapping pixels.
We constructed a set-aside validation dataset from a subset of sites in our NEON dataset, which we call "NEON test".None of the sites used in the validation dataset were contained in the training or calibration dataset.A list of NEON sites in the validation set appears in Appendix A. We also prepared two validation datasets from other publicly available ALS Lidar datasets, outside of the NEON collection.These datasets covered different geographic locations and ecosystems: "CA-Brande" (Brande, 2021)  Where these datasets were available as CHMs, we directly used the supplied CHMs.However, for the São Paulo datasets, which only contained point cloud datasets, we processed CHMs following the pit-free algorithm (Khosravipour et al., 2014).The pit-free algorithm was also adopted by the NEON team for generating their CHM product, and we found that different input parameters to the pit-free algorithm had negligible impact on the CHM output.

Data Augmentation
The 256 × 256 pixel image thumbnail images of RGB and CHM imagery were augmented at training time, with random 90 degree rotations, brightness, and contrast jittering.We found that these augmentations improved model prediction stability across the various Maxar observations in the input dataset.

Model and data generation methods
Our goal was to create a model that produces high resolution canopy height maps and generalizes across large geographic scales.To accomplish that goal, we leveraged the relative strengths of two types of lidar data.Aerial lidar provided high resolution canopy height estimation, but lacks global spatial coverage.In comparison, GEDI has nearly global coverage of transects, but its beam width of approximately 25 meters did not allow for the identification of individual trees.
After self-supervised pre-training on satellite images globally, our high-resolution ALS CHM prediction model was trained on images from the NEON dataset, as detailed in Section 3.2 and Figure 1.As the Neon dataset only has a spatial coverage from sites only within the United States, we expect this ALS CHM model to perform well on ecosystems similar to the training set.To improve generalization of other ecosystems and locations, a low resolution CHM model was independently trained on global GEDI data (Section 3.3).The GEDI model was used to compute a rescaling factor map (Section 3.4), which adjusted the predictions made by the ALS CHM model.

Self Supervised Learning
Following the recent success of self-supervised learning on dense prediction tasks from Oquab et al. (2023), we employed a self-supervised learning step on 18 million globally distributed, randomly sampled 256 × 256 pixel Maxar satellite images to obtain an image encoder delivering features specialized to vegetation images.In the training phase, different views of the image were fed to two versions of the encoder: a teacher model receiving global crops, and a student model receiving local and global views where part of the crops were masked (replaced by zero values).We employ a huge ViT architecture, where the inputs are decomposed into 16 × 16 patches.The two networks were trained jointly to output similar feature representations.The procedure is illustrated in the Phase 1 in Figure 1.In a second phase described in Section 3.2, we freeze the SSL encoder layers using the weights of the teacher model and train the decoder with ALS data to generate high-resolution canopy height maps.

High resolution canopy height estimation using ALS
We used the reference dataset described in Section 2.4, prepared following the methods described in Section 2.3.1.The output of the ALS model was a raster of predicted canopy heights at the same resolution as the input imagery.For training the supervised decoder, we used the ALS CHM data described in Section 2.4 to create a connection between the SSL features and the full resolution canopy height image.In this second phase, we trained the decoder introduced in Dense Prediction Transformer (DPT) (Ranftl et al., 2021) on top of the obtained features.This approach is described in Figure 1, phase 2. The DPT paper describes a full model composed of a transformer encoder extracting features at different layers.In the decoder, each output was reassembled and all outputs were fused.In our second phase of ALS training, we replaced the transformer of DPT by our own SSL encoder, and trained the DPT decoder part only, from scratch.Our best results were obtained by freezing all layers from the SSL encoder.We employed a one cycle learning rate schedule with a linear warmup in the encoder training stage and a "Sigloss" loss function.Further architecture and training details are provided in Appendix D.
Sigloss function.We take advantage from the similarity of canopy height mapping to the task of depth estimation and borrow the loss from Eigen et al. (2014).Given a true canopy height map c and our prediction ĉ, the Sigloss is given by where δ i = log( ĉi ) − log(c i ), and T is the number of pixels with valid ground truth values.
Classification output.To avoid a bias toward small predicted values, we implemented a classification step first, combined with the Sigloss defined above.The strategy is described by Bhat et al. (2021) as the uniform strategy.Specifically, we modified the output of our decoder to return, instead of one scalar per pixel, a range of B bins.After a normalization on the predictions, we computed the scalar product between the obtained histogram of predicted bins and a linear vector ranging [0,B], with B set to 256.To mitigate the effect of the limited geographic distribution of available ALS data, we employed a second regression network trained on GEDI data to rescale the ALS network outputs.The GEDI prediction model was a simple convolutional network, containing five convolutional layers, followed by five fully connected layers.The inputs to the model were 128 × 128 pixel Maxar images containing three RGB bands, in topocentric coordinates, processed as described in Section 2.3.1.The ground truth data consisted of 13 million GEDI measurements, which were randomly sampled from the full GEDI dataset described in Appendix B.1.We trained the GEDI model to output a single scalar value for a 128×128 pixel image patch, with a L1 loss on a regression task against the RH95 value from the GEDI instrument.The training details are specified in Appendix B.3.

Combining ALS and GEDI model outputs
In this section, we describe the process of connecting our GEDI model outputs (Section 3.3) with ALS model outputs (3.2).Conceptually, the ALS model output provides high resolution canopy estimates but lacks the global context to correctly estimate the absolute height of vegetation in different ecosystems.Conversely, the GEDI model is trained on a global dataset and contains position and metadata inputs (Figure 2).A schematic of the process is shown in Figure 3. Correlation between different lidar sources.The first step in making the GEDI/ALS connection is understanding the relationship between the two sets of lidar data: ALS CHM (Section 2.4) and GEDI lidar (Section Appendix B.1).These two datasets make measurements of fundamentally different properties of canopy structure.GEDI measures the relative height of canopy based on the full waveform measurement of the return energy from 25 m diameter beam footprints while aerial lidar constructs higher resolution point clouds.To connect these two, we ran simulations with the GEDI simulator from Hancock et al. ( 2019) on the NEON ALS point clouds.We found that there was a strong correlation (R 2 = 0.88) between the 95th percentile of ALS canopy height maps and the simulated GEDI RH95 (see Appendix B.2).
GEDI based correction of ALS trained maps.We used this correlation to scale the ALS model canopy height maps by computing a scalar multiplier that match percentiles of the CHM map with the GEDI model predicted value for GEDI RH95.This process works as follows: Given an input RGB image, x, we combined the outputs of the ALS and GEDI models by computing a dense correction factor γ(x), so that the novel prediction, C ′ (x) was related to the ALS model CHM, C(x): where Here G(x) is the output CHM of our GEDI model and Q(x) 95 is a per block upsampled 95 th percentile of the ALS model CHM in meters, computed over the exact same 128 × 128 pixel input regions as the input to the GEDI model in G(x).More specifically, each input image was divided in four crops, each one independently fed to the height prediction model, leading to four scalars, that were concatenated and upsampled.From the predicted CHM map by our ALS model, we computed four percentiles from the same crops, concatenated and upsampled in the same way.We used the ratio in Equation 3 rather than G(x)/Q(x) 95 to down-weight noisy model estimates near zero canopy height.Since G(x) and Q(x) 95 are lower resolution than C(x), the correction factor map was upsampled to match the resolution of the ALS CHM, C(x).The ALS and GEDI maps were smoothed with a 20 pixel sigma Gaussian kernel s σ to prevent sharp transitions, and the correction factor was clipped between 0.5 and 2 to avoid drastic rescaling.

Baselines 3.5.1. ResUNet-based approach
We utilized a ResUNet-18 architecture for our baseline (Zhang et al., 2017), which is an encoder-decoder architecture predicting a N × N canopy height map from a 3 × N × N RGB image, with N = 256.The baseline model was trained with the sigloss between the predicted and ground truth CHMs.We also experimented with a classification output, however we did not obtain improvements from this approach.

Supervised Transformer-based approach
To assess the benefit of the self supervised training phase on Satellite data, we consider a baseline given the state-of-the-art vision SWAG encoder (Singh et al., 2022).We used the large version of this Vision Transformer (ViT) that was trained to perform hashtag prediction from Instagram images.At the time of writing this manuscript, this model was in the top ten models with highest accuracy on ImageNet, CUB, Places, and iNaturalist datasets, providing a warranty of feature quality.This model contains the same number of parameters as our SSL encoder, allowing a fair comparison in terms of model size.

Data validation
We evaluated the model performance against a variety of metrics, which we divided into two broad classes: (1.) Metrics which primarily evaluated the accuracy of canopy height maps, which we call canopy height metrics (Section 4.1), and (2.) Metrics which primarily evaluated the accuracy of image segmentation into tree or no tree pixels, which we call segmentation metrics (Section 4.2).The set-aside validation dataset of ALS canopy height maps described in Section 2 served as the primary dataset for all types of metrics.For the segmentation metrics, we also evaluated the model predictions against a dataset of human-annotated labels independently labeled by photo-interpretation of Maxar imagery (Section 4.2.1).

Results
We generated CHMs for the State of California, USA (Figure 4) and São Paulo, Brazil (Figure 5) by running inference on 0.59 m GSD Maxar images with the SSL + GEDI ViT huge model trained with 1 m aerial lidar data.In California, 39 percent of the area used Maxar imagery observed in 2020, and 90 percent within the years spanning 2018

Canopy height metrics
We compared the predicted canopy height maps with aerial lidar data in terms of mean absolute error (MAE), Root Mean Squared Error (RMSE), and R 2 -block (R 2 ).The R 2 -block score is the coefficient of determination, which we computed on cropped images with a resolution of 50 × 50 pixels (∼ 30 × 30 meters).We have chosen the exact size of these blocks somewhat arbitrarily, but were motivated to compute on a scale of 10s of meters due to: a.) georegistration errors in both the Maxar imagery and ALS data, b.) projection differences between the two datasets, with the ALS data being orthorectified and the Maxar imagery have off nadir view angles of up to 30 degrees.As such, the R 2 -block score better reflects the local accuracy of CHMs and provides a more direct performance comparison to lower resolution models.However, averages across blocks of this resolution do not provide a good indicator of the edge accuracy of the produced maps, which can be a desirable property for downstream tasks such as segmentation.We separately report the Edge Error (EE) metric we developed to measure the sharpness of the maps, described in Appendix C.3.Finally, to estimate the bias of different models, we report the Mean Error (ME).We provide formulas for the above mentioned metrics in Appendix C.

Canopy height metrics for ALS models
We present in Table 1 an ablation study of different pre-training data, model size and output on the Neon and São Paulo test sets.From this ablation study, we selected the SSL model trained on 18 million images utilizing the classification output, which achieved the highest canopy height accuracy metrics.We also trained a huge model instead of a large one, that significantly reduced the bias of the predictions on the São Paulo dataset.We refer to this model as the SSL model throughout the paper.Table 1 suggests that pre-training on satellite images gives better results compared to pre-training on ImageNet.Compared to the ViTs that are pre-trained on ImageNet, including the SWAG approach, the ResUNet remains the strongest baseline.The SSL model clearly outperforms the ResUNet on Neon, reducing the MAE from 3.1 to 2.6 meters, is also improving results on CA Brande, and leads to similar results on São Paulo, with a slightly worse R 2 but a much lower ME.We also experimented with different loss functions, and a smaller dataset for self-supervised pre-training.We found that was training on more data was leading to much better results in São Paulo.Comparing L1, L2 and Sigloss, we found that Sigloss and L2 were leading to the best results.Additional discussion of these trials can be found in Appendix E.

Canopy height metrics for ALS + GEDI models
Table 2 presents a quantitative validation of the best performing models, namely the ResUNet and the self-supervised model (SSL), combined with the GEDI correction step (ResUNet+GEDI, SSL+GEDI).We note the improved performance of the SSL model compared to the ResUNet in the NEON test and CA-Brande datasets.
Although the SSL model performed the best across the datasets in the USA (NEON test and CA-Brande), it performed worse than the ResUNet and ResUNet + GEDI for São Paulo, possibly due to the large domain shift in ecosystems.In the case of São Paulo, we found that the inclusion of GEDI ("SSL+GEDI") produced the best results, possibly indicating better generalization by including the globally trained GEDI model, which also includes additional metadata such as geographic position (Figure 2).
Figure 8 shows 2D-histograms of the SSL+GEDI model predictions vs the set-aside validation ALS-derived canopy height averaged over ∼ 30m blocks and the corresponding pixel MAE and block-R 2 scores.

Quantitative comparison of CHM model with GEDI RH95 data
The ALS set-aside validation datasets used in the previous section are limited in both total area and geographic coverage.In this section, we leverage the global coverage of the GEDI dataset to validate our CHM models.As described in Appendix B.2, CHM maps can be connected to GEDI RH95 metrics by taking the 95th percentile.In this analysis, we draw 2 × 10 4 GEDI samples globally in the set-aside validation split the same way as in training the GEDI model, i.e., weighted proportional to the square root of the inverse sample size of its RH95 bin.Similarily to Potapov et al. (2021), we removed GEDI observations corresponding to < 0.5 normalized difference vegetation index (NDVI) that  had no tree cover in the 2010 data of Hansen et al. (2013), corresponding to 337 samples out of 20000.In Fig. 9a, we show the scatter plot and histogram of the 128 × 128 pixels (76m × 76m) block 95 th percentile vs. the measured GEDI RH95.In Fig. 9b, we analyze the difference of the CHM p95 and the GEDI RH95 with respect to the referenced GEDI RH95 heights.We found that the p95 of the CHM model had a small negative bias against the GEDI RH95 values and adding the GEDI correction to the CHM model significantly reduces the bias.There is a slight positive bias in the GEDI RH95 data due to the terrain slope (Lang et al., 2022a).
We used terrain slope (Mapzen, 2017) as one of the input metadata when training the GEDI model (see Section 3.3), while setting the terrain slope to zero during inference.With this setup, we were able to calibrate out the positive bias caused by terrain slope in our GEDI model.
To assess the importance of the GEDI calibration model for geographic generalization, and the generalizability of the different baseline models, we computed R 2 on globally distributed GEDI test data (Table 3).We found that the SSL + GEDI model had the highest agreement with GEDI RH95 data in 13 of 15 geographic regions.In 42 out of 45 combinations of subregions and models, including the GEDI correction model increased R 2 .

Correlation with field data
To measure the agreement between our computed CHMs and field-collected tree height data, we utilize the Brazilian National Forest Inventory (NFI) data, which consists of systematic field plot inventories of tree count and height (da Luz et al., 2018).Because the NFI data for São Paulo was not yet available, we additionally generate a CHM of the nearby Espirito Santo state and evaluate its agreement with the NFI data for Espirito Santo.The NFI data analyzed encompassed 1,450 10×10 m subplots within 87 plots positioned within a 20×20 km grid in Espirito Santo.The field data was collected primarily in November and December 2014, and includes the height of each tree within each subplot having a diameter at breast height (DBH) of at least 10 cm.Of the 1,450 initial plots considered, we removed 291 that had tree cover loss since 2014 in the dataset of Hansen et al. (2016).

Segmentation metrics
In addition to the canopy height metrics discussed in Section 4.1, we compute a number of metrics that reflect the ability of the model to correctly assign individual pixels as trees.CHMs were converted into binary masks by thresholding height values of at least five meters (5m) as tree canopy extent.Table 4 shows the pixel user's and producer's accuracy values (also know as precision and recall, respectively) for pixels labeled as trees.Table 4 also shows the Intersection Over Union (IOU) for the binary masks, which was calculated as the average of IOU for pixels labeled as tree and the IOU for pixels labeled as ground.
Additionally, we introduce an Edge Error (EE) metric that computes the ratio of the sum of the absolute difference between edges from predicted and ground truth CHM, normalized by the sum of detected edges in both maps.Scores range between 0 and 1, where lower scores indicate improved accuracy along patch edges.In Table 4, the edge error is computed over all set-aside validation datasets.We detail the formula with a figure illustrating the behavior of this metric in Appendix C.3.
We observe an improvement of the SSL approach over the ResUNet baseline in terms all segmentation metrics.Both approaches leads to maps with the same level of sharpness, and the GEDI correction slightly degrades results.

Tree detection metrics evaluated against human annotated validation data
To assess the ability of the model to generalize to new geographies, we compiled human-annotated validation labels for tree detection (binary classification of tree vs notree) across 8, 903 Maxar thumbnail images.Human annotators were instructed to label any trees above one meter (1m) tall and with a canopy diameter of more than three meters (3m).Annotators were to include standing dead trees and tree-like shrubs, but exclude any grasslands, row crops, lawns, or otherwise vegetative ground cover whose peak height was estimated to be less than 1m from the ground surface.To create the model binary masks for the annotated dataset, we thresholded the model CHM at 1m.
The geographic locations for the images in the dataset correspond to randomly sampled GEDI measurement footprints from our global set-aside validation set where the GEDI measurement had RH95 greater than 3 meters, which we enforce to bias the dataset towards vegetated areas.The data is independent of the aerial lidar measurements used to train the model.Over the entire dataset, the user's and producer's accuracy was 0.88 ± 0.006 and 0.82 ± 0.008, while the IOU was 0.77 ± 0.006 indicating good agreement with the human annotations, cf.Table 5. Figure 11 shows examples of model predictions and their corresponding annotations.

Global, Annotated
U/P IOU ResUNet 0.89/0.860.75 ResUNet + GEDI 0.90/0.860.74 SSL 0.83/0.870.77 SSL + GEDI 0.82/0.880.77 Table 5: Segmentation metrics on global, human annotated dataset.U/P corresponds to pixel user's / producer's accuracy.IOU to the average of tree & no tree IOU scores.Since the GEDI correction only adjusts large scale height percentiles, the "+GEDI" rows show only small improvement over the base ALS models.
We additionally calculated user's and producer's accuracy by geographic subregion according to the United Nations geoscheme.Boostrapping with 10,000 iterations was used to calculate uncertainty for tree segmentation accuracy metrics rather than the methods of Stehman (2014) because the cluster sampling approach was used to generate validation data (Olofsson et al., 2014;Mugabowindekwe et al., 2022;Maxwell et al., 2021).This validation analysis indicated strong generalizability across different geographic regions, without significantly different accuracy metrics in geographic regions where we had train- ing data and where we did not (Figure 12).This suggests that the use of self supervised learning on global images facilitated the creation of a generalizable segmentation network.

Qualitative comparison of models
Although we have attempted to capture the performance of each model qualitatively with the included metrics, we note that visual inspection often leads to additional insights.Therefore, we additionally present a few examples of maps produced by our various models.Figure 13 compares the results with a ResUNet and SSL based strategies.

Canopy height as a function of plantation age
Densely planted monoculture stands, such as those commonly found in the Atlantic forest, can be many hundreds of hectares large.Assessing the age-height relationship of tree stands with CHMs derived from optical imagery may be challenging due to the relative homogeneity of the canopy structures, the large area to perimeter ratio, and the lack of canopy gaps.To assess the CHMs ability to map the height of planted trees, we utilized the annual 30-meter tree cover gain and loss data from MapBiomas in São Paulo (Azevedo et al., 2018).We calculated the number of years since the most recent tree cover gain with no subsequent loss event for each image date analyzed.Figure 14 shows a positive relationship (R 2 = 0.59) between the number of years since the most recent tree cover gain, and our predicted 95th percentile CHM.For areas with gain events older than seven years, there was no significant age-height relationship, as areas with trees with gain events more than seven years prior to the analysis year had similar height distributions to areas with stable (no gain or loss since 2000) trees.For this analysis, it's important to note that the tree cover gain year identified in MapBiomas is a lagging indicator of the tree age, since tree cover gain is not immediately recognizable from Landsat imagery.

Generalization to aerial imagery
Using a model fully trained on Satellite images.To assess the generalization ability of our approach to other input imagery, we measure model performance using airborne imagery at inference.For inference, we resized the NEON aerial images to match the size of corresponding satellite image, and apply a normalization of the aerial image to match the color histogram of the satellite imagery.Details about image normalization are provided in Appendix G.
The second line of Table 6 shows canopy height metrics computed on predictions made from NEON input RGB imagery.The SSL model almost doubles the R 2 of the ResUNet baseline.Compared to the performance of the SSL model with satellite images as input as reported in Table 1, the MAE is only slightly higher (3.0 instead of 2.7), the R 2 is a bit more impacted (0.55 instead of 0.70), while the bias is much higher, but evenly distributed between different height bins (Figure 15). Figure 16 displays a qualitative example, where we observe that despite a blurrier result, the accuracy of the model given an out-of-domain aerial image seems similar to the one obtained using in domain satellite imagery.
Despite changes in color intensity, image angle, and sun angle, our approach manages to generate predictions with consistent visual quality.From an application point of view, Training a new decoder on aerial images.We compared these results to another baseline, training a decoder on top of our pretrained SSL features on Neon RGBs (last line of Table 6).Given a better alignment with the ground truth CHMs, and view angles close to Nadir across the Neon dataset, this aerial model performed reasonably well compared to the recent result of Wagner et al. (2023), only using the RGB channels, as illustrated in Figure 17.

Discussion
Our proposed method provides a novel approach to estimating canopy height from VHR satellite imagery.We demonstrate the effectiveness of our approach based on selfsupervised learning, dense vision transformers, and introduce an approach to rescale highresolution canopy height maps from a model trained on Maxar and ALS data with lowresolution canopy height maps from a model trained on Maxar and GEDI data.In contrast to Lang et al. (2022a), which downscales the 25-meter GEDI data to generate 10-meter canopy height maps by only considering Sentinel-2 pixels at the centroid of each GEDI pixel, our approach uses a GEDI-based canopy height map to rescale an ALS-based model of canopy height map.While both Lang et al. (2022a) and Potapov et al. (2021) only utilize ALS data to validate their generated maps, we directly model the relationship   between Maxar imagery and ALS data, enabling the mapping of sub-GEDI scale canopy height variability, some times at a per-tree level outside of dense, closed-canopy forests.
Segmentation.Limitations.The production of high-resolution canopy height maps from optical imagery has challenges and associated limitations.Foremost, the availability of recent ALS training data is limited in geographic scope.While the utilization of self-supervised learning and the GEDI-based corrective model improve generalization and reduce test error, increased geographic availability of ALS remains necessary to further validate the proposed maps in new geographies.While we were able to validate error as a function of canopy height for trees under 25 m based on field inventory data in Espirito Santo, Brazil (Figure 10), we were unable to utilize field data to assess potential height saturation for very tall trees which may affect derived above ground carbon data.However, Figure 9a does suggest significant saturation of predictions for GEDI RH95 observations above 30 m.The generated maps are limited by variation in input imagery, particularly by variation in view angle, sun angle, acquisition times and dates, and optical aerosol.As shown in Figure 17, qualitative data quality improves considerably when processed on VHR aerial optical imagery, as opposed to VHR satellite optical imagery.Additionally, terrain slope appears to influence predicted height, since it affects the length of shadow an individual tree casts.At present, the ability to conduct tree height change detection assessments is limited by the need for improved input image processing to better align these differences between image pairs.

RGB aerial image
Lidar Ground Truth Wagner et al. (2023) Our predicted CHM

Conclusion
This study presents high-resolution canopy height maps at a jurisdictionial scale based on VHR (Maxar) optical imagery trained on aerial lidar and calibrated with spaceborn lidar (GEDI) data using latest advances from self-supervised learning and vision transformers.We demonstrate quantitatively and qualitatively the advantages of large-scale self-supervised learning, the versatility of obtained representations allowing generalization to different geographic regions and input imagery.Compared to existing canopy height maps, the presented data better captures tree structure variability at small spatial scales.Such very high resolution maps of canopy height can improve the monitoring of forest degradation, restoration, and forest carbon dynamics.The next steps include (a) developing and validating allometrically-derived high-resolution woody carbon data and (b) testing and validating the utility of the proposed approach for the operation monitoring of tree growth.The GEDI RH95 measurement used for training the GEDI model corresponds to the 95 th percentile of the lidar's energy response.We simulated the GEDI RH95 values and find that they have excellent correlation (R 2 = 0.88) with the 95th percentile of the canopy height map around the corresponding GEDI footprints.This high correlation between GEDI RH95 and p95 of CHM was consistent across the diverse ecosystems covered in all 40 NEON sites in Appendix A.   Architecture details.Our encoder architecture is a ViT architecture as introduced by Dosovitskiy et al. (2021b).It treats an image as a set of patches, called tokens, that are first embedded into a feature space and then processed by a cascade of transformer layers to produce updated representations of the tokens.The transformer layers use multi-head attention and self attention as their fundamental operation.Multihead attention is an operation that relates each token to every token in the image and consequently, has a global receptive field.The ViT does not use down sampling operations in its intermediate stages and thus supports fine-grained feature maps also in the deeper layers of the network.For the huge model, the features consists in outputs from layers (9,16,22,29).At each layer, a 8 × 8 × 1280 feature map and 1 × 1 × 1280 class output is extracted.In the DPT decoder, the set of tokens at various stages of the backbone is first reassembled into image like representations.Then, the decoder iteratively fuses the feature maps from different stages and produces the final dense prediction using an application specific output head.This step involves several residual convolution layers.The code of our backbone is available at https://github.com/facebookresearch/dinov2,with pre-training on natural images.
covered a coastal ecosystem in CA, and "São Paulo" (Dos-Santos et al., 2019) covered a region in the Brazilian São Paulo State.See Figure A.18 for a visual breakdown of the Neon dataset splits.

Figure 1 :
Figure 1: Overview of our approach for generating ALS-based CHMs.During the first stage, we employed the self-supervised learning approach Oquab et al. (2023) on 18 million 256 × 256 satellite images leading to a set of four spatial feature maps, and four feature vectors, extracted at different layers of the Vision Transformer model (ViT).In the second phase, we trained a convolutional DPT decoder to predict CHMs.

Figure 2 :
Figure 2: Overview of our methodology to generate predicted RH95 values using GEDI measurements across the globe.Terrain is used only during the training and set to zero during inference.

Figure 3 :
Figure3: Post processing step using GEDI predictions during inference.We used the GEDI model to correct our CHM predictions, by computing a dense scaling factor, and multiply it pointwise with the CHM prediction map.

Figure 4 :
Figure 4: Canopy Height Map (CHM) for the state of California, inset showing zoomed in region with input RGB imagery.

Figure 5 :
Figure 5: Canopy Height Map (CHM) for the state of São Paulo, inset showing zoomed in region with input RGB imagery.

Figure 6 :
Figure 6: Selected sample regions from the canopy height predictions (log scale), overlayed on the input Maxar imagery (RGB).Canopy height prediction below 0.1 meter is transparent.The top row corresponds to regions in California and the bottom row, São Paulo.

Figure 8 :
Figure 8: Block (∼ 30m × 30m) aggregated SSL + GEDI model predictions compared to ALS ground truth measurements for different set-aside validation datasets.Colormap density is normalized to the 99.6th percentile of the heatmaps.
Figure 10 visualizes box plots of the 95th percentile CHM by reference NFI height bins.The overall ME was 0.72 m while the RMSE was 4.25 m, with a slight positive bias for trees ≤15 m (ME = 1.10 m, RMSE = 4.28 m), and negative bias for trees >15 m (ME = -1.00m, RMSE = 3.79 m).
Figure 9: Global model evaluation on held-out GEDI data.(a) p95 of block (76m × 76m) model CHM predictions compared to the measured GEDI RH95 metrics.(b) Left: Difference between the p95 of block model CHM predictions and the measured GEDI RH95 metrics w.r.t model CHM predictions.Negative values indicate that the model estimates are lower than the GEDI RH95 values.Residuals in function of RH95 appear in Appendix F. Right: CHM p95 in function of RH95.

Figure 10 :
Figure 10: CHM error compared to reference tree height as indicated in the Brazilian National Forest Inventory for Espirito Santo.

Figure 11 :
Figure 11: Tree segmentation predictions from the SSL + GEDI model vs human annotated ground truth.Binary prediction masks were created from the CHM by thresholding at 1m. U/P corresponds to pixel user's / producer's accuracy of the tree class.The IOU represents the Intersection-Over-Union score for the tree class.

Figure 12 :
Figure 12: Pixelwise user's accuracy (UA) and producer's accuracy (PA) for 8,903 validation plots stratified by geographic sub-region.Error bars represent the 80, 90, and 95 percent confidence intervals as derived from 10,000 bootstrap iterations.Numbers in the x-axis tick labels denote sample size.

Figure 13 :
Figure 13: Qualitative comparison between different models for example imagery.Left: Input Maxar "thumbnail" image, 256 × 256 pixels, in local tangent plane coordinate system.Second from left: ALS CHM image, in same projection and pixelization.Right columns: Model CHMs.

Figure 14 :
Figure 14: Canopy height estimates for areas with tree cover gain of various ages in São Paulo relative to the imagery year analyzed.

Figure 15 :
Figure 15: Performance of models given aerial images inputs.Top: Model fully trained on satellite images; Bottom: Performance of encoder trained on satellite images, decoder trained on aerial images.

Figure 16 :
Figure16: Generalisation of our SSL approach.Even if trained on Satellite images, inference on airborne images does not seem to suffer from a domain shift.

Figure 17 :
Figure 17: Comparison of our aerial model, where we trained the DPT decoder on Neon aerial RGB images, with the approach of Wagner et al. (2023).Note that despite a slight change in the scale of the input image, which was zoomed to obtain a 256 × 256 input, and despite the fact we did not use the infra-red input, we obtain a result similar to the one of Wagner et al. (2023).
Figure A.18: Distribution of ALS Datasets: Train/Calibration/set-aside validation (aka Train/Validation/Test): (a) All ALS datasets.Here Train and Calibration points overlap and are shown in blue.Set-aside validation (Test) datasets are from non-overlapping geographic regions.(b) Zooming in on one Train / Calibration site (NEON GRSM) -we have randomly split the data into non spatially overlapping tiles so that calibration data is drawn from the same sites and ecosystems as training data, but separated spatially.

Figure
Figure B.19: Correlation between 95th percentiles of ALS Canopy Height Maps and simulated GEDI RH95 values from the same maps.The 95th percentile is computed within weighted Gaussians with σ = 12.5m, in order to roughly approximate the GEDI beam width.

Figure B. 20 :
Figure B.20: Correlation between terrain slope and GEDI RH95 for samples in CA.The dashed line indicates the height of the terrain with the GEDI beam (GEDI beam radius times the terrain slope).The heatmap is predominately above this line, indicating that there are no GEDI height estimates which fall below the terrain change within the beam.

Figure C. 21 :
Figure C.21: Illustration of edge error metric for two results: the ResUNet (trained with an L1 loss) edge error score is 0.66 in this example, the score of the SSL model is 0.55, computed using difference of the prediction and ground truth edge maps appearing in the two images at the right.
generated biomass maps in Peru by applying gradient boosted regression trees between 3.7 meter Planet Dove imagery and airborne lidar, with low uncertainty in dense forests but large amounts of uncertainty in transitional landscapes and areas that are hotspots of land use change.Recently, Liu et al. (2023) computed a canopy height map (CHM) map of Europe using 3 meter Planet imagery, training two UNets to predict tree extent and CHM using lidar observations and previous CHM predictions from the literature.Utilizing aerial optical imagery, Wagner et al. (2023) generated a submeter CHM of California, USA by training a regression U-Net CNN on 60-cm imagery from the USDA-NAIP program and aerial lidar.
2 -block ME MAE R 2 -block ME MAE R 2 -block ME Comparison of results with SSL pre-training on different datasets and with other supervised strategies (ResUNet, SWAG).IN: ImageNet.Sat: dataset described in Section 2.3.2.IG: Instagram dataset.R: DPT decoder with a regression (scalar) output.C: DPT decoder with a classification (256 bins) output.ViT L: large, H: huge.Note that the results are non GEDI corrected in this table,and all models were trained with a Sigloss.We later denote the model in the last line as the "SSL" model.

Table 2 :
MAE RMSE R 2 MAE RMSE R 2 MAE RMSE R 2Canopy Height Metrics to assess the gedi correction step.R 2 corresponds to ∼ 30 × 30 meter block R 2 ."Average" is the unweighted average across datasets.

Table 3 :
R 2 between predicted CHM p95 and GEDI RH95 by geographic subregion for 20,000 test GEDI observations for models with and without the GEDI calibration model.

Table 4 :
Segmentation metrics.U/P corresponds to pixel user's / producer's accuracy of the tree class.IOU to the average of tree & no tree IOU class scores.EE: Edge error.
Neon test -aerial Encoder training dataset Decoder training dataset MAE block R 2 ME EE

Table 6 :
CHM prediction accuracy on NEON test dataset using aerial input images as inputs.Trained on satellite images only, the SSL approaches demonstrates generalization abilities.
Potapov et al. (2021)(2022)eeplearning image segmentation approaches to map trees in high-resolution imagery, such asBrandt et al. (2020)andMugabowindekwe et al. (2022)have utilized a U-Net model(Ronneberger et al., 2015)and entirely handlabeled reference data.Focusing in Rwanda,Mugabowindekwe et al. (2022)map carbon stock estimates for individual trees by developing empirical relationships between crown area and carbon, finding that half of the national tree carbon stock is located outside of natural forests.In comparison to these approaches, our results suggest that incorporating SSL can improve model generalizability for vegetation structure mapping, in line with various research demonstrating the effectiveness of SSL in other domains.Additionally, our per-pixel height predictions combine the predictive quality of height for assessing biomass as demonstrated inLang et al. (2022b)andPotapov et al. (2021)with the predictive quality of crown area for assessing biomass as demonstrated inMugabowindekwe et al.