Deep point cloud regression for above-ground forest biomass estimation from airborne LiDAR

Quantifying forest biomass stocks and their dynamics is important for implementing effective climate change mitigation measures by aiding local forest management, studying processes driving af-, re-, and deforestation, and improving the accuracy of carbon accounting. Owing to the 3-dimensional nature of forest structure, remote sensing using airborne LiDAR can be used to perform these measurements of vegetation structure at large scale. Harnessing the full dimensionality of the data, we present deep learning systems predicting wood volume and above ground biomass (AGB) directly from the full LiDAR point cloud and compare results to state-of-the-art approaches operating on basic statistics of the point clouds. For this purpose, we devise different neural network architectures for point cloud regression and evaluate them on remote sensing data of areas for which AGB estimates have been obtained from field measurements in the Danish national forest inventory. Our adaptation of Minkowski convolutional neural networks for regression give the best results. The deep neural networks produce significantly more accurate wood volume, AGB, and carbon stock estimates compared to state-of-the-art approaches. In contrast to other methods, the proposed deep learning approach does not require a digital terrain model and is robust to artifacts along the boundaries of the evaluated areas, which we demonstrate for the case where trees protrude into the area from the outside. We expect this finding to have a strong impact on LiDAR-based analyses of biomass dynamics.


Introduction
Robust quantification of forest carbon stocks and their dynamics is important for climate change mitigation and adaptation strategies (FAO and UNEP, 2020).Supporting this notion, the Paris Agreement (United Nations / Framework Convention on Climate Change, 2015) and the IPCC (Shukla et al., 2019) acknowledge that climate change mitigation goals cannot be achieved without a substantial contribution from forests.However, currently many countries do not have nationally specific forest carbon accumulation rates but rather rely on default rates from the IPCC 2018 (Masson-Delmotte et al., 2019;Requena Suarez et al., 2019), without accounting for finer-scale variations of carbon stocks (Cook-Patton et al., 2020).
Spatial details in the carbon budget of forests are often necessary to encourage transformational actions towards a sustainable forest sector (Harris et al., 2012(Harris et al., , 2021)).Nevertheless, precise spatio-temporal monitoring of forest carbon dynamics at large scales has proven to be challenging (Erb et al., 2018;Griscom et al., 2017).This is due to the complex structure of forests, topographic features, and land management practices (Tubiello et al., 2021;Lewis et al., 2019).Technological developments in remote sensing and the concurrent increased availability of field-based measurements have led to an improvement in estimating carbon stocks using remote sensing observations of forest attributes that serve as proxy for above-ground biomass (AGB) (Knapp et al., 2018;Bouvier et al., 2015;Pan et al., 2013).
Currently, three remote sensing techniques are applied to collect data for AGB estimates: i) passive optical imagery, ii) synthetic aperture radar (SAR), and iii) light detection and ranging (LiDAR).LiDAR, as an active sensor, can penetrate dense forest canopies regardless of illumination conditions.In comparison, SAR observations are sensitive to topographical variations (e.g., steep slopes or cliffs), while imaging spectroradiometer products are highly dependent on the absence of clouds (Sinha et al., 2015;Treuhaft et al., 2014).Moreover, compared to LiDAR, both optical and radar-based data tend to saturate with increasing AGB exceeding 100 Mg ha − 1 (Mermoz et al., 2015;Sinha et al., 2015;Mitchard et al., 2012).Consequently, both alternatives to LiDAR may not be sufficient to detect variations in AGB in the region of interest (Goetz and Dubayah, 2011), leaving LiDAR as the better tool to accurately characterize the fine-scale spatial variability in forest structure and terrestrial carbon stocks (Zhang et al., 2019;Zolkos et al., 2013) across different spatial scales, depending on the platform used (i.e., airborne, drone-based, mobile, and terrestrial laser scanning).Recent developments in spaceborne LiDAR, such as the GEDI mission (Coyle et al., 2019), can be useful for identifying areas with high AGB (Lang et al., 2021).However, the coarse spatial resolution of these sensors (sparse 25 m footprints in case of GEDI Hancock et al., 2019) makes it impossible to derive fine structural information and to produce local as well as sufficiently accurate AGB estimates (Lang et al., 2022;Li et al., 2020;Zhang et al., 2019).In comparison, LiDAR data obtained from UAVs and terrestrial platforms have a very high dimensionality and provide even more detailed information about forest structure and the characteristics of individual trees.Due to the limited mobility of these platforms, however, they are merely suitable for local estimates of forest growing stock (Dalla Corte et al., 2020;Liu et al., 2018;Calders et al., 2018).
The state-of-the-art for predicting forest biomass from LiDAR point clouds is to voxelize the data and compute summarizing statistics of the point distribution along the vertical axis, such as mean heights, relative height quantiles, or metrics of heterogeneity (Magnussen et al., 2018;Knapp et al., 2018;d'Oliveira et al., 2012;Stark et al., 2012;Mascaro et al., 2011).Together with forest attributes obtained from inventory plots, these simple statistical features then serve as inputs to prediction models (Fassnacht et al., 2014) such as linear and non-linear regression, random forests, support vector machines, or neural networks, aimed at predicting the forest characteristics of interest.Based on this approach, several studies are focused on generalizing AGB estimations across different forest types describing different aspects of forest structure at different spatial scales (Bouvier et al., 2015;Asner and Mascaro, 2014;Lefsky et al., 2002).For instance, Knapp et al. (2020) assessed a variety of LiDAR metrics and auxiliary information (e.g., mean canopy height, maximum possible stand density, maximum possible tree height, vertical canopy heterogeneity, and average wood density) to detect structural forest attributes that may explain stand AGB across different regions.However, in this line of research, assessing the optimal combination of LiDAR metrics that can provide accurate AGB predictions has proven to be a challenging task.
An alternative to manually defining informative features based on summarizing point clouds statistics, which loose relevant structural information contained in the 3D data, is to use deep learning (LeCun et al., 2015) approaches that directly work on point clouds (Guo et al., 2020).Compared to deep learning on imagery, which as been been thoroughly explored within remote sensing for classification (Li et al., 2023;Tucker et al., 2023;Yuan et al., 2020;Brandt et al., 2018;Kussul et al., 2017) and regression (Liu et al., 2023;Lang et al., 2022;Zhang et al., 2019), deep learning on point clouds for regression is an almost unexplored learning setting, except for the estimation of geometric properties from point clouds (e.g., hand pose-estimation Ge et al., 2018;Firintepe et al., 2021) and bounding box prediction in general detection tasks (Zhou and Tuzel, 2018;Qi et al., 2019).In addition, ALS, a technology with peculiarities (Naesset, 2009;Goodwin et al., 2006), produces more challenging point clouds than most standard deep learning benchmark data sets for point cloud tasks, since they are measured from the top instead of from the ground, and the corresponding point densities are usually lower compared to other LiDAR-based technologies (e.g., terrestrial-or drone-based LiDAR).The applicability of a deep learning system to estimate crop biomass has also been tested more recently using point clouds from LiDAR mounted on a roving vehicle (Pan et al., 2022) in a small scale experiment.When point density is high enough to make individual trees distinguishable, we generally expect models that use point clouds directly to give better AGB prediction results than models based on summary statistics, because the latter may be similar for areas with different AGBs.
This study brings forward deep learning systems to improve forest biomass and wood volume quantification based on airborne LiDAR data with high spatial resolution per sample and highly accurate estimates from on-site measurements.Specifically, this is the first work (the first idea of which was presented as a poster by Oehmcke et al., 2022) to apply deep neural networks directly to the point clouds to utilize structural information instead of relying on models being based on simple statistical features to derive forest characteristics.We hypothesize that maintaining the full dimensions of the point cloud, and hereby preventing the information loss associated with reducing dimensionality into a few statistical features strengthens predictions.The experimental evaluation shows that predictions using deep learning are considerably more accurate than currently employed methods on individual and aggregated plot level.Our approach requires neither a digital terrain model (DTM) nor a CHM.We also demonstrate that our chosen method is robust to border artifacts caused by trees reaching into plots, which is an essential when using high-resolution predictions for spatial aggregations, a feature not present in currently employed methods.

Methods
In the following, we first introduce the data used in this study in Section 2.1, which combines field-based biomass and wood volume measurements with ALS point clouds.Afterwards, in Section 2.2 the assessed deep learning architectures are presented along with their required adjustments for point cloud regression.The established baseline methods, which rely on statics derived from the point clouds are discussed in Section 2.3.Finally, we give details on why and how to correct the bias for least-square regression tasks in Section 2.4.

Forest data
This section outlines two data sources and their processing: first, the inventory data for AGB and wood volume and then the corresponding LiDAR data.

Forest inventory
The volume (in m 3 ha − 1 ) and biomass (Mg (metric ton) ha − 1 ) estimates used to train and evaluate the models stem from the ground measurements of the Danish National Forest Inventory (NFI), see Fig. 2a for histograms of the above-ground forest biomass, carbon stock, and wood volume distributions.The resulting NFI statistics are used for formation and assessment of national or regional forest programs, sustainability assessments, investment calculations for forest industry, strategic-level planning in general, and reporting to international conventions.The Danish NFI is based on a grid with cells of size 2 × covering the entire land surface of the country (Nord-Larsen and Johannsen, 2016).A plot is composed of four circular subplots with a radius of 15 m.Each subplot is located in the corners of a 200 × square, which is randomly placed within each grid cell (Fig. 1a).The full set of plots is geographically partitioned into five spatially balanced interpenetrating panels (Kish, 1998;McDonald, 2003;Olsen et al., 1999;Zhang et al., 2003).Each year within a 5-year cycle, a different panel is measured on-site, representing one fifth of the total set but representative of the entire country.These labeled data used in our study range from 2013 to 2017.
In the Danish NFI, each subplot is composed of three concentric circles with radii of 3.5 m, 10 m, and 15 m, respectively.Following standard procedures, a single caliper measurement of diameter is made at breast height (d bh ) (i.e., 1.3 m from the ground) for all trees in the 3.5 m circle.Trees with d bh larger than 10 cm are measured in the 10 m circle, and only trees with d bh larger than 40 cm are recorded in the 15 circle (see Fig. 2b).For a random sub-sample, further measurements of the total height (h), crown base height (h c ), age, and diameter at stump height (d st ) of two to six trees within each subplot are also obtained.
The height of trees for which height was not measured is estimated from models developed for each species and growth region based on the sub-sample of trees for which h and d bh were both measured.These models include the observed mean height and mean diameter within each subplot for creating localized regressions using the approach suggested by Sloboda et al. (1993).For subplots where no height measurements were available, generalized regressions were developed based on a modified Näslund-equation (Näslund, 1936;Johannsen, 1999).Subsequently, individual tree volume and biomass were estimated using species-specific models (Nord-Larsen et al., 2017a) (see Fig. 1b for an overview of the species distribution).Since trees were measured in different concentric circles depending on their diameter, the volume of growing stock and biomass in each subplot were calculated by scaling the estimated AGB and wood volume of each tree according to the circular area in which the tree was measured.Carbon stocks can be considered as a linear function of the forest biomass, thus, in this study was not used as a separate regression target.

Point cloud data
Airborne LiDAR point cloud data covering Denmark from 2014 to 2018 are publicly available from the Danish Agency for Data Supply and Efficiency (https://dataforsyningen.dk).Specifically, a sampling campaign in 2014 and 2015 covered the entire country, while a fifth of the country was again scanned in 2018 (in eastern Jutland and Funen).The vertical and horizontal point accuracy is reported to be 5 and 15 (given as root mean squared error (RMSE)), respectively.We extracted the point clouds corresponding to the subplots studied in the NFI (see Fig. 3).The average number of points per subplot are 11,684 with a standard deviation of 6743 points, which corresponds to an average of 16.7 points per m 2 .

Preprocessing
Several preprocessing steps were conducted for both the deep neural networks (operating on the point clouds) as well as for the baseline models that operate on simple statistical features.
For the plot-specific point clouds, the NFI identified four types of errors.In the first type, trees that are visible in the point cloud do not belong to the field-based measurements but reach into the plot from the outside, while in the second type non-forest objects, such as land-lines or buildings, are part of the point cloud.We chosen not to exclude these data since a method that directly works with the point cloud should detect irrelevant parts by itself, meaning that some previously excluded data should now be usable.The third error type occurs when the plot had been harvested in-between field measurements and airborne point cloud acquisition.Finally, the fourth error type consisted of unreasonable values.Since the third and fourth error flags are intractable, the samples were excluded from the model training.In addition, all samples with no points above 1.3 m were removed, as this was the minimal height considered for the biomass measurements. 1he remaining dataset consists of n = 6099 individual point clouds, each being labeled with a corresponding biomass and wood volume measurement as described above.The biomass and wood volume were measured at different times than the collection of point clouds.To ensure that both data sources matched, we only considered point clouds observed within one year of the biomass measurement for our validation and test set.For training, we took gaps larger than one year (up to nine years) into account.
Finally, to train, validate, and test all the models, we split the data as follows: the training set contains 2636 samples with a time interval of no more than one year and 1634 samples with a time interval longer than a year (i.e., 4270 training samples in total, in 3938 unique subplot location).The validation set consists of 918 and the test sets of 911 samples at 903 and 901 unique subplot locations, respectively, with a maximum distance of one year between paired point cloud and biomass measurements.Since some subplots appeared twice due to a subplot being measured again on the ground or via ALS, we ensured that measurements from the same site stayed together in either training, validation, or test datasets (i.e., no site/plot considered during training or for validation occurred in the test set).

Deep regression for point clouds
From a mathematical perspective, a single point cloud can be represented via a set P = {x 1 , …, x nP }⊂ℝ N , where n P is the number and N the dimensionality of the points (typically, N = 3).Besides these coordinates, additional information is often provided for each point, such as colour or intensity.These features paired with the points define the set with f i ∈ ℝ D containing the additional features.Such a point cloud is given for each instance (e.g., one subplot corresponds to one point cloud, see Fig. 3).The point cloud sets typically have different cardinalities and are unordered.Accordingly, the output of a model should not depend on the order of the provided points, a property referred to as permutation invariance.Furthermore, the spatial structure of such point clouds can be highly irregular and the density of points is typically very sparse (such characteristics usually vary among different regions as well).
Compared to deep convolutional neural networks (CNNs) for standard 2D and 3D images, the research field of deep learning for point cloud data has yet to mature.There are several approaches for adapting neural networks for point cloud processing, differing among others in how they deal with the key questions of how to extend the concept of spatial convolutions to sparse point clouds.Most deep learning systems for point clouds are developed for segmentation and other classification tasks (Hackel et al., 2017;Chang et al., 2015;Wu et al., 2015).In this study, we focused on regression tasks.We adapted and compared three widely-used and conceptually different point cloud classification approaches for regression: the classic PointNet (Qi et al., 2017), the kernel point convolution (KPConv) (Thomas et al., 2019), and a Minkowski CNN (Choy et al., 2019).
We extended the open-source library Torch-Points3D (Chaton et al., 2020) to support regression and will contribute our networks and anonymized dataset to the library upon acceptance of this manuscript.Two modifications were applied to all networks.First, we replaced Rectified Linear Unit (ReLU) activation functions with Gaussian Error Linear Units (GELU) (Hendrycks and Gimpel, 2016).Secondly, we replaced the last global pooling operation (PointNet: max, KPConv and Minkowski: average) with global sum pooling to better reflect the task of aggregating AGB or wood volume from different areas of the point cloud scene.

PointNet
The seminal PointNet is one of the first deep learning architectures developed for point clouds (Qi et al., 2017).To compute new representations x ′ i ∈ ℝ Dout for i ∈ {1, …, n P }, PointNet-type networks apply the same linear map x ′ i = Wx i with W ∈ ℝ Dout×Din to each input x i ∈ ℝ Din individually without any interaction between inputs.The original architecture is based on the use of multiple such neural network blocks.In the first layer, the position and additional features are concatenated (D in = N + D), which does not support translation invariance.In between these blocks, input/feature transformations are used to align the data via learnable transformation matrices.At the end, to facilitate a variable number of points and to achieve permutation invariance, a symmetric set function is applied, which usually is a global max-pooling operation.Note that there is only a global interaction between points, local contexts where introduced by follow up works using hierarchical structures (e.g., see KPConv below).
Two PointNet variants have been proposed in the original work to address classification and segmentation scenarios, respectively.To adapt the architecture for regression, we mostly adopted the classification PointNet architecture but removed the learnable transformation matrices since they did not improve the validation scores.Also, we used batch normalization after each hidden layer.

KPConv
The kernel point convolution (Thomas et al., 2019) extends standard discrete convolution to the domain of point clouds.Kernel point convolution as well as discrete convolution on images use a kernel to extract local information.However, KPConv uses a set of kernel points instead of a discrete kernel mask to define the convolution kernel, which allows for a representation in Euclidean space instead of a rasterized space.More precisely, if P F is convolved by a kernel g, the result at a point x ∈ ℝ N is defined as Here, the local neighborhood is   matrix W k ∈ ℝ D in ×Dout , which maps an input feature vector f ∈ ℝ D in to a new feature vector f ' ∈ ℝ Dout (for the first such convolution layer, we have D in = D).These kernel points are conceptually similar to a kernel filter matrix in a CNN.In KPConv the kernel g : B N r →ℝ D in ×Dout is defined as where h(y, xk ) = max is an inverted distance function and σ ∈ ℝ + is a user-defined model parameter (Thomas et al., 2019).Thus, the kernel yields a weighted sum of weight matrices depending on the distance of a kernel point xk to an input point y.Each kernel is parameterized by its own kernel points and their corresponding weight matrices.The weight matrices are learnt similarly to the coefficients of convolution operators in standard CNNs.The position of kernel points can be either learned ("deformable") or be systematically fixed ("rigid") around the center of B N r (Thomas et al., 2019).Pooling operations are realized via creating a 3D grid in which all points in a cell are aggregated (see Section 2.2.3).
We adapted the KPConv architecture proposed for classification to utilize it for regression.This architecture follows the structure of ResNet, thus modifying it for regression was straightforward and was achieved by changing the output layer to have the identity as activation function.Fixed kernel points were preferred over deformable ones, since the latter consistently performed worse on our validation dataset.

Minkowski convolutional neural network
A straightforward approach to process point clouds is to voxelize the points to a 3D image (with D channels) and to apply a standard 3D convolutional neural network (CNN) to the resulting image.This requires spatial binning, but while a coarse binning may discard important information, a fine-grained binning can render the computations intractable.However, the points are typically very sparse in 3D space.This lends to keeping a sparse representation and applying spatially sparse 3D convolutions, which basically operate only in areas where points exist (Graham, 2015;Graham et al., 2018).A state-of-the-art implementation of sparse convolutions for point clouds is given by the Minkowski Engine, an auto-differentiation library, which allows to efficiently implement standard architectures, such as ResNet with 3D convolutions (Choy et al., 2019).
Both the Minkowski and KPConv architectures progressively downsample the point cloud based on a 3D grid via pooling operations.The initial grid size is only bound to the horizontal and vertical accuracy of the point data, meaning that a too small grid size cannot be processed efficiently.Therefore, it is crucial to set the initial grid size according to the data and task.The subsequent pooling steps are then based on this initial size and work similarly to standard 2D pooling operations.Once a grid is created, a voxel-based approach, such as Minkowski CNNs, aggregates all voxels within a grid cell to a new voxel.In contrast, KPConv takes the mean of the position of all points within a cell to form a new point.By doubling the size of the grid cells in each subsampling/pooling layer, the number of points/voxel is gradually reduced, corresponding to an increase in the size of the receptive field.The features associated with each new point can be obtained, for example, via max-pooling (applied to the feature vectors of the pooled points) or by applying the chosen convolution operator.
Using the Minkowski Engine library, we built two 3D versions of SENet (SENet14 and SENet50) for regression (Hu et al., 2018), see Fig. 4. To that end, we replaced 2D convolutions with their 3D equivalent.The stride of the initial convolution of SENet was changed from 2 to 1 in order to avoid early loss of information.Finally, the output layer was  S. Oehmcke et al. changed to fit the number of target variables and does not apply a nonlinear activation.The two network versions differ in the number of trainable parameters as shown in Table 1 and are referred to as MSE-Net14 and MSENet50, respectively.

Experimental setup for forest attribute regression
Three different neural network architectures were applied: the proposed regression versions of PointNet, KPConv, and two 3D version of SENet (MSENet14 and MSENet50) built with the Minkowski Engine.All code was written in Python and PyTorch (Paszke et al., 2019) and trained on A100 GPUs.The positions of the input points were given in the projected coordinate system EPSG:25832.We centered the x-, y-, and z-coordinates for each subplot individually to get a local reference frame.Scaling was applied by dividing by 15, 15, 20 m, respectively, so that the coordinates lie within [ − 1, 1] for the x-and y-coordinates.The z-scaling of 20 m was chosen as it represents the expected value range for the trees we were studying, with the highest trees being around 40 m.We do not need to calibrate our model using a DTM, as is required by state-of-the-art methods based on point cloud statistics.
We tuned the initial grid size of KPConv and the MSENet models by gradually decreasing it (i.e., increasing the spatial resolution) until no improvement could be seen on the validation set.The idea behind this is that larger grids can be processed faster but at the cost of losing structural details and finer grids offer more structural details but require longer to process.By gradually reducing the grid size, we can ensure good results without unnecessary computations.Both architectures worked best with the initial grid size set to 0.025.The number of trainable model parameters and runtimes are given in Table 1.
To increase diversity in the training set, several augmentations were considered.Augmentation generates distinct training examples from the given training data set by applying random transformations to the input data that with high probability do not affect the dependent variable (Zhang et al., 2023).This helps to avoid overfitting because exactly the same input is never presented to the learning system several times.Augmentation improves generalization because it forces the system to learn general features of the input which are invariant under the applied transformations.First, a random rotation around the z-axis was applied since the predictions should be invariant to such rotations.We did not consider rotating the x-and y-axis since a tilt would result in unrealistic samples (e.g., change of growth direction).Other transformations, such as scaling or shearing, were also not applied, as they may invalidate the target label.Thereafter, random dropout of points with a dropout rate of 20% in 50% of all samples was applied, which increases variety and simulates changes in point density.As a last step, each dimension of the point position was shifted by adding a small random value drawn from a zero-mean truncated Gaussian distribution with variance 0.001 and support [ − 0.05, 0.05] to simulate slight positional errors and shifts.Note that these augmentations are computationally inexpensive and did not add noticeably more time to the training procedure.We only considered augmentations for the neural network models, since the summarizing statistics (which typically are, e.g., rotation invariant by definition) would not change drastically and the baseline methods are not prone to overfitting.
Trees reaching into subplots can change the results of the calculated statistics.For that reason, these kind of plots are often excluded from the model phase (see type 1 errors in Section 2.1.3).We decided to keep them as our structure-aware model should be resilient to these border effects.To demonstrate this resilience, we artificially changed the point clouds to simulate border effects where trees would reach into subplots.We used 1482 single ALS tree scans from Germany (Weiser et al., 2022), where each scan contains the point cloud of an isolated tree without other trees or ground points.We estimated the stem location of each of these trees by calculating the center of points below 2.5 m.If the positional deviation of these points from the estimated center was larger than 5 m, indicating a high uncertainty of the stem location, the tree was discarded.This step removed 130 trees from the dataset, leaving for the subsequent steps.We added up to 10 randomly chosen trees by positioning each tree center somewhere outside of the subplot (15.1 m -20.1 m away from the center), which we refer to as as tree-adding augmentation.During the evaluation phase two runs were conducted, one without and one with these augmentations to each subplot and the results are shown separately.
One goal was to create precise wood volume and above-ground biomass maps over large areas.Therefore, instead of using the entire circle in which the measurements were taken, we resorted to a hexagonal shape in the x-and y-coordinate, see Fig. 2b.This approach facilitates an easy tiling of large areas without gaps and overlaps.The hexagonal shape covers almost all of the points, in particular those in the inner circles (in control experiments, we verified that considering the whole subplot does not lead to a significant performance increase).
We employed the AdaBelief (Zhuang et al., 2020) optimizer with weight decay set to 0.01 to update the weights of the neural networks.The initial learning rate was 0.005, which was adapted according to the cosine annealing schedule with warm restarts (T 0 = 10, T mult = 2) (Loshchilov and Hutter, 2017) for 310 epochs.The batch size was 32 and the smooth L1 function Girshick (2015) was used as training loss.These hyper-parameter configurations were chosen based on the highest R 2 score on the validation set.After choosing the network weights with the highest R 2 score on the validation set, we only adjusted the running average and variance of the batch normalization layers for another epochs to stabilize training without altering the network weights.Three input channels were given per point: the distance to subplot center and the distance to the lowest point to encode the relative position as well as a channel filled with ones to encode the presence of a point.

Baseline methods
The baseline models considered in this study for estimating aboveground biomass, rely on precomputed features from the normalized height distribution in accordance with Nord-Larsen et al. (2017b): mean height, height standard deviation, coefficient of variation, skewness, kurtosis, percentiles of the height distribution (5 th , 10 th , 25 th , 50 th , th , 90 th , 95 th , 99 th ), and the interception ratio (IR), which is the fraction of points above one meter compared to all first returns.For each point cloud sample, these features except the IR were computed twice, once for all points and once only for the points 1 m above ground.To determine the height of a point relative to the ground, we calibrated it semi-automatically, using two DTMs with different spatial resolutions (25 m 2 and 5 m 2 ), and the first return of each light pulse from the airborne LiDAR, following the study of Magnussen et al. (2018).In preliminary analysis, it was observed that this approach led to better results than using all the recorded returns from each LiDAR-derived light pulse.In contrast, the deep learning methods simply use all the information included in the point clouds without the need to filter out the point cloud datasets according to the number of returns.In total, d = statistical features were extracted for each point cloud sample.

Linear and power regression
The first two baseline methods were linear regression and power regression models.For the linear regression models, all the d features and the number of years between LiDAR and NFI measurement were used as explanatory variables.The multivariate power regression can be regarded as the state-of-the art for predicting AGB, above-ground carbon stocks, and tree volume from the point cloud features (Magnussen et al., 2018).Following Magnussen et al. (2018), we fitted models of the form y = w 1 ⋅z w2 meanAG ⋅z wr 95 th AG ⋅IR w4 (4) for each quantity of interest.Here, z meanAG is the mean height of the points one meter above ground, z 95 th AG is the 95 th percentile of the height distribution of point one meter above ground, and IR is the inception ratio.The power regression performed worse when including the training data with gaps larger than a year between LiDAR and NFI measurements, which is why only the model using data from within a year is used.For each model, the parameters w 1 , …, w 4 ∈ ℝ were set to the optimal least-squares estimates.

Random forest
We trained random forest (RF) models with 5000 trees (Breiman, 2001) and resorted to the popular Scikit-Learn implementation (Pedregosa et al., 2011) using the same feature set as the linear regression.The out-of-bag (OOB) error was used to tune the RF hyper-parameters.We considered {0.1, 0.2, …, 1} for the ratio of features to consider at every split, {0.1, 0.2, …, 1} for the ratio of samples to consider for each tree, {5, 6, …, 20, ∞} for the maximum tree depth, and {1, 2, 4, …, 16} for the minimum number of samples required at each leaf node.The best model used 80% of the features at every split, 10% of the training data for each tree, had a maximum depth of 11, and required a minimum number of 4 samples at each leaf node.

Bias correction
Least-squares regression with deep learning models typically lead to biased models in the sense that the sum of residuals on the training data set is not zero (Igel and Oehmcke, 2023).
This can introduce a systematic error that accumulates if we are interested in the total aggregated performance over many data points.We therefore used the method proposed by Igel and Oehmcke (2023) to correct this bias on the training and validation set for all non-linear methods (all except linear regression) by replacing b by This bias correction can also be applied to other non-linear models h θ (e.g., h θ could be a random forest); h θ can be wrapped according to (5) with scalar a = 1 and b set to b * computed by (6).

Evaluation metrics
We compared methods for predicting AGB (and thereby aboveground carbon stocks) and wood volume using different metrics, the root mean square error the coefficient of determination and the mean absolute percentage error The MAPE is asymmetrical and by definition weights errors of the same magnitude higher when the target values are smaller, but we report it in accordance with the related literature.We also study the mean bias to emphasize differences over aggregated predictions.

Results
An overview of the results is given in Table 2 showing the R 2 , RMSE, and MAPE for the different algorithms on the test set, which was not used for model development.Overall, the MSENet50 model closely followed by the MSENet14 model achieved the best results with an R 2 of 0.835 for biomass (and carbon stocks) and an R 2 of 0.831 for wood volume.The corresponding RMSEs were 41.10 Mg ha − 1 and 78.38 m 3 ha − 1 , respectively.
The RF performed the worst.The better performance of the linear regression with an R 2 of 0.762 for biomass and 0.766 for wood volume, indicates that the calculated features are to a certain extend linearly correlated with the target variables.The power regression model based on Magnussen et al. (2018) reached R 2 values of 0.761 and 0.763 for biomass and tree volume, respectively, that is, the deep learning approach clearly improved over the state-of-the-art methods.All more recent point cloud based methods (MSENet and KPConv) outperformed the models based on the precomputed features, justifying that the networks indeed utilized additional information from the point clouds.PointNet only achieved statistically different performance for wood volume.The MAPE, which was not an optimization criterion, showed the largest differences between the methods, which are mainly attributed to errors related to small target values.
The national forest inventory data offers information about the fraction of conifer and broadleaf for each site, which we used to analyze the performance at different fractions in Fig. 5. Again, the high MAPE for the baseline methods that rely on the precomputed features and Point-Net was noticeable, in particular when monocultures were analyzed.In general, monocultures turned out to be more difficult than mixed forests w.r.t.MAPE but not w.r.t.RMSE and R 2 (e.g., forests with conifer trees only exhibited the lowest RMSE and R 2 ).The decrease in RMSE with increasing proportion of conifers supports the intuition that the more structurally complex broadleaf trees are more difficult to predict than conifers.In Fig. 6, we show the distributions of the predictions by the power model and the MSENet model as well as the ground truth NFI data.The power model predictions consistently have lower variance, while the MSENet14 model better matches the actual distribution.
Residual errors increased with magnitude of the target variables as seen in Fig. 7 for the power, KPConv, and MSENet14 model.The results also show a symmetric distribution of the residuals around zero,

Table 2
Comparison of methods showing the R 2 score, RMSE, MAPE, and MB on the test set.The R 2 and MAPE for biomass are the same as for carbon stocks.Note that we excluded 0 biomass measurements from MAPE to avoid numerical issues.RMSE and MB are in Mg ha − 1 for AGB and m 3 ha − 1 for wood volume.The median (med.) and best result (min or max) of 5 trials are shown, the best results are highlighted in bold font.The median for the MB is based on the absolute values, which is more meaningful when combining multiple trials.The min values are also based on the absolute values, however the signs are indicated in the table.To highlight significant differences between power and the other models, we employ the paired Wilcoxon signed-rank test (H0: The paired samples in each subplot have a median difference of zero), where a sample is the absolute error for each subplot (n = 911).Different p-values are indicated: p > 0.1 by' ns', p ≤ 0.05 by'*', and p ≤ 0.001 by'**'.suggesting that errors at plot level would partly cancel out when aggregating predictions.Comparing the distribution of errors as well as the mean bias of the assessed models, one can see that the KPConv and Minkowski models had a generally lower spread of errors than the power model.See Fig. B.12 and B.13 in the appendix for the remaining plots.
The results of the tree adding augmentation as described in Section 2.2.4 are shown in Table 3.Our analysis demonstrates that the Minkowski and PointNet models are only marginally affected, while the model performance of the baseline methods and KPConv decreases significantly.This behaviour is expected from the baseline methods since they do not capture structure and the utilized statistics are not robust to these kind of changes (see Training deep learning model does not necessarily require vast amounts of data to achieve useful results.To support this, we conducted an additional experiment, where we reduced the original training data (n = 4270) used with the MSENet14 until the performance decreases to the power regression baseline.The first reduction was achieved by removing all labels where LiDAR and NFI measurements are more than one year apart since these labels are considered to be noisy.Then, from these n = 2636 we gradually removed data: by 75% (n = 1977), 50% (n = 1318), 25% (n = 659), 12.5% (n = 330), 6.25% (n = 165).For each setting five repetitions where conducted, whereby each time a different random subset for training was chosen to reduce the impact of selecting a non-representative subset.The results in Fig. 8 show that even a model trained with only 165 training samples outperforms the power regression baseline on the test data.While larger datasets often provide more robust generalization, we show that even with a comparatively small amount of data, superior performance can be achieved.It is worth noting that pretrained models, such as the ones we created and provide with this work (see Appendix B), will reduce the need for labeled data even further.
An exemplary biomass map is demonstrated in Fig. 9.The figure shows the results of the MSENet50 model applied to a randomly selected 1 km area in Denmark without any additional processing.The predictions are convincing since heavily forested areas have a higher amount of biomass than areas with sparser tree cover.

Discussion
The results give clear evidence that the Minkowski deep learning approach is preferable to the methods working on point cloud statistics on individual subplot level as well as when aggregating plots.Specifically, the RMSE of the power regression is on average 8.41 Mg ha − 1 or 588.7 per subplot larger than the best MSENet50 model (see Table 2).When normalizing the RMSE with the mean AGB per subplot, The difference in absolute MB is 1.92 Mg ha − 1 per 0.07 ha sample and when considering the 640,835 ha forest area of Denmark in 2021 (Nord-Larsen et al., 2023), the baseline model would accumulate an additional error of 17.6 M tons of AGB.We consider MSENet to be the method of choice for AGB estimation from ALS, because the models produced were robust  Table 3 Results of border artifact simulation with tree adding augmentations showing R 2 , RMSE, and MB on the augmented test set.RMSE and MB are given in Mg ha − 1 for AGB and m 3 ha − 1 for wood volume.The median (med.), the difference to the unaltered test set (diff.), and the best result (min or max) of 5 trials are shown.The best results are highlighted in bold font.To highlight significant differences between having border artifacts (tree augmentations enabled) and using the test data as is, we employ the paired Wilcoxon signed-rank test (H0: The paired samples in each subplot have a median difference of zero), where a sample is the absolute error for each subplot (n = 911).Different p-values are indicated: p > 0.1 by' ns', p ≤ 0.05 by'*', and p ≤ 0.001 by'**'.That is,' ns' indicates the desired property that the border artifacts do not negatively influence the model predictions.to border artifacts, achieved the high predictive performance, and do not require much computation time.
As pointed out by, for example, Xu et al. (2021) and Lu (2006), it is necessary to reduce the current uncertainty in estimates of carbon sequestration over regional scale.The classical statistical regression methods do not effectively describe the complex nonlinear relationship between forest AGB and remote sensing data (Li et al., 2020), which also becomes apparent in our results (e.g., see Fig. 6).Our proposed method goes beyond summarizing statistics by learning to utilize structural information and can be used to quantify forest biomass more accurately, which has been identified as a key component of understanding carbon budget responses to human intervention and climate (Harris et al., 2021).In addition, automatic generation of high-resolution AGB maps can also improve monitoring of AGB variations over time, resulting from land use or land use changes, and thereby increase the effectiveness of climate policies for adaptation and mitigation.

Quantitative comparison to other studies
Caution is warranted when comparing our prediction accuracy with that of other studies.Reported errors of ALS-based prediction models for AGB range from 17 to 40 Mg ha − 1 in the tropics (Asner et al., 2011), and this error is comparable to estimates reported in other regions (Lefsky et al., 2002) as well as previous studies with the Danish NFI (Nord-Larsen and Schumacher, 2012).Reassuringly, we achieve error rates at the lower end of this spectrum.It is important to emphasize, however, that it is not uncommon that the indicators documented in these studies measure the error on a training data set as a goodness of fit index (e.g., in the study by Lefsky et al., 1999 the R 2 for AGB drops from 80% down to 33% when the model is applied to a different data set).Similarly but not as severe, earlier studies of the Danish NFI with fewer data also reported a 0.6% increase in RMSE when using hold-out sets (Nord-Larsen and Schumacher, 2012).Conversely, the training errors of our models are similar to the test set results in Table 2 (see Table B.4 in the supplementary material), indicating that the models are not overfitted.
The error of the overall carbon stock prediction has been shown to decrease with increasing area, which is partly explained by error cancellation (Asner et al., 2010;Mascaro et al., 2011;Dalponte and Coomes, 2016;Ferraz et al., 2016;Magnussen et al., 2018).This is an important property when it comes to country-wide quantification of carbon.The residuals in Fig. 7 indeed suggest that our errors average out if several (disconnected) subplots are combined.

Applicability and limitation
The results are evaluated on a previously unseen and nonoverlapping test set, meaning that the model should exhibit a similar performance on other arbitrarily chosen forest subplots from the study area (e.g., Denmark).For new study areas that are dissimilar to ours (e. g., different climate zone, tree species, or topographical features), reevaluation and potential fine-tuning of the models would be needed.However, since little preprocessing is required, this verification and adjustment step is comparatively minor.
The question naturally arises whether the method can be transferred to predict other forest characteristics, such as stem size distribution, species classification, dead wood detection, tree quality estimation, tree diameter at breast height, crown size, or stand age.The employed method would be applicable to extract such characteristics on subplot level without any modification.As shown in our experiments (see Fig. 8), even a reduced dataset gave better results than methods based on point statistics using the full dataset.In any case, it is required that the input point cloud data needs to be dense enough to give structural information about the target (e.g., if branch structure is to be extracted, singular branches need to be identifiable in the point cloud).Since our method does not require manual feature engineering, it is possible to solve a variety of tasks for different settings.For example, one would expect the relevant features to be different between temperate and tropical forests and therefore a data-driven model would learn different representations automatically.Also, even though our current model is bound to the subplot size of the Danish NFI, one could easily adjust our method to support plots of a different size.
Our method allows for dense, non-interpolated biomass mappings with unprecedented accuracy at large scale, for different types of forest (Fig. 9), and at a resolution below 30 m.The deep learning models were trained on existing NFI and ALS data, emphasizing that such models are possible today.That said, utilizing more data from different countries could lead to further improvements.Regarding runtime, our approach is also feasible to compute on a larger scale.For example, based on Table 1, the estimated prediction time for the Danish forest area of 640,835 ha amounts to 8.4 h for MSENet14 or 16.8 h for MSENet50, which could be further reduced (e.g., by using multiple GPUs).This ignores preprocessing and saving steps that need to be applied for all methods (e.g., creating point cloud cutouts), but these are reasonably fast if modern data structures, such as cloud optimized point clouds (Bell et al., 2021), are employed.Compared to the precomputation of features and the subsequent application of the baseline models, our deep-learning approach currently incurs a modest amount of additional computational overhead, but we consider this to be appropriate for achieving higher quality predictions.
The lack of interpretability of deep learning models has been perceived as a potential disadvantage compared to linear regression models based on manually selected statistical features that may offer a sense of understanding.However, combinations of a large number of such features often provide little insight owing to complex internal correlations and because they may fail to capture the relevant properties of forests (Li et al., 2020).Therefore, insights gained from these models may be limited, particularly when attempting to understand the complex relationships between vegetation and its environment.When utilizing other machine learning methods, such as random forests based on these statistical features, the interpretability is further reduced.Because our deep learning method learns to utilize structural information directly from the data, which could be abstract features about individual trees, such as differences in height, DBH, crown area, species composition, and other characteristics, the resulting model can provide valuable insights into complex relationships in the data that are not captured by standard metrics.To understand these insights, one could leverage ongoing research (Samek et al., 2019;Montavon et al., 2018;Sundararajan et al., 2017) on techniques for interpreting and visualizing deep learning models.
When plot sizes are small, border artifacts can influence the results drastically (Knapp et al., 2021).However, our results with tree-adding augmentations in Table 3 show that our PointNet and Minkowski models are not negatively affected by trees reaching into the subplot, thus allowing for the aggregation of high-resolution plots.Previously, this kind of aggregation was only possible by individual tree detection methods (Ferraz et al., 2016), but this kind of data is difficult to acquire for larger areas.The main reason why the methods based on point cloud statistics are susceptible to border artifacts is that the underlying statistics are canopy focused, even though the ground truth is stemlocalized (Mascaro et al., 2011).In future research, it would be of interest to increase the model robustness even further.Specifically, robustness to non-tree objects, such as electric poles or radio towers, could be added.

Conclusions
Precise quantification of the spatial distribution of forest biomass and carbon content is a valuable tool to understand the processes driving af-, re-, and deforestation, and the variations in carbon cycle from the ecosystem to the regional and global scale (Zolkos et al., 2013).
This study brings forward deep learning systems for predicting above-ground biomass, wood volume, and carbon stocks in forests efficiently and directly from airborne LiDAR point clouds.Three conceptually different neural network architectures for classification (i.e., PointNet, KPConv, and Minkowski CNNs) were modified to perform regression, and were evaluated using a unique data set combining field measurements and LiDAR data.
Our adaptation of Minkowski CNNs outperformed the other deep learning approaches as well as the baseline methods.Specifically, the coefficient of determination R 2 for above-ground biomass and stored carbon exceeded 0.82 at 0.07 on a test set, which was not considered in the modelling process.The main advantage of the method proposed is the efficient implementation of 3D convolutions for sparse data, as it relies on the same operations that CNNs use for standard image processing.
The assessed deep learning models achieved high accuracies, leading to a considerably better performance in predicting AGB of forest areas, compared to the state-of-the-art approach operating on basic statistics of the point clouds.Our approach simplifies the production of AGB and carbon content maps at high resolution.It does neither require digital elevation models (e.g., DTMs or canopy height models) nor forest canopies with structural heterogeneity.Finally, our model is robust to border artifacts caused by trees reaching into a subplot, which allows for aggregation of small subplots that was not possible with the previous methods.
These encouraging results may be the first step towards utilizing airborne LiDAR data for more accurate quantification and analysis of forest carbon dynamics, which is a key component for achieving the United Nations Strategic Plan for Forests under the Paris agreement and the Sustainable Development Goals (SDGs) (United Nations Department of Economic andSocial Affairs, 2017, 2021) in accordance with the IPCC 2018 (Masson-Delmotte et al., 2019).

Table B.6
Comparison of methods when ground points are removed.The given metrics are the coefficient of determination (R 2 ) score, root mean square error (RMSE), mean absolute percentage error (MAPE), and mean bias (MB) on the test set.Note that we excluded 0 biomass measurements from MAPE to avoid numerical issues.RMSE and MB are in Mg ha − 1 for AGB and m 3 ha − 1 for wood volume.The median (med.) and best result (min or max) of 5 trials are shown, the best results are highlighted in bold font.The median for the MB is based on the absolute values, which is more meaningful when combining multiple trials.The min values are also based on the absolute values, however the signs are indicated in the table.
Histogram of conifer fractions across data splits.

Fig. 1 .
Fig. 1.The left plot (a) shows the spatial distribution of Danish NFI plots (map background based on (ESRI, 2023)).Colour indicates how the data were split into training, validation, and testing.The right plot (b) shows the species distribution by grouping the conifer fraction into 5 categories.The histograms for the different data sets are stacked.
where the radius r ∈ ℝ + is a hyper-parameter defining the domain B N r = {y ∈ ℝ N |‖y‖ ≤ r } of the kernel g, which is similar to the kernel size in discrete convolution.The kernel g is based on a set {x 1 , …, xK }⊂B N r of K kernel points, where each kernel point xk has an associated weight frequency frequency volume m 3 ha 1 biomass Mg ha 1 carbon Mg ha 1 (a) Histogram of above-ground forest biomass, carbon stock, and wood volume.

Fig. 2 .
Fig. 2. The left plot (a) shows the distribution of target variables, whereby on the y-axis is the frequency of NFI plots within specific value bins on the x-axis.Aboveground biomass and carbon stocks overlap completely since the carbon stock estimate is computed as a linear function of the biomass.As expected, the distributions appear to be log-normally distributed.The right plot (b) shows exemplary input point cloud (top view of a single subplot) and a representation of the measurement circles.The dashed hexagon encloses the points used by the deep learning point cloud methods.

Fig. 3 .
Fig. 3. Examples of subplots with different fractions of broadleaf and conifer trees.Green and blue indicate points classified belonging to trees and ground, respectively.We show three perspectives: isometric front, top, and side view.(For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Fig. 4 .
Fig. 4. (a) Implemented Minkowski SENet50 (MSENet50) architecture.The grid sizes are relative to the initial grid size.Conv3D refers to sparse 3D convolution with D channels followed by GELU activation and batch normalization.The kernel size k and stride are given as single numbers but extend to all three dimensions (e.g., 3 = 3 × 3 × 3).FC layers are fully connected layers of dimensionality D. (b) Bottleneck block, parameterized by its input size C, stride s, and reduction rate t (defaults to 16).
Consider a regression model ℝ d →ℝ of the form f (x) = a T h θ (x) + b (5) with parameters (θ, a, b).Here h θ : ℝ d →ℝ m for some positive integer m, b ∈ ℝ, a ∈ ℝ m , x ∈ ℝ d .This model can be a deep neural network, where all layers but the final layer are represented by the function h θ and the parameters θ comprise all weights except those in the final layer.Training a deep neural network for regression using iterative optimization with mini-batches and stopping after a fixed training time or based on performance on a hold-out data set cannot be expected to chose the parameter b in a way that the residuals on the training data sum to zero.That is, on the training data

Fig. 5 .Fig. 6 .
Fig. 5. Errors for different fractions of conifer and broadleaf trees for biomass on the test set.The columns show R 2 (higher is better), RMSE (lower is better), and MAPE (lower is better).The corresponding figure for wood volume is qualitatively the same and can be found in the supplementary material (see Fig. B.10).
Fig. B.15 in the appendix for more concrete examples).Even though the changes in error in KPConv are of smaller magnitude than the baseline, the model is affected negatively.

Fig. 7 .
Fig. 7. Biomass residual (left side) and error distribution (right side) plots of the test performance for power regression, KPConv, and MSENet14.The mean bias is given as well to quantify the observed bias.

Fig. 8 .Fig. 9 .
Fig. 8. Ablation of training set size to see effect on R 2 -score, RMSE, and absolute MB (columns) on AGB and wood volume (rows).Averaged runs of the MSENet14 model are shown in blue, while the power regression baseline is given for comparison in orange.(For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.) This work was supported by the research grant DeReEco (grant number 34306) from Villum Foundations, the Independent Research Fund Denmark through the grant Monitoring Changes in Big Satellite Data via Massively-Parallel Artificial Intelligence (grant number 9131-00110B), a Villum Experiment grant by the Velux Foundations (MapCland project, project number 00028314), the DeepCrop project (UCPH Strategic plan 2023 Data+ Pool), the PerformLCA project (UCPH Strategic plan 2023 Data+ Pool), the EU LIFE Forests Fit for Future project (grant number: LIFE19ENV/DK/000013), and the Pioneer Centre for AI (DNRF grant number P1).

S
.Oehmcke et al.

Fig. B. 11 .
Fig. B.11. Boxplots of value distributions over considered conifer fractions.Left side shows AGB and right side shows wood volume.

Fig. B. 12 .
Fig. B.12. Biomass residual (left side) and error distribution (right side) plots of the test performance.

Fig. B. 13 .
Fig. B.13. Wood volume residual (left side) and error distribution (right side) plots of the test performance.

Table 1
Number of trainable parameters and average runtime (elapsed real time using a single GPU) for one pass through the test set (n = 911) as well as the total training time for the point cloud models.

Table B .
4 (continued ) Comparison of methods showing the R 2 score, RMSE, and MAPE on the validation set.The R 2 and MAPE for biomass are the same as for carbon stocks.Note that we excluded 0 biomass measurements from MAPE to avoid numerical issues.Best results are highlighted.Note that the deep learning models did not train on this data directly but the feature extraction methods did.Errors for different fractions of conifer and broadleaf trees for wood volume on the test set.The columns show R2 (higher is better), RMSE (lower is better), and MAPE (lower is better).