Sensor Agnostic Semantic Segmentation of Structurally Diverse and Complex Forest Point Clouds Using Deep Learning

: Forest inventories play an important role in enabling informed decisions to be made for the management and conservation of forest resources; however, the process of collecting inventory information is laborious. Despite advancements in mapping technologies allowing forests to be digitized in ﬁner granularity than ever before, it is still common for forest measurements to be collected using simple tools such as calipers, measuring tapes, and hypsometers. Dense understory vegetation and complex forest structures can present substantial challenges to point cloud processing tools, often leading to erroneous measurements, and making them of less utility in complex forests. To address this challenge, this research demonstrates an effective deep learning approach for seman-tically segmenting high-resolution forest point clouds from multiple different sensing systems in diverse forest conditions. Seven diverse point cloud datasets were manually segmented to train and evaluate this model, resulting in per-class segmentation accuracies of Terrain: 95.92%, Vegetation: 96.02%, Coarse Woody Debris: 54.98%, and Stem: 96.09%. By exploiting the segmented point cloud, we also present a method of extracting a Digital Terrain Model (DTM) from such segmented point clouds. This approach was applied to a set of six point clouds that were made publicly available as part of a benchmarking study to evaluate the DTM performance. The mean DTM error was 0.04 m relative to the reference with 99.9% completeness. These approaches serve as useful steps toward a fully automated and reliable measurement extraction tool, agnostic to the sensing technology used or the complexity of the forest, provided that the point cloud has sufﬁcient coverage and accuracy. Ongoing work will see these models incorporated into a fully automated forest measurement tool for the extraction of structural metrics for applications in forestry, conservation, and research.


Introduction
Forest measurements are important in a number of fields including, but not limited to, forestry, climate science [1][2][3], fire risk management [4,5], and understanding habitat structural complexity [6][7][8]. Modern remote sensing techniques such as Light Detection and Ranging (LiDAR) and photogrammetry are enabling high-quality 3D reconstructions of forests to be collected by operators with little or no surveying training. Particularly transformative are techniques such as close-range photogrammetry, which enable researchers and foresters to collect high accuracy and high-resolution 3D reconstructions of forests with low-cost, consumer-grade cameras [9,10] and low-cost Unoccupied Aircraft Systems (UAS) [11,12]. While the capability to collect such rich datasets is becoming more widespread and accessible, a major obstacle impeding the utility of such datasets is the complexity of extracting reliable and useful measurements from them.
There are a number of tools available for extracting measurements from forest point clouds [13][14][15][16][17]; however, many of these tools require manual tuning of parameters, manual Objects within a Mobile Laser Scanned (MLS) point cloud can still be interpreted with relative ease by a human despite having no color information. We can easily identify which points belong to terrain, vegetation, coarse woody debris, and stems in most cases. While this figure is only two-dimensional (making interpretation more challenging), these objects are considerably more recognizable when viewing the point cloud directly, as it is easier for us to perceive the structure while translating/rotating the point cloud.
Semantic segmentation refers to the separation of a dataset into meaningful subsets. In the case of this paper, we are focused on separating parts of a forest into terrain, vegetation, coarse woody debris, and stem categories from a point cloud. There have been many different approaches to the segmentation of forest point clouds [20][21][22][23][24][25][26][27][28][29][30][31][32][33][34] so far. Some approaches use heuristics [20,22,25,28,29] or morphological operations [27], while others use supervised [23,26,[30][31][32][33] or unsupervised [21,34] machine learning techniques. With supervised machine learning techniques, the first major challenge to building a model is to obtain or generate sufficient and appropriate training data. Some works in this area [23,33] approach this through the use of artificially generated datasets, which are made from simulating a forest and a terrestrial laser scanning operation to create perfectly labeled point clouds on demand. This approach is certainly logical, as manually labeling point cloud datasets is time-consuming and monotonous while also requiring skilled and attentive operators; however, it can be difficult to generate synthetic datasets with all of the same challenges present within real-world point clouds. Occlusions and ranging noise can be generated in these workflows with relative ease; however, it is difficult to account for all possible sensing difficulties and sources of error. Movement of the trees due to breeze, imperfect reconstructions during photogrammetry, and variable optical properties of the environment can all present sensing challenges and artefacts that are difficult to simulate at this time.
The segmentation works described above were mostly designed with individual sensing methods in mind, such as TLS [20][21][22][23]27,28,33,34], MLS [25,28], or ALS [26,[28][29][30][31][32], resulting in limited transferability to point clouds captured using other methods with the exception of [28], whose approach was demonstrated on ALS, TLS, and MLS. To move toward the idea of a fully automated and universal forest point cloud processing tool, the Objects within a Mobile Laser Scanned (MLS) point cloud can still be interpreted with relative ease by a human despite having no color information. We can easily identify which points belong to terrain, vegetation, coarse woody debris, and stems in most cases. While this figure is only two-dimensional (making interpretation more challenging), these objects are considerably more recognizable when viewing the point cloud directly, as it is easier for us to perceive the structure while translating/rotating the point cloud.
Our paper contributes a successful semantic segmentation approach based upon a modification of the Pointnet++ [35] architecture. To train the model to perform on diverse datasets, we manually segmented point clouds from a diverse set of sensors. As this segmentation approach extracts the terrain points, we also provide a method to exploit this information to create a Digital Terrain Model (DTM) that is robust to complex understory vegetation, photogrammetry noise, and uneven terrain. Finally, we validated the accuracy and coverage of our DTM approach against six point clouds from an international benchmarking dataset [18].

Methodology Overview
This work was motivated by the idea that a well-segmented point cloud would simplify the forest point cloud measurement process in the presence of diverse and imperfect datasets. Here, we describe the creation of our training and evaluation datasets, the architecture and training approaches for the deep learning model, an approach to the generation of a Digital Terrain Model (DTM) from the segmented point cloud, and how these models and approaches were validated. Figure 2 shows a schematic of how the methods described in this paper fit into a larger-scale project that will incorporate the semantic segmentation and DTM generation tools into a comprehensive forest structural measurement tool that will be able to handle diverse forest point clouds (of high resolution).
datasets. Here, we describe the creation of our training and evaluation datasets, the architecture and training approaches for the deep learning model, an approach to the generation of a Digital Terrain Model (DTM) from the segmented point cloud, and how these models and approaches were validated. Figure 2 shows a schematic of how the methods described in this paper fit into a larger-scale project that will incorporate the semantic segmentation and DTM generation tools into a comprehensive forest structural measurement tool that will be able to handle diverse forest point clouds (of high resolution).

Figure 2.
Schematic diagram describing how this research, which focuses on semantic segmentation and Digital Terrain Model generation, fits into our larger goal of creating a fully automated forest point cloud measurement tool. Figure 2. Schematic diagram describing how this research, which focuses on semantic segmentation and Digital Terrain Model generation, fits into our larger goal of creating a fully automated forest point cloud measurement tool.

Class Selection Approach
The classes for semantic segmentation were chosen based on visual inspection of the point clouds with color information omitted. While some implementations of Pointnetlike architectures exploit Red, Green, and Blue (RGB) color information or LiDAR return intensity/reflectance, our model is intended to work on spatial (X, Y, Z) coordinates alone such that it can work on most (if not all) high-resolution* forest point clouds. The point cloud visualization and editing tool CloudCompare [36] was used with "Eye-Dome-Lighting" mode enabled for this step, which makes it possible to perceive the 3D structure without a colored point cloud. Classes that the authors could reliably distinguish from 3D structure alone were noise, terrain, vegetation, coarse woody debris (CWD), and stems.
A point is considered to be terrain if it appears to be part of the ground surface (according to the human labeling the dataset). The vegetation class is used as a catch-all class for any points that were not terrain, CWD, or stem points. As such, any nearby points above or below the ground surface that were not considered to be terrain or CWD were labeled as vegetation. The CWD class consists of any obvious fallen timber/branches laying on the ground. While this class could have been merged with the stem class, it was separated for the following reasons. First, we need to distinguish between a log on the ground and a standing tree stem that may be adjacent. When clustering with the Density-based spatial clustering of applications with noise (DBSCAN) algorithm in post-segmentation processing steps (beyond the scope of this paper), these would be considered one tree if they were not in separate classes, which is undesirable for our processing approach. Secondly, the reconstructed CWD that we wish to classify included more variable structures than the stem class. For example, we wish to detect partially decomposed CWD, which can have a different structure to what the stem class is intended to represent, particularly in photogrammetric datasets.
The stem and vegetation classes were intended to separate the well-reconstructed woody material from the leaf material. We observed that as the reconstruction quality reduces, points that may be from a branch or stem become indistinguishable from points that may belong to leaf material. As a result, there is typically a gradual transition from stem to vegetation class (per our class definitions) as noise/measurement errors increase, stem/branch sizes decrease, or occlusions lead to poor reconstruction. If a section of a branch or stem is very poorly reconstructed, there is little use in trying to measure the diameter of it, as it will almost certainly be incorrect; however, the points can still be useful for measuring the amount of canopy vegetation.
Manually labeling the point clouds requires some operator discretion; so for the sake of consistency, only one person manually segmented the entire dataset. * When we refer to "high-resolution" in this paper, we are referring to any forest point clouds where tree stem diameters could be directly measured from the point cloud.

Segmentation Model Dataset Generation
The point clouds used in this study came from a variety of sources, forest conditions, and sensor systems. These point clouds were captured using Terrestrial Laser Scanning (TLS), Aerial Laser Scanning (ALS), Mobile Laser Scanning (MLS), and Unmanned Aerial System (UAS)-based aerial photogrammetry (UAS_AP). Forest conditions included open woodlands, pine plantations, and dense eucalyptus forests of varying structural complexity and were collected in various locations throughout Australia and New Zealand. Seven point clouds (described in Table 1) were manually segmented using the segmentation tool in CloudCompare [36] into the 4 class categories (Terrain, Vegetation, CWD, Stem). When manually segmenting the point clouds, color information can be helpful for the human operator to differentiate objects in the point cloud; however, care must be taken to avoid the creation of contradictory training information. The model is relying on spatial coordinates alone, so any segmentation performed using the color information must be carefully checked to ensure that the class of interest is able to be identified by a human with only spatial information (i.e., no color). Particularly in the cases of the photogrammetry datasets, CWD can be visible in colorized point clouds, but spatially, it can not be reconstructed in a way that is distinct from the underlying terrain points. We wish to train the model to predict CWD only when it is clearly present in the 3D structure.
Once segmented, these point clouds were split (0.5/0.25/0.25) into training, validation, and test sets at the individual point cloud level. Figure 3 visualizes the data split and the manually labeled point clouds. We did not split these datasets blindly, as it is necessary to ensure that representative samples of each point cloud were present in the training, validation, and test sets. This is necessary as the datasets are imbalanced simply due to the structure of forests, as there were far more stem and vegetation points than CWD points. If we were to take the approach of blindly splitting the data, we would run the risk of providing insufficient CWD samples to the model during training or none during validation/testing, leading to poor performance and/or an inappropriate evaluation of the model's performance.  For the purposes of training and evaluating the model, the canopy was fully removed from the HOVERMAP_1 dataset and partially trimmed from the TLS_3 and HOVER-MAP_2 datasets, as manually segmenting the multitude of small branches was both unfeasible from a time perspective and also highly ambiguous. The ambiguity arises due to choosing a boundary between points belonging to either the stem or vegetation class as stems/branches become noisier with increasing height due to the increasing sensing distance, beam divergence effects, and occlusion effects (from a ground-based LiDAR). Figure 4 shows a close-up section of the HOVERMAP_2 dataset, which visualizes the ambiguity existing in the original point clouds. We cannot evaluate the model against a human For the purposes of training and evaluating the model, the canopy was fully removed from the HOVERMAP_1 dataset and partially trimmed from the TLS_3 and HOV-ERMAP_2 datasets, as manually segmenting the multitude of small branches was both unfeasible from a time perspective and also highly ambiguous. The ambiguity arises due to choosing a boundary between points belonging to either the stem or vegetation class Remote Sens. 2021, 13, 1413 7 of 24 as stems/branches become noisier with increasing height due to the increasing sensing distance, beam divergence effects, and occlusion effects (from a ground-based LiDAR). Figure 4 shows a close-up section of the HOVERMAP_2 dataset, which visualizes the ambiguity existing in the original point clouds. We cannot evaluate the model against a human baseline in regions where a human cannot consistently label the points, so we took the approach of removing the ambiguous regions from several of the training, testing, and validation datasets as appropriate.  The training dataset (shown in Figure 3) was cloned twice with each clone being scaled by a factor of 0.5 and 2.0, respectively. The cloned point clouds that were downscaled to 0.5 of their original size were subsampled to 0.01 m resolution (minimum distance between points) to provide training examples of smaller sized objects at the same 0.01 m resolution as the original dataset. The upscaled clone was not subsampled as doubling the point cloud scale halves the effective point cloud resolution.

Network Architecture
The architecture we used was based upon Pointnet++ [35], which was chosen due to its ability to perform semantic segmentation of unordered point clouds directly and efficiently without the need for voxelization. The main changes we made from the Pointnet++ The training dataset (shown in Figure 3) was cloned twice with each clone being scaled by a factor of 0.5 and 2.0, respectively. The cloned point clouds that were downscaled to 0.5 of their original size were subsampled to 0.01 m resolution (minimum distance between points) to provide training examples of smaller sized objects at the same 0.01 m resolution as the original dataset. The upscaled clone was not subsampled as doubling the point cloud scale halves the effective point cloud resolution.

Network Architecture
The architecture we used was based upon Pointnet++ [35], which was chosen due to its ability to perform semantic segmentation of unordered point clouds directly and efficiently without the need for voxelization. The main changes we made from the Pointnet++ architecture were to increase the size of the model to increase the learning capacity enough to handle up to 20,000 points per sample versus 1024 points per sample in the original paper. For detailed explanations of the set abstraction and feature propagation modules, please see the original Pointnet [39] and Pointnet++ papers [35]. The Pytorch Geometric [40] implementation of the Pointnet++ segmentation architecture was used as the starting point for this work, with our modified architecture shown in Figure 5. We have described the architecture in the same structure as the Pytorch Geometric segmentation examples for ease of implementation.
Remote Sens. 2021, 13, x FOR PEER REVIEW Figure 5. The network architecture used in this paper was based upon the Pytorch Geometric [40] imple Pointnet++ [35] with some modifications to increase the size and learning capacity of the network. Abbrevi layer Perceptron (MLP), 1 Dimensional Convolution (Conv1D), Rectified Linear Unit (ReLU).

Data Pre-Processing
While the Pointnet++ [35] architecture is able to process point cloud not made to ingest large point clouds all at once (such as TLS point clo contain greater than 1 billion points). For segmentation in the original P they used subsets of the point cloud with 1024 points or less. Our appro ing the point cloud into cube-shaped regions of side length 6 m with a and a maximum of 20,000 points. If one of these cubes contains greater th points are removed at random until 20,000 points remain. If a cube cont points, it is not used to avoid processing empty or nearly empty cube sa cube sized samples with less than 500 points were typically difficult fo rectly classify, so they were considered to be sub-optimal examples for t from. The 6 m size is an arbitrary size chosen during early experimen Figure 5. The network architecture used in this paper was based upon the Pytorch Geometric [40] implementation of Pointnet++ [35] with some modifications to increase the size and learning capacity of the network. Abbreviations: Multilayer Perceptron (MLP), 1 Dimensional Convolution (Conv1D), Rectified Linear Unit (ReLU).

Data Pre-Processing
While the Pointnet++ [35] architecture is able to process point clouds directly, it was not made to ingest large point clouds all at once (such as TLS point clouds, which may contain greater than 1 billion points). For segmentation in the original Pointnet++ paper, they used subsets of the point cloud with 1024 points or less. Our approach involves slicing the point cloud into cube-shaped regions of side length 6 m with a minimum of 500 and a maximum of 20,000 points. If one of these cubes contains greater than 20,000 points, points are removed at random until 20,000 points remain. If a cube contains less than 500 points, it is not used to avoid processing empty or nearly empty cube samples. Then, 6 m cube sized samples with less than 500 points were typically difficult for humans to correctly classify, so they were considered to be sub-optimal examples for the model to learn from. The 6 m size is an arbitrary size chosen during early experimentation to capture enough context that humans could identify objects in the samples in most cases, while a sample was limited to 20,000 points. Smaller sample boxes offered less context, making them more difficult to classify. Larger sample boxes become lower resolution if we retain the 20,000 point cut-off we assigned and again become more difficult to classify by humans and machines alike.
These cube regions are overlapping in X, Y, and Z dimensions with 0.75 overlap used for training data, 0.5 overlap used for validation, and 0.5 overlap for testing data.
Each cube is shifted to the origin prior to inference to avoid floating point precision issues when dealing with large numbers from global coordinates. Pre-processing is performed before training or inference, and each sample is stored in a file. During training, the samples are seen repeatedly, so we need not pre-process each sample multiple times this way. During inference, the pre-processing will bottleneck our inference process, so loading each sample from a file allows the Graphics Processing Unit (GPU) to be working at near full capacity. By pre-processing the data before training or inference, we can also take advantage of parallel processing more easily. To minimize computational time, our pre-processing approach takes advantage of vectorization as much as possible through extensive use of the NumPy [41] package.

Data Augmentation and Model Training
Data augmentation is applied for training samples in the form of random rotations about X, Y (±15 • ), and Z (±180 • ) axes and random scale changes by multiplying coordinates by a factor of 0.8 to 1.2. If there is no terrain or CWD present in a sample, the X and Y axis rotations are randomly chosen between ±90 • instead of ±15 • . We did this because we do not wish to train the model to predict terrain class on vertically oriented surfaces (such as on the side of a particularly large diameter tree), but valid stems can be completely horizontal.
For each cube-shaped sample, there was a 50% chance of adding random noise to the X, Y, and Z coordinates with a randomly chosen standard deviation of between 0.01 m and 0.025 m and a mean of 0 m, applied at a per-point level. The training dataset consisted of 112,758 samples prior to the random augmentation, which was applied throughout training to minimize the risk of overfitting and to aid the generalizability of the model to unseen data.
To minimize contradictory training information, if a sample contains CWD but no ground points, the CWD is relabeled to Stem class during training. The intent behind this condition is for the model to learn that CWD should be near the ground and that CWD is similar to the Stem class in some circumstances.
All training and testing were performed on a desktop computer with an Intel i9-10900K CPU, 128 gigabytes (GB) of DDR4 RAM, and an Nvidia Titan RTX graphics processing unit (GPU) with 24 GB of Video Random Access Memory (VRAM). The model was trained for 300 epochs with a batch size of 8 (limited by GPU VRAM), taking approximately 3 days. Figure A1 shows the changes in accuracy and loss of the train and validation sets over the 300 epochs.
The model was trained using cross-entropy loss with an initial learning rate of 5 × 10 −5 , which was reduced to 2.5 × 10 −5 after 150 epochs. The initial learning rate was chosen through experimentation, where we found that higher learning rates lead to erratic loss values or exploding gradients.

Model Inference
When used for inference, the model is used with a sliding box overlap of 0.5 in X, Y, and Z axes. For each point in the segmented point cloud, up to 16 nearest neighbors are found within a maximum search radius of 0.1 m. The median prediction scores are computed for each class prediction, followed by an argmax function to select the final point label. The initial segmented point cloud may be down-sampled in some regions through the process of enforcing a maximum of 20,000 points per sample region, so to label the full original point cloud, each point in the original point cloud is assigned the label of its nearest neighbor in the segmented point cloud.

Semantic Segmentation Evaluation Method
The segmented point cloud was evaluated on an individual point basis against the manually segmented point cloud dataset. The Python package Scikit-Learn [42] was used to evaluate the model and generate a confusion matrix of the results. As manually labeling point clouds is highly time-consuming, there was a practical limit to how many point clouds we could quantitatively evaluate the segmentation model on. In the interests of transparency and in order to demonstrate the utility and limitations of the tool on a larger range and scale of datasets, we have provided a fly-through video of several additional datasets segmented by the model. These datasets are described in Table 2. * CloudCompare's "Statistical Outlier Removal" was applied to VUX_1LR_2 with the default settings prior to processing to speed up the processing time.

Digital Terrain Model Generation
Once a point cloud has been segmented by the model, the points labeled with "terrain" class can easily be extracted for use in the generation of a Digital Terrain Model (DTM). In Figure 6, we provide a pseudocode describing the process used to generate a DTM using the segmented point cloud.  [42] and [43], respectively.

Digital Terrain Model Evaluation Method
To evaluate the performance of our Digital Terrain Model (DTM), we applied it to 6 point clouds made publicly available by a TLS benchmarking study [18]. In their study, they generated the reference DTMs by first classifying the ground points using the Ter-raScan software [44], followed by manual removal of non-ground objects. They applied a 20 cm resolution grid for rasterization and used the mean height of the ground points within each cell. In cases without points, the height value was interpolated using the average of the neighboring cells.
To compare our DTM height measurements against the reference DTMs provided by the benchmarking study, we used a 20 cm grid resolution in our algorithm, followed by 2D linear interpolation to the positions of the reference DTM, as there was a small offset between predicted and reference grid point positions due to minor differences in how the grid is generated.
The benchmarking study also introduced a measurement called DTM Coverage. The DTM coverage is the ratio of the covered reference DTM points and the total reference DTM points. For each point in the reference DTM, the nearest neighbor from our DTM was found. If the distance to the nearest neighbor was less than 0.2 m, the point was "covered". A score of 1 implies the predicted DTM completely covered the region of the reference DTM. This approach accounts only for the points in the reference DTM that are or  [42] and [43], respectively.

Digital Terrain Model Evaluation Method
To evaluate the performance of our Digital Terrain Model (DTM), we applied it to 6 point clouds made publicly available by a TLS benchmarking study [18]. In their study, they generated the reference DTMs by first classifying the ground points using the TerraScan software [44], followed by manual removal of non-ground objects. They applied a 20 cm resolution grid for rasterization and used the mean height of the ground points within each cell. In cases without points, the height value was interpolated using the average of the neighboring cells.
To compare our DTM height measurements against the reference DTMs provided by the benchmarking study, we used a 20 cm grid resolution in our algorithm, followed by 2D linear interpolation to the positions of the reference DTM, as there was a small offset between predicted and reference grid point positions due to minor differences in how the grid is generated.
The benchmarking study also introduced a measurement called DTM Coverage. The DTM coverage is the ratio of the covered reference DTM points and the total reference DTM points. For each point in the reference DTM, the nearest neighbor from our DTM was found. If the distance to the nearest neighbor was less than 0.2 m, the point was "covered". A score of 1 implies the predicted DTM completely covered the region of the reference DTM. This approach accounts only for the points in the reference DTM that are or are not covered and it does not consider the possibility of covering a greater area than the reference DTM.

Semantic Segmentation Evaluation
The semantic segmentation results are visualized in Figure 7, with the manually segmented reference point clouds on the left and the model's predictions on the right of each pair. These point clouds are the "Test" dataset first shown in the top half of Figure 3. are not covered and it does not consider the possibility of covering a greater area than the reference DTM.

Semantic Segmentation Evaluation
The semantic segmentation results are visualized in Figure 7, with the manually segmented reference point clouds on the left and the model's predictions on the right of each pair. These point clouds are the "Test" dataset first shown in the top half of Figure 3. The predictions were visually very similar to the reference dataset. Of the 4 class labels, we observed that the model was least accurate at segmenting the CWD class, with the clearest examples of this in the TLS_1, TLS_2, and HOVERMAP_2 datasets. Ground points that were misclassified as stem points can be seen in the VUX_1LR_1 dataset, and the stem was not extracted as far up the tree as in the human segmented point cloud. It appears to be uncommon for ground points to be misclassified as stem points in our other testing datasets, but when they do occur, it tends to be on the edges of the point clouds where the ground does not completely cover the sample box region. These observations align with what we would expect based upon the quantitative results shown in the confusion matrix in Figure 8. The predictions were visually very similar to the reference dataset. Of the 4 class labels, we observed that the model was least accurate at segmenting the CWD class, with the clearest examples of this in the TLS_1, TLS_2, and HOVERMAP_2 datasets. Ground points that were misclassified as stem points can be seen in the VUX_1LR_1 dataset, and the stem was not extracted as far up the tree as in the human segmented point cloud. It appears to be uncommon for ground points to be misclassified as stem points in our other testing datasets, but when they do occur, it tends to be on the edges of the point clouds where the ground does not completely cover the sample box region. These observations align with what we would expect based upon the quantitative results shown in the confusion matrix in Figure 8. The terrain, vegetation, and stem classes had notably greater accuracy than the CWD class, with stems being predicted with the greatest accuracy on this test dataset. Table 3 presents the per-class recall and precision scores, as well as the overall accuracy, precision, and recall. Per-class accuracies are shown in the confusion matrix in Figure 8. We have also provided a comparison of our segmentation results against others in the literature in Table 4; however, it should be stressed that without identical test datasets and agreement on per-class definitions, it can only be used as an indication of their relative performances. The terrain, vegetation, and stem classes had notably greater accuracy than the CWD class, with stems being predicted with the greatest accuracy on this test dataset. Table 3 presents the per-class recall and precision scores, as well as the overall accuracy, precision, and recall. Per-class accuracies are shown in the confusion matrix in Figure 8. We have also provided a comparison of our segmentation results against others in the literature in Table 4; however, it should be stressed that without identical test datasets and agreement on per-class definitions, it can only be used as an indication of their relative performances.

Video Demonstration of Semantic Segmentation Performance
The segmentation model was also applied to additional point clouds that were not manually segmented (nor seen to the model during training in any way) to qualitatively identify the performant aspects of the model and identify the weaknesses under various scenarios. To present our results as transparently as possible, a fly-through video of the unseen datasets is provided here (Krisanski, S. et. al, Sensor Agnostic Semantic Segmentation of Forest Point Clouds using Deep Learning (Part 2), https://www.youtube.com/ watch?v=v0HwNu6SK6g, accessed on 30 March 2021).This video shows five datasets from five different sensor types as described in Table 2 in the methodology section. The datasets shown in the video are also visualized in Figure 9. Below, we have provided comments on the fly-through video with associated timestamps. These timestamps are linked to the video sections in the description on YouTube. We also provide Figure A2 in the appendix to show an example of the per-class feature maps of the TLS_4 dataset.


Successfully identified CWD can be seen (00:30).  Some understory vegetation is misclassified as stem (00:38).  TLS_4 has some point cloud registration errors in the canopy, which is potentially due to wind during data capture; however, this does not appear to have affected the predictions negatively (00:48).

UAS_AP_2
 As this was captured by above-canopy nadir aerial photogrammetry, many stems were not well reconstructed (01:13)  Rocks can be seen to be classified as CWD. This was not considered a misclassification since we never provided examples of rocks; however, this suggests rocks could be worth including into future models for quantifying habitat (01:23). Below, we have provided comments on the fly-through video with associated timestamps. These timestamps are linked to the video sections in the description on YouTube. We also provide Figure A2 in the appendix to show an example of the per-class feature maps of the TLS_4 dataset.

TLS_4
• Successfully identified CWD can be seen (00:30). • Some understory vegetation is misclassified as stem (00:38). • TLS_4 has some point cloud registration errors in the canopy, which is potentially due to wind during data capture; however, this does not appear to have affected the predictions negatively (00:48).

UAS_AP_2
• As this was captured by above-canopy nadir aerial photogrammetry, many stems were not well reconstructed (01:13) • Rocks can be seen to be classified as CWD. This was not considered a misclassification since we never provided examples of rocks; however, this suggests rocks could be worth including into future models for quantifying habitat (01:23).

•
The bases of many stems were classified as CWD (01:35).

HOVERMAP_3
• Mostly desirable performance on the Hovermap dataset. • Some minor branches/stems were mislabeled as vegetation; however, most of these examples are in the ambiguous region between our definition of stem and vegetation, where it would be difficult to measure accurate diameters from the point cloud even if they were detected as stems (02:39).

VUX_1LR_2
• The bases of many stems in this dataset were misclassified as vegetation (03:05). • CWD was not well detected in noisy point clouds, which is likely as a result of limited training examples in this data type (03:15).

UC_UAS_AP_1
• A major stem (leaning almost horizontally) and some minor branches/small stems were missed by the model and labeled as vegetation (04:32).

•
The main CWD object in the point cloud was partially correctly segmented but was misclassified as vegetation in some regions and misclassified as stem where the CWD contacts a standing stem (04:33). • A small patch of terrain points were misclassified as stem (04:36).

Digital Terrain Model Evaluation against Benchmarking Dataset
Our approach to DTM generation was able to cover the entirety of the reference DTM in five out of six cases, with the exception being effectively completely covered also at 0.991 coverage. Table 5 shows the results of the DTM evaluation against the benchmarking.

Processing Times
We have provided Table 6 to demonstrate the processing times on the desktop computer described in Section 2.6. We have also provided Figure A3 in the appendices to visualize these numbers more clearly. A high-resolution TLS point cloud TLS_4 was able to be processed from start to finish in 29 min. However, the VUX_1LR_2 dataset exceeded the 128 GB of RAM available on our desktop computer, which meant that it needed to spill over onto swap space on the M.2 solid state drive for the excess (≈200 GB of swap space was used). * All point clouds were subsampled to 0.01 m minimum distance between points. ** Area was computed automatically using a convex hull on the terrain labeled points.
In Figure A2, it can be seen that the post-processing step (consisting of the DTM generation process) had a smaller impact on the processing time than the pre-processing and inference steps, as is to be expected. Both pre-processing and segmentation steps appear to have a similar relationship with respect to the number of points in the point cloud. We have fitted a 2nd order polynomial to these points for the purposes of visualizing the trend. From these trends, which appear to increase quadratically with respect to the number of points, the best approach to using this model in practice would be to slice large point clouds into sub-point clouds to be processed in batches before reassembling them (if needed). The optimal slicing size will depend on the computational resources available, as the classification model performs worse on the edges of point clouds (smaller slices means more edges for an equivalent point cloud).
This model is currently only suitable for relatively high-performance desktop computers and up; however, due to the computational expense of working with such large point clouds, it is a reasonable expectation that those interested in our approach will already have a sufficiently powerful workstation. Lower-level computers cannot cope well with point clouds containing hundreds of millions to billions of points, so our method is likely out of reach of lower-end computers at this time.

Segmentation
In Table 4, we presented our segmentation results alongside other forest point cloud segmentation studies. This is not an exhaustive list of related works but is intended to serve as an indication of the performance of our approach relative to similar studies. For this comparison, we must acknowledge the limitations that we are not comparing these methods on the same datasets, and that our definitions of stem and vegetation classes may differ slightly from the other studies in the field. The top performing model we found in the literature was [34], achieving an overall accuracy of 92.5%, which is a particularly impressive result considering it was using an unsupervised learning technique, negating the need for labeled training data. [24] achieved a 91% overall accuracy using a technique based upon the random forest technique. [23] used a Pointnet++ inspired approach and claimed an overall accuracy of "close to 90%". [26] tested a variety of approaches, with their best results being on their Carabost dataset. An overall accuracy was not reported; however, we can compare with respect to overall precision. Using a 3D convolutional neural network on voxels, they reported an overall precision of 79% without LiDAR intensity information, and 81.9% with intensity. [26] also tested a Pointnet-based method that achieved 74.7% without intensity information and 77% with intensity. Our model was able to achieve 96.1% overall accuracy (if only comparing stem and vegetation classes) or 95.4% overall accuracy if comparing our model in its entirety (segmenting all four classes). Our model scored a higher overall precision than all of the models tested in [26]. Whilst we cannot conclusively compare these models in this form due to the above-mentioned limitations; of those semantic segmentation studies we compared against, our model ranks among the best performing at this task. This remains the case even while simultaneously segmenting an additional two classes that the compared models did not need to segment.
A limitation of this work is the subjectivity associated with manually labeling forest point clouds. While the majority of points can be segmented consistently, it is inevitable that mislabeled points will be present due to the ambiguity of noisy sections and the limited time that can be spent ensuring a point cloud is correctly labeled. Further to this, humans are not well suited to highly repetitive tasks, and while all possible care was taken to accurately label these point clouds during the two-week long labeling process, some minor (human) misclassifications are almost guaranteed to be present. In synthetic forest point cloud datasets, it is possible to precisely define vegetation and stem as separate categories; however, in real-world point clouds, this distinction becomes less clear. As discussed in Section 2.3, we described a continuous scale between the definitions of stem and vegetation, where stem points begin to resemble vegetation points as the noise increases/reconstruction quality decreases. As a result of this, the intent of our approach was to segment out wellreconstructed stems from poorly reconstructed stems by labeling the difficult to measure stems/branches as the vegetation class. This effect is clearly illustrated in the video provided, which shows the segmentation results of the model on five point clouds. For example, in TLS_4, a dataset with little noise, most of the stem is correctly classified as stem, while noisier and less dense point clouds such as VUX_1LR_1 and VUX_1LR_2 have comparatively more stem sections labeled as vegetation. We considered it to be preferable to misclassify stems as vegetation rather than vegetation as stems. It is preferable to miss a tree than to attempt to fit circles/cylinders to vegetation and risk overestimating the volume of the forest. This undesirable behavior was difficult to avoid entirely with this approach, but it may be possible to remove some of these vegetation-stem misclassifications during post-processing with a well-designed and robust stem fitting approach applied to the segmented stems. At the time of writing, our team is working on this problem as the next step in this project.
The idea behind combining these classes into a single multi-class segmentation model was that the CWD class was intended to make use of the proximity to the terrain class information. Due to the complex nature of the model, it is not clear if this idea was useful; however, the approach was nonetheless successful. Our deep learning approach to CWD detection differs considerably from the cylinder fitting approach used in [45] to detect fallen deadwood. An advantage of our method is the capability of identifying highly irregular, partially reconstructed, and decaying CWD rather than cylindrical CWD. Most of the CWD we are detecting in our model is ill suited to being measured with cylinder-based models, so our future work on measuring the volume of segmented CWD will approach this using mesh-based techniques.
The misclassification of some terrain points as stem points (seen in UC_UAS_AP_1 for example) appears to mostly occur when the sample box region has cropped the terrain on the edges of the point cloud or partially cropped into the terrain with the upper or lower boundary of the box vertically. These cases can change the appearance of the terrain such that even a human may have difficulty identifying it correctly. We suggest that this problem is one of context, as when the sample is seen in context (i.e., with the rest of the point cloud), it is easy for a human to identify these examples correctly as terrain, but without context, a small slice of terrain may look very similar to CWD, a branch/stem in the air, or vegetation. Alternate sampling strategies to the box approach that could provide a greater context to the model with minimal loss in effective resolution would be a useful direction for future research to explore.

Digital Terrain Model
To quantify the performance of our DTM method, we tested our approach on point clouds and reference DTMs provided by a benchmarking study [18]. Our Root Mean Squared Errors (RMSEs) of heights relative to the reference DTMs were higher (worse) than the best algorithms tested in the benchmarking study, but they were still within a similar range as the other algorithms tested in that study. With that said, we must acknowledge that we are not truly comparing the same data, as we are measuring only the six point clouds that were made open access: a subset of the 24 point clouds in the original study. Our method had effectively 100% coverage in all point clouds while also being relatively consistent in performance amidst the variable complexity, missing data from occlusions, and steep terrain conditions of some of the point clouds. Our approach generated a smoother DTM surface than the reference DTM method, but we cannot confidently say if one method was more accurate than the other with this test. In this comparison, we are comparing our algorithm's results to another algorithm's results (with some manual intervention in the case of the reference DTM); however, we consider this comparison to be sufficient to validate our DTM generation method's efficacy.
Extracting DTMs using Pointnet and Pointnet++-based approaches has been done before on comparatively low-resolution ALS point clouds [32,41]; however, our approach differs by applying a modified Pointnet++ architecture to simultaneously extract terrain, vegetation, CWD, and stem points from a point cloud. A notable property of our DTM generation method is its robustness to noise points below the ground surface, which are common in photogrammetry datasets. This robustness emerges as a result of the segmentation model classifying the below-ground noise points as vegetation and not terrain points, allowing the DTM method to simply ignore those points.
The most significant limitation of our DTM method would be the computational cost compared to other, simpler DTM generation methods. It is more computationally expensive to segment an entire point cloud prior to generating the DTM; however, we already segment the point cloud as part of our overall point cloud analysis approach, so this is acceptable for our application. Our algorithm was capable of performing similarly to the reference DTMs of the benchmarking study with no manual intervention required, which, as stated in the benchmarking study, is difficult to achieve with a fully automatic algorithm. The priority of our work at this time is reliability and the ability to truly automate forest point cloud analysis, which leads us to the future directions of this project.

Future Research Directions
In future work on this project, we intend to use the trained model to expand our training dataset by manually correcting the minor errors made by the model and retraining/adjusting the model iteratively until the desired model performance is reached. Once the CWD segmentation performance is more reliable, there will be a need for further research to measure and validate CWD quantities against reference data. This work is part of an ongoing research effort into the development of a tool for fully automated and sensor agnostic measurement of forest point clouds. Future work by the authors will focus on the exploitation of reliable point cloud segmentation as the starting point for the extraction of detailed tree models and structural complexity metrics under diverse forest conditions and point cloud types.
While beyond the scope of our project, we also suggest that the modified Pointnet++ model we have presented is likely to be transferable to applications outside of forest mapping, particularly where the smaller original Pointnet++ model may not capture sufficient contextual information to segment the point cloud effectively. It would be interesting to explore the effect of varying the number of segmentation classes on the overall accuracy of segmentation, as well as exploring if even larger models (allowing even more contextual information) could perform better.

Conclusions
In this study, we presented and evaluated a methodology for sensor agnostic semantic segmentation of high-resolution forest point clouds and a Digital Terrain Model (DTM) approach that exploits the segmented point cloud. Our semantic segmentation approach was able to achieve an overall accuracy of 95.4% relative to human labeled point clouds but with the considerable benefit of being a fully automated workflow. Our model achieved per class accuracies of 95.92% for terrain, 96.02% for vegetation, 54.98% for coarse woody debris, and 96.09% for stems. Where human operators may require several days to manually segment relatively small (20 × 20 m) point clouds, the presented methodology allows much larger-scale point clouds to be segmented to an almost human-level accuracy at a rate of up to several hectares per day (depending on point density) on a moderately powerful consumer grade desktop computer. Furthermore, we can now use this model to build larger-scale training datasets through an iterative process of model prediction and manual human correction of errors. Through this process, it will become faster and cheaper to generate reliable reference datasets for training and evaluation of new forest segmentation models, until errors such as those seen in the videos can be mostly overcome. Future work will see the segmentation and DTM extraction methods incorporated into a fully automated forest point cloud measurement tool, which is intended to extract structural measurements from diverse and complex point clouds from a variety of sensors and sensing techniques.  Data Availability Statement: Some restrictions apply to the availability of these data. Data obtained from Interpine Group Ltd is commercial in confidence. The TLS_1 and TLS_2 datasets were extracted from "Terrestrial laser scans-Riegl VZ400, individual tree point clouds and cylinder models, Rushworth Forest" [38] obtained through TERN AusCover (http://www.auscover.org.au, accessed on 5 October 2020). UC_UAS_AP_1, UAS_AP_1, and UAS_AP_2 are available upon request. The trained model will eventually be made available as part of a larger Python package upon release of the second paper of this project at (https://github.com/SKrisanski, accessed on 7 April 2021).  Figure 5. This dataset is TLS_4, which was never seen to the model during training. The terrain, vegetation, and stem classes were more confident, as per our accuracy results; however, coarse woody debris was still successfully detected in many cases. Figure A2. This figure shows the raw output from the model just prior to the argmax function (which chooses the label with the highest confidence), as shown in Figure 5. This dataset is TLS_4, which was never seen to the model during training. The terrain, vegetation, and stem classes were more confident, as per our accuracy results; however, coarse woody debris was still successfully detected in many cases. Remote Sens. 2021, 13, x FOR PEER REVIEW 24 of 26 Figure A3. This figure shows the computation time of each processing step relative to the number of points (after subsampling to a 0.01 m minimum distance between points). A second-order polynomial was fitted to show the approximate trend of the data; however, the largest dataset (top right data point) exceeded the available 128 GB of RAM during the final steps of semantic segmentation, using the swap file on a solid state drive for the excess, which did slow the process. Figure A3. This figure shows the computation time of each processing step relative to the number of points (after subsampling to a 0.01 m minimum distance between points). A second-order polynomial was fitted to show the approximate trend of the data; however, the largest dataset (top right data point) exceeded the available 128 GB of RAM during the final steps of semantic segmentation, using the swap file on a solid state drive for the excess, which did slow the process.