The Hessigheim 3D (H3D) Benchmark on Semantic Segmentation of High-Resolution 3D Point Clouds and Textured Meshes from UAV LiDAR and Multi-View-Stereo

Automated semantic segmentation and object detection are of great importance in geospatial data analysis. However, supervised machine learning systems such as convolutional neural networks require large corpora of annotated training data. Especially in the geospatial domain, such datasets are quite scarce. Within this paper, we aim to alleviate this issue by introducing a new annotated 3D dataset that is unique in three ways: i) The dataset consists of both an Unmanned Aerial Vehicle (UAV) laser scanning point cloud and a 3D textured mesh. ii) The point cloud features a mean point density of about 800 pts/sqm and the oblique imagery used for 3D mesh texturing realizes a ground sampling distance of about 2-3 cm. This enables the identification of fine-grained structures and represents the state of the art in UAV-based mapping. iii) Both data modalities will be published for a total of three epochs allowing applications such as change detection. The dataset depicts the village of Hessigheim (Germany), henceforth referred to as H3D. It is designed to promote research in the field of 3D data analysis on one hand and to evaluate and rank existing and emerging approaches for semantic segmentation of both data modalities on the other hand. Ultimately, we hope that H3D will become a widely used benchmark dataset in company with the well-established ISPRS Vaihingen 3D Semantic Labeling Challenge benchmark (V3D). The dataset can be downloaded from https://ifpwww.ifp.uni-stuttgart.de/benchmark/hessigheim/default.aspx.


Introduction
Supervised Machine Learning (ML), especially embodied by Convolutional Neural Networks (CNNs), has become state of the art for automatic interpretation of various data. However, the applicability and acceptance of such approaches are greatly hindered by the lack of labeled datasets for both training and evaluation (and, consequently, for the verification of their quality). For that purpose, large datasets of labeled 2D imagery were established, for example the ImageNet dataset (Deng et al., 2009). As such an extensive annotation process cannot be accomplished by a single person or group, crowdsourcing was employed. Whereas 2D imagery can be very well interpreted by non-experts (i.e., crowdworkers), labeling 3D data is much more demanding. Although first investigations were conducted on employing crowdworkers for 3D data annotation (Dai et al., 2017;Herfort et al., 2018;Walter et al., 2020;Kölle et al., 2020), these approaches typically try to avoid deriving a full pointwise annotation. This is achieved either by working on object level or by focusing only on necessary points by exploiting active learning techniques. However, at least for evaluating ML models for semantic segmentation, full annotations are beneficial, which are typically acquired by experts. In case of 3D data, existing datasets can be categorized into three different domains (comprehensive literature reviews are given by Griffiths and Boehm (2019) and Xie et al. (2020)): indoor data, outdoor terrestrial data, outdoor airborne data.  Table 1), validation (yellow box) and test set (grey). Data splits of H3D(Mesh) are identical but organized in tiles (individually colored according to data splits in transparent manner ). As can be seen, H3D(Mesh) exceeds H3D(PC) in terms of covered area.
i) Indoor data. Indoor 3D data typically depicts various scenes in living space or working environments, often captured by RGB-D sensors such as Microsoft Kinect (Silberman et al., 2012;Song et al., 2015). Generally, even non-experts are familiar with such scenes, which is why the annotation process can be outsourced to crowdworkers, for instance via Amazon Mechanical Turk (Buhrmester et al., 2011). This is realized by providing (pre-segmented) data embedded in easy to handle tools as described in Dai et al. (2017) and Silberman et al. (2012). For better interpretability, often multi-modality is realized in the sense that acquired point clouds are meshed to obtain a well-defined closed surface representation (Hua et al., 2016).
ii) Outdoor terrestrial data. Capturing outdoor terrestrial data has become most popular in the context of autonomous driving. Cars destined for self-driving are equipped with a great variety of different sensors such as cameras, laser scanners, and odometers. Often only the combination of these sensors allows a comprehensive understanding of the complete scene, which is studied extensively on the basis of the well-known KITTI dataset (Geiger et al., 2012).
Although Mobile Laser Scanning (MLS) point clouds of typical urban scenes are often provided as stand-alone products (Roynard et al., 2018;Munoz et al., 2009;Hackel et al., 2017), the concurrent availability of LiDAR data and imagery in the form of meshes is often pursued (Riemenschneider et al., 2014). Caesar et al. (2020) provide a unique multi-modal dataset for autonomous driving applications by the combination of both cameras and ranging sensors (i.e., LiDAR & RADAR) for 3D object detection.
iii) Outdoor airborne data. Datasets of this category are often referred to as (large-scale) geospatial data and deviate from the previous ones due to a significantly increased distance between target and sensor, which is attached to an airborne platform (mostly small aircraft). So far, publicly available datasets provide labeled point clouds obtained from a single sensor, either a camera (Hu et al., 2020) or a LiDAR sensor (Varney et al., 2020). One prominent example of the latter case is the Vaihingen 3D (V3D) dataset acquired by Cramer (2010), which served as the basis for the ISPRS 3D Semantic Labeling bench-4 mark (Niemeyer et al., 2014 (Zolanvari et al., 2019). Due to the high point density and the resulting depiction of fine structures, the authors opt for expanding the corresponding class catalog.
The Hessigheim 3D (H3D) dataset presented in this paper belongs to the third group but differs from other datasets because it is the first ultra-high resolution, fully annotated 3D dataset acquired from a LiDAR system and cameras integrated on the same Unmanned Aerial Vehicle (UAV) platform. This results in a unique multi-modal scene description by a LiDAR point cloud H3D(PC) and a textured 3D mesh H3D(Mesh). Hence, properties unique for these two acquisition methods can be efficiently combined, which offers new possibilities for high-accuracy georeferencing (Glira et al., 2019) and semantic segmentation . We consider H3D to be the logical successor of V3D, which was already captured in 2008 and therefore no longer represents the state of the art. As H3D was acquired from a UAV platform by the usage of state-ofthe-art sensors, a point density that is about a hundred times higher compared to V3D can be achieved, allowing the expansion of the class catalog of V3D.
The rest of this paper focuses on presenting H3D as a new benchmark dataset. This includes a detailed presentation of data acquisition, the registration process, and a discussion of the unique characteristics of H3D (Sections 2.1-2.3). Sections 2.4 and 2.5 are dedicated to present the class catalog and the annotation process for H3D(PC) and H3D(Mesh). As we aim at generating a benchmark for semantic segmentation, Section 2.6 describes the general structure of H3D, i.e., the partitioning into disjoint subsets for training, validation, and testing. As labels of the test set are not disclosed to the public, labels 5 are to be predicted by participants (Section 2.7) and transmitted to the authors for evaluation (Section 2.8). We present first results based on two state-of-theart approaches for semantic segmentation to kick off the benchmark process and to give first baseline results (Section 3) before concluding with a summary in Section 4.

The H3D Dataset
The H3D dataset was originally captured in a joint project of the University of Stuttgart and the German Federal Institute of Hydrology (BfG) for detecting ground subsidence in the sub-mm accuracy range. For this monitoring application, the area of interest, which is the village of Hessigheim, Germany (see

Capturing H3D
In all three epochs, our sensor setup was constituted of a RIEGL VUX-1LR scanner and two oblique Sony Alpha 6000 cameras integrated on a RIEGL Ricopter platform (see Figure 2). Considering a height above ground of 50 m, we achieved a laser footprint of less than 3 cm and a ground sampling distance of 2-3 cm for the cameras. Using this setup, we obtain two distinct data representations: i) H3D(PC) and ii) H3D(Mesh) (see Section 2.2 and Section 2.3 respectively).

H3D(PC)
H3D was acquired by a total of 11 longitudinal (i.e., north-south) strips and several diagonal strips (see Figure 3). Scanner parameters (Pulse Repetition Rate and the mirror's rotation rate) and flying parameters (flying altitude and speed) were set for receiving a point distance of about 5 cm both in and across flight direction. Hence, we obtain about 400 pts/m 2 for one single LiDAR strip and about 800 pts/m 2 for the complete point cloud due to strip overlap. As additional strips were flown for further block stabilization, in some areas significantly higher point densities are achieved (see Figure 3). Compared to conventional Airborne Laser Scanning (ALS) flight campaigns applying manned platforms (see Figure 4 top), this high-resolution point cloud allows a more comprehensive 3D scene analysis in comparison to existing datasets. For accurate co-registration of acquired strips with respect to available control planes ), trajectories were corrected by the bias model offered by the OPALS software (Pfeifer et al., 2014). In this context, for each strip a constant offset for each trajectory parameter (∆X, ∆Y , ∆Z, ∆ω, ∆ϕ, ∆κ) was estimated.
Apart from the XYZ coordinates of each point, LiDAR inherent features such as the echo number, number of echos, and reflectance were measured. While up to 6 echos were recorded per pulse emitted, the majority of subsequent echos (echo number > 1) are second and third echos (see yellow and green color in Figure 5 (a)). Instead of the intensity of received echos, we provide reflectance values 1 , which can be interpreted as range corrected intensity. Please note that these values were not corrected for differences in reflectance due to different inclination angles of the laser beam with the illuminated object surface. Reflectance values range from about -30 dB for objects of diffuse reflection properties such as vegetation or asphalt (dark blue respectively light green points in Figure 5 (b)) and up to about 20 dB for objects of directed reflection such as roof or façade elements (red points in Figure 5 (b)).
Point cloud colorization was done in a two-step process. We first derived the mesh as outlined in Section 2.3. Afterwards, we extracted an individual RGB tuple for each LiDAR point from the mesh texture by nearest neighbor transfer.
The nearest neighbor interpolation in 3D space is a simple approximation of an occlusion-aware projection of 3D LiDAR points to image space (result is visualized in Figure 5 (c)).
Additionally, we provide a class label for every point (classes and the annotation will be discussed in Sections 2.4 and 2.5). Both plain ASCII files and Las files are used for data exchange.

H3D(Mesh)
We generated the 3D mesh with software SURE (Rothermel et al., 2012). For the geometric reconstruction of the scene, both LiDAR data and imagery were used to benefit from their complementary properties (Mandlburger et al., 2017).
The fusion of both data sources results in a more complete mesh compared to a mesh derived from images only. For instance, urban canyons are difficult to reconstruct from imagery (due to required visibility in at least two images) but their reconstruction works smoothly for LiDAR data (where one received echo is already sufficient). Furthermore, the oblique images serve for texturing the generated 3D mesh, which allows a realistic representation of vertical faces (e.g., façades in Figure 5). The mesh data is provided in a tiled manner. Each tile is given in both textured and labeled mode. For the textured form, each tile consists of i) an obj file describing the geometry and referring to (ii) the mtl file encoding material properties which links to the (iii) texture atlas that provides textural information (jpgs). Their labeled counterparts consist of an obj file (containing the same geometry as the respective textured version) and a mtl which encodes the class properties (i.e., the color-coding). Therefore, the labeled obj files do not require texture atlases since they are pseudo-textured by the class labels. Additionally, we provide Centers of Gravity (CoGs) for each face along with the transferred labels as CoG point cloud. The CoG cloud is available as a plain ASCII file enabling simple data handling and data exchange.

Class Catalog
For H3D(PC) and H3D(Mesh), we employ the same fine-grained class catalog, which is based on V3D but is refined due to H3D's higher point density (and due to the purpose of the Hessigheim project, which is monitoring the shipping lock depicted in Figure 1). This allows to differentiate more details than in V3D. Hence, we added classes Urban Furniture, Soil/Gravel, Vertical Surface (e.g., found at the shipping lock in Figure 5) and Chimney (see Table 1).

Generating Ground Truth Data for H3D
As previously mentioned, the main objective of H3D is to provide labeled multi-temporal and multi-modal datasets for training and evaluation of ML systems for the task of semantic point cloud segmentation. For labeling H3D(PC) (see Section 2.2), we established a manual process carried out by student assistants, resulting in an annotation as depicted in Figure 5 (   For the 3D mesh, we automatically transfer labels from the manually annotated point cloud by a geometry-driven approach that associates the representation entities, i.e., points and faces . Therefore, the mesh inherits the class catalog (see Table 1) of the point cloud. In comparison to the point cloud representation, the mesh is more efficient because only a small number of faces is required to represent flat surfaces. For this reason, the number of faces is significantly smaller than the number of LiDAR points (see Table 2). Consequently, several points are commonly linked to the same face.
Hence, the per-face label is determined by majority vote of the respective LiDAR points. However, due to structural discrepancies, some faces remain unlabeled 13 because no points can be associated with them (e.g., absence of LiDAR points or geometric reconstruction errors). These faces are marked by the pseudo class label −1. Unlabeled faces cover about 40 % of the entire mesh surface. As can be seen from Figure 1, the majority of the unlabeled area (99.7 %) belongs to parts where the mesh exceeds the labeled point cloud (due to the tiled mesh structure). For the overlap of LiDAR and mesh (i.e., the relevant data), 84 % of the surface carries an annotation.

Data Splits
The datasets of all epochs are split into a distinct training, validation, and test area (see Figure 1). The splits are identical in both modalities and chosen in accordance with the mesh tiling. Since H3D is designed as a benchmark, labels of the test set are not disclosed to the participants of the benchmark.
Whereas labels of the training and validation sets can be used by participants as desired, we recommend utilizing the pre-defined splits. Detailed statistics of class frequencies in the training and validation sets can be found in Table 2 for both modalities (points vs. faces/CoGs) and are visualized in Figure 6. For the mesh, we additionally provide the area each class covers, measured by the area of faces assigned to the corresponding class.
In case of the point cloud, the relative number of points for classes cover-

Benchmark Challenge
In contrast to V3D, the H3D benchmark challenge is twofold in terms of data representation. For both H3D(PC) and H3D(Mesh), we offer participants to use the training and validation dataset for developing ML approaches for supervised classification and then to apply those to the test dataset. Predicted labels are to be returned to the authors for evaluation (see Section 2.8). For the mesh, the predictions are to be returned to the authors as labeled CoG cloud (the respective plain ASCII file of CoGs is provided by the authors). The authors will match the predicted per-face labels with the corresponding faces of the obj files. Simply put, the CoG cloud is utilized as an efficient link to the underlying mesh structure to keep the memory footprint of submitted data low.
The evaluation itself will be done on the mesh (obj files).  (Equation 1 & 2). Additionally, we derive a F1-score as the harmonic mean of P and R for each class (Equation 3) (Goutte and Gaussier, 2005).

Evaluation Metrics
To describe the total performance of a classifier, we combine individual class scores by computing i) the Overall Accuracy (OA = c T P c /N , with N being the total number of labeled instances) and ii) the mean F1-score (macro-F1).
These measures are determined both for H3D(PC) and H3D(Mesh). In case of the latter, the evaluation is based on the covered area of correctly / incorrectly classified faces (as provided as CoG cloud by the participants).

Baselines
We initialize the H3D benchmark challenge by providing two baseline solutions for semantic segmentation of H3D. On one hand, we apply a conventional Random Forest (RF) classifier (Breiman, 2001) relying on hand-crafted features (see Section 3.1) and on the other hand, we use a Sparse Convolutional Network (SCN) as end-to-end learning approach (see Section 3.2).

Random Forest
For semantic segmentation of the point cloud, we compute geometric features as proposed in Weinmann et al. (2015) and in Chehata et al. (2009)  To obtain mesh features, we follow the approach of Tutzauer et al. (2019) and encode each face by its CoG. In this way, we can on one hand compute all aforementioned geometric features for our CoG cloud. On the other hand, we preserve features of the mesh geometry by assigning mesh-inherent features such as face area, face density, and normal orientation to the respective faces.
Furthermore, we transfer LiDAR-specific features to the mesh representation by the approach presented in Laupheimer et al. (2020).
For both the point cloud and the mesh, we additionally incorporate radiometric features. For this purpose, RGB tuples are converted to HSV color space and used together with Gaussian smoothed color values for the aforementioned spatial neighborhoods. Since a multitude of HSV tuples is encoded in each face, we additionally calculate the HSV variance for each face.
Based on these features, a RF model is trained for H3D(PC) and H3D(Mesh).
Prediction results for the test set can be found in Table 3 (see discussion of results in Section 3.3). The RF models are parametrized by 100 binary decision trees with a maximum depth of 18. Niemeyer et al. (2014) have shown that pure pointwise results of the RF classifier can be further refined by a Markov Random Field (MRF). Therefore, we enhance our RF by an a posteriori probability-aware MRF-like smoothing (a posteriori probability is used as unary potential; points within 0.5 m radius are considered for the regularization term).  Table 3: Baseline results of semantic segmentation for both H3D(PC) and H3D(Mesh). For the mesh, we report the performance metrics weighted by the covered surface.

Sparse Convolutional Network
As deep learning has become a de-facto standard in most fields of pattern recognition, we also include a neural network in our baselines. In particular, we employ a 3D CNN in U-Net form with submanifold sparse convolutional layers (Graham et al., 2018) to account for the typical spatial distribution of ALS point clouds. The network's general layout and training regime is described in Schmohl and Sörgel (2019). It consists of 3 downsampling levels and 21 convolutional layers in total. The input data (point cloud or mesh) is discretized to sparse 3D voxel grids of 25 cm side length and contains just the raw attributes (i.e., no features are computed as for the RF). In case of the point cloud, these are the measured point attributes (i.e., echo number, number of echos, reflectance & RGB values) as described in Section 2.2. For the mesh, voxels are derived from the CoG point cloud and attributed with the texture information (RGB) only. We tested also a configuration with additional normal information, since this is the most basic native mesh feature besides RGB. However, the achieved performance is slightly worse (−0.6 and −2.34 percent points for OA and mF1-score respectively). For evaluation, the inferred voxel labels are transferred back to the enclosed points or faces. Results are also reported in Table 3.

Discussion of Baseline Results
Within this chapter, we analyze the performance of our two baseline classifiers (see Section3.3.1 and Sectionsec:discussionMesh respectively) for both H3D(PC) and H3D(Mesh) in order to develop a better understanding of challenges of H3D and to kick-off the benchmark competition.

H3D(PC)
Generally, we can observe from Table 3  both cases for there is often confusion with other ground classes (see Figure 7).
Since points of all ground classes incorporate similar geometric properties (e.g., similar normals and smooth surfaces), distinguishing these classes is only possible with the help of radiometric features such as reflectance or color information (see Figure 5). Whereas this performs well for Low Vegetation vs. Impervious Surface, segmenting points of Soil/Gravel is rather demanding due to similar radiometric properties to Low Vegetation (similar to bare soil) and Impervious Surface (similar to debris and gravel).
Similar radiometric and geometric properties are also the reason for the confusion of Shrub with Tree (greenish color in both cases and rough surfaces).
Nevertheless, the extraction of tree points succeeds quite well, probably due to the distinctive multi-echo ability of the employed sensor (see Figure 5 (a)).
Regarding buildings, roofs can be extracted successfully, but the detection of façades seems to be more demanding for both classifiers (see Table 3).  the RF classifier performs better probably due to the discretization of the SCN approach, so that the RF might capture such small structures more precisely for its pointwise working principle.
To conclude, our baseline solutions indicate, that the depiction of detailed structures and the consequently expanded class catalog of H3D (compared to V3D) poses new challenges for the development of methods for semantic segmentation. Particularly, this applies to entities belonging to newly introduced classes but also for the interaction of those with representatives of common object classes (e.g., Vertical Surface vs. Façade).

H3D(Mesh)
In addition to results for H3D(PC), Furniture due to its large intra-class variance (see discussion in Section 3.3.1).
For instance, cars are often classified as Urban Furniture by mistake. However, due to the utilized superior geometric information, the RF copes better with the variance resulting in an F1-score that is 5.4 percent points better compared to SCN. For class Shrub, the RF is 9.5 percent points better. In case of the SCN, the majority of predictions of class Shrub truly belongs to Urban Furniture. As can be seen for the shipping lock in Figure 8 (b), SCN confuses Impervious Surface with Roof and hence performs worse than the RF on these 25 classes. The geometric similarity of ground classes in the mesh representation (Impervious Surface, Low Vegetation, and Soil Gravel ) makes it demanding to correctly separate them. Therefore, the color information is decisive for the correct prediction. Both classifiers predict other ground classes for Gravel/Soil.
Soil/Gravel is the only class where SCN outperforms the RF. This indicates that SCN mostly learns geometric features. Most probably, the Gaussian smoothed features cause misprediction of chimneys as Roof for RF. In case of SCN, Chimney has a high recall but at cost of good precision. On the contrary, RF has a significantly worse recall but very high precision resulting in a better F1-score.
The distinction of vertical surfaces and façades is demanding for both classifiers due to their small inter-class variance.

Conclusion
In this paper, we presented a new benchmark on semantic segmentation of high-resolution 3D point clouds and textured meshes as acquired and derived The results indicate great potential for testing ML approaches on H3D due to its large sets of labeled data. Eventually, we hope H3D to become a second established ISPRS benchmark dataset in company with V3D.