Classification of Aerial Photogrammetric 3D Point Clouds

We present a powerful method to extract per-point semantic class labels from aerialphotogrammetry data. Labeling this kind of data is important for tasks such as environmental modelling, object classification and scene understanding. Unlike previous point cloud classification methods that rely exclusively on geometric features, we show that incorporating color information yields a significant increase in accuracy in detecting semantic classes. We test our classification method on three real-world photogrammetry datasets that were generated with Pix4Dmapper Pro, and with varying point densities. We show that off-the-shelf machine learning techniques coupled with our new features allow us to train highly accurate classifiers that generalize well to unseen data, processing point clouds containing 10 million points in less than 3 minutes on a desktop computer.


INTRODUCTION
Extraction of semantic information from point clouds enables us to understand a scene, classify objects and generate high-level models with CAD-like geometries from them. It can also provide a significant improvement to existing algorithms, such as those used to construct Digital Terrain Models (DTMs) from Digital Surface Models (DSMs) (Unger et al., 2009). With the growing popularity of laser scanners, the availability of drones as surveying tools, and the rise of commercial photogrammetry software capable of generating millions of points from images, there exists an increasing need for fully automated extraction of semantic information from this kind of data. Although some of the commercial photogrammetry software available today offer tools such as automated DTM extraction (Pix4Dmapper, 2017, Photoscan, 2017, semantic classification is typically left to specialized software packages (eCognition, 2017, GlobalMapper, 2017) that rely on 2.5D orthomosaics and DSMs as an input.
The need for semantic modeling of 3D point data has inspired many research and application engineers to model specific structures. Often the proposed solutions were handcrafted to the application at hand: buildings have been modeled by using common image processing techniques such as edge detection (Haala et al., 1998, Brenner, 2000 or by fitting planes to point clouds (Rusu et al., 2007); road networks have been modeled by handcrafted features and DTM algorithms used heuristics about the size of objects to create a DTM from a DSM. While successful and valuable, these approaches are inherently limited since they cannot be easily applied to detect new classes of objects. The huge boost in the performance of machine learning algorithms over the last years allows for more flexible and general learning and classification algorithms. If supervised or semi-supervised learning and especially classification becomes fast and reliable, machine learning approaches to point cloud classification will find their way into common photogrammetric workflows. Therefore we focus here on machine learning techniques, that will allow the users to detect objects categories of their own choice. * This work was done while the author was working at Pix4D SA In this paper we present a method to classify aerial photogrammetry point clouds. Our approach exploits both geometric and color information to classify individual points as belonging to one of the following classes extracted from the LAS standard: buildings, terrain, high vegetation, roads, human made objects or cars. Unlike previous point cloud classification methods that rely exclusively on geometric features, we show that incorporating color information yields a significant increase in accuracy.
We evaluate our approach on three challenging datasets and show that off-the-shelf machine learning techniques together with our new features result in highly accurate and efficient classifiers that generalize well to unseen data. The datasets used for evaluation are publicly available at https://pix4d.com/research.

RELATED WORK
Methods used to extract semantic information from point clouds can be split into two groups: those that try to segment coherent objects from a scene, and those that focus on assigning an individual class label to each point. Early works using the first approach often converted the point data into a regular 2.5D height grid so that standard image processing techniques, e.g., edge detection, can be applied (Haala et al., 1998, Haala and Brenner, 1999, Wang and Schenk, 2000. A scan line based approach (Sithole and Vosselman, 2003) was proposed for structure detection in urban scenes. Building extraction approaches typically use geometric primitives during the segmentation step. A multitude of such primitives has been proposed, both in 2D, such as planes and polyhedral (Vosselman et al., 2001, Dorninger andNothegger, 2007), and in 3D Mallet, 2012, Xiao andFurukawa, 2014). In (Rusu et al., 2007) the authors fit sampled parametric models to the data for object recognition. Similarly, (Oesau et al., 2016) investigates supervised machine learning techniques to represent small indoor datasets with planar models for object recognition.
The second type of methods assign a label to each point in the point cloud. Typically this is done with supervised machine learning techniques, requiring training on labeled data from which a classification model is learned and then applied to new, unseen data to predict the label of each point. Binary classification has been explored in environmental monitoring to extract road surfaces (Shu et al., 2016), tree species (Böhm et al., 2016, Liu andBöhm, 2015), land cover (Zhou et al., 2016), and construction sites (Xu et al., 2016). Several other authors employed a multiclass setting to classify multiple types of objects and structures (Brodu and Lague, 2012, Weinmann et al., 2015a, Hackel et al., 2016, which we adopt in this paper. In particular, we follow the work of (Weinmann et al., 2013), which introduced local geometric features that were used to train a Random Forest (RF) classifier for single terrestrial LiDAR scans. Their set of features was extended later by (Hackel et al., 2016). Examples of other feature sets used in the point classification context are Fast Point Feature Histogram (FPFH) (Rusu et al., 2009) or Color Signature of Histogram of Orientations (SHOT) (Tombari et al., 2010). All these methods use handcrafted features that can be considered suboptimal when compared to more recent deep learning techniques (Hu andYuan, 2016, Qi et al., 2016), which learn features directly on image or point cloud data. Those approaches have not been considered here, since they require large computational power to train the classifier, and may be restrictive at prediction time, depending on the hardware available.
The ambiguity of the classification task can be minimized by modeling also the spatial correlations between the different class labels. Spatial priors are used in (Shapovalov and Velizhev, 2011) to classify LiDAR data and in (Niemeyer et al., 2014) the authors apply Conditional Random Field (CRF) priors to model different probabilities that neighboring labels can have. While those methods show reasonable classification improvements, they are computationally expensive and not easy to parallelize.
In this paper we extend the work on geometric features by (Weinmann et al., 2013, Hackel et al., 2016 and show that incorporating color information provides a significant boost in prediction accuracy, while keeping a low computational load at prediction time. In the following sections we describe our method and present the results obtained on three photogrammetry datasets.

METHOD
Our approach combines geometric and color features that are fed to a classifier to predict the class of each point in the point cloud. The geometric features are those introduced in (Hackel et al., 2016), which are computed at multiple scales, as explained shortly further. To exploit color information, we compute additional color features, based on the color of the respective point and its neighbors.
In the next sections we describe the geometric features introduced in (Hackel et al., 2016). We then show how our color features are computed and discuss implementation details.

Geometric Features
Our approach computes geometric features at different scales to capture details at varying spatial resolutions. Below we first describe how features are computed for a single scale, and then we show how the scale pyramid is constructed.
We follow the method proposed in (Weinmann et al., 2013) and later in (Hackel et al., 2016). To compute the features for a point x, we first obtain its neighboring points Sx = {p 1 , p 2 , . . . , p k }. This set is used to compute a local 3D structure covariance tensor The eigenvalues λ1 ≥ λ2 ≥ λ3 ≥ 0, unit-sum normalized, and the corresponding eigenvectors e1, e2, e3 of Cx are used to compute the local geometry features shown in Table 1.
Besides the features based on the eigenvalues and eigenvectors of Cx, features based on the z coordinate of the point are used to increase their discriminative power. We have slightly changed the geometry feature set from (Hackel et al., 2016) and removed the sum of eigenvalues because it is constant since the eigenvalues are normalized to unit sum.

Multi-scale Pyramid
To incorporate information at different scales we follow the multi-scale approach of (Hackel et al., 2016), which has shown to be more computationally efficient than that of (Weinmann et al., 2015b). Instead of computing the geometric features of Table 1 at a single scale, we successively downsample the original point cloud to create a multi-scale pyramid with decreasing point densities. The geometric features described earlier are computed at each pyramid level and later concatenated.
Pyramid Scale Selection In order to generalize over different point clouds with varying spatial resolution, we need to choose a fixed set of pyramid levels. This is particularly important when dealing with data with varying Ground Sampling Distance (GSD), which affects the spatial resolution of the point cloud. The GSD is a characteristic of the images used to generate the pointcloud. It is the distance between two consecutive pixel centers measured in the orthographic projection of the images onto the Digital Surface Model (DSM). Among other factors, the GSD depends on the altitude from which the aerial photos were taken. With this in mind, we set the starting resolution of the pyramid to four times the largest GSD in our datasets, or 4 × 5.1 cm = 20.4 centimeters. In total we compute 9 scales, with a downsampling factor of 2. With these values we were able to capture changes in patterns of surfaces and objects which vary with distance (e.g. buildings have significant height variations at the scale of dozens meters, while cars, trees do at only a few meters).

Color Features
To increase the discriminative power of the feature set we combine the geometric features introduced above with color features. Our color features are computed in the HSV color space first introduced by (Smith, 1978), since the analysis of the Pearson product-moment correlation coefficient and the Fisher information of our training data showed that we should expect higher information gain from the HSV over RGB color space.
Besides the HSV color values of the point itself, we compute the average color values of the neighboring points in the original point cloud (i.e. not downsampled). These points are selected as the points within a certain radius around the query point. We experimented with the radii 0.4m, 0.6m and 0.9m to balance between speed and accuracy of the classification.

Training and Classification
We use supervised machine learning techniques to train our classifier. We experiment with two well-known ensemble methods: Random Forest (RF) and Gradient Boosted Trees (GBT) (Friedman et al., 2001). Though RF has been used extensively in point cloud classification (Weinmann et al., 2015b, Hackel et al., 2016, we provide a comparison to GBT and show that the latter can achieve higher accuracies at a similar computational complexity.
Both RF and GBT can generate conditional probabilities and are applicable to multi-class classification problems. They are easily parallelized and are available as reusable software packages in different programming languages.
Random Forest (RF) (Breiman, 2001) is a very successful learning method that trains an ensemble of decision trees on random subsets of the training data. The output of a RF is the average of the predictions of all the decision trees in the ensemble, which has the effect of reducing the overall variance of the classifier.
On the other hand, the Gradient Boosted Trees (GBT) method trains an ensemble of trees by minimizing its loss over the training data in a greedy fashion (Friedman et al., 2001). GBT has been described as one of the best off-the-shelf classification methods and it has been shown to perform similarly or better than RF in various classification tasks (Caruana and Niculescu-Mizil, 2006).

Implementation Details
We implemented our software in C++ to ease its later integration into the Pix4DMapper software. For prototyping and evaluation we used Julia (Bezanson et al., 2014). For fast neighbor search we used the header-only nanoflann library 1 which implements a kd-tree search structure.
The implementation of the RF comes from the ETH Random Forest Library 2 . We parallelized training and prediction, reducing computation times significantly. For GBT we used Microsoft's LightGBM 3 .

Experimental Setup
To evaluate our method we test different combinations of feature sets and classifiers on photogrammetry data. We compare two different setups to evaluate the performance of our approach: within the same dataset, or intra-dataset, and across different datasets, or inter-dataset. For training we sampled 10k points of each class at random, resulting in 60k training samples.
The different feature sets used in our experiments are summarized below: • Geometric features (G): the geometric eigenvalue-based features shown in Table 1. We use k = 10 neighbors to construct Sx.
• Point color (Cp): HSV color values of the respective 3D point.
• Point and Neighborhood color (C N (r) ): Cp set added with averaged HSV color values of the neighboring points within the radius r around the respective 3D point.
• All features: Concatenation of all geometric, point color and neighborhood color features for radii 0.4m, 0.6m and 0.9m.

Intra-dataset experiments
In this setup we divide each dataset into two physically disjoint point clouds. We first find a splitting vertical plane such that the resulting point clouds are as similar as possible with respect to the number of points per class. More specifically, we solve for the vertical planep p = arg min p∈P max c∈Y 1 #c where #c is the number of points of class c in the whole point cloud, Y is the set of all classes present in the point cloud, p + is the set of points falling on one side of the plane p, and P is a set of potential vertical planes of different offsets and rotations.
We then train on one of the splits and test on the other.

Inter-dataset experiments
To test the generalization capabilities of our approach to new unseen point clouds we also experiment with a leave-one-out evaluation methodology: we train on two point clouds and test on the remaining one.

RESULTS
In this section we describe first the datasets and classification methods used for the experiments. Next, we show that our implementation is able to reproduce the results presented in (Hackel et al., 2016) on the Paris-rue-Madame dataset. For this dataset we use purely geometric features, as no color information is available. Finally, we evaluate our approach on challenging aerial photogrammetry point clouds. Our experiments demonstrate that using color information boosts performance significantly, both quantitatively and qualitatively. Table 2 shows the characteristics of the datasets employed for evaluation. The Paris-rue-Madame dataset (Serna et al., 2014) does not contain color information and was solely used to verify that our geometric features perform as well as those of (Hackel et al., 2016).

Datasets
Our main interest is the aerial photogrammetry and the three last datasets of  as the input for our approach. Note that the GSD varies significantly between datasets. A 3D visualization of each of these point clouds is presented in Fig. 3(a), Fig. 4(a), Fig. 5(a).
Moreover, each dataset contains different types of objects and terrain surfaces as shown in Table 3. For example, while all datasets contain roads, cropland only appears in one of them. This table will be useful later to analyze the performance of our approach on each dataset.
We have made the three photogrammetry datasets publicly available at https://pix4d.com/research.

Classifier Parameters
For both GBT and RF we used 100 trees, and at each split half of the features were picked at random as possible candidates. For RF the maximum tree depth was set to 30. For GBT we set the maximum number of leaves to 16, learning rate to 0.2 and the bagging fraction to 0.5. These parameters were fixed for all the experiments.

Validation on Laser Scans
In the first experiment we reproduced the results presented on the laser-scan Paris-rue-Madame dataset (Serna et al., 2014). The training and test data sets are generated the same way as in (Weinmann et al., 2015b) and (Hackel et al., 2016) by randomly sampling without replacement 1000 points per each class for training, and using the rest of the points for testing. When training a RF we achieved overall accuracies of 95.76% compared to the reported 95.75% in the paper although our per-class results differed slightly. We also observed that this evaluation procedure typically yields overly optimistic accuracies, which are much higher than the expected accuracy on unseen test data. We noticed that such evaluation resembles an inpainting problem: given a few known labeled points in the cloud, estimate the labels of the rest that lie in-between. This gives a bias to the results and does not represent the classifier's ability to generalize to unseen datasets.
To overcome these issues we propose to split the data set into two non-overlapping regions, train on one half and test on the other, as described in Sec. 3.5. If the Paris-rue-Madame dataset is split this way our overall accuracy is reduced to ∼ 90%. We believe this is a less biased estimator of the performance on unseen data, and adopt this strategy to evaluate performance in the rest of our experiments.  Table 4. Timings for feature computation, classifier training and prediction. Our whole pipeline runs in less than 3 minutes with any of the provided point clouds, being suitable for interactive applications.
It is worth noting that the Paris-rue-Madame dataset contains only small quantities of some classes such as vegetation and human made objects which were found to be harder to classify correctly by (Hackel et al., 2016).

Experiments on Aerial Photogrammetry Data
Intra-dataset results The misclassification errors for different sets of features are presented in Fig. 1, where we can see that color features bring a significant improvement. The best results are obtained with the C N (0.6) features or with All features. A second important observation is that GBT consistently outperforms RF, in some cases by a large margin.
Inter-dataset results The results are shown in Fig. 2. First, there is an overall increase of classification error, in particular for the Cadastre dataset. To analyze the results in more detail we computed the confusion matrix for the top-3 classes that contribute to the misclassification error, as shown in Table 5. We now discuss the result of each dataset in detail.
Ankeny The classifier performs very well for buildings and roads, as shown in Fig. 3(b). However it confuses large amounts of ground points as roads. This is not surprising since most of such mistakes occur in croplands, which are not present in any other dataset. Finally, although high vegetation appears in the top-3 misclassified classes in Table 5, this is mostly due to ambiguities in the ground truth: some bushes were manually labeled as ground, while the classifier predicts them as high vegetation.
Building The classifier performs very well on this dataset. The highest error is due to predicting buildings as high vegetation or human-made objects. This dataset has the lowest GSD (or highest resolution), and facades of the buildings are well-reconstructed. This is not the case for the other two datasets with higher GSD, where few facade points are available. We hypothesize that the classifier is confused with the facades, finding the vegetation or human-made object to be the closest match.
Cadastre The classifier predicts vast amounts of ground and vegetation points as buildings and human-made objects, leading to a very high error rate. This result is expected considering Table 3, as the Cadastre dataset contains hills and non-flat ground surfaces, which are not present in any of the other two datasets. Thus, the classifier confuses points in the regions of inclined ground with other classes that are closer in feature space to the training data (e.g. building roofs present a slope that resembles the properties of the points on a hill).
The analyses above highlight the importance of reliable and varied training data, in that it should resemble the unseen data on which the classifier will be applied, e.g different landscapes, seasons, shapes of buildings, etc.
A l l f e a t u r e s A l l f e a t u r e s Inter-dataset results: classification error (number of misclassified points divided by overall number of points, the lower the better) when training on two datasets and testing on the remaining one. Similar to Fig. 1, the best results are obtained with both geometric and color features C N (0.6) . The results on Ankeny and Cadastre show higher classification errors than those obtained when training and testing on the same dataset in Fig. 1. This is due to the lack of training data for certain objects or terrain type. For a detailed discussion see section 4.4. Ankeny dataset (a) Original data.
(b) Classification with geometry features only.
(c) Classification with geometry + color features.

Ground Road Building
High veg. Car Human-made obj. Figure 3. Qualitative results obtained with our approach on the Ankeny dataset, using the other two datasets for training. We used neighbor color features within a 0.6-meter radius neighborhood and the Gradient Boosted Trees classifier. Incorporating color information into the classifier yields a significant boost in accuracy, particularly for the road and high vegetation classes.

Qualitative Results
Figures 3, 4 and 5 show 3D views of each dataset and the respective classified point clouds obtained when using geometric features only, as well as with our approach. Overall the results are very satisfying, especially when one considers the heterogeneity of the different datasets, as discussed earlier. These figures also help clarify the observations described in section 4.4. For example, grass and ground on slopes in Cadastre are sometimes misclassified as roof in Fig. 5(c). Table 4 shows the break down of the timings obtained on a 6-core Intel i7 3.4 GHz computer. Our approach is very efficient, taking less than 3 minutes to classify every point in any of the presented photogrammetry datasets. This makes it ideal for the applications where the user needs to interact with the software to correct the training data or fix the classifier's predictions.

Timings
Buildings dataset (a) Original data.
(b) Classification with geometry features only.
(c) Classification with geometry + color features.

Ground Road Building
High veg. Car Human-made obj. Figure 4. Qualitative results obtained with our approach on the Buildings dataset, using the other two datasets for training. We used neighbor color features within a 0.6-meter radius neighborhood and the Gradient Boosted Trees classifier. Incorporating color information into the classifier results improves classification, particularly for the roads between buildings.

Testing dataset Ankeny Building Cadastre
Geom. features 46% 33% 48% Geom. + Color C N (0.6) 24% 16% 41% Table 6. Overall classification errors for inter-dataset experiments with LGBM classifier run with 2 sets of features: geometric only and both geometric and color. Color information yields a significant performance boost in prediction accuracy in all datasets.

CONCLUSION
In this paper we described an approach for a point-wise semantic labeling of aerial photogrammetry point clouds. The core contribution of our work is the use of color features, what improves significantly the overall classification results. Further we provide a concise comparison between two standard machine learning techniques that hopefully facilitates the decision making process of future research, showing that the Gradient Boosted Trees classi-Cadastre dataset (a) Original data.
(b) Classification with geometry features only.
(c) Classification with geometry + color features.

Ground Road Building
High veg. Car Human-made obj. Figure 5. Qualitative results obtained with our approach on the Cadastre dataset, using the other two datasets for training. We used neighbor color features within a 0.6-meter radius neighborhood and the Gradient Boosted Trees classifier. Using color information results in a more accurate segmentation.
fier outperforms the Random Forest classifier, in some cases by a large margin. Our method performs not only with high accuracies over the whole range of datasets used in the experiments but also with a high computational efficiency, making our approach suitable for interactive applications.
There are several directions that we would like to explore to increase accuracy and further decrease computational complexity. First we would like to explore the possibility to use auto-context information to train a second classifier that takes into account class labels of the neighboring points. Our preliminary experiments with this technique provided smoother results and higher accuracy. Another interesting topic to is to combine point cloud and image classification. Existing image classification algorithms are trained specifically to detect objects of such classes as cars and other human made objects. This method could increase the robustness of our classifier in particular on the classes that often confused: cars and human-made objects.
The classification method presented in this paper will soon be part of Pix4Dmapper Pro. Earlier we mentioned that access to properly labeled training data that represents aerial photogrammetry point clouds is limited. To overcome this issue we will implement an incremental training method, where users will be given the possibility to classify their data, visualize and correct errors manually. In a next step we will offer our users the possibility to include their datasets into our training data to improve the classifier quality. As the amount of training data increases we will be able not only to provide more accurate classifiers but to also train specialized ones. For example, we could provide a selection of classifiers for indoor and outdoor scenes, and for different seasons and scales.