A Distance Correlation Approach for Optimum Multiscale Selection in 3D Point Cloud Classiﬁcation

: Supervised classiﬁcation of 3D point clouds using machine learning algorithms and handcrafted local features as covariates frequently depends on the size of the neighborhood (scale) around each point used to determine those features. It is therefore crucial to estimate the scale or scales providing the best classiﬁcation results. In this work, we propose three methods to estimate said scales, all of them based on calculating the maximum values of the distance correlation (DC) functions between the features and the label assigned to each point. The performance of the methods was tested using simulated data, and the method presenting the best results was applied to a benchmark data set for point cloud classiﬁcation. This method consists of detecting the local maximums of DC functions previously smoothed to avoid choosing scales that are very close to each other. Five different classiﬁers were used: linear discriminant analysis, support vector machines, random forest, multinomial logistic regression and multilayer perceptron neural network. The results obtained were compared with those from other strategies available in the literature, being favorable to our approach.


Introduction
As a result of the advances in photogrammetry, computer vision and remote sensing, accessing massive unstructured 3D point cloud data is becoming easier and easier. Simultaneously, there is a growing demand for methods for the automatic interpretation of these data. Machine learning algorithms are among the most used methods for 3D point cloud segmentation and classification given their good performance and versatility [1][2][3][4]. Many of these algorithms are based on defining a set of local geometric features obtained through calculations on the vicinity of each point (or the center of a voxel when voxelization is carried out to reduce computing time) as explanatory variables. Local geometry depends on the size of the neighborhood (scale) around each point. Thus, a point can be seen as belonging to an object of different geometry, such as a line, a plane or a volume, depending on the scale [5]. As a result, the label assigned to each point can also change with this variable. The neighborhood around a point is normally defined by a sphere of fixed radius centered on each point [6] or by a volume limited by a fixed number of the closest neighbors to that point [7]. A third method, which is normally applied to airborne LiDAR (Light Detection and Ranging) data, consists of selecting the points in a cylinder of a fixed radius [8]. The local geometry of each point on the point cloud is mainly obtained from the covariance matrix, although other alternatives are possible, such as in [9], where Delaunay triangulation and tensor voting were used to extract object contours from LiDAR point clouds.
Given the importance of the scale in the result of the classification, there has been a great interest in estimating the scale (size of the neighborhood) or combination of scales (multiscale approach) [10][11][12][13] that provide the best results (the smallest error in the classification). Unfortunately, this is not a trivial question, and several methods have been proposed to determine optimum scales. A heuristic but largely used method is to try a few scales considering aspects such as the density of the point cloud, the size of the objects to be classified and the experience of the people solving the problem and select the scale or scales that provide the best solution. Differences in the density through the point cloud lead to errors in the classification [14]. In [15], the sparseness in the point cloud was addressed through upsampling by a moving least squares method. Another possibility is to select several scales at regular intervals or at intervals determined by a specific function, such as in [16], where a quadratic function of the radius of a sphere centered in each point was used. Obviously, these procedures cannot guarantee an optimal solution, so some more objective and automatic alternatives have been proposed. One of them is to estimate the optimum scales taking into account the structure of the local covariance matrix obtained from the coordinates of the points, and a measure of the uncertainty, such as the Shannon entropy [17,18]. Another alternative is to relate the size of the neighborhood with the point density and the curvature at each point [19]. The size of the neighborhood can be fixed across the point cloud, but the results can improve when it is changed from point to point [20].
There are different approaches for feature selection in classification problems. Some of them, for example, the meta algorithm optimal feature weighting [21], random forests [22] or regularization methods, such as Lasso [23], select the features at the time of performing the classification. By contrast, other methods (known as filter methods), such as Anova, Kruskel, AUC test [24] or independent component analysis [25], perform the selection prior to establishing the classification model. In this work, we propose a method inside this second category. As in [26], we assume that a good approach to the optimum scale selection should be that for which the distance correlation [27] between the features and the labels of the classes has high values, but in this case we look for a combination of scales that provides the best classification, instead of selecting just one scale. In summary, we propose a simple, objective and model independent approach to address an unsolved problem: the determination of the optimal scales in 3D point cloud multiscale supervised classification.

Methodology
Given a a sample X , where X j represents the predictors, j = 1, . . . , p, and Y i ∈ Z + , we are interested in determining a model that assigns values to Y given the corresponding features X j . Each feature depends on the values of a variable k observed in k = 1, ..., K discretization points. Accordingly, X j is a vector in R K . For each X j , the response variable Y follows a multinomial distribution with L possible levels C 1 , . . . , C L and associated probabilities P l (X j ) = P(Y = C l |X j ), l = 1, . . . , L.
In the context of our particular problem, X j (k) represents each of the features obtained from the point cloud to be classified, whose values are calculated, for each training point, at a finite number of scales k = 1, ..., K, each of them representing the size of the local vicinity around each point. In contrast to the standard procedure, by which only a few scales are used, here the features are calculated at a much larger number of scales in order to be able to select those containing the relevant information to solve the classification problem.
The initial hypothesis is that for each feature only a few values of k (scales) provide useful information to perform the classification, and that these scales correspond to high values of the distance correlation between the features and the labels representing each category. Generally speaking, the distance correlation [27,28] between two random vectors X ∈ R p and Y ∈ R q is defined as dtds represents distance covariance (see [29] for an application of distance covariance for variable selection in functional data classification), a measure of the distance between f X,Y = R R e i(sx+ty) f (x, y)dxdy, {s, t} ∈ R, the joint characteristic function of random vectors X and Y, and the product f X f Y of the characteristic functions of X and Y, respectively. For their part, c p and c q are constant depending on the dimensions p and q, respectively. A distance correlation has some advantages over other correlation coefficients, such as the Pearson correlation coefficient: (1) it measures non-linear dependence, (2) X and Y do not need to be one dimensional variables, (3) R(X, Y) = 0 ⇔ X, Y are independent, that is, independence is a necessary and sufficient condition for the nullity of distance correlation.
Given that R(X , Y ) is defined for X and Y random vector variables in arbitrary finite dimension spaces, in this study, the distance correlation was calculated both for a single scale R(k) = {R(X(k s ), Y)} K s=1 or for a set of scales (discretization points) R(k 1 , . . . , k v ) = R(X(k 1 , . . . , k v ), Y). Our aim is to select a set of critical scales (those which contain transcendental information for the classification)k =k 1 ,k 2 , . . . ,kK, withK < K, which maximize the distance correlation. Solving this problem can be time consuming given that it requires calculating the distance correlation from a very high number of combinations. In this work, three different approaches have been proposed to determine the arguments providing the maximum distance correlation values.
The first approach (Algorithm 1) is an iterative method that looks for the best combinations of all the combinations of the scales K, taking a numberK of scales at a time. It is a procedure that in each iteration fix the lastK − 1 values of vectork and randomly selects the remaining value until a maximum of R(k) is reached. This is a force brute procedure that does not look for local maximums directly.
The second approach (Algorithm 2.a) calculates the DC at each value of k for each feature and looks for the local maximums of the distance correlation, sorting them in decreasing order, and finally selecting the firstK values of k.
Finally, the third approach (Algorithm 2.b) only differs from the second approach in that the distance correlation is smoothed before calculating the local maxima. Then, the values of DC calculated at discrete scales are considered as coming from a smooth function m(k) so that R(k) = m(k) + ε, ε being a zero mean independent error term and k ∈ [1, K]. In this way, close local maximums providing redundant information are avoided. In particular, this work follows the same idea as [29] but using a B-spline base instead of kernel-type smoothing.
Each algorithm is run several times, and the values of k ink = (k 1 , . . . ,kK) obtained in each step are stored. Those values of {k j }K j=1 that are the most frequent are considered as the critical points (scales). Once these critical scales have been selected, a classification model is fitted to the features at these scales, avoiding the inherent drawbacks of highdimensional feature spaces. The algorithms for each of the methods are written below, although the second and third methods are written together, since they only differ in a sentence corresponding to the adjustment of a smooth function to the vector of distance correlation. In each algorithm, n represents the size of the data, K the number of scales used to calculate the features andK the number of critical points selected.

Algorithm 1 For scale selection by brute force of the DC.
Step 0: Randomly select the initial estimatesk 0 = (k 0 1 , . . . ,k 0K ) taking a sample ofK << K size without replacement.
Step 2: Smooth R(k) as a function of k (only for the third approach-Algorithm 2.b).

Simulation Study
This section reports the procedure followed in order to evaluate the practical performance of the proposed methodology using simulated data. We consider each simulated with (k 1 , k 2 , k 3 , k 4 ) = (20, 40, 60, 80) being the critical points or points of interest to be detected, (σ 1 , σ 2 , σ 3 , σ 4 ) = (3, 2, 2, 3) a measure of how much sharpness is X i (k) around k j , and w ij weights that have been simulated independently from a uniform distribution in the interval [−2.5, 3.0]. The errors ε i (k) were generated from a random Gaussian process with covariance matrix Σ = σ 0 I, being σ 0 a scalar. A set of predictors (features) for σ 0 = 0.05 and σ 0 = 0.25 is represented in Figure 1. Given X i , the corresponding outcome variable Y i was generated from a multinomial distribution with three possible results C 1 , C 2 and C 3 , with associated probabilities P j ( were generated under the above scenario for different variance parameters σ 0 . Moreover, we randomly split the sample into a training set (n e = 140), used in the estimation process, and a test set (n p = 60), for prediction. The curves were discretized in k = 100 equi-spaced points in k ∈ (0, 100).
The performance of the algorithms was evaluated for each of the critical points (as mentioned above, the theoretical critical points in the simulation scenario were k = (20, 40, 60, 80) by means of the mean square error on a number of repetitions R (we specifically chose R = 400): wherek r is the estimated optimal subset of scale points in the simulation r. MSE(k) values for the different proposed algorithms and different values of σ 0 are summarized in Table 1. For the three algorithms, the mean squared error and the dispersion of the estimated critical points around the mean increases with σ 0 , as expected. Algorithm 2.a produced the worst results, both in detecting the critical points as well as in the uncertainty of the points detected. In contrast, Algorithm 2.b presented the best performance, so smoothing the distance correlation functions before calculating the critical points had a positive effect. Despite the fact that Algorithm 1 produces reasonable results in terms of accuracy in estimating the critical points, it is striking that the standard deviation is too large in opposition to Algorithm 2.b.  Figure 2 shows boxplots representing the results of the simulation to detect each of the critical points (20, 40, 60, 80) for three different values of σ 0 and for each algorithm. Clearly Algorithm 2.b is the most accurate, while the worst results corresponded to Algorithm 2.a. Increasing σ 0 produces uncertainty in the determination of the critical points for Algorithms 1 and 2.a, but hardly affects Algorithm 2.b. In addition, to measure the effect of the errors in the detection of the critical points in the estimation of the probabilities, the mean squared error of the probabilities for each class was also calculated: The results are shown in Table 2, considering a different numberK of critical points and values of σ 0 , again, the best results, that is, small values of MSE (and its standard deviation) and total accuracy close to that of the theoretical model, corresponded to Algorithm 2.b, whereas Algorithm 2.a provided the worst results.  Figure 3 shows boxplots of total accuracy for all the repetitions and for the three algorithms tested, according to the number of critical points. The red items correspond to the theoretical model. Again, the best results correspond to Algorithm 2.b followed by Algorithm 1. Note that more than four critical points were tried, but, either way, for Algorithms 1 and 2.b these two metrics stabilized atk = 4.
The Bayesian Information Criterion for each Algorithm is represented versus the number of critical points, for three different values of σ 0 in Figure 4. As can be appreciated, Algorithm 2.a does not work properly: the number of critical points is not detected, and there is a great dispersion in accuracy. Algorithm 2.b reaches a minimum value of BIC at the correct number of critical pointsK = 4, regardless of the value of σ 0 .
A comparison of the total accuracy for the three algorithms tested with the theoretical model (in red), fixingK = 4, for different values of σ 0 , is shown in Figure 5. A decrease in the accuracy is accompanied by an increase in the standard deviation of the error. Based on the results of the classification for each algorithm it seems that there is a relationship between inaccuracy in the estimation of the critical points and errors in the classification. As before, the best algorithm is Algorithm 2.b, followed by Algorithm 1, while Algorithm 2.a behaves badly. Note that inaccuracy in Algorithm 2.b is mainly due to bias (see right panel of Figure 2), since there is hardly any variability, while for the other two algorithms there is both bias and variability, especially for Algorithm 2.a (see center panel of Figure 2).

Case Study
In addition to testing our method on simulated data, we also applied it to a real dataset where the objects to be labeled are elements of an urban environment. Figure 6 shows the operations followed to perform the classification of the point cloud using the proposed methodology.

Dataset and Feature Extraction
The real dataset used to test our approach was the Oakland 3D point cloud [30], a benchmark dataset of 1.6 million points that has been previously used in other studies concerning point cloud segmentation and classification. The objective is to automatically assign a label to each point in the point cloud from a set of features obtained from the the coordinates (X, Y, Z) of a training dataset. Specifically, there are six categories (labels) of interest, as is shown in Figure 7.
The point cloud was collected around the CMU campus in Oakland-Pittsburgh (USA) using a Mobile Laser Scanner system (MLS) that incorporates two-dimensional laser scanners, an Inertial Measurement Unit (IMU) and a Global Navigation Satellite receptor (GNSS), all calibrated and mounted on the Nablab 11 vehicle. Figure 7 shows a small part of the point cloud, where a label represented with a color has been assigned to each point. A total of six labels have been considered. The features representing the local geometry around each point were obtained through the eigendecomposition of the covariance matrix Σ [13,31]: where vector p i = (X i , Y i , Z i ) represents each point in the point cloud, V a matrix whose columns are the eigenvectors v i , i = 1, ..., 3, and Λ a diagonal matrix whose non-zero elements are the eigenvalues λ 1 > λ 2 > λ 3 > 0. The three eigenvalues and the eigenvector v 3 were used to calculate the five features registered in Table 3. Z range for each point is calculated considering the points in a vertical column of a specific section (scale) around that point. In order to avoid the negative effect of outliers, instead of using the range of Z coordinates we used the values between the 5th and 95th percentiles. An explanation of the geometrical meaning of these and other local features can be obtained in [12,16]. Table 3. Features extracted from the point cloud.

Name Formula
Linearity The spatial distribution of these features at different scales are shown in Figure 8.

Neighborhood Selection
The proposed methodology for optimum scale selection (detection of critical points) was applied to solve the previous classification problem. Thus, we have a vector of input variables X = (X L (k), X P (k), X S (k), X H (k), X Z (k)) representing the features (linearity L, planarity P, sphericity S, horizontality H and Z range Z) measured at different scales, and an output variable Y = {C 1 , ..., C 5 }, which can also take five discrete values, the labels assigned to each type of object (cars, buildings, canopy, ground and poles). Our aim is to estimate an optimum neighborhood (scale) for each feature by means of distance correlation (DC), taking into account its advantage with respect to the Pearson coefficient. For each sample, each feature was evaluated at a regular grid of K = 100 scales measured in centimeters in a linear scale from 50 to 300 cm. Figure 9 shows a sample of n = 150 curves for each feature registered X i (k), i = 1, ..., 5 in the interval k ∈ [k 1 = 50, k 100 = 300] and the corresponding functional meanX i (k), both colored by a class label. Note the different performance of the features for the different classes and scales. For instance, horizontality takes high values for the ground, and it is uniform at different scales. However, this feature shows abrupt jumps at certain scales for the poles, that could correspond to edge effects. As expected, linearity takes high values for the poles and low values for the buildings, while planarity is high for buildings and low for poles. Figure 10 shows the distance correlation functions between a dummy matrix representation of the categorical response and the covariates (X L (k), X P (k), X S (k), X H (k), X Z (k)) for 100 repetitions of random samples of size n e = 750 (150 per class), corresponding to each of the features extracted. DC functions were calculated using the fda.usc package [32]. They are quite uniform, with not many peaks, and some of them are close together. This could cause problems in finding the relevant scales, similar to the effect of increasing the standard deviation observed with the simulated data. A histogram of the global maximum of distance correlation curves for those repetitions is depicted at the bottom of the figure.
As can be appreciated, most of the relative maximums correspond to low scales (impact points), except for the Z range variable (5th-95th range of z axis). Again, smoothing the DC function before searching for the local maximums helped to discriminate the most important points. However, even when the number of local maximums decreases after smoothing DC, some of them have little influence on the classification. In fact, no more than three critical points were needed to obtain the best results of the classification for the different classifiers tested. The three most frequent scales in each histogram are shown in Table 4. As can be appreciated, the important scales take low values for all the features, with the exception of the Z range, for which important scales correspond to low to medium grades. The performance of our approach was contrasted with the proposal in [18], on which the optimal scale k λ was calculated as the minimum of the Shannon entropy E λ , which depends on the normalized eigenvalues e i = λ i /Σ λ , i = 1, ..., 3, of the local covariance matrix Σ:

Classification
In this step we evaluate how the classifier translates the information of critical points of X (all features) into a classification error. This is a quite an interesting question in the functional context or when dealing with high-dimensional spaces. However, new important questions arise related to how representative the selected critical points (scales) shown before are or how to select a useful classifier for the classification problem among all the possibilities. Unfortunately, detecting the critical scales is not a simple task, as was proved in the simulation study by introducing Gaussian errors with standard deviation of different magnitude in a model relating features and classes. The greater the standard deviation, the greater the error in determining the exact values of the relevant scales. This is because these scales correspond to the peaks of R i (k), and they are less sharp as the standard deviation of the error increases. In this situation, the best results were obtained when DC was smoothed before searching for the local maximum, and this was the method applied to perform the classification with real data. Now, we have to deal with a purely multivariate classification problem in dimension K N (the number of critical points for each functional variable) and many procedures are known to handle this problem (see, for example, [33]). Many classification methods have been described in the literature, but we limit our study to four proven methods that are representative of most of the types of classifiers, linear or non-linear, hard or probabilistic, ensemble or not. Specifically, the chosen methods are: linear discriminant analysis (LDA) [34,35], multiclass logistic regression (LR) [36,37], multiclass support vector machines (SVMs) [38,39], random forest (RF) [22] and feed-forward neural networks with a single hidden layer (ANN) [40]. They are among the top classification algorithms in machine learning, although there are some other classification methods that could be employed here such as, for example, Quadratic Discriminant Analysis (QDA) or Generalized Additive Models (GAMs).
The choice among the different classifiers could be influenced by their theoretical properties and/or the easiness to draw inferences. Better inferences can be drawn from more simple classifiers such as LDA or LR models. Furthermore, as is discussed in [41], such simple classifiers are usually hard to beat in real life scenarios, adding interpretability to the classification rule which is sometimes more important than predictability.
We consider three possibilities for the classification, depending on the number of critical scales used, according to Table 4: 1.
A unique scale (impact point),k N 1 = k L 1 ,k P 1 ,k S 1 ,k H 1 ,k Z 1 , for each of the features, corresponding to the most frequent value of the impact points.

2.
Two scales,k N 2 = k N 1 ,k L 2 ,k P 2 ,k S 2 ,k H 2 ,k Z 2 , corresponding to the two most frequent critical scales for each of the features.

3.
Three scales,k N 3 = k N 2 ,k L 3 ,k P 3 ,k S 3 ,k H 3 ,k Z 3 , corresponding to the three most frequent critical scales for each of the features.
In order to contrast the performance of our approach, the same classification algorithms were applied to the feature values obtained using other scales. Specifically, we used the following values of the scale k: k seq = k L seq , k P seq , k S seq , k H seq , k Z seq , linearly spaced scales corresponding to the following values of k in centimeters (cm): 50, 112, 175, 237, 300.
Training data (n e = 200 per class) and test data (n p = 400 per class) were sampled from different areas of the point cloud in order to ensure their independence. Table 5 compares the total accuracy obtained using LR, LDA, SVM, RF and ANN classifiers for all the scales studied: no important discrepancies between them were appreciated, with SVM having a narrow advantage over the others. A quick look at this table informs us that our proposal is better than using sequential scales or k λ . Furthermore, using a multiscale scheme provides a slight improvement in accuracy with respect to using just one scale (k N1 ) for almost all the classifiers. The feature selection and classifier functions available in the R package fda.usc were used with the default parameters (without previous tuning).  Table 6 shows the results of the classification for the test sample, in terms of precision and recall, for each of the scales tested, using a multinomial logistic regression LR and SVM classifiers. We limited the results to these two classifiers because there are few differences with the other three classifiers. Although some non-optimal scales provided the best results for some types of objects (ground and buildings), in global terms it can be concluded that the largest values of precision and recall correspond tok N2 andk N3 , and that they are practically the same in both cases.   Figure 11 represents boxplots of F1-index for the LR classifier, for each of the classes depending on the scale. The plot at the bottom right is the average value of F1 for the five classes. With a few exceptions, the highest F1 values correspond to the case where two (k N2 ) or three (k N3 ) optimum scales for each feature were used. However, the values of the median and the interquantile range are almost the same, so we cannot claim that there are significant differences between both options. For its part, k λ led to the worst performance with low mean values and great dispersion because sometimes the extreme values of the scale were selected. A drawback for the practical implementation of the proposed algorithms, specially for Algorithm 1, is the memory consumption. The overall consumption of memory when storing the distance matrices for p covariates with n elements each is (p + 1)(n(n − 1)/2 − n). As an example, Table 7 shows the maximum memory consumed (in megabytes) and the execution time (in seconds) when the algorithms for different sample sizes and two different levels of error is executed in an Intel Core i7-1065G7 with 16MB of RAM.

Conclusions
In this work we propose three different algorithms to select optimum scales in multiscale classification problems with machine learning. They are based on determining the arguments (scales) of the features with high distance correlation with the labels assigned to the objects. Distance correlation provides a measure of the association, linear or non-linear, between two random vectors of arbitrary dimensions, so it was expected that high correlations correspond to low classification errors. First, the proposed algorithms were tested simulating the distance correlation function and its relationship with the labels. The results were encouraging, supporting the validity of our proposal, and allowing us to establish the order of performance of the three algorithms. The best results were obtained when the distance correlation functions for each feature were smoothed before calculating the local maximum. Then, the algorithm that provided the best results with the simulated data was tested in a real classification problem involving a 3D point cloud collected using a mobile laser scanning system. Determining the optimum scales simplifies the classification problem for other point clouds, since they give us important information to limit the scales at which the features are determined, assuming the quality and density of the point clouds are similar to those of the training data and, of course, that we are classifying the same type of objects. The results obtained for the real problem were also positive, outperforming those obtained with other methods reported in the literature that use unique sequentially defined scales or the Shannon entropy. A maximum of three scales for each feature was sufficient to obtain the best results in the classification, measured in terms of precision, recall and F1-index. In addition, non-significant differences were found between the five classifiers tested.