EFFICIENT LARGE-SCALE AIRBORNE LIDAR DATA CLASSIFICATION VIA FULLY CONVOLUTIONAL NETWORK

Nowadays, we are witnessing an increasing availability of large-scale airborne LiDAR (Light Detection and Ranging) data, that greatly improve our knowledge of urban areas and natural environment. In order to extract useful information from these massive point clouds, appropriate data processing is required, including point cloud classification. In this paper we present a deep learning method to efficiently perform the classification of large-scale LiDAR data, ensuring a good trade-off between speed and accuracy. The algorithm employs the projection of the point cloud into a two-dimensional image, where every pixel stores height, intensity, and echo information of the point falling in the pixel. The image is then segmented by a Fully Convolutional Network (FCN), assigning a label to each pixel and, consequently, to the corresponding point. In particular, the proposed approach is applied to process a dataset of 7700 km that covers the entire Friuli Venezia Giulia region (Italy), allowing to distinguish among five classes (ground, vegetation, roof, overground and power line), with an overall accuracy of 92.9%.


INTRODUCTION
In the last years, governments and other institutions worldwide have been promoting the survey of large areas of the national territory to be employed, e.g., for natural hazard management, urban planning and facilities monitoring. In this context, airborne LiDAR (Light Detection and Ranging) technology represents a suitable survey platform to obtain high resolution data at wide scale, requiring, however, efficient algorithms to handle and process the large amount of acquired data.
In the LiDAR data processing pipeline, classification is one of the most important and time consuming stage, necessary for the subsequent generation of cartographic products. Ground points must be extracted, e.g., to create digital terrain models (DTMs), whereas the identification of vegetation is essential to evaluate its density or, in the field of power lines monitoring, to automatically calculate the distance from the conductors, just to name a few applications.
When dealing with airborne LiDAR (ALS) large-scale datasets, the processing time becomes an essential factor to take into account. In this work, we show how an algorithm based on Convolutional Neural Networks (CNNs) was profitably employed to classify a dataset of 7700 km 2 , that covers the entire Friuli Venezia Giulia region (Italy). The proposed approach is applied to distinguish among five classes, namely: ground, vegetation, roof, overground (e.g., cars, walls and chimneys) and power line, achieving an overall accuracy of 92.9% with a classification time of 11 minutes per km 2 . It is worth noting that half of the surveyed region is characterized by alpine areas (higher than 600 m a.s.l.), particularly challenging for the classification task, as observed in (Winiwarter et al., 2019). However, our method achieved a good classification also in mountainous environments. * Corresponding author The paper is organized as follows. Section 2 reports a review on the existing methods for point cloud classification, while Sec. 3 describes in detail the proposed approach. In Sec. 4 the dataset is presented, together with the achieved results. Finally, Sec. 5 draws the conclusion.

STATE OF THE ART
Point cloud classification has always been a hot research topic in the LiDAR data processing field, with several applications in, e.g., land cover classification, vegetation studies in forestry and agriculture, and road infrastructure management (Wang et al., 2020). Classification methods usually rely both on geometric information (i.e., the 3D coordinates of the surveyed points and their distribution in a neighboring region), as well as on the intensity of the backscattered pulse (Scaioni et al., 2018). Furthermore, thanks to the recent availability of laser scanners that are able to digitize the entire waveform of the reflected signal, several algorithms exploit also full-waveform data and the features derived from them (Maset et al., 2015).
Early works mainly proposed classification algorithms based on predefined discriminant rules and simple thresholds (Rutzinger et al., 2008, Wagner et al., 2008, subsequently replaced by machine learning techniques, such as Support Vector Machine (Serna, Marcotegui, 2014) and Random Forest (Tran et al., 2018). The main limitation of these approaches lies in the need of hand-crafted features, which can be sensible to changes in the data characteristics. Moreover, these approaches usually classify each point independently, without considering the labels assigned to neighboring points (Wang et al., 2020).
In the last decade, deep learning techniques have been spreading in disciplines such as computer vision, robotics and audio processing, replacing methods based on hand-engineered features with algorithms that learn both features and classifier end-to-end (Goodfellow et al., 2016). In particular, Convolutional Neural Networks (CNNs) and Fully Convolutional Networks (FCNs) proved to be successful tools for image classification and segmentation tasks, respectively (Szegedy et al., 2015, Garcia-Garcia et al., 2017.
Very recently, various methods based on deep learning have been applied in the remote sensing field, also for point cloud classification (Griffiths, Boehm, 2019). They can be distinguished into three main approaches: (i) classification of single points based on 2D CNNs, (ii) simultaneous classification of portions of point clouds via FCNs that operate on a 2D image, and (iii) exploitation of network architectures that allow to operate directly in the 3D space. The methods proposed in (Yang et al., 2017, Zhao et al., 2018 fall in the first category, with the 3D neighborhood features of a point that are transformed into a 2D image that is then classified by a CNN. Leveraging on the high performances that can be obtained by the networks usually applied for image processing, (Zorzi et al., 2019) proposed to map the point cloud and the information derived from full-waveform data into an image segmented by a FCN, assigning in this way a label to each pixel and, consequently, to the point falling in the pixel. A similar approach was applied also in (Rizaldy et al., 2018): in this case, the distinction among the three classes ground, vegetation and building is achieved with an overall accuracy of 93%.
Deep learning architectures that operate directly on 3D data have been proposed, e.g., in (Wu et al., 2015, Tchapmi et al., 2017. The cited methods rely on a voxelization of the point cloud, i.e., data are represented by means of a 3D regular voxel grid, subsequently fed to a network that performs convolution in the 3D space. This kind of approach is usually very expensive from the computational point of view, limiting the size of the point cloud that can be taken as input by the network. PointNet (Qi et al., 2016) and its improved version PointNet++ (Qi et al., 2017) have been the first architectures able to operate on unstructured data. These networks do not require the point cloud transformation on a regular grid neither use convolution functions, but employ instead multi-layer perceptrons to extract features, both at local and at global scale. PointNet and PointNet++ proved to outperform state-of-the-art methods for common benchmark datasets such as ModelNet40 (Wu et al., 2015) and were successfully applied, e.g., to distinguish between coniferous and deciduous tree points (Briechle et al., 2019). Another point-based deep learning method that does not involve rasterization or voxelization is the one proposed by (Landrieu, Simonovsky, 2018), that pre-organizes the point cloud in the so-called Superpoint Graph and exploits a graph convolutional network to perform the classification task.
Methods that operate directly in the 3D space are particularly suited for point clouds representing indoor scenes and road environments, acquired by, e.g., Terrestrial Laser Scanner (TLS) and Mobile Mapping System (MMS). In the case of ALS data, instead, we will demonstrate that a good trade-off between accuracy and computing time can be achieved by exploiting the 2.5D characteristic of the data, that can be processed by custom FCNs originally developed for image segmentation tasks.

PROPOSED METHOD
As mentioned in Sec. 2, in order to take advantage of wellestablished CNN architectures, usually employed in the image processing field, we treat the ALS data classification as a problem of image segmentation, solved with a FCN (Fig. 1). Similarly to the algorithm proposed in (Zorzi et al., 2019), our method is composed of two main stages (Fig. 2)  At first, the point cloud is projected into two-dimensional orthographic images, with one image channel storing the height of the point falling in the pixel. A 3D point cloud is thus represented as 2.5 data on a regular grid, which allows to efficiently take into account spatial positions and geometrical relationships between neighboring points. Moreover, other attributes recorded by the instrument are associated to three additional image channels, namely intensity, return number and total number of returns. In this way, the point cloud classification process is cast to an image segmentation problem, where the usual RGB channels are replaced by LiDAR attributes.
Because of the uneven spatial distribution of the 3D points, this projection cannot avoid collisions, that occurs when more than one point is mapped to the same pixel, unless a very small pixel size is chosen, which would have a negative effect both on computing time and classification accuracy. We cope with this by creating two different images, that are processed independently: in the first, the point with the highest altitude is assigned to the pixel, in order to enhance classification of thin objects such as power lines. In the second one, the lowest point is stored, improving the identification of the ground class, which is critical for the generation of DTMs. If more than two points fall in the same pixel, the ones with intermediate height inherits the label from the highest point. Thanks to this approach, we can use a pixel size of 0.10 m that is commensurate to the acquisition density of 18 points/m 2 .
The image segmentation task is then performed by a FCN, that assigns a class label for each pixel and, consequently, to the corresponding point. In the last years, several FCN models have been proposed to solve semantic segmentation (Ciresan et al., 2012, Garcia-Garcia et al., 2017. For our application, we started from the popular U-net architecture (Ronneberger et al., 2015), already applied also in (Zorzi et al., 2019) and specifically adapted to segment the four-channel images created as previously described.
The implemented FCN is composed only of convolutional layers without any fully-connected one, allowing to operate on an input of any dimension and obtaining an output segmented image of corresponding size (Long et al., 2015).
More in detail, the first part of the network consists of a contracting path made of typical convolutional layers, whose task is to extract low and high level features, capturing context information (as done by a custom CNN). Each layer in the contracting path performs two convolution operation with filters of size 3 × 3, each followed by batch normalization and ReLU activation function. Max-pooling of size 2 × 2 is then applied to halve the representation size, with the first layer designed to take as input an image of dimensions 256 × 256 pixels and the final layer producing feature channels of size 8 × 8. At each layer the number of feature maps is doubled with respect to the previous one, starting from 32 maps produced by the first layer to 1024 of the last one.
The contracting path is followed by an almost symmetrical one, known as expansive path, whose role is to enable precise localization, allowing a per-pixel labeling (Ronneberger et al., 2015). Each layer of the expansive path is constituted by an upsampling of the output of the previous layer, a concatenation operation with the corresponding feature maps from the contracting path and three convolutions with filters of size 3 × 3, followed by batch normalization and ReLU activation function. The final layer is characterized by a 1 × 1 convolution operation followed by a softmax activation function, that is used to reduce, for each pixel, the 32 components feature vector into a vector of dimension equal to the desired number n of classes. The detailed architecture is shown in Fig. 1.
Please note that the first layer of a FCN can take as input an image of fixed size (in our case, 256 × 256 pixels). A tiling with overlapping windows is thus adopted to process the entire dataset of arbitrary dimensions.

EXPERIMENTS AND RESULTS
The network was implemented in Keras (Chollet et al., 2015) and ran on a PC Intel Core i7 with 16GB RAM and a NVIDIA GeForce 1080 GPU.

Dataset
To train the network and test the performance in terms of accuracy, efficiency and speed, we employed a large-scale dataset, It covers the entire Friuli Venezia Giulia region (7700 km 2 ) with a mean density of 18 points per m 2 and it is divided into tiles of 0.2 km 2 . The flights were performed between December 2017 and July 2019, at an average altitude of 500 m above ground level and ensuring an overlap of 30% between adjacent flightlines. In addition to the 3D coordinates of the points, the instrument registered also the intensity value of the reflected signal (a quantity related to the reflectance properties of the hit target), the total number of returns for each emitted laser pulse, as well as the return number associated with each echo.
We selected several areas for a total of 54 km 2 characterized by different land cover types (urban, rural and forest environments) that were manually classified among five classes: ground, vegetation (height from ground > 2m), roof, overground and power line. Please note that the overground class contains the objects that do not fall in the other classes, including cars, fences, walls, chimneys and low vegetation (height < 2m). The classified dataset was then split in 25 km 2 for training/validation and 29 km 2 for testing. Table 1 shows the points distribution over the classes. One can notice that the dataset is very imbalanced, because of the different shape and size of the scanned objects: the number of points classified as ground and vegetation is much higher than the samples falling in the roof and power line classes. As specified in Sec. 4.2, the unevenly distribution over the classes is an important aspect to take into account when designing the training of the network.

Training
The training of the model was performed applying categorical cross-entropy as loss function, Adam optimizer (Kingma, Ba, 2014) with 0.0002 learning rate and the weight initialization approach described in (Glorot, Bengio, 2010). Using a batch size of eight images (limited by the GPU memory), the process reached convergence after 30 epochs, requiring approximately 60 hours.
As data augmentation strategy, each tile composing the training set was randomly rotated four times and, for each rotated configuration, we extracted 240 images of size 256 × 256. This approach proved to be fundamental to prevent the network from learning a specific scan pattern, which would have led to inaccurate results when classifying point clouds acquired along a different flight direction, as demonstrated by our experiments. Moreover, to take into account the unbalancing of the point distribution over the classes, we ensured that respectively 30% and 35% of the training images contained pixels belonging to roof and power line, which are the under-represented classes.

Testing
The proposed method reached an overall accuracy of 92.9% on the test set, while the average per-class accuracy is 86.3%. Table 2 shows precision (i.e. the number of points correctly classified as x divided by the number points classified by the algorithm as x), recall (i.e., the number of points correctly classified as x divided by the number of points belonging to class x) and F1-score (i.e., the harmonic mean of precision and recall) for each class. As can be noticed also from the confusion matrix represented in Fig. 3, the algorithm performs well even for a challenging class such as power line; on the other hand overground is often misclassified. This is mainly due to the fact that this class contains different objects, including low vegetation that the network usually labels as vegetation. However, this error can easily be corrected by simple height thresholding.
To further investigate the performance of the method, we divided the test set into two main scenarios: (i) urban and flat areas and (ii) mountainous environments, and independently evaluated the results for the two area types. Figure 4 shows the confusion matrices for the analyzed cases. The proposed algorithm allowed to achieve an overall accuracy of 95.8% in urban and flat areas, ensuring good performance also in mountainous environments (92.2%). One can notice that the accuracy of vegetation, roof and power line classes does not significantly change between the different scenarios, reaching high values even in forest environments and high alpine terrain. In the presence of complex topography and steep slopes, instead, low vegetation (belonging to the overground class) and ground are sometimes confused (a behavior highlighted also in (Winiwarter et al., 2019)), causing a decrease in the classification accuracy of ground and overground classes. Some results for different area types are presented in Fig. 5.
In order to manage large datasets, a requirement of the classification algorithm is to be computationally efficient: our approach showed an inference time of only 11 minutes per km 2 (including the time for reading and writing point cloud files in LAS format), which allowed its application to the whole set of 7700 km 2 (≈ 138 × 10 9 points), significantly reducing the large amount of time that is usually spent by the companies for the manual classification. The parallelization of the point cloud to image projection step could lead to further improvement in terms of time efficiency and productivity.

CONCLUSION
The good trade-off between speed and accuracy that characterizes end-to-end deep learning approaches makes these algorithms superior to established methods based on hand-crafted features.
In this paper we presented a deep learning approach for point cloud classification that, thanks to an overall accuracy of 92.9% and a low inference time (11 minutes per km 2 ), was effectively applied for the classification of a whole large-scale dataset of 7700 km 2 , covering the entire Friuli Venezia Giulia region (Italy).
Half of the region is mountainous, yet the algorithm performs satisfactorily also in such challenging environments as alpine areas. The network was trained on a subset of only 25 km 2 and the reached accuracy allowed a massive reduction of the manual work that is usually spent to correct the misclassification errors produced by commercial software routines.
As a future work, we will test other network architectures, focusing on the implementation of methods whose computing time is compatible with the processing of large-scale datasets, such as the one presented in this paper. Particular attention will be dedicated to mountainous environments, that pose stimulating challenges.