1 Introduction

Accurate and real-time 3D scene perception and understanding are essential tasks in environmental remote sensing. As 3D sensor technology advances, the use of 3D data has become more prevalent, 2D spatial data represented by maps and images are no longer sufficient to meet the demands of 3D spatial cognition, research, and real-world applications. Point cloud semantic segmentation, widely applied in robotics, augmented reality, autonomous driving and smart cities, has become a prominent research topic in environmental perception. The goal of point cloud semantic segmentation is to use computer vision methods to assign semantic labels to each point in a three-dimensional point cloud scene, providing an accurate description of the object categories in the scene (Huang et al., 2023; Zhou et al., 2022; Guo et al., 2023; Wu et al., 2021; Han et al., 2021).

While some successful researches, including Recurrent Neural Networks (RNNs) for natural language processing and Convolutional Neural Networks (CNNs) for structured images, has achieved remarkable performance. However, due to RNNs were originally designed for processing ordered sequences and CNNs were originally developed for structured image data, they face challenges in their applicability to disordered and unstructured point cloud data (Qi et al., 2017a). In particular, large-scale point cloud scenes contain a considerable number of points, ranging in the millions or even billions (Zeng et al. 2022). To more efficiently process larger-scale point cloud scenes, encoder-decoder network architectures with down-sampling are widely used for accurate segmentation (Hu et al., 2020; Du et al., 2021; Qiu et al., 2021; Shuai et al., 2021; Fan et al., 2021; Zeng et al., 2022a, b). Their success is largely attributed to the encoder, which performs a high-to-low process, aiming to explore deeper features and enlarge the receptive field (Zeng et al., 2023). Additionally, it introduces input perturbation and reduces overfitting. The encoder usually consists of feature extraction modules and feature aggregation modules, which are the topics of our study.

Fig. 1
figure 1

Architecture of the proposed LFEA-Net. LFEA: Local Feature Extraction and Aggregation module, MLP: multi-layer perception

The feature extraction module encodes the information of point clouds, comprehensively exploring the representation of each point. Existing methods typically linearly combine the spatial coordinates and semantic features of the central point and its neighbors to enhance the model’s local perception capabilities (Hu et al., 2020; Qiu et al., 2021; Fan et al., 2021; Shuai et al., 2021). However, the color information of the point cloud is only employed as an initial feature input to the network and is not combined linearly in the feature extraction module like spatial coordinates. In some scene data, objects with very similar spatial coordinates may not be effectively segmented by encoding only the spatial coordinates. Conversely, utilizing the color differences between these points can be beneficial for semantic segmentation. In this paper, we design a local feature extraction (LFE) module that fully learns the local neighborhood information of the point cloud by encoding spatial, color, and semantic information. Instead of treating color information solely as an initial feature, it is actively integrated into the learning process.

The feature aggregation module aggregates locally learned neighborhood features from the feature extraction module onto the central point. Existing methods commonly employ a simple concatenation of spatial and semantic information, followed by max-pooling (Qi et al., 2017a, b; Wang et al., 2019; Qian et al., 2022). However, a significant gap exists between the encoding of spatial and semantic information, and a simple concatenation may not effectively bridge this gap. Additionally, max-pooling can lead to a significant loss of local information, limiting its effectiveness in fine-grained semantic segmentation. This paper presents a local feature aggregation (LFA) module designed to address these issues. Firstly, a soft cross-operation is employed to adaptively fuse the encoded spatial, color, and semantic information. Finally, a combination of max-pooling, average pooling, and sum pooling is used to aggregate both the most significant local features and the entire local neighborhood information onto the central point, enhancing the model’s ability to perform fine-grained semantic segmentation.

We collectively refer to the proposed Local Feature Extraction (LFE) and Local Feature Aggregation (LFA) as the Local Feature Extraction and Aggregation (LFEA) module. Based on the LFEA module, we have constructed an end-to-end semantic segmentation neural network called LFEA-Net, which utilizes an encoder-decoder architecture. Our key contributions can be summarized as follows:

  • We design a local feature extraction (LFE) module that can take fully advantage of local spatial, color and semantic information.

  • We design a local feature aggregation (LFA) module for a more effective aggregation of local information.

  • We proposed LFEA-Net for point cloud semantic segmentation, which achieves excellent performance on city-scale photogrammetric benchmark SensatUrban.

2 Method

2.1 Network architecture

The proposed LFEA-Net follows an encoder-decoder structure. Figure 1 illustrates the detailed architecture of the proposed network. Each encoding layer comprises a downsampling operation followed by a LFEA module, which consisting of a LFE module and a LFA module. Farthest point sampling is used for downsampling the point cloud. The network progressively processes the point cloud at lower resolutions (\(N\rightarrow \ \frac{N}{4}\rightarrow \ \frac{N}{16}\rightarrow \ \frac{N}{64}\rightarrow \ \frac{N}{256}\)), while increasing the channel dimensions (\(16\rightarrow 32\rightarrow 128\rightarrow 256\rightarrow 512\)). In this respect, the process of handling 3D point clouds is similar to that used by traditional CNNs, which expanding channel dimensions while concurrently reducing image size for enhanced computational efficiency.

Each decoding layer includes an upsampling operation and an multi-layer perceptron. The learned features transfer between encoder and decoder is achieved through skip connections. Finally, three fully connected layers, along with a dropout layer and a softmax layer, are used to predict the label classification scores for each point.

Fig. 2
figure 2

Illustration of the proposed Local Feature Extraction and Aggregation (LFEA) module. a Bilateral Feature Encoding Unit, (b) Multidimensional Feature Encoding Unit, and (c) Local Feature Aggregation Module (LFA). \(\oplus\): concatenate, mlp: multi-layer perception

2.2 Local feature extraction module

To effectively encode local neighbor features, especially enable the network to focus on spatial, color, and semantic features simultaneously, we introduced the LFE module. The LFE module comprises two encodings: a bilateral feature encoding and a multidimensional feature encoding.

2.2.1 Bilateral feature encoding

Available datasets of point clouds typically include spatial coordinates xyz and color information rgb as their intrinsic information. Existing methods for local feature encoding primarily focus on point spatial coordinates xyz and semantic features f, leading to underutilization of color information rgb. The color information rgb is solely used as an initial feature input to the network and is not encoded in each encoding layer like spatial coordinates xyz.

The proposed bilateral feature encoding effectively captures both the spatial geometry and color information of point clouds, learning both geometric structure and color differences. For each point \(p_i\) and its corresponding color information \(c_i\) in the point cloud P, the spatial geometry and color information of the nearest K points (determined by K-nearest neighbor (KNN) based on the Euclidean distance in 3D space) are fused into the central point to gather context information. In Fig. 2a, we concatenate the absolute position \(p_i\) of the centroid and the relative distance \(\left(p_i-p_i^k\right)\) as the spatial encoding. Similarly, we concatenate the color information \(c_i\) corresponding to the center point \(p_i\) and the relative color difference \(\left(c_i-c_i^k\right)\). The spatial encoding \(P_i^k\) and color encoding \(C_i^k\) can be represented as follows,

$$\begin{aligned} P_i^k=\left( p_i\oplus p_i-p_i^k\right) . \end{aligned}$$
(1)
$$\begin{aligned} C_i^k=\left( c_i\oplus c_i{-}c_i^k\right) . \end{aligned}$$
(2)

Where the \(\oplus\) denotes concatenate operation. Bilateral feature encoding can be expressed by combining spatial encoding and color encoding. We present the spatial information encoding and color information encoding separately for clarity. In reality, the spatial coordinates xyz and the color information rgb of point clouds are input to the network as the same set of 6D vectors. This implies that the spatial coordinates xyz and the color information rgb are treated as the same variable in the actual calculation. The bilateral feature encoding BF can be represented as follows,

$$\begin{aligned} BF=\left( P_i^k\oplus C_i^k\right) . \end{aligned}$$
(3)

2.2.2 Multidimensional feature encoding

The Multi-Dimensional Feature Encoding unit effectively captures the semantic features of the input point cloud from the previous layer, allowing the network to fully utilize local semantic information. This unit encodes the semantic feature \(f_i\) corresponding to each point \(p_i\), as well as its neighboring points \(p_i^k\) and their corresponding features \(f_i^k\). As shown in Fig. 2b, the Multi-dimensional feature encoding concatenates the corresponding semantic feature \(f_i\), along with the relative feature difference \(\left(f_i-f_i^k\right)\), can be represented as follows,

$$\begin{aligned} MF=\left( f_i\oplus f_i{-}f_i^k\right) . \end{aligned}$$
(4)

Finally, the output of the LFE module comprises two sets of encoding: bilateral feature encoding BF and multidimensional feature encoding MF. Together, they represent the local information of point clouds.

2.3 Local feature aggregation module

The LFE module produces two sets of feature encodings, but simply concatenating them would not effectively utilize the local information they contain. Fine-grained semantic segmentation has limitations when using only maximally pooled aggregated local features. To address this, we propose the local feature aggregation module (LFA) to aggregate the encodings effectively and utilize the local information. The LFA comprises two operations: a soft cross unit and a united pooling unit.

Table 1 Quantitative comparison on SenastUrban. Bold indicates the best result, underline indicates the sub-best result

2.3.1 Soft cross operation

After the aforementioned two units, the bilateral feature encoding BF and the multidimensional feature encoding MF are output, but the BF and MF cannot represent the local features well due to the following two reasons: (1) the MF can only perceive the spatial structure and color information belonging to the last layer, the BF only contains the spatial geometric structure and color information. (2) Semantic gap between spatial geometric structure and color information (BF) and abstract semantic features (MF), and simple concatenate hardly bridges the huge semantic gap between the two encodings (Qiu et al., 2021).

To address the above problems and strengthen the generalization capability of features, the soft cross operation is proposed to augment the local contextual information and bridges the semantic gap between the BF and MF. Specifically, we first use a learnable parameter \(\alpha\) to softly select the representations generated by bilateral feature and multidimensional feature. Then we concatenate the two soft-weighting encodings with an multi-layer perceptron. The soft cross operation can be shown as,

$$\begin{aligned} CF=mlp[\alpha \cdot {BF}\oplus (1-\alpha ) \cdot {MF}]. \end{aligned}$$
(5)

2.3.2 United pooling operation

Point-wise feature representations are crucial for semantic segmentation. Existing work typically uses a max pooling to aggregate local features, but this results in the loss of most information. To address this problem and to extract local significant features while considering the entire local neighbor features, we introduce a united pooling operation UP with max, mean, sum poolings to optimize the soft cross features CF as follows,

$$\begin{aligned} UP=mlp[max(CF) + mean(CF) + sum(CF)]. \end{aligned}$$
(6)

Overall, our LFE and LFA modules are designed to effectively learn the local contextual features via explicitly considering spatial, color and semantic information, then adaptively soft crossover bridges their semantic gaps, finally use united pooling to aggregates the neighbor to the centroid. We refer to these two modules together as the LFEA module.

Fig. 3
figure 3

Visual comparison on SensatUrban

3 Experiments

3.1 Experimental setting

We used the city-scale photogrammetric benchmark SensatUrban to evaluate the effectiveness of LFEA-Net in semantic segmentation. In this case, per-class intersection over union (IoUs), mean IoU (mIoU), and overall accuracy (OA) are used as evaluation metrics to quantitatively analyze the segmentation performance. After that, the effectiveness of each component in the network is evaluated by ablation studies.

$$\begin{aligned} IoU=\frac{TP}{TP+FP+FN} \end{aligned}$$
(7)
$$\begin{aligned} mIoU=\frac{\sum _{i=1}^{n}{IoU}_i}{n} \end{aligned}$$
(8)
$$\begin{aligned} OA=\frac{T P}{N} \end{aligned}$$
(9)

Where TP, FP, FN represent true positive, false positive, and false negative samples respectively, while n, N denotes the number of semantic classes and the total number of samples respectively. The number of points fed into the network is 40960, and the proposed network is trained with a batch size of 4 for 100 epochs using the Adam optimizer with default parameters to optimize the cross-entropy loss. The initial learning rate is set to 0.01, and the search range K of the KNN is set to 16. All experiments are conducted on a single Nvidia RTX 2080Ti GPU with 11G memory using Python 3.7 and TensorFlow 2.4.

3.2 Evaluation on sensatUrban

The SensatUrban dataset (Hu et al., 2021) is a city-scale photogrammetric point cloud dataset covering 7.6 km\(^2\) of the city landscape, consisting of nearly 3 billion richly-annotated 3D points. Each point is labeled with one of 13 semantic categories, including: ground, vegetation, building, wall, bridge, parking, rail, car, footpath, bike, water, traffic road, and street furniture. The color attributes are utilized for training. The network is trained following the official splits, and its performance is evaluated on the online test server.

Table 1 presents the quantitative evaluation of our LFEA-Net compared to state-of-the-art methods on the SensatUrban dataset. Our method surpasses others by achieving 92.4% and 61.6% in overall accuracy and mIoU, respectively, representing a 2.6% and 8.9% relative enhancement compared to RandLA-Net (Hu et al., 2020). Figure 3 provides a visual comparison of the segmentation outcomes achieved by (Hu et al., 2020) and LFEA-Net on the SensatUrban dataset. As the dataset’s test sets are online and lack ground truth availability, we present the visualization on the validation set for a more accurate representation of the results. The red box highlights the portion where segmentation is incorrect or boundary delineation is indistinct in (Hu et al., 2020). Notably, our method can unambiguously discern railways from the ground, whereas RandLA-Net struggles to do so. Additionally, LFEA-Net exhibits significant improvements in parking and traffic road performance. Considering the outcomes of qualitative and quantitative evaluations, our LFEA-Net efficiently categorizes the semantics within expansive point cloud scenes.

3.3 Ablation study

In this section, we conduct ablation studies to quantitatively and qualitatively evaluate the effectiveness of the individual units within the Local Feature Extraction (LFE) and Local Feature Aggregation (LFA) modules. All ablation networks are trained on SensatUrban and tested on its validation set. The visualization results are shown in Fig. 4.

Fig. 4
figure 4

Visual comparison of ablation networks

3.4 Ablation study on LFE

The LFE module is designed to optimally leverage local neighbor information by concatenating neighboring spatial, color, and semantic information with the centroid, as described in Section 3.2. In this study, we investigate the influence of different information encoding methods on the local information encoding module. The following ablation experiments are conducted to evaluate the effects of each encoding type: (A1) No encoding, replaced with a multi-layer perceptron. (A2) Only encoding semantic information. (A3) Semantic combined with spatial, as done in recent works. (A4) A combination of all three encodings (spatial, color, and semantic), which is used in our approach.

The results of the ablation studies are presented in Table 2. The following observations can be made: (1) The encoding operation leads to a significant improvement compared to no encoding. (2) Encoding all three types together achieves higher performance than encoding on spatial and semantic. In conclusion, the effectiveness of encoding spatial, color, and semantic information is evident.

Table 2 Ablation studies about the LFE module on SensatUrban validation set

3.5 Ablation study on LFA

As explained in Section 3.3, the LFA module is designed for a more comprehensive and effective aggregation of local information. In this section, we quantitatively investigate the effects of the soft cross and united pooling operations on the LFA module. In models B1 and B2, we use the cross-encoding operation with various types of pooling, while in models B3-B4, we do not use the cross-encoding operation (instead, we simply concatenate the bilateral feature encoding and multidimensional feature encoding).

The ablation study results, presented in Table 3, reveal that networks with the soft cross operation exhibit significantly improved accuracy compared to those without it, regardless of the pooling operation used. Similarly, networks with the united pooling operation exhibit significantly improved accuracy compared to those that only use max pooling, regardless of the soft cross operation employed. Overall, the results demonstrate the effectiveness of the soft cross and united pooling operations.

Table 3 Ablation studies about the LFA module on SensatUrban validation set

4 Conclusion

In this study, we propose LFEA-Net, a model designed for large-scale point cloud semantic segmentation. The key contribution of this study is the introduction of the LFEA module, which explicitly encodes spatial coordinates, color information, and semantic information, followed by a cross-encoding operation to augment them, and united max, mean, sum pooling to learn and aggregate local contextual features. The proposed LFEA-Net demonstrates excellent performance on the city-scale photogrammetric benchmark SensatUrban. However, a limitation of the proposed LFEA-Net is that the color information encoding can only be applied to point cloud data with available color information, which may not be present in some datasets. In future work, will further explore more general and effective methods for semantic segmentation of large-scale point cloud scenes in the future, applicable to point cloud data without color information, to enhance the local context efficiently.