Integrating Contextual Information and Attention Mechanisms with Sparse Convolution for the Extraction of Internal Objects within Buildings from Three-Dimensional Point Clouds

Yu, Mingyang; Li, Zhongxu; Xu, Qiuxiao; Su, Fei; Chen, Xin; Cui, Weikang; Ji, Qingrui

doi:10.3390/buildings14030636

Open AccessArticle

Integrating Contextual Information and Attention Mechanisms with Sparse Convolution for the Extraction of Internal Objects within Buildings from Three-Dimensional Point Clouds

¹

School of Surveying and Geo-Informatics, Shandong Jianzhu University, Jinan 250101, China

²

801 Institute of Hydrogeology and Engineering Geology, Shandong Provincial Bureau of Geology & Mineral Resources, Jinan 250101, China

^*

Author to whom correspondence should be addressed.

Buildings 2024, 14(3), 636; https://doi.org/10.3390/buildings14030636

Submission received: 4 January 2024 / Revised: 23 February 2024 / Accepted: 25 February 2024 / Published: 28 February 2024

(This article belongs to the Section Construction Management, and Computers & Digitization)

Download

Browse Figures

Versions Notes

Abstract

:

Deep learning-based point cloud semantic segmentation has gained popularity over time, with sparse convolution being the most prominent example. Although sparse convolution is more efficient than regular convolution, it comes with the drawback of sacrificing global context information. To solve this problem, this paper proposes the OcspareNet network, which uses sparse convolution as the backbone and captures global contextual information using the offset attention module and context aggregation module. The offset attention module improves the network’s capacity to obtain global contextual information about the point cloud. The context aggregation module utilizes contextual information in the training and testing phases, which increases the network’s capacity to discern the overall structure and successfully improves the network’s capacity and the accuracy of the difficult-scene segmentation category. Compared to the state-of-the-art (SOTA) models, our model has a smaller parameter count and achieves higher accuracy on challenging segmentation categories such as ‘pictures’, ‘counters’, and ‘desks’ in the ScanNetV2 dataset, with IoU scores of 41.1%, 70.3%, and 72.5%, respectively. Furthermore, ablation experiments confirmed the efficacy of our designed modules.

Keywords:

point cloud semantic segmentation; sparse convolution; offset attention; context aggregation module

1. Introduction

Processing point cloud data poses unique challenges due to their inherent characteristics, including disorder, sparseness, and lack of structure. Traditional point cloud segmentation methods, such as edge detection-based approaches, region growth-based algorithms, and model fitting-based methods, often rely on manually designed features and involve substantial computational overhead. Consequently, these methods have been suggested to be ineffective for large volumes of point cloud data. However, deep learning-based methods are capable of solving the previously listed problems. Currently, investigations into semantic segmentation of point clouds can be divided into supervised and unsupervised (weak) methods, depending on the degree of reliance on annotated data. Supervised methods can be categorized into three main approaches: multi-view-based methods, voxelization-based methods, and point cloud-based methods.

The multiview-based approach starts by projecting the point cloud onto a two-dimensional plane. Next, features are extracted by applying convolution to the 2D images, and, ultimately, these features are fused to generate the output. The Multi-view Convolutional Neural Network (MVCNN) [1] obtains a 2D image of the acquired point cloud by simulating the camera from different angles, extracts the features using a CNN, and aggregates the features through maximum pooling; however, the features extracted by this network have a high redundancy. Inspired by the lightweight network SqueezeNet [2], SqueezeSeg [3] uses spherical projection to obtain the front view, which facilitates 2D convolution for feature extraction, followed by feature extraction using a convolutional network and further optimization of the segmentation results using a conditional random field (CRF) as an RNN layer. SqueezeSegV2 [4] builds upon the foundation of SqueezeSeg, effectively minimizing the adverse effects of noise while also harnessing synthetic data more efficiently. RangeNet++ [5] uses a lightweight CNN for feature extraction. Robert et al. [6] proposed a model that effectively integrates image features from arbitrary positions, thereby fully exploiting the features of multiple views and point clouds.

Point cloud-based techniques can be further divided into four categories: RNN-based, graph convolution-based, multilayer perceptron-based, and point convolution-based techniques. In the realm of methods based on multilayer perceptrons, PointNet [7] pioneered the direct processing of point clouds. Through a meticulously designed network architecture, it successfully extracts global features from point clouds. However, a drawback lies in its inability to capture local features. PointNet++ [8], building upon PointNet, addresses the limitation of capturing local information by introducing a hierarchical feature extraction process for point clouds. However, PointNet++ still suffers from the limitation of not fully leveraging interpoint structural information. Different from RandLA-Net [9], PointWeb [10] excels in capturing contextual information within local domains and understanding relationships between points more effectively. Qian et al. [11] introduced PointNeXt, which enhances PointNet++ through improved training strategies and novel architectural modifications. Inspired by a former work, Lin et al. [12] proposed a unified framework for the fair analysis of point cloud models. Zhu et al. [13] proposed a universal 3D pretraining framework named PonderV2. In contrast to previous enhancements, Zhong et al. [14] improved semantic segmentation results in neural networks by mitigating a neural collapse phenomenon. Deng et al. [15] proposed Vector-oriented Point Set Abstraction to reduce the parameter count of PointNeXt.

Graph convolution-based methods are graph structures combined with convolutional neural networks that rely on nodes in the graph for information transfer and that capture graph dependencies. Graph convolution-based methods excel in capturing relationships between each node and its neighboring nodes, enhancing point cloud segmentation accuracy via the utilization of rich edge features. SPGraph [16] represents point clouds using a superpoint graph, employing graph convolution to learn contextual features and extending the scope of point cloud processing. The Dynamic Graph CNN (DGCNN) [17] addresses the issue of disregarding the correlations between neighboring points in PointNet by employing edge convolutions. Robert et al. [18] combined the superpoint structure with self-attention mechanisms to expedite the preprocessing step and reduce the model parameter count. A typical network based on point convolution is Kernel Points Conv (KPConv) [19], which overcomes the limitations of point convolution by utilizing kernel point convolutions, leading to improved results. Xiang et al. [20] designed a pyramid structure for refining features, which effectively improves the performance of the backbone network.

A typical network based on the RNN is Recurrent Slice Networks (RSNets) [21], which achieves improved segmentation results by incorporating a local dependency module.

Voxelization-based methods convert disordered point clouds into structured voxel representations and subsequently conduct feature extraction on the voxelized data. VoxNet [22] represents point clouds as voxels and utilizes 3D CNNs for feature extraction. Nevertheless, the employment of a fixed grid introduces the challenge of potential data misalignment issues and imposes a notable computational burden. SegCloud [23] addresses the shortcomings of previous voxelization methods, producing refined voxel outputs with improved granularity in the results. PointGrid [24] overcomes the grid size limitation of VoxNet by employing random sampling and ‘0’ padding. In comparison to SegCloud, PointGrid offers a simpler training and testing process. However, PointGrid is still limited by the inadequate acquisition of geometric contextual information. By incorporating K-D trees and octrees, Kd-Net [25] and OctNet [26] effectively transform irregular point cloud data into a structured format, significantly reducing computational overhead. However, a notable drawback of these methods is their sensitivity to noise points. 3DcontextNet [27] follows a tree-like structure for feature learning, capturing local structures of point clouds more effectively. However, due to its tree-based structure, 3DcontextNet suffers from the disadvantage of voxel boundary dependence and fails to fully utilize the point cloud local structure information. The SSCN [28] addresses the inefficiency of convolutional networks in handling sparse point cloud data by proposing a submanifold sparse convolutional network. Nonetheless, the SSCN still suffers from the drawback of losing global contextual information. MinkowskiNet [29] utilizes generalized sparse convolution for processing high-dimensional data. Rozenberszki et al. [30] achieved a higher accuracy by combining sparse convolution with pretrained CLIP models. Wang et al. [31] introduced a transformer structure based on octrees, improving computational efficiency.

Based on the above analysis, voxel-based methods currently achieve better accuracy. Early research suffered from high computational costs and memory consumption. Although some of the subsequent research tree structures reduce computational effort, they still have the disadvantage of being sensitive to noise. Sparse convolution is more efficient but may lose global context information.

Taking into account the merits and drawbacks of various methods, this study introduces a model named OcspareNet. This network enhances the ability of sparse convolutional networks to capture global contextual information by employing offset attention and context aggregation modules. The network improves the network’s semantic segmentation performance in complex indoor scenes, thereby enhancing the accuracy of segmentation for categories that are prone to confusion. Our main contributions are presented as follows:

(1): Context aggregation module: The context aggregation module enhances the network’s capability to acquire information over long distances during the training and testing phases. This module improves the network’s ability to assess the overall structure, effectively increasing the accuracy of segmentation for complex scenes.
(2): Offset attention module: The offset attention module efficiently captures global contextual information and refines attention weights, thereby mitigating the impact of noise.
(3): Enhancement of accuracy in complex segmentation categories: Compared to the state-of-the-art (SOTA) model, our model demonstrates higher accuracy in complex segmentation categories on the ScanNetV2 dataset. Ablation experiments confirmed that the modules we designed contribute to enhancing sparse convolutional backbone networks’ ability to capture global contextual information, thereby improving the segmentation accuracy of complex indoor objects.

2. Methods

2.1. Network Architecture

The architectural diagram of the network model comprises three components: voxelization, feature extraction network, and devoxelization. To integrate irregular point cloud data into the network, we utilize an initial voxelization process, transforming the point cloud data into a structured format. Subsequently, the voxelized point cloud data undergo feature extraction within the designed network. Via the devoxelization process, the output is transformed into labels for each point, thereby accomplishing the task of point cloud semantic segmentation. The model architecture, based on sparse convolution, is depicted in Figure 1.

2.2. Voxelization and Devoxelization

The input point cloud ensemble is denoted as

M

, and the point cloud

M

is defined as follows:

M = ((p_{i}, f_{i}))

(1)

where

p_{i}

and

f_{i}

represent the coordinates and colors of each point.

The index

p_{i}'

for the

i^{th}

voxelized point is represented as follows:

p_{i}' = (a, b, c) .

(2)

The calculation formulas for variables

a

,

b

, and

c

are as follows:

a = int (\frac{x_{i}}{v})

(3)

b = int (\frac{y_{i}}{v})

(4)

c = int (\frac{z_{i}}{v})

(5)

where

v

denotes the resolution of point cloud sampling and

x_{i}

,

y_{i}

,

z_{i}

represent the coordinates of the input point cloud.

The feature vector

f_{w', d', h'}

for each voxel input to the network is expressed as follows:

f_{w', d', h'} = \frac{1}{M} \sum_{i = 1}^{M} χ_{w', d', h'} (p_{i}' (a, b, c)) f_{i}

(6)

where

M

is the number of input points in the point cloud and

f_{i}

represents the color information of each point.

The variable

χ_{w', d', h'} (p_{i}' (a, b, c))

can be calculated using the following formula:

χ_{w', d', h'} (p_{i}' (a, b, c)) = \{\begin{cases} 1, (a, b, c) = (w^{'}, d^{'}, h^{'}) \\ 0, e l s e \end{cases}

(7)

The variables

w^{'}

,

d^{'}

, and

h^{'}

can be calculated using the following formulas:

w^{'} \in (0, int (\frac{w}{v}))

(8)

d^{'} \in (0, int (\frac{d}{v}))

(9)

h^{'} \in (0, int (\frac{h}{v}))

(10)

where

w

,

d

, and

h

denote the spatial extent of the point cloud, and

v

represents the resolution of the point cloud voxels.

A hash table is created using the coordinates and feature vectors of the voxelized point cloud. The keys of the hash table correspond to the coordinates of nonempty voxelization, while the values are the feature vectors associated with these nonempty voxels.

Voxelization is performed using trilinear interpolation. For a given position

(x, y, z)

, we first determine the coordinates of the nearest voxel

(x_{0}, y_{0}, z_{0})

and the coordinates of the farthest voxel

(x_{1}, y_{1}, z_{1})

. The relative offset with respect to the nearest voxel is calculated using the following formula:

α^{'} = \frac{x - x_{0}}{x_{1} - x_{0}}

(11)

β^{'} = \frac{y - y_{0}}{y_{1} - y_{0}}

(12)

γ^{'} = \frac{z - z_{0}}{z_{1} - z_{0}}

(13)

The result of trilinear interpolation, denoted as

f (x, y, z)

, can be obtained using the following formula:

\begin{array}{l} f (x, y, z) = (1 - α^{'}) (1 - β^{'}) (1 - γ^{'}) f_{000} + α^{'} (1 - β^{'}) (1 - γ^{'}) f_{100} + (1 - α^{'}) β^{'} (1 - γ^{'}) f_{010} + \\ α^{'} β^{'} (1 - γ^{'}) f_{110} + (1 - α^{'}) (1 - β^{'}) γ^{'} f_{001} + α^{'} (1 - β^{'}) γ^{'} f_{101} + α^{'} β^{'} γ^{'} f_{111} \end{array}

(14)

where

f_{i j k}

represents the values in the voxel grid, and i, j, and k represent the indices of the voxel along three dimensions.

2.3. Offset Attention Module

The remarkable ability of attention mechanisms to model long-range dependencies positions them as pivotal players across various academic domains. The Google Machine Translation Team published ‘Attention is All You Need’ [32] and abandoned the traditional CNN and RNN structure, using only the attention mechanism to complete the machine translation task and achieving better results. Subsequently, attention mechanisms have emerged as a popular research direction.

Point cloud attention techniques can be divided into two categories based on their operational scale: global attention and local attention [33]. Global attention primarily applies the attention module to the entire point cloud to capture global features, while local attention restricts the application of the attention module to specific regions.

The Point Cloud Transformer (PCT) [34], as a typical global attention network, uses four stacked attention blocks to learn global features and uses max-pooling and mean-pooling to accomplish point cloud segmentation tasks. In subsequent network developments that account for a broader range of scales, the cross-level cross-scale cross-attention network for point cloud representation (3CROSSNet) [35] extracts point cloud features through the construction of three modules. Each module considers distinct hierarchies and inter-point relationships at different scales. Point-Bert [36] draws inspiration from the successful application of Bert in natural language processing. It considers local point clouds as the vocabulary of a language, offering a novel approach to the application of a transformer in the point cloud domain.

Local attention networks primarily aggregate features from local point clouds. Point Transformer (PT) [37] adopts a hierarchical structure, using attention blocks to extract features from local point clouds, with each module applied in the KNN local region of sampled points. However, the PT network has its drawbacks. As the model’s depth and channel count increase, it results in excessive computational demands and a certain degree of overfitting. Point Transformer V2 (PTv2) [38] improves upon PT by refining attention and pooling operations, reducing the model’s parameter count, and making the network more lightweight. Stratified Transformer [39], inspired by Swin Transformer [40], divides the voxelized point cloud into nonoverlapping cube windows and performs a local attention operation at each window, with the attention part consisting of two consecutive blocks of attention, the former for capturing long-term and short-term dependencies and the latter for reinforcing the links between windows. Unlike previous improvements on attention, Wu et al. [41] were inspired by large-scale representation learning and proposed PointTransformerV3 (PTv3) to achieve the highest segmentation accuracy.

In recent years, architectures based on the transformer, with intricately designed attention mechanisms, have surpassed CNN architectures in point cloud model accuracy. However, the increasing complexity of attention designs tends to reduce the efficiency of the model. Moreover, training a high-precision model on a typical graphics processing unit (GPU) becomes challenging. For instance, PT models exhibit a certain degree of overfitting and demand a high GPU memory, making them particularly difficult to train with a single GPU. PTv2 introduced improvements over PT but still required substantial GPU memory. It was not until the recent release of PTv3 that a focus on model efficiency became apparent. Unlike the complex transformer architectures, efficient CNN architectures are more conducive to improvement and practical use by the majority of researchers. During our research, we found that the attention mechanisms involved in transformer architectures can enhance the network segmentation performance of CNN architectures. Some simplifications of attention mechanisms (such as PT) can aid in model improvement. However, complex attention mechanisms (such as PTv2) can render the model difficult to train and may potentially have adverse effects. Therefore, designing a network that combines the high precision of transformer architectures with the efficiency of CNN architectures can provide complementary advantages.

The self-attention operators that are currently employed can be categorized into the scalar self-attention operator and vector self-attention operator. Let

V = {v_{i}}

denote the set of input point cloud feature vectors. The scalar self-attention operator is defined as follows:

O_{i} = \sum_{v_{j} \in V} ρ (φ {(v_{i})}^{T} ψ (v_{j})) α (v_{j})

(15)

The vector self-attention operator is defined as follows:

O_{i} = \sum_{v_{j} \in V} ρ (γ (β (φ (v_{j}), ψ (v_{j})) + δ)) ⊙ α (v_{j})

(16)

where

φ

,

ψ

, and

α

are linear transformations.

O_{i}

represents the output point cloud feature, and

β

is the relational function.

γ

is the mapping function, and

δ

is the positional encoding function.

ρ

is the normalization function

softmax

.

The offset attention that we use belongs to scalar self-attention. In contrast to vector self-attention, offset attention has lower computational requirements and is more practical, making it more suitable for hardware with lower specifications.

This paper primarily emphasizes two aspects: the rapid computation of sparse convolution, which efficiently extracts local features from point clouds, and the superior capability of self-attention mechanisms in capturing global point cloud features. However, the introduction of self-attention mechanisms may create computational challenges due to excessive computations. Therefore, in this study, sparse convolution is employed as the backbone architecture to ensure efficiency. Simultaneously, an offset attention module is utilized to capture global features and effectively mitigate the impact of noise. To avoid the excessive computational burden associated with a pure attention architecture, this paper introduces only one offset attention module, which is composed of stacked attention layers. A context aggregation module is also incorporated to further enhance the network’s modeling capability in complex scenarios.

Inspired by the PCT network, this study introduces the offset attention module. The offset attention module is positioned after the first sparse convolutional layer. Consequently, the input to the offset attention module is the output features of the preceding sparse convolutional layer. The output is the result of stacking three offset attention layers to extract features, which are then combined with the input features through residual connections and subsequently normalized via batch normalization.

The offset attention module consists of three stacked layers of offset attention, with each layer utilizing the offset attention operator. The input to the first layer of offset attention consists of the features from the previous layer’s sparse convolution, which are then normalized through batch normalization. The input to the subsequent offset attention layer is the output of the preceding offset attention layer. The output consists of two components. One component involves batch normalization applied to features extracted through sparse convolution, while the other component comprises features calculated through offset attention. These two components are combined in the final output through a residual connection. The inspiration for the offset attention operator is drawn from utilizing the Laplacian matrix

L = R - E

as a substitute for the neighboring matrix

E

in graph convolutional networks, with

R

being a diagonal matrix. The offset attention calculates the offset difference through the elementwise subtraction of the input point cloud features and attention features.

First, we introduce the first offset attention layer as an example; the other two layers are similar to the first layer. The self-attention operator computes the

Q

,

K

, and

V

matrices based on the input point

F

, with the following calculation formulas:

Q = W_{Q} \cdot F

(17)

K = W_{k} \cdot F

(18)

V = W_{w} \cdot F

(19)

where input

F

represents the result of the first-layer sparse convolutional feature extraction followed by batch normalization.

W_{Q}

,

W_{k}

, and

W_{v}

denote linear transformations.

The attention weights

a t t e n t i o n

are computed through the matrix multiplication of

Q

and

K

, with the following calculation formula:

a t t e n t i o n = Q \cdot K^{T}

(20)

where

Q

and

K

are the query matrix and key matrix, respectively.

The computation approach to offset attention differs from scalar attention. Offset attention utilizes

softmax

and then applies

L 1

normalization. The new feature

κ_{i . j}

is defined as follows:

{\bar{κ}}_{i, j} = softmax (attention)

(21)

κ_{i . j} = \frac{\bar{κ_{i, j}}}{\sum_{k} \bar{κ_{i, k}}}

(22)

where

a t t e n t i o n

refers to attention weights.

The final output

F_{offset}

of the first-layer offset attention can be computed using the following formula:

F_{r e l a t i o n} = κ_{i, j} \cdot V

(23)

F_{offset} = F + L B R (F - F_{r e l a t i o n})

(24)

where

V

is the value matrix, and

L B R

stands for Linear Transformation, Batch Normalization, ReLU.

The output

F_{offset}

of the first layer of the offset attention undergoes feature extraction through the same two layers of offset attention, along with residual connection and batch normalization, completing the feature extraction of the entire offset attention module. Subsequently, it undergoes feature extraction through the entire sparse convolution backbone network. Finally, the output feature is passed through a linear layer to output the segmentation result, which is represented by the initial classifier

C

.

The architecture of the offset attention layer is illustrated in Figure 2.

Since the offset attention module follows a sparse convolutional layer, its input consists of features extracted by the first sparse convolution layer, while its output comprises features extracted through three offset attention layers. In the following, we provide a detailed overview of the offset attention module. The offset attention module initially normalizes the input point cloud data and subsequently stacks three attention layers to facilitate information exchange among point cloud feature vectors. The input point cloud features are combined with the features from the offset attention layer through residual connections, forming the output of the offset attention module. Residual connections primarily address the issue of gradient vanishing during model training, facilitating faster convergence. Subsequently, the output point cloud features are enhanced by a normalization layer, bolstering the network’s generalization capabilities. By constructing the offset attention module, the model’s ability to capture global contextual information is enhanced. Each layer of the offset attention module effectively promotes information exchange among point cloud feature vectors, enriching the generated point cloud feature information and making it more distinctive. This exchange enhances the semantic segmentation performance of the network as the model becomes more adept at discerning intricate patterns and capturing meaningful relationships within the point cloud data. The architecture of the offset attention module is depicted in Figure 3.

2.4. Context Aggregation Module

The paper enhances the network’s ability to capture contextual information by incorporating a context aggregation module. Inspired by the work of Tian et al. [42], this study introduces the context aggregation module. This module effectively leverages contextual information among point cloud data, adapting to the latent distributions present in different datasets.

The contextual aggregation module can be understood as a classifier situated at the end of the entire network. Preceding it are the offset attention module and the feature extraction through sparse convolution. As a result, its input is the outcome of the entire preceding structure’s feature extraction, and its output is the final segmentation categories. The context aggregation module exhibits slight differences between the training phase and the testing phase. During training, real labels from the dataset are available for utilization. Nevertheless, there are no actual label data accessible during the testing period. Therefore, this study employs approximate labels as surrogate real labels. The architecture of the context aggregation module is illustrated in Figure 4.

During the training phase, the true labels of point clouds can be obtained, enabling the acquisition of accurate contextual priors. For the extracted feature map

f \in R^{[n \times c]}

and the initial classifier

C \in R^{[c l s \times c]}

with

c l s

classes. The ground-truth annotation

y \in R^{[n]}

, which contains

c l s

classes, can be converted into corresponding

c l s

binary masks

y * \in R^{[c l s \times n]}

.

Subsequently, the formula for calculating the categorical prototypes

C_{y}

is as follows:

C_{y} = \frac{y * \times f}{\sum_{j = 1}^{n} y * (\cdot, j)} .

(25)

where the operation of masked average pooling is applied to

y *

and

f

.

The oracle context-aware category

A_{y}

is derived from

C_{y}

,

C

, and the projections

θ_{y}

from two linear layers; the calculation formula is presented as follows:

A_{y} = θ_{y} (C_{y} \oplus C)

(26)

where

\oplus

denotes the concatenation of the extracted point cloud features.

The prediction

P_{y}

is computed through a specific formula involving the extracted feature map

f

and the oracle context-aware classifier

A_{y}

, which is expressed as follows:

p_{y} = τ \cdot η (f) \times η {(A_{y})}^{T}

(27)

where

η

undergoes

l_{2}

normalization along the second dimension, and the numerical value for

τ

is set to 15.

During the testing phase, due to the absence of true labels

y *

, we approximate the oracle contextual prior using prediction

p

, and it refers to the results obtained from the initial classifier, the calculation formula is presented as follows:

p = f \times C^{T} .

(28)

Then, the subsequent computational process is similar to the training phase. Categorical prototype

C_{p}

is computed based on

p

and

f

, utilizing the same formula as in the training phase. However, since this is the testing phase and

p

is used as the prediction, Cy is also an estimated value. The calculation formula is presented as follows:

C_{p} = \frac{σ {(p)}^{T} \times f}{\sum_{j = 1}^{n} σ {(p)}^{T} (\cdot, j)} = \frac{σ {(f \times C^{T})}^{T} \times f}{\sum_{j = 1}^{n} σ {(f \times C^{T})}^{T} (\cdot, j)}

(29)

where

σ

represents the

softmax

operation.

The computation process of the context-aware category

A_{p}

is similar to the calculation process of

A_{y}

in the training phase. However,

A_{p}

is an estimated value, determined through the calculation involving

C_{p}

,

C

, and projections from two linear layers

θ_{p}

. The calculation formula is presented as follows:

A_{p} = θ_{p} (C_{p} \oplus C) .

(30)

The prediction

P_{p}

is similar to the training phase, and the calculation formula is presented as follows:

P_{p} = τ \cdot η (f) \times η {(A_{p})}^{T}

(31)

where

η

undergoes

l_{2}

normalization along the second dimension, and the numerical value for

τ

is set to 15.

The context aggregation module effectively alleviates the instability issues of individually generated

A_{y}

and

A_{p}

, with cosine similarity proving more effective than the conventional dot product. The experimental results indicate that the incorporation of the context aggregation module significantly improves the network’s capacity to effectively capture contextual information.

2.5. Feature Extraction Network

Currently, within voxel-based methods, sparse convolution is extensively employed in various downstream tasks related to point cloud analysis. Compared to regular convolution, sparse convolution effectively reduces computational complexity, enhancing the efficiency of network processing for point cloud data. Despite advancements, sparse convolution still grapples with the challenge of limited local receptive fields, leading to a weaker grasp of the overall structural information of point clouds. Compared to convolutional neural networks, models based on transformers exhibit global feature learning capabilities without the necessity of stacking multiple layers of convolutional layers. The attention mechanism, which has demonstrated satisfactory results in 2D images and natural language processing, plays a significant role. Furthermore, the 3D self-attention mechanism operates in parallel and is order-agnostic, making it well suited for processing point cloud data. However, purely attention-based networks have high computational requirements. Therefore, in this study, the offset attention is designed as a module that is exclusively added at the beginning of the sparse network, and a context aggregation module is applied at the end to further enhance the network’s long-range modeling capabilities.

The network proposed in this paper, with sparse convolution as its backbone and augmented by the offset attention module and context aggregation module, is illustrated in Figure 5. The main architecture is built upon the sparse convolutional U-Net, employing an encoder–decoder structure. In this study, sparse convolution is chosen as the backbone network primarily for its enhanced efficiency compared to regular convolution. After the voxelization of the point cloud data, there are numerous empty voxels. Sparse convolution, however, computes outputs only for predefined coordinates and stores the results in a compact tensor. Compared to conventional convolution operations, sparse convolution exhibits higher-dimensional spatial expressiveness and generalization, enabling a reduction in computational complexity. Sparse convolution better addresses the sparsity challenges inherent in point cloud data, thereby enhancing the efficiency of processing point cloud data.

While sparse convolution effectively reduces the computational burden for processing point cloud data, it shares a limitation with regular convolution in terms of a limited local receptive field. During feature extraction from point cloud data using convolution, the difficulty of obtaining information from distant points increases as the number of convolutional layers increases. To address this limitation in the feature extraction process, this study introduces the offset attention module and context aggregation module to enhance the network’s ability to capture global features.

Our network initially subjects the input point cloud to a sparse convolutional layer with a kernel size of 5 for feature extraction. Next, the input undergoes processing through an offset attention module to augment the network’s global receptive field. The encoder component comprises sparse convolution blocks and sparse convolutional layers. In the encoder, the sparse convolutional layer has a kernel size of 2, and the sparse convolution block comprises stacked residual sparse convolutional units. The residual sparse convolution block comprises two sparse convolution layers, each with a kernel size of 3. Following each sparse convolutional layer are Batch Normalization (BN) and ReLU layers. A residual structure connects every two convolutional layers in the block. The sparse convolution module is illustrated in Figure 6. The decoder component consists of sparse convolution blocks and layers. The convolutional kernel size for the sparse convolution blocks is 3, while the convolutional kernel size for the sparse convolutional layer is 2, consistent with the encoder. Subsequently, the point cloud undergoes a sparse convolutional layer with a kernel size of 1, followed by the final output through the context aggregation module.

3. Experiments and Results

3.1. Experimental Settings

The hardware configuration and software versions of this experiment are shown in Table 1.

The experiment was configured with 750 training epochs, a batch size of 6, and the SGD optimizer. We trained two models on the S3DIS and ScanNetV2 datasets using the same hyperparameters and then conducted testing on the S3DIS dataset and validation on the ScanNetV2 dataset, respectively. The evaluation metrics included overall accuracy (OA), mean accuracy (mAcc), and mean intersection over union (mIoU). The calculation formulas for these metrics are presented as follows:

O A = \frac{1}{N} \sum_{C} T P_{i}

(32)

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(33)

m A cc = \frac{1}{C + 1} \sum_{i = 0}^{n} A c c u r a c y

(34)

I o U = \frac{T P}{T P + F P + F N}

(35)

m I o u = \frac{1}{C + 1} \sum_{i = 0}^{C} I o U

(36)

where

N

represents the total number of point clouds;

C

denotes the number of segmentation categories for point clouds; and

T P

,

T N

,

F P

, and

F N

refer to true positives, true negatives, false-positives, and false-negatives, respectively.

3.2. Dataset Description

In this study, experiments were conducted on two large indoor point cloud datasets, namely, the Stanford Large-Scale 3D Indoor Spaces dataset (S3DIS dataset) [43] and ScanNetV2 dataset [44]. These two indoor point cloud datasets encompass categories of common indoor scenes, such as walls, windows, doors, etc. The S3DIS dataset consists of six areas, totaling 272 rooms and covering 13 common indoor segmentation categories. For this study, areas 1, 2, 3, 4, and 6 were selected as the training dataset, while area 5 was used as the validation and testing dataset. The types and quantities of rooms contained in each area are shown in Table 2. The ScanNetV2 dataset consists of 1513 scenes, covering 20 common segmentation categories. The training dataset included 1201 scenes, the validation dataset included 312 scenes. The data numbers used for training and validation in this study are shown in Table 3.

4. Results

4.1. Results on the S3DIS Dataset

To better showcase the segmentation results of OcspareNet, we visualized the segmentation results on the S3DIS test set. The segmentation results are shown in Figure 7.

From the second and third rows of Figure 7, it can be observed that our network successfully segments objects such as bookshelves, windows, and blackboards, which require relying on the overall scene for segmentation. This result is attributed primarily to our network’s capability to optimize point cloud segmentation results by leveraging global context information. This finding effectively demonstrates the efficacy of incorporating offset attention and context aggregation modules. As shown in the first row of Figure 7, our network can also classify other objects, such as the circular clock on the wall and objects stacked on the table, into appropriate categories. Furthermore, for categories like tables, chairs, and floors, as shown in the first row of Figure 7, our network achieves satisfactory segmentation results, underscoring the effectiveness of integrating the offset attention and context aggregation modules in our network.

4.2. Results on the ScanNetV2 Dataset

To better showcase the segmentation results of OcspareNet, we visualized the segmentation results on the ScanNetV2 test set. The segmentation results are shown in Figure 8.

First, our proposed network exhibits good segmentation results for commonly encountered and easily distinguishable categories such as floors and walls, as shown in the overall presentation in Figure 8. Second, for categories with unique shapes such as round tables, chairs, and sofas, our network successfully segments them. As shown in the first row of Figure 8, the boundary between the sofa and the wall is distinctly segmented. Objects that are more challenging to segment, such as windows, refrigerators, and counters, which require reliance on the overall structural information of the room, can also be well segmented. As shown in the second row of Figure 8, the overall segmentation results are good, with only a small portion experiencing misclassification. Last, our network is suitable not only for large objects but also for segmenting small objects. As shown in the second row of Figure 8, our network accurately segments the sink on the counter. In the fourth row of Figure 8, the trash bin near the doorway is erroneously classified as another object. In summary, based on the analysis of Figure 8, our network demonstrates good segmentation accuracy for commonly encountered and easily distinguishable categories. Our network also exhibits satisfactory segmentation results for small objects and categories that are prone to confusion. This finding is explained by the notion that our offset attention and context aggregation modules can compensate for the absence of global context information in sparse convolutional networks and through the efficiency of our sparse convolutional network in capturing point cloud features. These modules effectively optimize the features captured by sparse convolution, thereby enhancing the network’s adaptability to complex indoor scenes.

5. Discussion

5.1. Ablation Study

5.1.1. ScanNetV2 Dataset Ablation Experiment

We conducted ablation experiments on the context aggregation module and the offset attention module using the ScanNetv2 dataset. The segmentation accuracy of the network is shown in Table 4. The incorporation of the offset attention module resulted in improvements of 0.51% in mIoU, 0.16% in OA, and 0.55% in mAcc. Following the integration of both the offset attention module and the context aggregation module, there was an enhancement of 1.72% in mIoU, 0.51% in OA, and 2.45% in mAcc.

The experimental results indicate that the offset attention module and the context aggregation module can effectively enhance the sparse convolutional network’s ability to capture complete contextual information, addressing the issue of limited receptive fields in sparse convolutional networks.

To visually demonstrate the impact of the offset attention and context aggregation modules on the network, we visualize the point cloud segmentation results in Figure 9.

As depicted in Figure 9, it is evident that our network, augmented with the offset attention and context aggregation modules, significantly enhances the segmentation capability for categories with similar shapes, such as sofas and chairs. Additionally, our modules can address boundary misclassification issues. As illustrated in the first row of Figure 9, the base network misclassifies the majority of the point cloud from the left chair and a small portion of the point cloud from the right chair as a sofa. After adding the two modules, the network correctly resegments them. In the second row, the base network incorrectly classifies a portion near the boundary of the left refrigerator as another category and misclassifies the boundary of the right refrigerator as a window. With the added modules, the network improves the segmentation at the boundaries based on overall structural information, resulting in significant enhancement compared to the base network, despite some remaining misclassifications. As shown in the third row, our modules also exhibit improvement in addressing misclassifications of planar categories. The base network misclassifies the plane on top of the cabinet as a counter. With the added modules, the network correctly segments the plane. In the fourth row, although the modules do not fully segment the sofa, they improve the overall segmentation results. These results convincingly demonstrate that our modules enhance the network’s ability to capture global context information, enabling the network to improve the accuracy of similar-shaped categories and the point cloud segmentation in complex scenes based on overall structural information.

5.1.2. S3DIS Dataset Ablation Experiment

We conducted ablation experiments on two modules using the S3DIS dataset; the results are presented in Table 5. In comparison to the base network, the incorporation of the offset attention and contextual aggregation modules yielded an improvement of 1.52% in mIoU, of 0.55% in OA, and of 2.83% in mAcc. These experimental results demonstrate that the introduced modules can compensate for the deficiencies in sparse convolutional networks.

To visually demonstrate the impact of the offset attention and context aggregation modules on the network, we present the segmentation results on the S3DIS dataset in Figure 10.

As shown in Figure 10, the offset attention module and context aggregation module improve the base network on the S3DIS dataset. As shown in the first row of Figure 10, our network, after incorporating the modules, can effectively and more comprehensively segment the boundary between the bookshelf and the wall. Moreover, there is notable improvement in handling the overlapping sections of walls and tables. For walls that were mistakenly classified into other categories, the addition of the modules also facilitates effective correction. As shown in the second row of Figure 10, our modules demonstrate improvement in segmenting categories such as windows and boards that are on the same plane and share similar shapes. Before adding the modules, the network struggles to differentiate among the walls, boards, and windows. With the integration of these modules, the network gains a more precise understanding of the overall structural information, leading to more distinct segmentation boundaries for windows and boards. Furthermore, our modules contribute to the improvement of misclassification issues between bookshelves and floors, as well as between ceilings and other categories. As illustrated in the third row of Figure 10, our modules enhance the segmentation performance in overlapping areas between walls and bookshelves, as well as between walls and other categories. In the fourth row of Figure 10, we showcase the improvement in our modules regarding the segmentation of bookshelves, doors, and walls. While there is still some mis-segmentation between doors and bookshelves after the improvement, the overall segmentation of doors sees significant enhancement compared to the base network. Additionally, our network exhibits a clearer perception of the boundaries between tables and other objects.

In summary, our modules enhance the base network’s ability to capture overall structural information, leading to improvements in the segmentation accuracy for shape-similar categories and categories prone to misclassification in overlapping areas, thereby strengthening the network’s performance in complex scenes.

5.2. Comparison with Other Networks

The OcspareNet proposed in this study was experimentally evaluated on two indoor point cloud datasets. First, we present the segmentation results of OcspareNet on S3DIS. We compared OcspareNet with other networks on the S3DIS test set, and the experimental results are shown in Table 6.

Compared to other networks, our network achieves moderate segmentation accuracy on the S3DIS dataset. However, when compared to the state-of-the-art PTv3 model, our model demonstrates a reduction in parameter size by 8.7 M. When compared to the PointNeXt-XL model, which achieved the highest accuracy through improvements to PointNet++, our model also holds an advantage in terms of parameter size. Furthermore, it can be observed regarding PointNeXt that a portion of the current model accuracy improvement stems from scaling the model—increasing the parameter size to trade for a higher accuracy. However, the ideal scenario involves achieving a balance between accuracy and parameter size. Compared to the PT network, PT holds an advantage in accuracy. However, in contrast to PT’s complex vector attention mechanism, our network employs a more concise attention model, requiring fewer training epochs. Compared to PointNet, which does not capture local point cloud structural information, our network demonstrates an overall segmentation improvement, showcasing the effectiveness of sparse convolution in extracting local point cloud features. In comparison to SegCloud, our proposed network exhibits significant enhancements in segmenting categories such as doors, chairs, bookshelves, and boards, with improvements of 36.0%, 17.0%, 33.5%, and 57.9%, respectively. In comparison to SPGraph, certain segmentation categories, such as doors, tables, and sofas, demonstrate a slightly lower accuracy. This reduction is attributed mainly to the superior capability of graph convolution in capturing spatial point neighborhood information. However, for challenging-to-segment categories such as windows, bookshelves, and boards, our network achieves superior segmentation accuracy, with improvements of 4.8%, 21.8%, and 68.8%, respectively, compared to SPGraph. This improvement is attributed mainly to the integration of the offset attention module and context aggregation module in our network. These enhancements significantly improve the network’s ability to capture comprehensive contextual information, effectively boosting segmentation accuracy in complex scenes.

We conducted experiments on the ScanNetV2 dataset, and the results are shown in Table 7. Compared to other networks, our network achieves a higher validation mIoU. Specifically, our proposed network exhibits a superiority in performance compared to other networks. It surpasses PointNet++ by 23.4%, SSCN by 6.1%, MinkowskiNet by 4.7%, OctFormer by 1.2%, and ConDaFormer by 0.9%. However, it is noteworthy that when compared to the state-of-the-art PTv3 + PPT, our network demonstrates a slightly lower performance, with a deviation of 1.7%. Notably, in our comparison, PointNet++ incorporates color and normal vector information from the point cloud. In comparison to PTv3 + PPT, our model achieves superior segmentation performance on categories with distinctive shapes such as chairs, sofas, and tables. Additionally, while other networks demonstrate a lower segmentation accuracy for categories like Picture and Counter, our network produces better segmentation results. Specifically, the Intersection over Union (IoU) for the Picture and Counter categories is 41.4% and 70.3%, respectively, indicating a significant improvement compared to competing approaches. Compared to the PointNet++ network, which utilizes hierarchical subsampling to obtain local point cloud information, our network outperforms PointNet++. This result demonstrates that our network can better capture local point cloud features. In comparison to the MinkowskiNet network, our network achieves similar segmentation results in some easily distinguishable categories. However, in categories that pose greater challenges and demand a reliance on global context information, our network attains superior segmentation results. This finding confirms that the added modules in our network can effectively capture comprehensive contextual information. In contrast to the SSCN network, which focuses only on the sparsity of point clouds, our network excels in capturing global point cloud context information, enhancing its ability to obtain overall structural information. Therefore, our network demonstrates better segmentation results for categories such as doors, bookshelves, and pictures that rely on overall scene structure information. In the overall comparison, our network exhibits the most significant improvement in segmentation accuracy for large objects such as doors, windows, desks, and refrigerators, outperforming the SSCN network by 10%, 10.1%, 13.8%, and 17.4%, respectively. Additionally, for small objects, such as pictures, the accuracy was improved by 10.1%. In comparison to networks utilizing window-based attention such as ConDaFormer and attention based on the octree structure like the OctFormer transformer architecture, our network demonstrates advantages in categories such as doors, chairs, and counters. However, transformer-based networks exhibit superior segmentation performance in categories containing irregularly shaped small objects like sinks, curtains, and showers. Despite this, our network maintains an advantage in segmenting regular-shaped small objects, specifically in the picture category, with margins of 13.2% and 18.8% over ConDaFormer and OctFormer, respectively.

In terms of model parameter count, our model has 37.5 M parameters, which is smaller compared to the state-of-the-art PTv3 + PPT model. Compared to OctFormer, our model has fewer parameters and achieves a higher validation mIoU. In comparison with the MinkowskiNet model, our model has a similar parameter count but achieves a higher accuracy in the majority of segmentation classes. Furthermore, we provide additional accuracy metrics such as overall accuracy (OA) and mean accuracy (mAcc). As for the testing mIoU metric, due to the unavailability of the test dataset, we do not yet provide this metric.

5.3. Limitations

The OcspareNet proposed in this paper achieves good segmentation results on two indoor datasets, but it also has some drawbacks. Firstly, when comparing our model with the SOTA model on the S3DIS dataset, our accuracy is lower. One contributing factor is that our model does not utilize normal vector information during the training process. Another factor is that further optimization of the set hyperparameters is required. Furthermore, inspired by PointNeXt, our model lacks the use of stronger data augmentation (such as in KPConv training) and the adoption of better optimization strategies during training. Therefore, in the future, we plan to further optimize our model by employing superior training strategies.

6. Conclusions

In this study, we introduce a novel network called OcspareNet for the semantic segmentation of indoor point clouds. Our network employs sparse convolution as the backbone, enhancing its ability to capture global contextual information by incorporating offset attention modules and contextual aggregation modules. This augmentation facilitates the network in achieving a higher precision in segmenting challenging categories. Specifically, the offset attention modules primarily sharpen attention weights, thereby enhancing the network’s ability to capture global contextual information. The contextual aggregation modules further enhance the network’s ability to model long-range information during both the training and testing phases. We conducted experiments on two indoor datasets, S3DIS and ScanNetV2, and visualized the results before and after incorporating the modules. The visualization demonstrates that our modules effectively improve the segmentation accuracy of complex categories. Compared to state-of-the-art (SOTA) models, our model boasts a smaller parameter count while demonstrating advantages in segmenting challenging categories such as pictures, counters, and desks on the ScanNetV2 dataset. Additionally, our CNN-based architecture exhibits competitive performance compared to transformer-based architectures such as OctFormer and ConDaFormer. We also compared our network with methods based on multilayer perceptron (MLP), voxelization, and graph convolution. The results indicate that our network effectively leverages the holistic structural information of buildings, leading to improved segmentation accuracy, especially in differentiating between easily confused categories.

However, this study still has some limitations. In comparison with other networks, our model is less competitive on the S3DIS dataset than on the ScanNetV2 dataset. This is primarily due to the underutilization of normal vector information. Additionally, inspired by PointNeXt, we recognize the need to refine our training strategy. In the future, we plan to improve our training strategy and optimize segmentation results by leveraging normal vector information.

Author Contributions

Conceptualization, Writing—review and editing, Supervision, Project administration, Funding acquisition, M.Y.; Methodology, Writing—original draft, Formal analysis, Writing—review and editing, Z.L.; Validation, Visualization, Software, Q.X.; Investigation, Validation, X.C.; Visualization, F.S.; Validation, W.C.; Resources, Q.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (41801308).

Data Availability Statement

The S3DIS dataset available at http://buildingparser.stanford.edu/dataset.html (accessed on 10 June 2023); ScanNetV2 dataset available at http://www.scan-net.org/ (accessed on 10 June 2023).

Acknowledgments

The authors thank the managing editor and anonymous reviewers for their constructive comments.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Su, H.; Maji, S.; Kalogerakis, E.; Learned-Miller, E. Multi-view convolutional neural networks for 3d shape recognition. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 11–18 December 2015; pp. 945–953. [Google Scholar]
Iandola, F.N.; Han, S.; Moskewicz, M.W.; Ashraf, K.; Dally, W.J.; Keutzer, K. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5 MB model size. arXiv 2016, arXiv:1602.07360. [Google Scholar]
Wu, B.; Wan, A.; Yue, X.; Keutzer, K. Squeezeseg: Convolutional neural nets with recurrent crf for real-time road-object segmentation from 3d lidar point cloud. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, 21–25 May 2018; pp. 1887–1893. [Google Scholar]
Wu, B.; Zhou, X.; Zhao, S.; Yue, X.; Keutzer, K. Squeezesegv2: Improved model structure and unsupervised domain adaptation for road-object segmentation from a lidar point cloud. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 4376–4382. [Google Scholar]
Milioto, A.; Vizzo, I.; Behley, J.; Stachniss, C. Rangenet++: Fast and accurate lidar semantic segmentation. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macao, China, 4–8 November 2019; pp. 4213–4220. [Google Scholar]
Robert, D.; Vallet, B.; Landrieu, L. Learning multi-view aggregation in the wild for large-scale 3d semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 5575–5584. [Google Scholar]
Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar]
Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. arXiv 2017, arXiv:1706.02413. [Google Scholar]
Hu, Q.; Yang, B.; Xie, L.; Rosa, S.; Guo, Y.; Wang, Z.; Trigoni, N.; Markham, A. Randla-net: Efficient semantic segmentation of large-scale point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 11108–11117. [Google Scholar]
Zhao, H.; Jiang, L.; Fu, C.-W.; Jia, J. Pointweb: Enhancing local neighborhood features for point cloud processing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Los Angeles, CA, USA, 16–20 June 2019; pp. 5565–5573. [Google Scholar]
Qian, G.; Li, Y.; Peng, H.; Mai, J.; Hammoud, H.; Elhoseiny, M.; Ghanem, B. Pointnext: Revisiting pointnet++ with improved training and scaling strategies. Adv. Neural Inf. Process. Syst. 2022, 35, 23192–23204. [Google Scholar]
Lin, H.; Zheng, X.; Li, L.; Chao, F.; Wang, S.; Wang, Y.; Tian, Y.; Ji, R. Meta Architecture for Point Cloud Analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Oxford, UK, 15–17 September 2023; pp. 17682–17691. [Google Scholar]
Zhu, H.; Yang, H.; Wu, X.; Huang, D.; Zhang, S.; He, X.; He, T.; Zhao, H.; Shen, C.; Qiao, Y. Ponderv2: Pave the way for 3d foundataion model with a universal pre-training paradigm. arXiv 2023, arXiv:2310.08586. [Google Scholar]
Zhong, Z.; Cui, J.; Yang, Y.; Wu, X.; Qi, X.; Zhang, X.; Jia, J. Understanding imbalanced semantic segmentation through neural collapse. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Oxford, UK, 15–17 September 2023; pp. 19550–19560. [Google Scholar]
Deng, X.; Zhang, W.; Ding, Q.; Zhang, X. PointVector: A Vector Representation In Point Cloud Analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Oxford, UK, 15–17 September 2023; pp. 9455–9465. [Google Scholar]
Landrieu, L.; Simonovsky, M. Large-scale point cloud semantic segmentation with superpoint graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4558–4567. [Google Scholar]
Wang, Y.; Sun, Y.; Liu, Z.; Sarma, S.E.; Bronstein, M.M.; Solomon, J.M. Dynamic graph cnn for learning on point clouds. ACM Trans. Graph. 2019, 38, 1–12. [Google Scholar] [CrossRef]
Robert, D.; Raguet, H.; Landrieu, L. Efficient 3D Semantic Segmentation with Superpoint Transformer. arXiv 2023, arXiv:2306.08045. [Google Scholar]
Thomas, H.; Qi, C.R.; Deschaud, J.-E.; Marcotegui, B.; Goulette, F.; Guibas, L.J. Kpconv: Flexible and deformable convolution for point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Los Angeles, CA, USA, 16–20 June 2019; pp. 6411–6420. [Google Scholar]
Xiang, P.; Wen, X.; Liu, Y.-S.; Zhang, H.; Fang, Y.; Han, Z. Retro-fpn: Retrospective feature pyramid network for point cloud semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Oxford, UK, 15–17 September 2023; pp. 17826–17838. [Google Scholar]
Huang, Q.; Wang, W.; Neumann, U. Recurrent slice networks for 3d segmentation of point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 2626–2635. [Google Scholar]
Allen, M.; Girod, L.; Newton, R.; Madden, S.; Blumstein, D.T.; Estrin, D. Voxnet: An interactive, rapidly-deployable acoustic monitoring platform. In Proceedings of the 2008 International Conference on Information Processing in Sensor Networks (IPSN 2008), City of Saint Louis, MO, USA, 22–24 April 2008; pp. 371–382. [Google Scholar]
Tchapmi, L.; Choy, C.; Armeni, I.; Gwak, J.; Savarese, S. Segcloud: Semantic segmentation of 3d point clouds. In Proceedings of the 2017 International Conference on 3D Vision (3DV), Qingdao, China, 10–12 October 2017; pp. 537–547. [Google Scholar]
Le, T.; Duan, Y. Pointgrid: A deep network for 3d shape understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 9204–9214. [Google Scholar]
Klokov, R.; Lempitsky, V. Escape from cells: Deep kd-networks for the recognition of 3d point cloud models. In Proceedings of the IEEE International Conference on Computer Vision, Honolulu, HI, USA, 21–26 July 2017; pp. 863–872. [Google Scholar]
Riegler, G.; Osman Ulusoy, A.; Geiger, A. Octnet: Learning deep 3d representations at high resolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3577–3586. [Google Scholar]
Zeng, W.; Gevers, T. 3DContextNet: Kd tree guided hierarchical learning of point clouds using local and global contextual cues. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, Munich, Germany, 8–14 September 2018. [Google Scholar]
Graham, B.; Engelcke, M.; Van Der Maaten, L. 3d semantic segmentation with submanifold sparse convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 9224–9232. [Google Scholar]
Choy, C.; Gwak, J.; Savarese, S. 4d spatio-temporal convnets: Minkowski convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Los Angeles, CA, USA, 16–20 June 2019; pp. 3075–3084. [Google Scholar]
Rozenberszki, D.; Litany, O.; Dai, A. Language-grounded indoor 3d semantic segmentation in the wild. In Proceedings of the European Conference on Computer Vision, New Orleans, LA, USA, 19–24 June 2022; pp. 125–141. [Google Scholar]
Wang, P.-S. OctFormer: Octree-based Transformers for 3D Point Clouds. arXiv 2023, arXiv:2305.03045. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
Lu, D.; Xie, Q.; Wei, M.; Gao, K.; Xu, L.; Li, J. Transformers in 3d point clouds: A survey. arXiv 2022, arXiv:2205.07417. [Google Scholar]
Guo, M.-H.; Cai, J.-X.; Liu, Z.-N.; Mu, T.-J.; Martin, R.R.; Hu, S.-M. Pct: Point cloud transformer. Comput. Vis. Media 2021, 7, 187–199. [Google Scholar] [CrossRef]
Han, X.-F.; He, Z.-Y.; Chen, J.; Xiao, G.-Q. 3CROSSNet: Cross-level cross-scale cross-attention network for point cloud representation. IEEE Robot. Autom. Lett. 2022, 7, 3718–3725. [Google Scholar] [CrossRef]
Yu, X.; Tang, L.; Rao, Y.; Huang, T.; Zhou, J.; Lu, J. Point-bert: Pre-training 3d point cloud transformers with masked point modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 19313–19322. [Google Scholar]
Zhao, H.; Jiang, L.; Jia, J.; Torr, P.H.; Koltun, V. Point transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 16259–16268. [Google Scholar]
Wu, X.; Lao, Y.; Jiang, L.; Liu, X.; Zhao, H. Point transformer v2: Grouped vector attention and partition-based pooling. Adv. Neural Inf. Process. Systems 2022, 35, 33330–33342. [Google Scholar]
Lai, X.; Liu, J.; Jiang, L.; Wang, L.; Zhao, H.; Liu, S.; Qi, X.; Jia, J. Stratified transformer for 3d point cloud segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 8500–8509. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Wu, X.; Jiang, L.; Wang, P.-S.; Liu, Z.; Liu, X.; Qiao, Y.; Ouyang, W.; He, T.; Zhao, H. Point Transformer V3: Simpler, Faster, Stronger. arXiv 2023, arXiv:2312.10035. [Google Scholar]
Tian, Z.; Cui, J.; Jiang, L.; Qi, X.; Lai, X.; Chen, Y.; Liu, S.; Jia, J. Learning context-aware classifier for semantic segmentation. arXiv 2023, arXiv:2303.11633. [Google Scholar] [CrossRef]
Armeni, I.; Sener, O.; Zamir, A.R.; Jiang, H.; Brilakis, I.; Fischer, M.; Savarese, S. 3d semantic parsing of large-scale indoor spaces. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 1534–1543. [Google Scholar]
Dai, A.; Chang, A.X.; Savva, M.; Halber, M.; Funkhouser, T.; Nießner, M. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5828–5839. [Google Scholar]
Duan, L.; Zhao, S.; Xue, N.; Gong, M.; Xia, G.-S.; Tao, D. ConDaFormer: Disassembled Transformer with Local Structure Enhancement for 3D Point Cloud Understanding. arXiv 2023, arXiv:2312.11112. [Google Scholar]

Figure 1. The model structure proposed is based on sparse convolution.

Figure 2. Offset attention layer architecture.

Figure 3. Offset attention module architecture.

Figure 4. Context aggregation module architecture.

Figure 5. Sparse convolution model architecture. Note: in a sparse convolutional block, the first position represents the size of the convolutional kernels, the second position represents the number of channels, and the third position represents the number of blocks. In a sparse convolutional layer, the first position represents the size of the convolutional kernels and the second position represents the number of channels.

Figure 6. Sparse convolutional module.

Figure 7. S3DIS segmentation result. (a) Point cloud. (b) Ground truth. (c) OcspareNet.

Figure 8. ScanNetV2 segmentation result. (a) Point cloud. (b) Ground truth. (c) OcspareNet.

Figure 9. Visualization of ablation experiments on ScanNetV2. (a) Point cloud. (b) Ground truth. (c) OcspareNet (before adding the module). (d) OcspareNet (after incorporating the module). Note: the red box represents the area for improving segmentation.

Figure 10. Visualization of ablation experiments on S3DIS. (a) Point cloud. (b) Ground truth. (c) OcspareNet (before adding the module). (d) OcspareNet (after incorporating the module). Note: the red box represents the area for improving segmentation.

Table 1. Software and hardware settings.

Hardware	Settings	Software	Settings
CPU	Intel Core i9-10850K	System	Ubuntu 18.04
RAM	32G	CUDA version	CUDA 11.3
Hard disk	512G	Python version	Python 3.8
GPU	NVIDIA GeForce GTX 3070	Pytorch version	Pytorch 1.12

Table 2. Detailed description of the S3DIS dataset.

S3DIS	Total	Auditorium	Conference Room	Copy Room	Hallway	Lobby
Region1	44	0	2	1	8	0
Region2	40	2	1	0	12	0
Region3	23	0	1	0	6	0
Region4	49	0	3	0	14	2
Region5	68	0	3	0	15	1
Region6	48	0	1	1	6	0
S3DIS	Lounge	Office	Open Space	Pantry	Storage	WC
Region1	0	31	0	1	0	1
Region2	0	14	0	0	9	2
Region3	2	10	0	0	2	2
Region4	0	22	0	0	4	4
Region5	0	42	0	1	4	2
Region6	1	37	1	1	0	0

Table 3. Detailed description of the ScanNetV2 dataset.

ScanNetV2	Number of Scenes	Data Number
Total	1513 scenes	0–806
Validation	312 scenes	11 15 19 25 30 46 50 63 64 77 81 84 86 88 95 100 131 139 144 146 149 153 164 169 187 193 196 203 207 208 217 221 222 231 246 249 251 256 257 277 278 300 304 307 314 316 328 329 334 338 342 343 351 353–357 377 378 435 441 458 461 462 474 488 490 494 496 500 518 527 535 549 550 552 553 558 559 565 568 574 575 578 580 583 591 593 595 598 599 606–609 616 618 621 629 633 643–645 647 648 651–653 655 658 660 670 671 678 684–686 689 690 693 695–697 663–665 699 700–702 704
Train	1201 scenes	Other

Table 4. ScanNetV2 ablation experiments.

Offset Attention	Context Aggregation	mIoU (%)	OA (%)	mAcc (%)
		75.22	91.21	82.69
√		75.73	91.37	83.24
√	√	76.94	91.72	85.14

Table 5. S3DIS ablation experiments.

Two Modules	mIoU (%)	OA (%)	mAcc (%)
	63.05	88.58	68.26
√	64.57	89.13	71.09

Table 6. Comparison of experimental results in S3DIS Area5.

Method	mIoU (%)	OA (%)	mAcc (%)	Params. (M)	Ceiling (%)	Floor (%)	Wall (%)	Beam (%)	Column (%)
PointNet [7]	41.1	-	49.0	3.6	88.8	97.3	69.8	0.1	3.9
SegCloud [23]	48.9	-	57.4	-	90.1	96.1	69.9	0.0	18.4
SPGraph [16]	58.0	86.4	66.5	-	89.4	96.9	78.1	0.0	42.8
OcspareNet (ours)	64.6	89.1	71.1	37.5	93.0	97.3	83.5	0.0	22.3
Point Transformer [37]	70.4	90.8	76.5	-	94.0	98.5	86.3	0.0	38.0
PointNeXt-XL [11]	70.5	90.6	-	41.6	-	-	-	-	-
PTv3 [41]	74.4	-	-	46.2	-	-	-	-	-
Method	Window (%)	Door (%)	Table (%)	Chair (%)	Sofa (%)	Bookcase (%)	Board (%)		Clutter (%)
PointNet	46.3	10.8	59.0	52.6	5.9	40.3	26.4		33.2
SegCloud	38.4	23.1	75.9	70.4	58.4	40.9	13.0		41.6
SPGraph	48.9	61.6	84.7	75.4	69.8	52.6	2.1		52.2
OcspareNet (ours)	53.7	59.1	79.2	87.4	61.9	74.4	70.9		56.7
Point Transformer	63.4	74.3	89.1	82.4	74.3	80.2	76.0		59.3
PointNeXt-XL	-	-	-	-	-	-	-		-
PTv3	-	-	-	-	-	-	-		-

Table 7. Comparison of ScanNetV2 experimental results.

Method	Val. mIoU (%)	Test mIoU (%)	OA (%)	mAcc (%)	Params. (M)	Wall (%)	Floor (%)	Cabinet (%)	Bed (%)
PointNet++ [8]	53.5	55.7	-	-	-	75.6	94.6	49.1	66.1
SSCN [28]	70.8	-	-	-	-	83.6	95.1	65.3	80.7
MinkowskiNet [29]	72.2	73.6	-	-	37.9	85.2	95.1	70.9	81.8
OctFormer [31]	75.7	76.6	-	-	44.0	87.7	96.0	78.6	80.8
ConDaFormer [45]	76.0	75.5	-	-	-	87.3	95.8	80.1	82.2
OcspareNet (ours)	76.9	-	91.7	85.1	37.5	86.7	95.8	68.6	83.0
PTv3 + PPT [41]	78.6	79.4	-	-	46.2	90.3	97.9	78.2	81.3
Method	Chair (%)	Table (%)	Door (%)	Window (%)	Bookshelf (%)	Picture (%)	Counter (%)		Desk (%)
PointNet++	74.4	49.7	37.5	51.5	68.6	20.5	39.2		45.1
SSCN	90.4	72.2	64.3	60.5	78.0	31.3	62.5		58.7
MinkowskiNet	84.0	68.3	64.3	72.7	83.2	28.6	52.1		66.0
OctFormer	84.6	72.2	67.4	77.6	84.9	22.6	56.6		69.0
ConDaFormer	84.9	67.8	68.0	75.6	83.6	28.2	51.6		65.1
OcspareNet	92.5	77.9	74.3	70.6	86.3	41.4	70.3		72.5
PTv3 + PPT	89.0	69.6	71.3	80.5	85.1	38.4	59.7		69.6
Method	Sofa	Curtain (%)	Refrigerator (%)	Shower (%)	Toilet (%)	Sink (%)	Bathtub (%)		Other (%)
PointNet++	64.3	53.9	40.3	35.6	82.4	55.3	73.5		37.6
SSCN	82.0	75.8	49.4	70.8	93.0	63.9	87.4		51.4
MinkowskiNet	77.2	85.3	73.1	89.3	87.4	67.5	85.9		54.4
OctFormer	81.5	87.6	75.3	90.4	92.3	77.7	92.5		57.6
ConDaFormer	80.2	86.4	75.9	85.5	88.0	72.8	92.7		58.4
OcspareNet	84.2	79.2	66.8	77.5	92.0	69.9	86.9		62.2
PTv3 + PPT	79.0	91.6	79.3	90.7	96.7	82.1	94.1		63.5

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yu, M.; Li, Z.; Xu, Q.; Su, F.; Chen, X.; Cui, W.; Ji, Q. Integrating Contextual Information and Attention Mechanisms with Sparse Convolution for the Extraction of Internal Objects within Buildings from Three-Dimensional Point Clouds. Buildings 2024, 14, 636. https://doi.org/10.3390/buildings14030636

AMA Style

Yu M, Li Z, Xu Q, Su F, Chen X, Cui W, Ji Q. Integrating Contextual Information and Attention Mechanisms with Sparse Convolution for the Extraction of Internal Objects within Buildings from Three-Dimensional Point Clouds. Buildings. 2024; 14(3):636. https://doi.org/10.3390/buildings14030636

Chicago/Turabian Style

Yu, Mingyang, Zhongxu Li, Qiuxiao Xu, Fei Su, Xin Chen, Weikang Cui, and Qingrui Ji. 2024. "Integrating Contextual Information and Attention Mechanisms with Sparse Convolution for the Extraction of Internal Objects within Buildings from Three-Dimensional Point Clouds" Buildings 14, no. 3: 636. https://doi.org/10.3390/buildings14030636

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Integrating Contextual Information and Attention Mechanisms with Sparse Convolution for the Extraction of Internal Objects within Buildings from Three-Dimensional Point Clouds

Abstract

1. Introduction

2. Methods

2.1. Network Architecture

2.2. Voxelization and Devoxelization

2.3. Offset Attention Module

2.4. Context Aggregation Module

2.5. Feature Extraction Network

3. Experiments and Results

3.1. Experimental Settings

3.2. Dataset Description

4. Results

4.1. Results on the S3DIS Dataset

4.2. Results on the ScanNetV2 Dataset

5. Discussion

5.1. Ablation Study

5.1.1. ScanNetV2 Dataset Ablation Experiment

5.1.2. S3DIS Dataset Ablation Experiment

5.2. Comparison with Other Networks

5.3. Limitations

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI