Language-Level Semantics Conditioned 3D Point Cloud Segmentation

In this work, a language-level Semantics Conditioned framework for 3D Point cloud segmentation, called SeCondPoint, is proposed, where language-level semantics are introduced to condition the modeling of point feature distribution as well as the pseudo-feature generation, and a feature-geometry-based mixup approach is further proposed to facilitate the distribution learning. To our knowledge, this is the first attempt in literature to introduce language-level semantics to the 3D point cloud segmentation task. Since a large number of point features could be generated from the learned distribution thanks to the semantics conditioned modeling, any existing segmentation network could be embedded into the proposed framework to boost its performance. In addition, the proposed framework has the inherent advantage of dealing with novel classes, which seems an impossible feat for the current segmentation networks. Extensive experimental results on two public datasets demonstrate that three typical segmentation networks could achieve significant improvements over their original performances after enhancement by the proposed framework in the conventional 3D segmentation task. Two benchmarks are also introduced for a newly introduced zero-shot 3D segmentation task, and the results also validate the proposed framework.


I. INTRODUCTION
P OINT cloud semantic segmentation is a fundamental problem in the computer vision and computer graphics communities.Inspired by the tremendous success of deep neural networks (DNNs) in the 2D image analysis field, recently, DNNs have been introduced to various 3D point cloud processing tasks [1], [2], [3].However, due to the irregular structure of 3D point clouds, the existing DNNs which were usually designed to work at regular grid inputs (e.g.image pixels), could not straightforwardly be applied to 3D point clouds.
To tackle this problem, a large number of DNN-based methods [4], [5], [6] have been proposed to design new network architectures for 3D point cloud segmentation, which could be roughly divided into two categories: projection-based methods and point-based methods.Projection-based methods [1], [7], [8] projected irregular 3D point clouds into regular 3D occupancy voxels or 2D image pixels, where conventional 3D or 2D convolutional neural networks (CNNs) could be applied directly.Despite their decent performances, the projection-based methods usually suffer from information loss in the projection process.Recently, many point-based methods [2], [9], [10], [11] have been proposed.The point-based methods directly processed 3D points by extracting point features with multilayer perceptrons (MLPs) and then aggregating neighborhood point features via a specially designed point feature aggregator or an elaborate convolution operation.Since the point-based methods directly operate on 3D points, the key to the pointbased methods is how to successfully aggregate local and longrange geometrical information.Currently, due to the lack of large-scale datasets (like ImageNet for 2D image analysis) in the 3D point cloud processing field, both projection-based and point-based methods generally suffer from the data hungry problem due to the adoption of DNN architectures.
Data hungry, on the one hand, could be alleviated by data augmentation techniques.Some works have been proposed to augment 3D point clouds.Hand-crafted rules (e.g.random rotating and jittering) were used as general pre-processing operations by many works [2], [12].PointAugment [13] introduced an auto-augmentation framework, which took an adversarial learning strategy to jointly optimize a point cloud augmentation network and a classification network.Following the Mixup [14], an effective augmentation technique in the 2D image analysis field, PointMixup [15] and RSMix [16] mixed samples in the 3D space to augment 3D point clouds.However, these approaches focused on point-cloud-level augmentation and only aimed at the point cloud classification task, and they are generally not quite effective for the point cloud segmentation task because point cloud segmentation is a pointlevel discrimination task.
Data hungry in the 3D segmentation task, on the other hand, is reflected by the fact that novel object classes usually appear in real 3D scenes, and it is impossible for human annotators to annotate all potential object classes for model learning.To tackle this novel-class problem, we introduce a new task by generalizing zero-shot learning to 3D semantic segmentation in this work, called zero-shot 3D scene semantic segmentation where some novel 3D object classes (no training data available) need to be classified in the testing 3D scenes.Note that although Cheraghian et al. [17] introduced zero-shot learning to 3D shape classification to recognize novel-class 3D object shapes, zero-shot 3D scene segmentation is consider-ably different from zero-shot 3D shape classification, because 1) feature learning of seen-class objects and novel-class objects are affected mutually in the local feature aggregation process in the 3D scene segmentation task, while feature learning of different classes in the 3D shape classification task is totally independent since all points in a single 3D object shape belong to the same class; 2) 3D scenes are naturally and structurally class-level imbalanced.
In this work, we propose a unified framework called SeC-ondPoint for both conventional and zero-shot 3D point cloud segmentation, where language-level semantics are introduced to condition the modeling of point feature distribution as well as the pseudo-feature generation.The proposed SeCondPoint firstly employs a conditional generative adversarial network to model point feature distribution conditioned on semantic information, which is adversarially learned from the real distribution of point features extracted by an arbitrary existing point cloud segmentation network.A feature-geometry-based mixup approach is further proposed to facilitate the distribution learning.After training the generative model, a large number of point features could be generated from the learned distribution conditioned on the specific semantic information.With these generated point features, a semantics enhanced point feature classifier can be trained for point cloud segmentation.As illustrated in Figure 1, the advantages of the semantics enhanced classifier include: 1) for a class with a relatively small number of point samples in the training 3D point clouds, a great number of point features could be generated from the learned distribution conditioned on the corresponding semantic information of this class.Hence, the decision boundary between this class and its neighboring class in the feature space could be improved to be better generalized to the testing samples; 2) due to the semantics conditioned modeling, point features of novel classes could be generated conditioned on their corresponding semantic information, which endows the learned classifier with the ability to segment novel-class objects.
In sum, the key contributions of this work include: • We propose a unified framework (SeCondPoint) for both conventional 3D scene semantic segmentation (C3DS) and zero-shot 3D scene semantic segmentation (Z3DS), where language-level semantics are introduced to conditionally model point feature distribution and generate point features.Any existing C3DS networks could be seamlessly embedded into the proposed framework for either improving their C3DS performances or classifying novel-class objects.To our best knowledge, this is the first attempt to utilize language-level semantics for 3D scene semantic segmentation in this field.• Under the framework of zero-shot learning, we introduce the new task of Z3DS (zero-shot 3D scene semantic segmentation) and construct two benchmarks for algorithmic evaluation.Two baselines are also established for comparison.To our best knowledge, this is the first such an attempt to investigate the Z3DS problem, and the introduced two benchmarks and methodology could be of help for more researches on this new direction.the proposed framework is not only able to significantly improve the C3DS performances of existing segmentation networks, but also establish a competitive baseline in the Z3DS task.The remaining of this paper is organized as follows.Firstly, we review some related works in Section II.Secondly, we elaborate the proposed framework in Section III, where a language-level semantics conditioned feature modeling network and a semantics enhanced feature classifier are discussed.In Section IV, the experimental setup, extensive results, and some in-depth discussions are provided.Finally, we conclude the paper and outline some future works in Section V.

B. Data Augmentation for Point Clouds
Hand-crafted rule based augmentations (e.g.random rotating and jittering) are usually used to augment 3D point clouds by many existing methods [2], [12], [39].Recently, Li et al. [13] proposed the PointAugment framework to automatically augment point clouds, where a point cloud augmentation network was used to augment point clouds and a classification network was adopted to classify the augmented point clouds.By training the two networks adversarially, the learned classification network was expected to be enhanced.Mixup [14] has achieved a huge success in the 2D image analysis field, but it could not be used directly in 3D point clouds due to the irregular structure of point clouds.Addressing this problem, PointMixup [15] mixed different parts of two point clouds by solving an optimal transport problem to augment point clouds.To reduce the loss of structural information of the original samples in PointMixup, Lee et al. [16] proposed a shapepreserving augmentation technique, called RSMix which could partially mix two point clouds while preserving the partial shape of the original point clouds.Our proposed framework also aims to augment training data, however it is significantly different from these related works mainly in three aspects: 1) the related works are designed only for the 3D shape classification task by point-cloud-level augmentation, while ours is proposed for the 3D scene segmentation task by point-level feature distribution modeling.Considering that 3D scenes usually include millions of 3D points and such methods proceed based on complex optimization on 3D points, it is hard for them to adapt to 3D segmentation scenario; 2) 3D scene segmentation faces essentially different problems from 3D shape classification in terms of feature learning and inherent class imbalance; 3) our proposed method significantly differs from the related works in its working principle.

C. Zero-Shot 2D Image Semantic Segmentation
Zero-shot image semantic segmentation [42], [43], [44] has received increasing attention only recently.Xian et al. [44] proposed a semantic projection network to perform semantic segmentation of novel classes, where visual features were projected into a semantic feature space and then novel classes were classified according to similarities between the projected features and novel-class semantic features.Both Bucher et al. [42] and Gu et al. [43] proposed similar methods which combined a deep visual segmentation model with a generative model to generate visual features from semantic features and trained a visual feature classifier to segment images.Compared with such zero-shot 2D image learning tasks, our introduced Z3DS task is in fact a more challenging task because 1) for regular 2D images, CNN is powerful to extract discriminative features.However, there seems no such powerful architectures for irregular 3D point clouds; 2) point feature extraction is mainly based on geometrical information, but some welldefined semantic features which could well align with geometrical features for zero-shot 3D point inference, are usually lacking.

D. Generative Models for Point Clouds
Recently, generative models like GAN [45] (Generative Adversarial Network), VAE [46] (Variational Auto-Encoder), and NF [47] (Normalizing Flows) have been applied to point cloud generation.Achlioptas et al. [48] introduced GAN to point cloud generation, while Gadelha et al. [49] and Zamorski et al. [50] employed VAE to generate point clouds.In such works, point cloud was modeled with a point distribution and was synthesized by sampling points from the learned distribution.Yang et al. [51] applied NF to point cloud generation by learning a two-level hierarchy of distributions where the first level was the distribution of shapes and the second level was the distribution of points.Unlike these methods, our method learns point feature distribution conditioned on the semantics for each semantic class, which is then used to train a discriminant model by sampling a large number of point features from the learned distribution for the point cloud segmentation task.

III. METHODOLOGY
We begin by introducing the task of conventional 3D scene semantic segmentation (C3DS), zero-shot 3D scene semantic segmentation (Z3DS), and some necessary notations.Suppose we are given a 3D point cloud scene set P at the training stage, for the sake of clarity, which can also be represented as a 3D point set , where x n is a point in a 3D point cloud scene and y n is the label of x n , belonging to the seen-class label set Y S , and N is the number of points in all point clouds.Usually, x n is a (3 + a)-D feature vector with 3-D X-Y-Z Euclidean coordinates and a-D features (e.g.tr and E, the goal is to learn a mapping f : D te → Y .Note that the 3D scene semantic segmentation problem is actually a point-level classification problem, hence we could represent point cloud set as corresponding point set and segment 3D scenes by classifying 3D points.It is assumed in C3DS that testing points only belong to the seen-class label set, this is to say, C3DS is unable to tackle novel-class objects often occurred in realistic scenarios.Z3DS assumes that we have the prior that testing points belong to only the unseenclass label set.Although this prior is not always available in practice, Z3DS can demonstrate the ability of a model to infer unseen-class objects, and is of a practical value with the aid of some additional modules (like novel-class detector).GZ3DS simultaneously deals with seen classes and unseen classes, which is a more practical setting.

3-D R-G-B color features). In the deep learning community
Here, we propose the SeCondPoint framework for 3D scene semantic segmentation, which utilizes language-level semantics to condition the modeling of point feature distribution and the point feature generation.As shown in Figure 2, the proposed SeCondPoint consists of three parts: a backbone segmentation network for extracting point features from input 3D point clouds, a language-level semantics conditioned feature modeling network for learning point feature distribution, and a semantics enhanced feature classifier for classifying both seen-class and unseen-class points in 3D point clouds for semantic segmentation.Here we would point out that an arbitrary existing (also novel) segmentation network could be used as backbone network under the SeCondPoint framework.Since our main goal is to propose the novel SeCondPoint framework, rather than a novel segmentation network, we simply use existing segmentation networks as backbone networks here.In the following, we describe the languagelevel semantics conditioned feature modeling network and the semantics enhanced feature classifier respectively.Note that in the proposed SeCondPoint framework, the parameters of the backbone network are fixed, but the remaining two modules need to be learned.
A. Language-Level Semantics Conditioned Feature Modeling Network 1) Adversarial Feature Modeling: Here, we propose a language-level semantics conditioned feature modeling network to learn the conditional distribution of point features conditioned on semantic features by introducing languagelevel semantic information of both seen and unseen (novel) object classes.Note that there exist many models for extracting language-level semantics in literature [52], [53], [54], here we use the semantic embeddings of the object class names extracted by the existing language model [53], considering that our current work is to show the usefulness of semanticsconditioned feature modeling in 3D scene segmentation, rather than comparing different semantics.
As shown in Figure 2, the feature modeling network employs a conditional generative adversarial network to model the conditional distributions, where the inputs of the generator are the concatenation of semantic features of object classes and Gaussian noises, the outputs of the generator are synthesized point features of the corresponding object classes, and the discriminator is to discriminate the real point features extracted by the backbone network from those generated by the generator.Specifically, suppose we are given a backbone network u(•), a generator G(•), and a discriminator D(•).For a given input 3D point with its corresponding class label y, we firstly extract point feature x ∈ R b of the input 3D point with the backbone network u(•).Then, the generator G(•) is used to generate a fake point feature x conditioned on the corresponding semantic feature e y , and the discriminator D(•) is to discriminate the generated fake point feature x from the real point feature x.In order to learn a point feature distribution which could not be discriminated from the real one by the discriminator, the generator is adversarially trained with the discriminator under the framework of Wasserstein Generative Adversarial Network (WGAN) [55] as follows: where D S tr is the labeled training point feature set extracted by the backbone network from the training 3D point clouds.The first two terms in (1) are the original objectives of WGAN, which aim to minimize the Wasserstein distance between the distributions of the generated point features and those of the real point features, x is generated by the generator G(•) conditioned on the corresponding semantic feature e y and a standard Gaussian noise z ∼ N (0, I), i.e. x = G(e y , z).The third term in (1) is used for gradient penalty for the discriminator, where x = αx + (1 − α)x with α sampled from a uniform distribution, i.e. α ∼ U(0, 1), is used to estimate gradients, and λ is a hyper-parameter for weighting the gradient penalty term, which is usually set to 10 as suggested in [55].By optimizing the MinMax objective, the generator can finally learn the conditional distribution of real point features of each class conditioned on its semantic feature.In other words, an arbitrary number of point features for each class could be synthesized by sampling features from the learned conditional distribution.
2) Feature-Geometry-Based Mixup Training: Feature learning in 3D scene semantic segmentation is mainly based on geometrical information.This is to say, geometrically adjacent points generally have similar features in the learned feature space.At the same time, due to the feature similarity between geometrically adjacent object classes, confusing classifications often happen between these classes.Inspired by Mixup [14], here we propose a feature-geometry-based mixup training approach to increase the discrimination between adjacent object classes in the feature space.Specifically, we firstly compute the class feature centers {X i , n c is the number of points belonging to class c, and C is the number of classes.Then we compute the Euclidean similarity matrix A between these feature centers.Next, given a point feature x c from the class c, we firstly identify the closest I classes with the class c according to the similarity matrix A, denoted by c, and then we sample a point feature from c, denoted by x c.Finally, an intermediate feature sample x is synthesized by interpolating between x c and x c with a scalar β sampled from the Uniform distribution U(0, 1) as: where e c and e c are the corresponding semantic features of x c and x c.According to (2), we can interpolate a large number of point features which locate between two geometrically adjacent classes in the feature space.Finally, we add these interpolated samples to the training set to train the feature generative modeling network.Here we use a hyper-parameter γ to control the scale of interpolated samples, which is defined as the ratio of the interpolated samples to the real samples.

B. Semantics Enhanced Feature Classifier
Once the language-level semantics conditioned feature modeling network is trained, a large number (K) of point features for each object class could be generated conditioned on the corresponding semantic feature e y and K different random noises {z k } K k=1 sampled from the standard Gaussian distribution N (0, I).Specifically, we generate point features according to: x = G(e y , z) where x is the generated point feature.In the following, we describe the feature generation and classifier learning in three different tasks, i.e.C3DS, Z3DS, and GZ3DS. 1) Conventional 3D Scene Semantic Segmentation: In C3DS, the testing points are assumed to come from only seen classes, hence, we generate a large number of point features for each seen class in Y S conditioned on seen-class semantic features according to (3).The set of generated point features and corresponding labels is denoted by X S .Then, we train a semantics enhanced classifier f s (•) with X S as follows: where C(•) and S(•) are a cross-entropy loss function and a softmax function respectively.Note that f s (•) could be any classifier, not constrained to linear classifier as done in existing segmentation networks [39], [9], [56].After training the classifier, given a real testing point feature x, we predict its label y by: y = arg max Finally, for a given testing 3D point cloud, we achieve point cloud semantic segmentation by classifying the feature of each 3D point extracted by the backbone network via the learned semantics enhanced feature classifier.
2) Zero-Shot 3D Scene Semantic Segmentation: Thanks to the language-level semantics conditioned point feature modeling, the proposed framework has the flexible ability to segment novel-class objects in 3D point cloud scenes (including Z3DS and GZ3DS) if their corresponding semantics are available, which is an impossible function for existing segmentation networks.In Z3DS, the task is to classify unseen-class points.To this end, we sample a large number of point features for each unseen class in Y U from the learned conditional distributions conditioned on the unseen-class semantic features according to (3).That's to say, the semantic feature e y is sampled from unseen-class label set Y U , which is different from those in C3DS where the semantic feature e y is sampled from seen-class label set Y S .Then we train a semantics enhanced classifier f u (•) in a similar way to (4), and classify the real testing unseen-class points as done in (5).
3) Generalized Zero-Shot 3D Scene Semantic Segmentation: In GZ3DS, the testing points could be from either seen classes or unseen classes.Hence, according to (3), we generate a large number of point features for every class in Y conditioned on all semantic features, and the semantics enhanced classifier f g (•) training and the point feature classifying are similar to those in C3DS and Z3DS.The only difference among C3DS, Z3DS and GZ3DS is that their conditions and classification space are different, which in turn demonstrates the flexibility of our proposed framework.

IV. EXPERIMENT
A. Experimental Setup 1) Backbone Networks and Datasets: In C3DS, we evaluate the proposed SeCondPoint framework with 3 typical 3D point cloud segmentation backbone networks: DGCNN [39], RandLA-Net [9], and SCF-Net [56].In this work, the architectures of the used networks are totally the same with those in their original papers since we directly use their public codes and models.We choose the three networks not only because their codes and models are public, but also because they are representative networks.Specifically, DGCNN is a graphbased architecture and usually used to process block-size small-scale point clouds.While RandLA-Net and SCF-Net are two different point-based networks, which can be directly used to deal with large-scale point clouds.Two public 3D point cloud scene datasets, S3DIS [57] and ScanNet [58], are used to evaluate the proposed SeCondPoint framework.The details about the two datasets are provided in the supplemental materials due to the limited space.The semantic feature of each given class is determined simply by 1) recording its class name and 2) retrieving its semantic feature among all word2vec embeddings [53] according to its name, which is represented by a 300-D vector.
In Z3DS, we evaluate the proposed SeCondPoint framework with DGCNN and RandLA-Net being the network backbones.Other backbones could be adapted seamlessly to Z3DS in the same way.We construct two benchmarks by re-organizing S3DIS [57] and ScanNet [58].In S3DIS, we consider the 12 semantic categories as valid categories and ignore the 'clutter' class, which are split into seen classes and unseen classes in 3 different manners, i.e. 10/2, 8/4, 6/6 as seen classes/unseen classes respectively.In ScanNet, we choose 19 semantic categories as valid categories, which are also split in 3 different ways, i.e. 16/3, 13/6, 10/9 splits.More details are introduced in the supplemental materials due to the limited space.For the sake of clarity, we denote the re-organized S3DIS and ScanNet as S3DIS-0 and ScanNet-0 respectively.The semantic features are the same with those in C3DS.
For Z3DS, according to the evaluation protocols in zeroshot learning [59] and 3D semantic segmentation, we use average per-class Top-1 accuracy and average per-class IoU on unseen classes to evaluate the Z3DS performance on S3DIS-0, denoted by (mACC u ) and (mIoU u ) respectively, and only the Area-5 validation are used considering that Area-5 validation can better test the model's generalization ability since 3D scenes in Area-5 have no overlap with those in other areas.On ScanNet-0, average per-class Top-1 accuracy (mACC u ) and average per-class voxel accuracy (mV ACC u ) on unseen classes are used.In GZ3DS, as done in the generalized zeroshot learning [59], we first compute mACC u (also mIoU u and mV ACC u ) and mACC s (also mIoU s and mV ACC s ) on unseen classes and seen classes respectively, and then their harmonic mean HACC (also HIoU and HV ACC) is computed to evaluate the overall performance by: Here, we only describe the computation of HACC in (6), while HIoU and HV ACC are computed in the same way.On S3DIS-0, the overall performance is evaluated by HACC and HIoU , while on ScanNet-0, it is evaluated by HACC and HV ACC.Note that the reason for using the harmonic mean is that many GZSL methods [60], [61], [62] generally suffer from the bias towards seen classes, in other words, their performances on seen classes are significantly higher than those on unseen classes, hence the harmonic mean is used here as done in these methods to emphasize the importance of unseen-class data classification.In C3DS, the proposed methods are compared with 10 existing state-of-the-art networks, including DGCNN [39], RandLA-Net [9], SCF-Net [56], PointNet++ [12], SP-Graph [37], PointCNN [29], ACNN [32], PointWeb [38], Point2Node [23], and PointGCR [24].Besides, we also adapt a data augmentation approach originally for 3D shape classification to deal with 3D scene semantic segmentation, and  compare our proposed framework with the adapted method.In Z3DS, since this work is the first attempt to investigate the Z3DS task, we firstly establish two baseline Z3DS methods by straightforwardly introducing two zero-shot learning methods [63], [64] into Z3DS, and then we compare the proposed framework with the two baselines.
3) Implementation Details: The used three backbone architectures in this paper are totally consistent with the corresponding original backbone networks.In C3DS, we directly use the public pre-trained models provided by the authors or train the models with the public codes according to the hyper-parameters given by the authors.In Z3DS, the backbone networks are trained from scratch with the seen-class objects using the public codes and hyper-parameters.In practice, a whole point cloud is input into the models, however, only the labeled seen-class points are used to compute gradients and the unlabeled unseen-class points are not computed in the back-propagation process.We consider that this training setting satisfies the standard of zero-shot learning and it is not a transductive setting because 1) the unlabeled unseenclass points do not provide any form of supervision signal for model learning and 2) the testing point clouds are not used at the training stage.Actually, in the zero-shot 2D semantic segmentation works [42], [43], [44], unlabeled unseen-class pixels are also input into the model at the training stage for not destroying the image structure.
The feature generator, discriminator and classifier are multilayer fully-connected neural networks, and the corresponding architectures are detailed in the supplemental materials.The generator and discriminator are adversarially trained by 20 epochs, with a batch size of 32 and a learning rate of 0.0005.The point feature classifier is trained by 10 epochs, with a batch size of 4096 and a learning rate of 0.0001.All models are learned with the Adam optimizer.[56] 61.4 86.0 SCF-Net+RSMix [16] 61.5(+0.1)85.9(-0.1)SCF-Net+SeCondPoint 68.9(+7.5)85.9(-0.1)

B. C3DS Results on S3DIS
Here we evaluate the proposed SeCondPoint by embedding DGCNN [39], RandLA-Net [9], and SCF-Net [56] into the proposed framework on S3DIS under both the Area-5 validation and the 6-fold cross validation settings.Since the pretrained models of DGCNN are not released by the authors, we firstly pre-train DGCNN with the public codes and the corresponding hyper-parameters provided by the authors.For RandLA-Net and SCF-Net, we directly use the pre-trained models released by the authors 1 .The results of these pretrained models and the corresponding enhanced models by the proposed framework in the Area-5 validation and the 6fold cross validation are reported in Table I and Table II respectively.We also adapt a data augmentation method (RSMix [16]) originally proposed for 3D shape classification to handle the C3DS task, and the results of DGCNN, RandLA-Net, and SCF-Net augmented by RSMix are reported in the two tables too.In addition, we report the results of some stateof-the-art 3D segmentation methods as shown in Table I and Table II for further comparison.
As noted from Table I and Table II, five points are revealed.Firstly, the performances of all the three backbone networks are significantly improved by the proposed framework in both the used validations in terms of mACC, mIoU , and OA.For instance, the mACC of DGCNN is improved by 6.5% and 4.4% in the Area-5 validation and the 6-fold cross validation respectively, indicating that the proposed framework is able to generate many augmented point features from the semantics conditioned feature distributions for learning a better point classifier.
Secondly, compared with the state-of-the-art data augmentation method (RSMix [16]) from the perspective of data augmentation, the proposed framework improves all the backbone networks by larger margins in both the used validations.In fact, it is noticed that RSMix even decreases the performances of the backbone networks at some cases.This is probably because RSMix is originally designed to augment single 3D object shapes for 3D shape classification, and it would hurt the structures of 3D scenes when applied to 3D scene segmentation.
Thirdly, the improvements of mACC are more significant than those of mIoU and OA under the proposed framework for all the three backbone networks.This is mainly because for those classes with a relatively small number of points in the original training point clouds, the diversity of their point features are largely augmented by the proposed feature modeling network, and consequently their accuracies are improved.This is consistent with the results shown in Table I and Table II where some classes (e.g.sofa, column) that originally have low performances are significantly improved by the proposed framework.Both support the explanation in the caption of Figure 1.
Fourthly, the improvements in the Area-5 validation are more significant than those in the 6-fold cross validation for the three backbone networks in most cases.Considering that the validation on Area 5 is harder than that on the other areas (since 3D scenes in Area 5 have no overlap with those in other areas), the larger improvements in the Area-5 validation further demonstrate the generalization ability of the proposed framework.
Finally, the enhanced SCF-Net by the proposed framework outperforms the state-of-the-art methods significantly in most cases, especially in terms of mACC, by margins about 4.2% and 3.8% in the Area-5 validation and the 6-fold cross validation respectively.

C. C3DS Results on ScanNet
We also validate the proposed SeCondPoint framework on ScanNet with RandLA-Net and SCF-Net.We do not conduct experiments with DGCNN on ScanNet because it is needed to split a whole point cloud into blocks and randomly sample points in blocks when applying DGCNN on Scannet, but a unified splitting protocol does not exist in literature, hence a fair comparison with existing methods could hardly be made.For both RandLA-Net and SCF-Net, the pre-processing operations (only down sampling) are simple and unified.Since the results of RandLA-Net and SCF-Net on ScanNet are not directly reported by their authors, we firstly obtain their results by training them according to the public codes and the hyperparameters provided by the authors, which are reported in Table III.Then, we enhance RandLA-Net and SCF-Net with the proposed SeCondPoint framework and a state-of-the-art data augmentation technique (i.e.RSMix [16]) respectively, and the corresponding results are reported in Table III.We also report the results of several state-of-the-art methods in Table III for further comparison.As seen from Table III, firstly, the OV A of the enhanced RandLA-Net and SCF-Net by the proposed framework are close to those of the original methods, but the mV ACC of the two enhanced methods are significantly improved by about 6.1% and 7.5% respectively.This is because those classes with a relatively small number of points originally are augmented by the proposed framework, and the significant increases of the accuracies of those classes result in a better average per-class accuracy.Secondly, the improvements achieved by the proposed framework is significantly better than those achieved by RSMix.Actually, RSMix makes a negligible on both RandLA-Net and SCF-Net in ScanNet, which is probably caused by the fact that it is designed to augment 3D object shapes for 3D shape classification.Finally, both the enhanced SCF-Net and the enhanced RandLA-Net outperform all the comparative state-of-the-art methods by large margins, demonstrating the effectiveness of the proposed framework.

D. Z3DS/GZ3DS Results on S3DIS-0
To demonstrate the Z3DS/GZ3DS ability of the proposed framework, here we evaluate the proposed framework in both Z3DS and GZ3DS by embedding existing segmentation networks (i.e.DGCNN [39] and RandLA-Net [9]) into the proposed framework on S3DIS-0 with 3 different seen/unseen splits in the Area-5 validation.The enhanced DGCNN and RandLA-Net are denoted by DGCNN-Z3DS and RandLA-Net-Z3DS respectively for easy reading.We also implement two baselines by straightforwardly embedding DGCNN and RandLA-Net into two typical zero-shot learning methods [63], [64].[63] is a visual-to-semantic embedding based zero-shot learning method while [64] is a GAN based method.For the sake of clarity, we denote them by 3D-V2S-embedding and 3D-GAN respectively.The results of our proposed methods and the two baselines are all reported in Table IV.Besides, the results of the fully-supervised model (the labels of seen classes and unseen classes are given in training) in GZ3DS are reported so that the performance gap between the zero-shot segmentation methods and the corresponding fully-supervised ones can be seen clearly.
Five observations can be made from Table IV.Firstly, in Z3DS, the accuracies of both the proposed framework and the baselines exceed 90% in the 10/2 split.This demonstrates that novel-class inference could be realistic in the 3D scene segmentation task by introducing the language-level semantics at least when the number of novel classes is not very large.
Secondly, HACC and HIoU of 3D-V2S-embedding in GZ3DS are both 0, demonstrating that 3D-V2S-embedding severely suffers from the bias towards seen classes.This   problem is partly caused by the facts that the feature learning of seen-class objects and unseen-class objects in 3D scenes are mutually affected, and since the unseen-class objects have no supervised signal in the feature learning process, their features easily bias towards those of seen classes.Note that this phenomenon is significantly different from that in generalized zero-shot 3D shape classification [17] where a relatively better result is achieved, showing the difference between the two tasks.
Thirdly, the proposed framework achieves significantly superior performances over the two baseline methods in all data splitting cases in both Z3DS and GZ3DS, especially when compared with 3D-V2S-embedding.These large improvements demonstrate the effectiveness of the proposed semantics conditioned feature modeling and the feature-geometry-based mixup training approach.
Fourthly, compared with the fully-supervised model, the GZ3DS performances of the proposed framework in the 10/2 split are relatively lower but still encouraging, considering the fact that the unseen-class objects totally have no supervised signal in the training under the proposed framework.These results demonstrate that novel classes in realistic 3D scenes can be well handled at least when the number of novel classes is not very large.
Finally, the Z3DS and GZ3DS performances of the proposed framework (also 3D-V2S-embedding and 3D-GAN) decrease significantly as the number of unseen classes increases.This indicates that it is still a hard problem to infer a large number of unseen classes with a limited number of seen classes, and a considerable number of seen classes are helpful for achieving high-performance Z3DS/GZ3DS.We leave this direction as a future work.

E. Z3DS/GZ3DS Results on ScanNet-0
Here we validate the proposed framework on ScanNet-0 with 3 different seen/unseen splits in both Z3DS and GZ3DS.The employed backbone networks (i.e.DGCNN and RandLA-Net) and the two comparative baselines (i.e.3D-V2S-embedding and 3D-GAN) are the same with those in the S3DIS-0 benchmark.All the results are reported in Table V.As seen from Table V, the Z3DS performances of the proposed framework are significantly better than those of the two baselines, for instance, mACC u of DGCNN+3D-V2Sembedding and DGCNN+3D-GAN in the 16/3 split are 57.7% and 60.4% respectively, while that of DGCNN-Z3DS reach 63.5%.Secondly, in GZ3DS, 3D-V2S-embedding severely suffers from the bias problem like in S3DIS-0, while the proposed framework outperforms the baselines with large margins, for instance, the HV ACC of RandLA-Net-Z3DS in the 16/3 split reaches 42.2% while that of RandLA-Net+3D-V2S-embedding is 0, which demonstrates that the proposed semantics conditioned feature modeling is helpful to alleviate the bias problem.Besides, the improvements achieved by the proposed framework over 3D-GAN further demonstrate the effectiveness of the proposed framework.
F. Result Analysis 1) Effect of Feature-Geometry-Based Mixup: Here we analyze the effect of the feature-geometry-based mixup approach.We conduct experiments in both the C3DS task and the Z3DS task using RandLA-Net as backbone network in the Area-5 validation setting on S3DIS.The results of C3DS and Z3DS are shown in Figure 3-A and Figure 3-B respectively, where 'w' means using feature-geometry-based mixup training while 'w/o' means not using it.Note that the Z3DS task is performed under the 10/2 split.As seen from Figure 3, the feature-geometry-based mixup training clearly improves the model's performances in terms of mACC, mIoU , and OA in C3DS and Z3DS.This is mainly because by interpolating between two adjacent classes in the feature space, the local discrimination between classes could be enhanced.
2) Effect of Number of Generated Features: We investigate the effect of the number (K) of generated features on the segmentation performance by conducting both C3DS and Z3DS experiments using RandLA-Net in the Area-5 validation setting on S3DIS.In C3DS, K of each class is set to {10000, 20000, 50000, 100000, 200000}.In Z3DS, K of each unseen class is set to {10000, 20000, 50000, 100000, 200000}, and the Z3DS task is performed under the 10/2 split.Figure 4-A and Figure 4-B show the results of C3DS (mIoU , mACC, and OA) and Z3DS (mIoU , mACC) respectively.As seen from Figure 4, with the increase of the number of generated features, the segmentation performance increases accordingly at the beginning stage.This is reasonable because with more diverse point features, the trained point classifier can be more discriminative.When the number of generated features reaches a certain scale, the segmentation performance becomes relatively stable.
3) Sensitiveness to Hyper-parameter(γ): Here we analyze the sensitiveness of the proposed framework to its key hyperparameter, i.e. the ratio of the interpolated samples to the real samples (γ).We conduct both C3DS and Z3DS experiments using RandLA-Net as backbone network in the Area-5 validation setting on S3DIS, with γ = {0, 0.1, 0.3, 0.5, 0.7, 1.0} respectively.The results of C3DS (mIoU , mACC, OA) and Z3DS (mIoU , mACC) are shown in Figure 5-A and Figure 5-B respectively.From Figure 5, we can see that a moderate γ can improve the performance, while a too large or too small γ decreases the improvement.This is because a too small γ can not take advantages of feature-geometry-based mixup training, while a too large γ would make the interpolated samples overwhelming the real samples, hence obfuscating the decision boundary of classifier.
4) Visualization: To qualitatively demonstrate the effectiveness of the proposed framework, we visualize some segmentation results of the proposed framework and the comparative methods in the C3DS and the Z3DS/GZ3DS experiments in the Area-5 validation setting on S3DIS.In C3DS, the used comparative method includes the RandLA-Net backbone.In Z3DS, the used comparative methods include conventional segmentation network (i.e.RandLA-Net trained with 10 seen classes), 3D-V2S-embedding.The 10/2 seen/unseen split is employed in Z3DS.The results of C3DS and Z3DS/GZ3DS are shown in Figure 6 and Figure 7 respectively.As shown in Figure 6, in C3DS, the bookcase (purple) in the left view and the door (yellow green) in the right view are more accurately classified by RandLA-Net+SeCondPoint, demonstrating the proposed framework can improve the segmentation performance of the backbone network.From Figure 7, we can see that 1) the conventional segmentation model always wrongly classifies the unseen-class objects into seen classes since it has no ability to recognize the unseen-class objects; 2) in Z3DS, the unseen-class objects (window (green) and table (dark gray)) are well recognized by both the baseline 3D-V2Sembedding and the proposed SeCondPoint, demonstrating that the introduced language-level semantics are effective at novelclass inference; and 3) in GZ3DS, the segmentation of unseen classes is significantly worse than that in Z3DS for both 3D-V2S-embedding and our proposed method, showing the severe bias problem.However, compared with 3D-V2S-embedding, the proposed framework can significantly alleviate the bias problem, successfully inferring a large number of novel-class (window (green) and table (dark gray)) points.

V. CONCLUSION AND FUTURE WORK
In this paper, we propose a unified SeCondPoint framework for both conventional 3D scene semantic segmentation (C3DS) and zero-shot 3D scene semantic segmentation (Z3DS), where language-level semantics are introduced to condition the modeling of point feature distribution and a large number of point features from both seen and unseen classes can be generated from the learned distribution.The proposed framework has the flexibility to not only enhance the performances of existing segmentation networks by augmenting their feature diversity, but also endow the existing networks with the ability to handle unseen-class objects.The C3DS experimental results on two public datasets demonstrate that the proposed framework is able to significantly boost the performances of existing segmentation networks.Two benchmarks are also introduced for the research of the newly introduced Z3DS task, and the results on them demonstrate the effectiveness of the proposed framework.
In the future, a number of directions are worthy of further exploring.At first, our newly introduced Z3DS task could be considered as a practical setting for 3D scene segmentation since novel classes are often encountered in realistic scenarios, then inspirations could be taken from recent advanced zeroshot learning methods and the characteristics of 3D scene data (such as geometrical knowledge), for example, the bias problem in GZ3DS is expected to be alleviated by resorting to some novel-class detectors.In addition, in the current semantics conditioned feature modeling framework, simple word embeddings are used as semantic features and GAN is employed to model feature distributions.In fact, under the proposed framework, semantics are not constrained to word embeddings.Higher-quality semantics, such as sentence embeddings which represent the descriptions of 3D objects and their spatial relationships, and semantic attributes which have explicit semantic meanings pre-defined by experts, could also be used to achieve better performances.Besides, the way to use language-level semantics is not necessarily constrained to modeling feature distribution with GAN.Other generative models like VAEs and NFs could also be used to model feature distributions, and other advanced embedding networks could be used to jointly learn point representations and semantic representations.In a word, the present work introduced a promising semantics conditioned framework, and its key ingredients are largely subject to much further improvements.

Fig. 1 :
Fig. 1: Blue/red/purple ellipses represent the real data distributions of blue/red/novel classes respectively.Solid blue/red triangles represent samples from seen blue/red classes in the training set and solid purple triangles represent novel-class samples which are not included in the training set.Hollow blue/red/purple triangles represent samples of blue/red/novel classes generated by our SeCondPoint.Yellow dashed line represents the decision boundary determined by the unenhanced classifier, and yellow solid line represents the decision boundary determined by the enhanced classifier.In A, the decision boundary of the unenhanced classifier locates within the distribution of blue class, and the novel class could not be discriminated from the other two classes.In B, the semantics enhanced classifier is enhanced in two aspects: 1) the decision boundary between blue and red classes is pushed close to the real boundary, bringing better generalization to classify testing samples; 2) the semantics enhanced classifier can classify novel class.

Fig. 2 :
Fig.2:The proposed SeCondPoint framework consists of three parts: a pre-trained backbone segmentation network for extracting point features from input 3D point clouds; a language-level semantics conditioned feature modeling network for learning point feature distributions conditioned on the corresponding semantic features; a semantics enhanced feature classifier for classifying both seen-class and unseen-class points in 3D point clouds.Note that the parameters of the pre-trained backbone network are fixed while those of the feature modeling network and feature classifier need to be learned.

Fig. 6 :
Fig. 6: Visualization of the C3DS results of RandLA-Net and its enhanced version by the proposed SeCondPoint framework.

TABLE IV :
Comparative results on the S3DIS-0 dataset.

TABLE V :
Comparative results on the ScanNet-0 dataset.