Self-Supervised Learning for Point Clouds Data: A Survey

3D point clouds are a crucial type of data collected by LiDAR sensors and widely used in transportation applications due to its concise descriptions and accurate localization. Deep neural networks (DNNs) have achieved remarkable success in processing large amount of disordered and sparse 3D point clouds, especially in various computer vision tasks, such as pedestrian detection and vehicle recognition. Among all the learning paradigms, Self-Supervised Learning (SSL), an unsupervised training paradigm that mines effective information from the data itself, is considered as an essential solution to solve the time-consuming and labor-intensive data labelling problems via smart pre-training task design. This paper provides a comprehensive survey of recent advances on SSL for point clouds. We first present an innovative taxonomy, categorizing the existing SSL methods into four broad categories based on the pretexts' characteristics. Under each category, we then further categorize the methods into more fine-grained groups and summarize the strength and limitations of the representative methods. We also compare the performance of the notable SSL methods in literature on multiple downstream tasks on benchmark datasets both quantitatively and qualitatively. Finally, we propose a number of future research directions based on the identified limitations of existing SSL research on point clouds.


INTRODUCTION
With the rapid development of 3D data processing technologies, an increasing number of relevant applications have emerged in both industrial and daily usage, such as indoor •   Pre-training stage: the point cloud data is first pre-processed through the augmentation block and then fed into the point-specific encoder to capture feature representation. Then the features are utilized to complete a well-design pretext task, where the output will compare with the pseudo label derived by the original data to generate a loss and update encoder parameters via back-propagation. (2) Supervised fine-tuning stage: the well-trained encoder will be transferred to the target domain. A task head will be trained with the training labels in a supervised manner to complete the downstream task. (3) Inference stage: the encoder and task head are concatenated as a model to execute inference in the test set. The effectiveness of the SSL pre-training framework could be evaluated through the performance of the model on the downstream task.
navigation [1], autonomous driving [2], and object modeling [3]. In particular, LiDAR is one of the indispensable sensors to capture disorder 3D point cloud data from complicated traffic scenes, which could be further parsed to achieve challenging tasks like pedestrian detection [4] and road semantic segmentation [5] via combining the strong inference ability of deep neural networks (DNNs). However, several issues in the current supervised point cloud DNNs hinder their further development. Initially, accurate environment perception via reliable point clouds DNNs requires millions of labeled data as the input source, while point cloud annotating is labor-intensive and timeconsuming due to its disordered and sparse nature [6]. Besides, manual labeling will inevitably lead to mistakes such as mislabeling and omission due to the excessive number of labeled points. Another long-standing problem is that the supervised learning paradigm struggles to capture the underlying patterns of new data and fails to generalize the pre-training model to downstream tasks because of overfitting caused by noisy labels [7].
The aforementioned issues motivate research develop-arXiv:2305.11881v1 [cs.CV] 9 May 2023 ment in extracting effective feature representation in the point cloud field via self-supervised learning (SSL) manner to learn the implicit representation embedded in the data without manual labels. Not only does it solve the problem of difficult and expensive labeling, but it also benefits relieving domain adaptation (DA) issues [8] and improves model generalization ability. In the SSL paradigm, basic geometric as well as advanced semantic information can be extracted as knowledge encoded in the form of parameters and migrated to downstream tasks under the transfer learning setup. This progress approximates human learning that discovers objective principles of the world by observing phenomena and summarizing them into a system of experience and knowledge. Fig. 1 shows the general pre-train pipeline of SSL in point cloud data. Overall, the goal of SSL is to pre-train an encoder on an unlabeled large-scale point cloud dataset (source domain) and transfer the well-trained network to other datasets (target domain) for various downstream tasks. Commonly, the complete SSL framework contains the following modules: • Data augmentation: The raw input point cloud data is augmented into numerous versions through easyto-implement preprocessed operations such as translation, rotation, flip, and adding noise [9]. The objective is to expand the diversity of the data and provide subjects for subsequent pretext tasks. The details will be discussed in Section 2.4.
• Encoder: The encoder is a point-specific network that captures the hierarchical representation of the input point cloud data. In this paper, we will introduce some commonly used point cloud encoders that either learn from downsampling layer-by-layer [10], [11] or focus on the local area to capture the association between blocks [12], [13]. The details will be discussed in Section 2.5.
• Pretext task: The core of the whole framework is to design a pretext task that mines the hidden selfsupervision signal via the interaction between the encoder and data. This part is the focus of the whole review and will be discussed in detail in Section 3.
• Knowledge transfer: The well-trained encoder will be transferred to another dataset with the knowledge gained in the source domain after completing the pretext task. A task head is then constructed and trained by a few labels in the target domain as the supervision signals to fine-tune the whole architecture. The details will be discussed in Section 4.
• Downstream task: To evaluate the effectiveness of the SSL framework, the pre-trained encoder will be transferred and evaluated on another dataset for performance tests such as object classification, part segmentation, and object detection. A higher evaluation metric for the downstream task represents a better SSL pre-training effect. The details will be discussed in Section 4.
Thriving progress has been made in the field of point cloud SSL recently. While there is not a systematic survey to summarize such prominent research works except [14], which organized outlining unsupervised point cloud repre- sentation learning. However, [14] lacked a summary of the state-of-the-art SSL models as well as a detailed demonstration of rare approaches. Therefore, we were motivated to conclude remarkable SSL work in the point cloud field in the last few years and produce a reviewed paper providing an attentive explanation of each module of the point cloud SSL framework displayed in Fig. 1. Compared with [14], we propose an advanced taxonomy of the SSL pretext tasks shown in Fig. 2, which is more reasonable and methodical to classify the current pretext tasks according to their characteristics. Besides, our paper contains more comprehensive content, for example, the module about point cloud data augmentation, a description of point cloud data properties, and the way different proxy tasks generate pseudo labels. Our contributions can be summarized as follows: • Systematic and novel taxonomy: We propose a novel and systematic taxonomy to formalize the diverse kinds of point cloud SSL methods. This new classification criterion displays our understanding from an advanced perspective and categorizes all current methods into a total of four broad categories. For each broad category, it can be further subdivided into more fine-grained sub-categories according to the variations in feature utilizing aspects as detailed in Fig. 2.
• Comprehensive and detailed summary: We conduct a comprehensive review of SSL point cloud methods, including the background of SSL as well as point clouds, commonly used point cloud datasets and models, four types of point cloud SSL pretext tasks, various downstream tasks with performance comparison, and practical future directions. For each section, we explain each conceptual approach clearly using standard academic expressions with images, tables, and formulas.
• Exhaustive dataset summary and comparison: We summarized the unique characteristics of the 18 most frequently utilized datasets in the point cloud field. In addition, we also compared the performance of different SSL point cloud methods on these datasets separately according to various downstream tasks.
• Feasible future directions: Based on current research progress, we discuss the present technique limitations and propose potential improvement directions in the point cloud SSL field. These ideas combine the latest thoughts and trends in other fields to make point cloud SSL methods more generalizable and advanced.
The rest of the survey is organized as follows: Section 2 introduces the background of SSL and point cloud data, containing the SSL development history, point cloud properties, point cloud datasets summary, point cloud data augmentations, commonly used point cloud models, pseudo labels, and loss functions. Section 3 demonstrates the proposed taxonomy with exhaustive instances elaboration. Section 4 summarizes the frequently utilized downstream tasks and compares the performance of different SSL methods on these tasks. Section 5 discusses the limitations of current approaches and proposes potential future directions. Finally, Section 6 concludes the whole review.

Self-supervised learning in the language and image domain
Before introducing SSL methods in the point cloud field, we will briefly describe the development history of SSL methods in the language and image domains. The purpose is to give readers a general understanding of SSL methods and their core ideas across different domains.
Natural Language Processing (NLP) is where SSL was first developed. After the revolutionary research Word2Vec [15] was released, NLP models began to trend towards label-dependency removal and rapidly progressed leveraging these SSL paradigms across many NLP pretext tasks. Specifically, after converting words into vectors and utilizing the relationship between the representation and context, models could learn semantic representations from neighboring words or sentences through task formulations such as next sentence prediction [16], auto-regressive language modeling [17], or sentence permutation [18], etc. During this exploration, landmark network models such as GPT [17] and BERT [16] were born, and their variants were extended to other fields and achieved promising performance successfully.
In the field of images, SSL algorithms are constantly updated in the design of pretext tasks. From simple tasks like relative position prediction [19], [20] and rotation angle prediction [21], to reconstructing the block being masked by surrounding visible pictures [22], [23], different selfsupervised methods impose simple variations on image data and extract features by recovering to the original input. In recent years, the academic community has converged its attention on the paradigm of contrastive learning [24]- [26], which aims to differentiate positive and negative samples by comparison using data augmentation techniques.
Additionally, free semantic label-based [27]- [30] and crossmodal-based methods [31]- [33] have been proposed, which learn significant representations via automatically generated semantic labels and extra information from other modalities, respectively. All aforementioned significant works inspired the development of SSL in the point cloud field. Similar ideas could be transferred from 2D to 3D in the case of adapting for data peculiarities.
Although data types vary from domain to domain, the core idea of SSL remains the same: leverage data characteristics for transformation processing and make transformed data consistent with the original input in terms of feature representation by contrasting or reconstruction.

Properties of the point cloud data
As mentioned above, data properties are distinct between language, image, and point cloud. In this section, we will focus on describing the unique features that distinguish point cloud data from the other two.
Language is a complex and abstract data type that contains ambiguous information due to its versatility and richness. It is expressed by a sequence of words, which are discrete and unstructured in the representation space [25]. In contrast, images are more intuitive and explicit for human perception. They are 2D data consisting of a matrix of pixel values that represent the color, texture, and shape information of an object in high-dimensional space [34].
Point cloud data is similar to image data in terms of visual format and could be regarded as 3D stereo images with depth information. However, the attributes of point cloud data are completely different in geometric representation. Specifically, a point cloud is a collection of discrete, disordered, and topology-free 3D points. The most basic information contained in the point is the position coordinate (x i , y i , z i ) in Euclidean space where i is the number of points in the object. Alternatively, there are also other optional attributes such as color, intensity, reflectivity, etc., specifying the physical properties of the point cloud in more detail. The input order is trivial for point cloud data and will not affect semantic meaning while it is crucial for images and language where various words or pixel sequences lead to totally divergent connotations. Additionally, point cloud data is invariant to rigid transformation which means that it remains unchanged after rotation and translation. Such exclusive properties could be summarized as follows: • Sparsity: The point cloud data is discretely distributed on the surface of the scanned object or scene.
• Non-uniformity: The distance between points is not fixed and is determined by various factors such as the instrument's sampling strategy, relative position, and scanning range.
• Imcomplete data: Some parts of real-scanned surfaces are incomplete due to self or external occlusion.
• Data noise: It is inevitable that noise from environmental factors or inaccuracies in instruments will be present.
• Permutation invariance: The order of points does not affect the overall semantic representation of point cloud objects, so identical point cloud objects can be expressed by various matrices.

Point Cloud Dataset
As deep learning continues to be explored in the 3D field, the need for 3D point cloud datasets is becoming increasingly stringent. It is no doubt that SSL is facilitated by the availability of complete, well-varied, and densely-labeled point cloud datasets. This section will list all commonly used point cloud datasets with brief descriptions and summarize them in Table 1 in terms of sample number, object categories, suitable tasks, and highlight contributions. It is worth mentioning that the point cloud data in this paper is in the broad sense that contains single frames and time series, individual objects and complex scenes, as well as synthetic and real scanned formats. Furthermore, there are a few automatic driving datasets containing extra modalities, like images or radar, forming complex traffic scenarios with point cloud data.
• KITTI [35] is a benchmark suite for autonomous driving vision tasks. The dataset was collected using several pieces of equipment, including four video cameras, a laser scanner, and a localization system. This comprehensive outdoor dataset includes not only point clouds but also stereo and optical flow data. For the 3D object aspect, there are more than 200 thousand annotated point cloud scenarios consisting of diverse cars and pedestrians, providing a novel challenging benchmark for 3D object detection and orientation estimation.
• ModelNet [36] is the most widely used 3D point cloud CAD dataset for object classification and fewshot learning. This dataset was aggregated and modified from previous smaller datasets by removing defective categories and unrealistic models. It contains a total of 12,311 single objects from 40 categories, with each point composed of six dimensions of information, including XYZ spatial coordinates and RGB values.
• ShapeNet [37] is a relatively large-scale repository of 3D CAD objects frequently employed as a pretraining dataset. It contains more than 3 million samples categorized into 55 classes under the WordNet synsets [52] criteria. The annotations in the dataset are versatile, including rigid alignments, parts, physical sizes, and key points.
• SUN RGBD [38] is an RGB-D scene understanding benchmark suite containing 10,335 samples at a comparable scale to PASCAL VOC [53]. It has 146,617 2D polygons and 64,595 3D bounding boxes densely annotated to indicate object orientation, room layout, as well as scene category for overall scene awareness.
• S3DIS [41] is a 3D indoor venues dataset that scans 272 rooms in 6 areas overlaying a 6,000 m 2 area. It has 13 semantic categories labeled by fine-grained point-wise annotations carrying full 9D information, including XYZ, RGBs, and normalized location coordinates.
• ScanNet [6] is a 3D RGB-D dataset that comprises 2.5M views in 1,513 scenes acquired in 707 indoor environments. Various tests containing semantic voxel labeling and CAD model retrieval proved that Scan-Net could provide quality data for 3D scene understanding.
• ScanObjectNN [45] was proposed as a collection of real-world indoor point cloud scenes to break the performance saturation of 3D object classification on synthetic data. This dataset introduces new challenges for 3D object classification due to the presence of background noise and occlusions that require networks' ability about context-based reconstructions and partial observations.
• Waymo [48] is a large autonomous driving dataset produced by Waymo in collaboration with Google Inc. The dataset consists of 1,150 urban and suburban geography scenes spanning 20 seconds, which There are a total of 14 sub-categories of data augmentation methods that could be classified as three general corruption families. The figure is adapted from [9].
are collected via well-synchronized and calibrated LiDAR and camera.
• NuScenes [49] is another remarkable multimodal dataset provided by the full sensor suite including cameras, radars, and LiDAR. Compared to other autonomous driving datasets, it contains additional annotations like pedestrian pose, vehicle state, and also scenes from nighttime and rainy weather.

Point cloud data augmentations
It is well known that data augmentation is a crucial technique for enhancing DNNs' performance by increasing the amount and diversity of training samples. Especially for SSL tasks, it not only prevents the model from overfitting but also facilitates capturing robust and invariant representations of the point cloud under multiple transformations. In this section, we will introduce several commonly used data augmentation methods for point cloud data and compare the effectiveness of each approach via a specific metric. Essentially, data augmentation is a process of generating new data by adding some interventions or corruptions based on not destroying the original semantic expression. In the point cloud field, data augmentation methods are applied based on the point cloud properties mentioned in Section 2.2 and can be classified into three general categories: density/masking, noise, and affine transformation [9]. These three corruption families could be further divided into 14 sub-categories displayed in Fig. 3.
Density/masking is the most frequent data augmentation method adopted in mask autoencoder (MAE) type SSL research [23], [54], [55]. Based on the principle that point cloud data is sparse with uneven density, randomly removing a certain percentage of points while preserving part of the semantic expression presents a challenging learning objective for such MAE-based tasks. On the contrary, the noise method is proposed to impose interventions on originally clean input so that it enhances the difficulty of feature extraction. While affine transformation, consisting of seven different approaches, leverages point cloud invariance characteristics to shift the spatial coordinates of each point, which causes the most significant impact on the input since basic position information varies completely. Similar to SimCLR [24], Zhang et al. [9] investigated the effectiveness of the aforementioned data augmentation methods as pretext data preprocessing on downstream classification tasks. Task relatedness is employed as the evaluation metric to statistically measure the performance of SSL models on downstream tasks, which provides valuable advice for proxy data augmentation selection. Following [56], for each pretext task c, its task relatedness to downstream task t is defined as: Where x is a sample in a point cloud dataset X. E c is the model's encoder pre-training on task c. R c is a readout function, which indicates the classification head composed of several fully connected (FC) layers. f t is the labeling function. I t is accuracy measurement estimating whether the downstream output R c (E c (x)) comform to the ground truth f t (x). Such accuracy-based task relatedness is the metric to measure the effectiveness of data augmentation methods.
To further explore the relationship between task relatedness and classification accuracy on downstream tasks, Pearson correlation coefficient r and p-value are utilized to estimate the linear relationship as well as statistical significance for Pearson correlation [57], respectively, where |r| > 0.5 refers to a strong relationship and p < 0.05 is considered statistically significant. Fig. 4 demonstrates the statistically significant linear relationship between task relatedness and classification accuracy on downstream tasks when r = 0.89 and p < 0.001. The results reveal a counter-intuitive fact that frequently used density/mask and noise-based data augmentation methods are ineffective for downstream tasks either in accuracy and task relatedness. Conversely, seemingly the most simple affine transformation enhances task relatedness to point cloud classification resulting in higher accuracy. Furthermore, the combined corruptions of affine transformation and mask can even approach the supervised  2 Commonly used models to extract the point clouds feature.

Model
Year Architecture Contributions PointNet [10] 2017 CNN Pioneer in direct processing of raw point clouds with lightweight architecture PointNet++ [11] 2017 CNN Aggregating local neighborhood by multi-scale and multi-resolution sampling and groping VoxelNet [12] 2018 3D CNN Partitioning disordered point clouds into regular voxels for local feature learning DGCNN [13] 2019 Graph CNN Constructing a dynamic local graph to capture edge features around a neighbor PCT [58] 2021 Transformer Successfully gaining the long-range dependencies between point patches GANs [59] 2014 GAN Generating synthetic data through adversarial training benchmark. Hence, using affine transformation-based methods for data augmentation is preferable for point cloud data preprocessing in SSL pre-train.

Commonly used models for point cloud learning
Although the SSL idea can be extended from language and images to point clouds, different data peculiarities lead to distinct network structures for corresponding preprocessing. For instance, traditional CNN networks cannot handle irregular and discrete point cloud data well since there is no guarantee that a corresponding point exists at the same relative position of the convolution. In this section, we introduce five point cloud networks that are frequently used as feature extraction encoders in the self-supervised paradigm and summarize their respective characteristics in Table 2 for intuitive presentation.

PointNet
To reduce data size and complexity, Qi et al. proposed PointNet [10], which is the pioneering work to extract features directly on the raw, unprocessed point cloud. It is widely deployed as the feature extractor [60]- [62] in the SSL framework due to its simple and lightweight network structure. Taking advantage of the point permutation invariance, PointNet aligns the input points to a canonical space and aggregates global features by a symmetric function such as max pooling. However, it fails to capture local structures induced by the metric space in which the points reside, thereby limiting its ability to recognize fine-grained patterns and generalize to complex scenes. The updated version PointNet++ [11] was then put forward several months later. It adopts multiscale, multi-resolution sampling, and groping strategies to propagate features from one level to another, which improves the feature learning ability further. Furthermore, the point patch generation strategy combining Farthest Point Sampling (FPS) and K-Nearest Neighbor (KNN) provides a template for point cloud cropping preprocessing for subsequent studies [54], [55], [63].

VoxelNet
VoxelNet [12] is a generic point-specific network that uses voxels, a kind of finite unit cube, to divide and access a local representation of the point cloud for 3D detection tasks [64]- [66]. This network partitions disordered point clouds and performs feature learning in quantified and fixed-size 3D structures. One innovation is the stacking Voxel Feature Encoding (VFE) layers which encode interaction between points within a voxel and grasp descriptive appearance information. The output of each VFE layer is the concatenation of point-wise features and locally aggregated features so that local features are better captured. However, the expensive computation of voxel construction and quantization artifacts constrain the model capturing high-resolution or fine-grained representation.

DGCNN
A point with its neighbors can reflect the geometry property of a local point cloud. Such a local community relationship could be expressed easily by a graph network. Therefore, Wang et al. proposed a dynamical graph-based CNN network (DGCNN) [13] that encodes the edges feature between vertices. Instead of learning point representations directly, DGCNN depicts the interactions between points and their connection edges in both Euclidean and semantic space and alters the graph structure dynamically if the learnable parameters are updated. This graph network-based architecture has served as a backbone in many subsequent point cloud SSL models with good results [61], [62], [67].

GANs
Generative Adversarial Networks (GANs) [59] are a widely used framework in reconstruction-based pretext tasks for point cloud knowledge mining. GANs consist of two submodules: the generator, which generates point clouds similar to the training data, and the discriminator, which distinguishes between generated and real point clouds. These two modules are trained under an adversarial paradigm without any supervision. The framework can be formulated as a two-player minimax game: where D and G denotes the discriminator and the generator in GAN. X and Z represent the data and noise distribution, respectively.

Transformers
Transformers are currently the most prevalent architectures in all fields. They benefit from the multi-head self-attention mechanism, which allows them to easily capture long-range dependencies between point patches and discover implicit regional correlations. The state-of-the-art performance of SSL point cloud classification and part segmentation is achieved by transformer-based models such as the one proposed by Zhang et al. [63]. Furthermore, point cloud transformer (PCT) [58], a variant adapted specifically for point clouds, enhances local feature extraction with the support of farther point sampling and nearest neighbor search. Such a design further improves performance on various downstream tasks.

Pseudo labels
To address the absence of ground truth labels, it is imperative to generate a pseudo label from the raw point cloud input. This enables calculation of loss via comparison between output of the pretext task and pseudo labels, which is then used for encoder updating via backpropagation. Normally, information contained in pseudo label is more considerable than human tags. For instance, label 'airplane' could only indicate shape of object without other descriptions like color, pose, and differences from other samples in same category. Instead, aforementioned attributes are implicitly involved in point cloud itself and could be captured as pseudo label in SSL tasks. Therefore, pseudo label is more reliable and informative source for pretext task to learn point cloud representation.
In most reconstruction-based pretext tasks, pseudo label is point cloud itself which provides rebuild objective for pretext task. In contrast-based methods, pseudo label is multidimensional matrix typically generated using clustering methods such as memory bank [68], online dictionary [25], and prototype approaches [26], representing mean and variance of all or part of features of point cloud dataset. For some alignment-based prediction or motion-based tasks pre-trained on temporal point cloud datasets, pseudo label is geometric information like position, pose, and orientation in several frames before and after.

Loss functions
Moreover, selecting an appropriate and easily-differentiable loss function is critical to facilitate backpropagation and encoder optimization. In the case of reconstruction-based pretext tasks, Chamfer distance (CD), which is a symmetric function used to measure the similarity between two point sets, is commonly employed to assess the distances between each point in one set and its corresponding nearest point in the other set. More formally, for two non-empty subsets X and Y , Chamfer distance d CD (X, Y ) is defined as: Here, x and y represent the points in the reconstruction point set X and the original input point set Y , respectively. || · || denotes the L2 distance between two points and | · | refers to the number of points. The smaller the CD value, the more similar the two point sets are, and the better the SSL algorithm performs.
For contrast-based pretext tasks, the objective is to discriminate the similarities and differences between each point cloud sample on the overall semantic understanding level. Diversity data augmentation is applied to generate positive samples that are similar to the input point cloud and negative samples that are transformed from other point clouds. Therefore, it requires a cross-entropy variant loss function to encourage the positive samples to be close to each other and vice versa. InfoNCE, where NCE stands for Noise-Contrastive Estimation, is a contrastive loss function that estimates the mutual information between a pair of samples. It can be formulated as: where q indicates the encoded query (feature). k indicates a set of K + 1 encoded samples {k 0 , k 1 , k 2 , . . . , k K }, which could be regarded as the historical samples prototype. And τ is the temperature parameter controlling the sharpness of the distribution. Assuming there is only one positive sample k + in the set k matching the query q, the others K samples are all negative. InfoNCE aims to assign the query q into the positive class k + in the K + 1 classification problem [25]. In other words, the loss function tries to maximize the logits of q · k + and minimize the value of the denominator. The only difference between InfoNCE and cross-entropy is that the value of K means the number of negative samples for InfoNCE, while it represents the number of dataset categories in cross-entropy.

FOR POINT CLOUD
As shown in Fig. 2, we classify the current point cloud SSL pretext tasks into four general categories based on the nature of the point cloud that each method exploits: reconstructionbased, contrast-based, alignment-based, and motion-based methods. These categories can be further subdivided into fine-grained specific tasks according to the varying featureutilizing aspects. The following sections will elucidate the principles and peculiarities of various proxy tasks by presenting representative examples with illustrative formulas or figures. Notably, some pretext tasks may comprise multiple sub-categories, so such works will be raised in various sections.

Reconstruction-based methods
Reconstruction-based methods are the most prevalent SSL proxy tasks for point clouds. Such methods learn the point cloud representation by reconstructing the corrupted point cloud as close as possible to the original entire input. Not only global features but also the mapping between local and global is learned during the reconstruction process. According to the different types of corruption and reconstruction objects, the reconstruction-based methods can be divided into six sub-categories: mask recovery, spatial restoration, point sampling, disentanglement, deformation reconstruction, and generation and discrimination. In the following subsections, we will introduce the details of each sub-category with representative examples and summarize the contribution of each method in Table 3.

Mask recovery
The primary concept of reconstruction is to mask a portion of the point cloud and recover such missing components via the encoder-decoder architecture. Similar to the image inpainting task [83] and Mask AutoEncoder (MAE) [66], the encoder is required to capture the local geometric structure representation and the regional relations during the restoration process. Generally speaking, the better the reconstruction, the more effective the learned features.  Point-BERT [55], inspired by BERT [16], designed a point-specific tokenizer built upon discrete Variational Au-toEncoder (dVAE) to map point patches into discrete point tokens that imply meaningful local geometric patterns. A portion of the input is then randomly masked out, and a BERT-style transformer is trained to reconstruct the missing point token under the supervision of point tokens obtained by the tokenizer. However, the tokenizer should be pretrained advanced, and Point-BERT over-relies on the auxiliary contrastive learning as well as data augmentation. To address aforementioned issues, Pang et al. proposed Point-MAE [54] as a neat and efficient scheme of mask autoencoder shown in Fig. 5. Concretely, Point-MAE employs the standard transformer as its backbone with asymmetric encoder-decoder architecture to process random masking points with a high ratio (60%-80%). The mask tokens are shifted from the input of the encoder to the lightweight decoder, which saves considerable computation, and more significantly, avoids early leakage of location information.
To further capture local geometric information, Zhang et al. introduced Mask Surfel Prediction (MaskSurf) [63], which estimates the surfel position (i.e., points) and per-surfel orientation (i.e., normals) simultaneously. Such a two-head pre-training paradigm was justified to capture more effective representations than a reconstruction-only pretext. Likewise, Voxel-MAE [65] transforms point clouds into volumetric representations and applies the range-aware random masking strategy on the voxel grid. Except for reconstructing the occupancy value of masked voxels, a supplementary binary voxel classification task distinguishing whether the voxel contains point clouds boost the model to figure out complicated semantics. Fig. 6. The illustration of the CloudContext pretext task. The pre-training model is enforced to estimate the spatial relevance between two given point cloud segments from six provided categories. In this case, the exact relation of these two components is "the reading part is diagonally above the blue part". This figure is adapted from [69].

Spatial restoration
The point cloud is the coordinate set containing abundant spatial information that describes the structural distribution of objects and the environment in Euclidean space. It is natural to exploit such rich spatial knowledge as the supervision signal in the pretext tasks to learn favorable representations.
Sauder et al. [62] proposed a 3D version of the jigsaw pretext to rearrange point clouds whose parts have been randomly disrupted and displayed by voxels along the axes. The goal of this pretext is to restore the original position of each patch labeled by voxel ID from the state of chaotic and disorderly distribution. Likewise, they further produced CloudContext [69] to forecast the spatial relevance between two point cloud segments. As shown in Fig. 6, the pretraining model is trained to predict the relative structural position between two given patches from the same object, which utilizes the innate flexible attribute of point clouds as they are not restrained by a discrete grid. By doing so, powerful per-point features could be accessed in an easy-toimplement unsupervised manner without expansive computation. Orientation estimation [61] was another simple but effective proxy task to capture the spatial information of point clouds. Thanks to the convenience of canonical orientation provided in most datasets, the orientation estimation pretext task aims to predict and recover the rotation angle around an axis via matrix multiplication. Such a pretext requires a high-level holistic understanding of shapes and obviates the need for manual annotations cleverly. Point upsampling is the operation to upsample sparse, noisy, and non-uniform point clouds to generate a dense, complete, and high-resolution point cloud, which is challenging but also beneficial for the model to capture implicit geometric representations of the underlying surface.

Point upsampling
PU-GAN [70] was a pioneer SSL upsampling paradigm formulated based on a generative adversarial network (GAN) [59] to grasp a diverse range of point distribution from the latent space and upsample points over patches. An up-down-up unit is embedded in the generator to expand point features as well as a self-attention unit for feature aggregation quality enhancement. The discriminator is inspired to gain inherent patterns and improve the uniformity of output generation according to a compound loss including adversarial, uniform, and reconstruction terms. Motivated by PU-GAN, Zhang et al. proposed the Upsampling AutoEncoder (UAE) [72] to gain both advanced semantic information and basic geometric structure from the subsampled point cloud. As shown in Fig. 7, the encoder is devised to perform point-wise feature extraction on the subsampled point cloud, and the upsampling decoder is designed to reconstruct the original dense point cloud with offset attention [58] to refine global shape structure. Liu et al. [73] proposed a coarse-to-fine reconstruction framework, dubbed SPU-Net, integrating self-attention with graph convolution network (GCN) for context feature extraction and generating fine point sets with hierarchically learnable 2D grids. Zhao et al. [71] introduced SSPU-Net leveraging the shape coherence between input sparse and generated dense point cloud as well as image-consistent loss among multi-view rendered images to capture the latent pattern of underlying point structure. PUFA-GAN [74], a frequency-aware framework, utilized a graph filter to extract the high frequency (HF) points of sharp edges and corners so that the discriminator could focus on the HF geometric properties and enforce the generator producing more uniform and neat upsampled point cloud. To get rid of fixed upsampling factor restriction, Zhao et al. [75] presented a self-supervised arbitrary-scale (SSAS) framework with a magnification-flexible upsampling strategy. Instead of direct mapping from sparse to dense point clouds, the proposed scheme seeks the nearest projection points on the implicit surface for seed points via two functions, which are exploited to estimate the projection direction and distance respectively. The two input point clouds are separately halved and mixed into a hybrid object feeding to the encoder E to mine the geometry-aware embedding. The Erase operation is applied to obtain the 2D projection from both original input point clouds simultaneously. The instance-adaptive decoder D receives the embedding with two partial projections as input to disentangle the blended shape into the original two point clouds. The chamfer distance is used to measure the reconstruction error between generated point clouds and the original ones. This figure is adapted from [78].

Disentanglement
The model pre-trained under the SSL paradigm usually prefers to grasp the low-level geometric features of point clouds, such as pose, contour, and shape information but neglects the high-level semantic content understanding, which leads to unsatisfactory performance in downstream tasks like object classification that require global discriminative capability. To tackle this issue, disentanglement-based SSL pretexts were proposed to separate the low-level geometric features from the high-level semantic embedding and feature extraction is performed on various content via distinct modules to obtain hierarchical representations.
Tsai et al. [76] proposed a self-supervised disentanglement framework that uncouples content and pose attributes in partial point clouds to enhance both geometric and semantic feature abstraction. Two encoders are employed to learn the content and multi-view poses individually, where the gained pose representation should predict the viewing angle and navigate the partial point cloud reconstruction cooperated with the content from another specific view. Likewise, Xu et al. [77] presented a universal Contour-Perturbed Reconstruction Network (CP-Net) that disentangles the point cloud into contour and content ingredients, where a concise contour-perturbed augmentation unit is exploited on the contour component while retaining the content part of the point cloud. Therefore, the self-supervisor is able to concatenate the content component for advanced semantic comprehension. Different from the above two pretexts, Mixing and Disentangling (MD) [78] blends two disparate point shapes into a hybrid object and attains geometry-aware embedding from the encoder. An instance-adaptive decoder is then leveraged to restore the original geometries based on the obtained embedding by disentangling the mixed shape. As shown in Fig. 8, except for the main encoder-decoder structure, the proposed scheme also encompasses a coordinate extracting operation "Erase," which randomly drops onedimension coordinate of each point to provide an extra 2D partial projection for the encoder to better reconstruct the original two-point cloud shapes. Fig. 9. The demonstration of shape self-correction pretext. The input point cloud is firstly preprocessed by a shape-disorganizing module to generate a deformed point cloud and then fed to the encoder to grasp the geometry-aware representation. Two separate task heads are constructed to distinguish and segment points belonging to distorted parts and subsequently reconstruct partial-deformed objects to their initial status. The well-trained feature extractor is transferred to downstream tasks to estimate the feature capturing ability. This figure is adapted from [80].

Deformation reconstruction
Point cloud deformation is a common phenomenon in real-world data scanning, which is usually caused by object distortion, sensor noise, or external occlusion. It has been discovered that reconstructing the original point cloud from the artificially deformed one, implemented by adding Gaussian noise or local translation, supports the model in grasping geometric perception as well as context awareness of the point cloud.
Chen et al. [80] proposed a shape self-correction pretext to mine the implicit geometric embedding of point clouds. The pretext assumes that a robust shape representation could identify and correct distorted regions of a shape. As shown in Fig. 9, the proposed scheme imposes destruction over certain shape regions by a shape-disorganizing module and sends the deformed point cloud to the feature extractor for embedding learning. Two task heads are built separately to discern the distorted components and further restore them to their original normal shapes for fine-grained geometric and contextual feature explorations.
Achituve et al. [81] proposed the first study of SSL for domain adaptation (DA) on point cloud field via Deformation Reconstruction (DefRec). By mapping the dislocating points to their original location, the model is capable of obtaining the latent statistical structure of input point cloud objects. Moreover, the distribution gap between source and target domains is bridged by the learned underlying representation since they are invariant to distribution shift.
FoldingNet [79] presented a novel folding-based decoder to perform deformation on the canonical 2D grid to fit an arbitrary 3D object surface. Instead of deforming the point cloud, the folding operation exerts a "virtual force" induced by the embedding captured from input to stretch a 2D grid lattice to reproduce the 3D surface structure and tackle issues caused by point cloud's irregular attributes by applying implicit 2D grid constraint. The generation and discrimination pretext was a unique paradigm that replaced the comparison between the generated and input point clouds with a discriminator module that distinguishes whether the fed point cloud is reconstructed from noise distribution or truly sampled. During the adversarial training process, the generator (encoder) and discriminator (decoder) compete with each other and are updated alternatively so that both components can be transferred as initial networks for downstream tasks.

Generation and discrimination
The point cloud GANs variation models are representative of generation and discrimination approaches. PC-GAN [82] is a modified framework specifically for point clouds that employs a hierarchical and interpretable sampling strategy inspired by Bayesian and implicit generative models to tackle the issue of missing constraints on the discriminator. Sarmad et al. [83] introduced a reinforcement learning (RL) agent to control the GAN to grasp the implicit representation from noisy and partial input to generate high-fidelity and entire point clouds. Meanwhile, applying an RL agent to seek the best-fit input of GAN to produce low-dimension latent embedding relieves the challenge of unstable GAN training. Shu et al. [84] introduced a tree-structured graph CoCoNets [92] Scene contrast Mapping RGB-D images to 3D point scenarios by optimizing view-contrastive prediction 2020 P4Contrast [93] Scene contrast Utilizing the synergies between two modalities for better feature extraction 2021 DepthConstrast [94] Scene contrast Applying the Instance Discrimination method on depth maps convolution network (TreeGCN) as the generator leveraging ancestor information to boost the representation of the point which is more efficient in computation than neighborhood features adopted in regular GCNs. PU-GAN [70] and PUFA-GAN [74], both employed GANs-based models to generate dense and uniform point clouds with disparate innovative modules for feature aggregation element enhancement and high-frequency point filtering, respectively. Liu et al. [85] proposed a discriminative mask pretraining transformer framework, MaskPoint, which combines mask and discrimination techniques to perform simple binary classification between masked object points and sampled noise points as the proxy task. As shown in Fig. 10, the original complete point cloud is divided into 90% masking portions and a 10% visible set. Two kinds of query, where the real is sampled from masked point clouds while the fake is derived from random noise, are fed to the decoder to distinguish the query source. During the discrimination process, the model is required to deduce the full geometry from small visible portions.

Contrast-based methods
Contrastive learning is a popular mode of SSL that encourages augmentations of the same input to have more comparable representations than the views of different inputs. The general approach is to expand the views of input point clouds (anchors) by various data augmentation techniques to make the positive samples augmented from the same anchor more similar to the negative samples from different anchors in the feature space. In this section, we will introduce SSL contrast-based methods for point clouds with representative examples and summarize each method's contribution in Table 4.

Object contrast
Traditional contrastive learning research focuses on instance-wise objects, where the priority is on overall semantic learning through discriminative pretext tasks that capture context similarity and difference of point clouds. Therefore, such an object-contrast paradigm performs data augmentation on relatively large patches or whole single point objects to capture global geometric awareness.
Sanghi [86] proposed Info3D, which takes inspiration from Contrastive Predictive Coding [95] and Deep InfoMax [96], to obtain rotation-insensitive representation by maximizing mutual information between 3D objects and their local chunks as well as geometrically transformed versions. Lu et al. [87] proposed the Augmentation Fusion Self-Supervised Representation Learning (AFSRL) framework, Fig. 11. The illustration of a self-contrastive paradigm. Patch D is selected as the anchor and the symmetrical part Patch B is the positive sample. Patch B and C are the negative samples, where Patch C is hard to distinguish due to its comparative similarity to the anchor. The figure is adapted from [89].
which imposes data-level augmentation and feature enhancement simultaneously to construct a stable and invariant point cloud embedding. The correspondence between augmented pairs is acquired, and the invariant semantic is maintained under perturbations during two aspects augmentations. Zhang et al. [88] introduced a simple twophase unsupervised GCN framework, named contrasting and clustering, to capture superior point embedding by solving part contrast and object cluster tasks consecutively. Du [89] presented a self-contrastive paradigm leveraging self-similar point cloud patches within a single point cloud to facilitate local shape and global context primitives capturing. As shown in Fig. 11, according to the nonlocal selfsimilar property of the point cloud, where regional geometric keeps invariant after affine transformation, the selfsimilar point cloud patches are treated as positive samples otherwise negative based on the inferred similarity score. Moreover, hard negative samples, close to positive samples in the representation space, are sampled for more discriminative and expressive representation learning.

Scene contrast
Different from the object-contrast, the scene-contrast paradigm concentrates on scene-level contrastive learning to grasp broader environmental context and neighborhood perception, which is more challenging but also more relevant to real-world complex scenarios.
To address the domain gap concern that it is insufficient to capture a global representation from object instances, Xie et al. [90] proposed PointContrast, a sparse residual U-Net based framework aiming to obtain dense features at the point-level on complex scenes. As shown in Fig. 12, two views x 1 and x 2 are produced from a complicated point cloud scene input, where pairs of the point-level corresponding match are computed between these two views as the positive samples. Two rigid transformations T 1 and T 2 are utilized to increase the pretext's difficulty which demands the network to learn the invariant embedding under random geometric shift. The contrastive loss is defined to shorten the distance between the matched points and enlarge the distance of mismatched points of two overlapping partial scans so that the pre-training model can capture local description and be universally pertinent to various advanced 3D understanding downstream tasks. However, PointContrast only considered point correspondence matching but ignored the spatial configurations and contexts in a scene, e.g., relative pose and distance, therefore confining scalability and transferability. To remedy this shortcoming, Hou et al. [91] presented Contrastive Scene Contexts to fuse spatial information into pre-training objects by introducing ShapeContext local descriptor [97] partitioning and performing contrastive learning in each region. Such simple improvement modifies downstream performance and achieves data-efficient that only employs 0.1% of point labels to reach the same effect as full supervision.
Continuous Contrastive 3D Networks (CoCoNets) [92] aimed to infer latent scene representation by mapping RGB-D images to 3D point scenarios by optimizing viewcontrastive prediction. P4Contrast [93], another RGB-D bimodal SSL framework, proposed contrasting "pairs of point-pixel pairs" providing additional flexibility for hard negatives creation to utilize the synergies between two modalities for better feature extraction. DepthConstrast [94] circumvents the need for point correspondences and instead applies the Instance Discrimination [68] method on depth maps combined with a momentum encoder to grasp the geometric perception for point cloud scenes.

Alignment-based methods
Point cloud representation is generally invariant to transformations in terms of time flow, spatial motion, multiview photography, etc. Based on this imperative property, alignment-based methods have been proposed to learn the implicit embedding of point clouds by preserving the coherence of point features in spatiotemporal consistency, multi-view alignment, and multimodal fusion. In the following content, we will demonstrate the details of these sub-categories with representative examples and summarize the contribution of each method in Table 5. Fig. 13. The schematic of cross-modality and cross-view correspondences. The 3D point cloud objects and corresponding pairs of multiview rendered images are sampled from the same mesh input, respectively. The relation of diverse views is captured as the supervision signal by sustaining the alignment among multi-view and cross-domain representations. The figure is adapted from [99].

Multi-view alignment
Compared to direct processing and feature extraction on 3D point clouds, projecting point clouds into 2D images for dimension reduction and utilizing mature image networks as well as 2D SSL technologies is relatively more accessible. To ensure that the learned embedding sufficiently represents the entire 3D point cloud object or scene, multi-view alignment pretexts are necessary to preserve the integrity and uniformity of obtained point cloud features.
Info3D [86] was a pioneer in research aimed at obtaining rotation-insensitive representation by maximizing mutual information between 3D objects and their local chunks for patch-level consistency. Occlusion Completion (OcCo) [60] combined the idea of mask recovery shielding and restoring occluded points in a camera view for better spatial and semantic properties comprehension. Similarly, Yang et al. [98] introduced an SSL multi-view stereo structure generating prime depth map as pseudo-labels and refined such self-supervision from neighboring views as well as highresolution images by multi-view depth fusion iteratively. Furthermore, the correspondence of pixel/point of the point clouds and its corresponding multi-view images should be aligned for cross-modality consistency. Jing et al. [99] proposed a novel SSL framework leveraging cross-modality and cross-view correspondences to jointly grasp both 3D point cloud and 2D image embedding concurrently. As shown in Fig. 13, point cloud objects and comparable pairs of multi-view rendered images are sampled from the same mesh input, respectively. In addition to 2D-3D consistency, the contrastive notion is adopted into cross-view alignment that shortens intra-object distance while maximizing interobject discrepancy of distinct rendered images. Similarly, Tran et al. [100] presented a dual-branch schematic not only agreeing upon fine-grained pixel-point local representation but also encouraging 2D-3D global feature distribution as approaching as possible by exploiting knowledge distillation.

Spatiotemporal consistency
Unlike previous methods, the spatiotemporal approach is more concerned with long-range spatial and temporal invariance before and after point cloud frames, which are 4D Info3D [86] Multi-view alignment Maximizing the mutual information between objects and their transformations 2021 OcCo [60] Multi-view alignment Shielding and restoring occluded points in a camera view 2021 Multi-view stereo [98] Multi-view alignment Generating prime depth map as the self-supervision signal 2021 Cross-view [99] Multi-view alignment Jointly grasping both 3D point cloud and 2D image embedding concurrently 2022 Multi-view rendering [100] Multi-view alignment Encouraging the 2D-3D global feature distribution similar 2021 Order prediction [101] Spatiotemporal consistency Sorting the temporal order of sampled and disorganized point cloud clips 2021 STRL [102] Spatiotemporal consistency Dual-branch network to predict the representation of another temporally correlated input 2022 Futrue prediction [103] Spatiotemporal consistency Forecasting future point cloud scenes with lightweight model 2020 PointPainting [104] Multimodal fusion Projecting LiDAR points into the semantic segmentation diagram for traffic scenes 2021 PointAugmenting [105] Multimodal fusion Replacing sub-optimal segmentation scores with high-dimension CNN features 2022 DeepFusion [64] Multimodal fusion Exploiting cross-attention to dynamic capture long-range correlations of image-LiDAR pairs Fig. 14. The demonstration of point cloud sequence order prediction. The first row is the uniformly sampled point cloud clips from the continuous point cloud sequence. Then randomly disrupt the order of these segments and enter them into 4DCNN in the second row to master the dynamic features of human action, which are leveraged to forecast the original permutations in a self-supervised manner. The figure is adapted from [101].
data (XYZ coordinate + temporal dimension), to capture the intrinsic characteristics of dynamic sequences that are absent in static 3D point data. Motivated by the success of Xu et al.'s work [106] in video SSL, Wang et al. proposed the first SSL scheme to gain effective temporal embedding on dynamic point cloud data by sorting the temporal order of sampled and disorganized point cloud clips. As shown in Fig. 14, a few static point cloud frames are uniformly sampled and disordered as the input of a 4D CNN to capture significant campaign information by restoring the disrupted time fragments to correct order on an unannotated large-scale sequential point cloud action recognition dataset. Another spatiotemporal representation learning (STRL) [102] framework, inspired by BYOL [107], designed a dual-branch pipeline referred to as online and target networks to collaborate and promote each other. Specifically, the online network is enforced to predict the target network representation of another temporally correlated input, which is augmented by random spatial transformation, for spatiotemporal invariant contextual cues extraction. Taking training and inference time into account, Mersch et al. [103] presented an innovative 3D spatiotemporal convolutions encoder-decoder neural network consisting of fewer parameters to predict future point cloud scenes. Such a lightweight model concatenates range images as input to estimate forthcoming range images and per-point scores in multiple future steps so that aggregate spatial and temporal scene information can be captured simultaneously.  [92], [99] decorated LiDAR points with camera feature on input-level for 3D detection. (b) DeepFusion [64] fused the camera and LiDAR features extracted by respective encoders on representation-level leveraging cross-attention consistency technique. The figure is adapted from [64].

Multimodal fusion
Rather than simply requiring coherence between 2D-3D correspondences [92], [99], [100], automatic driving algorithms demand sophisticated collaboration between invehicle sensors. For example, cameras and LiDAR could provide complementary information for 3D object detection promotion, such as colorful texture visualization and distance perception. Therefore, multimodal fusion is a promis- PRNet [108] Registration Pioneer for partial-to-partial point cloud registration enabling coarse-to-fine refinement 2021 Part mobility [109] Registration Converting points to trajectories to derive the rigid transformation hypotheses 2022 SuperLine3D [110] Registration Obtaining precise line representation under arbitrary scale perturbations  2022 DVD [111] Registration Learning local and global point embedding jointly 2020 PointPWC-Net [112] Scene flow estimation Discretizing cost volume onto 3D point clouds in a coarse-to-fine fashion 2020 Just go with the flow [113] Scene flow estimation Optimizing two SSL losses based on nearest neighbors and cycle consistency 2021 Self-Point-Flow [114] Scene flow estimation Converting pseudo label matching problem as optimal transport task ing direction to exploit the potential of images and point clouds for acquiring complex traffic scene features.
Although not explicitly identified, Vora et al. [104], Wang et al. [105], and Li et al. [64] offered compact frameworks for tight sensor-fusion which could be implemented under the SSL paradigm without human annotations. PointPainting [104] was a sequential fusion method that projected LiDAR points into the semantic segmentation diagram for traffic scenes with color marking. Each point was painted with a class score obtained from the image segmentation network and then could be utilized in any LiDAR detection approach. Such painting fusion manner cleverly solved the shortcoming of depth-blurring and scale ambiguity by consolidating the bird's-eye and camera view. PointAugmenting [105] adopted a late cross-modal fusion mechanism based on PointPainting, replacing sub-optimal segmentation scores with high-dimension CNN features containing rich outlook hints and larger receptive field to emphasize the delicate details. Moreover, a simple yet effective cross-modal data augmentation pasted virtual objects into images and point clouds guaranteeing the alignment between the camera and LiDAR. However, both PointPainting and PointAugmenting simply decorated LiDAR points with camera embedding as shown in Fig. 15(a). To improve the downstream performance, DeepFusion [64] proposed endto-end cross-modal fusion on the feature level focusing on consistency improvement. As shown in Fig. 15(b), a block named LearnableAlign was introduced to exploit crossattention to dynamically capture long-range correlations during the image-LiDAR fusion process to enhance the model's recognition and localization ability.

Motion-based methods
Various point cloud frames contain rich geometric patterns and kinematic schemas that are concealed in their movement. The motion-based SSL paradigm focuses on dynamically capturing the intrinsic motion characteristics during point cloud spatial variations by taking advantage of traditional registration and scene flow estimation as pretexts. In this section, we will introduce SSL motion-based methods for point clouds with representative examples and summarize the contribution of each method in Table 6.

Registration
Point cloud registration is the task to merge two point clouds X and Y into a globally consistent coordinate system via estimating the rigid transformation matrix, which could be formulated as: where R ∈ SO(3) and t ∈ R 3 indicates rotation matrix and translation vector, respectively. ψ is the feature extraction network learning the hierarchical informative features from dynamic point clouds. Unlike the classic ICP registration method [115] iteratively searching correspondences and estimating rigid transformation, the SSL registration paradigm can obtain the informative point cloud feature without highquality ground-truth correspondences. PRNet [108] was the pioneering work for partial-topartial point cloud registration enabling coarse-to-fine refinement iteratively. Based on co-contextual information, the framework boils down the registration problem as a key point detection task, which should recognize the matching points of two input point clouds. Shi [108] presented a point cloud part mobility segmentation approach to understand the essential attributes of the dynamic object. Instead of directly processing the sequential point clouds, the raw inputs are converted to trajectories by point correspondence between successive frames to derive the rigid transformation hypotheses. Analogously, Zhao et al. [110] proposed an SSL line segmentation and description for LiDAR point cloud, called SuperLine3D, providing applicable line feature for global registration without any prior hints. Compared to point embedding constrained by limited resolution, this segmentation model is capable of obtaining precise line representation under arbitrary scale perturbations. Motivated by the observation that the local distinctive geometric structures of two subset point clouds are able to improve the representation expression, Liu et al. [111] introduced the deep versatile descriptors (DVD) regarding learning local and global point embedding jointly. As shown in Fig. 16, the co-occurring point cloud local regions, which retain the structural knowledge under rigid transformations, are regarded as the input of DVD to extract latent geometric patterns restrained by local consistency loss. To further enhance the model's ability of transformation awareness, reconstruction, and normal estimation are added as auxiliary tasks for better alignment. Fig. 17. The demonstration of just going with the flow. The nearest neighbor loss is utilized to push the predicted flow (green) close to the pseudo-ground truth (red) of the t + 1 frame. The cycle consistency loss is the penalty term to estimate the flow between predicted points (green) in the opposite direction to the original points (blue) in t frame for temporal alignment consideration. The figure is adapted from [113].

Scene flow estimation
Scene flow estimation is a vital computer vision task in the point cloud field. Its objective is to estimate the motion of objects by computing dense correspondences between consecutive LiDAR scans of a scene over time. The dynamic variations of points can be represented as 3D displacement vectors to describe the motion state in terms of scene flow.
Wu et al. [112] were the first to introduce SSL into scene flow estimation in a learnable point-based cost volume PointPWC-Net. The cost volume is discretized as input point pair to reduce computational complexity. Additionally, an efficient upsampling strategy and wrap layers are employed to realize a coarse-to-fine fashion. Mittal et al. [113] proposed a novel SSL scene flow estimation network to achieve safe navigation during interactions with highly dynamic environments by optimizing two SSL losses based on nearest neighbors and cycle consistency. As shown in Fig. 17, the nearest neighbor loss encourages the points predicted based on current moment t flowing toward occupied regions of the future frame t + 1. The cycle consistency loss ensures that the points of the future frame t + 1 can be restored in the reverse direction back to frame t to avoid degenerate solutions by maintaining temporal consistency. Self-Point-Flow [114] employs more than 3D point coordinates, surface normal, and color in one-to-one matching to generate pseudo labels and concludes the pseudo label generation issue as an optimal transport problem. It leverages a random walk module to refine annotation quality by imposing local alignment.

DOWNSTREAM TASKS
The primary purpose of SSL is to pre-train a backbone network and transfer it to solve task-related computer vision downstream tasks. Therefore, the performance of the model in downstream tasks could reflect the effectiveness of the SSL paradigm to a certain degree. Namely, the evaluation criterion indicates whether the SSL method extracts useful implicit information from large-scale unlabeled point cloud data. In this section, we will introduce four commonly used downstream tasks as well as their respective evaluation metrics for point cloud SSL algorithm estimation. Additionally, we will summarize and compare the performance of each aforementioned SSL method in the corresponding task in a table in every subsection.

Object classification
Object classification is a fundamental and prevalent downstream task that requires the model to output a most likely category label for the given point cloud object to assess the overall semantic awareness ability of the pre-trained model. The two commonly used metrics to evaluate performance are Overall Accuracy (OA) and Mean Class Accuracy (mAcc), where the former is the ratio of correctly classified objects to the total number of objects, and the latter is the average of each class's accuracy. Object classification can be divided into three protocols based on task settings: • Few-shot: Few-shot learning (FSL) is a challenging task that involves training with limited information provided by the downstream dataset. Specifically, the n-way, m-shot setting is employed, where n is the number of classes randomly selected from the dataset and m is the number of objects randomly sampled for each class. Then, the trained model is evaluated on the test split. The few-shot protocol performance of proposed SSL methods is shown in Table 7.
• Fine-tuning: The SSL pre-trained feature extractor serves as the initial weights for the downstream backbone encoder, and the entire network is retrained in a supervised manner with labels from the downstream dataset. The fine-tuning protocol performance of proposed SSL methods is presented in Table 8.
• Linear classification: The SSL pre-trained feature extractor is frozen by stopping the backpropagation gradient. Only a linear SVM classifier is trained in a supervised manner with downstream dataset labels. The linear classification protocol performance of proposed SSL methods is shown in Table 9.

Part segmentation
Part segmentation is a fine-grained 3D task that involves distinguishing and separating various components of an object, such as plane wings or desk legs. This task requires a model that can extract local point-level features more effectively than the overall discriminative ability required for object recognition. The evaluation criteria of point cloud part segmentation are determined by the mean Intersection over Union (mIoU), which computes the ratio of the intersection of the predicted and ground truth part labels to the union of the two, across all categories (mIoU C ) or all instances (mIoU I ). Table 1 summarizes the results of point cloud part segmentation on the ShapeNetPart dataset based on an SSL pre-training model supervised fine-tuning in terms of mIoU C (%), mIoU I (%), as well as (mIoU) of 16 object classes for clear demonstration and comparison.

Semantic segmentation
Similar to part segmentation, semantic segmentation requires the model to assign a semantic label to each point in the point cloud to group meaningful regions. It is frequently implemented on complicated outdoor or indoor point cloud scenes with background noise instead of a clear single object. To better compare performance, mIoU, OA, and mAcc are commonly employed as estimation indicators to judge the feature extraction capability of pre-training models on the S3DIS dataset [41], which contains six large-scale indoor venues points in two protocols: • Area 5 test: The SSL pre-trained model is fine-tuned on all areas except the largest area 5, which is chosen as the test set.
• Six-fold cross validation: Areas 1-6 are selected in turn as the test set and fine-tuned in the remaining 5 areas.
Both protocols of semantic segmentation performance of SSL pre-training methods are displayed in Table 11.

Object detection
Object detection is a task that involves localizing the 6 Degrees-of-Freedom (DoF) bounding box of an object and differentiating its category in a complex point cloud scene. The evaluation metric used is the average precision (AP), which measures the precision of the 3D bounding box at various recall levels. The threshold is usually set to 0.25 and 0.5.

FUTURE DIRECTIONS
Although self-supervised learning has shown promising success in the point cloud field, we have identified current deficiencies and shortcomings in this area. We argue that SSL tasks should not be studied in isolation but rather in conjunction with advanced techniques from other domains. In this section, we will discuss possible future research directions to improve the SSL pre-training feature extraction capability and corresponding downstream performance in point clouds.

Few-shot and zero-shot learning
Generally, the dataset samples and labels used for SSL are complete and sufficient; however, such an ideal situation is difficult to achieve in real scenarios. Instead, data shortage challenges are often faced, such as damaged labels, missing information, and uneven assortment. Few-shot learning (FSL) [117] is a solution that allows the network to train under the situation of small amounts of data. This kind of technology is like a skill that a few smart people remember the emphasis only look once and can be mastered without repeated training. It is even possible to identify new sample types that have not been seen before in a test task by description alone without training samples. This amazing method is called zero-shot learning (ZSL). Both self-supervision and FSL (ZSL) [118] can free models from reliance on large annotated datasets and reduce the cost. In addition, the combination of these two can improve the generalization and understanding of the model and simulate the model's ability to imagine the unknown.

Unified transformer-based backbone
Unlike the dominant status of CNN-based architecture in image processing, there are various networks in the point cloud field as introduced in Section 2.5. For discipline and convenience considerations, a unified architecture for point cloud feature extraction for multiple downstream tasks is needed. We believe that the transformer is a potential candidate for this purpose, where it is perfectly suitable to find the latent long-range correlations between irregular point cloud geometry. Moreover, the outstanding performance of four downstream tasks summarized in Table 7 to Table 12 employed transformers as the backbone to capture the hierarchical representation globally and locally. Therefore, we recommend adopting transformers as the unified backbone network for feature extraction in SSL point cloud research. This not only improves the performance of downstream tasks but also facilitates experimental comparisons.

Large scale 3D dataset requirement
As Section 2.3 shows, there are various 3D point cloud collections with different data structures and dense annotations that have their characteristics and provide huge support for point cloud SSL. Nevertheless, the aforementioned datasets are too homogeneous in terms of source, which is unfavorable for increasing the data diversity for domain adaptation in transfer learning. Hence, a large-scale point cloud dataset is needed that is large and varied in sample size, collects both synthetic and realistic data, and contains indoor as  well as outdoor scenes. We believe such a comprehensive dataset will be of great help for the development of point cloud SSL to further enhance the model's generalizability and robustness.

Various modalities interactions
Despite the assorted modalities gathered in outdoor autonomous driving datasets [35], [48], [49] by different sensors, researchers normally focus on the point cloud data only while ignoring the connections and alignment relationships between them. This approach cannot fulfill the potential of such multimodal collections. With the rapid development of multimodal models [64], [104], [105], we believe that a storm of cross-modal SSL point cloud paradigm is coming. This paradigm will fuse diverse modalities (e.g., natural language, radar, voice) more than images to maximize the unique perspectives of each modality to complement each other while highlighting repetitions for validation.

Hierarchical feature extraction
To cope with sophisticated and varied downstream tasks requiring overall semantic understanding (e.g., object classification) as well as fine-grained geometrical awareness (e.g., part segmentation), the SSL pre-training model should have the ability of global perception and local analysis. Particularly, the interactions between hierarchical levels need to be regarded to discover the implicit relation in terms of disorder permutations. Therefore, we suggest that hierarchical feature extraction should be embedded in the forthcoming SSL paradigm to improve the model's ability to capture the global and local information of point clouds.

Multiple tasks pre-training
Up to now, most point cloud SSL schemas have only one specific pre-training pretext while few works train diverse tasks concurrently such as BERT [16]. The main resistance is that multi-tasking has to consider the compatibility and synergy between various pretexts simultaneously and distribute fitting weights to each loss item for steady parameter updating. Indeed, distinct proxies could provide multi-level information from various perspectives of point clouds so that jointly training multiple tasks could facilitate the network to learn more comprehensive representations from all sides.

Theory and interpretability
Similar to traditional deep learning, point cloud SSL has the same shortcomings that lack theoretical support and poor interpretation. The process of model training is a black-box operation, making it difficult for researchers to analyze the results, which could only be implemented by an ablation study alternatively. Most of the paradigms are improved based on previous work by drawing empirical conclusions. Such 'tried and tested' research methods miss rigorous proof and are therefore challenging to generalize and replicate. We suggest that future studies should include an inquiry into explainable theory. For example, well-established theories from mutual information [119] or causal inference [120] can be applied in the structure of the network and the design of the loss function. Such a theoretical and explainable paradigm would be more generalizable and robust.

CONCLUSION
Point cloud self-supervised learning fundamentally moves away from a model's reliance on manual annotation via the design of pre-training pretext tasks to enable the network to extract effective information from raw data and achieve exciting progress competitive to supervised learning in various downstream computer vision tasks. This paper extensively outlines recent representative deep neural network-based methods for self-supervised learning in the point cloud from a comprehensive perspective, including the background introduction, dataset collections, data augmentation summary, etc. A novel taxonomy is proposed to systematically classify the current point cloud SSL pretext tasks into four general categories demonstrated by a detailed explanation as well as performance comparison. Finally, future research directions are discussed to provide a broad and insightful view of the current state-of-theart methods in point cloud SSL. We hope this paper can provide a comprehensive overview of the current point cloud SSL research and inspire more researchers to explore the potential of this promising paradigm.

ACKNOWLEDGEMENT
This work received financial support from Jiangsu Industrial Technology Research Institute (JITRI) and Wuxi National Hi-Tech District (WND).