A Texture Integrated Deep Neural Network for Semantic Segmentation of Urban Meshes

3-D geo-information is essential for many urban related applications. Point cloud and mesh are two common representations of the 3-D urban surface. Compared to point cloud data, mesh possesses indispensable advantages, such as high-resolution image texture and sharp geometry representation. Semantic segmentation, as an important way to obtain 3-D geo-information, however, is mainly performed on the point cloud data. Due to the complex geometry representation and lack of efficient utilizing of image texture information, the semantic segmentation of the mesh is still a challenging task for urban 3-D geo-information acquisition. In this article, we propose a texture and geometry integrated deep learning method for the mesh segmentation task. A novel texture convolution module is introduced to capture image texture features. The texture features are concatenate with nontexture features on a point cloud that represents by the center of gravity (COG) of the mesh triangles. A hierarchical deep network is employed to segment the COG point cloud. Our experimental results show that the proposed network significantly improves the accuracy with the introduced texture convolution module (1.9% for overall accuracy and 4.0% for average F1 score). It also compares with other state-of-the-art methods on the public SUM-Helsinki dataset and achieves considerable results.


I. INTRODUCTION
3 -D geo-information is important for many urban related applications, e.g., infrastructure management, environment monitoring, and urban planning. The acquisition of 3-D geoinformation of urban areas through 3-D remote sensing technologies, such as Lidar and oblique photography, thus, becomes a hot topic in recent years. With the help of a high-performance collaborative data processing system, it is possible to construct mesh data from oblique photography imagery quickly and with high quality. Compared to point cloud data, the textured mesh data provides high quality texture from original imagery and excels in preserving sharp shape features by its triangular representation. This means that the textured mesh data derived from oblique photography is potential for accurate 3-D geoinformation acquisition. Semantic segmentation is a common way to extract geoinformation from remote sensing data. In the past few decades, extensive research and significant progress have been made in 2-D remote sensing imagery semantic segmentation. A lot of convolutional neural networks (CNNs) have also been introduced to improve the segmentation performance in recent years [1], [2], [3], [4], [5]. For 3-D mesh, however, CNNs cannot be applied directly for semantic segmentation due to the irregular data structure. Many works try to project the 3-D mesh into image from multiple perspectives to employ 2-D image feature learning methods for segmentation [6], [7], [8], [9]. These multiview-based methods effectively capture the features of 3-D scene surface by classifiers trained by massive image dataset. Though the strong practicability, geometric information was lost during projection. Differing from extending segmentation issue on 3-D space to 2-D plane, researchers also try to employ CNNs directly on mesh surface [10], [11], [12], [13], [14], [15]. Methods directly process mesh significantly improve the performance on the corresponding dataset, however, most of them are only suitable for segmenting coherent parts in small-scale object datasets [16] not urban scenes [17], [18].
As for point cloud, another common representation of 3-D urban scene, rich researches have been putting forward for its semantic segmentation [19], [20], [21]. Particularly, PointNet and PointNet++ are worth mentioning [22], [23], the former pioneers the learning directly on unordered point set and the latter improves the former by the capturing local characteristic of point. These two are regarded as the milestone in point cloud processing, a set of point-based networks have been presented based on them [24]. In view of the superior performance of point cloud segmentation methods and the simplicity of point cloud structure, researchers divert their attention to linking it with mesh segmentation [25], [26], [27]. Generally, they choose to compute the center of gravity (COG) point associated with features to represent each face of mesh. Further researches are based on the COG points, to which classifiers that are designed for point clouds originally are applied. These approaches combining point cloud and mesh achieve satisfactory performance by encoding information on points generated from meshes. However, the high-resolution 2-D texture of mesh, which is essential for the recognition of 3-D urban objects, are not being fully utilized in these works.
Aiming at the problem that the current mesh segmentation methods of urban scene are not sufficient to utilize texture information, we design a novel deep learning architecture integrating advanced texture features. The texture of each triangle is first mapped to a regular square map, so that 2-D convolution can be applied for texture features extraction. Then, the texture features are concatenated with other triangular-face-based features and recorded on the corresponding COG points. Finally, a hierarchical deep network is employed to predict the COG point cloud label, and the mesh segmentation is achieved according to the correspondence between COG points and triangular faces. Since the proposed method integrates the mesh and point cloud, it can take both advantage of the high-resolution texture and the efficient point cloud deep learning architecture.

II. RELATED WORK
In the following, we revisit and summarize work on 3-D semantic segmentation of mesh.

A. Multiview Based Methods
The segmentation of 2-D images achieves fruitful results in the field of machine learning and deep learning [3], [4], [5], [28], [29]. However, due to the irregular structure, image processing methods cannot be applied on 3-D mesh directly. Many works try to project the mesh into images from multiple perspectives to employ 2-D image feature learning methods for mesh segmentation. MVCNN renders multiple views of 3-D scenes, takes the CNN as classifier to learn features [6]. All the features are aggregated by a view pooling layer to classify the projective images. However, sematic labels of different views of 3-D scene may not be relevant. MV-recurrent neural network (MV-RNN) treats the sequential multiple views as a temporal sequence and applies (RNN) to capture the redundancy between adjacent views [7]. The problem of the irrelevance between different views is solved. SnapNet takes the RGB and RGB-D images as input. Pixel-level labels are obtained through the CNN and the segmentation is completed via back-projection of the label [8]. Although models above mentioned achieve powerful and efficient feature learning, the feature of spatial structure provided by mesh cannot be captured well due to the occlusion during projection performed by real physical cameras. Virtual multiview Fusion was put forward to solve this issue [9]. The key idea is to use synthetic images rendered from "virtual views." Problems of misalignment, occlusion, and scale invariance encountered with multiview projection are relieved. PLVNet introduces the parameterized-view-learning mechanism to parameterize the views as learnable parameters [30]. This model achieves the considerable performance without relying on a huge number of parameters. Multiview based methods effectively capture the features of 3-D scene surface by classifiers trained by massive image dataset. Though the strong practicability, high-dimensional spatial information was lost during rendering and projection. Moreover, perspective issues produced by the back-projection hinder improving segmentation accuracy.

B. Mesh Based Methods
Mesh surface express geometric, topological, and high-resolution textural information simultaneously. Several models are presented to learning features directly from mesh surface. Masci et al. [31] designed a novel feature descriptor to learn features from local patches parameterized by radius and angles. This article pioneers the deep learning of local features of meshes. Monti et al. [10] parameterized the mesh surface into local patches, which can be processed by a neural network in graph spatial domain. Abovementioned two works rely on handcrafted local coordinate. Verma et al. [12] presented a novel graph-convolution operator to establish correspondences dynamically between filter weights and graph neighborhoods with arbitrary connectivity to replace previous hand-designed coordinates. TextureNet parameterizes the mesh with a consistent four-way rotationally symmetric field to extract the local high-resolution textural information of mesh surface for mesh segmentation [14]. Unlike previous local parameterization, Li et al. [32] segment the textured mesh into multiple atlases for global parameterization. Cross-atlas convolutions, which recover the geodesic neighborhood is the key to this method that achieves invariance property to arbitrary parameterization. Hanocka et al. [13] took five elements associated with the edges of mesh as input and perform convolution on edge directly. The highlight of this method is the design of pooling layer via edge collapses [33], which forms a task-driven processing model and retains more task-related edges. Schult et al. [15] proposed a deep hierarchical convolution network that combines geodesic and Euclidean convolution. The combination of two convolutions is able to learn useful geodesic and Euclidean features at different scales. Although methods mentioned above both lead to significant performance gains for semantic segmentation, almost all are merely suitable for small-scale object datasets such as the Princeton shape benchmark [16] not urban scenes [17], [18]. High demands on the connectivity of surface facets of whole mesh data may be responsible for this. Due to the limited capacity of the neural network, large-scale meshes require to be split into small tiles before feeding to the network that will lead mesh data into incoherent fragments. Besides, 3-D urban mesh inevitably contains isolated objects floating in the air (such as tree crown, etc.). The existence of incoherent fragments and isolated objects will make urban meshes hardly feed to mesh convolution network, thereby further leading to the poor performance of mesh segmentation.

C. Point Cloud Based Methods
Point cloud, as another widely used 3-D urban scenes representation, has simplicity in data organization. The research on semantic point cloud segmentation is in full swing [19], [21], [34]. Among them, PointNet and PointNet++ are regarded as milestones in point cloud processing [22], [23]. Specifically, PointNet is pioneering significance in learning directly on unordered point clouds. In this architecture, multilayer perception is applied for feature extraction and T-Net is employed for disorder and geometric rotation of the point cloud. PointNet++ is composed of feature extraction module of PointNet and a farthest point sampling (FPS) module [23]. This network architecture improves the performance well by considering the relationship among neighbor points and learning local patterns.
On the basis of PointNet and PointNet++, a series of pointbased method applying deep learning framework to unordered point cloud directly have been presented. Compared with point cloud, mesh provides more information (e.g., high-resolution texture, explicit surface connectivity) and is superior in sharp region representing. In view of the excellent performance of point clouds segmentation methods and the simplicity of its data organization researchers try to combine point cloud segmentation methods with urban mesh segmentation. Tutzauer et al. [25] computed the COG point for every triangle and obtain a point cloud equal to the number of facets in the urban mesh so that classifiers (e.g., a multibranch 1-D CNN) designed for point originally can be applied. The benefit of available color information for segmenting urban scenes is demonstrated in this article too. Continuing from this technical route, for urban mesh segmentation, Laupheimer et al. [26] designed comparative experiments to attest that color information per-face outperforms color information per-vertex and the fine-tuned PointNet++ performs best due to its hierarchical learning ability. Further improvement is made by this team via transferring features and labels between point clouds and meshes [27]. Grzeczkowicz and Vallet [35] proposed to applied the Poisson disk sampling algorithm on texture meshes, and further utilized point cloud semantic segmentation methods to complete the task. Gao et al. [36] proposed a two-stage deep learning framework, which first over-segments the mesh to generate semantically meaningful segments. Then, a graph is constructed to encode the geometric and texture information, and a graph convolutional network is used to classify the segments. Guan et al. [37] constructed a graph based on the point set extracted from mesh and perform convolution on this graph for features extraction. Besides, local and global features are captured simultaneously and aggregated together for point labeling. Point-based methods leverage the efficient classifiers designed for point from previous researches and combine the information provided by mesh. However, the texture of mesh is not fully exploited because most of the existing methods simply used the color information statistics as texture feature. To fill this gap, we propose a new deep learning method that extracts a set of COG points to represent mesh, performs convolution on each facet to obtain texture features. Then, the learned texture features are concatenated with nontexture features on the computed point cloud. A U-Net-based architecture is employed for the prediction of COG point clouds.

III. METHODOLOGY
The proposed method is mainly designed for 3-D mesh semantic segmentation of urban scenes. Specifically, inferring the category of ground object to which each face belongs of the textured mesh is the focus of this task. The core of our method is the TextureConv module, which can make good use of mesh texture information. Note that the proposed TextureConv module is quite different from that in TextureNet, which are mainly reflected in the way of local texture image generation. TextureNet parameterizes the mesh through 4-RoSy field and generates oriented patches centered on sample points. Unlike these patches, our method is to directly map the coordinates of several points in each triangular texture of mesh to form a square map. Compared with TextureNet, the proposed method is more suitable for mesh semantic segmentation of large-scale outdoor scenes, because such data usually contains isolated floating objects caused by reconstruction or tile split while feature learning based on each triangular facet can reduce negative effects. In addition, we carefully selected several initial features, including radiometric features (e.g., median RGB) and geometric features (e.g., normal vector of triangular face). All the features are assigned to the COG point of each face. Besides, residual connections and dilated k nearest neighbor (kNN) were introduced to ensure network performance.

A. TextureConv Module
The TextureConv module is located at the forefront of the network and mainly consists of two parts: mapping and 2-D convolution. Given a texture image and texture coordinates of vertices of triangles, we create a square texture map for each triangle, as shown in Fig. 2. The purpose of the mapping is to establish the correspondence between irregularly triangular texture and regular square map for the further employment of 2-D convolution and pooling.
Assuming a triangle texture and a corresponding square map, we divide them into R × R equal sub-blocks according to the side length of 1/R and each subtriangle corresponds to a square sub-block, as shown in Fig. 2 (take R = 3 as example). We carefully set up this mapping, which preserves the adjacency relationship between sub-blocks to a certain degree. With the mapping correspondence, the next step is to calculate the RGB of each square sub-block. We obtain the RGB from the texture image according to the texture coordinates of a certain point p inside a subtriangle and assign it to the corresponding square sub-block. Given the texture coordinates of the three vertices of a triangle, the barycentric coordinates can be used to represent any point inside the triangle [38]. Suppose the three vertices of a triangle are v 0 , v 1 , v 2 . We take the two vectors v 0 − v 2 and v 1 − v 2 as the +β and +γ directions to establish a local coordinate system, and the coordinate range is [0, 1]. In this coordinate system, the coordinates of v 0 , v 1 , v 2 are (0,0), (1,0), (0,1) respectively, and the texture coordinates of a given point p contained in the triangle can be computed as Equation (1) can be further organized as where 1 − β − γ, β, and γ are the barycentric coordinates of point p. Therefore, we can calculate the texture coordinates based on the β and γ of a certain point.
Considering a special case, we use the centroid of the subtriangle to represent the corresponding square sub-block. Then, since the barycentric coordinates of triangle centroid is ( 1 3 , 1 3 , 1 3 ), the texture coordinates of each square sub-block can be calculated according to (2). Take the two subtriangles of "0" and "2" as an example, their coordinates are 1 3 ) under the established coordinate system. Note that for other subtriangles,  we only need to add a certain offset (the multiple of 1/R). Based on the texture coordinates of subtriangles' centroids, we use bilinear interpolation to obtain the RGB, and assign them to the corresponding square sub-blocks. Since the initial vertex selection of triangular faces is random in triangulation, it can ensure the rotation invariance of the faces.
After mapping, 2-D image convolution and pooling operations can be applied to the square map to extract texture features. We applied two 3 × 3 convolutional layers (stride = 1). Behind each convolutional layer, batch normalization, ReLU activation function and pooling layer are inserted in sequence. The kernel size of the first pooling layer is 2 × 2 and stride is 2, while the second pooling layer is global pooling. The number of channels of the square map is transformed from 3 (RGB) to 16 and then to 32.

B. Features Combination
In addition to texture features, other features based on the basic component of the mesh (i.e., triangular face) are introduced, such as COG XYZ, normal, and median RGB. Among them, COG XYZ represents the 3-D coordinates of the centroid of the triangular face in the local coordinate system, containing important location information. The normal is a vector perpendicular to the plane where the triangle is located, which reflects geometric information. Median RGB depicts the median RGB value of all sub-blocks of the square map, and it is suitable for presenting the general color information of the square map. Similar to the processing method of Tutzauer et al. [25], we concatenate all the features and connect them with the central point of each triangle. In this way, meshes can be represented in the form of COG point clouds, and each COG point corresponds to a face, as shown in Fig. 3. As a result, the high-resolution texture features and the geometric information of the discrete point cloud can be organically combined.
C. Network Details 1) Hierarchical Network Architecture: We adopt a hierarchical architecture that based on PointNet++ [23], which includes several down-samplings and up-samplings. As shown in Fig. 1, it integrates the TextureConv module, residual block and linear block. The residual block is essentially a local feature extractor for different sampling level point clouds, which is on the premise of the local neighborhood. To search the local neighborhoods, the dilated kNN was proposed, which will be introduced in Section III) dilated kNN. The down-sampling is achieved by FPS and max-pooling [23]. Max-pooling is used to extract the largest feature value within the neighborhood and deal with the disorder of COG point clouds [22]. By abstracting the COG point clouds multiple times through down-sampling, hierarchical features can be extracted via the residual block, which helps to obtain more sufficient semantic information. In addition, considering the complexity of urban scenes and nonuniform density of COG point clouds, we introduce the multiscale grouping (MSG) to capture local patterns from different neighborhood ranges and concatenated them together [23]. For up-sampling, we use k nearest neighbor interpolation to restore the COG point cloud. Considering the importance of the 3-D coordinates of the COG points, before each residual block in the down-sampling stage, the input feature vector is concatenated with the 3-D coordinates. Besides, we also concatenate the input of the residual block at the same level of abstraction with the output of the nearest residual block to realize the features interaction between shallow and deep (the green narrow arrow in Fig. 1). At the end of the network, there is a linear block containing three fully connected layers, with a dropout layer interspersed among them. The main contribution of this linear block is to realize the dimensional transformation of features according to the final output category.
2) Residual Block: The residual block consists of convolutional layers with a kernel size of 1 × 1, batch normalization, ReLU and residual connection. Each block has the same structure, as shown in Fig. 4, except the dimension of input and output. Moreover, there is a slight difference in the input of residual block in the up-sampling and down-sampling stage. In the down-sampling stage, the residual block is actually applied on each local neighborhood, thus it is considered to be a local feature extractor, while in up-sampling stage, it is applied on each point.
The disappearance of gradients and explosion problems in deep learning network will make it difficult to train, thereby lowering the performance. The residual connection (skip connection) proposed by ResNet relieves this problem and has achieved excellent results with an extremely deep network architecture [39]. Inspired by ResNet, we introduced residual connections to speed up the convergence of the network and improve the accuracy of segmentation. In Fig. 4, since the dimension of the input features has changed after two convolutions of the main path, it is necessary for feature dimension on the branch to transform accordingly. To this end, we use a convolutional layer with a 1 × 1 kernel size for processing. In order to maintain the consistency with the figures in ResNet, the dashed line in Fig. 4 represents the residual connection with the change in the feature dimension.
3) Dilated kNN: The proposed dilated kNN is inspired by the dilated convolution [40], which is a method that can expand the receptive field in 2-D image. Yu and Koltun conducted experiments to prove that the dilated convolution can effectively facilitate the image semantic segmentation. Therefore, we introduced the concept of dilation into the neighborhood search algorithm kNN. Given a point set S and a point p, p ∈ S. k and d represent the nearest neighbor value and dilation value, respectively. As shown in Fig. 5, the algorithm first finds k × d points closest to the point p, and then randomly selects k points among them as the neighborhood of the point p. The proposed neighbor searching approach is able to maintain a considerable size of the receptive field without changing the number of neighbors, and thus, can better copy with the scene segmentation tasks.

IV. EXPERIMENTS
To prove the performance of the proposed method, a series of mesh semantic segmentation experiments are conducted on two datasets. On the Wuhan dataset, we prove the effectiveness of our proposed method and execute comparative experiments of different levels of texture features. Next, the SUM-Helsinki dataset is used for comparison with other state-of-the-art methods [41].

A. Dataset
Wuhan: This dataset is self-made and collected from a residential area in Wuhan (China), which covers an area of 0.53 km 2 , including about 20 million faces. We generated the mesh data from oblique aerial images with a ground sample distance (GSD) of 5 cm using DJI Terra. We manually annotate the dataset using CloudCompare and double checked the annotation for quality assurance [41]. The dataset is divided into seven object categories, namely roof, façade, window, impervious surface, tree, vehicle, and low vegetation. The impervious surface mainly contains the road, while the facade includes the building facades, balconies, intricacies on the roof and the spaces of air conditioner external unit. Fig. 6 shows the overview of Wuhan dataset, we can see that most of the buildings are relatively high and the vegetation is densely distributed.
The dataset is randomly split into 54 tiles, the training set, validation set and test set account for 24, 8, and 22, respectively, as shown in Fig. 7(a). At the same time, we count the number of faces in each category in the set, as shown in Fig. 7(b). The roof and vehicle have the least number of faces, while facade and tree have the most.
SUM-Helsinki. This is a benchmark dataset for semantic urban meshes, located in the central area of Helsinki (Finland), with a range of about 4 km 2 , generated by oblique aerial images with a GSD of about 7.5 cm [41]. The entire dataset is divided into 64 tiles, the training set includes 40, and the validation set and test set both contain 12 tiles. The SUM-Helsinki dataset contains 6 object categories, namely terrain, high vegetation, building, water, vehicle, and boat. In Fig. 8, we show the entire SUM dataset and corresponding ground truth meshes.

B. Data Processing
For batch training purpose in deep learning methods, there is a need to divide the dataset into batches and take them as input to the network. For each tile of the datasets, we randomly picked T COG points from it using the FPS to make sure that they are evenly distributed throughout the tile. Suppose there is a tile that contains M COG points, and the size of sample is N [23], we define T = M × K/N, where K represents the number of coverage times. Next, taking the T COG points as central, we construct T boxes and randomly select N COG points within each box range. In addition, if there are less than N points in the box, existing points will repeat to meet the requirements. In the inference stage, there may still be unselected points in the dataset after the sampling. We use neighborhood majority to decide the labels of the remaining points. In order to simplify the calculation, we have centralized the 3-D coordinates of points. Moreover, when constructing samples, we randomly rotate the selected points for data augmentation.

C. Implementation Details
Our model runs on Tesla M60 GPU and is mainly implemented on the PyTorch framework. Since the Wuhan and SUM-Helsinki datasets contain different categories, we train and experiment on them separately. Besides, considering these two datasets have similar GSD, the parameters involved in the model are set to be the same.   tiles is 2. In each box, 4096 COG points are randomly selected. In inference stage, the batch size is increased to 128, since this process only performs forward propagation.
For the loss function, the weighted cross-entropy loss is applied. As shown in Fig. 7(b), considering the imbalance of categories, we determine a weight for each category to solve this problem. For any category c, its weight W c is defined as follows: where N c represents the number of faces of category c, and α is set to 1.2 according to previous research [34].

D. Evaluation Metrics
Overall accuracy and F1 score are used to evaluate the results on the Wuhan dataset. Overall accuracy (OA) is defined as the proportion of correct predictions. It can reflect the performance of segmentation globally. F1 score is measured by two secondary metrics, precision and recall, compared to overall accuracy, it evaluates the results more comprehensively [43]. In addition, in order to facilitate comparison with the results on the SUM-Helsinki dataset, intersection of union (IoU) was introduced. Similar to the F1 score, we use mean IoU to measure performance on the entire test set, and per-class IoU is used to evaluate every single category.

V. RESULTS AND ANALYSIS
In Section V-A, we show the results of mesh semantic segmentation on the Wuhan dataset. In order to show the ability of the proposed TextureConv module, we make a comparison with texture features of different levels as input in Section V-B. Section V-C compares the mesh semantic segmentation results of point sampling method with different point density. In

A. Results on Wuhan Dataset
We show the results of three tiles in the Wuhan dataset in Fig. 9. As shown in the figure, our method can complete the mesh semantic segmentation task very well, and most of the faces are correctly predicted with an overall accuracy of 85.7%. Even for some complex scenes, such as the third row in Fig. 9, the roof of the low-rise building hidden in trees can be correctly labelled.
To further reveal the performance of the proposed method on the Wuhan dataset, we show the partial details in Fig. 10, and the confusion matrix and evaluation results of each category are shown in Fig. 11 and Table I, respectively. It can be observed that all seven object categories have excellent F1 scores, especially façade and tree, which are around 90%. However, there are also some typical errors in the segmentation results, which are shown in Fig. 10. The scene presented in the first row of the figure mainly shows the segmentation of the building and its components, and the black ellipses indicate the confusion. The roof is easily predicted to be façade, which mostly appears at the edge of the roof and at the junction with roof furniture. Similar   misclassifications occur on the edge of window and vehicle, which shows that it is challenging for the model to accurately distinguish the faces at the boundary of different categories. As for the reasons, the proximity of local features and the accuracy of manual labeling of mesh are important factors. The ellipse above in the second row of Fig. 10 shows that low vegetation is predicted to be tree. Since the geometric features of the low vegetation and tree are quite different, we think the main reason for the error is the similar texture. The ellipse below indicate cases where tree is misclassified as vehicle. Such a situation reflects that the model does not rely solely on a single texture to distinguish categories. The color composition of the vehicle is diverse, and the tree together with ground shown in the ellipse make the model mistakenly classify them as vehicle.

B. Effectiveness of TextureConv Module
To prove the effectiveness and importance of TextureConv module, comparative experiments are carried out with three levels of texture features: Level 1. For the most basic level, we choose the median RGB of each face. Level 2. On the basis of level 1, we add the standard deviation of color distribution and histogram features. Level 3. Based on level 1, we introduce the proposed TextureConv module. Note that geometric features (e.g., COG 3-D coordinates and normal) are introduced in all three levels. As for the calculation of histogram features, similar to Tutzauer et al. (2019), we first divide the RGB color space into M bins (e.g., M = 12), and then the histogram feature of each face is the statistical value of the number of pixels in each bin, i.e., an M-dimensional feature vector. Table II shows the results of texture features at different levels. Compared with level 1, OA and average F1 score are improved by 5.6% and 7.1%, respectively, with our TextureConv module. The comprehensive increase in evaluation metrics demonstrates that the proposed module can obtain rich texture information that is beneficial to prediction performance. Compared to level 2, our OA and average F1 scores increase by 1.9% and 4%, respectively. The improvement of OA is limited, while it is significant for average F1. Compared with OA, average F1 reduces the impact of categories with a large number of samples on the results, and reflect the segmentation performance more comprehensively. It can also be observed that the F1 of vehicle is greatly improved, which indicates that the proposed TextureConv is able to learn high-level and easily distinguishable texture features from objects with diverse colors compared to hand-crafted features. Furthermore, in order to show the improvement brought by the proposed method more intuitively, we circled the details of the three categories of tree, facade, and vehicle in Fig. 12. The model with level 3 texture features as input performs best. For the first column, although level 2 correctly predicts most of the trees, there are misclassifications at the edge. For the second column, neither level 1 nor level 2 predicts the vehicle category well. The former tends to label the vehicle as the impervious surface, and the latter is prone to predict it as the tree, which shows that the introduction of color distribution and histogram features only brought a slight positive impact on vehicle prediction. We think this is due to the variety of vehicle colors, and it also reflects that the texture features at level 2 are poor at expressing sufficient information. For the third column, the proposed level 3 texture features can more accurately identify façade (mainly refers to roof furniture), compared with the other two.

C. Comparing of Point Sampling Method
In addition to COG point clouds, mesh can also be sampled as uniform RGB point clouds with different point densities to capture the mesh texture to some degree. To validate the effectiveness of the designed model based on the COG point sampling method, further experiments are conducted on Wuhan dataset. Specifically, we first use CloudCompare, a GPL software, to sample 10, 30, and 60 pts/m 2 point clouds from mesh, respectively, and then feed them to the network separately. Our COG point cloud density is about 9.1 pts/m 2 , which is lower than the above three. Most of the parameters are fixed, except for the numbers of sampling points and nearest neighbors in Table III. The input features include 3-D coordinates, normal and RGB information. Notice that the TextureConv module is removed, since the texture information is encoded on each point.
As shown in Table IV, we compare the training time and some evaluation results for 200 epochs with different point density. The comparison shows that with the sampling point density increases, both OA and average F1 score significantly improves, but it has also been accompanied by a substantial increase in training time. Compared with 60 pts/m 2 , although the F1 scores of impervious surface, tree and roof are higher than ours, overall, neither OA nor average F1 score exceeds our results. Also, it should be noted that the training time required for 60 pts/m 2 is about 13 times that of ours. In addition, the comparison with 10 pts/m 2 shows that our TextureConv module trades for a 4% OA and 4.1% average F1 score improvement at a relatively low time cost (about 3.1 h). This demonstrates that the texture convolution brings a small amount of additional time consumption with great improvement in prediction accuracy.

D. Comparison With State-of-the-Art Methods
We conduct an experiment on the SUM-Helsinki dataset to compare with other methods in Table V. Some published results on SUM dataset are chosen for comparison, including PointNet [22], PointNet++ [23], SPG [19], RandLA-Net [21], KPConv [44], RF-MRF [45], RF [42], PSSNet [36], and PDS [35]. Among them, except RF and RF-MRF, the others are deep learning (DL) methods. The input of these DL methods is the uniform dense point cloud with a point density of 10 pts/m 2 , which is generated by mesh sampling. The label is obtained by the nearest neighbor searching method. For fair comparison, the evaluation is on the basis of the surface area rather than the number of faces, and our results are also the average of ten runs.
In Table V, the proposed method achieves a considerable performance compared with other state-of-the-art approaches, ranked third in mIoU among all methods. Grzeczkowicz and Vallet [35] proposed to utilize PDS method to sample the texture mesh, and then use KPConv network to segment the sampling point cloud semantically. The results obtained by this method are excellent, and all of the evaluation metrics show that it is far better than other methods. The results of PSSNet and KPConv are similar to the proposed model. Compared with PSSNet, our OA and mIoU are 0.8% and 0.4% lower, and mF1 is 0.4% higher. As for KPConv, our results are higher than that in mIoU and mF1, with 3.6% and 5.3% respectively, while OA is slightly lower, with 0.3%.
In all categories, our accuracy of vehicle and boat is relatively high, which we think is mainly due to two points: on the one hand, sufficient texture information to distinguish object categories can be captured via our TextureConv module; on the other hand, the adopted weighted cross-entropy loss enables the network pay more attention to minority classes. In addition, the prediction accuracy of high vegetation ranks second, which is due to the in-depth expression of the color information from ground objects by high-level texture features.

VI. DISCUSSION
The abovementioned experiments validate the proposed method in Wuhan and SUM datasets, and the results prove the effectiveness. Table II shows that the extraction of color information from texture features based on simple statistical calculation is not sufficient, compared with the proposed TextureConv module. In addition, as shown in the Table IV that our model uses little time in exchange of obvious accuracy improvement, that further show the potential of it.
In comparison with other methods, the accuracy of water and terrain has dragged down the model performance. We believe that the poor accuracy of these two categories is mainly caused by two factors. The primary cause is the application of a fixed size (e.g., 12 × 12) for the triangular texture maps divide but ignore their uneven size, which induce the loss of the texture information. The secondary cause is the nonuniform point density. Compared with other categories, terrain and water usually contain larger area of faces, which leads to nonuniform point density in COG point clouds. Similar normal features of these two categories also lead to this result, which makes these two tend to be confused, thereby lower the prediction accuracy.
For future work, considering the poor accuracy of classes with large area of faces, e.g., water and terrain. Multiscale grid division can be introduced to further subdivide those faces, so as to extract accurate texture features. In addition, the quality of mesh may have an impact on the network. Mesh with low granularity or large difference in texture and geometric from the actual surface is likely to provide confused and incorrect information, which will affect feature learning and reduce network performance. Thus, we will consider exploring the sensitivity of the proposed method to different quality of mesh. Moreover, taking the similarity between the training set and the test set into account, we will introduce scenes in different cities and different seasons for model training in the subsequent research work to improve the generalization.

VII. CONCLUSION
In this article, to make the most of texture information, we proposed a texture integrated deep learning framework for semantic segmentation of urban meshes. The proposed method employed a novel texture convolution module (Tex-tureConv) for texture feature extraction. Experiments show the effectiveness of the proposed method. Our TextureConv module can capture the information contained in the texture map better than traditional color features, resulting in improved network performance. Compared with the uniform sampling method, the adopted COG point cloud method can significantly improve efficiency. The comparison with other state-of-the-art methods on the SUM-Helsinki also shows comparable results have been achieved. Yetao Yang received the Ph.D. degree in photogrammetry and remote sensing engineering from Wuhan University, Wuhan, China, in 2009.
He is currently an Associate Professor with the College of Geophysics and Spatial Information, China University of Geosciences, Wuhan. His research interests include urban geo-information acquisition and its applications.
Rongkui Tang received the bachelor's degree in geoinformation science and technology in 2020 from the China University of Geoscieneces, Wuhan, China, where he is currently working toward the master's degree in resources and environment.
His research interests include point cloud and mesh semantic segmentation.
Mengjiao Xia received the bachelor's degree in geographic information science in 2019 from the Chengdu University of Technology, Chengdu, China. She is currently working toward the master's degree in resources and environment with China University of Geosciences, Wuhan, China.
Her research interests include point cloud classification and semantic segmentation.
Chen Zhang received the bachelor's degree in geoinformation science and technology in 2020 from China University of Geoscieneces, Wuhan, China, where he is currently working toward the master's degree in resources and environment.
His research interests LiDAR data process and building reconstruction.