A new deep distortion convolutional neural network for semantic segmentation of panoramic images

Semantic segmentation of panoramic images plays a crucial role in many applications, such as scene understanding, autonomous navigation, and community security. However, the traditional deep learning algorithms achieve lower accuracy for panoramic images due to the serious image distortion. This paper proposes a new deep convolutional neural network to segment panoramic images of outdoor scenes semantically. The proposed network includes two branches: the semantic segmentation sub-branch (SS-branch) and the feature enhancement sub-branch (FE-branch). The SS-branch contains two parts: Encoder and decoder. In the encoder, a new distortion convolution layer (DCL) is defined to reduce the distortion of panoramic images. In the FE-branch, a canny is used to detect the edge and a lightweight network is designed to enhance the features of image boundaries. Finally, a weighted loss function is used. The experimental results show that the proposed semantic segmentation method has better performance for different outdoor scenes.


Introduction
The rapid developments of semantic segmentation methods have allowed us to accurately recognize various objects in an image [1]. The panoramic images acquired by a camera contain a broader view than common images. Nowadays, panoramic images are widely used in many applications, especially in the image recognition, object location and detection [2], visual navigation [3]. However, traditional semantic segmentation algorithms are not adequate for panoramic images because of the significant distortion effect. Therefore, the segmentation of panoramic images is very important to the scene understanding and environment perception.
Although the semantic segmentation methods of normal images have been developed rapidly in recent years, those networks are awkward to process panoramic images. The most networks processing normal images underlies that the images have the planer grid architectures. However, the panoramic image defines image on different structures form the normal image. Therefore, the convolutional neural networks (CNN) can not be directly applied to panoramic images.
The commonly used methods for the semantic segmentation of the panoramic images can be divided into two categories: the direct methods [4]- [6] and indirect methods [7]- [9]. In the direct methods, the special deep learning networks are designed for the semantic segmentation of panoramic images end-to-end, which doesn't need the image distortion correction in advance. In the indirect methods, the panoramic images are corrected into the normal images at first, and then the traditional deep learning networks are used for the semantic segmentation of the normal images.
In the direct methods, Varga et al. [4] proposed a method to segment the panoramic images by using the VGG neural network. Yang et al. [5] proposed a PASS architecture for the semantic  [6] to predict the semantical feature maps and uses multisource data to train the network.
In the indirect methods, Budvytis et al. [7] developed a semantic localization method, which first trains semantic segmentation network using the collected dataset and then recognizes surrounding objects. Fernandez et al. [8] proposed a line estimation method for the semantic segmentation of panoramic images. This method segments panoramic images using the geometry features, and generates layout hypotheses of the room assuming Manhattan world. Suzuki et al. [9] presented a new method for the semantic change detection of panoramic images. This method detects panoramic images using normal images of different scales.
In recent studies, many researchers have devoted to the enhancement of feature maps. Zhao et al. [10] proposed an improved FCN network for the segmentation of images, which combines features in different stages. Zhou et al. [11] developed a high-resolution encoder-decoder network for the segmentation of medical images. Hu et al. [12] designed a classification network for high-resolution remote sensing images by using a fully convolutional network. This network adopts the global and local attention modules to enhance feature maps. Xu et al. [13] proposed a multichannel network for the gland instance segmentation. This method reduces the loss of boundary information during downsampling by the introduction of sub-network.
Therefore, a new semantic segmentation network for panoramic images is proposed, which is called the deep distortion convolutional network (DDCN). DDCN contains SS-branch and FE-branch. The SS-branch contains two parts: Encoder and decoder. To reduce the distortion effect of images, the DCL is designed in the front end of encoder. To enhance the features of image boundaries, the FEbranch is developed using a canny edge detection method [14]. Then, a weighted loss function is used to improve the training process. In this section, a new semantic segmentation network for panoramic images is proposed. The architecture of our semantic segmentation network is shown in figure 1. To reduce the effect of image distortion and change the distortion sampling coordinates according to the degree of image distortion, the DCL is designed in the front end of encoder. The FE-branch is developed to enhance the features of image boundaries by designing a lightweight network. Then, a weighted loss function is used to improve the training process.

Distortion convolutional layer (DCL)
A CNN is usually composed of convolution layers, pooling layers and a full connection layer. In a common structure, the convolution layers and pooling layers are unique to a convolutional neural network. The basic operation of CNN is convolution, which filters each small regions of images, and obtains features of these small regions. The process of convolution is where c K is the convolutional kernel, c I is a color channel of the input image I , and S is the output feature map.  , which is for normal images. It is not suitable for panoramic images. Therefore, we propose a new distortion convolution.
The distortion convolutional layer is n  is the distortion sampling coordinate for panoramic images which is used to offset the effects of objects distortion in panoramic images. In our network, the size of distortion sampling region is set as 5×7.

Distortion sampling coordinate
To determine n  in (2), the edges are extracted from panoramic images using the canny edge detection algorithm [14], as shown in figure 3. Then, the edge in the panoramic image that is the horizontal line in the real world is selected, such as the curb of the road or the straight marking line. In order to calculate the mean curvature of an edge E , we use the method which is dedicatedly illustrated in [15]. To obtain accurate mean curvature, K edges are selected from different panoramic images.
In where is the discrete curvature of the point ( , ) i i x y . The adjustment factor f  is designed to change with respect to the distortion degree of different panoramic images, as the following where k  is the mean curvature of the k -edge and p is a constant. Then, the distortion sampling coordinate n  is calculated as with where c  is the set distortion variable, w is the height of the panoramic image.

FE-branch
In the traditional segmentation networks for boundaries, the sub-branches which can be trained independently from the main-branch are commonly used, such as the DCAN [13] and the deep multichannel side supervision (DMCS) [16]. However, the method in those paper need the additional annotation of boundaries, which is less in the datasets of outdoor scenes. Therefore, a new FE-branch is proposed which is as shown in figure 1. In the FE-branch, a canny method is used to extract edges of images. Then, the labels for edges of images are extracted, which are used as the labels of the FE-branch.
In the semantic segmentation of images, an overlarge time of upsampling will dedicatedly loss the texture information, which can decrease the segmentation accuracy. In the FE-branch, the high-level features from the bottom of the encoder are upsampled by 4-time and convoluted by a 1 1 convolution successively. Then, the output feature maps are sequentially upsampled by 4-time and convoluted by a 1 1 convolution successively. Finally, the segmentation result of edges are obtained.

Weighted Loss Function (WLF)
To train the SS-branch and the FE-branch simultaneously, the loss function in the training of the network is formulated as  Figure 4. Semantic segmentation results of Panoramic Image I-II. In the above figures, red, green, yellow, blue, purple, and black represent tree, ground, building, car, person, and others respectively.

(a) I (b) II
BS and FE represent the baseline and FE-branch respectively.

Setting
To further illustrate the effectiveness of our method, we present the results from both segmentation indicators and segmentation scenes. The experiment environment is Centos 7.0 system, RAM 32G, and GPU Telsa K40M. To train our network, the IMPART dataset is used, which is synthesized using the Cityscape dataset [17]. The IMPART dataset contains 5000 annotated panoramic images. The loss function of SS-branch and the FE-branch are both use the CrossEntropy. Momentum and weight decay are set to 0.9 and 0.0005 respectively. The learning rate is 0.007 and  is 0.3. The annotation contains six common classes of ground, building, tree, car, person, and others.

Ablation study
To evaluate the performance of each module in DDCN, the experiments are conducted in the setting: without the DCL, with the DCL, with the DCL+FE-branch+WLF. Our baseline is the ResNet101+ASPP. Table 1 and figure 4- figure 5 show the semantic segmentation results of the ablation study. From Table 1, the baseline+DCL exceeds the baseline 0.39/0.74/0.38 in terms of Acc, mIoU and fwIoU. As marked by red boxes in figure 4 and figure 5, the baseline+DCL has better segmentation effects than the baseline. This is because the specially designed module DCL can reduce the distortion effects and improve the feature extraction of panoramic images.  Figure 5. Semantic segmentation results of Panoramic Image III-IV. In the above figures, red, green, yellow, blue, purple, and black represent tree, ground, building, car, person, and others respectively.
BS and FE represent the baseline and FE-branch respectively.
We can find the performance of our method can be further improved by the addition of the FEbranch and the WLF. Compared with the baseline, the baseline+DCL+FE-branch+WLF, i.e. DDCN, outperform it by 0.73/1.48/0.84 in terms of Acc, mIoU and fwIoU. As marked by green and yellow boxes in figure 4 and figure 5, the DDCN has better segmentation effects for images boundaries than the baseline+DCL. This is because the developed FE-branch and WLF can enhance the feature extraction of image boundaries. These results demonstrate that the modules DCL, FE-branch, and WLF enhance the performance of the semantic segmentation network, including distortion reduction, feature extraction, and feature enhancement.

Comparison study
To compare our DDCN with the semantic segmentation methods of panoramic images, two classical methods, i.e. ERF-PSPNet [6] and PASS [5], are chosen for comparison. Table 1 and figure 4-5 show the segmentation results of panoramic images with two methods and DDCN. As can be observed, our DDCN achieves Acc 89.60 and mIoU 71.66. Our method is about 1 higher than the other two methods respectively. As marked by the yellow boxes in figure 4-5, the semantic segmentation results of our DDCN are much closer to the ground truth than those of the other methods.

Conclusion
This paper proposes a new deep convolutional neural network for segmenting panoramic images of outdoor scenes semantically. The main technical contributions included the definition of the DCL, the design of the FE-branch, and the development of the WLF. The DCL reduced the distortion of panoramic images. The FE-branch enhanced the feature maps. The WLF improved the training process. These technical features made our semantic segmentation method accurate and reliable. The experimental results show that the proposed semantic segmentation method has good performance for different outdoor scenes.