Parallel global convolutional network for semantic image segmentation

In this paper, a novel convolutional neural network for fast semantic segmentation is pre-sented. Deep convolutional neural networks have achieved great progress in the task of vision scene understanding. While the increase of the accuracy mainly depends on the increase of depth and width. This slows down large networks and consumes power. A fast and efﬁcient convolutional neural network, PGCNet, aiming at segmenting high-resolution images with a high speed is introduced. Compared with the competitive methods, the generated model has high performance with fewer parameters and ﬂoating point operations. First, a lightweight general architecture pre-trained on ImageNet is relied on as the main encoder. Then, a novel lateral connection module to better transmit features from encoder to decoder. Third, a powerful method termed as PGCN block to extract features of each block in the encoder is proposed and an edge decoder is applied as a supervision for pixels on the edge of stuff and things during training. Experiments show that this method has great advantages. Based on the proposed PGCNet, 75.8% mean IoU is achieved on the cityscapes test set and 35.4 Hz on a standard Cityscapes image on GTX1080Ti.


INTRODUCTION
Semantic segmentation is the task of marking every pixel in an image. It is one of the main tasks of computer vision [1][2][3][4][5]. Driven by powerful deep neural networks [6][7][8], great progress has been made in semantic segmentation task. The improvement of the accuracy of these networks mainly depends on the increase in model depth and width, but this makes large networks slow and consume power. Many advanced practical applications (such as self-driving cars, robots and augmented reality) are very sensitive to latency. They need to process local data on edge devices online. While state-of-the-art networks need a lot of resources, they are not suitable for edge devices. Meanwhile, such devices have limited energy consumption, memory and computing power and require very low latency to make real-time decisions. The resulting demand intensifies the computational burden and makes real-time semantic segmentation a challenging research goal.
Semantic segmentation model must face two main problems: restoration of input resolution and increase of receptive fields. The semantic segmentation requires predicting the label of each pixel of the original image. The classification results of pixels in an image are mainly depends on the high level features output by the backbone of a model. If the resolution of high level features is low, the segmentation results of stuff and things will be blurred. In order to maintain the input resolution as much as possible, state-of-the-art models [9][10][11] get rid of the pooling operation in the last two stages of the backbone. However, resolution's increase of high level feature maps output by the backbones means a lot of computational complexity. This makes the existing models aiming at the highest accuracy unable to achieve real-time semantic segmentation. In addition, there are objects of same category but different sizes in an image. Simultaneously, the classification of some pixels maybe related to the category of surrounding pixels. Neurons with large receptive fields can cover large objects and help classify ambiguous pixels. So we need neurons with different receptive fields to get multiscale information.
The simplest way to get semantic segmentation prediction is to use a fully convolutional network (FCN) [12] whose fullyconnected layer is cast aside. When evaluating on modern hardware, the generated model will be very fast, but the accuracy is low. Because the resolution of the deepest output feature image of this model is very small compared with the original image, it is usually 1 / 32 of the original image. Small objects are difficult to predict by such models.
As an improvement of this kind of methods, U-shaped networks [13,14] are proposed. The subsampled feature maps output by deep layers have small resolution compared to original images and encode much semantic information. Recognition is operated on such feature maps and restoration of resolution is performed by upsampling these feature maps. Using lateral connections, spatial information in shallow layers is introduced and the accuracy of models is enhanced.
In this paper, we propose an effective lightweight architecture based on ResNet-18 for large image semantic segmentation. The proposed method leads to a moderate increase in model capacity. We have three main contributions. First, we introduce a modified ResNext [15] block to improve lateral connection. It helps to better transform features from the backbone trained for classification into features for segmentation. Second, we use an edge decoder to predict the edge of each object and use edge loss as a supervision. This helps to classify pixels on the edge of stuff and things. Third, we propose parallel global convolutional block to extract features of each block in the encoder. And this module increases receptive fields. The features form different stages with different receptive fields are fused to get multi-scale information. The proposed method achieves a good balance between prediction accuracy and model complexity. Among all the existing real-time execution methods, our experiment has achieved the most advanced semantic segmentation accuracy.

RELATED WORKS
Researches on image segmentation have a long history. Before deep learning was introduced into this task, images are divided into several disjoint areas according to the characteristics of gray scale, colour, spatial texture, geometric shape, etc. So that the target is separated from the background. Many classical works use Markov Random Fields [16] and Conditional Random Fields [17] to build probabilistic graphical models to segment and label. Normalized Cuts (N-cut) [18] and GrabCut [19] are based on graph theory. N-cut extracts global information and treats image segmentation as a graph partitioning problem. GrabCut makes full use of texture and edge information.
With the explosion of deep learning in recent years, image classification networks are used as the backbone in semantic segmentation task for extracting rich features. However, for yielding accurate segmentation results, semantic segmentation models need both semantic information and spatial information. But backbones trained for classification cannot afford much spatial information. Meanwhile, to segment objects with different scales and classify ambiguous pixels, multi scale features are needed. To handle these problems, multiple modifications are made for pixel-level prediction.
Many schemes are designed to extract multi-scale context information on these feature maps, such as SPP(spatial pyramid pooling) [20] or upgraded SPP. SPP averages features over aligned grids with different granularities. Combined with atrous convolution, DeepLab [10,21] develops SPP to Atrous SPP module. By using different neurons to represent sub-regions with different sizes, PSPNet [22] develops SPP to PP module.
The self-attention [23] scheme is another method that are widely used in semantic segmentation task. The self-attention is originally proposed to solve the machine translation. Wang et al. further proposed the non-local neural network [24] for various tasks such as video classification, object detection and instance segmentation. The self-attention method calculates the context at one position as a weighted sum of all positions. Various self-attention methods [25][26][27][28] are proposed and achieve good performance.
Because architectures following FCN requires a large amount of training data annotated to pixel-level, weakly supervised semantic segmentation [29][30][31] has received a lot of attention recently. In weakly supervised semantic segmentation, one of the main challenges is to effectively build a bridge between image-level keyword annotations and corresponding semantic objects. Many works use a saliency detector [32][33][34] to capture pixel-level information and generate proxy ground-truth from the original images. Although the development of weakly supervised semantic segmentation is rapid, theirs performance still cannot match the models trained on finely annotated datasets.
In order to enlarge the spatial density of computed feature responses in classification networks, state of the art models set the stride of several downsample layers in classification networks to 1. And all subsequent convolutional layers are replaced with dilated convolutional layers. A dilated convolution is a special form of a standard convolution, in which the receptive field of each kernel is increased by inserting zeros between each pixel in the convolution kernel. Generally, the output map is subsampled by 8 compared to the original image. But this ends up being too costly. In order to save computing resources and improve inference speed, many lightweight models are designed. Swift-Net [35] fused features obtained by inserting images of different resolutions to enlarge receptive field. ESPNet [36] did n ot borrow networks trained for classification on ImageNet and proposed a module based on a convolution factorization principle and achieved high accuracy with a small model. But models pretrained on ImageNet perform much better than those not pretrained on ImageNet, so we choose ResNet-18 pre-trained on ImageNet as the backbone of our model. We benefit from such initialization. The encoder first learns representation by performing convolutional and downsampling operations. Then our PGCNet uses a lightweight decoder to perform upsampling and convolutional operations to generate the segmentation mask.

APPROACH
This section elaborates on the details of PGCNet and describes the core PGCN module on which it is built. The framework of our model is demonstrated in Figure 1, termed as PGCNet. The left, middle and right parts in Figure 1 is the edge decoder, backbone and main decoder of our model. First, images are fed into the backbone. Then two decoders fuse features from four Here x and y are the input and output vectors of the layers considered. The function  (x, W i ) represents the residual mapping to be learned. The stride of the first convolutional layer in each stage is 2. This operation makes the height and width of the feature maps in the latter stage half of that in the previous stage. The height and width of feature maps put out by each stage of ResNet is 1/4, 1/8, 1/16, 1/32 of the original image. By setting the stride of the first convolutional layer in the last two stages to 1, the resolution of feature maps output by the backbone of state-of-the-art models is usually 1/8 of the original image. Given that we are aiming at real-time inference, we do nothing with the backbone. Then these feature maps are denoted as R 1 , R 2 , R 3 , R 4 , respectively. By analysing the segmentation results, we find that the pixels located on the edges of objects in images are more difficult to classify compared to other pixels. This severely affects the segmentation accuracy. We think it is very important to increase the weight of such pixels during training. To this end, we use a separate decoder to classify such pixels. Therefore, we have two decoders in our model. Same as other models, the first decoder is used to classify all the pixels of the input images. The second decoder is intended to classify the pixels located on the edges of things and stuff in the input images. The two decoders in our model are exactly the same. The feature maps output by the last layer in each stage of the encoder namely R 1 , R 2 , R 3 , R 4 are fed into the same stage of the main decoder and the edge decoder during training. When doing inference, only the main decoder is used and the edge decoder does not contribute to the final result.
In each decoder, we first feed R 4 into a pyramid pooling module(PPM) from Pyramid Scene Parsing Network (PSPNet). The feature maps put out by PPM are denoted as P 4 . The number of channels of P 4 is reduced to fpn_dim by a convolutional layer in PPM. P 4 encodes the highest level semantic information. Then P 4 is upsampled to twice its resolution. The number of channels of R 3 is reduced to fpn_dim(output channel of feature pyramid network [FPN]) by a lateral connection termed as RN block. RN block contains three convolutional layers with kernels of 1, 3 and 1. The groups of the convolution in the middle is set to fpn_dim. By adding upsampled P 4 and reduced R 3 , new feature maps are generated and are denoted as P 3 . In the same way, FPN gradually enlarges feature maps from bottom to top and use lateral connections to fuse features encoded by shallower layers of ResNet. The feature maps put out by FPN are denoted as P 1 , P 2 , P 3 , P 4 , respectively.
However, we think that feature maps extracted by classification networks need other manipulation to better use them. The idea is proposed in our work that parallel global convolutional network are useful manipulation on feature maps to better use them. We feed the feature maps put out by each stage of FPN to a parallel global convolutional network module. In each GCN, feature maps are fed into two convolution modules. In the first module, feature maps are processed by several 1 × k convolutions followed by batch normalization and ReLU. By padding, the size of output feature maps is not changed. The resulting feature maps are processed by several k × 1 convolutions followed by batch normalization and ReLU. Conversely, in the second module, features maps are successively processed by several k × 1 and 1 × k convolutions followed by batch normalization and ReLU. Then the feature maps generated by these two convolution module are combined, which enables densely connections within a large k × k region in the feature map. We arrange 128 GCNs with the same k in parallel, termed as PGCN. In a regular GCN, each kernel in the latter convolutional layer can see every kernel in the previous convolutional layer. In our proposed PGCN, each kernel in the latter convolutional layer can only see one kernel in the previous convolutional layer. If we take a 1 × k convolutioanl layer and a k × 1 convolutional layer as a whole, then PGCN is equivalent to connecting several GCN with gcn_dim of 1 in parallel. Since the down sampling rates are 4, 8, 16, 32, we set k in every GCN from a typical PGCN as to 23 for Cityscapes. Namely, each of P 1 , P 2 , P 3 , P 4 is fed into a PGCN separately. When each of them goes through 128 GCNs, the feature maps put out by GCNs have same size and we use a 3 × 3 convolutional layer to concatenate these feature maps and reduce channel dimensions. The feature maps put out by four PGCNs are denoted as G 1 , G 2 , G 3 , G 4 . The feature maps put out by different PGCNs have different sizes, gradually decreasing from G 1 to G 4 by a ratio of 2. To fuse them, we enlarge them up to the size of G 1 by bilinear interpolation and concatenate these resized feature maps. And then a convolutional layer is attached to reduce the channel dimension of these concatenated feature maps to the number of object categories.
Peng et al. [37] has already obtained the conclusion that the performance consistently increases with the kernel size k, but we cannot use larger k in avoidance of GPU memory overflow. Since the dilated convolution proposed by [10] can effectively enlarge the field of view of filters to incorporate larger context without increasing the number of parameters or the amount of computation, we incorporate dilated convolution into PGCN. Dilated convolution with rate r introduces r − 1 zeros between consecutive filter values, effectively enlarging the kernel size of a k × k filter to k e = k + (k − 1)(r − 1). We experiment parallel global convolutional network with different dilation rates. Although the larger dilation rates mean larger receptive fields, the number of valid filter weights (i.e., the weights that are applied to the valid feature region, instead of padded zeros) becomes smaller as the sampling rate becomes larger. We find that our model performs best when we set the dilation rates to 1, 1, 2, 4 respectively for each PGCN from bottom to top. Except for dilation rates, the PGCN used at each stage of the main decoder is exactly the same. However, the resolution of the feature maps to be processed by PGCN in each stage is different. Then the GFLOPs required for each stage is four times as much as that required for the deeper stage. We find that the GFLOPs required for the fourth PGCN module in the main decoder are very large relative to the whole model. But it does not improve the model very much. So we replace it with a 1 × 1 convolutional layer. And we call this model as PGCNet-S and the previously introduced PGCNet as PGCNet-L.
The entire PGCNet process is summarized in Algorithm 1. During training, we use negative log likelihood loss to compute the main loss and edge loss after we get two segmentation results. If we term the main loss as L m and the edge loss as L e , then the final loss L f can be calculated as follows, Here represents the edge supervision scale. The model is trained by stochastic gradient descent with back-propagation.

ALGORITHM 1 Parallel Global Convolutional Network
Require: The image need to be segmented, Image;

EXPERIMENTAL RESULTS
All extra non-classifier convolutional layers have batch normalization [38]. ReLU [39] is applied after batch normalization. Same as many works, we use the "poly" learning rate policy where the learning rate at current iteration equals to the initial learning rate multiplying (1 − iter max iter ) power with power = 0.9.
We set initial learning rate to 0.01 for Cityscapes. Momentum and weight decay are set to 0.9 and 0.0001 respectively. To apply data augmentation, we adopt random resize between 0.5 and 2 and random cropping for data augmentation. And we also use other data augmentation schemes such as mean subtraction and horizontal flip. We set the batchsize to 16 during training. Mean IoU is the standard metric to evaluate semantic segmentation tasks, which indicates the proportion of correctly classified pixels, and mean IoU, which indicates the intersectionover-union (IoU) between the predicted and ground truth pixels, averaged over all object classes. Routinely, we evaluate our model on Cityscapes test set with mean IoU.

Cityscapes
Cityscapes [4] segmentation dataset contains 19 foreground object categories and one background class. It contains 5000 high quality pixel-level finely annotated images collected from 50 cities in different seasons. The dataset contains 2975, 500, 1,525 images for training, validation and testing. In addition, Cityscapes also provides 20 000 coarsely annotated images. But we do not use coarse data in our experiments. The cropsize is  Figure 2 shows some of segmentation results on Cityscapes test set of our model and the comparison with DeepLabv3+.

RN
To demonstrate the importance of lateral connections between the encoder and the decoder, we train a single scale model with 3 × 3 convolution layers located on the skip connections as the backbone. Then we introduce the resnext block into our model and set the groups as same as the channel of FPN. The RN block contains three convolutional layers. Unlike ResNext, each convolution in the middle convolutional layer used in our model can only see one channel of the last convolutional layer. The details of the modified RN block are shown in Figure 3. This

PGCN
We carry out experiments with four models shown in Figure 4. First, we try to feed P 2 , P 3 , P 4 , P 5 to Global convolutional network (GCN), respectively. The output channel is set to 128  Table 2.
We incorporate dilated convolution into PGCN for larger receptive fields. We take different dilation rates for experimentation. Although the larger dilation rate can make the convolution kernel have a larger receptive field, the number of valid filter weights becomes smaller. The experimental results show that the model works best in configurations such as 1, 1, 2 and 4. It improves the performance from 74.48 to 75.06 on Cityscapes val set. The segmentation results when we using different dilation rates are listed in Table 3.

Edge
Pixels at the edge are always hard to segment. This greatly affects the segmentation accuracy. Therefore, we use a separate decoder to classify such pixels and calculate edge loss to predict the edge of every object. The edge decoder is exactly the same as the main decoder. To make the labels for the edge decoder, we keep the labels of the two pixels at the boundary of each object, and ignore the labels of the remaining pixels. The second column in Figure 5 shows some of the labels we made for the edge decoder and the third column shows the features learned by the edge decoder. The third column in Figure 5 shows some qualitative output of the edge decoder. Table 4 shows segmentation accuracy of model in different configurations. This change brings the segmentation accuracy to 76.27.

Simplified PGCNet
The resolution of feature maps output by the first block of ResNet is high. When feeding these feature maps into a PGCN module, the GFLOPs needed is quite large compared to other blocks. Therefore, we replace it with a 1 × 1 convolutional layer. This makes the GFLOPs drop a lot, but this does not make the accuracy drop two much. Our PGCNet-S gets 75.88% mean IoU on Cityscapes val set and with only 58.6 GFLOPs@1Mpx. Experimental results are listed in Table 5.

CONCLUSION
In this paper, we propose a novel model termed as PGCNet, aiming at real-time semantic segmentation. We use a modified resnext block as the lateral connection. Considering the inefficiency of a simple combination of feature maps from different levels, we propose the parallel global convolutional block to extract feature from each block of backbone. The enriched maps are hierarchically combined to produce a high-resolution feature map for pixel-level semantic inference. We propose to use a separate decoder to classify pixels located on the edges of images and calculate loss as supervision. Our experimental results show that our proposed model achieves good results on Cityscapes (75.8%) with only 58.6GFLOPs@1Mpx.