Improving Semantic Segmentation via Decoupled Body and Edge Information

In this paper, we propose a method that uses the idea of decoupling and unites edge information for semantic segmentation. We build a new dual-stream CNN architecture that fully considers the interaction between the body and the edge of the object, and our method significantly improves the segmentation performance of small objects and object boundaries. The dual-stream CNN architecture mainly consists of a body-stream module and an edge-stream module, which process the feature map of the segmented object into two parts with low coupling: body features and edge features. The body stream warps the image features by learning the flow-field offset, warps the body pixels toward object inner parts, completes the generation of the body features, and enhances the object’s inner consistency. In the generation of edge features, the current state-of-the-art model processes information such as color, shape, and texture under a single network, which will ignore the recognition of important information. Our method separates the edge-processing branch in the network, i.e., the edge stream. The edge stream processes information in parallel with the body stream and effectively eliminates the noise of useless information by introducing a non-edge suppression layer to emphasize the importance of edge information. We validate our method on the large-scale public dataset Cityscapes, and our method greatly improves the segmentation performance of hard-to-segment objects and achieves state-of-the-art result. Notably, the method in this paper can achieve 82.6% mIoU on the Cityscapes with only fine-annotated data.


Introduction
Semantic segmentation is a key technology for scene understanding and analysis in the field of computer vision [1], which belongs to intensive prediction tasks and is widely used in medical diagnosis, image generation, autonomous driving [2], and other fields [3,4]. In recent years, a series of networks based on FCN (Fully convolutional Network) [5] has become an important technique for semantic segmentation tasks, but currently semantic segmentation still meets the following challenges. First, in the semantic segmentation network, the limited receptive field cannot fully capture the global contextual relationships between the pixels in an image, and ambiguities and noise in pixel classification are generated inside the object body. One can change the above problem by emphasizing the importance of the receptive field to capture more contextual information, such as dynamic optimization pooling [6] using the learned scale factor to identify the best resolution and perceptual field for a particular layer. Multiple receptive fields are obtained using spatial pyramidal pooling [7], conditional random airport simulation of pixel relationships [8], etc. Secondly, with downsampling operation in FCNs, the resolution is significantly reduced while acquiring strong semantic information and the boundary information is usually lost. To address this problem, common approaches have been used such as high-and low-level feature fusion [9], using boundary information to enhance semantic segmentation • A body-edge dual-stream CNN framework for semantic segmentation tasks is proposed. The framework is able to decouple the semantic feature map into two parts, the body and the edge, and explicitly treats the edge information as a separate processing branch.

•
The body stream uses pixel similarity inside the segmented object. By learning the flow-field offsets to warp each pixel toward object inner parts, the body-stream module produces a consistent feature representation for the body part of the segmented object, increasing the consistency of pixel features inside the body. • A non-edge suppression layer is added to the edge stream. The body stream contains higher-level information, and the non-edge suppression layer in the edge stream takes advantage of this feature to suppress the interference of other low-level information so that the edge stream can only process edge information and let the edge information flow in the specified direction.

Related Work
In recent years, the semantic segmentation technology has rapidly developed and the related research has entered a bottleneck period. Some works start from real-time and self-supervision, all aiming to be able to further improve the segmentation performance of semantic segmentation. In addition, improving the segmentation performance of segmenting difficult objects, such as small objects and the refinement of edge segmentation results, is still a problem worth thinking about. There are currently three main directions to solve this problem.
Using edge information to help segmentation: Some works use the idea of multi-task learning by adding an edge-detection task to the network so that the model can use the image edge information to improve the segmentation. For example, Lin G [16] designed the RefineNet help segmentation by using boundaries as intermediate representations.
Gated-SCNN [14], proposed by Takikawa, designs the structure of the shape stream and regular stream by explicitly merging shape information into the feature map. The boundary maintenance network [17] maintains the edge features during feature extraction and introduces boundary information to improve the localization accuracy when predicting segmentation results in the network.
Refinement and correction of segmentation results: Another way to improve the segmentation performance is to refine or correct the coarse segmentation results. The work of [18] uses conditional random fields (CRF) to refine the output boundaries and overcome the poor localization properties of deep networks. Based on this work, a conditional random field RNN method is proposed to improve the semantic segmentation results by capturing the long-range correlation between pixels using the features of DenseCRF [19]. Considering the complexity of DenseCRF, the work of [20] uses fast-domain transform filtering for the network output when predicting the edge maps of the intermediate CNN layers. pointRend [21] uses a graphical rendering method to achieve reprediction by labeling pixel points with insufficient confidence in the initial segmentation results and replacing the original classification with a new one. The cascadePSP [22] algorithm takes HD images and initial segmentation results as input and refines the initial segmentation results at global and local levels by cascading multiple refinement models to obtain more accurate and detailed segmentation results.
Optimizing the loss function: Some works start from the loss function, the current mainstream cross-entropy loss, and its derived correlation loss function in the training process of semantic segmentation networks, whose optimization goal is the KL scatter of the category distribution in the segmentation results and the category distribution in the real labels so these losses focus on the region rather than the edges. A series of works [23,24] proposes edge-related loss functions that make the neural network in the optimization process also focus on the classification accuracy at the edges of the categories. Unlike the above work, our network explores the properties of the segmented object itself, exploits the relationship between the body and the edge of the segmented object, decouples the segmented object into two parts, the body and the edge, and designs components with specific supervision rights to handle each part separately.

Method
In this section, we first introduce the general architecture of our proposed new semantic segmentation method, as shown in Figure 1, and then describe each part in detail. In Sections 3.2-3.4 we present our elaborate supervision method.
The semantic feature graph F can be decoupled into F body and F edge . In other words, it satisfies the addition rule F = F body + F edge . Our aim is to design components with specific supervision to separately generate feature representations of each part. The semantic feature graph F can be decoupled into and . In other words, it satisfies the addition rule = + . Our aim is to design components with specific supervision to separately generate feature representations of each part.

Body Stream
The pixels inside the same object are similar to each other, while the pixels at the edges and outside are more different. The body-stream module in this paper is designed to exploit this property by learning the flow field to aggregate the contextual information inside the segmented object, warp the internal pixels to the same direction, generate more consistent feature representations for pixels inside the object, and improve the internal consistency of the segmented object. The body stream consists of two parts: flow field generation and feature warping.

Flow Field Generation
Given a feature map H × W × C, where C denotes the channel size and H × W denotes the spatial resolution, we use WideResNet in the architecture of ResNets as the backbone. After the image goes through the ASPP module, we obtain the initial feature map . The convolution with step size is used to compress the feature map to a lower frequency feature map , which is a lower resolution and has a lower spatial frequency part. We can consider it as capturing the core part of the image and focusing more on the main region of the image, while the high frequency and detail information will be ignored. For the generation of the flow field, the low-frequency feature map is first sampled by bilinear interpolation to the same size as the original feature map to obtain , and then two kernels of spatial size 3 × 3 are used to join them together to form a convolutional layer, and the 3 × 3 convolutional layer is applied to predict the flow field.

Feature Warping
After we generate the flow field, as shown in Figure 2, we warp the coarse features to refined features with high resolution based on this offset field, which can transfer the

Body Stream
The pixels inside the same object are similar to each other, while the pixels at the edges and outside are more different. The body-stream module in this paper is designed to exploit this property by learning the flow field to aggregate the contextual information inside the segmented object, warp the internal pixels to the same direction, generate more consistent feature representations for pixels inside the object, and improve the internal consistency of the segmented object. The body stream consists of two parts: flow field generation and feature warping.

Flow Field Generation
Given a feature map H × W × C, where C denotes the channel size and H × W denotes the spatial resolution, we use WideResNet in the architecture of ResNets as the backbone. After the image goes through the ASPP module, we obtain the initial feature map F. The convolution with step size is used to compress the feature map F to a lower frequency feature map F down , which is a lower resolution and has a lower spatial frequency part. We can consider it as capturing the core part of the image and focusing more on the main region of the image, while the high frequency and detail information will be ignored. For the generation of the flow field, the low-frequency feature map F down is first sampled by bilinear interpolation to the same size as the original feature map F to obtain F up , and then two kernels of spatial size 3 × 3 are used to join them together to form a convolutional layer, and the 3 × 3 convolutional layer is applied to predict the flow field.

Feature Warping
After we generate the flow field, as shown in Figure 2, we warp the coarse features to refined features with high resolution based on this offset field, which can transfer the semantic information from deep to shallow layers more effectively. The value of the high-resolution feature map is obtained by differentiable bilinear interpolation between neighboring pixels in the low-resolution feature map, where the neighborhood is determined based on the learned flow-field offset.
( ) corresponds to the topic feature map , ( ) is the corresponding pixel feature of , is the weight of the bilinear kernel computed from the flow-field feature map δ, and ( ) represents the pixels involved in the computation. Each position on the standard spatial network can be obtained as a new point by + ( ). shows an example of an image being warped. The high-resolution feature map is computed by performing a bilinear interpolation of neighboring pixels from the low-resolution feature map. The interpolation is guided by a learned flow field, which determines the spatial relationships between the pixels.

Edge Stream
The edge stream is designed to generate edge features, as shown in Figure 3,and the processing of edge information is treated as a separate branch with the aim of only extracting the corresponding edge information, so that the edge stream, in cooperation with the body stream, can achieve successful decoupling of the edge features from the body features and realize the flow of edge information to the specified location. The edge features learned by the edge stream are supervised by the edge mask to learn edge prediction, and the edge information is fully used.

Procedure of Edge Processing
Five feature maps are extracted from the input image using WideResnet. We denote these five feature maps as , , , , and . The residual blocks are denoted as res. Our network also adds image gradients before the fusion layer and we use a Canny edge detector to retrieve such gradients. The , , , , and feature maps and the gradient information output of the image are the input of the edge stream.  (1): F body (p x ) corresponds to the topic feature map F body , F(p) is the corresponding pixel feature of F, ω p is the weight of the bilinear kernel computed from the flow-field feature map δ, and N (p l ) represents the pixels involved in the computation. Each position p l on the standard spatial network ω l can be obtained as a new point p l by p l + δ l (p l ).

Edge Stream
The edge stream is designed to generate edge features, as shown in Figure 3, and the processing of edge information is treated as a separate branch with the aim of only extracting the corresponding edge information, so that the edge stream, in cooperation with the body stream, can achieve successful decoupling of the edge features from the body features and realize the flow of edge information to the specified location. The edge features learned by the edge stream are supervised by the edge mask to learn edge prediction, and the edge information is fully used.

Procedure of Edge Processing
Five feature maps are extracted from the input image using WideResnet. We denote these five feature maps as R 1 , R 2 , R 3 , R 4 , and R 5 . The residual blocks are denoted as res. Our network also adds image gradients before the fusion layer and we use a Canny edge detector to retrieve such gradients. The R 1 , R 2 , R 3 , R 4 , and R 5 feature maps and the gradient information output of the image are the input of the edge stream.
Use 1 × 1 convolution to adjust the R 3 , R 4 , and R 5 feature maps to channel number 1 and use bilinear interpolation sampling to keep the feature map with the original map size. Conduct bilinear interpolation of R 1 to the original map size, and then after 1 × 1 convolution and the res 1 layer, the feature map with channel C is produced. This feature map is continued to recover to the original map size to obtain R 1 . R 1 is used as the input of the non-edge suppression gate together with R 3 [N, 1, H, W]. The input data R 1 and R 3 [N, 1, H, W] are processed in the non-edge suppression layer, as shown in Figure 4. Next, the weight ∂ t is multiplied with the input points of the edge stream and R 3 is added, and finally, the feature map of [N, 1, H, W] is produced after a convolution weight action. The output information, after passing through a non-edge suppression layer and after passing through res 2 , will then enter into the non-edge suppression layer with R 3 , R 4 , and R 5 , respectively, and get the final feature map R f inal , which will be reduced to 1 and transformed into a weight fraction representation, and produce an edge map R edge of  Use 1 × 1 convolution to adjust the , , and feature maps to channel number 1 and use bilinear interpolation sampling to keep the feature map with the original map size. Conduct bilinear interpolation of to the original map size, and then after 1 × 1 convolution and the layer, the feature map with channel C is produced. This feature map is continued to recover to the original map size to obtain . is used as the input of the non-edge suppression gate together with [ , 1, , ]. The input data and [ , 1, , ] are processed in the non-edge suppression layer, as shown in Figure 4.
Next, the weight is multiplied with the input points of the edge stream and is added, and finally, the feature map of [ , 1, , ] is produced after a convolution weight action. The output information, after passing through a non-edge suppression layer and after passing through , will then enter into the non-edge suppression layer with , , and , respectively, and get the final feature map , which will be reduced to 1 and transformed into a weight fraction representation, and produce an edge map of [ , 1, , ]. Boundary loss supervision is applied to to prevent boundary prediction errors. and the gradient are spliced and fused in the first dimension to generate new weights.   Edge stream processing of edge information. The edge stream is responsible for processing the edge information using residual blocks, the non-edge suppression layer, and supervision. Later, a fusion module combines information from both streams in a multi-scale manner using an Atrous Spatial Pyramid Pooling module (ASPP). The regularization loss ensures the generation of highquality boundaries on the segmentation masks.
Use 1 × 1 convolution to adjust the , , and feature maps to channel number 1 and use bilinear interpolation sampling to keep the feature map with the original map size. Conduct bilinear interpolation of to the original map size, and then after 1 × 1 convolution and the layer, the feature map with channel C is produced. This feature map is continued to recover to the original map size to obtain . is used as the input of the non-edge suppression gate together with [ , 1, , ]. The input data and [ , 1, , ] are processed in the non-edge suppression layer, as shown in Figure 4.
Next, the weight is multiplied with the input points of the edge stream and is added, and finally, the feature map of [ , 1, , ] is produced after a convolution weight action. The output information, after passing through a non-edge suppression layer and after passing through , will then enter into the non-edge suppression layer with , , and , respectively, and get the final feature map , which will be reduced to 1 and transformed into a weight fraction representation, and produce an edge map of [ , 1, , ]. Boundary loss supervision is applied to to prevent boundary prediction errors. and the gradient are spliced and fused in the first dimension to generate new weights.  The non-edge suppression layer is equivalent to acting as a control switch. The features in the body stream come in, and the non-edge suppression layer generates a weight to filter out the shallow irrelevant boundary information, focusing on the edge information and enhancing the ability to identify the edge information. The edges are supervised by crossentropy loss to suppress the scores of other information in the process of back-propagation, weakening the non-edge information layer by layer. In summary, edge stream focuses only on edge information, which can better extract edge features and retains much effective boundary information, making the segmentation performance better.

Fusion Module
We use body features and edge features to guide semantic segmentation using the fusion module ASPP as reconstructed features to produce fine semantic segmentation output, as shown in Figure 5. mation and enhancing the ability to identify the edge information. The edges are supervised by cross-entropy loss to suppress the scores of other information in the process of back-propagation, weakening the non-edge information layer by layer. In summary, edge stream focuses only on edge information, which can better extract edge features and retains much effective boundary information, making the segmentation performance better.

Fusion Module
We use body features and edge features to guide semantic segmentation using the fusion module ASPP as reconstructed features to produce fine semantic segmentation output, as shown in Figure 5.

Multitask Learning
We jointly supervise three parts, , , and . The total loss of the network is calculated as where represents the GT (ground truth) semantic labels, and is the GT edge label generated by . and represent the semantic segmentation prediction maps of and , respectively, and , , and are the three hyperparameters of the network that control the weights of the three loss calculations. and are common cross-entropy losses. For , loss supervision, since our aim is to optimize , we use the training method of boundary-relaxation loss [25], which allows the model to be able to predict multiple categories to the boundary pixel points, downplaying the specific classification of the boundary pixels, with relaxation state of supervision on the edges of the segmented objects.
The final semantic segmentation image output by the network can be computed to obtain the boundary information, and to make the boundary information more accurate,

Multitask Learning
We jointly supervise three parts, F body , F edge , and F f inal . The total loss L of the network is calculated as whereŜ represents the GT (ground truth) semantic labels, andÊ is the GT edge label generated byŜ. M body and S f inal represent the semantic segmentation prediction maps of F body and F f inal , respectively, and λ 1 , λ 2 , and λ 3 are the three hyperparameters of the network that control the weights of the three loss calculations. L f inal and L edge are common cross-entropy losses. For L body , loss supervision, since our aim is to optimize M body , we use the training method of boundary-relaxation loss [25], which allows the model to be able to predict multiple categories to the boundary pixel points, downplaying the specific classification of the boundary pixels, with relaxation state of supervision on the edges of the segmented objects.
The final semantic segmentation image output by the network can be computed to obtain the boundary information, and to make the boundary information more accurate, the final semantic segmentation image is regularized, and here we introduce a dual-task regularizer [14]: (3) is a Gaussian filter. The edge information of the output semantic segmentation image is calculated by Equation (3). For the real semantic labels, the edge information is also calculated using the above equation. In order to keep the consistency between them, Equation (4) uses the absolute value of the difference between the two computed results as the loss, thus making boundary maps output by the model similar to the real labels. The boundary prediction of the edge-stream output should be the same as the boundary of the final semantic segmentation result, so the regularization of both is added as follows: k,p H s p ŷk p logp y k p r, s (5) λ4 and λ5 are the two hyperparameters controlling the weights of the regular subweights. H s = {1 : s > thrs} corresponds to the indicator function, and thrs is the confidence threshold, which we set to 1 in our experiments and is responsible for filtering too-fine gradients. The total dual-task regularization loss function can be written as the sum of the two.
L reg = L reg⇐ + L reg⇒ (6) In training, in order to back-propagate through (6), we need to calculate the gradient of (3). Let g = , , the partial derivatives for a given η parameter can be calculated as follows.
Since argmax is not a differentiable function, we use Gumbel Softmax [26]. The maximum value of Gumbel Softmax is set to τ = 1. In the process of backward transmission, we approximate argmax operator with τ softmax: where g j∼Gumbel(0,I) and τ are hyperparameters. Operator ∇G * can be obtained by Sobel kernel filtering.

Cityscapes
We will conduct experiments on the Cityscapes dataset, which focuses on the understanding of urban street scenes, including semantic-level annotations applied to the semantic segmentation task, with an image resolution of 1024 × 2048, including 30 target categories such as roads, people, cars, buildings, skies, etc., framing up to 50 video images taken in different seasons and weather conditions. Additionally, the entire dataset contains 5000 images with fine annotations. These 5000 images are divided into three parts; 2975 images are used for training the model, 500 images are used for validating the network, and the remaining 1525 images are used for testing the results. In the experiments of our paper, these 30 categories are generalized into 19 categories, such as cars, people, bicycles, etc.

Experiment Settings
We use the PyTorch [27] framework for our experiments, using DeepLabV3+ [28] as our baseline. All networks were trained under the same settings, and SGD was chosen as the optimizer, with an optimizer momentum of 0.9 and an initial learning rate of 0.001. During training, the 'poly' learning rate policy is used to decay the initial learning rate by multiplying 1 − iter total iter 0.9 . The data augmentation included random horizontal flipping, random adjustments in the size range [0.75, 2.0], random cropping of size 720, and 175 training rounds.

Evaluation Metrics
We use three different semantic segmentation metrics to evaluate our model: (1) IoU and mIoU are standard evaluation metrics for semantic segmentation, which can effectively assess the performance of model prediction regions and reflect the accuracy of model segmentation. (2) The approach in this paper hopes to obtain better edge-segmentation performance with a finely designed edge-processing module, where we introduce a boundaryaccuracy evaluation metric [29] by calculating a small relaxed F-score at a given distance along the boundary of the predicted label to show the predicted high-quality segmentation boundary and compare it with the current edge processing of different boundary detection algorithms, with thresholds we set to 0. 00088, 0.001875, 0.00375, and 0.005, corresponding to 3, 5, 9, and 12 pixels, respectively. (3) When the segmentation object is far away from the camera, the segmentation difficulty increases, and we hope that our model can still maintain high accuracy in segmentation. In order to evaluate the performance of the segmentation model segmenting objects at different distances from the camera, we use a distance-based evaluation method [22] as follows, where we take different size crops around an approximate vanishing point as a proxy for distance. Given a predefined crop factor c (pixels), we crop c from the top and bottom, and crop c × 2 from the left and right. This is achieved by continuously cropping pixels around each image except the top, with the center of the final crop being the default approximate vanishing point. A smaller centered crop will give greater weight to smaller objects that are farther from the camera.

Quantitative Evaluation
Our experiments are conducted on the validation set of Cityscapes, using the complete uncropped images for computation. The architecture in this paper is experimentally compared to current networks with robust semantic segmentation performance. On mIoU, we achieve a 1.1% improvement, as reflected in Table 1, which is a clear optimization of this metric. In terms of IoU for a single class of objects, our architecture performs well in some segmentations with high segmentation difficulty, such as smaller traffic signs and thinner utility poles and rider. As can be seen in Figure 6, our model performs very well when given different thresholds. When given the strictest threshold, th = 3 px, our model is still 1.6% higher than the state-of-the-art method. It can be seen that our model is very effective in improving the accuracy of edge segmentation.
As can be seen from Figure 7, in the distance-based evaluation method, the performance of our model is better than the other advanced algorithms as the cropping increases. When there is no cropping, the difference in performance between GSCNN and our algorithm is 1.7%. When the cropping is maximum, the performance difference with GSCNN is 4.7%. This confirms that the segmentation difficulty increases when the object being segmented is far away and also reflects that our model maintains good seg-mentation performance for segmenting objects at greater distances when the segmentation difficulty increases. As can be seen in Figure 6, our model performs very well when given different thresholds. When given the strictest threshold, th = 3 px, our model is still 1.6% higher than the state-of-the-art method. It can be seen that our model is very effective in improving the accuracy of edge segmentation. As can be seen from Figure 7, in the distance-based evaluation method, the performance of our model is better than the other advanced algorithms as the cropping increases. When there is no cropping, the difference in performance between GSCNN and our algorithm is 1.7%. When the cropping is maximum, the performance difference with GSCNN is 4.7%. This confirms that the segmentation difficulty increases when the object being segmented is far away and also reflects that our model maintains good segmentation performance for segmenting objects at greater distances when the segmentation difficulty increases.   As can be seen from Figure 7, in the distance-based evaluation method, the performance of our model is better than the other advanced algorithms as the cropping increases. When there is no cropping, the difference in performance between GSCNN and our algorithm is 1.7%. When the cropping is maximum, the performance difference with GSCNN is 4.7%. This confirms that the segmentation difficulty increases when the object being segmented is far away and also reflects that our model maintains good segmentation performance for segmenting objects at greater distances when the segmentation difficulty increases.

Ablation
We evaluate the effectiveness of each component of our method through ablation experiments. Considering that we are building a completely new architecture, we build the complete model step by step on a benchmark network to observe and analyze the impact of each block and loss supervision on the architecture performance. As can be seen from Table 2, when our model uses flow-field warping to extract body features, it increases the consistency of the body features, but the edge information is not fully utilized, so the segmentation performance is improved, but the improvement is not significant. The segmentation result is consistent with the real labeled object interior, which improves the spatial continuity between pixels, reduces the "fragmentation" phenomenon inside the object, and improves the semantic segmentation effect. With the introduction of the edgeprocessing module, we hope to obtain edge features with low coupling to the body features and achieve full usage of edge information. Experiments for edge performance show that our boundary performance F-score is improved by 5.1% under the most demanding conditions. This suggests that performance-powerful semantic segmentation requires explicit modeling of the relationship and processing of body and edge, and that multi-task learning of joint edge detection and semantic segmentation to improve the performance of semantic segmentation is a desirable method. To further optimize our model, we fuse gradient information with edge features obtained from edge streams and improve our segmentation performance mIoU by 0.3% and boundary accuracy by 0.5% at Th = 3 px. Designing a multi-task jointly supervised method that introduces double regularization loss by relaxed training of body features and final features, the segmentation performance mIoU improves by another 0.5% and the performance F-core of the boundary task improves by another 1.2% under the strictest threshold.

Qualitative Experiments
The above figure shows the visual semantic segmentation results of the DeepLabv3+ network and our method (some objects with significant improvement are circled in the figure for comparison), which validates the significant improvement of our method on the Cityscapes dataset. As can be seen in Figure 8, DeepLabv3+ achieves good results in segmentation results for some easily distinguishable objects, but our segmentation accuracy is higher for some difficult segmentation objects, while Deeplabv3+ tends to incorrectly segment, for example, grouping trucks and cars together, grouping buildings and fences together, and when crowds are obscured by vegetation Deeplabv3+ classifies crowds and trees into one category. For smaller and thinner segmentations, such as traffic signs or crowd outlines, Deeplabv3+ easily misses the segmentation, but our model is able to recognize and handle the edges with more detail. This proves that our model has a significant performance breakthrough in segmentation and has a clear advantage in segmenting difficult objects. As shown in Figure 9,We demonstrate our predicted high-quality boundaries on the Cityscapes val set, further demonstrating the excellent segmentation performance of the model.     . We demonstrate our predicted high-quality boundaries on the Cityscapes val set. The boundaries are generated by extracting the edges from the predicted segmentation masks.

Conclusions
In this paper, we propose a multi-task learning architecture for semantic segmentation and boundary detection. Semantic features are decoupled into body features and edge features by a designed body-stream module and edge-stream module and they are jointly optimized in a well-designed architecture to guide semantic segmentation. The body stream improves the consistency within the segmented objects by learning flow-field offsets that warp the pixels toward object inner parts. The edge stream is an independent edge-processing module that introduces a non-edge suppression layer to make better use of edge information and thus refine the effect of boundary segmentation. The final Figure 9. We demonstrate our predicted high-quality boundaries on the Cityscapes val set. The boundaries are generated by extracting the edges from the predicted segmentation masks.

Conclusions
In this paper, we propose a multi-task learning architecture for semantic segmentation and boundary detection. Semantic features are decoupled into body features and edge features by a designed body-stream module and edge-stream module and they are jointly optimized in a well-designed architecture to guide semantic segmentation. The body stream improves the consistency within the segmented objects by learning flow-field offsets that warp the pixels toward object inner parts. The edge stream is an independent edgeprocessing module that introduces a non-edge suppression layer to make better use of edge information and thus refine the effect of boundary segmentation. The final features are obtained by fusing the body features and edge features using the ASPP module. During the training process, we use relaxation training and a dual task regularizer to produce better segmentation predictions and the fused features are supervised as a result of feature reconstruction by common cross-entropy loss. Our experimental results show that the method has good generality and can significantly improve the segmentation performance for thinner and smaller objects. On the challenging Cityscapes dataset, our method achieves state-of-the-art results with significant improvements on strong baselines.