Edge Prior Multilayer Segmentation Network Based on Bayesian Framework

,

1. Introduction 1.1. Background. Semantic segmentation of images is a very important task in computer vision, which aims to classify each pixel in the image and can be applied to autopilot, 3D reconstruction, and other fields. Synthetic Aperture Radar (SAR) image semantic segmentation is widely utilized in military and civilian information applications because SAR can obtain images at any time of the day and night independently of the weather conditions. Traditional segmentation methods can be mainly divided into three steps: segmentation of super-pixel blocks, feature extraction, and classifier selection. Methods like Meanshift [1] and Watershed [2] are typical for extracting super-pixel blocks. For classification, Support Vector Machine [3], naive Bayes [4], and Maximum Likelihood method [5] are commonly used algorithms. Afterwards, Markov Random Field (MRF) [6] and Conditional Random Field (CRF) are also introduced to consider the information of surrounding pixels.
However, owing to the special imaging mechanism of SAR imagery, such as polarization characteristic expression [7], the modeling of multiplicative non-Gaussian noise, simply applying traditional strategies for optical images to SAR imagery is not suitable anymore. Many methods for SAR image segmentation have been proposed. Liu et al. [8] proposed a SAR image segmentation with reaction diffusion level set evolution equation in an active contour model. Zhang et al. [9] developed a semisupervised SAR image segmentation method based on a hierarchical CRF model. Liu et al. [10] proffered a new SAR image segmentation approach via a hierarchical visual semantic and adaptive neighborhood multinomial latent model.
Meanwhile, the color information of polarized SAR imagery has also begun to attract researcher's attention. Through methods like Pauli decomposition [11] based on scattering matrix, the pseudo color and texture features in SAR imagery can be extracted, that many segmentation methods mainly for optical images can be utilised. Chen et al. [12] proposed a multifeature segmentation for highresolution polarimetric SAR image based on fractal net evolution approach. Lang et al. [13] developed the generalized mean shift algorithm for polarimetric SAR image segmentation. In recent years, since convolutional neural network (CNN) has achieved remarkable performance in image classification, various networks suitable for semantic segmentation, such as Fully Convolutional Networks (FCN) [14], DeepLab [15] (Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs), and CRF as Recurrent Neural Networks (CRFasRNN) [16], were proposed based on VGG-Net. CNN automatically learns multiple-level feature due to its multilayer structure; it can well judge what kind of objects are contained in an image and has achieved remarkable performance in image classification, but the accurate segmentation is difficult because of the loss of the object edge details. FCN modifies CNN to obtain the classification results of each pixel and implement semantic segmentation. In general, CRF usually serves as a postprocessing method of FCN [17,18] to improve the segmentation results and DeepLab is a typical case. DeepLab performs semantic segmentation with atrous convolution, deep convolutional nets, and fully connected CRFs. The atrous convolution is used to achieve downsampling, and the fully connected CRF overcomes the negative impact on locating accuracy caused by the achievement of invariance. For better edge correction, Chen et al. [19] replaced the fully connected CRF with Domain Transform (DT) [20] to improve the efficiency. Though DT is traditionally used for common image processing, it still shows good effect in filtering rough segmentation results, making the object edges integrated precisely.
With complex boundaries in SAR imagery, segmentation near object edges usually has a high error rate. Considering the limitations of traditional classifiers applied to the image classification, FCN-8, the best in FCN networks, is chosen to improve the accuracy of the initial classification. FCN is responsible for convolutional feature extraction and works directly on RGB vectors of each single pixel. However, the broad receptive field in traditional FCN convolution kernel and the reduced dimension in pooling layer blur the output of a single pixel, limiting the segmentation accuracy. Therefore, DT and the fully connected CRF [21] are introduced, with the influence of surrounding pixels taken into consideration. Traditional edge feature extraction methods include Canny [22], LSD [23], and the ratio of exponentially weighted averages (ROEWA) edge extraction [24]. In these years, many new edge detection methods have been proposed. Zhang et al. [25] proposed a new edge detector using structured random forests as the classifier, which can make full use of RGB-D image information from Kinect. Tabb and Ahuja [26] describe an algorithm for image segmentation at multiple scales based on edge and region detection. Mandal et. al [27] presented a novel scheme for segmenting the image boundary in optoacoustic images. Marmanis et al. [28] proposed a boundary-aware semantic segmentation algorithm; this algorithm explicitly represents and extracts the boundaries between regions of different semantic classes. The class boundaries improve the deep convolutional neural network for semantic segmentation.
DT can be seen as a recurrent neural network (RNN) [19]. The internal DT combining edge information and score map is applied for internal edge correction. The modified DT is utilized for external edge correction by combining the edge distance map and rough segmentation result. Meanwhile, holistically nested edge detection (HED) network [29] is employed to provide edge prior information to improve segmentation accuracy.
The contributions of the proposed approach are as follows: (i) On the basis of Bayesian framework, this study presents a parallel network architecture, which is composed of two parallel networks called the likelihood network (FCN8) and the edge prior network (HED), respectively, and a constraint network (the directed DT) behind (ii) To achieve accurate edge detection, edge map obtained from HED is utilized, serving as edge prior information to sharpen blurred edge-pixel categories (iii) Considering that the edge distribution in traditional DT is not completely trusted and some edges may be lost, Directed Domain Transform (DDT) is proposed for image classification with complex edges. In this method, an edge distance map is defined to limit the diffusion direction, to avoid error label spreading. Besides, the gradient descent method is adopted for parameter determination, difference vectors of color and class probability about adjacent pixels are considered to deal with misclassification The remainder of this paper is organized as follows: Section 2 introduces the algorithms involved in each step, including HED, the fully connected CRF, internal fusion, and modified DT, and Section 3 shows the whole framework of our proposed method. In Section 4, the proposed method is tested, and its performance is compared with other semantic segmentation methods. Finally, this paper is concluded in Section 5.

Framework
The framework of our proposed method is shown in Figure 1; the upper part of the structure is the edge prior information, including edge map obtained by HED and the corresponding edge distance map. In the lower part of the structure, internal DT filters the rough FCN8 segmentation result by the fused edge prior information; the obtained FCN8+DT model is input into the fully connected CRF. Finally, the Directed DT combines the coarse segmentation result, the edge map, and the edge distance map to obtain the fine segmentation result.
The specific process of our proposed method mainly consists of two phases, training and testing. These two phases are described in the following.

Methodology
3.1. HED. HED is an end-to-end edge detection algorithm based on the fully convolutional neural network and deeply supervised nets, referring to article [29,30]. By relying on the weight update of convolution kernel, HED automatically learns the rich hierarchical features and determines whether the feature is important for solving the abstraction of object edges. The network architecture of HED is shown in Figure 2, five side-output layers of HED are inserted after the convolutional layer, respectively, and deep supervision is imposed at each side-output layer, so that the result is toward the edge detection. Finally, the output of HED is obtained through a weighted fusion layer from multiscale outputs. As can be seen from Figure 2, based on VGG-Net, HED incorporates two improvements. First, HED connects the side-output layer to the last convolutional layer in each stage, respectively, conv1_2, conv2_2, conv3_3, conv4_3, and conv5_3, to obtain multiple prediction results and then fuse all the edge maps. The size of the receptive field of each convolutional layer is the same as the corresponding sideoutput layer, since the side-output layer is implemented as a convolution layer with 1 kernel and 1 output. Second, HED cuts the fifth pooling layer and all the fully connected layers of the VGG-Net to obtain meaningful side-output results and reduce the cost. The main goal of HED is to complete the overall image training and prediction and achieve nested multilevel feature learning.
HED is a network which can learn features to produce the edge maps approaching the ground truth. In the training process, all of the network parameters are denoted as W. Suppose the network has M side-output layers, for each side-output layer, a classifier is connected with it. The weights of the corresponding classifiers are w = ðw ð1Þ , ⋯, w ðMÞ Þ. Then, the image-level loss function l side is defined by this formulation:

Journal of Sensors
For normal images, the number of edge points is much smaller than nonedge points; only 10% of the ground truth consists of edge pixels; therefore, HED takes a simple strategy to solve the bias problem between the edge and nonedge pixels. Using I represents the original image, G represents the corresponding ground truth, and the edge and nonedge pixels are denoted as |G − | and |G + |, respectively. The classbalanced cross-entropy loss function is defined as follows: where β represents a class-balancing weight, and 1 is calculated by the sigmoid function σð·Þ on the activation value at pixel j.
A "weight-fusion" layer is introduced to connect all prediction results of side-output layers; the fusion weight is obtained by training process, denoted as h = ðh 1 , ⋯, h M Þ. Then, the loss function of the fusion layer is as follows: The activations of the side-output of layer m isĜ fuse = σð∑ M m=1 h mÂ ðmÞ side Þ, whereÂ ðmÞ side = fa ðmÞ j , j = 1, ⋯, jGjg. Then, the total loss function is as follows: The greyscale edge mapĜ HED is obtained through HED and it can be utilized as edge prior information.

Fully Connected
CRFs. The probabilistic graphical model has been proven to effectively improve the image classification accuracy. CRF has a good performance in image classification and segmentation; it can extract fuzzy and insignificant pixel-level category annotations into sharp edge distributions and fine segmentation results. Therefore, CRF can be used to solve the classification error caused by fuzzy output of FCN.
Suppose X i represents the category label of point i, with L categories, as X i ∈ L = fl 1 , l 2 , ⋯, l L g. The variable X is a random vector composed of X 1 , X 2 , ⋯, X N , where N represents the number of pixels of the image. The original image is denoted as I; ZðIÞ represents a normalization function. CRF can be used to describe the above relationship based on Gibbs distribution: In the fully connected CRF model, the energy distribution of x is expressed by the following formula: where φðx i Þ is a unitary energy term representing the energy labeling the pixel i as X i . φ p ðx i , x j Þ is the energy that simultaneously labeling the pixel i, j as X i ,X j . The value of unitary energy term is derived from FCN. FCN does not consider the smoothness and continuity of the point class assignment.
The pairwise potential provides smooth rules based on image data and encourages the same category to be assigned to points of the same attribute. The pairwise potential is built into a model based on a Gaussian kernel as follows: In this formulation, if  In the internal edge net, these feature layers obtained by FCN8 are filled to the same spatial scale before being stacked. A convolutional layer with 1 × 1 convolution kernel and one output channel is introduced to predict edge strength. A RELU layer is added to limit the output to a range of 0 to 1. Since the ground truth does not contain reference edges, the implicitly generated edge map might be inaccurate. However, HED predicts object edge by training various contour types, instead of calculating gradients for edge discrimination. Thus, there is a direct connection between multilevel features. In the proposed approach, the edge map form HED is fused with the internal one, as expressed in Equation (8).
where g i is the internal edge map and g HED i represents another one from HED; g i ′ denotes fused edge as new input of DT. α means the weight with initial value of 0.5. Generally, α is calculated using gradient descent algorithm. The contribution of ∂L/∂g i , ∂L/∂α in backpropagation is as follows: ∂L ∂α Journal of Sensors where L is the loss function; ð∂L/∂αÞ n and ∂L/∂y i represent the current gradient and updated gradient, respectively. With g HED i fixed, α is updated through Equation (10). In practical application, a layer of convolutional neural network is defined to train parameters as shown in Equation (11). The network consists of a 1 × 1 convolution kernel with a single channel, an eltwise layer for add operation, and a ReLU layer. α ranges from zero to infinity and is initialized to 1.
3.3.2. Standard DT. In the Domain Transform [19] (DT) network, the raw signal x and the positive "domain transform density" d are the inputs and the filtered signal y is the output. For 1-D signals, the output is calculated as follows: The weight ω i is related to density and defined as follows: where g i is the edge map obtained by exploiting features from the intermediate network layers. σ s and σ r are the standard deviation of the convolution kernel over the spatial domain and the edge map, respectively. The details of σ s , σ r , and σ k can be found in DeepLab DT [19]. Filtering from one side by Equation (12) is asymmetric. Therefore, DT filters from two sides for 1-D signals, left-toright and right-to-left. For 2-D signals, DT employs 1-D filtering in each dimension, that is a horizontal pass (left-toright and right-to-left) is performed along each row, followed by a vertical pass (top-to-bottom and bottom-to-top) along each column.
Actually, the edge map is implicitly generated by the DT network, not working well as the explicit network for complex boundaries. Besides, some reverse diffusion occurs in four-direction filtering, spreading error categories. Aimed at the unsatisfactory effect of DT network, the corresponding improvements in the proposed method are listed below: (1) Introduce the HED network to explicitly get the edge map, which truly reflects edge strength of different connection types (2) Define an edge distance map and obtain positive diffusion direction, that is, towards the nearest edge. Besides, the diffusion coefficients are adjusted by color vector and class probability vector, to deal with misclassification

Directed DT.
To acquire a more precise edge map, an improved DT algorithm-DDT-is adopted in the presented approach, as shown in Figure 3. First, the edge distance map is defined to eliminate inappropriate diffusion.
In four-direction filtering, the diffusion coefficient is adjusted by two parameters. For pixels that satisfy the diffusion conditions, the point is diffused in four directions until the iteration ends. Finally, an area-filling method is employed to fill holes. In the distance map, the edge distance is measured by the minimum distance of each pixel to its nearest edge, as shown in formula (14).
where D i,j is the minimum distance, Y + denotes all boundary pixels in the edge map, and ðm, nÞ is the coordinate of pixel ði, jÞ.
In Figure 3(c), the dashed line means the actual edge and the solid one denotes the edge to be filtered. A and B are different class labels. The correct diffusion direction is from pixels far away to pixels near the actual edge.
As shown in Equation (15), ΔD is the diffusion potential. For adjacent pixels with different classes, the diffusion condition is that ΔD should be larger than 0.
Since a large color gradient of the same category and different classes with a small gradient may result in false edges and leaked edges, respectively, the weight ω i in Equation (12) is adapted as follows: Weight ω i is determined by the color vector I and class probability vector P. Two parameters θ 1 and θ 2 are used to balance the spreading strength and determined through  (12) is unrolled to compute derivatives with respect to y and ω i .
where GðiÞ = 1/ð0Þ represents whether area i has a diffusion or not. L is the loss function, ð∂L/∂y i Þ n−1 is the current gradient, while ð∂L/∂y i Þ n denotes the updated gradient.
The segmentation error of DDT output y i is backpropagated to θ 1 and θ 2 .
After the gradients of y i and ω i are updated, the gradient of θ can be further obtained.

Experiment
4.1. Experiment Data. The first experimental data in this paper is ESAR L-band PolInSAR image of the German Aerospace Center, with a dimension of 1,000 × 1,000 pixels. The spatial resolution of the SAR imagery is 3 m × 2:2 m. There are five classes, namely, building, forest, farmland, road, and others, as shown in Figure 4. The second experimental data in this paper is L-band fully polarized image of Foulum Area in Denmark, acquired on April 17, 1998. This data mainly consists of farmland, forest, residential area, higher crop, shrub, and untilled glebe, as shown in Figure 5.
For these two data sets, the entire imagery is divided into four subimages. Each time, one subimage is for testing and the rest three are sliding to 4 directions for training. After four cycles, four segmented images are spliced as the final result.

Experiment Result.
Through HED, the edge map of ESAR data is obtained. The edge map and edge distance map are shown in Figure 6.
The proposed algorithm is conducted on the ESAR data with seven contrast methods: (1) semantic segmentation algorithm based on Meanshift and SVM, (2) CRF based on Meanshift and Potts, (3) FCN8, (4) DeepLab, (5) FCN8+ DT+the fully connected CRF, (6) FCN8+HED+DT+the fully connected CRF, and (7) the proposed method. Traditional method first uses Meanshift to obtain coarsely divided pixel blocks and extracts intensity features, texture features, and polarization decomposition features for each block [31]; then, traditional method directly classifies these features with SVM [32] or introduces neighborhood information with  Journal of Sensors Potts prior to CRF [31]. These two methods mentioned above are the first and second methods of our comparative experiment. The third comparison method is FCN8, the most accurate net of FCN. The 4th comparative experiment is Dee-pLab, which proposed the fully connected CRF. Among the remaining three comparison methods, the last one is our proposed method, the 6th method deletes the DDT of our proposed method, and the 5th experiment inserts DT between FCN8 and the fully connected CRF. The results of seven methods are shown in Figure 7.
To further verify the effectiveness of our method, five methods related to neural networks from the above seven methods are applied on the second dataset. The results are shown in Figure 8.

Confusion Matrix.
As an important measuring index, the average accuracy of different methods is assessed. Since objects of the same class may have quite different colors in Pauli SAR image, thus this color continuity-based method is largely limited. As presented in Table 1, seven algorithms  Table 2.
Comparing with traditional methods, FCN8 has greatly improved, mainly because FCN8 is suitable for images and high accuracy in image classification. DeepLab is equivalent to adding a fully connected CRF for postprocessing after FCN8. The classification information of surrounding pixels is introduced as a reference to further improve the classification accuracy. For data set 1, compared with DeepLab, the average accuracy of DeepLab DT only increases by 0:05%, revealing that DT with internal edge extraction has almost the same effect as postprocessing CRF. For the sixth method, the edge-weighted input obtained by the FCN-HED edge and the internal edge network is trained in the DT, but the correct rate is not greatly improved. These can be caused by two reasons: (1) DT has many unsolved problems; (2) the actual image category distribution is complex. For data set 1, the proposed approach achieves the highest performance with 81:39% accuracy, outperforming DeepLab DT by 0:49%, which is attributed to internal edge fusion, edge prior information, and directed DT. The accuracy of each class is improved, especially for nonbackground categories. For data set 2, the average accuracy of our proposed method is increased by 7:2% over FCN.

Edge Improvement.
The results, including the segmentation images and some zoomed-in details, are shown in  8 Journal of Sensors Figure 9, respectively. It is clear that the upper right area is labeled as building (water blue in Figure 4(b)), which corresponds to uneven color in the Pauli image. Besides, the road (pink in ground truth) shows two different colors in the original image (black and dark green in Figure 4(a)). Since the rest of comparative methods and the proposed one are network based, they achieve better performance than the traditional algorithm. To show the improvement of DDT, some details of segmentation result are magnified and presented in Figure 9; the segmentation images by FCN8+DT+the fully connected CRF and the presented one are quite similar. From the four zoomed-in areas, it is obvious that the result by the proposed approach matches the ground truth more closely. The average accuracy, which reflects improvement in the whole image (edge area and nonedge area), is insufficient to measure the edge enhancement. To verify the contributions of the proposed approach, all improved pixels in the marginal area are counted in Table 3 and Table 4. The number of boundary points is rather smaller than that of the total pixels. Thus, the contribution is evaluated by the ratio of edge-improved points to all boundary pixels. With DeepLab as a comparison basis, the edge accuracy of FCN8+DT+the fully connected CRF and the proposed algorithm increases by 0:9% and 4:4% for data set 1, respectively. For data set 2, the edge accuracy increases for these two methods are 0:2% and 2:2%, respectively. It is indicated that DDT guided by prior information does have a good effect on edge segmentation.

Discussion.
There are three reasons for the improvements of our proposed method. (1) Based on Bayesian framework, HED is introduced to provide the edge prior information. In the segmentation process, the edge map and the edge distance map play an important role in improving the accuracy of segmentation. (2) FCN8 is chosen to improve the initial classification accuracy replacing traditional classifiers, and the fully connected CRF is employed to improve the classification accuracy of the points since it introduces the prior information of the surrounding pixel classification categories. nnnnn Internal DT and modified DT (DDT) contribute 9 Journal of Sensors to improving segmentation accuracy. As a constraint network, DDT analyzes the core problem of original DT and improves it. DDT combines the improved diffusion method and the hole-filling method for external fusion, which enhances the correctness of the edge distribution, improves the classification accuracy of points near the edge, and further improves the overall accuracy.

Conclusions
An edge prior Bayesian semantic segmentation network for SAR image is proposed in this study. The raw segmentation result is firstly derived from the likelihood network (FCN8). Afterwards, the edge map from HED network is fused with the map from intermediate FNC8 layers, to obtain more accurate boundaries as prior information. In the last stage, the newly defined edge distance is utilised for eliminating inappropriate diffusion directions. Besides, the DDT method, which redefines the domain transform density, is proposed to improve the segmentation performance. Experiments about the proposed approach and six comparative methods are conducted on the two data sets. The experimental results demonstrate that the proposed method has a good effect on edge correction, improving the overall accuracy as well.
However, Pauli SAR image is segmented directly without preprocessing in this paper. In the future work, we will focus on preprocessing the SAR image before segmentation to improve the accuracy.

Data Availability
The image data used to support the findings of this study are included within the article.