Edge-aware Multi-level Interactive Network for Salient Object Detection of Strip Steel Surface Defects

The performance of the salient object detection of strip surface defects has been promoted largely by deep learning based models. However, due to the complexity of strip surface defects, the existing models perform poorly in the challenging scenes such as noise disturbance, and low contrast between defect regions and background. Meanwhile, the detection results of existing models often suffer from coarse boundary details. Therefore, we propose a novel saliency model, namely an Edge-aware Multi-level Interactive Network, to detect the defects from the strip steel surface. Concretely, our model adopts the U-shape architecture where the two crucial points are the interactive feature integration and the edge-guided saliency fusion. Firstly, except the skip connection that combines the same stage of encoder and decoder, we deploy another connection, where the features from adjacent levels of encoder are transferred to the same stage of decoder. By this way, we are able to provide an effective fusion of multi-level deep features, yielding a well depiction for defects. Secondly, to give well-defined boundaries for prediction results, we add the edge extraction branch after each decoder block, where the progressive feature aggregation endows the edge with precise details and complete object cues. Meanwhile, together with the edge extraction branches, we deploy the saliency prediction branch at each decoder stage. After that, coupled with the fine edge information, we fuse all outputs of saliency prediction branches into the final saliency map, where the edge cue steers the saliency result to pay more attention to the boundary details. Following this way, we can provide a high-quality saliency map which can accurately locate and segment the defects. Extensive experiments are performed on the public dataset, and the results prove the effectiveness and robustness of our model which consistently outperforms the state-of-the-art models.


I. INTRODUCTION
S URFACE defect detection is a very important research area in the field of machine vision, which tries to locate the defect regions in the collected surface images. Here, this paper focuses on strip steel which is a kind of industrial material and is widely used in ships, bridges, cars, military, and so on. There are many types of surface defects on strip steel, such as inclusions, patches, and scratches, which are caused by the equipment, raw materials, technology, casting and other factors in the production process of strip steel [1]. The defects exert a negative influence on the quality of strip steel, the deep processing, and the aesthetics of products. Therefore, the strip steel surface defect detection technology is deployed on the production line to inspect the surface and locate the defects, so as to realize the effective control of the strip steel quality.
Generally, the defect detection is often conducted by the human vision based manual inspection, where the surface defect details cannot be observed in time. Besides, the manual inspection is easily affected by the working environment, equipment stability, and subjective factors. Thus, in recent years, manual inspection methods have been gradually replaced by the machine vision based detection methods which are more efficient and more robust. Meanwhile, as we all known, saliency detection [2] tries to capture the most visually attractive regions of an image, and coincidentally, the defect regions of the strip steel surface can also be regarded as salient regions. Saliency detection is often treated as a preprocessing operation for many vision tasks such as tracking [3], [4], segmentation [5], [6], quality assessment [7], [8], and defect detection [9]- [17]. Therefore, this paper attempts to regard the strip steel surface defect detection as the salient object detection, so as to effectively highlight the defect regions.
Recently years, there are many efforts have been devoted to the research of saliency detection. Concretely, the traditional saliency models can be categorized into two classes. The first one is heuristic priors based models, such as the contrast-based saliency model [18], [19], center-surround differences based models [20], [21], and the background prior based models [22], [23]. The second one is traditional machine learning algorithm based models such as random forest [24], [25], support vector machine [26], conditional random field [27], and so on. However, the traditional saliency models are mainly designed based on the handcrafted features which cannot give a well depiction for strip steel surface defects, especially some complex surface scenes such as low contrast between defects and backgrounds, small defect regions, and noise disturbance. This often results in the generated saliency map cannot pop-out the defects completely. To tackle this dilemma, the deep learning technology has been applied to this field [28]- [37]. Although the performance of saliency models has been elevated largely, the inference results still suffer from two problems. The first problem is the coarse boundaries, where the defects cannot be accurately segmented when there are multiple defect regions with different sizes and the defects with fine structure in the image. The second one is the robustness of the model, where the performance degrades largely when dealing with the noise interference data.
To address the above challenges, we propose a novel saliency model, namely Edge-aware Multi-level Interactive Network shown in Fig. 1, to detect the strip steel surface defects. The entire network is an U-shaped architecture [38], and the two crucial components of our model are the interactive feature integration and edge-guided saliency fusion. To be specific, the proposed model first extracts multi-level deep features by using the encoder part. Then, we deploy the decoder to aggregate the multi-level deep features. Particularly, the existing U-shape based saliency models [17], [28], [39] try to combine the deep features derived from the same stage encoder and decoder by using the skip connection. Following this way, each level deep feature can only give a scale-specific representation for defects, and it is lack of information exchange between different layers, where the shallow layer features may be impaired by the continuous combination of the features from deep layers. Thus, for each level feature, we attempt to integrate the features from adjacent levels of encoder which will provide more relevant and effective cues for the current level feature. By this way, we can realize the interaction of adjacent level features, and facilitate the flow of information from different levels. After that, to further promote the boundary quality of inference results, there are many models [17], [28], [39] attempt to introduce edge cues. Inspired by this, we also introduce edge information into our network.
We deploy the edge extraction branch after each decoder block, and meanwhile, we add a saliency prediction branch at each decoder block, where the edge information not only conveys precise boundary details but also is endowed with complete object cues. Subsequently, we fuse the saliency inference results and the fine edge information into the final high-quality saliency map which can accurately locate and segment the defects.
Overall, the contribution of this paper can be summarized as follows: 1) We propose a novel saliency model, i.e. Edge-aware Multi-level Interactive Network, to detect strip steel surface defects, where the two key points are the interactive feature integration and the edge-guided saliency fusion. 2) To give an effective interaction of different level features, we integrate each level feature with its adjacent level features. Besides, to present high-quality boundary details of defect regions, we introduce the edge information to refine the saliency fusion.

II. RELATED WORKS
There are numerous efforts have been devoted to the saliency detection of which the performance has been push forward significantly. Here, we mainly give a brief introduction for the two kinds of saliency models, namely the traditional models (the heuristic prior based models and the traditional machine learning based models) and the deep learning based model.

A. TRADITIONAL SALIENCY MODELS
The pioneer work of saliency detection is conducted by Itti et al. [2], where the saliency is defined as the centersurround difference computed by color, intensity, and motion features. Following this mechanism, the Achanta et al. [40] defined the saliency by using frequency-tuned method. Besides, Cheng et al. [18] treated the saliency as a region contrast which is with respect to its nearby regions. In [19], a saliency prediction is designed from two contrast measures by the uniqueness and spatial distribution. There are also some other heuristic priors to build saliency measurement. For example, in [20], the discriminant centersurround hypothesis is proposed to estimate the saliency values of each image. In [21], the saliency of each pixel is defined as how much it discriminate from surroundings, which is computed by employing an anisotropic centersurround operator. In addition, the background prior is also adopted in many saliency efforts. For example, in [22], the boundary and connectivity priors about backgrounds are employed by geodesic saliency. In [23], the spatial layout of image regions with respect to image boundaries is employ to define the boundary connectivity which can be viewed as a saliency measurement.
In recent years, with the development of machine learning algorithms, the performance of saliency models has also been promoted to some degree. For example, in [24] and [25], the random forest is used to map the features to saliency values. In [26], Lu et al employ the support vector machine based multiple kernel boosting method to estimate the saliency values. The conditional random field is also used to aggregate various unary saliency cues with pairwise information to highlight salient objects [27]. Besides, regularized random walk ranking [41] is employed to introduce prior saliency prediction to each pixel by simultaneously considering the region and pixel image features, thus generating high-quality saliency maps. In [42], Peng et al. employed the low-rank matrix theory to perform matrix decomposition, where the background and salient regions are represented by low-rank matrix and sparse matrix respectively. In [43], Huang et al. used the multiple instance learning to estimate each proposal's saliency value.
Generally, the aforementioned traditional models usually adopt hand-crafted features to compute the contrast, measure the background prior, train the machine learning algorithms, and so on. They cannot capture the rich semantic information of salient objects, i.e. defects. Therefore, when dealing with the challenging scenes, the existing traditional models are incapable of detecting the defect regions accurately and completely.

B. DEEP LEARNING BASED SALIENCY MODELS
In recent years, the deep learning technologies have achieved a huge progress, and this is also benefit for the detecting of salient objects, where the performance of saliency models has been pushed forward significantly. For example, in [44], Luo et al. proposed a convolutional network that fuses the local and global cues via a multi-resolution 4 × 5 grid structure. In [45], Hou et al. inserted short connections to the skip-layer structures within holistically-nested edge detector. In [30], the recurrent residual refinement network is deployed to progressively refine the saliency maps by building a set of residual refinement blocks, where the the low-level features and high-level features are alternatively utilized. In [46], the bi-directional message passing model is proposed to fuse multi-level features into the final saliency map, where the messages are flowed among multi-level features. In [47], pixel-wise contextual attention network is designed to selectively acquire an attention map, in which each attention map corresponds to the contextual relevance at each pixel. In [48], Zhao et al. proposed the pyramid feature attention network consisting of contextaware pyramid feature extraction module and channel-wise attention module, which is employed to strength the highlevel and low-level deep features. In [39], Liu et al. designed two pooling-based modules including the global guidance module and the feature aggregation module to promote the performance for saliency detection. In [29], Wu et al. designed the cascaded partial decoder framework to obtain a precise saliency map, where the low-level features are discarded and the high-level features are retained. In [28], the boundary-aware saliency detection network employed the hybrid loss to guarantee the accuracy of saliency maps.
In [17], Song et al. proposed an encoder-decoder residual network to precisely segment the defect regions from the strip steel surface. Besides, in [31], a depth-quality-aware subnet is inserted into the classical bistream RGBD saliency network, which promotes the fusion of the RGB and depth information. In [32], Chen et al. proposed a lightweight temporal network to acquire the temporal information which can effectively interact with the corresponding spatial cues, which gives a well saliency inference on videos. In [49], the salient object detection is regarded as an object-level semantic re-ranking problem, where a lightweight deep network and a post-processing refinement are deployed successively. In [50], a stereoscopic attention mechanism is deployed to adaptively integrate various scale features.
Although the deep learning based saliency models have promoted the research of saliency detection, they are still weakness when approaching the complex defect scenes, especially the cluttered backgrounds and noise disturbance. In our model, we focus on the interaction of features from different layers and the effect of edge information, which gives an effective boost for the performance of salient object detection.

III. THE PROPOSED METHOD
This section first gives an introduction for the proposed saliency model in Section III-A. Then, the interactive feature integration is detailed in Section III-B. After that, we detail the edge-guided saliency fusion in Section III-C. Lastly, the loss function will be elaborated in Section III-D.

A. OVERALL ARCHITECTURE
The proposed saliency model shown in Fig.1 is built on the U-shape architecture consisting of encoder and decoder with pre-trained model ResNet-34 [51] as backbone, and the two key components of the network are interactive feature integration and edge-guided saliency fusion. Firstly, we discard the last average pooling layer and softmax function of ResNet-34. Then, the encoder contains six convolutional blocks. Concretely, the first convolutional block "Conv-E1" contains a 3×3 convolutional layer (channel=64, stride=1) and the residual learning block "conv2_x" from ResNet-34. The following convolutional blocks "Conv-Ei" (i=2,3,4) adopt the residual learning blocks of ResNet-34 (i.e., "conv3_x", "conv4_x", and "conv5_x"). After that, to enlarge the receptive field of the entire network, a max pooling layer of stride 2 and another two convolutional blocks "Conv-E5" and "Conv-E6" are added after "Conv-E4", where each block is equipped with three basic residual blocks (channel=512).
The overall process shown in Fig.1 can be summarized as follows: the input is the strip steel image I , and the output of our model is the high-quality saliency map S which accurately highlights the defects. Firstly, the encoder extracts the multi-level deep feature {F E i } 6 i=1 . Then, through a bridge module "Conv-B" which consists of three dilated convolutional layers (channel = 512, dilation rate=2) [ The overall architecture of the proposed saliency model: the encoder (i.e. Conv-Ei (i=1,· · · ,6)) first extracts multi-level deep features Then, a deep feature F E D is generated by using the bridge module(Conv-B). After that, the decoder (i.e. Conv-Di (i=1,· · · ,6)) progressively aggregate the multi-level deep features by integrating the adjacent level features, yielding the multi-level features {F D i } 6 i=1 . Besides, we deploy the edge extraction branch "conve" and the saliency prediction branch "convs" after each decoder block. In addition, we employ the deep supervision to the network, namely {le i } 6 i=1 and {ls i } 7 i=0 presented by the blue and green arrows. Finally, the high-quality saliency map S is the aggregation of saliency prediction results {A i } 6 i=1 and the fine edge cue E 1 . Here, "up" means upsampling operation.
we can obtain a global semantic information F E D . After that, each decoder block integrates the feature from the corresponding encoder block, the features from adjacent encoder blocks, and the output from its previous decoder block. By this way, we can obtain six level deep features . Besides, to guarantee the accuracy of saliency prediction, we also deploy edge estimation branch after each side-path of decoder blocks. Correspondingly, the deep supervision is introduced to the entire network for the optimization of saliency prediction and edge extraction. Finally, by combing the saliency prediction {A i } 6 i=1 of all decoder blocks and the edge information E 1 generated by the first decoder block, we can obtain the final saliency map S.

B. INTERACTIVE FEATURE INTEGRATION
Through the encoder of our network, we can obtain multilevel deep features which present different cues of the object. Particularly, the features from shallow layers focus on the spatial details, the middle-level deep features convey the spatial and semantic cues simultaneously, and the features from deep layers provide rich semantic information of salient objects. To aggregate the multi-level features, there are many efforts have been designed. Concretely, the exist-ing models [44], [53] often try to transfer the current level features to the corresponding level decoder block, where the current level decoder block can only present the scalespecific cues. Further, some other existing models [45], [54] attempt to fuse the multi-level features in a dense way, where the integration process requires huge computation resource. Fortunately, inspired by the mutual learning [55], [56], we conduct the interactive feature integration for the aggregation of multi-level deep features, as presented in Fig.1.
Formally, firstly, to capture the global semantic information, we introduce a bridge stage (Conv-B) between the encoder and the decoder, yielding the deep feature F E D . The corresponding process can be defined as: where f B denote the bridge Conv-B, which contains three dilated convolutional layers (channel = 512, dilation rate=2) [52]. Meanwhile, each of the dilated convolution layers is followed by a batch normalization (BN) layer and a ReLU layer, respectively. Then, for each decoder block Conv-Di, its input F SD i not only contains the output of previous decoder block F D i+1 , but also includes the features current-level encoder block Conv-Ei and its adjacent level encoder blocks (i.e. Conv-E(i − 1) and Conv-E(i + 1)). By this way, we can obtain the input of the i th decoder block Conv-Di, where u ×2 (·) and d ×0.5 (·) denote the upsampling and downsampling operations (sampling rate= 2, 0.5). Here, we should note that the upsampling and downsampling operations don't change the channel number of the features from adjacent levels. Finally, with the generated initial fused deep features , we pass them to the corresponding decoder blocks, respectively. This can be defined as where the f D i denotes the i th decoder block Conv-Di, and F D i is the output of Conv-Di. Here, Conv-Di (i=2,· · · ,6) contains three 3×3 convolutional layers and a 2× upsampling layer, where each convolutional layer is followed by batch normalization layer (BN) and a ReLU activation fuction. Conv-D1 only contains three 3×3 convolutional layers, and it isn't equipped with the upsampling layer. By this interactive feature integration process, we can obtain six-level deep features {F D i } 6 i=1 which give a well representation for salient objects.

C. EDGE-GUIDED SALIENCY FUSION
To obtain a high-quality saliency map with fine boundary details, many efforts have been paid their attention to the extraction and utilization of edge information [17], [28], [39]. Inspired by this, we also introduce edge information in our model. Differently, we deploy the edge extraction branch after each decoder block, namely "conve" shown in Fig.1. Meanwhile, we also add saliency prediction branch after each decoder block, namely "convs". Formally, the two operations are performed on the deep features where A i can be regarded as the attention map from the i th decoder block by conducting saliency prediction f s , f e is the function of edge extraction branch, and E i is the output of the i th edge extraction branch, i.e. the edge information. Here, both of the saliency prediction branch "convs" and edge extraction branch "conve" are set as a 3×3 convolutional layer. Besides, to give deep supervision of the saliency prediction branches and the edge extraction branches, we deploy the upsampling operation and the sigmoid activation function after the branches, namely the green and blue double-headed arrows shown in Fig. 1.
After that, we attempt to combine all side-outputs of saliency prediction branches, namely the attention maps . Meanwhile, we choose the first edge cue E 1 to take part in the saliency fusion process, where the E 1 is with the biggest resolution than other edge cues. Besides, we should note that in the saliency fusion, the two kinds of features including attention maps and edge cue are not processed by sigmoid activation function. In addition, according to Fig. 1, to concatenate the attention maps and edge information, we should first resize the attention maps A 3 ∼ A 6 to 256 × 256 by upsampling operation. Finally, under the guidance of edge information E 1 , the saliency fusion can be defined as where S is the final saliency map, [·] means the concatenation operation, f denotes the convolution operation and a sigmoid activation function, and up denotes upsampling operation. Furthermore, we present the features E 1 and Fig. 2, where the features give a well depiction for defects. Notably, to present a well visualization, we exhibit the results of the features E 1 and {A i } 6 i=1 after the sigmoid activation function, as shown in Fig. 2(d-j). Following this way, we can get the high-quality saliency map with well defined boundary details, which can effectively highlight the defect regions on the strip steel surface as presented in Fig.1.

D. LOSS FUNCTION
To remit the over fitting, some models [28], [45] employ the deep supervision for the side-outputs. Here, we introduce the deep supervision to our network.
Formally, we deploy the supervision to the saliency prediction branches and edge extraction branches, namely VOLUME 4, 2016 . Besides, we also introduce the supervision to the bridge module and the final output of the entire network, which are defined as ls 0 and ls 7 , respectively. Thus, the total loss L of the entire network can be defined as Besides, similar as [17], [28], we also adopt the hybrid loss to define the saliency prediction loss {ls i } 7 i=0 , namely where ls B i , ls I i and ls S i denote BCE loss [57], IoU loss [58] and SSIM loss [59], respectively. For the edge extraction loss {le i } 6 i=1 , each of them adopts the BCE loss [57]. Here, the aforementioned three losses including BCE, IoU, and SSIM are detailed below. BCE [57] (Binary Cross Entropy) loss is usually employed by the binary classification task, and it can be written as where l B , GT and S denote the BCE loss, the ground truth and the predicted saliency map, respectively. IoU [58] (Intersection over Union) loss is often deployed to evaluate the similarity of GT and S, which can be written as (9) where l I is the IoU loss.
SSIM [59] (Structural Similarity) loss is initially designed in image quality assessment task, which can be used to acquire the structural information. Specifically, P S = P j S : j = 1, ..., N 2 and P GT = P j GT : j = 1, ..., N 2 denote two patches (size=N × N ) which are cropped from the saliency map S and the ground truth GT, respectively. The SSIM l S of patch P S and patch P GT can be defined as: where µ P S , µ P GT and σ P S , σ P GT refer to the mean and standard deviations of P S and P GT respectively, σ P S P GT denotes the covariance of two patches, and C u and C σ are usually set to 0.01 2 and 0.03 2 , respectively.

IV. EXPERIMENTAL RESULTS
This section first provides the experimental setup in Section IV-A. Then, in Section IV-B, we compare the proposed model with the state-of-the-art saliency models in quantitative and qualitative ways. Lastly, the ablation analysis is presented in Section IV-C.

A. EXPERIMENTAL SETUP
Here, we perform extensive experiments on the public strip steel dataset SD-saliency-900 [60] to verify the effectiveness of our model. SD-saliency-900 has 900 images, and it consists of three types of strip steel surface defects including inclusion, patches and scratches, and each type of defects has 300 images with 200×200 resolution.

1) Parameter Settings and Implementation Details
Following [17], we generate the training set containing 1620 images. Concretely, we first choose 180 images from each type of defects, yielding the initial training set containing 540 images. Then, we further select 90 images from the each type of defects in the initial training set, and add salt & pepper noise (ρ=20%), generating the noise interference training set that consists of 270 images. Thus, we combine the initial training set and the noise disturbance training set, yielding the final training set which totally contains 810 images. After that, we perform horizontal flipping to augment the training set, yielding 1620 images. In addition, during the training phase, we resize each image I to 256 × 256, and then we perform normalization ((I − µ)/σ, µ=0.4669, and σ=0.2437).
We implement our network with Pytorch 1.4.0, and the code is performed on a PC with an NVIDIA Titan XP GPU (with 12GB memory). Furthermore, to train the network, we initialize the encoder by using the ResNet-34 model [51], and the remaining convolutional layers are initialized by Xavier [61]. Meanwhile, we adopt the Adam optimizer [62] to optimize our network, where the initial learning rate, betas, eps, and weight decay are set to 10 −3 , (0.9, 0.999), 10 −8 , and 0, respectively. The entire training process will continue until the loss converges. Besides, the training batch size is set to 10, and our training process runs about 15 hours. During test phase, we resize each image to 256 × 256, and the final saliency map is further resize to the same resolution as the input image by using bilinear interpolation. Generally, the average running speed of our model is about 48fps when dealing with 256×256 images.

2) Evaluation Metrics
In the experiment, we take the following metrics to evaluate the performance of our model, of which the metrics contain the precision-recall (PR) curve, the F-measure curve, mean absolute error (MAE), the weighted F-measure (WF) score, overlapping ratio (OR), structure-measure (SM), and Prattąŕs figure of merit (PFOM).
where α is the balance parameter, and it is set to 0.5. Lastly, as a Prattąŕs figure of merit, PFOM [66] intuitively presents the boundary quality of the segmentation results, and it is often employed by the edge detection area, which can be defined as where N G and N S are the number of ideal and actual edge points extracted from ground truth map and binary saliency result, respectively. Besides, α denotes a scaling constant which is set to 0.1 or 1/9. In addition, d k denotes the Euclidean distance between the k th true edge point and the detected edge point.
The quantitative comparison between our model and the state-of-the-art models are presented in Fig. 3 and Table 1. To be specific, Fig. 3 supplies the results of PR curves (top row) and F-measure curves (bottom row). Obviously, we can find that our model achieves the best performance when compared with other models in terms of PR curves and F-measure curves. Particularly, under the disturbance of salt & pepper noise (ρ=10% and ρ=20%), our model still performs best, as presented in Fig. 3(b,c). In addition, when compared with the recently published work EDRNet [17], the improvement elevated by our model is still significantly. Furthermore, Table 1 provides the results in terms of MAE, WF, OR, SM and PFOM. We can find that our model still consistently outperforms other models with a large In addition, to evaluate the running efficiency of different models, we make a comparison of different models in terms of the model size (MB) and the average running time (seconds per image), as presented in Table 2. Concretely, the average running time is computed by executing models on the SD-Saliency-900 dataset. It can be seen that our model runs fastest when compared with other models, where our model takes about 0.021s when handling a 200×200 image. For the model size, we can find that our model size is 378MB, which is slightly large when compared with the top-performance models. Thus, in our future work, we will attempt to compress the model, and reduce the model size.
The qualitative comparison results are shown in Figure 4, where the examples of the top three rows, the middle three rows, and the bottom three rows are selected from three types of defects (i.e. inclusion, patches, and scratches), respectively. It can be found that the results of our model are the closest one to the ground truth when compared with other models' results. Specifically, firstly, the examples of the top three rows present low contrast and scattered attributions. We can find that only our model provides complete and accurate results which are capable of highlighting the defects. By contrast, other models either falsely highlight the backgrounds or cannot detect complete defect regions. For example, in the third row, other models mistakenly recognize the background regions as the defects, while our model can distinguish the defects accurately. Secondly, for the middle three rows, the defect regions are large and the backgrounds are cluttered in the three examples. Fortunately, our model still give a perfect prediction for the defect regions. In contrast, the traditional models often highlight the backgrounds as shown in Figure 4(n-r), and the deep learning based models either incorrectly pop-out backgrounds or incompletely detect defects as shown in Figure 4(d-m). Lastly, from the 7 th row to the 9 th rows, the examples are with fine structures. It can be seen that our model still performs best, where the results shown in Figure 4(c) are with clear details. By contrast, most models fail to detect the salient objects, where they often loss parts of defect regions and even falsely highlight backgrounds. Therefore, the qualitative comparison results demonstrate the effectiveness and superiority of our model again.

C. ABLATION STUDIES
To illustrate the effectiveness of our model and demonstrate the rationality of the design of our model, we give a comprehensive ablation studies shown in Table 3, where the quantitative comparison results are conducted in terms of five metrics including MAE, WF, OR, SM, and PFOM. Firstly, we design several variations of our model. Concretely, as depicted in Table 3, the "Baseline (B)" denotes the basic encoder-decoder network without any other components, which only contains Conv-E1∼E6, Conv-B, and Conv-D6∼D1. The final saliency map can be generated by deploying a 3×3 convolutional layer and a sigmoid activation function to the output of Conv-D1. "IFI" means interactive feature integration. "E" means the introduction of edge information. "Sup" denotes the deep supervision adopted by our network. Here, our model is denoted by "B+IFI+E+Sup", "B+IFI+Sup" means our model without edge extraction branches, "B+E+Sup" denotes our model without interactive feature integration, and "B+Sup" is the basic network with deep supervision. Correspondingly, "B+IFI+E" denotes our model without deep supervision, "B+IFI" means our model without edge extraction branches and deep supervision, "B+E" refers to our model without interactive feature integration and deep supervision.
From Table 3, we can find that our model achieves the best performs when compared with other variations in terms of MAE, WF, OR, SM, and PFOM. Particularly, compared with the baseline network (B), the WF, OR, SM, and PFOM of our model are improved by 1.2%, 2.2%, 1.0%, and 1.0%, and the MAE is decreased by 16.2%. This can demonstrate the effectiveness of all components of our model, and further validate the rationality of the proposed network.
Besides, we also provide the qualitative comparison between our model and the variations including "B+IFI+Sup", "B+E+Sup", and "B+IFI+E", namely our model without edge, our model without interactive feature integration, and our model without deep supervision, as presented in Fig. 5,  Fig. 6, and Fig. 7. To be specifically, firstly, comparing with the "B+IFI+Sup" shown in Fig. 5(d), we can find that the results of our model are more complete. Secondly, comparing with the "B+E+Sup" presented in Fig. 6(d), we can find that our model suppress the backgrounds effectively. Thirdly, comparing with the "B+IFI+E" depicted in Fig. 7(d), it is obviously that our model performs better. This presents the efforts of edge information, the interaction of different level features, and the deep supervision, where the edge indicates an accurate location cue of defect regions, the feature interaction gives a well depiction for defects, and deep supervision gives an effective constraint for feature learning. Thus, from Fig. 5, Fig. 6, and Fig. 7, we can prove the effectiveness of the crucial components of our model, and demonstrate the rationality of the design of our model.

V. CONCLUSION
This paper proposes a novel saliency model, i.e. Edgeaware Multi-level Interactive Network, to pop-out defects on the strip steel surface. Specifically, the proposed network adopts an U-shape architecture where the two points are the interactive feature integration and the edge-guided saliency fusion. Firstly, for each level of the network, we fuse the features from the current level of encoder, the adjacent levels of encoder, and previous decoder stage. Particularly, the features of adjacent layers promote the flow of object cues, which is benefiting for the depiction of defects. Secondly, to acquire a saliency result with precise boundaries, we extract edge information together with saliency prediction at each decoder block. After that, the fusion of edge cues and saliency results provides a complete and accurate saliency map which can effectively highlight the defect regions from the strip steel surface. Comprehensive experiments are conducted on the public dataset, and the quantitative and qualitative results demonstrate the effectiveness of our model which consistently outperforms the state-of-the-art models in all evaluation metrics.