Image segmentation via foreground and background semantic descriptors

Abstract. In the field of image processing, it has been a challenging task to obtain a complete foreground that is not uniform in color or texture. Unlike other methods, which segment the image by only using low-level features, we present a segmentation framework, in which high-level visual features, such as semantic information, are used. First, the initial semantic labels were obtained by using the nonparametric method. Then, a subset of the training images, with a similar foreground to the input image, was selected. Consequently, the semantic labels could be further refined according to the subset. Finally, the input image was segmented by integrating the object affinity and refined semantic labels. State-of-the-art performance was achieved in experiments with the challenging MSRC 21 dataset.


Introduction
Image segmentation is a fundamental problem in the field of computer vision.So far, abundant research has been published on this topic; [1][2][3][4][5] however, segmenting the complete foreground objects, which are not uniform in color or texture, remains a challenging task.In addition to the local low-level image features, such as color, texture, and spatial position, an increasing amount of studies focus on segmenting images using high-level visual information.
Cosegmentation methods suggested by Refs.6-11 employ foreground correspondence and jointly segmented objects, which have similar characteristics in a set of images.Rother et al. 7 utilized histogram matching and a modified Markov random filed (MRF) framework formed by the difference of foreground region histograms.Sun et al. 8 constructed an MRF framework, which reflected camera flash illumination changes in order to extract the foreground from the background.Kim et al. 9 proposed a hierarchical framework for dividing the large image set into multiple subsets in order to perform segmentation by cosegmenting each subset separately with interimage connections.Inspired by the characteristic of linear anisotropic heat diffusion, Kim et al. 10 suggested a cosegmentation model, in which the finite heat sources of temperature maximization corresponded to the maximized segmentation confidence.In Ref. 11, segmentation was modeled by an energy-minimization function, which combined local appearance and spatial consistency; however, in most existing studies on cosegmentation, only multiple images with common objects were handled, and different irregularly appearing objects were hardly dealt with.
In recent years, semantic segmentation aiming at assigning a semantic label to each pixel of a given image [12][13][14][15][16][17][18] has become a subject undergoing intense investigation in the field of computer vision.Especially, the techniques of deep neural networks have recently played an important role in the field of semantic segmentation.The segmentation accuracy has been greatly improved by applying the deep learning techniques, [19][20][21][22] on the condition that the huge dataset is collected to train the network.
Semantic information, such as high-level visual information, can provide an important cue for the segmentation of a complete foreground from the image.In this study, inspired by semantic segmentation methods, we propose a segmentation mechanism for achieving a complete and accurate foreground boundary.Inspired by nonparametric methods, the initial semantic labels were obtained by maximizing the normalized label likelihood score. 23,24Then, the foreground and background semantic descriptors were defined according to the initial semantic labels.With the aid of the two semantic descriptors, a subset of training images with similar foreground to the input image, was obtained.Subsequently, the semantic labels were further refined via object affinity and a semantic codebook.Finally, image segmentation was achieved by means of semantic labeling.For postprocessing, we adopted the Grab-Cut method 25 and used it to merge separate regions.
The remainder of this paper is organized as follows: Sec. 2 describes initial semantic labeling using the nonparametric method; Sec. 3 describes our image segmentation scheme via foreground and background semantic descriptors; and the experimental results are presented in Sec. 4.

Initial Semantic Labels Acquisition
Inspired by the nonparametric method, 23,24,26 the initial semantic labels of the input image can be acquired as follows: in the beginning, an image subset D is obtained from the training set by applying the global GIST feature descriptor, such that the image subset D contains the most scenes similar to the input image.
The GIST descriptor can summarize the gradient information for local regions of an image, which provides a rough description of the scene.A GIST descriptor of the scene refers to the meaningful information that an observer can identify from a glimpse at the scene. 27The GIST can be represented at both perceptual and conceptual levels because it includes all levels of visual information.It can be constructed by two-dimensional Gabor wavelets. 28The Gabor wavelets of specific direction and scale can be considered as a local bandpass filter with respect to the corresponding direction and scale, whose response is exactly corresponding to the edges of specific directions in the image.At the beginning, the image is divided into patches.For each patch P i of size r 0 × c 0 , the cascading of its convolution in each channel is defined as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 1 ; 6 3 ; 5 2 3 G P i ¼ cat½Fðx; yÞ Ã G m;n ðx; yÞ; ðx; yÞ ∈ P i ; where catð•Þ represents the cascade operation, and G m;n ðx; yÞ denotes the two-dimensional Gabor wavelet.
The average convolution of specific direction and scale for patch i ðx; yÞ, then the GIST descriptor can be expressed as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 2 ; 6 3 ; 4 2 8 G G ¼ f ḠP 1 ; ḠP 2 ; : : : ; ḠP N g g; where N g represents the number of patches in the image.By detecting and combining the edge information among local patches, the GIST descriptor can describe the overall distribution of gradient information within the image.
In this study, the initial semantic labels were assigned to each superpixel, instead of individual pixels, due to the spatial supports among pixels.Specifically, the superpixels, within both the input image and the images in D, were obtained using the simple linear iterative clustering (SLIC) method. 29Then, we adopted three features, denoted as f k (k ¼ 1; 2; 3): the scale-invariant feature transform (SIFT) descriptor, 30 color mean in Lab color space, and central location of the superpixel, in order to describe each superpixel.I s ¼ fs 1 ; s 2 ; : : : ; s N g denotes the set of superpixels of the input image, and D s ¼ fn 1 ; n 2 ; : : : ; n M g denotes all the superpixels achieved from set D. For each superpixel s i ∈ fs 1 ; s 2 ; : : : ; s N g, its neighborhood N k i was defined as a set of superpixels N k i ∈ D s , which had the nearest Euclidean distance to s i in terms of the k'th feature f k i .In this work, N k i included its closest 15 superpixels.Next, each superpixel s i was assigned a semantic label l ∈ L, where L represented the set of semantic classes.The probability distribution of semantic labels was defined as the normalized label likelihood score.In this work, the normalized label likelihood score Pðf k i jlÞ, of each superpixel s i , was expressed by nonparametric density estimates E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 3 ; 6 3 ; 1 0 2 Pðf k i jlÞ ¼ ½nðl; where l was the complementary set of l. nðl; N k i Þ [or nðl; DÞ] and indicated the number of superpixels labeled l in the set N k i (or D), whereas nð l; N k i Þ [or nð l; DÞ] represented the number of superpixels that were not labeled l.The function of ϵ was to prevent zero likelihoods and smoothen the counts.
Then, the initial semantic label for each superpixel was achieved by maximizing the normalized label likelihood score where Z was a normalization factor under all sematic classes.However, the result of semantic labels was coarse and some superpixels were labeled incorrectly, as shown in Fig. 1.

Image Segmentation via Semantic Descriptors
With regard to the input image, it was intuitively known that the segmentation would be guided effectively if there existed a subset of training images, which would have a foreground similar to the input image.Therefore, the subset of training data, named semantic retrieval set, had to be determined by utilizing the initial semantic labels prior to complete segmentation.

Semantic Retrieval Set Determination via
Foreground and Background Semantic Descriptor In the beginning, the image could be divided into two segments according to its Lab color features by using the k-means method. 31Considering that peripheral regions often appear as background in images, we assigned the peripheral segment, mentioned above, a background label "0"; a foreground label "1" was assigned to the other segment.Then, the foreground semantic descriptor f fs and the background semantic descriptor f bs were defined in order to obtain the semantic retrieval set Ψ, such that the images in the set would have the most similar foreground objects to the input image.
The semantic descriptors were defined in a spatial pyramid structure.The segment labeled "foreground," in the image, would be divided into equal grids with respect to different levels r s in the spatial pyramid.At each level, the semantic histogram was calculated within each grid.For instance, four semantic histograms h 21 , h 22 , h 23 , and h 24 were obtained in the second level of the spatial pyramid, corresponding to the four equal grids, as shown in Fig. 2.
Then, the foreground semantic descriptor f fs was defined as the concatenation of all the semantic histograms within each grid at each pyramid level (5) Similarly, we can also define the background semantic descriptor f bs .Experiments showed that the spatial pyramid layer r s ¼ 3 was a good compromise between capturing enough details and avoiding being sensitive to the noise.
In order to obtain the semantic retrieval set, we calculated the global GIST feature g f , foreground semantic descriptor f fs , and background semantic descriptor f bs , throughout the input image and all the images in the training set.Then, the similarity of the input image and the training set was defined as the Euclidean distance between the features E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 6 ; 3 2 6 ; 5 1 3 where d g , d fs , and d bs were the Euclidean distances, with respect to g f , f fs , and f bs , between the input image and the training set.The purpose of λ was to control the influence of the background semantic labels in the case where some training images had been wrongly selected due to their large background area.The procedure of obtaining the semantic retrieval set Ψ was described in Algorithm 1.By applying Algorithm 1, the training images were arranged according to the ascending order of d, which corresponded exactly to the similarity of the input image.Finally, the semantic retrieval set Ψ was obtained by selecting the images corresponding to the smallest d.More images contained in set Ψ would provide more clues for labeling the input image, but they would also decrease the similarity to the input image; therefore, in this study, a maximum of four training images was selected for the formation of the semantic retrieval set Ψ.
Figure 3 shows the semantic retrieval set Ψ corresponding to the input "cow" image.Set Ψ also had "cows" appearing in the foreground, which meant that the coarse initial semantic labels were able to provide an effective cue on what semantic categories the foreground belonged to.

Semantic Labels Assignment via Object Affinity
Once set Ψ was obtained, the semantic labels would be reassigned to each superpixel of the input image via object affinity.
Suppose s i is a superpixel of the input image, and s m j is a superpixel in the m'th image of Ψ; that is, m ∈ Ψ.We denoted the distance between s i and s m j with respect to the k'th feature f k i as Δd k im j .Then, the distance measure between s i and s m j was defined as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 7 ; 3 2 6 ; 1 2 5 where ω ¼ ½ω 1 ; ω 2 ; ω 3 was the weight of different features; in this study, k ¼ 1, 2, 3 as mentioned in Sec. 2. 1. Initialize λ ð0Þ ¼ 0, and compute 2. Search for a subset Ψ in an ascending order of distance d ð0Þ ; 3. Set t ¼ 0; 4. while none of background label of subset Ψ is the same as the input image's background && λ ðt Þ ≠ 0.1 do 8.
Search a new subset Ψ in an ascending order of distance d ðtþ1Þ ; 9. Subsequently, within image m, the nearest neighborhood N m i of s i in set Ψ was obtained via Δd im j .
Obviously, the semantic labels of N m i should have provided an important cue for assigning semantic labels to s i , due to their high feature similarity.Moreover, the labels of the superpixels neighboring to s i in the input image should have obeyed the smoothness constraint, which reflected the distribution of semantic labels in natural images.The above idea can be expressed as the concept of object affinity.
Thus, the semantic label likelihood of s i , determined by the labels of N m i , was described as a Gaussian function E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 8 ; 6 3 ; 4 8 2 Gsðs i Þ ¼ where K l was the number of superpixels sharing the same semantic class l within the neighborhood N m i ; β 1 was a damping parameter.The indicator function δ N m i ðm j Þ was defined as E Q -T A R G E T ; t e m p : i n t r a l i n k -; s e c 3 .2 ; 6 3 ; 3 9 0 Considering that the distribution of semantic labels tended to be smooth throughout natural images, we adopted the agglomerative clustering method 10 in order to cluster the superpixels with respect to the Lab color feature.The semantic label propagation of the neighboring superpixels was achieved via object affinity E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 9 ; 6 3 ; 2 7 9 S object ðs i Þ ¼ where Cðs i Þ was the cluster where s i belonged.

Initial Semantic Labels Refinement via Semantic
Codebook Although the initial semantic labels were coarse, they still offered a strong cue about the distribution of semantic labels.Now that the semantic retrieval set Ψ could also provide the probability of labels for each superpixel, we compared the similarity between initial semantic labels and semantic labels generated from the semantic retrieval set.Generally, the higher the similarity of the two semantic labels, the higher was the reliability of the semantic labels.
Hence, the initial semantic labels were refined according to a semantic codebook, which was constructed for measuring the similarity between initial semantic labels and the semantic labels in the semantic retrieval set Ψ.The semantic codebook of set Ψ was set as B ¼ fB k l jk ¼ 1; 2; 3:l ∈ L Ψ g, where B k l was defined as the feature descriptor of all the superpixels labeled l in the k'th (k ¼ 1, 2, 3) feature channel (mentioned in Sec. 2) for a specific semantic class l; L Ψ represented the set of semantic classes in set Ψ.Moreover, for any particular superpixel s j labeled l, its feature f k l j (k ¼ 1; 2; 3) formed a codeword in the codebook.
For any superpixel s i assigned a label l m initially, if its initial label l m was included in L Ψ , such that l m ∈ L Ψ , the similarity of label l m was calculated only with respect to B lm ¼ fB k lm jk ¼ 1; 2; 3g in the semantic codebook.However, if the initial label l m ∈ = L Ψ , the similarity was determined by examining all the codewords in B. Specifically, the similarity for superpixel s i , which was initially labeled l m , was defined as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 0 ; 3 2 6 ; 4 2 4 where ΔH k i recorded the smallest distance between feature f k i of the superpxiel s i and the codeword in B k lm .ΔH k il was the smallest distance between feature f k i and a particular codeword in the semantic codebook B.
Finally, for the initial semantic label of the superpxiel s i , the semantic probability was refined as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 1 ; 3 2 6 ; 2 9 5 where jR i j was the size of the semantic region of superpixel s i in the initial semantic labels; β 2 was a damping parameter.

From Semantic Labeling to Segmentation
In this section, we describe image segmentation by maximizing the linear combination of the object affinity and the refined probability of the initial semantic labels E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 2 ; 3 2 6 ; 1 4 7 The segmentation results with the semantic labels of sample images, shown in Fig. 4(b), also provided a bounding box indicating the possible foreground area.Finally, the Grab-Cut method 25 was adopted as a postprocessing procedure to portray the boundary of the foreground precisely.The Grab-Cut method 25 is a segmentation technique that uses graph cuts to perform segmentation.Before it is performed, a manually rectangular region of interest should be placed to indicate the location of the foreground in the image.The more precisely the rectangle could exactly encircle the object of interest, the more accurate the segmentation result is.If the rectangular region of interest is not perfectly placed, the good segmentation result cannot be obtained, as shown in Fig. 4.
In this work, the semantic information can provide a bounding box indicating the possible foreground area.The Grab-Cut method is performed to further merge the neighboring regions assigned to the same semantic label in the bounding box, and at the same time, to precisely portray the boundary of the foreground via color features in the Lab space.Moreover, after performing the Grab-Cut method, for the areas where the new labels are not consistent with the segmentation result by using Eq. ( 12), the higher V seg will enforce these areas to remain their previous labels.In the experiments, we set the threshold as 1.4.Figure 5 shows the experimental results by using our method with Grab-Cut as the postprocessing procedure, and the results by using the Grab-Cut method with respect to a precisely placed artificial rectangular region of interest.Compared to Fig. 5(c), it shows that the our result with Grab-Cut as postprocessing procedure is more crisp and the boundary of the foreground is more precisely portrayed as shown in Fig. 5(d).Also in Fig. 5(d), the green leaves among the red flowers in the middle of the image is correctly labeled as the background, whereas it is wrongly labeled as the foreground by only applying the Grab-Cut method, as shown in Fig. 5(e).
More samples of postprocessing results are shown in Fig. 4(c).The foreground boundaries were eventually determined (Fig. 6).

Experimental Results
To verify the effectiveness of the proposed method, we conducted experiments on the MSRC 21 dataset, which contained 21 different classes with 276 training images and 256 testing images.In the experiments using the MSRC 21 dataset, the image subset D was allowed to include a maximum of 25 training images, such that enough scenes similar to the input image could be selected.
The segmentation performance was validated via the intersection-over-union score mentioned in Refs. 10 and 12 E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 3 ; 6 3 ; 6 3 9 S IOU ¼ max l 1 jIj where GT i is the ground truth and R l i represents the region associated with l'th class in image i.
In the experiments, we tested all of the 14 image classes in the MSRC 21 dataset.The intersection-over-union score was adopted in order to evaluate the precision of our algorithm.Moreover, we compared our algorithm to other segmentation algorithms; 2,6,[10][11][12]18,22 the results are listed in Table 1. The hgher intersection-over-union score corresponded to higher segmentation precision.In addition, we also listed the precision of our semantic labeling named as "semantic label," and the segmentation results by using the initial labels with the Grab-Cut as postprocessing is named as "initial þ grab cut" in Table 1."Ours" represents the results of our final segmentation with the Grab-Cut as postprocessing.The subscript represents the rank of segmentation accuracy by using different methods.
We also compared our results with the technique of deep neural network. 22In the experiment, we directly evaluated the released pretrained model trained with PASCAL-context dataset on the MSRC dataset and computed the segmentation accuracy of nine overlapping classes between the two datasets.The quantitative results have been listed in Table 1.Admittedly, the average accuracy of our results is lower than that of Ref. 22, which involved a large amount of training samples.However, our results have achieved the best average accuracy among the hand-designed feature-based methods.In addition, the segmentation accuracies of several classes, such as "cow," "dog," and "sheep" are comparable to that of Ref. 22.We also achieved a better result on class "chair" compared to Ref. 22. Details on the experimental results are listed in Table 1. Figure 7 shows segmentation results by using our proposed algorithm.We also compared our results with Ref. 22 visually, as shown in Figs.7(c) and 7(f).For the results by using, 22 only the overlapping classes between the MSRC and PASCAL-Context datasets are shown.For example, segmentation results by using 22 are not shown from the 1st row to the 4th row in Fig. 7, as the PASCAL-Context dataset does not include the classes of "face", "flower", "sign" and "house", which is also illustrated in Table 1.
In the experiments, there are still some images which are very challenging.Our mechanism of semantic labeling and  foreground segmentation depends on the color and SIFT features.Consequently, our method would probably fail, if dealing with the images in which the color or texture distribution of the foreground is similar to the background, as shown in Fig. 8.

Conclusion and Future Work
In this paper, an image segmentation framework based on semantic information was proposed.Unlike traditional methods based on low-level features, we adopted semantic information in order to distinguish the foreground from the background.In our study, the initial semantic labels were obtained using the nonparametric method.By searching for similar images in the training data, the input image was segmented via the combination of object affinity and semantic labels.Experimental testing using the MSRC 21 dataset demonstrated that our method performed well.In future work, segmentation of video data by means of semantic information will be investigated.

Fig. 2 Algorithm 1 .
Fig. 2 Definition of semantic descriptors.At each level, the semantic histograms were calculated within each grid.
and obtain final subset Ψ.

Fig. 3
Fig. 3 (b) The semantic retrieval set of the (a) input image.

Fig. 4
Fig.4Segmentation results by using Grab-Cut method25 with respect to different artificial rectangular regions marked with green rectangles.(a) Segmentation results with respect to the precisely placed rectangle and (b) segmentation results with respect to the imperfect rectangle which do not encircle the foreground completely.

Fig. 5
Fig.5Segmentation results by using our method and by directly using the Grab-Cut method.(a) Input image, (b) ground truth, (c) segmentation results with semantic labels, (d) our segmentation result with Grab-Cut as a postprocessing procedure, and (e) result by using Grab-Cut with respect to a manually rectangular region of interest marked in green.

Fig. 7
Fig. 7 Sample segmentation results.(a) and (d) Input images, (b) and (e) segmentation results by using our proposed method, and (c) and (f) segmentation results by using the technique of deep neural network.22

Table 1
Segmentation results evaluation on intersection-over-union score.