Universal Foreground Segmentation Based on Deep Feature Fusion Network for Multi-Scene Videos

Foreground/background (fg/bg) classification is an important first step for several video analysis tasks such as people counting, activity recognition and anomaly detection. As is the case for several other Computer Vision problems, the advent of deep Convolutional Neural Network (CNN) methods has led to major improvements in this field. However, despite their success, CNN-based methods have difficulties in coping with multi-scene videos where the scenes change multiple times along the time sequence. In this paper, we propose a deep features fusion network based foreground segmentation method (DFFnetSeg), which is both robust to scene changes and unseen scenes comparing with competitive state-of-the-art methods. In the heart of DFFnetSeg lies a fusion network that takes as input deep features extracted from a current frame, a previous frame, and a reference frame and produces as output a segmentation mask into background and foreground objects. We show the advantages of using a fusion network and the three frames group in dealing with the unseen scene and bootstrap challenge. In addition, we show that a simple reference frame updating strategy enables DFFnetSeg to be robust to sudden scene changes inside video sequences and prepare a motion map based post-processing method which further reduces false positives. Experimental results on the test dataset generated from CDnet2014 and Lasiesta demonstrate the advantages of the DFFnetSeg method.


I. INTRODUCTION
Foreground segmentation, also named as fg/bg classification, that is the segmentation of frames into the background and foreground pixels is a commonly used first step for detecting regions of interest in videos, which has the same effect as the well-known task background subtraction but with a different mechanism.Foreground extraction helps video analysis methods to discard irrelevant information in applications such as video surveillance [1], pose estimation [2], and face detection [3].
Traditionally, fg/bg classification methods mainly focus on static surveillance camera videos, where the background pixels depict either static regions or regions with semi-periodic motion (e.g.flowing water).However, the development of camera hardware enables the surveillance camera to be The associate editor coordinating the review of this manuscript and approving it for publication was Vladimir M. Mladenovic .portable, which brings the challenge of scene change to the surveillance videos caused by the camera position or location change.We name these videos as multi-scene surveillance videos.The multi-scene challenge is not new for the fg/bg classification task, as the twoPositionPTZCam video in the largest change detection algorithm benchmark dataset -CDnet2014 [4] is a multi-scene video.Many fg/bg classification methods [5], [6] are designed based on the assumption that the camera is static, but recent methods begin to pay attention to non-static scenes.For instance, the special components to deal with the scene change problem are proposed in many traditional methods such as SubSENSE [7], SWCD [8] and so on.Also, a flux tensor [9] based scene change detection method is also used by a deep learning background subtraction method [10].However, since their performance relies on the quality of the background model, they are still spoiled by the unstable background model caused by the scene change.
On the other hand, recent supervised deep learning methods [11], [12] deal with the multi-scene problem by simply single frame foreground segmentation.They do not consider the background model and generate the foreground mask from every single frame, achieving near human-level performance, which surpasses the traditional methods with a large gap.Nevertheless, in the original paper, the parameters of these models are trained and tested on one specific video or a group of videos and the performance on unseen videos has not been evaluated.(Unseen videos denote that the videos whose scenes have no overlap with the scene in training dataset.)For example, FgSegNet [11] trains one model for one video with 200 frames and get a super good result when it tests the performance on the rest frames from the same video, whereas as shown in the experiment in Section IV-E, its performance drops significantly for unseen videos even with more than one hundred times various training data.As the real world environment is generally changeful and uncontrollable, it is not possible to guarantee that one model will only work on the known scene.Therefore, a universal foreground mask generation method which is robust to both unseen and multi-scene videos is necessary.
In this paper, we propose a deep feature fusion network based foreground segmentation method (DFFnetSeg) to tackle the problem mentioned above.DFFnetSeg takes as input a single frame, a previous frame and a reference frame, and produces in the output a segmentation mask for fg/bg separation.The reference and previous frames carry the long and short term information in one sequence which enables DFFnet not only to reserve the temporally stopped object mask but also to eliminate the ghost mask.In addition, our model is universal to a wide range of unseen videos (including videos from indoor, outdoor, under different weather, and so on), which works stably when both background scenes and foreground objects are totally unseen during training.A simple Pearson correlation coefficient based reference frame updating strategy further enables DFFnetSeg robust to the scene change inside the videos.As a reference frame is used instead of a background model, no extra effort is needed for background modelling, which leads to a fast response to the scene change.
The contributions of this paper are three folds: • We propose a deep features fusion network which first compares features extracted from Pyramid scene parsing network (PSPNet) in different depth levels to generate soft motion maps and then fuses the various levels of soft motion maps and single frame feature maps to produce in the output the fg/bg segmentation mask.We show that with the help of semantic information extracted by the PSPNet, high-quality segmentation masks are achievable even without background modelling.
• We propose a new post-processing method based on region-level motion map, which eliminates the false positive classification so as to boost the foreground mask.
• We propose a simple Pearson correlation coefficient based reference frame updating strategy which is both effective and efficient.
The paper is structured as follows.In Section II, we will discuss the related works.In Section III, we will present the DFFnetSeg method in detail.In Section IV, we will describe the experiments and discuss the results.In Section V, we will conclude the DFFnetSeg method and discuss the future work.

II. RELATED WORK
As is the case in several Computer Vision tasks, fg/bg classification methods could be classified into two categories: traditional methods and deep learning methods -the latter appearing to dominate the field in the recent years.
The traditional methods generally go through the pipeline of background model construction, background model maintenance and subtraction.Classified by background models, Gaussian mixture model (GMM) based methods [9], [13], codebook-based methods [14] and sample model based methods [7], [15], [16] occupied the top place in terms of performance.Typical GMM-based methods fit a Gaussian mixture model as the probability density function to describe the colour/intensity/features distribution at each pixel.Recently, Wang et al. [9] combine the flux tensor based motion detection method with split Gaussian models in order to deal with more complex scene challenges, such as illumination changes and ghosting effects.Chen et al. [13] propose a sharable GMM model which extends its robustness to camera jitter and dynamic background challenge by developing the spatial-temporal correlation between pixels.Original codebook based methods describe each pixel by a codebook containing a set of codewords and each codeword represents a range of pixels' intensity values of the background, whereas the recent variant like PAWCS [14], which utilizes the colour/LBSP/persistence triplets to construct the robust background words model, and dynamically adjust thresholds and learning rates for segmentation decision and model updating rules.Different from other encoding models, sample-based models just sample values from previous frames in order to maintain a background model.In such methods, background models are updated based on updating probabilities that are estimated spatiotemporally and the subtraction part is implemented by comparing the number of matching samples in background models with a threshold.Their variants are further proposed to enhance the robustness for different challenges.For example, SubSENSE [7] adds spatiotemporal binary features intra-LBSP and inter-LBSP to the background model and comparison stage, and gets it updated adaptively by monitoring the model fidelity and local segmentation noise.Besides, WeSamBE [16] employs a weight mechanism to both the background model and updating.As the unsupervised methods, they are not limited to certain video and can obtain proper performance by default parameters.However, as is typically the case in Computer Vision task, the handcrafted features have difficulty in coping with more complex situations.
To address this shortcoming, the first deep background subtraction method [17] was proposed with convolutional neural network (CNN) in 2016 and since then, tons of deep learning fg/bg classification methods spring out and surpass the traditional methods with a great gap (around by 20% in terms of F-Measure).The original deep background subtraction method [17] simply uses a pre-calculated image as the background model for each sequence and employs a network similar to LeNet-5 for subtraction, which is evaluated in a scene-specific manner on selected sequence from CDnet2014 dataset with F-Measure 0.9.Based on the similar network architecture, DeepBS [10] updates the background image based on SubSENSE [7] and flux tensor [9] along the sequence, which is evaluated in a relatively universal manner with F-Measure 0.75.Follow a similar background image generation method as [17], generative adversarial networks (GANs) based methods [18]- [20] are proposed with F-Measure around 0.95.BScGAN [18] and BGAN [19] utilize the conditional GAN and Bayesian GAN respectively and BPVGAN [20] further introduces parallel vision to the BSGAN.Different from the methods which need to maintain a background model, the recurrent neural network based methods extend their scope to the temporal sequence.SFEN [21] first extracts semantic maps from a single frame as the input of a ConvLSTM, and an STN model [22] and CRF [23] are combined to enhance the motion robustness and spatial smoothness of the output mask.Hu et al. [24] conduct 3D atrous convolutional network with multi-frame input before ConvLSTM.By contrast to the sequence-based method, the foreground segmentation methods [11], [12] only consider a single frame to segment foreground.Among them, FgSegNet [11] and its variants occupy first several entries on CDnet2014 with f-measure around 0.98.FgSegNet generates the foreground segmentation mask by using a single frame as input to the encoder-decoder structure architecture which uses triplet VGG-16 Net as the encoder and a transposed convolutional neural network as the decoder.Different from all the deep learning methods mentioned above, Seman-ticBGS [25] combines deep learning semantic segmentation with traditional methods, without training, in which rules are made to combine semantic maps from pre-trained PSPNet [26] with the traditional methods without modifying their internal elements.As a result, SemanticBGS reduces the mean overall error rate of 34 traditional algorithms by roughly 50%.
Most deep learning methods surpass other methods in terms of the evaluation metrics, but their good performance is benefited from the background scene overlap and even foreground object overlap between training data and testing data.For example, Cascade CNN [12] manually chooses 200 frames from each video as the training set and SFEN [21] uses the first half of videos as the training set with the rest as testing ones.By contrary, SemanticBGS shows its advantage as an unsupervised method on universality which is not limited by the training and testing set.Also, the highquality performance of SemanticBGS mainly benefits from the pre-trained deep semantic segmentation stage.It is because the deep semantic segmentation methods [26], [27] are trained on large and varied datasets which include enough semantic classes to be the reference for foreground segmentation.Therefore, dynamic background such as shaking trees and ghost mask on the road region caused by the removed car are easy to be eliminated, from the semantic perspective.Based on the superiority of deep semantic segmentation, the DFFnet internally combines the semantic information with deep learning method to extend the universality of deep learning methods and the robustness to challenging situations.

III. PROPOSED METHOD
Let us denote by f (t)  ∈ R w×h×3 , t ∈ [1, T ] the RGB frame at time t of an image sequence with T frames in all, where w and h are width and height.The DFFnetSeg method aims to produce a label mask M (t) post ∈ R w×h each entry of which denotes whether the corresponding pixel depicts a foreground object or the background, from the input group which is compound of the current frame f (t) , the previous frame f (t−4)  and the reference frame.The reference frame is initialized by f (1) and updated along the sequence.
The DFFnetSeg consists of three parts in parallel: a deep features fusion network, a region-based motion map generator and a scene change detector, as shown in Fig 1.
In detail, the first part is a convolutional neural network, which includes two stages: the deep semantic features extractor and the foreground mask generator based on feature fusion.In the features extractor stage, we feed each entry from the input group into a pre-trained PSPNet respectively to extract the deep features.In the mask generator stage, features of each entry are compared at selected depth levels.The inner group comparisons are further fused with the image content features extracted from f (t) at each depth level, as shown in Fig 2 .Finally, the shallow feature maps and the deep feature maps are fused to generate the mask prediction M (t) .2), are extracted, given inputs including a current frame (a), a previous frame (b) and a reference frame (c), respectively.After that, for each chosen layer, a FusionNet (e) first obtains the sum of the difference of features between (a) and (b), and between (a) and (c), respectively.Then it concatenates the difference sum and features from (a) to form a presentation which carries both content information of (a) and motion information along frames, followed by a convolutional layer to combine this information and upsampling (optional) layers to normalize the output shape.Finally, the features from different levels are concatenated to form the final feature representation, followed by two convolutional layers to fuse local information with the global one and get the final per-pixel prediction (f).
The second part is a region-based motion map generator, which could be regarded as a post-processing step to reduce the false positives caused by semantic noise.Specifically, the false positives here are the pixels which are classified as foregrounds because they belong to some objects which have pretty high possibility to be foregrounds such as humans and cars but they are static.This part just simply conducts a region-level comparison between f (t) , f (t−4) and the reference frame f ref to get the motion map which indicates the potential motion region.The motion map is then used to post-process M (t) as the final prediction M

1) FEATURE EXTRACTOR
SFEN [21] and SemanticBGS [25] show that semantic information plays an important role in fg/bg classification.Specifically, naive background subtraction is easy to meet the ghost problem when moving objects have not been eliminated from the background model.Even when the background model is super clean (without any foreground objects), the camouflage problem may happen when the foreground object shares a similar colour with the background region.However, ghost and camouflage problems are easy to be overcome if we know whether the problem region belongs to a potential moving object or not, from the semantic knowledge.Therefore, the semantic segmentation network comes to mind.Different from SFEN and SemanticBGS which only use final layers of a semantic segmentation network, the DFFnet uses both shallow layers and deep layers of a deep semantic segmentation network PSPNet.The shallower layers capture the low-level features such as edge, corner, and shape; the deeper layers capture the high-level features such as semantic information.
Both of them contribute to the high-quality foreground mask generation.The PSPNet we used as the feature extractor is trained on ADE20K dataset [28], [29] because the ADE20K dataset includes various scenes and objects which have similar view angles as surveillance videos.In terms of the architecture, the PSPNet consists of the ResNet50 to extract feature maps and the pyramid pooling module to generate semantic segmentation maps.The ResNet50 is slightly different from the original one, whose details are shown in Table 1.In the pyramid pooling module, we use the average pooling with bin sizes of 1 × 1, 2 × 2, 3 × 3 and 6 × 6 respectively and the convolutional layer is 512 feature maps with 1 × 1 kernel (the detail architecture sees [26]).The final feature representation obtained by concatenation is followed by the 3 × 3 kernel, 512 maps convolutional layer CONV5.4.Batch normalizations are applied after each convolutional layer and activation function is ReLU.

2) FUSION NETWORK
The fusion network conducts two kinds of fusion.Firstly, considering each feature level, the fusion network fuses image contents with the difference between input entries.Secondly, considering across feature levels, the fusion network fuses local information with global one and low-level features with high-level features.
Specifically, the current frame f (t) , the previous frame f (t−4) and the reference frame are fed to PSPNet respectively to extract the corresponding features.The last layers of each scale of PSPNet are chosen as the input of the fusion network, which are CONV1.3,CONV2.3, and CONV5.4 respectively.The RGB images are also regarded as a grouped entry of the fusion network.Thus, the fusion network considers four-level features in all.These four-level features include both four scale levels which contain various degree of local and global information and four depth levels which contain various degree of shape and semantic information.
Let us denote by A l (f (t) ), l ∈ {1, 2, 3, 4} the features at level l of input f (t) .We define the soft motion map at level l, , as follows: where i, j, k denote the index of one element in each dimension.d l ijk is relatively high when the features of current frame f (t) are different from those of both the previous frame and the reference frame, which denotes the location (i,j) may contain motion.The soft motion map can also be regarded as the external comparison of current frame information with short term information and long term information.The previous frame f (t−4) which only has short time interval with the current frame carries the short term information, while the reference frame carries the long term information because it only updates when a large scene change is detected.The advantage of short term information for foreground object detection is the robustness to the continuous changing background because closer frames may share a more similar background than others.However, when the foreground object temporarily stops for a while, the foreground object may be lost if only considering short term information.Then, the long term information becomes an important reference.
The soft motion maps are concatenated with the features of f (t) to feed to the CONV6.xlayers to fuse motion information with frame contents.All the CONV6.xlayers generate 32 feature maps with 3 × 3 kernel size.For level 2 to 4, the spatial upsampling is used to normalize the output back to the same size as level 1.
The feature maps from 4 levels are further concatenated and then fed to the CONV7 to fuse both low-level features with semantic information and local with global information, which generates 32 feature maps with 3 × 3 kernel size.Finally, the CONV8 uses 1 × 1 convolutional layer with softmax to produce the foreground object mask M (t)  ∈ {0, 1} w×h (foreground = 1, background = 0).
To optimize the weights in the fusion network, the corresponding loss is then the cross-entropy loss function, that is, where C(i) is the ground truth label and p(i) is the output of the network at pixel location i.

B. REGION-BASED MOTION MAP
Different from the soft motion map in the fusion network, this part proposes a hard motion map which gives the binary value to each region to denote motion by comparing the difference between f (t) , f (t−4) and f ref .
More specifically, we first define the motion mask D pre = (p ij ) and D ref = (r ij ), given denotions f (t)  = (a ij ), f (t−4) = (b ij ) and f ref = (c ij ), as follows: An example of the motion map based post-processing.The current frame f (19) , the previous frame f (15) and the reference frame f ref are the 19th, 15th and 1st frames of the selected clip of I_SI_01 sequence of LASIESTA dataset respectively.M (19) is the raw foreground mask estimated by the fusion network.M pix and M reg are the pixel level motion mask (with θ = 20) and region-based motion map (with θ = 20, β = 5 and N = 32) respectively.M post _pix and M post _reg are the post-processed M (19) by M pix and M reg respectively.
where i and j denote the location of pixels.A pixel is regarded as moving when the difference is larger than θ.Notice, here we only consider the greyscale of images.
The aim for hard motion map is to clean the false positives of M (t) caused by potential moving objects.Therefore, we propose it as a greedy mask, which activates one entry even with a relatively weak hint for motion.In detail, the bitwise or operator is used to obtain the pixel level motion mask M pix = (m pix ij ) as follows: Next, the pixel level motion mask is further transferred to the region-based motion map which can reduce the effect of local camouflage problem (an example is shown in Fig 3).We divide the whole motion mask to N × N regions without overlapping (the edge is padded to be valid), denoted by m k ⊆ M pix (with k m k = M pix ).Then, the region-based map M reg = (m reg ij ) is obtained based on the quantity of motion in each region as follows: The region entry of map M reg is activated when the quantity of motion in the specific region is larger than β.
The region-based map is finally used to post-process the estimated mask generated by the fusion network.Denoted M post = (m post ij ) is defined by: where M = (m ij ) denotes the output of the fusion network.
The bitwise and operator is used here to restrict the final prediction.As shown in Fig 3, the pixel level motion mask can not detect sufficient motion pixels because of camouflage phenomenon, which leads to the holes inside the foreground object in final prediction, whereas the region level motion mask can tackle this well.The ghost mask exists in both the pixel level mask and the region-based map because they are defined greedily but the ghost has little effect on final prediction.In addition, as we can see in M (19) , there is a small region of false positives, which is misclassified by the fusion network because it corresponds to the clothes region in the frame which has a high probability to be a moving object (as it is usually carried by human), but it is easy to be eliminated by the motion map.

C. SCENE CHANGE DETECTOR
If the reference frame in the input group of the network does not change along time, it will lose its advantage and even conduct the opposite effect when the dramatic background change caused by the camera position or location change happens.To tackle this problem, a simple Pearson correlation coefficient based scene change detector is proposed to decide whether to update the reference frame or not at time t.The Pearson correlation coefficient is defined as (7), where ∀x, y ∈ R n×1 and n ∈ N + .

Col(x, y)
Then, Col t 1 = Col(Ds(f (t) ), Ds(f (t−4) )) and Col t 2 = Col(Ds(f (t) ), Ds(f ref )) are the correlation coefficient between the current frame, previous frame and reference frame respectively, where Ds(•) denotes spatial downsampling image to 32×32 size and flattening it.The 32×32 size can both reduce the computational load and capture enough information to denote a scene.Intuitionally, the correlation coefficient can denote the similarity between two images.Therefore, based on the coefficient, the reference frame updating strategy flow is as follows: • We define an indicator C t ∈ {0, 1} which is 1 when Col t 1 < γ and Col t 2 < γ at time t, because generally the correlation is regarded as weak enough to judge that the scenes are different when the coefficient is lower than γ .Updating the reference frame when the scene changes enables the DFFnetSeg to be robust to deal with the multi-scene videos caught by multi-position or multi-location cameras.

IV. EXPERIMENTS AND RESULT
A. DATASET We evaluate the performance of DFFnetSeg method by two datasets CDnet2014 [4] and LASIESTA [30], to sufficiently test the universality of DFFnetSeg.The CDnet2014 dataset is the largest change detection benchmark and dataset, including the evaluation matrics, the rank of state-of-the-art methods and the pixel-level ground truth of 53 sequences.These sequences from different scenes are separated to 11 challenge categories, including bad weather (BW), baseline (Ba), camera jitter (CJ), dynamic background (DB), intermittent object motion (IOM), low frame rate (LF), night video (NV), shadow (Sh), thermal camera (TH), air turbulence (TB), and pan-tilt-zoom camera (PTZ), and each category includes 4 to 6 sequences.In our experiment, we do not consider the continue camera moving and air turbulence videos because our DFFnetSeg method focuses on the multi-scene videos captured by the intermittent position changed camera, and the air turbulence is out of the scope of the scene domain considered in this paper.Therefore, we only include the ''twoPositionPTZCam'' sequence in the PTZ category and exclude all the sequences in the TB category.
In addition, 7 objective evaluation matrics provided by CDnet2014 are used to evaluate the performance of algorithms quantitatively: • Re (Recall) : TP / (TP + FN).
• Precision : TP / (TP + FP).While each metric gives a different insight into the results, the F-Measure is the most commonly used one.Therefore, we mainly use F-Measure to evaluate the DFFnetSeg.
The LASIESTA dataset is composed of 17 real indoor and 22 outdoor sequences organized in 12 categories, including simple sequences (SI), camouflage (CA), occlusions (OC), illumination changes (IL), modified background (MB), bootstrap (BS), moving camera (MC), simulated motion (SM), cloudy conditions (CL), rainy conditions (RA), snowy conditions (SN), and sunny conditions (SU).Same as CDnet2014, we also exclude MC and SM categories, because those camera motion patterns are not in the scope of the DFFnetSeg.

B. AND TESTING SET
Many state-of-the-art deep learning fg/bg classification algorithms generate the training set by separate each sequence by half, 80% or selected number, with the rest sequence frames as the testing set, in which the same background scenes and foreground objects have chance to coexist in both training and testing set.However, in our experiment, we generate the training and testing set based on the principle that the background scenes and foreground objects in the testing set have no overlap with the corresponding in training set.In detail, we use 1 to 2 sequences in each category, including backdoor, canoe, dinningRoom, overpass, park, parking, pedestrians, peopleInShade, snowFall, streetCornerAtNight, traffic, tramStation, turnpike_0_5fps, twoPositionPTZCam, and winterDriveway from CDnet2014 and I_BS_01, I_CA_01, I_CA_02, I_IL_01, I_MB_01, I_MB_02, I_OC_01, I_SI_01, I_SI_02, O_CL_01, O_CL_02, O_RA_02, O_SN_01, and O_SU_01 from LASIESTA as testing sequences and the remainder as training ones.For training, we randomly choose around 1000 frames from each training sequence (when the available frames with ground truth are less than 1000, we use all the frames).For testing, we randomly select one clip from each sequence and based on these clips, we construct the testing set by two parts.The first part of the testing set consists of single test clips as sequences, which is mainly to test the generality of algorithms.In the second part, we simulate the surveillance video captured by the camera whose position or location changes 1, 2 and 3 times by randomly concatenating 2, 3 and 4 test clips together, which is mainly to test the robustness of algorithms when scene changes occur in sequences.In terms of the implementation details, we consider 30 test clips in all which is not exactly divisible by 4, so 2 remainders are abandoned when concatenating 4 clips.
The details of selected clips from CDnet2014 dataset are shown in Table 2. 500 continue frames including foreground are randomly chosen from each sequence, when the sequence is longer than 500 frames.Otherwise, the whole sequence is chosen as the testing clip.Under most circumstance, 500 frames are enough for most background initialization method to estimate the background model, and are also short enough to avoid long term static scene, which is proper to evaluate the robustness of algorithms to the scene change.
The details of selected clips from LASIESTA are shown in Table 3.The frame with big enough foreground object is chosen as the first frame for each sequence, which brings a general challenge bootstrap to background subtraction.It is because different from the single-scene video which generally has tons of history frames to obtain a relatively clean background (without foreground), multi-scene videos are difficult to obtain informative history frames to extract background.Therefore, multi-scene video foreground object detection method should have the ability to deal with the situation with bootstrap challenge.

C. RESULTS
To evaluate the effectiveness of DFFnetSeg method, we compare it with the following state-of-the-art algorithms: • SubSENSE [7], PAWCS [14] and SWCD [8], the top traditional background subtraction methods on CDnet2014 with source code open to the public.
• FgSegNet [11], the top foreground segmentation method on CDnet2014 with source code open to the public.
• BScGAN [18], the deep background subtraction algorithm with conditional generative adversarial networks which reports a top result on CDnet2014 dataset in the original paper.on two datasets mentioned above.To ensure a fair comparison between the supervised models, we use the same training data as our model to train the models and the hyper-parameters are same as the ones described in the source code and paper.The pre-trained model is also used to initialize the model parameters if the initialization is mentioned in original papers.
In order to assess the DFFnetSeg for a large set of parameters in region-based motion map stage mentioned in Section III-B, an estimation of the foreground mask has been generated for all the single test clips using each combination of θ ∈ {20, 50, 80, 110}, N ∈ {8, 16, 32, 64}, β ∈ {5, 10, 20, 40}.The three parameters of DFFnetSeg method (θ, N , β) provide enough flexibility to boost the foreground mask for various input video sequences.The raw result of the fusion network mentioned in SectionIII-A is with average f-value 0.8448.When we use the pixel-level motion map (N = 1, do not consider β), the best result with the parameters mentioned above is worse than the raw one with average f-value 0.84 (θ = 20).When (θ, N , β) = (20,16,20), the region-based motion map achieves the largest boost effect with f-value 0.884.In detail, when the considered parameters are in the range shown in Table 4, the region-based motion map boosts the foreground mask in different degrees.Otherwise, it wrecks the final prediction.The table shows that θ = 110 is too big to catch the motion pixels; the greater the θ, the greater the N , and the less the β is demanded to extract the good enough motion map.
To choose the best parameter for scene change detector we test all the testing sequences including the single clips and concatenated sequences using γ ∈ {0.4,0.5, 0.6, 0.7, 0.8, 0.9}, as shown in Table 5.As γ is the threshold for the level of similarity of frames, when γ is high, the reference frame updating is more sensitive to the scene difference and vice versa.As shown in Table 5, when γ = 0.7, the updating strategy promotes the performance to the best, whereas when γ becomes higher, the result becomes worse because the scene change detector is so sensitive that discriminates the scene change by mistake when the large foreground is moving.

D. IMPLEMENTATION DETAILS
In our experiments, only the fusion network part need to be trained in a supervised way by labelled change detection data, as described in Section III-A.The PSPNet is pre-trained on ADE20K as in [26] and no parameter inside the PSPNet is fine-tuned.As the purpose of the fusion network is to generate the foreground object mask based on the frames difference and current frame contents, the reference frame is fixed to be f (1) during the training and the parts proposed in Section III-B and Section III-C are only for testing.Before the frames are fed to PSPNet, the size of frames are standard to 473 × 473 and subtracted by the mean as preprocessing.The initial learning rate is 10 −4 and we use Adam for optimisation.
Our experiment is implemented based on TensorFlow framework on a single NVIDIA GeForce GTX 1080ti GPU.The training process was completed in about 10 hours with 5 epochs.

E. COMPARISON
We compare the performance among methods on the testing data with and without halfway scene changes separately.The performance on the sequences without halfway scene changes shows the universality of methods under different challenges, while the decay level of the performance when the scene change is included in testing videos shows the robustness of methods to the halfway scene change challenge.As shown in Table 6, SubSENSE, PAWCS, SWCD and BScGAN have similar performance, whereas the DFFnet dramatically outperforms these methods with a much higher F-Measure value.The performance of FgSegNet is extremely bad because it is originally proposed as a scene-specific method to predict foreground segmentation based on a single frame, which makes it easy to lack universality.In terms of the multi-scene testing, as shown in Table 7, the performance of SubSENSE, PAWCS, SWCD and BScGAN decreases significantly, especially PAWCS, but the corresponding of DFFnetSeg keeps steady with only 0.2% F-Measure drop.By contrast, the performance of FgSegNet does not change because it is based on a single frame which has no relationship with the scene change.The slight increase of the F-Measure is actually caused by the missed video clips as mentioned in Section IV-B.Except the DFFnetSeg, the SWCD and BSc-GAN are more robust with only 8.5% and 10.2% F-Measure drop respectively and the latter BScGAN gets the relative better F-Measure when the scene change happens, which benefits from the background modelling method it chooses and the GAN architecture.
More details are shown in the sampled visualization Fig 4. We choose the sequences from different typical challenges for comparison (category details in Table 2 and Table 3).As we can see from the figure, the DFFnetSeg outperforms others almost in all samples, from the perspective of both accurate classification and clear edges.In addition, the DFFnetSeg still performs well even when the scene changes halfway in the sequence.However, the testing data seem quite challenging for other methods.As the FgSegNet only consider a single frame for foreground segmentation, it highly depends on object detection rather than motion detection.Therefore, a great number of false positives caused by wrong object classification exist in its output masks but the output masks are totally not influenced by the scene change.The rest 4 methods as background subtraction methods all need to construct the background model, so the scene change makes a great impact on their performance.Therefore, we discuss the result of these 4 methods before and after the scene changes separately.
Firstly, we analyze the performance before the scene change happens.Notice, the performance of the existing methods is different from the one shown on the benchmark website because, in their experiments, the beginning frame of the same video is different from the one in our experiment.In our experiment, most beginning frames contain foreground objects inside which is a more common situation in nature and more challenge than the first frames without foreground objects.For example, the 2nd and 3rd row in the figure are frames from the same sequence, but the performance of Sub-SENSE, PAWCS and SWCD shown in the 3rd row is much better than the corresponding in the 2nd row, which indicates that those methods take time to construct the fine and stable background models.It aligns well with the mechanism of most background subtraction methods which generally need different length of video clips to construct their stable enough background models.Except FgSegNet, the rest stateof-the-art methods are competitive with each other.For example, BScGAN performs better in winterDriveway whereas PAWCS performs better in overpass.Although the peopleIn-Shade is in Sh category, the bad results are actually caused by the stopped foreground which stopped since the 34th frame in the test clip (the current frame is the 137th in that clip).In conclusion, for sequences without scene changes, the qualitative result aligns well with the quantitative result that SubSENSE, PAWCS, SWCD, and BScGAN have similar performance.
Secondly, we analyze the performance after the scene change happens.It is obviously PAWCS is not as good at dealing with the scene change as others, because of its large scale false positives which are caused by the bad background model.For the other methods, they still take time to reconstruct the background model, and the performance is acceptable after reconstructing a proper background model.From the perspective of visual evaluation, BScGAN performs better than other methods, which benefits from both the background modelling method it utilizes and its network robustness to background noise.
Actually, the qualitative evaluation is limited by sampling frames from sequences because the performance quality of methods changes a lot along the time.It takes different frame numbers for different methods to reconstruct a proper background model which can also be understood as the speed of background model reconstruction, which can not be shown by simply sampling.Nevertheless, quantitative evaluation can properly evaluate this performance feature which changes along the time sequence.Therefore, the high quantitative evaluation result comes not only from the fine classification but also from the fast scene change response mechanism.
In conclusion, the DFFnetSeg outperforms the state-of-theart methods by a great gap in both quantitative evaluation and qualitative evaluation.

F. ABLATION STUDIES
In this subsection, we justify the decisions we made in the DFFnetSeg by conducting a series of ablation tests.In particular, we evaluate the performance of the DFFnetSeg by testing the effect of removing individual components on foreground mask generation tasks.The results in Table 8 are obtained by parameters (θ, N , β, γ ) = (20, 16, 20, 0.7).The main evaluation value F-Measure shows the increasing trend when the proposed components are added.Table 8 shows that for the single clips the motion map has a great positive impact on the performance with F-Measure, increasing from 0.8449 to 0.884, because it can reduce the false positives caused by semantic noise but it demands the reference frame and the previous frame are from the same scene as the current frame.Therefore, without the reference frame updating, the motion map almost does not contribute to the foreground mask when the scene change happens in the sequence.On the other hand, the reference frame updating makes great contribution to the multi-scene sequences (F-Measure from 0.6912 to 0.8557) but rarely contributes to the single clip results.It is because the reference frame updating strategy is designed to deal with the multi-scene sequences.Besides, when no scene change happens in the sequence, our fusion network performs also good even with the noisy frame (frame with foreground object) as the reference frame.Notice, for multi-scene sequence, only adding the motion map does not boost our performance but when the motion map combines with the updating strategy, our performance is improved further.

V. CONCLUSION
We propose a robust foreground segmentation approach based on deep features fusion network by using features extracted from a semantic segmentation network PSPNet in a comparison and fusion architecture.By contrast to other semantic-based background subtraction methods, our fusion network learns to combine the semantic information of the current frame with the soft motion map extracted from the current frame, the previous frame, and the reference frame.By contrast to other deep learning methods, DFFnetSeg generates high-quality foreground masks on not only the unseen videos but also the multi-scene videos.
As the DFFnetSeg method is designed for position or location changed surveillance camera videos and takes the advantage of semantic information to get high-quality foreground mask, in future, it has potential to be extended to the continue position changing surveillance camera videos.

FIGURE 1 .
FIGURE 1.An illustration of the pipeline of DFFnetSeg method.Given a group of input images consist of a current frame (a), a previous frame (b) and a reference frame (c), we use three parts in parallel which consist of a deep features fusion network (d), a region-based motion map generator (e) and a scene change detector (f) to extract the foreground mask prediction (g), generate the motion map (h) and update the reference frame (c), respectively.Finally, we use the motion map (h) to boost the final prediction (i) by the bitwise and operator.

FIGURE 2 .
FIGURE 2. Architecture of our proposed DFFnet.Firstly, features from chosen layers of PSPNet (d), which consists of ResNet (d.1) and Pyramid Pooling Module (d.2), are extracted, given inputs including a current frame (a), a previous frame (b) and a reference frame (c), respectively.After that, for each chosen layer, a FusionNet (e) first obtains the sum of the difference of features between (a) and (b), and between (a) and (c), respectively.Then it concatenates the difference sum and features from (a) to form a presentation which carries both content information of (a) and motion information along frames, followed by a convolutional layer to combine this information and upsampling (optional) layers to normalize the output shape.Finally, the features from different levels are concatenated to form the final feature representation, followed by two convolutional layers to fuse local information with the global one and get the final per-pixel prediction (f).
third part is a scene change detector, which utilizes the Pearson correlation coefficient to indicate the degree of scene change and update the reference frame when the large scale scene change is detected.These three parts are discussed in detail as following.A. NETWORK ARCHITECTURE In the architecture of our DFFnet shown in Fig 2, the feature extractor and the fusion network are shown in white and cream-coloured background respectively.

•
We update the reference frame by f ref = f(t−4) and reset C j = 0 (j ∈ [t − 4, t]) only when three requirements are fulfilled: a)C t−4 = 1, b) 3 i=1 C t−i > 1 and c) Col t 2 > γ .It first detects a potential scene change signal, then further detects the scene change in neighbour frames to tolerate the potential noisy detection, and finally, update the reference frame by the farthest same scene frame to guarantee the long term information.

FIGURE 4 .
FIGURE 4. Qualitative results comparison on the 6 algorithms.The first 6 rows show results before the scene changes, while the last 6 rows show results after the scene changes.The number behind # denotes the indices of the frame in original sequence from the dataset, so that the corresponding indices of the frames in our clips can be obtained by subtracting the begin indices.For example, the corresponding indice of the first row is 88.

TABLE 1 .
The details of the architecture of the ResNet part of PSPNet.

TABLE 2 .
Scenes and frame indices used in testing on CDnet2014 dataset.

TABLE 3 .
Scenes and frame indices used in testing on LASIESTA dataset.

TABLE 4 .
Average F-measure on the single test clips with different parameters range for the motion map.

TABLE 5 .
Average F-measure on all the test sequences with different γ for the reference frame updating.

TABLE 6 .
Evaluation values of models on single clips part of the test dataset.

TABLE 7 .
Evaluation values of models on multi-scene sequences part of the test dataset.

TABLE 8 .
Evaluation values of the model if a component is removed.