An Effective Framework Using Spatial Correlation and Extreme Learning Machine for Moving Cast Shadow Detection

Moving cast shadows of moving objects significantly degrade the performance of many high-level computer vision applications such as object tracking, object classification, behavior recognition and scene interpretation. Because they possess similar motion characteristics with their objects, moving cast shadow detection is still challenging. In this paper, we present a novel moving cast-shadow detection framework based on the extreme learning machine (ELM) to efficiently distinguish shadow points from the foreground object. First, according to the physical model of shadows, pixel-level features of different channels in different color spaces and region-level features derived from the spatial correlation of neighboring pixels are extracted from the foreground. Second, an ELM-based classification model is developed by labelled shadow and un-shadow points, which is able to rapidly distinguish the points in the new input whether they belong to shadows or not. Finally, to guarantee the integrity of shadows and objects for further image processing, a simple post-processing procedure is designed to refine the results, which also drastically improves the accuracy of moving shadow detection. Extensive experiments on two publicly common datasets including 13 different scenes demonstrate that the performance of the proposed framework is superior to representative state-of-the-art methods.


Introduction
As a fundamental procedure in many high-level computer vision and image-processing applications, moving cast shadow detection has drawn more attention in recent years. This is because that cast shadows have similar properties with their corresponding moving objects, which may cause the misclassification of object detection and further downgrades the performance of object classification [1], object tracking [2], behavior analysis [3], scene interpretation [4]. Therefore, it is urgent to develop an effective moving cast-shadow detection method to separate shadows from the foreground.
Over the past decades, numerous works have been studied and surveyed in the literatures (e.g., [5,6]). Prati et al. [5] first divided shadow detection methods into deterministic and statistical methods depending on whether the decision process utilized the uncertainty or not. Subsequently, Sanin et al. [6] further categorized shadow detection methods into four kinds, including chromaticity-based methods, physical-based methods, geometry-based methods and texture-based methods, according to the type of shadow features. Regardless of the above categories, the process of shadow detection mainly consists of two stages: feature extraction and classification. The commonly applied shadow features are divided into pixel-level features and region-level features based on shadow properties [4]. In particular, the pixel-level feature is the value obtained by different channels in different color spaces, while the region-level feature is extracted statistically in terms of spatial correlations of neighboring pixels.
A significant property of a shadow is that the shadow is darker than that of the corresponding background and maintains constant chromaticity. Based on the property, the widely applied pixel-level features are intensity and chromaticity. Cucchiara et al. [7] first proposed a chromaticity-based method to detect shadows by calculating the change rates of three components in HSV color space. Thereafter, many color space-based methods were developed to exploit this property for shadow detection, such as RGB [8,9], c 1 c 2 c 3 [10], YCbCr [11], normalized RGB [12], YUV [13], HSI [14] or combination among of them [15,16]. However, just only intensity and chromaticity features may not be able to accurately separate shadows from foregrounds when moving shadows and moving objects both have darker colors. Moreover, pixel-based methods are sensitive to noise. To solve these problems, the other property of a shadow is exploited to perform shadow detection, which assumes that the texture of the shadow is similar to the background and it is different from the foreground. According to the texture consistency, the region-level feature is described by spatial correlations of neighboring pixels, which is also called the statistical feature. The local texture descriptors, which are robust to noise and illumination variation, have been adopted widely to detect shadows, such as Gabor function [17], scale-invariant local ternary patterns (SILTP) [18], discrete wavelet transform (DWT) [19], gradient information [20], non-linear tone-mapping (NTM) [21] or their combinations [22]. These kinds of methods will fail when the texture property is similar between the background and foreground.
Obviously, using a single property cannot detect shadows completely. Therefore, moving shadow detection methods with multiple features based on these two properties have attracted extensive attention [23][24][25][26][27]. In particular, Liu et al. [23] utilized the pixel-level information, local region-level information and global-level information to remove the shadow. Tang et al. [25] extracted the grey level, color composition, and gradient information of pixels in a video frame to construct a respective shadow mask corresponding to the same video frame, and then they discriminated shadow pixels by minimizing the likelihood of missing a shadow pixel in each of the shadow masks. Wang et al. [26] jointly used the color, texture and gradient by exploiting local neighboring information and designed an adaptive mechanism to estimate threshold parameters for detecting shadows. Gomes et al. [27] integrated the chromatic and gradient information with the image hypergraph segmentation and used a stochastic majority voting scheme to shadow regions. Without loss of generality, after extracting features, the above methods detect shadows according to the parameter assumptions and thresholds tuning in the classification stage. Apparently, it is difficult to acquire the appropriate thresholds of parameters for various environments such as indoors and outdoors.
Recently, learning-based methods have been popular in shadow detection [28][29][30][31][32][33][34], which apply a classifier constructed by various shadow features to discriminate shadows from the foreground. For example, Joshi et al. [30] extracted a set of features derived from characteristic differences in color and edges, then employed support vector machines (SVM) and the co-training algorithm for classification. Dai et al. [31] extracted various features based on illumination, color and texture, then employed partial least squares (PLS) and logistic discrimination (LD) to classify shadows from their moving objects. Russell et al. [33] introduced the binary patterns of local colour constancy (BPLCC), light-based gradient matching (LGM) and intensity-reduction histogram (IRH) to construct two over-complete dictionaries from image patches, then performed the sparse representation classifier to discriminate shadows and objects. Lin [34] designed a multi-layer pooling scheme (MLPS) to integrate the features in a local region and to reduce the dimension of extracting features, then used a random forest algorithm as the ensemble decision scheme. Although the above methods require the label information from a part of ground truth to train the classifier model, they can adapt to various environments very well without turning thresholds or parameters.
Instead of handcrafted feature extraction, machine learning techniques have been developed to automatically learn features using deep neural networks [35][36][37][38][39][40][41][42]. First, researchers mainly treated the convolutional neural network (CNN) as a powerful feature extractor and made significant performance improvement with the powerful deep features. For example, Shen et al. [35] first extracted shadow edges via a structured CNN, and then solved shadow recovery as an optimization problem. Khan et al. [36] first applied the CNN to shadow detection. They utilized a 7-layer CNN to extract features from superpixels and then feed the features to a conditional random field (CRF) model to smooth the detection results. Then, end-to-end CNN models were proposed due to the emergence of fully convolutional networks (FCN) [37]. For example, Vicente et al. [40] presented a semantic-aware stacked CNN model to extract the semantic shadow prior and then refined the output by a patch-based CNN. Hu et al. [41] formulated the direction-aware attention mechanism in a spatial recurrent neural network (RNN) and recovered direction-aware spatial context (DSC) for detecting shadows. The CNN-based methods can learn features from the image, while they need a sufficient data for training and require to turn some parameters, such as the number of convolutional layers, the detection window size of each convolution layer, learning rate and the number of iterations. However, most of the methods mentioned were presented to detect shadow in single image. Besides, deep learning methods generally require a large number of labeled samples for training. For video sequences, it is very difficult and expensive to obtain a large number of labeled samples manually.
As a special type of single-hidden layer feed-forward neural network, an extreme learning machine (ELM) has been extensively applied in several fields of machine learning [43][44][45][46][47][48], in which the parameters are randomly generated and the values of parameters do not need to be tuned. Ghimire and Lee [48] proposed an online sequential extreme learning machine-based semi-supervised technique for moving cast shadow detection, which provided better generalization performance. Motivated by its fast learning, good generalization and universal approximation capability, we propose an effective moving shadow detection based on the extreme learning machine. The main contributions of this work are: (1) based on the shadow properties, a set of features consisting of pixel-level features and region-level features is extracted by considering the characteristics of pixels and the spatial correlations of neighboring pixels simultaneously. (2) A generalized model based on the extreme learning machine is constructed for classification, which is simple and efficient, without turning thresholds or parameters.
(3) Extensive qualitative and quantitative evaluations on 13 various scenes demonstrate that the performance of the proposed method compared with some well-known methods.
The rest of this paper is organized as follows. Section 2 describes the moving shadow detection based on ELM in detail, Section 3 presents the experimental results and analysis. The conclusions are given in Section 4.

Extreme Learning Machine (ELM)-Based Moving Cast Shadow Detection
In this section, we develop a novel ELM-based moving cast shadow detection approach, and the overall architecture of the proposal is illustrated in Figure 1, which mainly includes the following five steps. The first step is to acquire the labelled object pixels and shadow pixels from the ground truths. The next step is feature extraction. In this step, pixel-based features and region-based features are extracted to form an input data matrix for training. The third step is MCSD-ELM model learning. The proposed MCSD-ELM classifier is trained for the moving cast-shadow detection, and we can obtain the corresponding output connecting weights w. The fourth step is the classification. For a given foreground image, firstly, extract features by step 2; then, calculate the corresponding network output values using the weights w; then, utilize the highest network output value based on the weights w to determine the final class label. Finally, to obtain the complete objects and shadows for high-level computer vision applications, the last step of post-processing is carried out. Appl. Sci. 2019, 9, x FOR PEER REVIEW 4 of 22 Figure 1. Overview of the proposed moving cast-shadow detection approach.

Feature Extraction
To efficiently describe the shadow properties with abundant information, the pixel-level and region-level illumination invariant distinguishing features are extracted in three different color spaces based on intensity, local color constancy, and local texture consistency. The set of feature descriptors is constructed to form an input data matrix for ELM.
Here, we assume that the background image B (without moving objects), and current foreground frame F (containing moving objects and moving shadows) of a sequence are generated using a standard background subtraction as described in [49]. Let ( , ) c B x y be the intensity value

Pixel-Level Features
Shadows are illuminated by a skylight, while non-shadows are illuminated by both skylight and sunlight. Therefore, shadows are darker than that of the surface where they are cast. Based on the Phong model [50], one of the properties related to shadows is that the intensity of shadows must be lower than that of non-shadows in each component and the chromaticity changes in a small limit.

Feature Extraction
To efficiently describe the shadow properties with abundant information, the pixel-level and region-level illumination invariant distinguishing features are extracted in three different color spaces based on intensity, local color constancy, and local texture consistency. The set of feature descriptors is constructed to form an input data matrix for ELM.
Here, we assume that the background image B (without moving objects), and current foreground frame F (containing moving objects and moving shadows) of a sequence are generated using a standard background subtraction as described in [49]. Let B c (x, y) be the intensity value located at (x, y) of component c in the background B. Similarly, F c (x, y) is the intensity value located at (x, y) of component c in the current foreground frame F.

Pixel-Level Features
Shadows are illuminated by a skylight, while non-shadows are illuminated by both skylight and sunlight. Therefore, shadows are darker than that of the surface where they are cast. Based on the Phong model [50], one of the properties related to shadows is that the intensity of shadows must be lower than that of non-shadows in each component and the chromaticity changes in a small limit.
According to the property, several pixel-level features derived from different color space are exploited to depict shadows as much as possible.
(A) Color ratio in RGB color space Due to the intensity of shadow is lower than non-shadow in each component, the color ratio is used to represent the ratio of shadow and non-shadow in RGB color space [4]. To avoid the division by zero, the color ratio is defined as: where K c (x, y) is the color ratio at the location (x, y) in component c, c ∈ {R, G, B}. B c (x, y) and F c (x, y) are the intensity values at the location (x, y) of component c in the background B and the current foreground frame F.

(B) Lightness ratio in LRGB color space
The lightness-red-green-blue (LRGB) color model was proposed in the literature [51], in which the lightness and color components can be scaled separately. Hence, the lightness ratio calculated in LRGB color space can better describe the darkness characteristic of shadow.
The LRGB components (L, T 1 , T 2 , T 3 ) T are generated by transforming RGB components (A 1 , A 2 , A 3 ) T , as follows: Then, the lightness ratio L LRGB is given by: where F L (x, y) and B L (x, y) are the lightness values at the location (x, y) in lightness components L of the foreground frame F and background image B in LRGB color space, respectively.
(C) Color constancy in the HSV color space The shadow maintains the color constancy compared with the surface it is cast. Generally, the hue and saturation components in HSV color space were utilized to describe this property [7]. Meanwhile, Tsai [52] assumed that shadows have higher hue in HSV color space. Therefore, the color constancy can be depicted adequately by the following three features: where F h (x, y), F s (x, y) and F v (x, y) are values at location (x, y) in the hue, saturation and value components of foreground frame F in HSV color space, respectively. Likewise, B h (x, y) and B s (x, y) are values at location (x, y) in the hue and saturation components of background image B in the HSV color space. H(x, y) and S(x, y) denote the hue and saturation differences between F and B, respectively. R(x, y) reflects that the higher hue of shadows appears in HSV color space, which is calculated in the foreground image F. In addition, F h , F s , F v , B h , B s ∈ [0, 1].

Region-Level Features
The other property of shadow is that the texture of the shadow is similar to the surface where it is cast (called background) and is different from the foreground. It is noticed that pixel-level features Appl. Sci. 2019, 9, 5042 6 of 21 are sensitive to noises. To overcome the drawback for shadow detection, region-level features are developed to describe the texture consistency of shadow, which are explored by the spatial correlations of neighboring pixels, such as normalized cross-correlation (NCC), Gabor features and modified local binary patterns (MLBP).
(A) Normalized cross-correlation (NCC) in the LRGB color space In fact, the shadow is the approximate scale version with respect to the background because of its darkness [50]. It is proved that the normalized cross-correlation (NCC) [53] is useful to adequately reflect the similarity between shadow and corresponding background, which is obtained in a neighboring region and is robust to noise. As noticed by the literature [51], the lightness can be scaled very well in LRGB color space. Given one pixel p at the location (x, y), its neighboring pixel q at the location (i, j), and the set of its neighboring pixels is denoted as Ω p , in which (i, j) ∈ Ω p . Hence, the NCC is formulated as: where where F L (i, j) and B L (i, j) are the foreground lightness value and the background lightness value of a neighboring pixel at location (i, j) in the lightness component L of LRGB color model, respectively.

(B) Illumination invariant Gabor features within the current frame of the RGB color space
The 2D Gabor filter [54] depicts the intensity variation with a range of scales and orientations for one pixel in its neighborhood. The generated Gabor texture descriptor is illumination invariant which is powerful to further express the texture information of shadow and non-shadow regions. Given one pixel at location (x, y), and its neighborhood D(x, y) centered at (x, y). Its Gabor transform is defined over P scales and Q orientations by the convolution: where I and J represent the dimensions of Gabor kernel g pq , and G c pq (x, y) is the Gabor coefficient at the location (x, y) in component c, c ∈ {R, G, B}.
Here, the Gabor kernel g pq is: where σ x and σ y denote the sizes of the Gaussian envelope in the x and y directions, respectively. f is the base frequency of the sinusoid, p is the scale factor (p = 0, 1, . . . , P − 1 for a > 1) and q is the orientation factor (q = 0, 1, . . . , Q − 1). Therefore, the filter orientation θ = qπ Q . In our work, we extract the Gabor features with P = 0 and Q = 4. Particularly, the texture information is described in the current foreground frame F with θ ∈ {0 • , 45 • , 90 • , 135 • } for R, G, B three components in RGB color space. The modified local binary pattern (MLBP) [55] is significant to represent the texture information of shadow. This is because the MLBP is not only illumination invariant but also is robust to flat regions. Besides, it is fast for computation. Given one pixel at location (x, y) and its intensity value is denoted as V m . Then, the MLBP descriptor is calculated as follows: where N and r are the numbers of pixels in the neighborhood Ω(x, y) centered at (x, y) and the radius of a circle, respectively. V n is the intensity value of the neighbor pixel at location (i, j) in Ω(x, y) and ∆ is a threshold to maintain the robustness for flat regions. n = 0, 1, · · · , N − 1. Hence, we can obtain a N-bits binary pattern for one given pixel according to the Equation (10). Afterwards, a histogram with 2 N -bits is produced to express the texture information.
To obtain the texture similarity between shadow and non-shadow, the simple histogram intersection operation is adopted for fast computation. Therefore, the texture similarity is given by: where h c F and h c B are the histograms of pixels at (x, y) in component c (c ∈ {R, G, B}) of the foreground frame F and corresponding background B, respectively. Sim c (x, y) represents the common parts of two histograms for one pixel at location (x, y) in component c.

Feature Descriptor
The pixel-level features and region-level features extracted above have varying dynamic ranges. Therefore, the features need to be normalized. After that, all of the features are combined to form the final descriptor for the foreground frame with a dimension of d = 23.

Classification Using Extreme Learning Machine
After feature extraction, the objective of the next step is to assign each pixel to one of two categories: shadow and object. The ELM [43][44][45][46][47][48] is adopted for classification, which has been proven to be effective and efficient in addressing a variety of classification problems. Next, the classification procedure using the ELM algorithm is described in detail.
Suppose that a training set is denoted as ( is utilized to train the single hidden layer feedforward neural network (SLFN). Here, the network contains d inputs, L hidden-layer neurons and C outputs.
The output function of the ELM [43] is expressed as: where w = [w 1 , w 2 , . . . , w L ] T is the output weight vector connecting the hidden layer nodes and the output nodes is the output vector of the hidden layer for the input x i which is also called ELM non-linear feature mapping [43].
To obtain the weights W ∈ L×C connecting the hidden layer nodes and the output layer nodes, the sum of squared loss for the prediction errors is minimized by: where · represents the Frobenius norm and H = [h(x 1 ); h(x 2 ); . . . ; h(x N )] ∈ R N×L is the output matrix of the hidden layer and Y = [y 1 , y 2 , . . . , y N ] T ∈ R N×C is the target matrix of the training set (i.e., labels of the training samples) [43]. Consequently, the output weights W can be rewritten as follows: where Given a new sample x N+1 ∈ d , its network response o N+1 is generated by the obtained network output weights W.
where ϕ(x N+1 ) is the network hidden layer output for x N+1 . For the binary classification applications, the decision function of ELM is:

Post-Processing
In the process of classification, misclassification may commonly occur. Specifically, the shadows may be detected as objects incorrectly, and the objects may be misclassified as shadows. Figure 2 shows the shadow detection results of some frames in different scenes.
From Figure 2c, we can see that the detected shadows and objects are comprised of some isolated regions compared with the groundtruths shown in Figure 2b. After the observation, we summarize that there are two situations for the misclassification. (1) There are some dark regions (such as windshields in the second column of Figure 2a) in moving objects, which are often mistakenly detected as shadows.
(2) The color similarity or the texture similarity between moving objects and the corresponding background (such as the shirt of the person in the fourth column of Figure 2a) is consistent with the similarity between their moving shadows and the background. It will lead to the incorrect classification of moving objects as moving shadows. In many cases, the situations mentioned above are inevitable. To solve the problem, the post-processing is performed to ensure the integrity of moving objects and moving shadows for further applications in computer vision. This is designed by the spatial correlation and geometric properties of shadows and objects. There are two operations in post-processing: size discrimination of candidate moving objects and moving shadows and border discrimination of candidate moving shadows. Appl. Sci. 2019, 9, x FOR PEER REVIEW 9 of 22 (1) Size discrimination of candidate moving shadows and moving objects Generally, the candidate moving shadow consists of shadow regions classified correctly and some small object blobs classified incorrectly, as well as the candidate moving objects shown in Figure 2(c). To remove the misclassified blobs, we first utilize a connected component labelling algorithm to mark candidate moving shadows and moving objects respectively, in order to generate different labelled sub-regions. Then, a size filter is performed to redress small misclassified blobs. Take the candidate moving shadows as an example, we expound the operation in detail.
For a candidate moving shadow mask M S , a connected component algorithm is performed, then a series of connected regions are generated: where Ri is the ith connected sub-region and n is the number of sub-regions. Next, the sub-regions in the set S M are sorted in terms of their sizes and the sub-regions have small sizes will be filtered and recognized as object regions. (1) Size discrimination of candidate moving shadows and moving objects Generally, the candidate moving shadow consists of shadow regions classified correctly and some small object blobs classified incorrectly, as well as the candidate moving objects shown in Figure 2c. To remove the misclassified blobs, we first utilize a connected component labelling algorithm to mark candidate moving shadows and moving objects respectively, in order to generate different labelled sub-regions. Then, a size filter is performed to redress small misclassified blobs. Take the candidate moving shadows as an example, we expound the operation in detail.
For a candidate moving shadow mask M S , a connected component algorithm is performed, then a series of connected regions are generated: where R i is the ith connected sub-region and n is the number of sub-regions.
Next, the sub-regions in the set M S are sorted in terms of their sizes and the sub-regions have small sizes will be filtered and recognized as object regions.
where Num(R i ) denote the number of pixels in the sub-region R i . num is the number of pixels in the maximum sub-region, and α is an empirical threshold, α ∈ [0, 0.2]. Likewise, the same operations are performed on the candidate moving object M o .
(2) Border discrimination of candidate moving shadows In practice, the true shadow pixels will occur at the edges of blobs. In other words, if a part of a moving object is misclassified as a shadow, most of the boundary of this region will be located inside the candidate moving object, as displayed in Figure 2c. Similarly, if a shadow candidate is a true shadow, then more than half of the boundary should be adjacent to the boundary of moving objects. Therefore, the boundary information of the candidate shadow region can be helpfully exploited to determine whether the region is a shadow or not. First, the candidate moving objects and moving shadows are segmented by the sobel edge algorithm. Then, a connected component labelling algorithm is also performed to mark each region and compute the edge of the region. For one candidate shadow region, the number of all boundary shadow pixels N s and the number of boundary shadow pixels N o which are adjacent to the boundary of a candidate moving object region are obtained, respectively. Consequently, we can determine the candidate region to a shadow according to the following rule: The results after post-processing are given in Figure 2d. Obviously, the post processing can refine shadow detection results and play a very important role in correcting misclassification results.

Moving Cast-Shadow Detection Algorithm
Based on the above discussions, the proposed moving cast shadow detection algorithm based on the extreme learning machine (MCSD-ELM) is summarized in Algorithm 1.

Algorithm 1 MCSD-ELM algorithm
Input: t original frames F t , t groundtruths G t , background image B and the t+1 foreground frame F t+1 . Output: The moving cast shadows and moving objects in the foreground frame F t+1 .
Step 1. Randomly select N labelled pixels from t groundtruths G t and extract pixel-level and region-level features according to Equations (1)-(11).
Step 2. Generate the feature descriptor x i = [x i1 , x i2 , · · · , x id ] ∈ d , i = 1, 2, . . . , N, and the label vector Step 3. Initiate an ELM with L hidden neurons, random input weights and bias values, and compute the output vector h i (x) of the hidden layer. Then the output matrix H ∈ N×L is formed.
Step 5. For one new testing foreground frame F t+1 , extract pixel-level and region-level features according to Equations (1)- (11) and form the feature descriptor x t+1 m for each foreground pixel (m = 1, 2, . . . , f p, fp is the number of moving pixels in F t+1 ).
Step 6. Calculate the network response of each foreground pixel according to the obtained output weights W, denoted by o t+1 m = W T ϕ(x t+1 m ) in which ϕ(x t+1 m ) is the network hidden layer output for x t+1 m of the mth pixel in F t+1 .
Step 7. Determine the class label y t+1 m according to Equation (16) and generate the candidate moving shadow mask M S and candidate moving objects mask M O .
Step 8. Perform the post-processing on the candidate M S and M O to obtain the refined classification results.

Experiments and Result Analysis
In this section, we first introduce two common datasets which are used to validate the superiority of the proposed method. Then, the quantitative evaluation metrics are illustrated. Finally, we compare the proposed method with several representative well-known methods quantitatively and qualitatively.

Datasets
In our experiments, we assume that the foreground detection masks (including the moving cast shadows and moving objects) are available and select two datasets consisting of 13 scenes (indoor and outdoor) to test the performance of our proposed method.
The first dataset [56,57] has six popular and widely tested scenes, which is summarized as follows. Moreover, the details of the first popular dataset are listed in Table 1. (1) Campus: It is an outdoor scene, in which the shadow is relatively large and weak because of the presence of multi-light effects.
(2) Highway: It is an outdoor scene of the traffic with different lighting conditions, in which the shadow is relatively large and strong.
(3) Intelligent Room: It is an indoor scene with different perspectives and lighting conditions, in which the shadow is medium and weak.
(4) Laboratory: It is an indoor scene with various lighting conditions, in which the shadow is medium and weak.
(5) Hallway: It is an indoor scene with a textured background, in which the shadow is variable and weak.
(6) CAVIAR: It is an indoor scene with different lighting conditions, in which the shadow is variable and weak. Besides, it has an obvious shadow color blending because of a strong background reflection.
The second dataset is selected from the CDnet dataset [58] including a large number of benchmarks, which is illustrated in Table 2 in detail.               (1) Cubicle: It is a typical indoor scene with strong light in the front of the view and has strong camouflages between the walking staffs and the background.
(2) Bungalows: It is an outdoor scene of the traffic with various vehicles, in which the shadow is relatively large and strong.
(3) BusStation: It is an outdoor scene, in which the shadow is static, medium and strong casting by buildings in front of the bus station and peoples coming out from the station or passing in front of the station.
(4) PeopleInShade: It is a challenging outdoor scene, in which people walk under a large shaded area. It has foreground-background camouflages and non-textured-dark regions except strong shadows.
(5) Seam, Senoon, Sepm: They are outdoor scenes acquired from the same camera with different time. They have strong shadows with various sizes.

Evaluation Metrics
To quantitively evaluate the performance of shadow detection methods, two metrics [5], shadow detection rate (η) and shadow discrimination rate (ξ) are defined as follows: where TP S and TP O are the numbers of moving shadow pixels and moving object pixels correctly detected, respectively. FN S is the number of moving shadow pixels classified as moving object pixels incorrectly. Likewise, FN O is the number of moving object pixels classified as moving shadow pixels incorrectly.
To consider the performance of shadow detection and shadow discrimination simultaneously, the third measure is given by: where Avg is the mean value of the shadow detection rate η and shadow discrimination rate ξ.

Comparisons on a Popular Dataset
The quantitative results obtained by the proposed method and nine comparative methods are summarized as Table 3 and the best result Avg of each scene is marked with bold. Some quantitative results of comparative methods refer to original literature [9,22] and some are provided by the authors of the literature [26,31]. It can be observed that the proposed method has the highest shadow detection rate η on CAVIAR, Hallway, Laboratory and Intelligent Room and has the best shadow discrimination rate ξ on Campus. From the aspect of the mean value Avg, the proposed method performs well on Campus, CAVIAR, Hallway, Highway and Intelligent Room compared with the existing state-of-the-art methods. In particular, the mean value Avg of the proposed method is higher than the literature [22] more about 3.0% on Campus and CAVIAR. Moreover, the proposed method is slightly lower than the literature [27] about 0.14% on Laboratory. To evaluate the performance comprehensively, the averages of six different scenes are calculated which demonstrate that the proposed method is superior to all comparative methods. Significantly, the average of the proposed method is higher more than the second-highest method [22] about 1.97%. The qualitative evaluation results of five representative state-of-the-art methods and the proposed method on six different scenes are shown in Figure 3, in which the moving shadow pixels are marked with green and the moving object pixels are marked with red. From Figure 3, we can see that the proposed method has an excellent capability of discriminating moving shadow pixels from foreground frames in both outdoor and indoor. Although pixel-based methods are sensitive to noise, the proposed method can improve the robustness by post-processing. Obviously, the integrities of moving shadows and moving objects have been maintained well.
The qualitative evaluation results of five representative state-of-the-art methods and the proposed method on six different scenes are shown in Figure 3, in which the moving shadow pixels are marked with green and the moving object pixels are marked with red. From Figure 3, we can see that the proposed method has an excellent capability of discriminating moving shadow pixels from foreground frames in both outdoor and indoor. Although pixel-based methods are sensitive to noise, the proposed method can improve the robustness by post-processing. Obviously, the integrities of moving shadows and moving objects have been maintained well.  [21]; (d) results obtained by [20]; (e) results obtained by [31]; (f) results obtained by [26]; (g) results obtained by [22]; (h) results obtained by our method. Table 4 illustrates the shadow detection comparison results of the proposed method and eight representative advanced methods intuitively. Most of the quantitative results of comparative methods refer to original literature [22] and some are provided by other authors [24,26]. In terms of shadow detection rate η, the proposed method achieves the highest accuracy on PeopleInShade, Cubicle, Senoon and Sepm, while it is worse than the method proposed by Wang et al. [24] on BusStation, Bungalows and Seam about 5.74%, 1.40% and 1.82%, respectively. From the aspect of shadow discrimination rate ξ, the proposed method has the best capability of separating moving shadows from moving objects on most scenes except Bungalows. For the sake of comprehensively evaluating the performance, the mean metric Avg is calculated for comparison. It can be seen that the proposed method has the best performance on PeopleInShade, BusStation, Cubicle, Seam, Senoon and Sepm compared with the existing methods, while it has lower accuracy than the second-highest method proposed by Wang et al. [24] of about 4.52% on Bungalows. Moreover, we can see that the proposed method is superior to the comparative methods according to the averages of seven scenes. In particular, the average of our method is higher than the second-highest method [24] of about 6.35%. In addition, for the three scenes, such as Seam, Senoon and Sepm, from the same camera with different time, our method exhibits the best performance. As summarized above, it demonstrates that our method is effective and robust in various indoor and outdoor scenes.

Comparisons on a Large Dataset
To validate the effectiveness and superiority of the proposed method visually, the qualitative comparison results obtained by our method and the recent well-known method presented in [22] on seven different scenes are displayed in Figure 4. Apparently, it can be seen that the proposed method can separate moving shadows accurately from foreground frames in different outdoor scenes with strong shadows as well as the indoor scene with weak shadows. Similarly, it is also robust to noises and guarantees the integrity of moving shadows and moving objects.

Comparisons with Some Representative Deep-Learning Methods
Furthermore, the proposed method is also compared with two state-of-the-art methods based on the convolutional neural network and the quantitative comparative results listed in Table 5 which are obtained from the literature [34]. Considering the metrics of shadow detection rate η, shadow discrimination rate ξ and the average among of them, the proposed method is superior to the method suggested by Long et al. [37] on the most scenes except on Campus, and it performs better than the method designed by Lee et al. [38] on all test scenes. The reasons are summarized as follows. (1) our method is implemented with the learning features obtained by ELM based on the discriminative hand-crafted features, while shallow features derived from original frames are learnt by convolutional neural network methods. (2) The deep-learning methods have a large number of parameters to estimate. When the sample size is small, the deep-learning methods cannot be able to optimize parameters well, which will lead to the under-learning phenomenon. While our method based on ELM requires fewer optimization parameters, it can achieve better performance with fewer labelled samples.

Parameter Sensitivity Analysis
In this section, to analyze the influence of different hidden layers L on the proposed method, Figure 5 illustrates the variation curves of classification rates on the two datasets. Obviously, the classification rate is not sensitive to the value of the parameter L. Specifically, with an increase of the value of L, the performance of the proposed method is being improved. After achieving the peak of classification rate, the performance of the proposed method is decreasing with an increase of the value of L. Therefore, we can see that neither too small nor too large L is appropriate for our method from parameters to estimate. When the sample size is small, the deep-learning methods cannot be able to optimize parameters well, which will lead to the under-learning phenomenon. While our method based on ELM requires fewer optimization parameters, it can achieve better performance with fewer labelled samples.

Parameter Sensitivity Analysis
In this section, to analyze the influence of different hidden layers L on the proposed method, Figure 5 illustrates the variation curves of classification rates on the two datasets. Obviously, the classification rate is not sensitive to the value of the parameter L. Specifically, with an increase of the value of L, the performance of the proposed method is being improved. After achieving the peak of classification rate, the performance of the proposed method is decreasing with an increase of the value of L. Therefore, we can see that neither too small nor too large L is appropriate for our method from Figure 5. To analyze the performance of the proposed MCSD-ELM with varying the number of training samples, Figure 6 displays classification rates of our method with different numbers of training samples on two datasets. It can be seen that with the increase of training samples, the classification rates of the proposed method are increasing. Moreover, when the number of training samples increases to a certain extent, the accuracy tends to be stable. It is well known that the time cost will be higher with increasing the number of training samples. Therefore, neither too small nor too large the number of training samples is appropriate for our method. To analyze the performance of the proposed MCSD-ELM with varying the number of training samples, Figure 6 displays classification rates of our method with different numbers of training samples on two datasets. It can be seen that with the increase of training samples, the classification rates of the proposed method are increasing. Moreover, when the number of training samples increases to a certain extent, the accuracy tends to be stable. It is well known that the time cost will be higher with increasing the number of training samples. Therefore, neither too small nor too large the number of training samples is appropriate for our method. To analyze the performance of the proposed MCSD-ELM with varying the number of training samples, Figure 6 displays classification rates of our method with different numbers of training samples on two datasets. It can be seen that with the increase of training samples, the classification rates of the proposed method are increasing. Moreover, when the number of training samples increases to a certain extent, the accuracy tends to be stable. It is well known that the time cost will be higher with increasing the number of training samples. Therefore, neither too small nor too large the number of training samples is appropriate for our method.

Conclusion
In this study, we have proposed a novel moving cast-shadow detection method using the supervised ELM. In contrast to the conventional methods, the proposed method not only incorporates pixel-level features but also explores region-level features according to the correlations among neighboring pixels to form input data for constructing the MCSD-ELM model. On the one hand, the proposed model only needs to turn one parameter which has little effect on the accuracy and can automatically determine one pixel whether it is a shadow or not. On the other hand, the post-processing operation can further improve the classification performance and guarantee the integrity of moving cast shadows and moving objects. We have evaluated the performance of the proposed method on two publicly available datasets. Compared with some representative state-of-the-art methods, the extensive experimental results indicate the effectiveness and robustness to noises of our method.
Considering that the process of labelling data could be time-consuming and infeasible, we will research the semi-supervised methods for moving cast-shadow detection using labelled and unlabeled data in the future work. Moreover, we have also discovered that some researchers have proposed active annotation methods [42] for the single image or video sequences. Therefore, we also will study moving shadow detection based on deep learning on video sequences.

Conclusions
In this study, we have proposed a novel moving cast-shadow detection method using the supervised ELM. In contrast to the conventional methods, the proposed method not only incorporates pixel-level features but also explores region-level features according to the correlations among neighboring pixels to form input data for constructing the MCSD-ELM model. On the one hand, the proposed model only needs to turn one parameter which has little effect on the accuracy and can automatically determine one pixel whether it is a shadow or not. On the other hand, the post-processing operation can further improve the classification performance and guarantee the integrity of moving cast shadows and moving objects. We have evaluated the performance of the proposed method on two publicly available datasets. Compared with some representative state-of-the-art methods, the extensive experimental results indicate the effectiveness and robustness to noises of our method.
Considering that the process of labelling data could be time-consuming and infeasible, we will research the semi-supervised methods for moving cast-shadow detection using labelled and unlabeled data in the future work. Moreover, we have also discovered that some researchers have proposed active annotation methods [42] for the single image or video sequences. Therefore, we also will study moving shadow detection based on deep learning on video sequences.