A Cloud Detection Approach Based on Hybrid Multispectral Features with Dynamic Thresholds for GF-1 Remote Sensing Images

Nowadays, GF-1 (GF is the acronym for GaoFen which means high-resolution in Chinese) remote sensing images are widely utilized in agriculture because of their high spatio-temporal resolution and free availability. However, due to the transferrable rationale of optical satellites, the GF-1 remote sensing images are inevitably impacted by clouds, which leads to a lack of ground object’s information of crop areas and adds noises to research datasets. Therefore, it is crucial to efficiently detect the cloud pixel of GF-1 imagery of crop areas with powerful performance both in time consumption and accuracy when it comes to large-scale agricultural processing and application. To solve the above problems, this paper proposed a cloud detection approach based on hybrid multispectral features (HMF) with dynamic thresholds. This approach combined three spectral features, namely the Normalized Difference Vegetation Index (NDVI), WHITENESS and the Haze-Optimized Transformation (HOT), to detect the cloud pixels, which can take advantage of the hybrid Multispectral Features. Meanwhile, in order to meet the variety of the threshold values in different seasons, a dynamic threshold adjustment method was adopted, which builds a relationship between the features and a solar altitude angle to acquire a group of specific thresholds for an image. With the test of GF-1 remote sensing datasets and comparative trials with Random Forest (RF), the results show that the method proposed in this paper not only has high accuracy, but also has advantages in terms of time consumption. The average accuracy of cloud detection can reach 90.8% and time consumption for each GF-1 imagery can reach to 5 min, which has been reduced by 83.27% compared with RF method. Therefore, the approach presented in this work could serve as a reference for those who are interested in the cloud detection of remote sensing images.


Introduction
GF-1 satellite is the first satellite of China's high-resolution earth observation system (GF is an acronym which means high-resolution in Chinese). It was successfully launched by the Long random forest, clustering and other methods [26,27]. Jin et al. established a back propagation in a neural network for MODIS remote sensing images, which has a better ability to learn knowledge [28]. Yu et al. used a clustering method according to texture features to realize automatic detection of cloud area [29]. Wang Wei et al. used a K-means clustering algorithm to initially classify the clustering feature of the data, and then used spectral threshold judgment to eliminate the interference of smoke and snow to detect clouds in MODIS data [27,30]. Fu et al. succeeded to detect clouds in an FY meteorological satellite with Random Forest Method [31]. However, the method based on machine learning usually requires a large number of training samples and test samples for model construction and accuracy evaluation, which cannot effectively guarantee the generalization ability of the model in a wide range of applications.
Aiming at detecting clouds promptly for GF-1 remote sensing images in the whole China and considering the data quality requirements of subsequent agricultural applications, this paper considers that it is more important to reduce the leakage recognition rate of clouds (i.e., if there are 100 cloud pixels in an image, we should do our best to find these pixels, no matter what the total pixels we mark as clouds is) than to reduce the error recognition rate of clouds. Therefore, after analyzing the spectral features of typical targets in GF-1 remote sensing images, three cloud detection feature parameters are selected to ensure that the suspected cloud pixels can also be identified. Meanwhile, this work utilizes Kawano's dynamic threshold method to build a relationship between the features and solar altitude angle of each image, which can realize the dynamic adjustment of thresholds of different features. Experiments demonstrate that the algorithm is simple and efficient. It can be used to realize automatic cloud detection of GF-1 remote sensing images throughout China.
This paper is organized as follows. Section 2 introduces a cloud detection algorithm for a GF-1 remote sensing image based on hybrid multispectral features with dynamic threshold, including analysis and selection of the feature parameters, seasonal adjustment of coefficients in the HOT calculation and spatial adjustment of feature parameters threshold. Section 3 demonstrates the cloud detection experimental results. Finally, Sections 4 and 5 discuss the experimental results and list future work.

Methodology
In this section, the spectral features of typical targets in GF-1 remote sensing images are analyzed, and three feature parameters are selected, namely NDVI (Normalized Difference Vegetation Index), WHITENESS and HOT (Haze-Optimized Transformation). Then, the seasonal background field of clear sky is designed, and the coefficients in the HOT calculation suitable for different seasons are obtained. Finally, a dynamic threshold method adapting to the change of solar altitude angle is adopted to dynamically change the three features' thresholds according each GF-1 image. Before detecting cloud, the radiometric calibration, atmospheric correction and orthorectification of the GF-1 multispectral images are carried out with the GF-1 remote sensing data preprocessing system developed by Ye et al. [1,32,33], and the reflectance values of the images are obtained.

NDVI
NDVI is the normalized difference vegetation index. For Landsat TM (The Thematic Mapper (TM) is an advanced, multispectral scanning, Earth resources sensor) images, most of the values of the NDVI and normalized snow index of cloud pixels are near zero because of the white color feature of cloud in the visible band [34]. With the increase of wavelength, the reflectance of cloud decreases slowly, while the reflectance of vegetation increases [35], so the pixels which are clouds or crops in the image can be preliminarily distinguished by NDVI. The NDVI equation is shown as (1), NIR represents the near infrared reflectance of remote sensing images, and R represents the red band reflectance.
In this section, 20 images in Heilongjiang Province, China are selected (10 images in summer and 10 images in winter, each image represents 800 × 800 km 2 ). Each image is classified according to four types of objects: thick cloud, thin cloud, crop and water. Then the NDVI parameter is calculated, and the mean, the standard deviation and the range of the four types of objects are calculated, as Table 1 shows. It can be seen from Table 1 that, in the same season, the NDVI mean values of different objects, such as thick cloud, crop and water differ greatly, which means there is a potential threshold that can be used to distinguish clouds and other ground objects. In different seasons, the NDVI mean values of the thick cloud differ slightly, which means the potential threshold is stable in different seasons. Therefore, using the NDVI parameter as one of the cloud detection features is feasible.

WHITENESS
WHITENESS was first proposed by Gomez-Chova in 2007 [36]. In the visible band, clouds always appear white because of their plane reflection characteristics. This parameter uses the sum of the absolute difference between each visible band and the overall brightness to capture the attribute of clouds. But for Landsat TM images and GF-1 multispectral images with only three visible bands, the effect is not ideal. However, by improving the calculation method [14], as in Equations (2) and (3) list, it can be well adapted to the multispectral images of GF-1. Band 1 is the blue band of GF-1, Band 2 is the green band and Band 3 is the red band. Band i represents the ith band of GF-1, MeanVis is the average of visible bands of GF-1.
The WHITENESS value statistics of clouds and other ground objects in the same and different seasons is calculated like NDVI, the results are shown in Table 2. It can be seen from Table 2 that at the same season, the WHITENESS mean values of different objects, such as thick cloud and crop, thick cloud and water, differ greatly, which means there is a potential threshold that can be used to distinguish clouds and other ground objects.At different seasons, the values of thick clouds differ slightly, which means the potential threshold is stable in different seasons. Therefore, it has a good effect on distinguishing thick cloud, crop and water that using WHITENESS as one of cloud detection features.

HOT
The HOT (Haze-Optimized Transformation) [37] algorithm was first published by Zhang in 2002. By analyzing a large amount of spectral information of multi-spectral remote sensing images, it was found that under sunny weather conditions, the blue-band pixel values of images are highly correlated with the red-band pixel values, and this relationship has nothing to do with the types of objects. In the image band scatter map, most of the pixels in red and blue bands are distributed on a straight line, which is defined as "clear sky line". The pixels in the cloud cover area will deviate from the clear sky line. The thicker the cloud, the greater the deviation. The HOT formula is Equation (4), Band 1 represents the pixel values of the blue band, Band 3 represents the pixel values of the red band, and ω is the angle between the horizontal axis (blue band) and the clear sky line.
The HOT value statistics of clouds and other ground objects in the same and different seasons is calculated like NDVI and WHITENESS, the results are shown in Table 3. It can be seen from Table 3 that at the same season, the HOT mean values of different objects, such as thin cloud and crop, thin cloud and water, differ greatly, but there is a partial overlap of HOT values' range between thin clouds and some high-brightness water, and at different seasons, the HOT mean values of the same land objects differ slightly. Therefore, HOT parameters can distinguish thin clouds from most crops, but HOT parameters are weak at distinguishing thin clouds from water.
In the three Tables above, it can be found that the corresponding features of thick clouds also vary with seasons, this may be relative to lots of aspects such as solar altitude angle, the concentration of composition of thick clouds, and so on.

Hybrid Multispectral Features
From the experiments, we found that all three features will fail to recognize some pixels in clouds. In order to reduce the leakage recognition rate of clouds, this work uses these features together to detect clouds, which is the cloud detection method based on hybrid multispectral features, as Figure 1 shows. We assume there is an original image with 4 × 4 pixels. The first step is to calculate the three features values respectively and get the three features images. The second step is to detect which pixels are clouds and which are not. The last step is to combine these three cloud images to get the final cloud image, the combined principle is if a pixel has been detected as a cloud pixel in any cloud images, we think it is a cloud pixel in the final cloud image. However, the HOT formula (4) is variant because of the two coefficients (sin ω and cos ω), so how to stabilize these two coefficients is an essential point in the method, which we will introduce in the next section.

Seasonal Datasets
The first step of the HOT feature algorithm is to select the clear sky area in the image and calculate the clear sky line. However, due to the influence of external factors such as solar irradiation intensity and aerosol concentration, the clear sky area and the clear sky line will be different in each image, so the conventional method needs to select the clear sky area manually. In this section, the method of establishing a clear sky background field is used to determine the clear sky line and HOT feature parameter of each image, which avoids manual selection of clear sky area and improves efficiency. A large number of clear and cloudless multispectral images of the satellite are selected as the data source to establish the seasonal clear sky background field, as Figure 2 shows. After pre-processing and geometric registration, the coordinates of different images have been unified. Then, the blue and red band data sets of all images are averaged separately, and the mean images of the two data sets (clear sky background field) are obtained by C# (a kind of programming language).

Linear Regression
The next step is to calculate the clear sky line for every season. Firstly, the red and blue band scatter diagram of all the pixels in the seasonal background field is constructed using the least square method. The result is shown in Figure 3.  In Figure 3, the horizontal axis represents the blue band and the vertical axis represents the red band. The red and blue band pixel values of the clear sky background field in each season are linearly correlated. The regression coefficients b and a of the clear sky line equation in each season are shown in Table 4. It can be seen from Table 4 that the determinant coefficient R 2 of the clear sky line equation in each season is high, which indicates that the background field can be used as the reference data for calculating the feature value, HOT. From the regression equation of the clear sky lines and Equation (4), the Equations (5) and (6) can be derived. Therefore, the sin ω and cos ω in Equation (4) can be derived, as Table 5 shows.

Dynamic Thresholds
Variability in surface characteristics over the vast territory of China makes using a fixed threshold for cloud detection unreasonable. In theory, there is no fixed threshold for nationwide cloud detection of multispectral features. Therefore, a dynamic threshold calculation method combined with solar elevation angle is adopted.

Coefficients of Formula
Koichi Kawano [22] believes that for visible and near infrared bands, the relationship between local solar elevation angle and image reflectance can be defined by Equation (7).
In the equation, γ is the reflectance of the pixel, and ϑ is the solar elevation angle. The values of α and β are obtained by linear regression analysis of a series of point pairs composed of γ and sin ϑ. In this section, 10 high-resolution multispectral images of Heilongjiang in 2015 are selected for band linear fitting, and the coefficients in the equations between solar elevation angle and reflectance in blue, red, green and near infrared bands of GF-1 images are obtained separately, as Table 6 shows. The flow chart of cloud detection with dynamic threshold is shown as Figure 4. (1) suppose that the reference image's central point is x and solar elevation angle is ϑ x . After radiation correction and atmospheric calibration, three cloud detection feature parameters are calculated respectively. Then we continually adjust thresholds (t ndvi1 , t ndvi2 , t whiteness and t hot ) until the accuracy of cloud detection of the reference image is more than 90% with these thresholds.
(2) the pixel values of ith band of image's central point x(x bi ) can be obtained through the Equation (7) and coefficients in Table 6, as shown in Equation (8).
(3) calculate the NDVI, WHITENESS and HOT of the central point x of the reference image, as shown in Equation (9). The values of sin ϑ and cos ϑ can be determined by acquired time of image and Table 5.
x bi −MeanVis MeanVis (4) calculate the difference (N) between features of the reference image and the thresholds, as shown in Equation (10).
(5) suppose that the center point of the image (the clouds that need to be detected) is y and the solar elevation angle is ϑ y . After radiation correction and atmospheric calibration, the pixel values of the ith band of the image's central point y(y bi ) can be obtained through the Equation (7) and coefficients in Table 6, as shown in Equation (11).
(6) calculate the NDVI, WHITENESS and HOT of the central point y of the aim image, as shown in Equation (12). The values of sin ϑ and cos ϑ can be determined by the time information about when the image was captured and Table 5.
(7) calculate the threshold (T) of NDVI, WHITENESS and HOT of the image which is needed to detect clouds, as shown in Equation (13).
When we use this method to detect clouds of GF-1 imageries, we only need to find out the thresholds of the reference image, then we save the thresholds as fixed parameters, there is no need to tune the thresholds every time. After that, we can use this method to automatically calculate the features' thresholds of each image from the reference image, which make it possible to detect clouds in different images (large-scale area).

Experiment Design
The paper designed three experiments to demonstrate the proposed method has a high accuracy and efficiency. The first experiment is a comparison between the HMF and single features mentioned in Section 2, which can interpret why we chose the HMF rather than a single feature. The second experiment is making a comparison between the proposed method and a random forest method in accuracy and efficiency. The experimental environment is a personal computer with the Ubuntu 14.06 operation system, an 8 GB RAM, an Intel(R) Core(TM) i7-7700HQ CPU and Python 3.6. The third experiment uses the proposed method in parallel to detect clouds in the whole of China, which illustrates that our method can be applied over a large-scale area.

Comparison between the HMF and Single Features
In this section, the experiment data is a GF-1 multi-spectral image with 8 m spatial resolution of TIFF format, randomly selected from Heilongjiang province. The image is shown in Figure 5a. In the image, the underlying surface is vegetation with some scattered thick clouds and a small amount of thin clouds. The image is classified into five types of objects: thick cloud, thin cloud, vegetation, water body and bare land by supervised classification. Then the NDVI, WHITENESS and HOT parameters of the image are calculated. Finally, the best threshold values of the three parameters of the image for cloud recognition are obtained by bisection and visual interpretation respectively: −0.1 < NDVI < 0.21, WHITENESS < 0.1, HOT > 1050, cloud detection results with different feature combinations are shown in Figure 5. In order to quantitatively evaluate the accuracy of cloud detection with combination of different features, ArcGIS random selection and classification evaluation function is used to gather statistics for the accuracy of the cloud detection results of images. In this section, 100 corresponding points were randomly selected from the two images for each type of object and were compared. The results are shown in Tables 7-9.  Figure 5b and Table 7 demonstrate that the cloud detection with NDVI can detect thick clouds and non-cloud objects well. The accuracy of thick clouds detection is 84%, but it does not work well on thin clouds detection, with an accuracy of only 35%. After comparing Figure 5b,c, Tables 7 and 8, it can be seen that a combination of NDVI and WHITENESS improves cloud detection compared with just using NDVI. However, the detection of thin clouds is unsatisfactory with an accuracy of only 59%. Table 9. Accuracy assessment of the cloud detection result with three feature parameters.

Results
Thick Through visual interpretation of cloud detection results (Figure 5d), it can be clearly seen that thick clouds or thin clouds in the image are effectively detected and the results of the clouds' edge detection mask are highly consistent with the shape of clouds' edge. From Table 9, it can be seen that the detection accuracy of clouds or non-cloud objects is more than 90% by using three multispectral features. Compared with thick clouds, some thin clouds are still misjudged, but the overall accuracy of clouds detection is more than 90%, which can meet the requirements of subsequent agricultural applications.

The Comparison to the Random Forest Method
We also conducted an experiment to compare the proposed method with the accuracy and efficiency of the random forest approach. We adopted the same programming language, Python 3.6, to achieve the two methods. The random forest approach is described in the reference [31], then four randomly selected GF-1 16-m images were used as the experimental data. One of the four images was used to collect the training samples, and other images were used to test the accuracy and the efficiency of cloud detection. The accuracy is shown in Table 10. It can be found that the two methods have no apparent difference when it comes to detecting clouds, but random forest is more likely to categorize crops into clouds by mistake. Furthermore, the average time consumption with random forest to detect clouds for a GF-1 image is 12 h 21 min, but the proposed method only needs 2 h 4 min. Therefore, the proposed method has the advantage of efficient calculation compared to the random forest approach.

Large-Scale Cloud Detection
In order to evaluate the cloud detection capability in a large-scale area, some randomly selected GF-1 multispectral remote sensing images in China are used to detect clouds. The results are shown in the Figure 6 and the accuracy assessment is shown in Table 11. Through visual interpretation of cloud detection results ( Figure 6) and analyzing Table 11, it can be seen that thick cloud detection accuracy can reach 93.8% and thin cloud detection accuracy can reach 87.9%. Although the detection accuracy of thin clouds is slightly lower than that of thick clouds, the overall detection accuracy of cloud pixels is 90.8%, and that of non-cloud pixels is 94.6%, so it can be shown that thick clouds or thin clouds in the nationwide image are effectively detected. At the same time, this section presents statistics on the cloud detection speed of a 8-m multispectral image and a 16-m multispectral image with the number of 1, 20, 50, 100 and 200. The statistical results are shown in the following Figure 7. It can be seen that the computing speed of the cloud detection system can reach the second level in processing GF-1 8-m multispectral images and the minute level for 16-m multispectral images, and the computing efficiency has reached the current application requirements.

Discussion
The fixed thresholds of each feature parameter found in Section 3.2 are partially inconsistent with the range of each parameter corresponding to the cloud in Section 2.1. This is mainly because the data which were used are not the same data, so it shows that although the threshold method can distinguish cloud and non-cloud objects to a certain extent, the thresholds of different regions or different seasons are variable, which implies that there is no fixed threshold for nationwide cloud detection of multispectral features. Thus, the method of cloud detection with a dynamic threshold proposed in this paper is essential, and can calculate the corresponding threshold for each GF-1 remote sensing image.
However, due to the overlap of cloud and snow in spectral characteristics and the lack of the Short Wave Infrared (SWIR) band in the GF-1 images which is needed when calculating the Normalized Difference Snow Index (NDSI), it is easy to make a misjudgement when using this method to detect cloud and snow. The main attempt in this work is to acquire the real reflectance of crops for subsequent agricultural application. If crops are covered by snow, the reflectance of a remote sensing image cannot reflect the real information of a crop, so these pixels cannot be utilized either. Therefore, this kind of misjudgement has a positive impact, which can be ignored in this study. However, for other applications, clouds and snow might need to be distinguished effectively. In this case, interested researchers can consider using texture features to distinguish clouds and snow. Furthermore, cloud shadow is also a challenge in cloud detection. In this paper, the method cannot address this problem. In future work, some feature parameters which can detect cloud shadow will be considered.
In evaluating the accuracy of the cloud detection results, this work and most researchers still choose the corresponding points randomly from an original image and detected image, then evaluate the accuracy by visual interpretation and judge the effect of cloud detection based on the results. Therefore, how to evaluate cloud detection accuracy more objectively is the difficulty of the current research stage, and is also the next research work of this study. This paper establishes a set of parameters from images in Heilongjiang province, even though the nationwide application has demonstrated that this method works. In the future, we would also focus on exploring and researching the differences of these parameters in different regions.

Conclusions
Based on the features of the GF-1 satellite sensor and the requirement of cloud detection in agriculture, this paper proposes a cloud detection method based on hybrid multispectral features with a dynamic threshold for GF-1 remote sensing images to achieve the high precision and highly efficient distinction between clouds and crops. Through experiments and analysis of various multispectral features, this work preliminarily shows that NDVI, WHITENESS and HOT provide a certain degree of discrimination between cloud and other ground objects in GF-1 satellite data. Combining this with experimental analysis, this work finds out that the accuracy of cloud detection is the highest when the three features are used in combination, which is more than 90%. In this paper, through the establishment of clear sky background field and linear fitting in four seasons, the corresponding HOT parameter calculation equations in different seasons are obtained. The R 2 of fitting results in spring, summer and autumn all reach above 0.95, the fitting effect is excellent, and the influence of different imaging times on the calculation of HOT parameters is reduced as far as possible. At the same time, this paper adopts an effective method to dynamically acquire the thresholds of NDVI, WHITENESS and HOT according to the change of solar elevation angle, which minimizes the influence of different imaging positions on the three feature parameters' thresholds. In addition, experiments show that the proposed method can process 8-m and 16-m GF-1 images efficiently, the speed can reach second and minute levels respectively. In terms of accuracy, cloud detections are carried out for random images in China, and the overall cloud detection accuracy is 90.8%. The main contributions of this work as follows: 1. we explored an effective strategy based on previous researchers' works without any other auxiliary data to detect clouds of the GF-1 images which are not suitable for some popular algorithms. 2. the proposed method has been applied into scientific production and supports other researches in the preprocessing of remote sensing images [33] and crop mapping [38,39].