Cloud removal using scattering model and evaluation via semi-realistic simulation

ABSTRACT Cloud removal is an essential task in remote sensing data analysis. As the image sensors are distant from the earth ground, it is likely that part of the area of interests is covered by cloud. Moreover, the atmosphere in between creates a constant haze layer upon the acquired images. To recover the ground image, we propose to use scattering model for temporal sequence of images of any scene in the framework of low rank and sparse models. We further develop its variant, which is much faster and yet more accurate. To measure the performance of different methods objectively, we develop a semi-realistic simulation method to produce cloud cover so that various methods can be quantitatively analysed, which enables detailed study of many aspects of cloud removal algorithms, including verifying the effectiveness of proposed models in comparison with the state-of-the-arts, including deep learning models, and addressing the long standing problem of the determination of regularization parameters. Theoretic analysis on the range of the sparsity regularization parameter is provided and verified numerically.


Introduction
As the imaging sensors are deployed kilometres above the earth ground, cloud usually appears in the acquired images, which complicates data analysis tasks. It is desirable to remove the cloud totally to recover clean ground scene, which gives rise to cloud removal. Due to the versatility of remote sensing imagery, cloud removal methods have to align with the characteristics of the sensors, for example, multiple channels or single band. Meanwhile, the platform is a decisive factor for the design of the algorithm, for example, the computation limitation and power consumption restriction. Furthermore, the analysis tasks after cloud removal have some influence as well. So one has to consider all possible contributing factors in the modelling process.
Our data is single band satellites images of the same scene sampled from different time points which are subjected to light to moderate random cloud coverage at various regions. The aim is to recover images without cloud, i.e. the clear images revealing the ground scene so that subsequent analysis can be performed reliably, for example, object detection and tracking. Therefore the fidelity is the most important factor to be considered, in other words, the recovered must be as close as possible to the truth, not just simply 'visually fit' (look plausible from afar). Unfortunately, there is no objective assessment except visual checking, and one of the goals of this paper is to fill this gap.
We focus on non-deep-learning-based methods for cloud removal, although latest deep learning methods were used as contenders in our empirical studies subject to code availability (e.g. Zheng, Liu, and Wang 2021;Sarukkai et al. 2020). The reason for this is that the fidelity of the recovered images is a concern for deep learning based methods. The workflow of these methods consists of two steps. The first is to identity cloud covered areas and remove them. The second is to apply generative models to fill the removed pixels. Generalised adversial networks (GAN) based models are popular choice for image completion. However, the working mechanism of GAN and its variants heavily rely on the training data on which the distribution is modelled by transforming a specified random distribution, e.g. uniform distribution or multivariate Gaussian distribution. Essentially, GAN is some sort of density estimator. Then the question is, what if the scene that the satellite sampled never appears in the training data? GAN will certainly generate something for the missing areas but will not be able to stretch outside its modelled distribution. Therefore we consider other alternatives, for example, temporal mosaicing (Guo et al. 2016(Guo et al. , 2017. Although enforcing spatial smoothness is the most time-consuming component, the fidelity can be reassured that no 'alien pixels' will be inserted into the images like GAN based methods do. Another possibility is matrix completion methods for missing pixel filling, for example Wen et al. (2018), and its later development Zhang et al. (2019). The main model behind these methods is the low-rank robust principal component analysis (Emmanuel et al. 2011) coming from a long development of robust PCA (RPCA) (De la Torre and Black 2001; Junbin Gao, Guo, and Guo 2009) that improves the robustness of the classic linear PCA model by reducing the sensitivity to outliers. The elegance of RPCA comparing to its peers is the simplicity in its formation as well as its theoretical guarantee for the recovery of the low-rank signals and sparse noise. The application of RPCA implies that the observed images are the summation of low rank ground images and sparse cloud cover images (images with cloud only without the ground scene). It makes sense for such arrangement assuming that the ground scene changes little after excluding misalignment and geometric distortion, and clouds cover only small portion of the scene. The low-rank condition on ground component signifies the way of filling missing pixels and hence RPCA has better interpretability than GAN methods.
It seems that the aforementioned two-step workflow can be consolidated to a single one using RPCA. Nonetheless this two-step strategy was still adopted (e.g. in Wen et al. 2018;Zhang et al. 2019), for no obvious reason, in which RPCA is only used for cloud identification and a low-rank matrix completion follows after those cloud affected areas masked out. Two questions remain though. Firstly, where is the atmosphere modelled in the image data?The atmosphere is reflected as a thin haze layer in the acquired images which may not be negligible. Secondly, is the simple additive model in RPCA really the right description of the physics? Apparently not. The most realistic model so far is the so-called atmosphere scattering model (Narasimhan and Nayar 2002) for satellite images. Therefore one should build atmospheric affect into the model for cloud removal and ground scene recover.

Models considering atmosphere effects
We first describe RPCA-based methods here in the setting of imagery applications. Let I i 2 R d 1 �d 2 be the i-th sampled image of size d 1 � d 2 and i ¼ 1; . . . ; n; D ¼ ½vecðI 1 Þ; . . . ; vecðI n Þ� where vecðXÞ is the vectorization of matrix X to be a column vector, and hence D 2 R d�n (d ¼ d 1 d 2 ). The RPCA model used in Wen et al. (2018) and Zhang et al. (2019) is the following, where k Xk � is the nuclear norm of X, i.e. the summation of all singular values of X, which is the convex envelope for matrix rank, k Xk 1 is the , 1 norm of X, L is the initial recovered ground images, C is the cloud cover images, and both are the same size as D. λ is the regularization parameter usually fixed to be 1 ffi ffi d p as recommended in Emmanuel et al. (2011). By incorporating group sparsity (defined by super-pixels) into (1) Zhang et al. (2019) claim slightly better performance. After solving (2.1), both methods proceed to matrix completion with the mask derived from C as follows where Ω is the mask matrix of size d � n with 0's for masked out elements and 1's for others, � Ω is the negated version of Ω, i.e. flipping 0's and 1's, and S Ω is the projection of S on Ω, i.e. masking out elements indicated by 0's in Ω. The ijth element in the mask matrix, ½Ω� ij ¼ 1 if ½C� ij > γσðvecðCÞÞ and ½Ω� ij ¼ 0 otherwise, where σðvÞ is the standard deviation of v and γ 2 ½0; 1� is a pre-set ratio. B is the final recovered ground images, which are supposed to be cloud free. S is the noise. In implementation, γ ¼ 0:8, α ¼ 0:1 ffi ffi d p and β ¼ 1.
Both problems are convex with two blocks of variables. There are many gradient projection based solvers/optimizers for them under the ADMM (Alternating Direction Method of Multipliers) framework (Boyd and Vandenberghe 2004). They all work reasonably well for moderate size of images, for example, d 1 ¼ d 2 ¼ 1024 and n ¼ 7.
The critical step is in (1) where cloud cover C is supposed to be separated. Note that the decomposition of the observed data D ¼ L þ C reflects the basic model assumption. As mentioned earlier, this departures from the reality by ignoring atmosphere effect. So instead of simple additive model we propose to use atmosphere scattering (Narasimhan and Nayar 2002), D ¼ L � ð1 À CÞ þ C, in the modelling, and hence optimize the following X where X � Y is the element-wise product of matrix X and Y of the same size. In the above formulation, it is assumed that the pixels in observed images are rescaled to ½0; 1�, which is easily done by dividing the maximum digital number of the sensor, but not the maximum of the observed values. Note that (3) is no longer a convex problem as the equality condition is not affine. It is supposed to be much difficult to solve on itself, let alone the boxed conditions clamping the elements in both L and C within ½0; 1�. Nonetheless, there are still some strategies for the optimization. For example, introducing a dummy variable X to untangle the interaction between L and C and proceed with the normal ADMM. However, we observed that this does not converge well enough to be practically useful. Instead, we employ linearization using primal accelerated proximal gradient method (Kei Pong et al. 2010) for its ease in handling entangled nuclear norm optimization and stability. The Lagrange of (3) with proximity is by ignoring constants, where k Xk F is the Frobenius norm of X, Y 2 R d�n is Lagrangian parameters for the equality condition and μ � 0 is the proximity coefficient. Note that (5) and (6) are the proximal form, and the boxed conditions in (3) are ignored at this stage, which will be handled later by feasibility projection after updating all unknowns. Alternating the minimization w.r.t. L and C is adopted here. Apparently minimizing L with respect to L is difficult due to the term L � ð1 À CÞ although no much trouble for C. The gradients are shown below.
where @ k Xk 1 and @ k Xk � are subgradients of , 1 norm and nuclear norm respectively. The stationary point of (7) gives closed form solution C � where a ij ¼ μ½ðL À 1Þ � ðL À 1Þ� ij , b ij ¼ μ½ðD À L þ Y μ Þ � ðL À 1Þ� ij and signðxÞ is the sign function of x which takes 1 when x > 0 and À 1 when x < 0. It is a straightforward soft thresholding for , 1 norm minimization. The only difference is the regularization is not global but local or adaptive as the regularization parameter λ is rescaled by each 1=a ij as shown in (9). Whereas there is no closed form solution for @L @L ¼ 0 because the singular value thresholding (SVT) (Cai, Candès, and Shen 2010) only works for the following general form where τ > 0 is an arbitrary scaler (normally regularization parameter) and A is a matrix the size as X. The solution X � to above is X � ¼ S τ ðAÞ and S τ ðAÞ is the so-called SVT operator defined as where U and V are from SVD of A, i.e. A ¼ U�V T and � ¼ diagðσ 1 ; . . . ; σ n Þ.
To work around it, we linearize the smooth part in (6) by the first-order Taylor expansion with a proximal term w.r.t L k , the kth value of L in the iterative optimization for (6), and optimize L while holding other variables constant as In above, @f s @L j L k ¼ μðD À L k � ð1 À CÞ À C þ Y μ Þ � ð1 À CÞ. , p is the Lipschitz constant of f s ðLÞ, which is the operator norm of h @f s @L j L k ; �i that maps a matrix of the same size of L to R It is straightforward to see that , p ¼ 1 due to the box conditions of C and L. This leads to The above linear approximation results is very convenient as the interaction between L and C has been removed and therefore (12) has closed form solution using SVT. We apply Nesterov acceleration to speed up the process, which is proven to be convergent for (12) with carefully chosen optimization parameters (Nesterov 2003). This iterative procedure for L has to be embedded into the optimization for (6) and hence there are two loops in the entire algorithm. The detailed optimization algorithm for solving (3) is listed in Alg. 1. Note that the boxed conditions are satisfied by clamping in Alg. 1, which is the feasibility projection commonly used in many implementations (Jun Liu and Jieping 2009). We call the model in (3) and its realization in Alg. 1 atmosphere cloud removal model and ATM for short.
Algorithm 1 Solving atmosphere scattering model for cloud removal in (3) while k < k max and jf À f j > 10 À 3 do 9: Due to the double iterative procedure for solving L in ATM, it is expected to be slow. However, the recovered cloud component, i.e. C is closer to reality than that obtained from RPCA as shown in Figure 1, where the source images are from GaoFen4 satellite captured at the same scene at seven time points. GF-4 is the first Chinese geosynchronous orbit remote sensing satellite and equipped with one array camera with resolution of 50 m for the panchromatic band and 400 m for the MWIR band. The test panchromatic data were captured on 20 August 2016, and the geographic coordinates are 18 � 09 0 34 00 À 18 � 37 0 27 00 north latitude and 108 � 56 0 30 00 À 109 � 48 0 28 00 east longitude (Hainan Province in China). Since the width of a single-field GF-4 satellite image is 10,240 × 10240 pixels, we cropped the west part of the Hainan Province with lots of clouds as the test data.
It is clear that the ATM detected clouds are much brighter than those detected by RPCA thanks to its detailed atmosphere model, at the cost of much higher computational load as shown in Figure 10. This motivates us to reduce its computational cost while maintaining model capacity. The key is to disentangle the interaction between L and C that breaks the convexity. Let us take a closer look at the core in ATM model in (2.3) Under the choice of P, N;ðP À CÞ � L0. The above can be written as for L ¼ ð1 À PÞ � L. We can easily write out an equivalent optimization problem to (3) using (13) with many coupling conditions, which complicate the optimization. However, if we drop some coupling conditions via relaxation and approximation, it will be much easier to solve, and yet the coupled problem is still a special case of the relaxed version. So we optimize the following Note that in above L replace L which is an approximation. We highlight this is a relaxed version of (3) with its own interpretation, that is N acts as a thin haze layer accounting for the atmosphere.
In (14) the values in N are controlled by the Frobenius norm. It is well known that the Frobenius norm will not encourage sparsity, but compress the values towards zeros uniformly. Depending on the value of β, C þ N can reach the so-called α-sparsity (Yonina and Kutyniok 2012), i.e. sparsity beyond value α. Note that we fix β ¼ 1 throughout this paper.
Equations (14) is significantly easier to solve than Equation (3) for being convex with no interaction terms in the low-rank component. Although direct generalization of ADMM to more than two blocks of variables like those in (14) may not converge as discussed in  (Chen et al. 2014) with hand-crafted counter examples, from many other applications, and vast amount of experiments we carried out, the optimization converged quite quickly. Detailed optimization algorithm is listed in Alg. 2. We call the model in (14) and its optimization algorithm in Alg. 2 alternative ATM, or aATM for short.
Algorithm 2 Solving (2.14) For the regularization parameters, we provide theoretic analysis on the range of the main regularization parameter λ in Appendix A. The results align with the empirical study outcomes presented in the next section. Furthermore, we will present an empirical equation based on numerical method to determine the best value for λ as a guidance for practical use. Figure 2 shows the flow chart of our method. It is indeed very straightforward. The input stack of images is first normalized by dividing the maximum digital number of the instrument. Then determine regularization parameter λ by the procedure provided in 3.4. Finally run Algorithm in 2 to obtain the clean images and cloud cover. We need to point out that all our models can be used directly to recover ground images unlike the main contenders (Wen et al. 2018;Zhang et al. 2019) where a matrix completion (MC) step has to follow although it is debatable whether MC is necessary. However, without some sort of ground truth, it would be a myth and the arguments would be meaningless. To address this long standing issue, we design a semi-realistic simulation of cloud covered images so that all methods can be analysed quantitatively.

Simulation and performance indicator
Cloud removal experiments are normally conducted on real images from satellites and the evaluation of the effectiveness of the recovery is based on visual checking and cloud cover by IoU (Intersection over Union) originated from computer vision (Liggins et al. 1997) which is basically Jaccard index (Jaccard 1912). The ground truth of cloud cover is obtained by time-consuming manual labelling of cloud clusters. Due to the complexity of the nature of cloud, it is extremely difficult to delineate the boundary of cloud clusters accurately, especially for thin clouds, and hence there exists large amount of errors when segmenting clouds manually. An ideal solution is to build cloud model to capture the shape and formation of all sorts of cloud, thick or thin. Unfortunately it is quite involved in physics and mathematics and it is a multi-facet problem (Dobashi et al. 1999(Dobashi et al. , 2017Yuan et al. 2014;Yumeihui Xing et al. 2017). Even if the cloud cover is known, the other side of the problem, way more important than cloud, is the ground truth of the ground scene. The ultimate goal of cloud removal is to recover ground scene accurately. Whereas current practice largely relies on subjective evaluation, i.e. 'eye-balling', which is apparently very vulnerable to bias. Therefore, an objective and robust evaluation is highly desirable. The work in Zheng, Liu, and Wang (2021) used an overly simplified method to train the Unet for cloud separation by simulating random strips of white rectangles or from brightest to darkest colour gradient boxes on top of clear ground images. This is a bit primitive. Not only are they far from real clouds, but most importantly the regular shape reduces the complexity of the problem. Inspired by the success of applying Perlin noise (Perlin 1985(Perlin , 2002 in the simulation of virtual landscapes, we adopt Perlin noise to generate synthetic cloud. We take a cloud free image, say from the Inria aerial image labelling dataset (Maggiori et al. 0000), convert it to greyscale as true ground image I (pixels rescaled to ½0; 1�), and generate multiple 2D Perlin noise the same size as the image, as C i , i ¼ 1; . . . ; n. Then the observed image I i is where pixels in C i are rescaled within ½0; 1�. Optionally one can apply any transformation f to I before combining to clouds, e.g. geometric distortion to study some aspects of the methods; or generate a base C and apply dynamics to C for cloud time series mimicking clouds movement. We leave these for future work. By varying the parameters in Perlin noise generator, we can control the density of the generated clouds, from sparse to heavy cover. We also apply some image correction, e.g. Gamma correction and histogram equalization, totally optional, to enhance the similarity to real clouds and haze.
As the ground truth is readily accessible, we can apply any suitable quantitative evaluation to the cloud removal methods for detailed study. Given the main focus is the fidelity of the recovered image, we use the following to quantify the goodness of recovery Where Î i is the recovered image from any method. The quantity r defined in (15) is the normalized distance metric, which is not meant to be the best. Other sophisticated measures could be applied certainly. However, (15) is sufficient by virtual of equivalency of norms (Åžuhubi 2010, Ch6.6), although in modelling process, different norms affect model behaviours vastly.

Performance evaluation on simulations on single image
Thanks to the above semi-realistic simulation, we can now investigate another important aspect, that is the regularization parameters used in the models. Using the goodness of recovery r, we can determine the best values from large-scale randomized trials. Meanwhile we can also verify the necessity of the MC step. Let us first visually check the outcomes of different methods on one set of simulated images. The true image is from Inria dataset named tyrol-w1 from Lienz in Austrian Tyrol resized to 1024 � 1024 (d¼ 2 20 ). It is a mixture of urbane and nature scene with some high intensity areas such as roads and roof tops shown in Figure 3. We simulate seven thin cloud covers. One simulated image and the cloud layer are also shown in Figure 3.
The cloud looks very nature. Note that the cloud cover image appears to be sparse as large dark areas exist as shown in the histograms in Figure 4 top panel where the right one is showing details in the range of ½0; 0:2�. However, they are not exactly zero and correspond to thin haze. If one thresholds them to zero, the cloud cover then becomes very artificial visually. The bottom panels in Figure 4 show thresholded results, by 0.1 and 0.2 respectively from left to right. The visible boundaries of clouds are unpleasant and against the intuition due to the lack of the critical smoothness commonly present in natural images with cloud. This also shows the tremendous difficulty to manually separate clouds in real images. Figure 5 shows the recovered images by different methods obtained with the setting of the regularization parameters as λ ¼ 1 ffi ffi d p ¼ 1 1024 and β ¼ 1 in aATM. Simple visual checking tells us that aATM and RPCA are better than ATM as ATM results (with and without MC) contain fair amount of cloud pixels. This may be straightforward. However, it is not clear which one is the best. It appears that aATM is slightly better for less 'washed away' areas. It is also impossible to identify the effect of MC. These indicate the limit of visual examination. Nonetheless, the r values of these methods are 0.1758,0.3195, 0.1754 in the order of aATM, ATM, RPCA with MC, and 0.1678, 0.3373, 0.1681 without MC. Now it is clear that aATM without MC is the best and MC does not do anything useful to enhance the results.  Figure 6 shows the cloud covers detected by these methods. The clouds separated by ATM are in better contrast, i.e. very bright and very dark although it appears very conservative, that is visually sparser than others. In contrast, aATM and RPCA seem to have more cloud pixels identified. Again, it is impossible to tell the difference between aATM cloud and RPCA cloud by visual examination. Note that for aATM the cloud is the summation of C and N.
One major benefit of simulation is to validate the sensitivity of the regularization parameter, mainly λ in the models. We ran large-scale simulation with n ¼ 7 and 15, using the same true image. We tested 51 values of λ equally spaced in log scale with the recommend value, 1 ffi ffi d p , in the middle, i.e. from 9.7656e-05 to 0.0098, and for each value of λ, we ran 50 randomized trials. The results are collected in Figures 7 and 8, where each data point is the mean and �1 standard deviation of the r values across all trials for a given λ value. Many things can be read out from the plots. The first is that ATM is not as good as competitors for small n, e.g. n ¼ 7, regardless of the choice of the λ values. However, it begins to gain advantage when n is larger. This will be investigated later. The second is that λ has roughly three zones: 1) failure zone, where the sparsity is too weak and all methods fail with no recovered images; 2) clamping zone, where the sparsity is overwhelming such that sparse component is wiped out and all methods lose the capacity to identify cloud; 3) Goldilock zone, where the algorithms work reasonably well (r � 0:2 for λ 2 ½0:0007; 0:0012�), including their bests. Of course, these zones have different boundaries for different methods, and their r values inside these zones have different shapes. For example, RPCA seems to have rather flat r values in its Goldilock zone meaning that its performance varies just a little bit if λ is from that zone. There exists a value for λ which is better than the default recommended value. This holds true for all methods, interestingly with different margin of being true. For example, for RPCA, the margin is smaller, that is the optimal value of λ brings 17.23% reduction of r value on average in n ¼ 7 case, while that is 42.11% for aATM. Similar observation for n ¼ 15. When all methods take the default value of λ, aATM without MC works the best on average, which is 22.84% better than RPCA in expectation sense. The overall best performance of aATM against that of RPCA is 43.06% reduction in r value, down from 0.1625 to 0.0941, which is very significant. This is verified by a one-side t-test with null hypothesis of no r values reduction performed on the trials with the optimal and default λ values where significance level α¼ 10 À 5 . The resulting p-value for null hypothesis is extremely low 1:7557 � 10 À 29 strongly supporting the alternative hypothesis that the reduction is quite significant. A very interesting  observation is that ATM without MC comes to the second when n ¼ 15 in terms of the overall best performance, better than RPCA. Figure 9 reveals the details of the r values of both methods in the trials when holding λ value constant, λ ¼ 4:6741eÀ 04, the optimal value for both methods. The r values of each method vary during the trials due to the randomness of the simulation. RPCA has higher values of r almost constantly with greater variation than ATM. There is no doubt that ATM outperforms RPCA when λ is optimal. The third is that MC does not bring much improvement even acts adversely when λ is in the Goldilock zone. This claim is strongly supported by statistical evidence. Table 1 shows the one-side t-tests results performed on the trials of various methods with optimal λ values for both n ¼ 7 and n ¼ 15 cases. The null hypothesis is that MC brings r value reduction on average, i.e. the mean of r MC À r MC , is no greater than 0. r MC and r MC , are the r values of  a method with and without MC respectively. The significance level α is set as low as 10 À 5 . The p-values are extremely low suggesting that the null hypothesis should be rejected almost surely. The only exception is ATM when n ¼ 7, which favours the MC to further improve its performance. So clearly the recommendation is to omit MC step in cloud removal in these methods, which is extra computation with little benefit. However, we need to point out here though that there are regularization parameters as well in MC, for which we took the default/recommended values. See previous sections for detail. To see the comparison more clearly, we present the mean and standard deviation values of the results in some λ range (in the Goldilock zone) in Tables 2 and 3 for clarity. The column in the middle of the tables with double column indicates the values when λ is equal to the default value. The column-wise best (minimum among all methods) is highlighted by italic font and overall best is highlighted by bold font. They show clearly that aATM is the best in terms of both expected r value and stability reflected by smaller standard deviations.

Computation costs comparison on simulations using single fixed image
We report the time for computation. Figure 10 shows the time consumed by various methods, with and without MC. Similar to previous plots, the data points in the plot are the means and �1 standard deviation of the times (in seconds) across all trials for a given λ value. Apparently they vary across simulations.   It is quite obviously that MC is extra work. Given no extra benefit, the computation for MC should be saved. ATM is pretty difficult to solve indeed, reflected by the skyrocketed computational time compared with others. There are double optimization loops inside its solver. Interestingly, when λ is correct, ATM takes the most of time to compute on average. When λ is growing from the failure zone to the Goldilock zone, a huge jump of needed computation can be observed, which is statistically significant. As ATM's performance turns very sharply along λ values, its computation cost varies accordingly, peaking at where ATM works the best and jumping down quickly. This is a very interesting observation that may lead to a way of selection of its regularization parameter as well as a hypothesis of required computational cost vs λ value. Along with the well known regularization path in sparse models (Hastie et al. 2004;Friedman, Hastie, and Tibshirani 2010;Tibshirani and Taylor 2011), this may be a useful route leading to optimal regularization selection in future. This was never possible previously without simulation. In general, aATM is more expensive to compute than RPCA because of the extra block of variables N, doubling the cost almost for all λ values. However, the base is quite small. when λ is in the Goldilock zone, aATM is doubling RPCA from about 6 seconds to 10 seconds. Therefore it is not dramatic.
Again, we present the mean and standard deviation values of the results in some λ range (in the Goldilock zone) in Tables 4 and 5 for clarity. The column in the middle of the tables with double column indicates the values when λ equal to the default value. The column-wise best (minimum time among all methods) is highlighted by italic font and overall best is highlighted by bold font. They show clearly that RPCA is the fastest and aATM is about 50% more expensive to run at this range of λ values. Considering its superiority in recovery performance, this cost is absolutely worthwhile.  Both aATM and RPCA exhibit the same pattern observed from ATM but less pronounced. When λ goes form failure zone to Goldilock zone, there is time cost leap and stabilizes for a while and then some up and downs. Again, the zone changing pattern of time cost is a good indicator of entering the Goldilock zone from failure zone. It is possible to exploit it for finding a better λ value than the default one, although it is tricker than ATM where the pattern is very clear.

Determining the best λ value
What is the best value for the regularization parameter λ? This is an inevitable and yet critical question in practice. It is almost impossible to address it without many assumptions and lengthy theoretic analysis. Please refer to Appendix A for Goldilock zone bounds to see the complexity. However, thanks to simulation, we can fit the data to derive some equation for the best λ value. Different from drilling into the computational cost pattern suggested by previous section, we look at the best λ values of different methods by stretching n from 2 to 250. The 'best' is defined as the λ value corresponding to the minimum average r value across trials, which we denote as λ � . Figure 11 shows the ratio of λ � of all methods to the suggested default value 1 ffi ffi d p , i.e. ffi ffi ffi d p λ � . As n becomes larger, ffi ffi ffi d p λ � decreases exponentially. We turn this into almost linear by applying log twice to n, as shown in Figure 11 right panel. From this data, we fit a linear model and derive the following λ � estimator The red curves in Figure 11 are the values of ffi ffi ffi d pλ � at different scales of n.

Performance evaluation on simulations on multiple images
Now we are ready for more comprehensive tests. One last question is how these evaluations hold across different n (image sequence length) and different scenes? To this end, we picked three other images from Inria data set, chicago1, kitsap1 and vienna1, and ran the same randomized trials with n ¼ 5; 7; 10; 12; 15; 20, each with 50 repeats. The ground truth images are displayed in Figure 12. In this experiment, we bring in the state-of-the-art deep learning methods (Sarukkai et al. 2020) (called STGAN+Resnet and STGAN+Unet) and (Zheng, Liu, and Wang 2021) (called UNET and UNET+GAN) for a thorough comparison. STGAN provides two variants using Resnet and Unet backbone networks. UNET separates cloud and ground only and UNET+GAN uses GAN to fill thick cloud covered areas. The training of these deep learning models strictly followed the procedures in their code base repository, and were optimized for best performance as per instructions. For our models and RPCA, λ was automatically determined by (16) for different n. results for different n. The overall impression is that deep learning methods are not as good although they have quite stable performance across different n values. They may have some advantages when n is small, say n ¼ 5. STGAN is better than RPCA in kitsap1 although no match for aATM and ATM when n > 7. Deep learning methods have large performance variations across different scenes, while others are rather consistent. All sparse models have better r values when n grows larger. This suggests that a strategy to boost performance is to increase the sampling frequency moderately. It makes perfect sense as more images provide more information for the missing pixels covered by cloud, and it is more likely that some areas covered in one image are not covered in another. The rank minimization in aATM/ ATM/RPCA is designed to fully utilize this. The unnecessity of MC is once again verified in this test. The add-on value of MC is only observable for ATM when n is small, i.e. n < 15. The above observations provide us a clear clue to the questions raised at the beginning of this paper and justify our motivation. Deep learning methods in general have lower fidelity (higher r values). This uncertainty poses many questions for subsequent applications. They may have good performance on some specific scenes, for example, pure nature scene like kitsap1. However, it is not clear how GAN's distribution transformation works. The large performance variation reveals their problems in dealing with different situations. While our models do not have these issues and are interpretable in terms of their working mechanism.

Conclusions
In this paper, we introduced atmosphere scattering model into cloud removal modelling process and proposed two ATM models as superior alternatives to RPCA-based model. Furthermore, we proposed a method to simulate controllable cloud cover scenes. This semirealistic simulation enables detailed study of various cloud removal methods and provides valuable insights to several aspects of the algorithms as large-scale randomized trial and quantitative analysis become available. Examining the methods by using this powerful experimental tool, we saw clearly that the proposed aATM outperforms not only RPCA model, the state-of-the-art in this category of non-deep-learning-based cloud removal methods, but also latest deep learning models constructed on large scale backbone networks, by quite a large margin. There were many interesting findings in this process, for example the zoning of the regularization parameter λ, computational cost pattern across zones, automated regularization parameter determination and so on. These may be out of the question without the assistance of the simulation. We envisage a robust development of the cloud removal algorithms under this framework in near future. For this purpose, we released the Matlab code at https://github.com/yguo-wsu/clouldremoval.

Disclosure statement
No potential conflict of interest was reported by the authors. maxfk X l i¼1 H i þ X n j > l e j H j k F g ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi tr½V n U T n U n V T n � q ¼ ffi ffi ffi n p where U n ¼ ½u 1 ; . . . ; u n � and V n likewise. Remark A.2. From above, we can see that, when λ < 1 ffi ffi ffi ffi dn p , the only allowed solution is to nullify the elements in L, in which case, L � ¼ 0 and C � ¼ D. This is what we have seen in Fig. 7, where when λ is very small, r value is 1 as L ¼ 0. aATM has the same result although it has another regularisation because the infimum of λ happens only when β ¼ 1, otherwise it would further reduce the value of λ.
This matches the purpose of regularisation. When λ is too small, the penalty to sparsity is next to null. Hence the sparse component is free. The sensible choice is of course to set the low rank component zeros, such that the objective is quite small, although this is a trivial solution. Similar logic for maximum value of λ.
Note that the extrema values of λ deduced in Lemma A.1 is for general cases, in other words, no specific conditions. The minimum value of λ is very close to the recommended value of λ in RPCA, while the maximum is rather loose, due to the generality. Actually we can have the following tighter upper bound of λ. Lemma A.3. For a given D 2 R d�n assuming d > n, the maximum value for λ in ATM, aATM with β ¼ 1 and RPCA model, written as λ m , is k UV T k 1 where U and V is from the skinny SVD of D ¼ U�V T and k Ak 1 is the matrix infinity norm, i.e. the maximum absolution value of all its elements.
Proof. The proof is similar to that of Lemma A.1 except now we consider only one data set D, i.e. sup λ fλ � 0 : (A:4) as the sparse component C � is totally zero. P l i¼1 H i ¼ UV T and all its elements are surely less than or equal to k UV T k 1 . Therefore when λ ¼k UV T k 1 , the condition in (A.4) holds. This is equivalent to setting "j; e j ¼ 0 as in (A.3).
The upper bound of λ in Lemma A.3 is much better than that in Lemma A.1, especially when d � n and D 2 ½0; 1�. However it is possible to further quantify λ ¼k UV T k 1 without actual SVD. We give asymptotic results here of the upper bound of λ ¼k UV T k 1 and hence λ m . To proceed, we need the following proposition to bound k UV T k 1 . Proposition A.4. For any matrix A of size d � n (d > n)and its skinny SVD as A ¼ U�V T , the following holds k UV T k 1 � k Ak 1 σ n ðAÞ where σ i ðAÞ is the ith largest singular value of A and then σ n ðAÞ is the smallest singular value of A.
Proof. Following the same way of thinking from previous lemmas, we see that k UV T k 1 can only bound the Frobenius norm of the matrix spanned by the same bases up to ffi ffi ffi n p . In other words, for any matrix of size d � n, if k Xk F � ffi ffi ffi n p , then k Xk 1 �k UV T k 1 . Also k P i σ i u i v T i k 2 F ¼ P i σ 2 i . Therefore, if σ i � 1 for all i, then k P i σ i u i v T i k F � ffi ffi ffi n p . Combining � min kv1k2¼1;v12R n kÃv 1 k 2 2 þ min kv2k2¼1;v22R n k εEv 2 k 2 2 þ min kv3k2¼1;v32R n v T 3Ã T Ev 3 Note we use v j j ¼ 1; 2; 3 to highlight that these minimisations are separated and hence the above holds. Ã T E in the third terms gives Ã T E ¼ ðs 1 ; s 2 ; . . . ; s n Þ T 1 T n where s i ¼ P d j a ij , a ij is the ijth element in A and 1 n is the vector with all 1 with length n. As a ij 's are from centralised population, under asymptotic condition, s i ¼ 0 almost surely and hence the third term vanishes. The first two terms are the square of the smallest singular values of corresponding matrices, i.e. Ã and E. Since σ i ðEÞ ¼ 0 for i�1, we have the required inequality by using Theorem A.5.
Combining Lemma A and Proposition A.4, we have the following corollary.
image was obtained by combining clear (cloud free) ground images with cloud only image by atmospheric scattering models as follows: (1) Clear ground images were taken from the INRIA aerial image data set (INRIA), which contains 180 training images. Each image has 5000 times 5000 pixels with a spatial resolution of 0.3 m. As directed in [32], each ground image was cropped into non-overlapping patches with 256 � 256 pixels. Therefore, each ground image results in 361 non-overlapping patches of size 256times256. Finally, a dataset of 64,980 ground images was generated.
(2) Following the instructions from [32], we collected cloud images from the NOAA, totally 150 cloud images from 2018 to form a training set. These images are multispectral (RGB + infrared) images with a resolution of 1920 � 1080 pixels. They were preprocessed into greyscale images. After preprocessing, each cloud image was downsampled and cropped into 16 non-overlapping patches of size 256�256. Finally, a dataset of 900 cloud images is obtained. (3) Cloud and clear images were randomly combined by using atmospheric scattering model to form cloud covered image data with total of 64,980 images for model training.
Finally, model structure and hyperparameters including optimization parameters were set as required in [32].

B.2 STGAN
We used the training data from the original paper [24]. The dataset in the original paper was constructed using publicly available real Sentinel-2 images. When training STGAN, a total of 3130 sets (256� 256) of image pairs were constructed, and each set of datasets was obtained by pairing one cloudless ground image with three cloud covered images of the same area at different times. The training, verification and test datasets were split according to the ratio of 8:1:1, and there were 2504, 313 and 313 training, verification and test image groups respectively. Once again, model structure and hyperparameters including optimization parameters were set as required in [24].