CD-GAN: a robust fusion-based generative adversarial network for unsupervised remote sensing change detection with heterogeneous sensors

In the context of Earth observation, change detection boils down to comparing images acquired at different times by sensors of possibly different spatial and/or spectral resolutions or different modalities (e.g., optical or radar). Even when considering only optical images, this task has proven to be challenging as soon as the sensors differ by their spatial and/or spectral resolutions. This paper proposes a novel unsupervised change detection method dedicated to images acquired by such so-called heterogeneous optical sensors. It capitalizes on recent advances which formulate the change detection task into a robust fusion framework. Adopting this formulation, the work reported in this paper shows that any off-the-shelf network trained beforehand to fuse optical images of different spatial and/or spectral resolutions can be easily complemented with a network of the same architecture and embedded into an adversarial framework to perform change detection. A comparison with state-of-the-art change detection methods demonstrates the versatility and the effectiveness of the proposed approach.


Introduction
In the context of Earth observation, change detection (CD) aims at identifying spatial areas that have been altered between two remote sensing image acquisitions [38,4].With the rapid development of urban areas and the frequent occurrence of natural disasters such as floods and earthquakes, CD has become an essential task to monitor land cover evolution [12,14,41,28].To be effective, the CD methods have to account for the two sensors characteristics and for their possible dissimilarities.When the two sensors share the same modality and the same spatial and spectral resolutions, they are described as homogeneous and as heterogeneous otherwise.For instance, if two optical sensors have the same spectral and spatial resolutions, they can be qualified as homogeneous.Otherwise, they are qualified as heterogeneous even if they have the same modality, the optical one.Due to the rapid development of remote sensing technologies, there is a growing availability of images coming from heterogeneous sensors.This offers new opportunities to perform timely CD, which is valuable in emergency situations.The pipeline of most CD methods consists of three steps: image pre-processing, change image (CI) generation and binary change map (CM) generation [5].In particular, when dealing with heterogeneous sensors, the pre-processing step aims at allowing a reliable pixel-by-pixel comparison to be performed between the two images.For that purpose, apart from colorimetric corrections, the pre-processing aims at producing two images with the same size where two homologous pixels (e.g., with the same coordinates) depict the same spatial location in the observed scene.The second step then generates the CI by comparing the pre-processed images pixel-bypixel.The last step performs high level processing such as segmentation or classification on the CI to label the altered pixels and/or group of pixels.This paper focuses on the challenging CD applicative scenario that involves optical sensors with different spatial and/or spectral resolutions.An archetypal scenario, considered in this paper for illustrative purpose and generally referred to as complementary acquisition, involves two optical images of complementary resolutions, i.e., a high spatial and low spectral resolution (HRLS) image and a low spatial and high spectral resolution (LRHS) image.Adopting a conventional dichotomy over the optical sensors, this could be i) a pair composed of a panchromatic (PAN) image and a multispectral (MS) image, ii) a pair composed of a PAN image and a hyperspectral (HS) image or iii) a pair composed of a MS image and an HS image.When facing dissimilar spatial and/or spectral resolutions, the pre-processing, the first step of the CD pipeline outlined above, uses various strategies [21,11,34,51,20].They mainly rely on individual processing of each image to make a pixel-by-pixel comparison possible.To achieve the same spatial and spectral resolutions, interpolation and resampling are classically used [2,1].For instance, a synthetic HRLS image can be obtained by spatially interpolating and spectrally filtering a LRHS image.Another possibility is to spatially downsample the HRLS and spectrally filtering a LRHS so that they end up with the same low spatial and spectral resolutions (LRLS).As a consequence, the pre-processed images are both embedded into the same feature space where they can be compared pixel-by-pixel [33,32].However, the above pre-processing strategies, since they individually handle the two images, somehow do not fully exploit their interdependence and complementarity in terms of spatial and spectral information.Instead of individually pre-processing the two images, a general CD framework introduced in [9] and extended in [10,8] tackles the CD problem as a multiband image fusion task.To fully and jointly exploit the spatial and spectral information brought by the two observed images, they are merged, using a fusion process, into a latent (i.e., unobserved) image.While avoiding to spatially and/or spectrally degrade the observed images, this fusion process allows a CD map to be subsequently recovered with the best spatial and spectral resolutions.This strategy shows to outperforms state-of-the-art CD methods relying on feature space embedding.
Meanwhile, some recent works exploit the versatility of deep neural networks to perform CD in case of homogeneous or heterogeneous sensors [23,51,22,47].For instance, in [25], cycle-consistent adversarial networks (CycleGANs) learn a subimage-to-subimage mapping, that allows the LRHS image to be embedded into a HRLS image space.However, most of these deep architectures are designed to perform supervised CD, i.e., their training requires a data set of pre-change and post-change image pairs with annotated changes.This paper proposes an unsupervised CD method for heterogeneous sensors (e.g., handling a HRLS image and a LRHS image) by capitalizing on the fusion-based CD framework proposed in [8] while simultaneously leveraging on recent advances on deep learning-based fusion, with no need of image pairs with annotated changes.More precisely, this paper shows that any pretrained deep network designed to fuse heterogeneous images can be reused as a building block of an adversarial architecture to perform unsupervised CD.This fusion network, whose design can be left to the final user, can be systematically complemented with a network of the same architecture to finally solve the CI inference problem.When compared to more conventional approaches, the strategy adopted in this paper appears to be more flexible while competing favorably.Indeed, thanks to the versatility of neural networks, this strategy may easily depart from simplifying assumptions initially formulated in the former contributions.Moreover, the proposed framework can also benefit from expected future advances in deep learning for Earth observation by allowing newly proposed and pre-trained fusion network to be embedded.This article is organized as follows.Section 2 recalls how the problem of CD can be cast as a robust fusion task.Capitaliz-ing on this formulation, an adversarial framework is proposed in Section 3 and a possible architecture is detailed for the particular case of complementary acquisitions.Experimental results obtained on simulated datasets are reported in Section 4 to assess the efficiency, the versatility and the robustness of the proposed approach when compared to alternative CD methods.Section 5 illustrates the approach relevance when analyzing real datasets.Finally, Section 6 concludes the paper.

Background: robust fusion-based change detection
This paper focuses on the problem of detecting changes between two optical images denoted Y 1 ∈ R m 1 ×n 1 and Y 2 ∈ R m 2 ×n 2 , respectively, acquired over the same scene, at different times t 1 and t 2 by two sensors S 1 and S 2 .The integer values m i and n i stand for the number of bands and the number of pixels of the image Y i (i ∈ {1, 2}), respectively.In our case of interest, the two images are assumed to have different spatial and/or spectral resolutions, i.e., n 1 n 2 and/or m 1 m 2 .
Remark 1 (Complementary acquisitions).Throughout this paper, for an illustrative purpose but without loss of generality, we will consider the particular case of images characterized by complementary resolutions, as in [10].This can be stated by the following relations In other words, Y 1 and Y 2 are LRHS and HRLS images, respectively.It is worth noting that other applicative scenarios listed in [10] can be easily handled following the approach detailed in the following paragraphs.
Because of these different spatial and/or spectral resolutions, a pairwise comparison of the pixels in Y 1 and Y 2 cannot be conducted to locate the changes.To overcome this difficulty, the adversarial strategy proposed in this paper elaborates on the robust fusion framework proposed in [10] and generalized in [8].This framework relates the two observed images Y 1 and Y 2 to two latent images X 1 ∈ R m×n and X 2 ∈ R m×n of same high spatial and same high spectral resolutions, e.g.n ≥ max {n 1 , n 2 } and m ≥ max {m 1 , m 2 }, through the observation models where H 1 : R m×n → R m 1 ×n 1 and H 2 : R m×n → R m 2 ×n 2 stand for degradation operators.These operators are assumed to be linear, i.e., Remark 2 (Complementary acquisitions).In case of complementary acquisitions, the operators H 1 (•) and H 2 (•) stand for spatial and spectral degradations respectively.In [10], as in most of the fusion works of the literature dealing with multiband imaging [45,6,46,42,15], they are defined as where the matrix R is generally decomposed into R = BS with B a spatially-invariant blurring matrix and S a subsampling matrix.The matrix L is a spectral degradation specified by the sensor spectral response.The approximation symbol ≈ stands for mismodeling or measurement noise.The latent space R m×n embeds images with spatial and spectral resolutions defined by the LRHS and HRLS images respectively, i.e., n = n 2 and m = m 1 .
It is worth noting that the latent images X 1 and X 2 are of equal spatial and equal spectral resolutions, that are at least equal to the observed image ones.As a consequence, these two latent images could be pixel-wisely compared to locate any changes that may have occurred between the two acquisition times.Based on this finding, CD can be reframed as an inference problem referred to as robust fusion.The two main steps of problem-solving are detailed below.

Change map inference step
If the latent images X 1 and X 2 were available, since they share the same spatial and spectral resolutions, a CI denoted ∆X ∈ R m×n of high spatial and high spectral resolutions could be easily derived by a pixel-by-pixel difference, i.e., ∆X = X 2 − X 1 . ( Conventional CD methods dedicated to homogeneous sensors, such as change vector analysis (CVA) and its extensions [3,7], can then be used to derive the binary CM from the CI ∆X = [∆x 1 , . . ., ∆x n ], where ∆x i is a m × 1 vector associated to the pixel i ∈ {1, ..., n} in ∆X.In its canonical formulation, CVA consists in computing the energy image e = [e 1 , . . ., e n ] ∈ R n of the change where, for i = 1, . . ., n, The components e i of e with the smallest (resp.highest) values most likely correspond to unchanged (resp.changed) pixels whose indices are gathered in the set Ω u (resp.Ω c ).Thus a natural decision rule consists in thresholding the energy change image, i.e., for i = 1, . . ., where the threshold τ balances the trade-off between the probability of false alarm and the probability of detection and can be adjusted using a dedicated method [29].Then, the final binary n can be derived as

Fusion step
Reciprocally, if the CI ∆X were known, the latent images X 1 and X 2 could be inferred by solving a multiband image fusion problem.Indeed, adopting the formalism introduced in [10] and exploiting the identity (3) and the linearity of the degradation operators, the direct models (2) can be equivalently rewritten as is the so-called corrected image that means the image that would be acquired by S 2 at t 1 i.e. before a change occurs.From the set of equations (6), it clearly appears that recovering the latent image X 1 from the observed and corrected images Y 1 and Y 2 is an image fusion task.The other latent image X 2 can be subsequently derived from X 1 and ∆X using (3).
Remark 3 (Complementary acquisitions).Fusing a pair of images resulting from complementary acquisitions aims at recovering a latent image whose spatial and spectral resolutions are the highest of the observed image ones.This problem has motivated a wide bunch of research works over the three last decades, specifically dealing with pansharpening when fusing PAN and MS images [39], hyperspectral pansharpening when fusing PAN and HS images [26] or multiband image fusion when fusing MS and HS images [44].

From fusion consistency to CD
It is worth noting that any fusion method should fulfill a consistency property.This property states that a well-designed fusion process should be reversible.The observed image Y 1 is expected to be as close as possible to the image that would be obtained by applying the acquisition process to the estimated fused image X1 [40].In other words, the so-called predicted image that should be observed by the sensor S 1 at time instant t 1 , denoted as Ŷ1 and defined by This property is the main rationale of the strategy introduced in [9] which conducts CD by alternating between the CM inference and the fusion steps.These two steps are achieved by iteratively solving two optimization problems whose complexities depend on the acquisition scenario.In the next section, we show that this consistency property allows the CD task to be formulated under an adversarial learning paradigm, by jointly conducting CM inference and fusion in a unified framework.
In particular, this framework embeds two networks dedicated to the two aforementioned tasks.Interestingly, since one task consists in performing fusion, any pre-trained network already designed and available off-the-shelf to perform this task can be reused as an essential building block.This framework is detailed in what follows.

Proposed adversarial framework
The previous section showed that CD between images acquired by heterogeneous sensors can be conducted by inferring a pair {X 1 , X 2 } of latent images or, equivalently, a pair {X 1 , ∆X} formed by one latent image and the CI with X 2 = X 1 +∆X.This inference problem, coined as robust fusion, can be solved using a procedure composed of two steps, namely CI inference and fusion, detailed in Sections 2.1 and 2.2, respectively.This paper shows that these two steps can be performed by two essential building blocks of an adversarial architecture.This architecture offers a high degree of modularity and thus opens the door to countless refinement opportunities and subsequent performance improvements.The different stages of this architecture depicted in Fig. 1, referred to as CD-GAN in the sequel, as well as the training strategy are detailed in this section.Finally, we provide a specific instance of this framework for the particular case of complementary acquisitions.

Overall architecture
CI inference can be interpreted as a mapping from the product set R m 1 ×n 1 ×R m2 ×n 2 of observed images towards the set R n×m of CIs.In this work, this unknown mapping is defined as where This network solves the CD problem possibly in the case of heterogeneous sensors.Indeed, once trained, it directly provides as output the estimated CI ∆ X from which the estimated binary change map d can be obtained as detailed in Section 2.1.In a supervised setting, training this network would require triplets composed of a pair of observed images and the corresponding change image, which may significantly limit the applicability of the method.Conversely, one proposes to train this network under an unsupervised setting, i.e., without assuming the availability of such triplets.For that, one can benefit from another network already trained and fully available to perform fusion.Indeed, keeping the notations introduced in the previous section, the fusion step can be formulated as a mapping from the product set R m 1 ×n 1 × R m 2 ×n 2 of observed images to the set R n×m of latent images, i.e., where F (•, •; Θ F ) stands for a fusion network parameterized by Θ F .In this work, we propose to benefit from recent efforts promoting deep architectures specifically designed to solve the fusion task.Thus, in the sequel of the paper, we will assume that the fusion network F (•, •; Θ F ) has been previously trained, hence Θ F is fixed in the following.It is worth noting that this fusion network implicitly requires the knowledge of the CI ∆X or its estimate ∆ X provided by the network C (•, •; Θ C ) since it takes as one input the corrected image Ỹ2 defined by (7).
Training the network C (•, •; Θ C ) dedicated to the CI inference under an adversarial paradigm requires to design a suitable discriminator denoted as D (•, •; Θ D ) in Fig. 1.One strategy would be to ensure that the output X1 of the subsequent fusion step defined by 11 is in good agreement with the latent image X 1 .This training strategy, generally adopted by the GAN-based fusion methods from the literature, requires a training set 1 of triplets made up of the two observed images and of the associated latent image standing for the expected fusion result.Such a supervised learning is difficult to achieve in practice due to the need of representative labelled data.We thus propose a different training strategy that only relies on the observed image Y 1 .By leveraging on the fusion consistency discussed in Section 2.3 and on expected properties of the predicted image Ŷ1 defined by Ŷ1 = H 1 ( X1 ), we design a discriminative network which assesses the quality of the fusion network output through the value of where Υ ∈ {Y 1 , Ŷ1 }, ω ∈ {0, 1} is the binary label stating the likeness between the predicted image Ŷ1 and the observed image Y 1 ( ω = 1 meaning "alike" and ω = 0 meaning "different") and Θ D is the set of the discriminative network parameters.
To summarize, the overall proposed adversarial architecture can be decomposed into five main stages listed below • Correction: the candidate CI ∆ X is used to produce the corrected image Ỹ2 following (7), • Fusion: the corrected image Ỹ2 is subsequently fused with the observed image Y 1 thanks to the already trained network F (•, •; Θ F ), providing the estimated latent image X1 , • Prediction: the predicted image Ŷ1 = H 1 ( X1 ) is derived, • Discrimination: the quality of the predicted image Ŷ1 with respect to (w.r.t.) the observed image Y 1 thanks to the discriminative network D(•; Θ D ).
Note that another interpretation of the proposed architecture can be drawn by observing that the sub-network highlighted as a beige dashed-lined box in Fig. 1 and denoted G(•, •; Θ G ) acts as a generator with parameters Θ G = {Θ C , Θ F }.However, since the fusion network has been trained beforehand, the generator parameters that still have to be trained are the CI inference network parameters Θ C .This generator, hence denoted G(•, •; Θ C ), takes as inputs the two observed images Y 1 and Y 2 and embeds the four first stages of the architecture to produce the predicted image Ŷ1 , i.e., where we recall that Introducing this generator allows the proposed architecture to be cast into an (almost) canonical GAN framework 2 , with its typical loss function specified in the next paragraph.

Loss function
As previously stated, the fusion network F (•, •; Θ F ) can be chosen as any state-of-the-art network (or even any modelbased fusion algorithm) from the literature and it is assumed to have been trained (or calibrated) beforehand.From there, two sub-networks have to be trained, the generative network C (•, •; Θ C ) and the discriminative network D(•; Θ D ), by minimization of a well-chosen loss function.First of all, the proposed approach prescribes the presence of an adversarial cost, denoted as L adv (Θ C , Θ D ), in the loss function: Then, as emphasized by previous works [17], the adversarial cost can be beneficially enriched with some applicationoriented costs.Thus, the loss associated with the network C (•, •; Θ C ) dedicated to CI inference is complemented with two additional terms.The first one is the so-called prediction loss introduced to assess the quality of the predicted image Ŷ1 w.r.t. the actual image Y 1 where • F denotes the Frobenius norm.Since the changes are expected to affect only a few pixels, a second term promoting spatial-sparsity of the estimated CI is also introduced, as advocated in [10,8] where • 2,1 denotes a group sparsity promoting norm.Finally, training both networks boils down to solve the minimax problem min stages of the architecture to define an extended discriminator combined with a generator defined as the sole network dedicated to the CI inference.
where α and β are hyperparameters adjusting the relative weights of the terms.In practice, following the common training procedures of GAN-based architectures, the two networks are alternately updated by stochastic gradient-based optimization, i.e., the expectations involved in (( 17)) are approximated by empirical averages over minibatchs.
where the predicted image Ŷ1 and the CI ∆ X implicitly depend on Θ C .More precisely, since Ŷ1 = H 1 ( X1 ), the loss function can be rewritten as min where Y1 H 1 (X 2 ) is the image that would be observed by the sensor S 1 at time t 2 and ∆ Y1 Y1 − Y 1 is the S 1 -virtual CI: the difference between the image actually observed at time t 1 by the sensor S 1 and the image virtually observed at time t 2 by the same sensor.Since the CI ∆ X is fully defined as the output of the network C (•, •; Θ C ) according to (10), training this network through nonlinear optimization (19) can be interpreted as a parametric inversion task: given the S 1 -virtual CI ∆ Y1 , one aims at inverting the measurement operator H 1 (•) seeking the solution in a parametric form specified by Θ C .It is worth noting that this inversion step has been explicitly referred to as a correction in [10,8].Note that in (19), we have introduced an additional task-driven data fitting term: the adversarial term L adv (Θ C ).

Detailed architecture in case of complementary acquisitions
This section provides details on the network architectures for a specific instance of the proposed framework.More precisely, it considers the particular case of complementary acquisitions, i.e., when the observed images Y 1 and Y 2 are LRHS and HRLS images, respectively.As stated before, the fusion network F(•, •; Θ F ) can be chosen as any state-of-the-art architecture (or model-based algorithm) [24,43,6,48,42] and is assumed to have been trained (or calibrated) beforehand.In this work, we adopt a network similar to the one introduced in [24], whose detailed architecture is depicted in Fig. 2. Specifically, the LRHS and HRLS images are respectively provided as inputs of two sub-networks designed for feature extraction.Each sub-network consist of two successive convolutional layers followed by a leaky rectified linear unit (LeakyReLU) [27] and a down-convolution layer.Since the spatial resolution of the LRHS image Y 1 is lower than the one of the target latent image X 1 , the second convolution layer of the dedicated sub-network is an up-convolution layer.The two resulting feature maps are then concatenated and pass through a pipeline of convolutional layers, complemented by skip connections [36].Finally, ReLU is applied in the last layer to ensure nonnegativity of the output fused image.Regarding the network C(•, •; Θ C ) dedicated to CI inference, it basically performs a mapping with the same input and target spaces as the fusion network detailed above.Even if the respective underlying tasks are different (i.e., CD vs. fusion), their objectives are similar, e.g., extracting relevant spatial-spectral information from a pair of images of different spatial and spectral resolutions to produce a HRHS (latent or change) image.Thus, it is quite legitimate to adopt a similar architecture for C(•, •; Θ C ) as F(•, •; Θ F ).The only difference lies in the last layer: the ReLU layer has been removed since the CI is not necessarily nonnegative.
Finally, the discriminative network D(•; Θ D ) takes an LRHS image as an input, the observed image Y 1 or the predicted image Ŷ1 , to provide a binary decision ω ∈ {0, 1}.When ω = 1, the network decides the input image is an actually observed one, and that it is a predicted image otherwise.As depicted in Fig. 3, it consists of three down-convolution layers and two-flat convolution layers.The network last layer involves a sigmoid activation function.

Y 1 ω
Figure 3: Architecture of the discriminative network.

Experiments simulated data sets
This section aims at assessing the efficiency of the proposed robust fusion-based CD-GAN framework by reporting experiments conducted on synthetic data and comparing its performance to those obtained by standard and state-of-the-art CD techniques.In this perspective, the proposed framework is instantiated for the particular case of complementary acquisitions.More precisely, throughout this section the observed images denoted Y 1 will refer to an HS images of low spatial resolution while Y 2 will refer to MS images of higher spatial resolution.

Synthetic data generation
In the context of CD, the ground truth gives the location of the changes that actually occurred in a geographical area between the two image acquisitions.In practice, it is generally difficult to obtain this ground truth however required for statistical assessment of the CD method performance.Moreover, it is also difficult to obtain enough pairs of images acquired by heterogeneous sensors over the same geographical area at different times.Consequently, the first experiments reported in what follows have been conducted on synthetic data obtained through the simulation protocol proposed in [9].This protocol relies on the availability of a single HRHS image denoted X ref to generate pairs of observed3 HRLS and LRHS images through an unmixing-upmixing process.This process allows the observed images to be affected by physically-inspired thus realistic changes whose maps are predefined and can be resorted to evaluate the performance of the CD methods.The successive steps of this protocol are briefly sketched below.Interested readers are invited to consult the work in [9] for more details.
Generation of HRHS images -Three sets of 22 HRHS images have been extracted from three real HS images acquired by the AVIRIS sensors and depicted in Fig. 4.Each reference image denoted by X ref is composed of m = 224 spectral bands and are of size 120 × 120 pixels (i.e., n = 14400).The resulting 66 reference images are then processed as follows.
Unmixing -From each reference image X ref , a set of k endmembers gathered in the m × k matrix M ref are extracted using vertex component analysis [31], where the number k of endmembers has been adjusted using Hysime [30].Based on this set of endmembers, the reference image has been unmixed using SUnSAL [16] to recover a corresponding k ×n abundance matrix A ref .
Change generation -Three kinds of simulated yet realistic changes are independently applied to the reference abundance matrices A ref to produce modified abundance matrices denoted A chg .These three change rules are referred to as zero abundance (R z ), same abundance (R s ) and block abundance (R b ) the sequel (see [9] for more details).For each reference image, the pixels whose respective abundance vectors have been modified are identified by non-zero values in the predefined reference binary map Upmixing -Pairs (X 1 , X 2 ) of HRHS latent images are computed by linearly mixing endmember matrices M ref with the abundances in A ref or A chg .More precisely, these simulated pairs of latent images X 1 and X 2 are defined as Generation of observed images -Given the HRHS latent images X 1 and X 2 produced as above, respective pairs (Y 1 , Y 2 ) of observed images are generated according to the forward models (2).In these experiments, as stated above, the observed images are assumed to be of complementary resolutions.Thus the spatial degradation operator H 2 (•) is chosen as a spatiallyinvariant Gaussian blur with standard deviation σ = 2.35 followed by a regular down-sampling in both directions of factor 4 leading to n 2 = 16n 1 (1).The spectral degradation operator H 1 (•) mimics the response of a MS sensor by averaging four contiguous bands along the spectral dimension.
Finally, with the above steps, 396 pairs of images have been generated, from which 376 pairs have been used for training, and the other for testing.

Experimental settings
The proposed CD-GAN architecture is implemented using the PyTorch framework on a computer equipped with a Quadro RTX 6000 GPU.Unless otherwise stated, the hyperparameters adjusting the terms associated with the prediction and the spatial regularizations in the loss function (17) have been set to α = 1 and β = 10 −3 .During the training, Adam [19] is considered as the optimizer with an initial learning rate set to 2 × 10 −4 over 15 epochs and batches composed of 4 training samples.These parameters have been adopted for all experiments reported in what follows.The change maps estimated by the proposed method is denoted as dCD-GAN .
Compared methods -The proposed method is compared to four other ones.However, very few works considered the problem of CD between images of different resolutions.Most methods apply independent preprocessing to the two observed images to reach the same spatial and the same spectral resolutions which make a pixel-wise comparison possible.The last method is specifically dedicated to the case of complementary acquisitions.
• The first method is the fusion-based one proposed in [9].
It derives a LR change map denoted dF .
• The second method consists in applying the superresolution algorithm proposed in [50] to each band of the observed image Y 1 .The result is then spectrally degraded by applying H 2 (•).The resulting HRLS image can then be compared pixel-by-pixel to the observed image Y 2 since they have the same spatial and the same spectral resolutions.The estimated HR change map dHRLS is thus obtained through spatially regularized change vector analysis (sCVA) [18].
• The third method applies the same operations as the second one but in a reverse order.The observed image Y 1 is first spectrally degraded by applying H 2 (•) and then spatially superresolved using [50].The pair composed of the resulting HRLS image and the observed image Y 2 is then analyzed using sCVA to produce a HR change map denoted dLSHR .
• The fourth method spectrally degrades the observed image Y 1 by applying H 2 (•) while spatially degrading the observed image Y 2 by applying H 1 (•).The pair of resulting LRLS images are then compared using sCVA to derive a LR change map denoted as dLRLS This allows to derive the empirical receiver operating characteristics (ROC) curves which represent the estimated probability of detection P D as a function of the estimated probability of false alarm P FA .Detection occurs when an actually changed pixel is identified as changed while false alarm occurs when an unchanged pixel is identified as changed.Empirical ROC curves are the privileged figureof-merit for detection performance assessment [35].Contrary to classification-oriented metrics, a ROC curve comprehensively displays, on a single plot, the estimated detection performance in terms of possible trade-offs between detection and false alarms.Considering a wide range of decision threshold values allows to cover these possible trade-offs.Two quantitative metrics are also derived from these ROC curves, namely the area under the curve (AUC), sometimes referred to as the c-statistic [13, Chap.9], and the distance (dist.) between the no detection point and the point at the interception of the ROC curve.For both metrics, the closer the values are to 1, the better the CD methods.

Results
Performance comparison -The ROC curves obtained by the compared methods are shown in Fig. 5, for the three change rules R b , R s and R z .The associated quantitative results are reported in Table 1.Clearly, these first results show the superiority of the proposed CD-GAN framework when compared to the four other CD methods.More precisely, CD-GAN provides better detection even for very low probability of false alarm.
The methods based on dHRLS and on dLSHR behave similarly and underperfom the other methods.Fig. 6 shows the true CM and the CM estimated by the compared methods.Note that the estimated CM dLRLS and dF are defined at low spatial resolutions, contrary to the other three ones.Sensitivity analysis -We now discuss the impact of the three terms in the loss function in (17), namely, the adversarial loss L adv (•, •), the prediction loss L pre (•) and the spatial-sparsity regularization L spa (•).Table 2 reports the quantitative results associated with the three change rules R z , R s and R b while adopting different loss terms in (17): first column without the adversarial loss, second column without the prediction loss and third one without spatial-sparsity regularization.Optimal values are highlighted in bold.Clearly, in this case, the prediction loss and the spatial sparsity regularization play important roles in the proposed CD-GAN, whereas combining the three terms leads to the best results.
In addition, we also analyze the impact of the spatial-sparsity regularization in the loss function (17) by adjusting the hyperparameter β.The empirical results demonstrate the interest of incorporating this regularization since the case β = 0, i.e., with no regularization, leads to the worst detection performance.Conversely, for a quite wide range of non-zero values of β, i.e., β ∈ {10 −4 , 10 −3 , 10 −2 }, the detection performance is clearly better with the best one obtained for β = 10 −3 , which is the default value chosen for the experiments reported in this section.Note that a similar analysis has been conducted for the hyperparameter α adjusting the weight of the prediction term in the overall loss (17).The results, which show the limited impact of this hyperparameter on the detection performance, are not reported herein for brevity.
Impact of the fusion methods -In the proposed CD-GAN framework, the fusion network F (•, •; Θ F ) can be chosen by the end-user and has been assumed to be trained beforehand.Section 3.3 describes the fusion network adopted in most of the experiments discussed herein.This network heavily relies on the PS-GAN architecture proposed in [24].To illustrate its versatility, we propose to instantiate the proposed CD framework with two other fusion networks, namely the deep blind hyperspectral image fusion (DBIN) proposed in [43] and the Spatial-Spectral Reconstruction Network for Hyperspectral and Multispectral Image Fusion (SSR-NET) proposed in [49].Besides, as a complementary analysis, we consider a semi-supervised scenario where the latent image X 1 (i.e., the HRHS image associated with the observed image Y 1 ) is also available for training.Note that this more favorable scenario is the one considered in [24] to train PS-GAN.In this case, the discriminative network D(•; Θ D ) and the prediction loss (15) can be easily adapted accordingly to distinguish the estimated fused image X1 from the true latent image X 1 .The quantitative metrics are reported in Table 3 for the three implemented change rules.These results show that, whatever the adopted fusion network, the proposed unsupervised CD-GAN framework reaches detection performance comparable to the one obtained in a semi-supervised scenario.In addition, they show that the proposed CD framework is quite robust to the choice of the fusion method.Table 4 reports the quantitative results obtained by the compared methods.They show that the proposed CD-GAN obtain higher metrics than those obtained by the compared methods.In conclusion, when compared with other methods, the proposed CD-GAN framework is shown to be robust w.r.t. to the misspecification of the degradation operators for the three implemented change rules.

Experiments on real data sets
For an illustrative purpose, complementary experiments have been conducted on a real data set, namely the Santa Barbara data set.It is composed of two HS images acquired by the AVIRIS sensor in 2013 and 2014, respectively, to monitor the change of composition over Santa Barbara area in California [37].These images are composed of 984 × 740 pixels, with m = 224 spectral bands.In the experiments, two HRHS subimages of sizes 120 × 120 are selected.The spatial and spectral degradation operators H 1 (•) and H 2 (•) detailed in Section 4.1 are applied to each of these HRHS images to mimic heterogeneous acquisitions with complementary resolutions, i.e., producing one HRLS image and one LRHS image.The HRLS and LRHS images as well as the actual binary CMs constituting the two data sets, denoted SB1 and SB2, are shown in Fig. 7.The binary CMs are designed as in [37].
Regarding the data set SB1, the CIs and the resulting binary CMs recovered by the compared methods are shown in Fig. 8. Visual inspection allows one to state that recovering the changed areas between the two observed images is a challenging task for all compared methods.The estimated change map dF does not succeed in identifying the changed regions accurately, which leads to a large number of actually changed pixels to be identified as unchanged.The change maps dHRLS and dLSHR seem to preserve more changed pixels but at the price of leading to a lot of false alarms since a large number of actually unchanged pixels are detected as changed.The estimated LRLS CM dLRLS is able to locate the major changes and the corresponding estimated CI exhibits a better contrast than the previous methods.However, many isolated unchanged pixels are misclassified and, similarly to the fusion-based map, the CI and CM are of low spatial resolution.Conversely, the results associated with the proposed CD-GAN show accurate  detection of the changes with a low number of false alarms.These findings are confirmed by the ROC curves depicted in Fig. 9 (left) and the quantitative results reported in Table 5.From Fig. 9 (left), it appears that the proposed CD-GAN provides high detection rate whatever the functioning point of the detector (i.e., for all values of PFA).From a quantitative point-of-view, the proposed CD-GAN framework obtains higher dist.score and AUC.
Regarding the data set SB2, the CI and CM estimated by the compared methods are shown in Fig. 10.For the middle part, the change maps derived from the HRLS and LSHR CI are composed of many isolated pixels wrongly detected as  changed.This may be due to the superresolution operation which is sensitive to noise.The estimated LRLS CM seems to achieves better performance, although changes in some areas are not well detected due to the loss of useful information induced by the underlying spectral and spatial degradations.change, but the result remains visually satisfactory.The CM estimated by the proposed CD-GAN framework is composed of more well-located changed pixels, with a significant reduction of false alarms.This visual inspection is confirmed by the quantitative results reported in Table 5 where the proposed CD-GAN framework reaches higher metrics (AUC and dist.)than the compared methods.The ROC curves depicted in Fig. 9 (right) show that the CD-GAN strategy provides better detection whatever the functioning point of the detector, i.e., for a wide range of false alarm probability.

Conclusion
This paper introduced a robust fusion-based adversarial framework for detecting changes between heterogeneous images in an unsupervised scenario.The proposed approach capitalized on the availability of a predefined and previously trained fusion network.Following a robust fusion-based change detection strategy, this network was complemented with another network having the same architecture and specifically dedicated to detect change between images of possibly different spatial and spectral resolutions.The overall architecture was trained within an adversarial paradigm, enriching the canonical adversarial loss function with task-driven terms.Experiments conducted on simulated and real data sets illustrated the efficiency, the versatility and the robustness of the proposed change detection framework.

Remark 4 (
Interpreting the generator training).Interestingly, the minimax optimization (17) underlying the generative-discriminative training can be interpreted from an image processing point-of-view.Indeed, let consider the deterministic setting or, equivalently, a minibatch composed of a unique sample.Under this simplifying assumption, training the generative network G (•, •; Θ C ) dedicated to CI inference only (13) boils down to solve the optimization problem min

Figure 4 :
Figure 4: Color composition of the three hyperspectral reference images.

Figures
Figures-of-merit -To quantitatively evaluate the performance of the CD methods, the estimated change maps d are compared to the reference change map d ref .This allows to derive the empirical receiver operating characteristics (ROC) curves which represent the estimated probability of detection P D as a function of the estimated probability of false alarm P FA .Detection occurs when an actually changed pixel is identified as changed while false alarm occurs when an unchanged pixel is identified as changed.Empirical ROC curves are the privileged figureof-merit for detection performance assessment[35].Contrary to classification-oriented metrics, a ROC curve comprehensively displays, on a single plot, the estimated detection performance in terms of possible trade-offs between detection and false alarms.Considering a wide range of decision threshold no adv.α = 0 β = 0 β = 10 −4 β = 10 −3 β = 10 −2

Figure 5 :
Figure 5: ROC curves obtained by the compared CD methods for the three change rules: R b (left), R s (middle) and R z (right).

Figure 6 :
Figure 6: CM estimated by the compared methods on the data sets generated according to the three change rules, namely R b (1st row), R s (2nd row) and R z (3rd row) with from left to right: ground truth, dCD-GAN , dF , dHRLS , dLSHR , dLRLS .

Figure 7 :
Figure 7: Data sets SB1 (top) and SB2 (bottom): color composition of the HRLS observed image (left), color composition of the LRHS observed image (middle) and the binary CM (right) where white (resp.black) pixels correspond to changed (resp.unchanged) areas.

Table 2 :
Sensitivity analysis w.r.t. the hyperparameters in CD-GAN: quantitative detection performance.

Table 3 :
Impact of the fusion method: quantitative detection performance.

Table 5 :
The fusion-based CD method presents misclassified areas of Data sets SB1 and SB2: quantitative detection performance.