MARU-Net: Multiscale Attention Gated Residual U-Net With Contrastive Loss for SAR-Optical Image Matching

Accurate synthetic aperture radar-optical matching is essential for combining the complementary information from the two sensors. However, the main challenge is overcoming the different heterogeneous characteristics of the two imaging sensors. In this article, we propose an end-to-end machine learning pipeline inspired by recent advances in image segmentation. We develop a siamese multiscale attention-gated residual U-Net for feature extraction from satellite images. The siamese architecture shares weights and transforms the heterogeneous images into a homogeneous feature space. Fast Fourier transform is used to compute the cross-correlation between the feature maps and produce a similarity map. A contrastive loss is introduced to aid the training procedure of the model and maximize the discriminability of the model. The experimental results on a benchmark dataset show that the proposed method has superior matching accuracy and precision compared to other state-of-the-art methods.

Optical and synthetic aperture radar (SAR) are some of the most common sensors for Earth observation. Optical sensors are passive sensors that measure the reflected sunlight from objects on the Earth's surface, making them susceptible to atmospheric conditions, such as cloud coverage, time of day, and other weather conditions, such as mist, fog, and smoke. However, they provide useful multispectral information. Instead, SAR is an active sensor that transmits radio wave pulses and measures the back-scattered signals, making them operational without the sun needing to illuminate the surface and in every weather condition. Furthermore, SAR can capture the surface properties (such as roughness) of objects. However, it provides no spectral information, resulting in noisy black-and-white imagery. Image matching is the process of aligning two or more images. The SAR-optical matching is especially problematic due to the significant radiometric and geometric differences and the visual disparities introduced by different remote sensing sensors. Consequently, combining data from various sensor types remains one of the major challenges in remote sensing [14]. The distinct characteristics of the different imaging principles cause the imagery to reveal different aspects of the Earth's surface. As a result, objects on the surface appear inherently dissimilar from active and passive sensors' viewpoints. Therefore, locating salient features in both images is complex, especially in areas with fewer distinct features. Combining the two sensors will increase the information content and possibly open new use cases in remote sensing applications, given the complementary information from SAR and optical imagery. However, the two images must align accurately before combining them. Traditionally, feature-based matching methods, such as SIFT [15], SAR-SIFT [16], optical-to-SAR SIFT (OS-SIFT) [17], and (RIFT) [18] have been used for SAR-optical matching. These methods calculate feature descriptors from images and match them together, evaluating the feature correspondence. Affine correction is performed by selecting noncollinear matched features as control points. However, the extraction of such features requires export knowledge and handcrafted procedures. Furthermore, they cannot handle well the heterogeneous characteristics caused by the SAR and optical imaging mechanisms on scenes with few salient features. Feature similarity is calculated using the sum of squared differences (SSD), normalized cross-correlation (NCC), or mutual information (MI) [19], [20]. However, MI-based approaches suffer from high computational costs.
Several studies have recently suggested deep learning methods to overcome the shortcomings of nonlearning methods. The availability of paired SAR-optical datasets, such as SpaceNet-6 [21] and SEN1-2 [22], help the development of machine learning-based SAR-optical matching. Zhang et al. [23] and Merkle et al. [24] proposed fully convolutional siamese networks to extract features in SAR-optical imagery. Both methods rely on a time-consuming pixel-by-pixel search to perform matching This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ and use shallow convolutional neural networks (CNN) with few parameters as feature extractors. Hughes et al. [25] implemented a component-based framework using three separate networks to extract patches suitable for matching, perform template matching, and remove outliers, respectively. However, the framework downsamples the produced feature maps due to time complexity constraints, thus losing matching precision when interpolating the similarity score. More recently, Zhou et al. [26] proposed a machine learning modification of channel features of orientated gradients (CFOG) proposed by [27] called multiscale convolutional gradient features (MCGF). Like CFOG, MCGF achieves fast matching since it also performs similarity evaluation in the frequency domain using NCC. Zhang et al. [28] proposed the deep dense feature network (DDFN), also inspired by the CFOG method. DDFN extracts a 9-D feature vector for each pixel and uses the SSD for similarity computation. The experiments show that the deep siamese network outperforms the state-of-the-art handcrafted CFOG descriptor. Fang et al. [29] introduced the fast Fourier transform (FFT) U-Net, using the image segmentation model U-Net [30] as a feature extractor with an FFT accelerated NCC layer to perform matching. Similarly, [31] demonstrated the superiority of using the U-Net as a feature extractor in SARoptical matching. However, SAR-optical matching remains a challenging problem due to the inherent geometric and radiometric differences between the two sensors. A review on recent methods and current research trends can be found in [32].
In this article, we tackle the problem of SAR-optical matching as a multiclass classification task. We use a siamese architecture to extract shared features, mapping different multimodal images (e.g., optical and SAR into the same space) into a common feature space. As the core of the model, we chose a UNet-based architecture because it is one of the most effective deep learning architectures for image classification and image segmentation. We enhanced the classic architecture with additional components, such as the attention mechanism and residual blocks, which were initially developed for image semantic segmentation tasks but not fully exploited yet in the context of SAR-optical template matching. Moreover, we compute the feature maps at different scales. Computing the features at different scales is a well-known method that can improve the representation ability of features in many tasks, such as standard object detection [33] and image segmentation [34]. The multiscale feature map makes the network more robust and improves the pixel-level matching accuracy. At the same time, the attention mechanism helps the model to locate and focus on salient regions in the SAR-optical imagery. Furthermore, we combine the standard cross-entropy with an additional contrastive loss to build a combined loss function tailored for the SAR-optical matching problem. The loss function reduces false positive matching locations and increases the discriminability of the proposed framework.

II. METHODOLOGY
The approach used in this work is based on template matching, which consists of finding the most likely position of a small image (template) within a larger image (reference). As such, the starting point is acquiring an optical image (at this moment also called reference) with dimensions R x × R y and an SAR image (at this moment also called template) with dimensions T x × T y . Fig. 1 outlines the structure of the proposed pipeline. Details of each component, including the architecture, FFT NCC layer, and loss function, are described as follows.

A. MARU-Net Architecture
As a preprocessing step, the two input images (i.e., optical and SAR) are downscaled, reducing their original size by half. The resulting four images (optical, SAR, and their corresponding downscaled versions) are passed through a siamese CNN composed of four units. Each unit has the same architecture and shares the same weights with the others. This way, they work in tandem on different input vectors to compute comparable feature maps [35].
Each unit is a CNN that consists of a U-Net backbone where standard convolution blocks were replaced by residual convolution blocks [36]. The architecture has four layers with channel dimensions of {32, 64, 128, 256} in the contracting path and similarly four layers of {256, 128, 64, 32} in the expanding path. We inserted attention gates [37] in the expanding path instead of the standard direct skip-connections. Each residual block consists of two iterations of 3 × 3 2-D convolutions, followed by batch normalization and an ELU activation. The shortcut path consists of one convolutional layer. Instead of using traditional transposed convolutions with learnable parameters to upscale the encoded feature maps, we use upsampling layers with bilinear interpolation to increase the resolution of the feature maps and thus preserve the initial details of the encoded features. This is because CNN architectures employing transposed convolutions from lower to higher resolution are prone to checkerboard artifacts [38].
Each network produces a 4-channel feature tensor ψ with the same dimensions (height and width) of the corresponding input. The feature map extracted from the downscaled optical and SAR images, ψ d opt and ψ d SAR , are upscaled and concatenated with the corresponding feature map extracted from the original images ψ o opt and ψ o SAR . Thus, resulting in a 8-channel feature tensor.

B. FFT NCC Layer
Comparing the feature maps pixelwise is time-consuming and drastically increases the training and inference time of the model. To speed up the process, we compute the NCC in the frequency domain to evaluate the similarity map S of the derived feature maps using the FFT as where, S denotes the derived similarity map, ψ opt and ψ SAR are the optical and SAR feature maps produced by the network presented in the previous section, "·" is the elementwise product operation, and F 2 d and F −1 2 d denote the 2-D forward and inverse FFT, respectively. In the FFT layer, the dimensions of S corresponds to the dimensions of the search space and is S x × S y × 8, where S x = R x − T x + 1, and S y = R y − T y + 1. The similarity map S is then normalized intoS according to [39]. As a result, every value in S can be interpreted as the observed similarity of the template (i.e., SAR image) within the reference (i.e., optical image).

C. Loss Function
If a softmax function is applied to S each value in the derived similarity map can be interpreted as the probability of a specific shift between the reference and the template. That is, the coordinates in the similarity map S corresponding to the maximum value indicate the predicted shift of the template with respect to the reference. Given the discrete dimensions of the search space and having the ground truth with the correct shift, locating the 2-D pixel shift between the reference (e.g., optical) and the template (e.g., SAR) image can be formulated as a multiclass classification problem, where the classes denote the shift coordinates of the SAR template within the larger optical image. As such, we adopt the cross-entropy loss function L CE as where, y i,j is the ground truth value at position (i, j), while S i,j is the similarity score at position (i, j).
However, for such a classification task, the size of S yields S x × S y different classes where the correct matching location (1 class) is considered as the correct class, while the all the rest are considered wrong. Therefore, this formulation results in a heavily imbalanced distribution of classes, which negatively impacts the training. Inspired by [28], we include a new term in the loss function to reduce the impact of the imbalanced distribution of the classes and improve the discriminability of the network. We apply the discrete approximation of the Gaussian function G on the area around the correct matching position (c i , c j ) obtained from the ground truth to construct a soft ground truth map as where, σ is set to 1, · 2 is the L 2 (Euclidean) distance. L 1 normalization is applied to G to render it as a probability distribution. G has the same size of the similarity map S. The similarity map space is then divided into two nonoverlapping regions, namely the matching region MR = S × G and the nonmatching region NMR = S × (1 − G ), where G is the ceiling function applied to G (nonzeros values are mapped into 1, zeros values remain zero). The matching region consists of pixels corresponding to the true class provided by the ground truth (hard label) and the nearby classes, weighted by the Gaussian function G (soft labels). We denote the number of these classes by N ps . Inside the NMR region, we select the N hns pixels with the lowest nonzero values in the similarity map (hard negative samples). Fig. 2 shows the two regions and the different types of labels graphically.
We define a new term Ω, formulated as the difference between the observed similarity scores within the two regions, which we aim to maximize. Mathematically Like in [28], we add a margin of 1 to prevent significantly low values in Ω and increase the separability of the positive and negative samples. Experimentally, we find that setting N hns = 16 yields the best results. Finally, the contrastive loss L Ω is defined as L Ω = −Ω to make it compatible with the cross-entropy term L CE (e.g., minimize −Ω is equivalent to maximize Ω). We construct the combined loss function, which is the sum of the cross-entropy term and the contrastive term The training process aims to minimize (5) through backpropagation. In practice, without the contrastive term Ω, we note that the simple cross-entropy loss L CE does not distinguish well between a mismatch close to the ground truth or far from it. With the term Ω, the model is penalized less if the maximum value in S is close to the ground truth. At the same time, the model is trained to distinguish better between positive samples and hard negative samples.

A. Dataset
We use the SEN1-2 open benchmark dataset [22] to examine the performance of our approach. The dataset consists of 282 384 coregistered SAR-optical image patches, including all four seasons and environments (e.g., urban, rural, deserts, mountains, etc.). The image patches are 256 × 256 pixels in size and have a 10-m spatial resolution. Fig. 3 shows some examples of the images taken from the SEN1-2 dataset.
We select 100 random patches from every folder in the dataset across all four seasons. The selected subsection of the dataset is then split into training and test sets with a ratio of 70:30, yielding 18.060 image pairs as our training data and 7.740 for testing. To generate the ground truth, the SAR patches are randomly cropped to a size of 192 × 192, where the row and column offset is stored as the ground truth value. The RGB optical images are converted to grayscale, and the SAR images are denoised using the Lee filter [40].

B. Evaluation Metrics and Implementation Details
The L 2 distance between the peak of the generated similarity map S and the ground truth is used to determine the matching accuracy. With a 256 × 256 reference image and a 192 × 192 template, the resulting similarity map is a 65 × 65 matrix. We select MI, siamese CNN, DDFN, and FFT U-Net to compare against our proposed MARU-Net method. Models are trained for five epochs with a batch size of 4 using the Adam optimizer with a learning rate of 5e −4 . All methods are implemented using the Keras API of TensorFlow 2, and training is performed on a machine with an Intel Core i5-7600 k CPU and a GeForce GTX 1080 GPU.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.

C. SAR and Optical Matching Performance Validation
We compute the results using our proposed approach as well selected state-of-the-art methods [32]. The obtained results of the testing dataset are shown in Table I. Besides the more advanced learning methods, we also evaluate the standard crosscorrelation method, using the implementation provided by [41]. To evaluate the performance, we select the percentage of image pairs with a L 2 pixel distance from the ground truth smaller than a given threshold as the correct matching rate (CMR). We also compare the average L 2 distance value as a measure of precision and the time complexity, measuring the average time to perform a single matching.
As shown in Table I, the proposed MARU-Net method obtains the best accuracy across all CMR thresholds. In addition, it exhibits the best precision compared to the usual methods. We note that the standard NCC, although faster than the more advanced methods, shows very poor performance. This is because the similarity map is performed directly on the input images, which are too diverse. MI shows acceptable results on the coarse 10 m/px resolution imagery (which is the resolution of the SEN1-2 dataset used in this study), although other studies have shown that MI performs noticeably worse when it comes to pixel-level accuracy on very high-resolution imagery [29], [43]. Nevertheless, it is also the most time-consuming among the tested methods. The DDFN achieves the fastest performance due to its shallow seven-layer CNN structure with only 300 000 parameters. However, the shallow nature of DDFN yields low precision scores due to poor matching accuracy on imagery with few salient features. The siamese CNN is also a shallow network with few parameters but employs a new architecture with shared weights and cross-entropy as a loss function, yielding considerably better performance compared to DDFN. Still, the Siamese CNN is the slowest among the machine learning methods due to a time-consuming dot-product computation of the extracted feature vectors. The state-of-the-art FFT U-Net utilizes a deep classic U-Net with cross-entropy as a loss function, producing the best results among the selected methods, as shown in Table I. The experimental results show that the proposed MARU-Net architecture yields significant improvements in matching performance compared to other state-of-the-art methods. In addition, our method is also computationally efficient, in line with the other methods.
The approach used in this work, and the other cited works, is based on template matching. However, this approach will lead to difficulties when the two images are heavily warped one with respect to the other. Most of the presented methods cannot effectively deal with significant rotation and scale differences between the two image and further research is needed to address these issues.

D. Visual Comparison of Matching Results
In Fig. 4, we show qualitatively two samples and the produced similarity maps using different methods. The chosen scenes consist of a nonurban and an urban image pair. Low response values in the similarity map are represented by a dark blue color, while high values are bright yellow. Ideally, an optimal matching result would be a single sharp yellow peak overlapping the ground truth value (red dot).
In the first scene, a snowy mountain scene, we observe no distinct features and few details, making the SAR-optical matching a challenging task. As such, the selected methods exhibit an unfocused response pattern with a low response in the correct matching region. MI, DDFN, and Siamese CNN particularly fail, with numerous peaks in incorrect areas. The FFT U-Net also yields an unfocused similarity map with a moderate response in the matching area. The proposed method shows a comparable similarity map with a singular sharp peak close to the red dot. The attention gates and combined loss function encourage the network to focus on a single region resulting in more focused similarity maps with sharper peaks.
The second scene depicts an urban area with detailed structures (river, buildings). Compared to the first scene, the matching is considerably more manageable, and all methods increase the matching performance since features, such as the river, are salient in both images. Still, the siamese CNN appears relatively unfocused. The main benefit of our proposed method is a more robust matching performance in scenes like in S1 where the level of details is low and few features are present.

E. Ablation Study
We perform an ablation study to verify how the components in the proposed method contribute to increasing the matching performance. The FFT U-Net method serves as a baseline, and the different components of our method are gradually added to verify the performance gain. The experimental results of the componentwise comparison are shown in Table II.
Including contrastive loss improves the matching precision, ensuring improved training compared to solely cross-entropy. Fig. 4. Matching results for different methods on two sample scenes in the SEN1-2 dataset: S1 (rural area), S2 (urban area). The generated similarity heatmap is color coded from blue to yellow. Low values are represented by a dark blue color, while high values are bright yellow. The red dot denotes the true ground truth offset coordinates. Qualitatively, a better performance corresponds to having a single yellow peak around the red dot.   5 shows the feature maps extracted from the feature extractor. For clarity, the feature extractor from Fig. 1  the other hand, due to the decreased resolution, small details are canceled out. Such missing information is retrieved by combining the feature maps from the original resolution, which improves the pixel-level matching accuracy. In our study, we select two scales (original resolution and halved). Testing with more scales did not improve the results, as further decreasing the resolution would delete too much information and details, while making the model even more difficult to train (due to more parameters). However, choosing multiple scales, for example, pyramidal approaches [44], can be implemented when dealing with larger images and a finer resolution. Fig. 6 shows the difference in the feature maps with and without the attention mechanism. Adding the attention mechanism yields much sharper and more focused feature maps. This is because, as presented in [37], the attention mechanism calculates attention coefficients that are multiplied during the concatenation in the expanding path of the architecture (decoder), identifying salient image regions, and prune feature responses to preserve only the relevant activations.
Finally, we present some limitations in the dataset that significantly affect the model's performance, and in general, all remote sensing-based approaches. Fig. 7 shows some problematic scenes with the dataset we used. A scene can be ambiguous, for example, if there are straight lines and the template can match all the positions along the line. In Fig. 7, first column, we notice that the regular patterns of the fields in the SAR image can be matched in different locations. Another example is when the scenes are unclear, such as in the middle of the ocean (7, second column). Finally, there might be temporal differences between optical and SAR acquisition times. Even if the difference is small, this can be significant if moving objects are in the scene, as are the boats in Fig. 7, third column.
Finally, when we plot the distribution of the error (see Fig.  8), we notice a large peak around the 0-and 1-pixel error with few outliers with a very high pixel error. These outliers are due to difficult scenes present in the dataset, as described in Fig. 7.  Our approach falls in the category of template matching. Therefore, it does not work well if the images are warped (i.e., involving a nonaffine transformation) one to the other or if significant rotations are involved. Future works are toward addressing these challenges. We clarified this better in the results section and in the conclusions.

IV. CONCLUSION
In this article, we propose an SAR-optical image matching method to increase the matching accuracy and precision at the pixel-level. We extend the classical U-Net with attention mechanisms to improve the feature extraction capabilities of the encoded-decoder architecture. We incorporate a multiscale strategy to produce feature maps from original and downscaled imagery to increase robustness. In addition, we propose a loss function consisting of the combination of cross-entropy and a contrastive loss, tailored particularly to the SAR-optical matching problem. Experiments show that our method outperforms other state-of-the-art methods while still being computationally efficient. Future works are toward exploring the potential of unsupervised or semisupervised methods in SAR-optical matching to overcome the inherent shortcomings of relying on large datasets. Also, another direction is to research how to make the network more robust to scales and rotations.