Efficient and Robust: A Cross-Modal Registration Deep Wavelet Learning Method for Remote Sensing Images

Deep convolutional networks are powerful for local feature learning and have shown advantages in image matching and registration. However, the significant differences between cross-modal images increase the challenge of image registration. The deep network should extract modality-invariant features to identify the matching samples and discriminative features to separate the nonmatching samples. The deep network can extract features invariant to the image modality changes by multiple nonlinear mapping layers. However, it does not inevitably lose rich details and affect the discrimination of features, degrading registration performances. This article proposes a novel deep wavelet learning network (DW-Net) for local feature learning. It incorporates spectral information into deep convolutional features for improving cross-modal image matching and registration. Specifically, this article aims to learn the multiresolution wavelet features through multilevel wavelet transform (WT) and the convolutional network. The cross-modal images are divided into low-frequency and high-frequency parts through WT. DW-Net can adaptively extract the shared features from the low-frequency part and useful details from the high-frequency part, which can enhance the modality invariance and discrimination of features. Additionally, the multiresolution wavelet features contain multiscale information and contribute to improving the matching accuracy. Extensive experiments demonstrate the significant advantages in terms of the accuracy and robustness of DW-Net on cross-modal remote sensing image registration. DW-Net can increase the image patch matching accuracy by 3.7% and improve image registration probability by 12.1%. Moreover, DW-Net shows strong generalization performances from low resolution to high resolution and from optical– synthetic aperture radar to other cross-modal image registration.


I. INTRODUCTION
R EMOTE sensing image registration aims to align images of the same scene in space, which may be obtained at different times from different viewpoints or by various sensors [1], [2]. Therefore, image registration is crucial for multitemporal image analysis, multiview image applications, and multimodal image fusion, such as change detection [3], [4], image fusion [5], and object detection [6].
Multimodal images having complementary information and provide rich features for land cover classification and object detection. Building correspondences and performing cross-modal image registration are crucial for improving the performance of remote sensing image applications [7], [8]. For example, optical images contain rich color and texture under good illumination. However, optical images are easily affected by cloud occlusion and camouflage. Synthetic aperture radar (SAR) sensors can capture clear object contours under weak illumination, ignoring the negative impact of the imaging environment. Therefore, the fusion of optical and SAR images is robust for target detection and recognition in complex environments. As they have different imaging mechanisms, optical and SAR images have significant appearance differences for the same scene. It is hard to find the matching correspondences between cross-modal remote sensing images for registration. Additionally, the severe speckle noise in SAR images and complex scene content in remote sensing images will influence optical-SAR cross-modal remote sensing image registration.
The traditional image registration methods can be divided into intensity-based methods and feature-based methods. Intensitybased methods search the optimal transformation matrix by comparing image similarity of intensity, e.g., mutual information (MI) [9] and normalized cross correlation (NCC) [10]. Featurebased methods establish many local correspondences through nearest neighbor descriptors matching. The traditional featurebased methods mainly rely on handcrafted descriptors, such as the gradient histogram statistics in the local neighborhood, shape features, and responses of filters [11], [12], [13], which can be viewed as low-level features. The traditional methods have shown good performances on single-model image registration. However, as shown in Fig. 1, there are significant differences between the intensity images and gradient magnitude maps of cross-modal images. Intensity-based methods and low-level feature-based methods are challenging to register cross-modal remote sensing images accurately.
Recently, deep learning methods have been applied for local image patch matching and image registration [14], [15], [16], [17], [18], [19], [20]. They mainly use the deep convolutional network to extract the high-level feature representation from local image patches and then build local correspondences according to the feature distance or directly predict the matching label through fully connected layers. The former is denoted as the deep descriptor learning method [14], [15], [16]. The latter is denoted as the metric learning method, which transforms the image patch matching problem into a binary classification task [17], [18], [19]. Deep networks contain many learnable parameters that can be trained for various images, which have higher adaptiveness and better image registration results than the traditional methods. Moreover, deep networks can extract high-level features through multiple nonlinear mappings, which are more robust to noises and image changes (e.g., illumination changes, rotation transformation, and image modality changes). In cross-modal image registration, the deep network should extract image modality-invariant features to identify the matching samples and discriminative features to distinguish nonmatching samples. To achieve this, the deep network is optimized by pulling close the matching cross-modal samples and pushing nonmatching samples away in the feature space. However, the deep network inevitably discards many useful details when we enforce the features of matching cross-modal samples to be as similar as possible. It will weaker the discrimination of features and increase the risk of false matching, degenerating the cross-modal image registration performances.
To solve the above problems, this article incorporates spectral information into deep convolutional features and proposes an efficient and robust cross-modal registration method. We design a novel deep wavelet learning network (DW-Net) to extract the multiresolution wavelet features through wavelet transform (WT) and deep convolutional layers for matching. The introduced multiresolution spectral information carries rich details to enhance the deep feature representation and improve image registration performances. Specifically, DW-Net first uses the discrete wavelet transform (DWT) to decompose the crossmodal images into different frequency bands, such as the lowfrequency and high-frequency bands. The former corresponds to the essential contents of original images, while the latter corresponds to the details of images, which are different but complementary. Then, DW-Net adaptively extracts the shared features from the low-frequency part and useful detailed information from the high-frequency component. Additionally, DW-Net captures the multiresolution wavelet features by the multilevel DWT. The multiresolution wavelet features contain multiscale information and contribute to improving the matching accuracy and accelerating the convergence of the deep network.
The proposed DW-Net integrates the advantages of the multiresolution spectral information and the deep convolutional features for cross-modal remote sensing image registration. First, DW-Net is learnable and adaptable for cross-modal remote sensing image matching. Second, DW-Net can capture robust high-level modality-invariant features through multiple nonlinear mapping layers, which can deal with the negative influence of image noise and significant appearance differences between cross-modal images. Third, the introduced multiresolution spectral information in wavelet features contains the image texture and rich details, which can enhance the modality invariance and discrimination and contribute to improving cross-modal image matching and registration. It should be noted that this article does not simply combine WT and the deep convolutional network for image matching and registration. How to combine their advantages to achieve better performance is very important and is also the focus of this article. This article considers the network structure from different perspectives and explores an effective DW-Net for cross-modal remote sensing image registration, including multiresolution wavelet features learning, wavelet information normalization method, and the fusion method of wavelet feature and deep feature.
The main contributions can be summarized as follows. 1) This article proposes a novel deep wavelet learning method for cross-modal image registration, which incorporates spectral information into deep convolutional features for improving image matching and registration performances. 2) This article explores different network structures and designs an efficient DW-Net for cross-modal remote sensing image registration. Meanwhile, we provide an insightful analysis that the introduced wavelet information will improve the image matching accuracy by enhancing the features of modality invariant and discrimination.

3) This article conducts extensive experiments and analysis
to show the effectiveness and robustness of the proposed DW-Net on cross-modal image matching and registration. It also has strong generalization performances from low-resolution to high-resolution image registration and from optical-SAR image registration to other cross-modal image registration. The rest of this article is organized as follows. Section II introduces the related work on image registration and the WT. We present details of the proposed network in Section III. The experimental results and analysis on cross-modal image matching and registration are shown in Section IV. Finally, Section V concludes this article.

II. RELATED WORK
This part mainly introduces the related works of image registration, WT, and the combination of WT and deep learning in various applications.

A. Remote Sensing Image Registration
Intensity-based image registration methods aim to find the optimal transformation matrix by maximizing the MI or NCC [9], [10], [21]. Yang et al. [22] mix the structure similarity of the frequency domain and the intensity domain to improve the accuracy and robustness of remote sensing image registration. Due to the sensitivity to the illumination and image modality changes, intensity-based methods have gradually been replaced by robust feature-based methods. Feature-based methods first extract the local descriptor from the neighborhood of keypoints. Then, the matching point pairs between two images can be acquired based on the extracted local descriptors. After that, the transformation matrix can be estimated based on the obtained matches. The traditional feature-based methods rely on the handcrafted features, such as the representative gradient histogram information, SIFT [11]. After that, various modified versions of SIFT have been proposed, such as the fast version of SIFT, SURF [23], the affine transform version of SIFT, Affine-SIFT [24], and the improved SIFT for SAR image registration, SIFT-OCT [12] and SAR-SIFT [13].
Descriptor learning methods use a deep convolutional network to learn a feature vector from the input local image patch. Then, they build local correspondences (matching point pairs) between two images according to the feature descriptor distance. The widely used optimization losses of descriptor learning networks are contrastive loss and triplet loss. Their core idea is to minimize the feature distance of matching samples and maximize the feature distance of nonmatching samples. As the representative descriptor learning network, HardNet [16] uses the triplet loss to train the deep network, which expects that the feature distance of the nonmatching samples is larger than that of the matching samples by a margin. Triplet loss relaxes the strict constraint of the feature distance of the matching sample approaching 0, which has a more stable training process than the contrastive loss. Additionally, HardNet adopts the hard sample mining strategy to find the negative samples with a small feature distance to boost network training. Deep learning methods can also be introduced for remote sensing image registration. Recently, deep learning methods for remote sensing image registration mainly improve the deep network structure, network optimization methods, and multiple features learning [33], [34], [35], [36], [37], [38], [39]. In terms of deep network structure, Fan et al. [35] propose a deep residual encoder network for remote sensing image registration, which adopts the multiscale loss function for network training. Xiang et al. [36] propose a feature decoupling network to decouple the semantic features and noise information for optical and SAR image registration, which can effectively remove speckle noise and keep more useful information for improving registration accuracy. In terms of deep network optimization, the provided supervised information in the matching and nonmatching binary labels is limited for deep network training. Quan et al. [37] exploit richer similarity information between a series of nonmatching patch pairs for enhancing deep network optimization and improving matching accuracy through a self-distillation feature learning network (SDNet). Additionally, Zhou et al. [39] propose to extract multiorientated gradient features and multiscale convolutional features for matching. Ye et al. [38] propose a novel structural feature based on the first-and second-order gradient information for multimodal image registration. Li et al. [40] propose an adaptive regional multiple features matching method for large-scale high-resolution remote sensing image registration. It combines the gradient feature, phase feature, and line feature for more effective feature representation and robust feature matching.
Metric learning methods directly predict the matching label of the input image patch pair, matching or nonmatching, which can be optimized by the binary cross-entropy loss. The representative network structure of metric learning methods is the Siamese network, pseudo-Siamese network, and two-channel network [41]. Siamese network uses two feature extraction branches with the same structure and shared weights to learn features from the image patch pair and then predicts the matching label based on their learned features. The pseudo-Siamese network uses two feature extraction branches with the same structure but not shared weights for feature learning. The two-channel network connects the input image patches along the channel dimension and then predicts their matching label through a deep convolutional network. Zhang et al. [42] propose a Siamese fully convolutional network to learn the shared feature between multimodal images, which adopts the convolutional computation to compute the similarity score based on the extracted features. Because descriptor learning methods have a faster inference speed than metric learning methods in image registration, this article mainly focuses on improving the descriptor learning method.
The existing deep learning methods aim to extract modalityinvariant features by constraining the cross-modal image as far as similar, which will lose many details and reduce the discrimination features. Thus, we propose to combine the deep features and wavelet information for image matching, which helps extract modality-invariant and discriminative features and then improves matching performances. The main difference between the proposed DW-Net between the existing deep learning methods is that DW-Net incorporates spectral information into deep convolutional features, which can effectively enhance the deep features for image matching. As mentioned in [43] and [44], WT can capture contextual and textural information. Inspired by this, we adopt the WT to enhance the discrimination of deep features and improve the cross-modal image matching performance. Additionally, the proposed DW-Net extracts multiscale features for image matching rather than merely taking the original resolution image information. The multiscale features can further enhance the feature representation and boost matching performance.

B. Wavelet Transform
WT mainly decomposes one-dimensional (1-D) signal or 2-D image into a set of orthogonal wavelet basis functions. Recently, the combination of WT and deep neural network has been widely used in various tasks, such as image classification [45], [46], image dehazing [47], image restoration and denoising [48], [49], change detection [50], and image super-resolution [51], [52]. There are three main motivations for combining the WT and deep learning. For example, using WT to capture the edge features, details, and other high-spectral information for enhancing the classification performances or the generated image quality; taking the advantage of the spectral features and multiresolution information in WT; and replacing the pooling operator in the deep convolutional neural network with WT, which can reduce the size of feature maps and increase the receptive field without information losses.
Kang et al. [48] propose a wavelet residual network for medical image denoising, which can recover the detailed texture of the original images. Yang and Fu [47] propose a wavelet U-Net for image dehazing, which uses the WT to extract edge features for enhancing the details of the dehazed images. Zhang et al. [53] combine the WT and knowledge distillation for image-to-image translation. They constrain the high-frequency information between students and teachers to be similar to improve the details of the generated image. To capture frequency features in different directions, Wang et al. [52] train a deep convolutional neural network for approximating the multiscale wavelet representations. After that, the trained CNN is used for aerial image super-resolution, which can capture high-frequency local details and low-frequency global layouts.
However, there is little research on the multiresolution wavelet features for image matching and registration. This article first proposes an efficient and robust deep wavelet learning method for cross-modal image matching and registration. The core motivations of this article using WT are to capture the details and multiresolution spectral information for more discriminative features learning and improving the multimodal remote sensing image matching and registration. Meanwhile, this article provides an insightful analysis of why deep wavelet features contribute to improving cross-modal image matching performances from the perspective of the modality invariance and discrimination of features. We verify the effectiveness and the advantages of deep wavelet learning and test the influence of different network structure settings, such as the multilevel WT, the normalization of the wavelet information, and the fusion of wavelet features and deep features.

III. METHOD
This article proposes a DW-Net for cross-modal image registration, as shown in Fig. 2. The existing deep learning methods merely use the deep convolutional network to extract high-level features from the input local image patch for matching, which will lose details and decrease the discrimination of features. DW-Net introduces multiresolution wavelet information that carries rich details to enhance the modality invariance and discrimination of features, significantly increasing the cross-modal image matching and registration performances. Specifically, DW-Net first uses a two-level WT to decompose the image patch into different multiresolution sub-bands. Then, DW-Net learns multiresolution wavelet features and fuses them with deep features for more accurate matching and registration. In the following sections, we will introduce the WT, the proposed DW-Net, the network optimization, and the image registration pipeline.

A. Wavelet Transform
This article adopts DWT to decompose the original signal into different sub-bands through a low-pass filter and a high-pass filter. The 2-D DWT can be viewed as performing the 1-D DWT twice. As shown in Fig. 3, we first conduct 1-D DWT along with the rows of the image. Then, we perform the 1-D DWT along with the columns of the transformed image. The input image patch x can be decomposed to four bands x LL , x LH , x HL , and x HH through the 2-D DWT. x LL represents the lowfrequency part, which contains the essential contents of the input signal. x LH , x HL , and x HH are the high-frequency parts about horizontal, vertical, and diagonal, respectively, which contain the details of the input signal. This article mainly adopts the low-frequency part x LL and the high-frequency part of diagonal x HH to enhance the deep convolutional features for matching.
To take the advantage of the multiresolution wavelet representations, we adopt two-level DWT on the input image patch. As shown in the bottom of Fig. 3, x 1 LL and x 1 HH are acquired from the first-level DWT on the original image patch. x 2 LL and x 2 HH are obtained from the second-level DWT, which are decomposed from x 1 LL . They can be formulated as follows: where dwt represents the discrete wavelet transform operator. It should be noted that after one-level DWT, the spatial size of the decomposed output band will be reduced by half. So, the spatial size of x 1 LL and x 1 HH is half of that of x, and the spatial size of x 2 LL and x 2 HH is quarter of that of x.    4 shows the visual decomposition results of WT on a pair of optical and SAR image. We can see that the lowfrequency part mainly contains the essential image contents (texture information), while the high-frequency part contains the details (some edges and noise). The low-frequency part and high-frequency part are different but complementary. Thus, we input the low-frequency part and high-frequency part into DW-Net for deep wavelet feature learning.

B. Deep Wavelet Learning Network
Deep descriptor learning methods mainly extract an L 2 normalized feature vector from the input image patch through a deep convolutional network. The most representative deep descriptor learning method of HardNet uses several convolutional blocks to extract high-level features from the input image patch. The convolutional blocks mainly contain the convolutional layers (Conv), batch normalization (BN), and nonlinear activation function ReLu. The acquired feature vector f can be formulated as follows: where F is the deep feature mapping function. The extracted high-level feature representations of the image patch are robust to illumination changes, rotation transformation, and even image modality changes. Thus, deep descriptors acquire better performances than handcrafted descriptors on image matching and registration. However, the significant differences between optical and SAR images greatly increase the difficulty of modality-invariant feature learning. Additionally, the deep learning network will lose a lot of useful details in the processing of learning the shared features between cross-modal images, which will reduce feature discrimination and lead to many false matches. To enhance the learned high-level features, the DW-Net introduces the spectral information and multiresolution wavelet features into deep feature learning and extracts deep wavelet features for matching. As shown in Fig. 2, DW-Net contains a deep convolutional learning branch (upper branch) and a multiresolution wavelet feature learning branch (lower branch). In the upper branch, DW-Net uses several convolutional blocks to extract the high-level features from the original image patch and fuses them with the multiresolution wavelet features acquired from the lower branch.
In the lower branch, DW-Net first extracts the multiscale lowfrequency and high-frequency information through two-level DWT according to (1). It is inappropriate to fuse the decomposed wavelet bands and convolutional features directly. Thus, we first design several convolutional layers (denoted as the wavelet block) for adaptively learning the useful information from the decomposed bands. The learned multiresolution wavelet features can be represented as follows: where f 1 w and f 2 w are the learned first-level and second-level wavelet features, respectively, F 1 w and F 2 w are the wavelet features learning mapping functions for the first-level and secondlevel wavelet decomposed bands, respectively, ⊕ represents the concatenation operator along the channel dimension, and nor is the normalization operator, such as Min-Max normalization and Z-score normalization. The acquired multiscale low-frequency and high-frequency information (x 1 LL , x 1 HH , x 2 LL , x 2 HH ) should be normalized before deep wavelet features learning. The influence of normalization is verified in Section IV-E.
After that, the learned multiresolution wavelet information f 1 w and f 2 w in the lower branch and deep features in the upper branch are fused through the wavelet features fusion mode. Specifically, we connect the wavelet feature and deep feature along the channel dimension. Then, the concatenated features are input to the subsequent convolutional blocks for high-level feature learning. In experiments, we also compare difference fusion methods and find that this concatenation fusion method is the optimal setting. The deep wavelet feature learning process can be formulated as follows: where f i c represents the learned ith feature maps in the upper branch, F i c represents the mapping function in the ith convolutional block, f 1 c and f 1 w have the same spatial size, and f 2 c and f 2 w have the same spatial size. The detailed parameters' sittings of each convolutional block and wavelet block are presented in Table I.

C. Network Optimization
The ideal local descriptors for cross-modal image matching and registration should have strong image modality invariance and discrimination. On the one hand, the modality invariance means that the matching cross-modal features are similar and are robust to the image modality changes. On the other hand, discrimination means that the nonmatching features are dissimilar, which contributes to separating the matching features from a lot of nonmatching features. To achieve this, we use the triplet loss with the hard negative sample mining strategy [16], [54] for DW-Net optimization.
First, we construct N triplet samples from the optical-SAR image patches. Suppose triplet samples are (a i , p i , n i ), i = 1 . . . N, where (a i , p i ) are matching, (a i , n i ) and (p i , n i ) are nonmatching. Second, we extract the deep wavelet features (f a i , f p i , f n i ) from the triplet samples through DW-Net. Finally, we apply triplet loss for DW-Net optimization.
The triplet loss aims to constrain the feature distance of the nonmatching sample that is larger than that of the matching sample by a margin. It can be represented as follows: where [x] + = max(x, 0), D(f a i , f p i ) is the feature distance of matching samples, D(f a i , f n i ) is the feature distance of nonmatching samples, and m is the margin, m = 1.
To boost the network optimization, we also adopt the hard negative sample mining strategy. It means that the negative Algorithm 1: Image Registration Process.
Input: Unregistered images and the trained DW-Net. Output: Registered images.
Step 2: Feature extraction. Clip local image patches surrounding the keypoints and then extract local descriptor of the input image patch through the trained DW-Net.
Step 3: Feature matching. Obtain matching point pairs based on the nearest neighbor matching strategy.
Step 4: False point-pairs elimination. Eliminate the potential false point pairs based on GMS [56] and RANSAC [57]. Step where D hard,i represents the feature distance of the ith hard negative image patch pair.

D. Image Registration Pipeline
After the training, the DW-Net can be used for cross-modal image registration. The image registration pipeline mainly contains five steps: keypoint detection, feature extraction, feature matching, false point-pairs elimination, and image transformation and alignment. Refer to Algorithm 1 to see more details. First, the traditional method ORB [55] is used to detect a lot of keypoints from optical and SAR images. Second, DW-Net extracts the local descriptor of keypoints from their corresponding local image patches. Third, the matches between optical and SAR images can be acquired according to the feature distance. The matching point pair has the smallest feature distance. After that, to achieve reliable matches, we use the GMS [56] and RANSAC [57] to eliminate the potential false point pairs. Finally, we can compute the transformation matrix based on the acquired reliable matches and perform image transformation and alignment according to the calculated transformation matrix.

IV. EXPERIMENTS
This part tests the effectiveness and advantages of the proposed DW-Net on cross-modal local image patch matching and image registration. We first introduce the experimental dataset, implementation details, and evaluation metrics. Then, we study the influence of deep wavelet features and the DW-Net structure, such as multiresolution deep wavelet features learning, the normalization method on decomposed wavelet sub-bands, and the fusion method of wavelet features in the lower branch and the deep features in the upper branch, respectively. Additionally, we provide an insightful analysis of why the deep wavelet features can significantly improve the cross-modal image matching performances from the perspective of the modality invariance and discrimination of features. Meanwhile, we present the network convergence and complexity analysis. Finally, we compare the DW-Net with other methods in various cross-modal image registration.

A. Data Introduction
This article adopts the public aligned optical-SAR dataset, SEN1-2 [58], [59]. The whole image size is 256 × 256. The image resolution is 10 m. First, we clip a lot of optical and SAR image patches from SEN1-2 to optimize DW-Net. The size of image patches is 64 × 64. The number of the training and test dataset for image patch matching is 583 180 and 248 274, respectively. Second, we test the cross-modal remote sensing image registration performances of DW-Net on the randomly selected 570 optical-SAR image pairs. The cross-modal images can be preregistered by using metadata or physical sensor models [7], [8]. We can correct the scaling change and rotation transformations according to the parameters of the sensors. Following the same setting in previous works [7], [8], [60], [61], [62], we generate the test cross-modal remote sensing images with a slight translation. We conduct random translation transformation on the SAR images and then test the cross-modal image registration performances. The main challenge of cross-modal image registration is to deal with significant differences between cross-model images caused by different imaging mechanisms and the negative influence of speckle noise.

B. Implement Details and Metrics
This article adopts the ADAM optimizer to train DW-Net. The initial learning rate is 1.0, the batch size is 500, and the training epoch is 20.
We take FPR95 and matching accuracy as evaluation metrics to verify the image patch matching performances. The smaller FPR95 and the larger accuracy represent the better matching results. Additionally, there are many metrics for image registration performance evaluation. For example, the number of the successfully registered images Num m and the matching probability of images I mp can reflect the effectiveness and robustness of the method. The root-mean-square error RMSE and the transform matrix error H err represent the registration precision. We also use the average number of matching point pairs Num p to show image registration performances. If RMSE and H err

C. Influence of Wavelet Features
DW-Net extracts the deep wavelet features for matching, which introduces wavelet information to enhance the high-level features. It is necessary to verify the influence of deep wavelet features. Thus, we test the image matching results based on the deep convolutional network with wavelet features and without (W/O) wavelet features. We can draw the following conclusions from the results in Table II.
First, the introduced wavelet information can improve matching performances. When we merely use the wavelet features learned from the decomposed wavelet bands for image patch matching, the matching accuracy increases from 94.56% to 97.45%, and FPR95 decreases from 5.98 to 1.26. Additionally, the wavelet information can significantly enhance the matching performances of deep features. The matching accuracy increases from 94.56% to 98.29%, and the FPR95 decreases from 5.98 to 0.67.
Second, the most effective image information is preserved in the low-frequency wavelet band. When we merely use lowfrequency wavelet features for matching, the matching accuracy can still achieve 93.09%. Compared with the learned deep convolutional features, there is a slight degradation.
Third, the high-frequency wavelet band contains useful details, which can further improve the matching performance. Compared with the matching results of "DW-Net W/O Image, HH" model, the introduced high-frequency wavelet band in "DW-Net W/O Image" model can significantly improve the matching accuracy by 4.36%.
Finally, DW-Net combines deep convolutional features from the image patch, low-frequency, and high-frequency wavelet features, which acquires the best matching performance. Thus, we adopt this setting in the following experiments. Additionally, from the perspective of network convergence, the training process slightly fluctuates in the deep feature learning network without wavelet information, as shown in Fig. 5. The introduced wavelet features in DW-Net can stabilize the optimization process, accelerate network convergence, and enhance the matching performance.

D. Influence of Multiresolution Wavelet Features
To take the advantage of multiresolution wavelet features, we perform a multiple-level WT on the image patch to enhance the feature learning. This section mainly tests the influence of multiresolution wavelet features. As shown in Table III, as the WT level increases, the matching performance gradually becomes better. Specifically, the one-level wavelet features increase the matching accuracy of deep features from 94.56% to 98.08% and decrease the FPR95 from 5.98 to 0.76. The two-level and three-level wavelet features can further increase the matching accuracy up to 98.29% and 98.39%, respectively. When the level of WT is further increased, the effect gains of matching performances decrease gradually. Additionally, the multiplelevel WT will bring more network parameters, as shown in Fig. 6. Thus, considering both the matching performance and the network complexity, DW-Net adopts the two-level WT for image matching and registration.

E. Influence of Normalization
We conduct the normalization on the acquired low-frequency and high-frequency wavelet bands in DW-Net. This section mainly verifies the influence of the different normalization Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. As shown in Table IV, when DW-Net does not conduct the normalization on the wavelet decomposed bands, the matching accuracy has a limited increase. When DW-Net adopts a normalization operator on the wavelet decomposed bands, the matching performances will improve further. The DW-Net with Min-Max normalization can increase the matching accuracy by 0.96%. The DW-Net with Z-score normalization can increase the matching accuracy by 1.21%.
Additionally, Fig. 7 shows the FPR95 and matching accuracy changes during the training process. The normalization operator can speed up the network convergence and enhance the matching performance. Thus, it is necessary to normalize the wavelet decomposed bands before deep wavelet features learning. This article adopts the DW-Net with Z-score normalization to acquire the best matching performances.

F. Influence of Fusion Method
The fusion method of wavelet features and convolutional features also is crucial in DW-Net. We test the image patch matching performance based on the different fusion methods, such as the addition fusion method, channel connection fusion method, and attention weighted fusion method. In the addition fusion method, the learned wavelet features from the lower branch and convolutional features from the upper branch are added and input for the next convolutional block. In the channel connection fusion method, we directly connect the wavelet features and convolutional features along the channel dimension. In the attention weighted fusion method, the wavelet features and convolutional features are connected along the channel dimension and then through a channel attention module, which assigns an attention weight for the feature map of each channel. Specifically, we perform the average pooling on the feature map and then learn the attention weights for each channel.
The comparison results are shown in Fig. 8. Due to these three fusion methods introducing wavelet features, they acquire better matching performance than the deep network without wavelet features. Additionally, the channel connection fusion method obtains the best matching accuracy. The main reason is that the channel connection mode preserves rich feature information than the addition mode. Although the attention weighted fusion method also adopts the channel connection mode, it brings more parameters and acquires slightly worse matching accuracy than the channel connection fusion method. Thus, this article adopts the DW-Net with the simple channel connection fusion method in experiments.

G. Modality-Invariant and Discrimination Analysis
Generally, the feature distance of the matching samples can reflect the modality invariance of the learned descriptors, while that of the nonmatching samples and the margin between the matching and nonmatching samples can reflect the discrimination of the local descriptors. The ideal local descriptors should have good modality invariance to find the matching samples with a small distance. Meanwhile, they should have strong discrimination to separate the nonmatching samples with a large feature distance. The large distance margin between matching and nonmatching samples can help the network to distinguish the matching and nonmatching samples.
To further verify the effectiveness of the wavelet features, we compute the mean feature distance of the matching and nonmatching samples and their margin in Table V. Meanwhile, we show the feature distance distributions of matching and nonmatching samples through kernel density estimation in    9. We can see that DW-Net achieves the smallest feature distance for matching samples, the largest feature distance for nonmatching samples, and the most significant distance margin between matching and nonmatching samples. We also can see the positive influences of the high-frequency information on image matching by comparing the results of DW-Net W/O HH and DW-Net. These experimental results demonstrate that deep wavelet features have better modality invariance and more discrimination than deep features.
We also present the visualization of fused deep wavelet features (f 1 c , f 2 c ) extracted from cross-modal remote sensing images. As shown in Fig. 10, the proposed DW-Net can extract the distinguished texture features from remote sensing images. Meanwhile, the extracted features from the optical and SAR images are similar. The visualization of features also demonstrated that the DW-Net can learn the modality-invariant features from cross-modal images and discriminative features for matching and registration.

H. Comparative Experimental on Image Patch Matching
In this part, we compare local image patch matching results of DW-Net with other deep learning based methods. The comparison results are shown in Table VI. Different from descriptor learning networks (HardNet and DW-Net) distinguish the matching and nonmatching samples according to the feature distance, the metric learning networks (Siamese,  Siamese-2stream, Pseudo-Siamese, 2-channel, and 2-channel-2stream) directly use the network to predict the matching label of the input patch pairs. We adopt the binary cross-entropy loss to train these metric learning networks and use the triplet loss to optimize HardNet and DW-Net. From the results in Table VI, we can see that DW-Net acquires the smallest FPR95 and the largest matching accuracy. As DW-Net and HardNet have the same optimization loss function, the matching performance gains of DW-Net mainly come from the introduced wavelet features. Specifically, DW-Net increases the matching accuracy of HardNet from 94.59% to 98.29% and decreases the FPR95 of HardNet from 5.98 to 0.67. With the exponential form triplet We also compare the proposed DW-Net with the recent multimodal remote sensing image registration method SDNet [37]. SDNet mainly improves the local feature representation by multiple optimization losses, such as the matching loss L m , the selfdistillation learning based on the feature consistency loss L con , and the reconstruction loss L recon . DW-Net focuses on enhancing the local feature representation by fusing the multiresolution wavelet features. We can see that DW-Net always achieves better matching performance than SDNet. Specifically, when the SDNet and DW-Net are optimized by the matching loss, our proposed DW-Net has significant advantages over SDNet. The extracted multiresolution deep wavelet features contribute to boosting remote sensing image matching. Although SDNet uses the matching loss, the feature consistency loss, and the reconstruction loss for network training, DW-Net still achieves higher matching accuracy and lower FPR95. These experimental results show the effectiveness of the multiresolution deep wavelet features on cross-modal remote sensing image matching.

I. Comparative Experimental on Image Registration
In the image registration process, we need to compare the similarity of all image patch pairs. Image registration can be viewed as dense image patch matching. There are a lot of similar local image patches, which will tend to cause confusion and result in false matches. Therefore, image registration is more challenging than image patch matching. To further test the effectiveness of the proposed DW-Net, we compare the image registration results of DW-Net with other representative traditional methods and deep learning based methods on the more difficult image registration task. Metric networks need to predict the matching label of each image patch pair, which will significantly increase the image registration time cost. Thus, we merely compare the image registration results of DW-Net with the deep descriptor learning network. 1) Registration Accuracy: As shown in Table VII, the traditional methods of SIFT [11], ORB [55], and PSO-SIFT [63] are difficult to deal with the significant nonlinear differences between optical and SAR images. They fail to register the crossmodal remote sensing images. Compared with the representative deep learning based method, HardNet, the proposed DW-Net increases the Num m from 386 to 455. The successful matching probability I mp is improved from 67.72% to 79.82%. When we adopt the exponential form of triplet loss for DW-Net optimization, the image registration performances are also improved by the DW-Net. The I mp is improved from 82.11% to 85.79%. These results demonstrated the effectiveness and robustness of the DW-Net on cross-modal image registration. Additionally, DW-Net acquires a smaller image registration error RMSE and a smaller transformation matrix error H err than HardNet.
2) Time Cost: We compare the average time cost at each image registration stage based on HardNet (without wavelet features learning and can be viewed as Wavelet-level0) and DW-Net, such as keypoint detection, feature extraction, feature matching, elimination, register, and total time. As shown in Fig. 11, the DW-Net takes more time than HardNet. It can see that the proposed DW takes more time than HardNet on feature extraction. The main reason is that the introduced wavelet information learning branch will add the network parameters and increase computational cost, as shown in Figs. 6 and 11. However, the deep wavelet features in DW-Net bring significant image registration performance gains. In the future, we will focus on decreasing the time cost and maintaining high image registration accuracy.
3) Visual Results: Fig. 12 presents the visual results of optical and SAR image registration. The first two columns of Fig. 12 are aligned optical and SAR images. We conduct the random translation transformation on the SAR images to test the image registration performances. We randomly show the 10% matching point pairs, as shown in the third column of Fig. 12. It can be seen that these matching point pairs are correct and have high positioning accuracy. As shown in the fourth column of Fig. 12, the junctions in the checkerboard images are smooth and without misplacement. These results also demonstrate the effectiveness of the proposed DW-Net on optical and SAR image registration.

J. High-Resolution Optical-SAR Image Registration
To test the generalization of the proposed DW-Net, we show the image registration performance of DW-Net on highresolution optical and SAR images. The size of the test optical and SAR image is 512 × 512, and their resolution is 1 m [64], [65]. It should be noted that the DW-Net is trained on the image patches acquired from the SEN-12 dataset with low resolution (10 m). Fig. 13 shows the image registration results of high-resolution optical-SAR images. We can see that although DW-Net is trained on low-resolution optical-SAR images, it still    achieves accurate register results. These results show the good generalization performances of DW-Net from low-resolution to high-resolution optical-SAR image registration.

K. Cross-Modal Image Registration
To further show the effectiveness of the proposed DW-Net, we test the image registration performances of DW-Net on other cross-modal images, such as visual image and LiDAR image registration. We directly use the trained DW-Net on SEN-12 dataset for image registration. The detailed image information is presented in Table VIII. There are slight translation changes between these cross-modal images. To increase the difficulty of the cross-modal image registration task, we perform an extra random translation of about [− 16,16] on the test image and then test the image registration results. Fig. 14 shows the visual registration results. We can see that DW-Net still achieves many matching point pairs. The edges in the checkerboard images are smooth and without misalignment. Additionally, their registration errors of RMSE are close to subpixel. These experimental results show the strong generalization ability from optical-SAR image registration to other cross-modal images.
We also test the registration performances of our method on large-scale remote sensing images. The size of the test images is 7666 × 7692, which are acquired from the estuary of the Yellow River in Shandong, China, by Radarsat-2. As shown in Fig. 15, DW-Net still achieves many matching point pairs in large-scale remote sensing images. The large-scale remote sensing images are accurately registered, and the junction of the checkerboard image is very smooth without any misalignment. The main reason is that large-scale images have rich texture features, which benefit image matching and registration.

V. CONCLUSION
This article proposes an efficient and robust deep wavelet learning method for cross-modal remote sensing image registration. We design a DW-Net that introduces spectral information and multiresolution deep wavelet features for matching and registration. Extensive experimental results have shown the effectiveness and advantages of the proposed DW-Net on cross-modal local image patch matching and image registration. The learned deep wavelet features have better modality invariance to match the cross-modal samples and discrimination to separate the nonmatching samples. It has shown significant advantages in cross-modal image matching and registration over the representative deep feature learning network. Additionally, DW-Net shows strong generalization from low-resolution to high-resolution image registration and from optical-SAR image registration to other cross-modal image registration. In the future, we will study more effective wavelet features for improving cross-modal remote sensing image registration. We will conduct DW-Net on MindSpore, which is a new deep learning computing framework.