Modiﬁed Dynamic Routing Convolutional Neural Network for Pan-Sharpening

: Based on deep learning, various pan-sharpening models have achieved excellent results. However, most of them adopt simple addition or concatenation operations to merge the information of low spatial resolution multi-spectral (LRMS) images and panchromatic (PAN) images, which may cause a loss of detailed information. To tackle this issue, inspired by capsule networks, we propose a plug-and-play layer named modiﬁed dynamic routing layer (MDRL), which modiﬁes the information transmission mode of capsules to effectively fuse LRMS images and PAN images. Concretely, the lower-level capsules are generated by applying transform operation to the features of LRMS images and PAN images, which preserve the spatial location information. Then, the dynamic routing algorithm is modiﬁed to adaptively select the lower-level capsules to generate the higher-level capsule features to represent the fusion of LRMS images and PAN images, which can effectively avoid the loss of detailed information. In addition, the previous addition and concatenation operations are illustrated as special cases of our MDRL. Based on MIPSM with addition operations and DRPNN with concatenation operations, two modiﬁed dynamic routing models named MDR–MIPSM and MDR–DRPNN are further proposed for pan-sharpening. Extensive experimental results demonstrate that the proposed method can achieve remarkable spectral and spatial quality.


Introduction
Due to its broad applications, remote sensing image processing has become an active field in computer vision. Based on different application scenarios, there are many representative research directions, such as hyperspectral image classification [1][2][3][4][5][6][7][8][9], estimation of the number of endmembers [10,11], hyperspectral unmixing [12,13] and pansharpening [14][15][16]. Pan-sharpening mainly fuses the information of LRMS images and PAN images to obtain high spatial resolution multi-spectral (HRMS) images, which contain the rich spectral information from LRMS images and the spatial details of PAN images.
In recent decades, numerous pan-sharpening algorithms have been proposed. These methods could be roughly divided into two kinds: the traditional pan-sharpening approaches and deep learning approaches. There are three categories in traditional pansharpening approaches, namely component substitute (CS) [17][18][19], multi-resolution analysis (MRA) [20][21][22] and variational optimization-based (VO) [23][24][25]. CS-based methods replace the specific components of the LRMS images with the PAN images. Via a multiresolution analysis, the MRA-based methods try to inject the spatial details contained in the PAN images. Different from them, VO-based methods take pan-sharpening as an inverse problem and solve this problem by designing optimization algorithms.
With the flourish of deep learning [26][27][28], convolutional neural networks (CNNs) are used for improving the performance of pan-sharpening methods. Representative pioneer-

•
To replace addition or concatenation operations in many deep learning models, we modify the dynamic routing algorithm to construct a modified dynamic routing layer (MDRL), where MDRL may be the first try to fuse LRMS images and PAN images by modifying the information transmission mode of capsules for pan-sharpening. In addition, the addition and concatenation operations are illustrated as special cases of our MDRL. • In MDRL, the spatial location information is preserved in our model based on the convolutional operator in transform operation and the vectorize operation. Furthermore, the coupling coefficients are learned by the MDR algorithm, which make MDRL fuse the information of PAN images and LRMS images more effectively than with a simple concatenation or summation operation. • Based on two baseline models (i.e., MIPSM and DRPNN), the proposed MDRL is inserted into them to generate our two neural networks named MDR-MIPSM and MDR-DRPNN. Quantitative experiments on three benchmark datasets demonstrate the superiority of our method.
The rest of this paper is organized as follows: MDRL and its corresponding models MDR-MIPSM and MDR-DRPNN are introduced in Section 2. Section 3 reports the experimental results. Section 4 mainly discusses our model by ablation experiments. Section 5 concludes this paper.

Dynamic Routing Algorithm
A dynamic routing algorithm is proposed to transfer information between capsules in adjacent layers of a capsule network [40,41]. Let us remark that the terminology "capsule" refers to the feature which is ready for classification or other high-level vision tasks. Capsules can be simply regarded as the feature maps. Let us assume u i is a vector with m-dimension and it represents the output of the ith lower-level capsule. The "prediction vector"û j|i of the jth higher-level capsule can be defined as multiplying a weight matrix W ij ∈ R n×m by u i :û Thus,û j|i is a n-dimension vector, and the capsule network sums all predictionû j|i with weights c ij to obtain the input of the jth higher-level capsule s j as follows: where c ij is the coupling coefficient between the ith lower-level capsule and the jth higherlevel capsule, while M and N represent the number of lower-level capsules and higher-level capsules, respectively. b ij is the unnormalized coupling coefficient, which is iteratively updated by the dynamic routing algorithm and initialized to zero. Instead of a traditional activation function, capsule networks propose a non-linear "squashing" function to handle s j for generating its corresponding output v j . The "squashing" function is defined as Then capsule networks calculate the agreement a ij between v j and its prediction vector u j|i as follows: Last, coupling coefficients b ij can be updated by adding a ij . Supposing that there are r iterations, the final coupling coefficients b ij are fixed after r iterations. Now the outputs of higher-level capsules can be calculated by Equation (4). It is revealed that dynamic routing is an algorithm used for automatically modeling relationships between low-level and high-level capsules, where the coupling coefficient stands for which low-level capsule contributes more to the high-level capsule. As a matter of fact, this procedure can be applied to the fusion of information from LRMS and PAN images. That is to say, LRMS and PAN images are viewed as low-level capsules, while the fused image is the high-level capsule. In this manner, the dynamic routing algorithm can be viewed as a fusion strategy that automatically selects important features to reconstruct an HRMS image, where the importance is measured by the coupling coefficient. It is worth noting that coupling coefficients are determined by input data, which means that they can vary with the different samples. Thus, in the test stage, MDRL can perform better than the simple fusion strategy (e.g., addition or concatenation). Based on the original dynamic routing algorithm proposed in [40,41] designed for image classification, we modify it to make it compatible with pan-sharpening.

MDRL and MDRCNN
Here, for fusing the features of LRMS images and PAN images, a modified dynamic routing (MDR) algorithm and a modified dynamic routing layer (MDRL) are first introduced. Then, a deep model named modified dynamic routing CNN (MDRCNN) is proposed, based on MDRL.
For convenience, some notations should be summarized first. An LRMS image is denoted as L ∈ R h×w×B , where h, w and B are the height, width and band number, respectively. A similar notation is applied to a PAN image P ∈ R H×W×b . Thus, the size of our target HRMS image is H ∈ R H×W×B . In addition, the convolutional operator Conv(X, c in , c out ) is defined, where X, c in and c out represent the input feature, the number of input channels and the number of output channels, respectively.
Modifying the dynamic routing, MDRL is constructed for fusing the LRMS images and PAN images. The proposed MDRL consists of two parts: transform operation and modified dynamic routing (MDR) algorithm. Using the transform operation, we obtain the "prediction vector" of higher-level capsules. Then, MDR is proposed to obtain the representation of higher-level capsules which contains the information of LRMS images and PAN images.
(1) Transform operation: First, the input patch of LRMS images L p and the input patch of PAN images P p in the same area are taken as two lower-level capsules. Furthermore, suppose the size of L p is h 1 × w 1 × B and the size of P p is h 2 × w 2 × b. Then, the up-sampling is used to handle L p to generate L u with the same spatial resolution of P p (i.e., h 2 × w 2 × B). Different from using multilayer perception in capsule networks, the convolutional operator in transform operation is adopted to use fewer parameters. Therefore, the prediction vector u j|i ∈ R h 2 ×w 2 ×c out can be expressed aŝ where b is one, which means PAN images have only one band, and c out is usually set as B + b (i.e., c out = B + 1). In other words,û j|1 andû j|2 are the feature maps of LRMS and PAN images, respectively. Last, for compatibility with MDR,û j|1 andû j|2 are vectorized to obtain two new prediction vectorsû c j|1 ∈ R h 2 w 2 c out ×1 andû c j|2 ∈ R h 2 w 2 c out ×1 as the input of MDR.
(2) Modified dynamic routing (MDR) algorithm: The dynamic routing algorithm is used to transfer information between capsules in adjacent layers in the original capsule network. However, the capsule network with the original dynamic routing algorithm mainly handles the image classification, and it is not easy to apply image fusion. Thus, the dynamic routing algorithm needs to be modified to fit the pan-sharpening. Our modified dynamic routing algorithm has four major differences, compared to the dynamic routing algorithm in the capsule network. First, we delete the exponential function in Equation (3) due to the significant difference ofû j|i · s j . In other words, if two values differ greatly, then an exponential function will further amplify the difference, which is very easy to cause b ij to be either 1 or 0. This evidently stands on the opposite of fusing the information of LRMS images and PAN images. Second, a lower-level capsule is coupled to all higher-level capsules in the original routing algorithm. In contrast to that, we make c ij to represent the importance of each lower-level capsule to higher-level capsules. Based on the above two changes, Equation (3) is modified as follows: where M represents the number of lower-level capsules. Third, the "squashing" function is replaced by the classical activation function. At last, for easy calculation, the number of higher-level capsules N is set to 1. The specific MDR algorithm is shown in Algorithm 1.

Input:
Prediction vectors:û c j|i ∈ R h 2 w 2 c out . The number of iterations: r.
for all capsules i in layer l and capsules j in layer (l + 1): use Equation (7) to obtain c ij . for all capsules j in layer (l + 1): use Equation (2) to obtain s j . for all capsules i in layer l and capsules j in layer (l + 1)

End for
For all capsules j in layer (l + 1): v j = ReLU(s j ). output: v j .
As a plug-and-play module, MDRL can be plugged into a CNN to replace the addition or concatenation operation. As shown in Figure 1a, taking the concatenation operation as an example, many deep learning models directly concatenate the LMRS images and PAN images in channel dimension, which means that the PAN images equal the one channel of LRMS images. Plugging the MDRL, the modified dynamic routing convolutional neural network (MDRCNN) is proposed for pan-sharpening. In our MDRCNN, as shown in Figure 1b, the LMRS images and PAN images are treated as two lower-level capsules. Then, the transform operation and modified dynamic routing algorithm handle them and obtain the higher-level capsules to represent the fusion of LMRS images and PAN images. Last, the highest-level capsule is transmitted to the remaining network layers to generate the HRMS images for pan-sharpening. In addition, the proposed MDRL can be stacked to plug in the MDRCNN. As shown in Figure 2, there are T MDRLs in the MDRCNN. Using the LRMS images and PAN images as input, the first fusion feature (i.e., higher-level capsule of the first MDRL) is obtained, via the first MDRL. Inspired by the deep residual network [43], taking the first fusion feature, LRMS images and PAN images as new lower-level capsules, the second MDRL can obtain the second fusion feature. By that analogy, MDRCNN with T MDRLs can be constructed for fusing the LRMS images and PAN images effectively.

The Relationship between the MDRL and Addition or Concatenation Operations
In this part, the relationship between the MDRL and summation or concatenation operations are analyzed. Through analysis, it is found that concatenation or addition operations are the special cases of the proposed MDRL when an affine transform and modified dynamic routing algorithm coordinate with each other.
Taking the concatenation as an example, concatenation operation means that many deep learning methods directly concatenate the PAN images and LRMS images into a channel dimension. Assuming that the size of the PAN images' patch is m × n × 1, the size of the LRMS images' patch is m × n × c after up-sampling, and there is only one higher-level capsule. In our MDRL, the size of the filter for LRMS images k L is set to 3 × 3 × c, and there are (c + 1) filters (i.e., k L 1 , k L 2 , ..., k L c+1 ). Similarly, the size of the filter for PANs k P is set to 3 × 3 × 1, and there are (c + 1) filters (i.e., k P 1 , k P 2 , ..., k P c+1 ). Now, the prediction vector's size is mn(c + 1) × 1. Based on Equation (7), suppose the coupling coefficient corresponding to LRMS images C 11 is a(0 < a < 1); then, the coupling coefficient corresponding to PAN images C 21 is 1 − a. Every LRMS filter k L i ∈ R 3×3×c , i = 1, ..., c + 1 can be split into c channel filter k L ij ∈ R 3×3 , j = 1, ..., c and one channel filter's size is 3 × 3, which means .., k L ic , ], i = 1, ..., c + 1. Then, we set the special value of the LRMS filters as follows: Similarly, the PAN filters can be expressed as k P i = [k P i1 ], k P i1 ∈ R 3×3 , i = 1, ..., c + 1. Then, we set the special value of the PAN filters as follows: Now, MDRL becomes the concatenation operation. It is easy to find that MDRL can also degenerate into the summation operation.
However, the filter weights for LRMS images and PAN images are obtained by the BP algorithm [44], and the coupling coefficients are learned by the MDR algorithm. Thus, the proposed MDRL can fuse the information of PAN images and LRMS images more effectively than a simple concatenation or summation operation.

Dataset and Evaluation Metrics
Here, three satellite datasets are chosen as our evaluation dataset, namely Land-sat8, QuickBird and GaoFen2. Specifically, Landsat8 has 350 samples on its training dataset, 50 samples on its validation dataset and 100 samples on its test dataset. GaoFen2 has the same number of training/testing/validation images as Landsat8. QuickBird has 474/103/100 samples on its training/validation/test dataset. There are 10 bands of LRMS images on Landsat8, and the spatial up-scaling ratio (SUR) is set as 2. QuickBird and GaoFen2 contain 4 bands and the SUR is 4. Furthermore, all the samples are generated by using the Wald protocol [45]. Table 1 shows all the details of the three datasets above. In the training process, the LRMS images are carved up into patches with a size of 32 × 32; thus, the patches of PAN images have a size of (32*SUR)×(32*SUR). In the test phase, to measure performance of our method, we use the following four metrics: three spatial assessment metrics including structural similarity (SSIM) [46], error relative global dimensionless synthesis (ERGAS) [47], peak signal-to-noise ratio (PSNR) [48] and one spectral assessment metric spectral angle mapper (SAM) [49]. The fused image has a high quality when the PSNR and SSIM become higher, or the SAM and ERGAS become lower.

Experimental Results
In order to evaluate the effectiveness of our method, MDRL is plugged into two famous deep learning methods (i.e., MIPSM and DRPNN) to construct two new models named MDR-MIPSM and MDR-DRPNN. They are chosen because MIPSM uses addition operation and DRPNN uses concatenation operation to fuse LRMS images and PAN images.
The proposed two models are compared with recent approaches on three benchmark datasets named Landsat8, QuickBird and GaoFen2. All the experiments are conducted on a computer with an NVIDIA RTX1080Ti GPU with 11 GB memory.
Our MDR-MIPSM and MDR-DRPNN adopt l 1 loss as the loss function ||H −Ĥ|| 1 , where H andĤ represent the ground truth HRMS image and the reconstruction HRMS image, respectively. In most of our experiments, the configuration of MDR-MIPSM and MDR-DRPNN is set as follows: there are two MDRLs, the kernel size is set to 3 × 3 and the number of modified dynamic routing's iterations is set to 1 due to its ease of calculation. These two models are supervised by Adam [56] over 200 epochs, and the learning rate is set as 1 × 10 −3 with a 0.8 times decrease every 20 epochs. In addition, two layers of MIPSM and DRPNN are deleted to construct our MDR-MIPSM and MDR-DRPNN, due to the number of MDRL being 2 in our experiments. Table 2 shows the quantitative results of our methods, compared with all comparison methods on the QuickBird dataset. The proposed MDR-MIPSM and MDR-DRPNN are superior to the traditional pan-sharpening approaches. In the deep learning methods, our MDR-DRPNN also achieves the best performance on all four indicators. Moreover, our MDR-MIPSM and MDR-DRPNN perform visibly better than MIPSM and DRPNN. Specifically, the result obtained by our MDR-MIPSM exceeds MIPSM in PSNR by almost 2.1 db, which is a significant improvement.

Performance Comparison
In Tables 3 and 4, our MDR-MIPSM and MDR-DRPNN are compared with other state-of-the-art methods on the Landsat8 and GaoFen2 datasets. It is found that our MDR-DRPNN performs slightly worse than PANNet on the SAM indicator, and visibly better than DRPNN on all metrics on Table 3. In addition, FGF-GAN and PANNet perform slightly better than our MDR-DRPNN on the SAM indicator in Table 4. However, on PSNR, SSIM and ERGAS, our MDR-DRPNN achieves the best performance, compared with all other models. Moreover, we display a visual comparison of our models with other methods on three datasets, as shown in Figures 3-5. Compared with other methods, our MDR-DRPNN not only preserves the spectral information of the LRMS images, but also has a wealth of detailed information contained in the PAN images. From the amplified area, we can see that our MDR-DRPNN generates high-quality HRMS images, which avoids spatial and spectral distortion effectively. In addition, taking the ground truth image (upper left corner of Figures 3-5) as a reference, we show the corresponding residual map to evaluate the quality of images generated by our MDR-MIPSM and MDR-DRPNN in Figures 6-8. From the residual map, we find that the fused images of our MDR-DRPNN are the closest to the ground truth, which demonstrates the effectiveness of our methods.
By comparing the results between three datasets, a careful analysis of our model is conducted. It is found that the spatial location information is preserved in our model based on the convolutional operator in transform operation and the vectorize operation. Thus, our model can extract more spatial detail information, which is consistent with the experimental results. By examining the image of the three datasets, the Landsat8 dataset contains more buildings, which means it offers offer a wealth of spatial detailed information compared to the QuickBird and GaoFen2 datasets. Taking the base model DRPNN [30] as example, our model MDR-DRPNN significantly improves 1.2237 db on the PSNR of Landsat8, while it improves 0.4211 db and 0.5330 db on the PSNR of the QuickBird and GaoFen2 datasets, respectively. Furthermore, this reminds us that our future study should pay more attention to extract spectral detail information. In addition, the real HRLS image is not accessible in reality. Thus, pan-sharpening in a real LRMS image and PAN image (i.e., full scale experiment) comes with reference-free image quality metrics. Here, Quality with No Reference (QNR), spectral distortion index (D λ ) and spatial distortion index (D S ) are chosen to evaluate the pan-sharpening quality of a full image on the Landsat8 and QuickBird datasets. The specific experimental results are shown in the following Table 5. As shown in this table, our model MDR-DRPNN obtains excellent experimental results on three metrics and two datasets, which is competitive with other deep learning models. In particular, compared to the base model DRPNN, our model MDR-DRPNN improves by 0.012 and 0.047 on the QNR of the Landsat8 and QuickBird datasets, respectively. Furthermore, MDR-MIPSM increases the values on the QNR metric with the base model MIPSM. Another two metrics D λ and D S also show the same improvement. Thus, the proposed plug-and-play MDRL is helpful to fuse the LRMS image and PAN image. Table 5. Quantitative results for the Landsat8 and QuickBird datasets at full scale. The best and the second best are highlighted by bold and underlined, respectively. The up or down arrow indicates higher or lower metric corresponding to better images.

Discussion
In section, we use MDR-DRPNN as backbone and conduct some ablation experiments to further verify the efficiency of our methods. We discussed the influence of the number of MDRL, the number of iterations in MDR and the parameter numbers of MDR-DRPNN, because they all play important roles in MDR-DRPNN.

Influence of the Number of MDRL
The depth of MDR-DRPNN depends on the number of MDRL; thus, it should be first to discuss. As shown in Table 6, the number of MDRL is set as 1, 2 and 3, respectively. Taking the PSNR, SSIM, SAM and ERGAS as metrics, the MDR-DRPNN model and Quickbird dataset are used to evaluate it. From this table, it is found that most of the metrics increase slightly and then decrease, when the number of MDRL increases from 1 to 2 to 3. Thus, the number of MDRL is set as 2 in our experiments.

Influence of the Number of Iterations in MDR
In this section, the influence of the number of iterations on MDR is studied, due to the fact that it is the most important parameter in MDR. For easy calculation, the number of higher-level capsules is set as 1 in this experiment. Table 7 shows the results on the Quickbird dataset, based on the MDR-DRPNN model and PSNR, SSIM, SAM and ERGAS metrics. From this table, it is found that no matter the number of iterations, our proposed MDR-DRPNN's performance does not vary much. This also verifies the stability of our model. Thus, for the purpose of saving training and testing time, the number of iterations is set as 1 in our most experiments.

Parameter Numbers
In our method, the MDRL is used to replace the addition or concatenation operations. Thus, based on the Quickbird dataset, the parameters of our MDR-MIPSM and MDR-DRPNN are analyzed here. As said in the training details, two layers of MIPSM and DRPNN are deleted to construct our models MDR-MIPSM and MDR-DRPNN, due to the number of MDRL being 2 in our experiments. As shown in Table 8, it is found that the amount of parameters of our models decreases visibly more than MIPSM and DRPNN, but our models achieve better results. This may be due the fact to that our network can perform more efficient fusion and achieve a reasonable balance between spectral and spatial information, which proves the efficacy of our method.

Conclusions
In this study, due to the fact that the addition and concatenation operations may cause loss of information, we design a modified dynamic routing layer (MDRL) to replace them for pan-sharpening, via a modified dynamic routing (MDR) algorithm. Based on the convolutional operator in transform operation and the vectorize operation, our MDRL can preserve more spatial location information. In addition, the coupling coefficient in MDRL can vary with the input samples, which make MDRL fuse the PAN and LRMS images in the test stage effectively. Using MDRL, two deep models named MDR-MIPSM and MDR-DRPNN are proposed, and extensive experimental results on benchmark pan-sharpening datasets demonstrate the efficacy of our method, compared with other excellent models. In the future work, we will pay more attention to extract spectral detail information, which may be implemented by the attention mechanism.
Author Contributions: Formal analysis, K.S.; funding acquisition, K.S. and S.X.; methodology, K.S.; project administration, J.Z., X.C. and R.F.; writing-original draft, K.S.; writing-review and editing, J.L. and S.X. All authors have read and agreed to the published version of the manuscript.

Data Availability Statement:
The datasets generated during the study are available from the corresponding author upon reasonable request.

Conflicts of Interest:
The authors declare no conflict of interest.