A CNN-Based Sentinel-2 Image Super-Resolution Method Using Multiobjective Training

Deep learning methods have become ubiquitous tools in many Earth observation applications, delivering state-of-the-art results while proving to generalize for a variety of scenarios. One such domain concerns the Sentinel-2 (S2) satellite mission, which provides multispectral images in the form of 13 spectral bands, captured at three different spatial resolutions: 10, 20, and 60 m. This research aims to provide a super-resolution mechanism based on fully convolutional neural networks (CNNs) for upsampling the low-resolution (LR) spectral bands of S2 up to 10-m spatial resolution. Our approach is centered on attaining good performance with respect to two main properties: consistency and synthesis. While the synthesis evaluation, also known as Wald’s protocol, has spoken for the performance of almost all previously introduced methods, the consistency property has been overlooked as a viable evaluation procedure. Recently introduced techniques make use of sensor’s modulation transfer function (MTF) to learn an approximate inverse mapping from LR to high-resolution images, which is on a direct path for achieving a good consistency value. To this end, we propose a multiobjective loss for training our architectures, including an MTF-based mechanism, a direct input–output mapping using synthetically degraded data, along with direct similarity measures between high-frequency details from already available 10-m bands, and super-resolved images. Experiments indicate that our method is able to achieve a good tradeoff between consistency and synthesis properties, along with competitive visual quality results.

at three different spatial resolutions-10 m (B2, B3, B4, and B8), 20 m (B5, B7, B7, B8a, B11, and B12), and 60 m (B1, B9, and B10). S2 data come in the form of tiles spanning an approximate area of 100 × 100 km and can be downloaded for free from Copernicus Services Data Hub. 1 The various utilizations for such multispectral information include monitoring natural disasters, agricultural production, and deforestation oversight. However, the level of details is limited due to the lack of high-resolution (HR) profiles for 20-and 60-m bands, which narrows down the quality of some spectral indices used for evaluating and estimating applicationspecific characteristics. This motivates the development of methods for constructing HR bands that enclose the physical characteristics (light reflectance value distribution) of lowresolution (LR) bands while increasing the number of details in the spatial domain up to the maximum spatial resolution available in the current dataset.
Image fusion has maintained a popular position in Earth observation applications, serving a central role in monitoring the evolution of different environmental areas [1]. One such class of methods concerns multispectral image fusion [2], [3], which combines information from various multispectral images in order to increase the level of details in LR images. Approaches included in this category have shown applicability in super-resolving all LR S2 bands up to 10-m ground sampling distance (GSD). Applications that could benefit from a full set of HR images include time series evaluation of crop fields [4], [5], deforestation monitoring [6], ship detection and recognition [7], and so on.
Recently proposed methods include model-based approaches, constructed as inverse problems that include a degradation process applied on the super-resolved bands, and machine learning-based methods, which directly combine the HR and LR bands to produce super-resolved images by exploiting interband information.
Recent deep learning architectures, mainly convolutional neural networks (CNNs), have imposed a great interest among different Earth observation applications [8], [9], through their ability of automatically learning complex spatial relationships. S2 image super-resolution is one particular domain that benefited from the power of representation of neural networks [10], [11], [12]. A representative method for this class was introduced in [13] as DSen2, in which the authors train two separate networks for upsampling the 20-and 60-m bands, both following a ResNet [14] architecture. The training process is designed in a supervised manner, constructing a synthetic 1 https://scihub.copernicus.eu/ This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ dataset by degrading the original bands, further considering as input-output pairs the degraded-observed images. This method has been shown to deliver competing results for a wide range of environments, suggesting good generalization capabilities. Other approaches rely on the use of GANs for generating HR patches from the original LR bands [15], [16], [17], training the generator network using a synthetic dataset, constructed in the same degradation-based manner.
A subset of model-based approaches incorporate a downsampling mechanism that mimics the sensor's modulation transfer function (MTF). Such transformation should restore a super-resolved image back to the originally observed one, produced by the sensor. Area-to-point regression kriging (ATPRK) [18] is a pansharpening technique originally tested on WorldView-2 and Landsat images, subsequently adapted for S2 super-resolution in [19]. In ATPRK, the super-resolved band is modeled as a linear combination of HR bands, with coefficients determined through least square minimization, followed by an adaptive averaging of the regression residuals for each pixel's neighborhood. SupReME [20] is a method that depends on the observation model of the imaging, i.e., the sensor's MTF, and formulates the super-resolution process as an inverse problem, exploiting the correlation between spectral bands by projecting them onto a lower dimensional space. S2Sharp [21] is an approach related to SupReME, optimizing the same objective function while accounting for interband correlations, but in this case, the low-dimensional space is estimated for each optimization step, while in the case of SupReME [20], it is defined beforehand for all steps. Sen2res [22] is a method based on extracting the local information from each LR band, i.e., reflectance values distribution, and exploiting the high-frequency characteristics of HR bands. The super-resolved patches are constructed as linear combinations of subpixel constituents from HR bands, using least square optimization for estimating the weighting coefficients. Another technique, introduced as SSSS [23], also formulates the super-resolution through the use of an MTF degradation model, additionally incorporating convex regularization terms intended for learning a self-similarity graph. This method has been shown to exhibit solid results, especially for upsampling the 60-m bands [24].
A more recent method introduced as S2SUCNN [25] combines these two classes of techniques, training a CNN in an unsupervised manner by introducing an MTF-based degradation layer as their final processing step. Their method is based on the idea of deep image prior [26], which provides a solving mechanism for inverse problems, such as denoising, inpainting, and super-resolution [27], [28], using neural network structures to model a reliable inverse mapping through standard gradient-based optimization, under a difference minimization objective.
He and Siu [29] applied Gaussian process regression (GPR) for single-image super-resolution, by fitting such a model on the local structures of each pixel and using a two-stage process to recover and refine the super-resolved image. Blix et al. [30], [31] utilized GPR for estimating quad-polarimetric parameters from dual-polarimetric synthetic aperture radar observations, providing insight by constructing uncertainty maps through the fit GPR model, thus establishing the level of trust in their predictions. S2 super-resolution could therefore benefit on these aspects, providing a new way to assess super-resolution results through measures indicated by such uncertainty maps. This is, however, yet to become a complete evaluation framework due to the difficulty of modeling multiple response variables, as discussed in [32], which for S2 super-resolution is a necessity when super-resolving multiple LR bands at once. Another obstacle is represented by the huge amount of input-output pairs a GPR model should be optimized on, in order to provide good generalization capabilities for different environmental areas.
In this article, we present a multiobjective training approach for fully CNNs, aimed for ensuring a good tradeoff between properties of consistency, synthesis, and enhanced visual details of super-resolved S2 bands. Our contributions can be summarized as follows.
1) We designed and implemented a super-resolution mechanism based on fully convolutional networks for upsampling the 20-and 60-m bands of S2, except for band B10 (we excluded band B10 from our study for two reasons: it exhibits poor radiometric quality 2 and it does not contain relevant surface information, as it is mainly used in applications dealing with cloud covered areas, e.g., cirrus cloud detection). 2) We formulated a multiobjective training for the proposed architectures, aimed at ensuring good super-resolution results with respect to two properties of importance: consistency and synthesis. 3) We included a direct similarity maximization term between high-frequency information from already available 10-m bands and upsampled results for 20-and 60-m bands, resulting in competitive visual results in terms of detail authenticity. The remaining of this article is organized as follows. Section II presents the mathematical notations used throughout this article, along with an in-depth discussion about the elements implied by our method. Section III offers a view over super-resolution evaluation protocols, providing intuition for the adopted mechanisms in our method, and also their role in assuring a high level of reliability. Section IV begins with details regarding our method and a description for the multiobjective training process, followed by Section V, which includes discussions about various experimental results. The concluding remarks are given in Section VI.

A. Notations
Each S2 acquisition can be viewed as a 3-D tensor of spectral bands, each with its own spatial dimensions and spatial resolution. Let us denote the set of spectral bands for one acquisition as {X 10 , X 20 , X 60 }, where X r represents the channelwise concatenation of spectral bands with spatial resolution r . Considering each band from X 10 to be of spatial dimension H × W , the spatial dimensions of bands from X 20 and X 60 are (H/2) × (W/2) and (H/6) × (W/6), respectively. X r (i, j) denotes the vector of reflectance values from all spectral bands with spatial resolution r at location (i, j). The spatial degradation (blurring + downsampling) by a factor of s is denoted as (·) ↓ s .

B. Deep Learning Methods
Deep learning methods have been extensively used as stand-alone solutions for S2 super-resolution, fusing bands at different spatial resolutions according to the process described by various neural network architectures. Such computational structures allow for obtaining the most representative abstract features for the task at hand, without any prior knowledge on the data acquisition process. However, these methods often require large training sets for achieving good generalization on unseen data, which is, by default, unachievable in certain domains. Since there is no ground truth for the observed 60-and 20-m bands of S2 data, the majority of methods based solely on neural network structures are trained on synthetically constructed LR-HR image pairs, in a supervised training paradigm. Given a neural network structure T (·), most approaches learn the following mapping (either for ×2 or ×6 upsampling): The optimized structure T (·) is subsequently used on real, nonsynthetically degraded data. One specific architecture recently used for super-resolving S2 bands is that of conditional GANs [17], [33], which trains the generator network by conditioning them on available information to produce super-resolved bands intended to fool a supposedly optimal discriminator network-optimal in that it distinguishes as good as possible between real S2 bands and bands produced by the generator. However, this type of neural architecture was found out to be very hard to train due to the necessity of finding an equilibrium state for the min-max optimization problem, often resulting in suboptimal final states, which leads to inconsistent super-resolved image patches. Since the generated images are significantly dependent on the ability of the discriminator to capture realistic HR patterns when distinguishing between real and generated images, the final image quality is tightly bounded by its performance.
One emerging network architecture in the field of computer vision, initially adopted by various natural language processing applications, is represented by transformer models [34]. Given the attention mechanism such models imply, they are able to capture wider spatial relationships, however, with the disadvantage that they require substantially more training than standard CNNs. One application in remote sensing that leverages the attributes of transformers was discussed in [35] for multi-image super-resolution of images provided by PROBA-V satellite. To reduce the transformer necessity for intensive training, the authors propose optimizing the model for each input acquisition, eliminating the reliance on a training dataset. Another super-resolution approach using transformers for remote sensing applications is described in [36], combining a CNN-based encoder with a transformer model, the latter acting on embedded image patches.

C. MTF-Based Methods
These methods are based on computations that simulate the sensor's transfer function (MTF), planning the super-resolution task as an inverse problem. Thus, for each LR image, a search in the HR image space is performed, imposing that the degraded HR solution should yield as close as possible to the LR image. Usually, the degradation process is described as follows, for an observed image X and its unknown super-resolved version Y ∈ R H ×W : where * indicates a depthwise convolution operation, g σ represents a 2-D Gaussian smoothing filter with variance σ , and d s is an s × s averaging filter, applied with stride s. Thus, the problem of finding the unknown HR image Y can be formulated as follows: for some distance metric || · || p (usually, p ∈ {1, 2}). This problem is ill-posed, allowing for multiple possible solutions for the unknown image Y . Nguyen et al. [25] provided an iterative way of finding the above solution through the use of a CNN for constructing the possible image candidatesŶ that would minimize (4). However, their approach implies optimizing the network separately for each S2 acquisition, which could be bothersome given a computer with lower computing capabilities. Applying such formulation in a super-resolution context aims at ensuring the property of consistency [37]-every super-resolved image once degraded to their initial spatial resolution should resemble as good as possible the originally observed image. This is a necessary property, but not sufficient to ensure good super-resolution results [37], given the ill-posedness characteristic of the problem at hand. The advantage of formulating the super-resolution problem in such manner is the possibility of training directly on observed data (as in [25]), eliminating the need of constructing synthetic input-output image pairs. Shocher et al. [38] raised the concern of testing a model trained solely on synthetically constructed data in real-world scenarios, where it may yield unsatisfactory results.
Finding a solution that would minimize (4) raises the following question: how much detail couldŶ encompass such that its degraded version does not significantly deviate from X ? Since the super-resolution problem does not directly imply any term regarding the level of spatial detail, including an additional detail-related objective may force an optimization path to a visually better solution, while still preserving the consistency of results. We further test this assumption by combining a consistency-related objective with another that pushes the super-resolved image into maximizing the similarity between its level of detail with one of the other HR images.

III. EVALUATION PROTOCOLS
One of the mostly adopted evaluation protocols for S2 super-resolution methods is by verifying the synthesis property [37], [39] This protocol requires the method to be evaluated on synthetically degraded data, taking the observed images as the desired targets. This has been regarded as a sufficient property for ensuring good super-resolution results. However, the nature of the problem itself states that there exists no unique solution for the super-resolved image. This leads to the question of existence for a set of HR images, not necessarily similar to the observed one, that could still be considered viable candidate solutions.
A relatively unused evaluation protocol in the literature is the one based on assessing the consistency property, which states that degrading the obtained super-resolved HR band should yield a result close to the observed LR band. Thus, attaining a good consistency property does not impose only a specific solution for the super-resolved band but allows for a rather broad set of HR solutions. Advantages in using this property as the main objective for training a model are the removal for the need of constructing a synthetically degraded dataset to fit the model on, along with a wider range for possible output solutions. However, the latter does not account for the visual quality of each such possible solution, hence not necessarily implying good super-resolution results. As stated in [37], this property alone is not sufficient to always ensure reliable super-resolved images, but it is rather necessary.
While for consistency evaluation, one does not need an HR reference image to measure the performance, such an analysis does not consider all high-frequency details contained in the super-resolved band. An evaluation, which uses the full scale of predicted HR bands, is performed by measuring the quality no reference (QNR) metrics, which has received popularity among pan-sharpening applications [40], [41]. The QNR is usually described in terms of pan-sharpening evaluation, by considering a single HR image, which guides the fusion process of multiple LR bands. For multiple available HR bands (as in the S2 case), the QNR metric can be extended as follows: where Here, L and R are the numbers of LR and HR bands, respectively, Q is the universal image quality index [42], (·) denotes a super-resolved band, HR i /LR i denotes the ith HR/LR observed band. In the S2 case, HR = X 10 and LR is either X 20 or X 60 . All further results concerning this evaluation are obtained for t 1 = t 2 = p = q = 1, thus equal contribution from D λ and D S . A high QNR metric implies that the fusion process preserves the similarity between LR bands (measured through the spectral distortion D λ ) and the similarity between the degraded HR bands and original LR bands (spatial distortion D S ).
Another common evaluation is based on visual inspection, presenting a side-by-side comparison of observed LR bands and their super-resolved solutions. Displaying areas of interest analyzed through different super-resolution methods may provide meaningful insight for constructing a better comparison between these techniques. However, one should question the detail authenticity induced by each method, knowing the desired spatial resolution the images should be at. Each visual evaluation could therefore benefit from including an already available HR band next to the illustrations, providing authentic high-frequency details as guidance for the visual quality assessment.

IV. PROPOSED METHOD
For training/testing data, we selected Level-1C products, each tile spanning an area of 100 × 100 km 2 , from both Sentinel-2A and Sentinel-2B satellites. We used 14 such tiles for training and four for testing, capturing a diversity of environments. The RGB representation of test areas is shown in Fig. 1. Since S2 image data come in tiles with a spatial dimension of 10 980 × 10 980 pixels for 10-m bands, directly feeding them to a neural network would be impractical. Thus, we partitioned each training/testing tile into 192 × 192 patches to be directly processed by the network. Before any processing, the raw reflectance value is divided by 2000 (as in [13], for numerical stability). All networks were trained in typical settings: initializing the weights with Glorot uniform [43] and using Adam optimizer [44] with a learning rate of 10 −4 with hyperparameters β 1 = 0.9 and β 2 = 0.999.
The proposed CNN architecture is presented in Fig. 2, for 6× super-resolution. We trained two such networks, one for Fig. 2. Proposed neural network architecture (green boxes) and workflow diagram for 6× super-resolution. The two branches resemble the two consecutive forward propagations through the same network: one for the real S2 image patches and one for their degraded counterparts. Their outputs are further used for computing different error terms, from which a linear combination is formed to result in the final cost. The consistency branch aims at obtaining realistic high-frequency details while preserving the original physical characteristics in the super-resolved images. The synthesis branch, acting on degraded data, is designed to approximate a direct fusion to available ground truth, implying a supervised learning step in a reduced-resolution context.
super-resolving the 20-m bands and one for the 60-m bands. In the following, the network for super-resolving the 60-m bands is presented. All 20-and 60-m patches are upsampled using bilinear interpolation up to the spatial dimensions of 10-m patches, before any processing implied by the CNN. The network received as input the channelwise concatenation of these upsampled patches, which are first processed by a 2-D convolution layer to increase the number of channels. All 2-D convolution layers contain 64 filters of dimension 3 × 3 or 1 × 1. We used ReLU activation for all cases, due to its efficiency in terms of computation time and training stability. Following the first convolution layer, a series of residual blocks is utilized, each block implying two convolution operations. Based on the work in [45], the first convolution layer uses 1 × 1 filters, while the second one uses 3 × 3 filters, encapsulating a ReLU nonlinearity. The first three residual blocks are standard, combining the input from the previous residual block with the features computed by the current block. The last three residual blocks operate on the previous features channelwise concatenated with the upsampled 60-m patches, with the intent of retaining some of the radiometric properties from the original bands. Szegedy et al. [45] also introduced a mechanism called residual scaling by which the outputs of the final convolution layer in a residual block are scaled down by a subunitary constant value (usually chosen between 0.1 and 0.3). This additional step has been shown to stabilize training, which may present some benefit in cases where training data distribution spans over a wide domain, such as the multitude of environments captured by S2 images. The same mechanism is used in [13] by choosing a constant value of 0.1 for scaling the residual features. We have also included a similar step in our residual blocks, by introducing a trainable residual scaling layer (see Fig. 2) through which the scaling factor is learned during the optimization process. This avoids the need of arbitrarily setting each scaling factor to a constant value, thus allowing for an adaptive change for the importance of each residual block's computed features, by lowering/increasing their residual scaling factors. Each residual scaling factor is initialized with a fixed value of 0.05 and is constrained to always be positive during the optimization steps. Following the sequence of residual blocks, the computed feature maps are passed to a convolution layer with two 1 × 1 filters, fusing the previous feature volume in two channels. The final prediction is obtained through an elementwise addition between the original upsampled image patches and the previously computed two-channel features. For the ×2 super-resolution network, the only difference is the removal of 60-m input since it does not bring any helpful information to the process.
Both the proposed architectures, for 2× and 6× superresolution, are trained using a weighted sum of three different loss terms, mainly aimed at preserving the two previously discussed properties-consistency and synthesis-, along with a term penalizing the high-frequency details. This is achieved by implying two consecutive steps for each batch optimization: directly feeding the input patches to the network and forcing the degraded output to resemble as close as possible the LR input patches (consistency branch from Fig. 2), degrading the input patches and feeding them to the network in order to produce an output as close as possible to the original LR patches (synthesis branch from Fig. 2). In addition to preserving the consistency of results, the first branch also implies a direct similarity term between the high-frequency details contained in the super-resolved image and real details extracted from a subset of 10-m input patches. The motivation behind this optimization step resides in achieving a visual quality for the super-resolved image that closely matches the level of details contained by observed HR patches.
In the following, we present the training objective for the network tasked with super-resolving the 60-m bands up to 10-m spatial resolution. A similar mechanism can be easily derived for the loss implied in super-resolving the 20-m bands. Let T (·) denote the HR prediction of the proposed neural network model. The loss function L is described as follows: L = αL consistency + βL sym + γ L synthesis (6) L consistency = ||(T (X 10 , X 20 , X 60 )) ↓ 6 −X 60 || 1 (7) L synthesis = ||T (X 10 ↓ 6 , X 20 ↓ 6 , X 60 ↓ 6 ) − X 60 || 1 (10) where J represents the set of 10-m observed bands used for computing L sym , X 10, j is the input patch from the jth 10 m band, sym(x, y) denotes the similarity function (using cosine similarity), ⟨·, ·⟩ is the dot product between vectorized images, ∇ X denotes the gradient of image X obtained by linear 2-D filtering with a Laplacian operator and || · || 1 denotes the L1 norm. Note that before computing the 2-D gradient of an image we applied a 3 × 3 average smoothing filter to counter the sensitivity to noise of the Laplacian operator. In (6), α, β, and γ are hyperparameters meant for controlling the overall influence of each loss term over the final super-resolution properties. In Section V, we provide insight regarding the influence of the three loss terms included in (6), along with performance measures for our 6× and 2× super-resolution methods for S2 bands.

V. EXPERIMENTS AND RESULTS
Given the three hyperparameters (α, β, and γ ) from (6), we trained four separate networks using the architecture described in Section IV, with different loss configurations (further referred to by their corresponding (α, β, and γ ) triplet) for 6× super-resolution, in order to examine their influence on full-and reduced-resolution evaluation.
1) (α, β, γ ) = (1, 1, 1), equal contribution of all three loss terms. 2) (α, β, γ ) = (1, 0.1, 1), less focus on minimizing the similarity component.     Table I and on reduced-resolution evaluation in Table II  super-resolved band and the original band, by quantizing the range 0-10 000 of reflectance values into 1000 equal width bins in order to construct a discrete probability distribution using the frequency of occurrence in each bin. In Table I, root-mean-squared error (RMSE) is measured between the original LR band and the degraded super-resolved band, while in Table II, RMSE is measured between super-resolved bands and ground truth-similarly for the structural similarity index (SSIM). Note that the mean QNR values are very close between the four networks, indicating little to no discrepancy between results at full-resolution evaluation, not aligning with the other metrics and the clear visual differences shown in Fig. 3. Thus, we decided to exclude the QNR evaluation from all future experiments. Results show that including the synthesis term L synthesis during the training process is effective on reduced-resolution evaluation, as shown in Table II, network (1, 0.1, 1) delivering better results than (1, 0.1, 0). Training using only the synthesis term (network (0, 0, 1)) leads to the highest SSIM values for reduced-resolution evaluation but results in the highest RMSE. While network (1, 0.1, 1) obtains slightly lower SSIM values for the same evaluation, it leads to the best RMSE on both 60-m bands, indicating a more favorable solution with respect to radiometric accuracy, while also maintaining a similar visual aspect (Table II). The results on full-resolution evaluation indicate that configuration (0, 0, 1) achieves the highest mean RMSE on band B9 and the second highest on B1. Since both configurations (0, 0, 1) and (1, 0.1, 0) achieve relatively high RMSE values on fulland reduced-resolution evaluation, and given that configuration (1, 0.1, 1) results in the best performance on both evaluations, we concluded that using a mixture of all three components for optimization could lead to a reliable super-resolution algorithm. We found that assigning an equal contribution to all loss terms does not help in finding a good equilibrium state with both good visuals and good evaluation results. Reducing the value of β helps with achieving a good radiometric quality-according to Table I-while also not degrading the visual quality, as observed in Fig. 3. We hypothesize that this behavior is due to the opposing objective of L sym and the other two loss terms, raising the need for finding a good tradeoff between visual quality and good consistency and synthesis metrics. Adding too much focus on correlating the super-resolved bands with observed 10-m bands clearly does not guarantee either good consistency or synthesis. This need of achieving a good tradeoff between visual quality and metrics translates into finding the appropriate weighting coefficients for the three loss terms.
In the following, we assessed the performance of both networks for 2× and 6× super-resolution with respect to the consistency and synthesis property. In the case of consistency assessment, the RMSE, signal-to-reconstruction error (SRE), and SSIM are computed between the original LR band and the degraded network output. Note that all results are measured given pairs of images with pixel values from the original S2 spectral bands. For reduced-resolution evaluation, we degrade all bands in order to reduce their spatial resolution s times, where s ∈ {2, 6}, feed them to the network, and compare the results with the original LR bands. Following the work in [25], the degradation process is designed according to (3), using a Gaussian kernel with standard deviation σ s for depthwise filtering, followed by a downsampling operation using an s × s averaging filter applied with a stride of s. For 2× degradation, we use σ 2 = 1 with kernel size 7 × 7, and for 6× degradation, we use σ 6 = 3 with kernel size 15 × 15.
We focused on comparing the performance of the proposed model to another S2 super-resolution method based on  deep learning techniques, namely, DSen2 [13]. We trained our ×2 super-resolution architecture using the configuration (α, β, γ ) = (1, 0.1, 1). Through multiple experiments, we decided that for bands B5, B11, and B12, a direct similarity with any 10-m band resulted in worse visual and performance results; thus, the loss term from (8) is not considered during the optimization for these three bands. We decided for bands B6, B7, and B8a to include the similarity with band B8 (10 m) during training, after multiple experiments regarding the choice of β and the set of 10-m bands to extract details from.
For bands B11 and B12, poor performance when increasing the similarity with a 10-m band could be explained by considering their spectral characteristics since their central wavelength is a lot more distanced from any 10-m band, compared with the other 20-m bands. In Table III, the results for consistency evaluation for 2× super-resolution are presented. Our architecture performed better with regard to averaged results over all 20-m bands, in terms of RMSE and SRE. For the Ethiopia tile, DSen2 [13] performed better on bands B6, B7, and B8a in terms of RMSE, allowing for a better recovery of the observed  image. Since L sym was also used during training for these three bands, we hypothesized that the increased detail similarity with another 10-m band did not allow for a fairly good recovery of the observed image, for this particular test area. However, for the rest of the test images, our architecture performed better on these three bands, indicating that a majority of geographical areas could benefit from a direct transfer of details from 10-m bands while still being consistent with the observed LR bands. One thing to notice in Table III is the results for Canada tile, which are relatively high in magnitude compared to the other three areas, for both methods. This may be explained by a big difference in reflectance characteristics for snow-covered areas compared to other geographic zones, leading to poor results for deep learning methods that are usually exposed during the training process to areas with different radiometric properties. The results for reduced-resolution evaluation are presented in  results in lower RMSE, which also leads to a higher average SSIM value. Similar to the previous evaluation, the results for Canada tile show a big difference in terms of magnitude, for both methods, indicating possible shortcomings in applying data-driven methods for snow-covered areas. DSen2 performs better on bands B11 and B12 on Italy and Canada tiles, and also on band B8a for Ethiopia and B5 for Romania, with minor differences for the last two. The remaining results are, however, considerably better for our method, which leads to an overall improved mean performance. Two visual examples are presented, in Fig. 4 for an area from Italy tile and in Fig. 5 for an area from Romania tile. The example in Fig. 4 illustrates a better performance for our method on bands B5, B6, B7, and B8a while achieving visually similar results for bands B11 and B12. This similarity, also verified through the results from Table IV (synthesis evaluation), may be explained given how the error for these two bands is computed during training: since only L consistency and L synthesis are employed for bands B11 and B12, and given that DSen2 is optimized through a cost function similar to L synthesis , it is natural to expect a similar performance on synthesis evaluation, along with a better performance on consistency evaluation for our method (as shown in Table III). Note here that our method performs especially well on areas containing land-to-water transitions, resulting in less artifacts than DSen2 for such regions. In Fig. 5, the example on a Bucharest region with an increased number of details also suggests that our method outperforms DSen2 on bands B5, B6, B7, and B8a.
For 6× super-resolution assessment, we conducted similar evaluation steps as in the 2× case, additionally including a visual comparison to accompany the numerical results. Table V contains the consistency evaluation results for super-resolving bands B1 and B9. For band B9, our method outperforms DSen2 on all test images while also achieving better mean results. On band B1, DSen2 achieves better SSIM scores on Romania and Ethiopia tiles, with very small differences compared to our method. Both methods also indicate deficiencies for the snow-covered image from Canada, in terms of magnitude for RMSE and SRE when compared to the other test tiles, achieving, however, a good SSIM score. One important element to consider in Tables III and V is the relatively poor performance of bicubic interpolation. Given that the process of bicubic upsampling does not introduce high-frequency details in the generated images, not aligning with the characteristics of an HR image, the degradation process applied during consistency evaluation is performed on an image with a lower resolution than expected. This results in degraded images having less details than the observed ones and, hence, low-performance measures between the two. The reducedresolution evaluation results are presented in Table VI, along with one example from each test image shown in Fig. 6. Our method is shown to perform better on Italy and Canada tiles, while DSen2 performed overall better on the other two test images, given table results. Even though the numerical results indicate a poor performance on some test images, compared to DSen2, we determined that this is due to the increased number of details for the super-resolved images our method produces, not always aligning with the level of detail from the reference image. Visual results for full-resolution upsampling are presented in Fig. 7, with the intent of comparing the level of detail induced by each method with the level of detail from an observed 10-m band. Here, we include an additional visual comparison with another method, namely, SSSS [23], which was shown to produce among the best visual results on 6× super-resolution in an extensive comparative study conducted in [24]. Visually, our method produces super-resolved bands with an increased number of details, leveling up with the high-frequency components of existing 10-m bands. On band B1, SSSS results contain fairly different radiometric values, compared to the original band, while DSen2 and our method show little difference from the original reflectance distribution. On band B9, both SSSS and DSen2 are shown to obtain suboptimal results regarding the amount of detail, compared to our method. Note that along with an increased number of details, our method also performs considerably better on consistency evaluation, leading us to conclude that the method could be used as a reliable super-resolution mechanism for upsampling 60-m bands. In the light of the ill-posedness characteristic of the problem at hand, which was discussed in Section III, we acknowledge that our method leads to super-resolved images with good consistency properties and realistic high-frequency details, not always aligning with the reference solution considered in synthesis evaluation.
Choosing the right weighting parameters for the loss functions represents an important step, exerting a high influence over the numerical and visual results, as we have seen from Tables I and II and Fig. 3. However, training multiple models with slight differences in their loss configuration, in order to determine the right direction for modifying the weighting terms, is a highly time-consuming process. One solution would be to condition the model's output on these three weighting factors, eliminating the need of an intensive search process for the right combination. This approach has been recently discussed and presented in [46] as you only train once (YOTO), introduced as a new mechanism for loss-conditioned training of neural networks. Applying such method in applications with multiple training objectives, in particular to our solution for S2 super-resolution, could lead to a reliable searching algorithm Fig. 7. Results on real S2 data for 6× super-resolution. The first four rows correspond to super-resolving band B1, while the last four are for band B9.
(From left to right) Original patch, bicubic interpolation, SSSS [23], DSen2 [13], proposed method, band B2 (10 m) for the first four rows, and band B8 (10 m) for the last four rows. for the right hyperparameter combination, ensuring a better final performance.

VI. CONCLUSION
In this article, we presented a mechanism for super-resolving the 20-and 60-m bands provided by S2 up to 10-m spatial resolution, based on fully CNNs. The architectures were trained using a multiobjective loss function, aimed at achieving a good tradeoff between three distinct features: good consistency properties, good synthesis properties, and visually realistic high-frequency components. The first objective was tackled by adopting previously established MTF-based methods. The second one relied on an additional optimization step using degraded data, a common training strategy among many super-resolution methods. The third objective implied adding a direct similarity measure between details from generated super-resolved bands and real details extracted from already available 10-m bands. Our trained architectures delivered good results on both reduced-resolution evaluation (synthesis) and full-resolution evaluation (consistency) for 20-m bands, given a wide variety of environments, proving good generalization capabilities. For super-resolving the 60-m bands, our method was able to achieve a better consistency, along with enhanced realistic high-frequency components. While there exist a variety of evaluation protocols more or less adopted for super-resolution methods, we feel that the lack of an exact solution for any super-resolved real image should motivate the comparison between the results obtained using multiple evaluation schemes. Along with mechanisms designed for automatically finding a good tradeoff for multiobjective training, our future work will include building evaluation mechanisms for assessing the level of trust in super-resolution methods.