360° Image Reference-Based Super-Resolution Using Latitude-Aware Convolution Learned From Synthetic to Real

High-resolution (HR) 360° images offer great advantages wherever an omnidirectional view is necessary such as in autonomous robot systems and virtual reality (VR) applications. One or more 360° images in adjacent views can be utilized to significantly improve the resolution of a target 360° image. In this paper, we propose an efficient reference-based 360° image super-resolution (RefSR) technique to exploit a wide field of view (FoV) among adjacent 360° cameras. Effective exploitation of spatial correlation is critical to achieving high quality even though the distortion inherent in the equi-rectangular projection (ERP) is a nontrivial problem. Accordingly, we develop a long-range 360 disparity estimator (DE360) to overcome a large and distorted disparity, particularly near the poles. Latitude-aware convolution (LatConv) is designed to generate more robust features to circumvent the distortion and keep the image quality. We also develop synthetic 360° image datasets and introduce a synthetic-to-real learning scheme that transfers knowledge learned from synthetic 360° images to a deep neural network conducting super-resolution (SR) of camera-captured images. The proposed network can learn useful features in the ERP-domain using a sufficient number of synthetic samples. The network is then adapted to camera-captured images through the transfer layer with a limited number of real-world datasets.


I. INTRODUCTION
360 • images are becoming increasingly popular: they are widely used for omnidirectional robot vision systems [1], virtual reality (VR) [2], and autonomous vehicles [3]. One or more 360 • images captured from different viewpoints can display the real world more faithfully. An autonomous mobile robot reconstructs its surroundings using 360 • images obtained from arbitrary locations during navigation [1], [3]. In VR, users can be immersed further by freely moving inside the virtual space created from multiple 360 • cameras [4].
The associate editor coordinating the review of this manuscript and approving it for publication was Felix Albu .
In emerging applications, the high-resolution (HR) of 360 • images is essential to provide a more faithful realization of the viewing space.
We shed light on reference-based 360 • image superresolution (RefSR) to exploit spatial correlation among omnidirectional views. As shown in Fig. 1, we aim to improve the resolution of the target 360 • image by synthesizing textures from a reference captured from an adjacent viewpoint in the same environment. In our previous study [5], a reference image could be obtained from a multi-camera array, and depth data were used to synthesize textures. However, this assumption may limit the general applicability of 360 • images in computer vision tasks. In contrast to a perspective image, a 360 • image provides a wider FoV, but the shapes of objects significantly change with the latitude in the ERP domain.
In this paper, we do not assume any structured camera arrays but use a reference image captured in an arbitrary viewpoint. The cameras are approximately 0.5 to 6.0 meters from each other. Because 360 • images tend to display the same objects through a wider field of view (FoV), the RefSR approach can be an enabling technique. The previous studies [6]- [11] attempted to super-resolve 2D images using a reference image. However, there are few RefSR studies for 360 • images. What are the major challenges?
RefSR attempts to obtain an HR image from a lowresolution (LR) source using an external [6]- [11]. It has been actively studied for perspective images to outperform single image super-resolution (SISR) [12]- [16]. The current RefSR models are developed based on convolutional neural networks (CNNs), achieving state-of-the-art performance for conventional 2D images. RefSR provides superior performance when there are more relevant textures in the [9]. However, conventional RefSR cannot be applied to a 360 • image directly. Most spherical images are projected onto rectangular 2D coordinates using an equi-rectangular projection (ERP). It incurs geometric distortion due to varying sampling density with the y-axis of an image (or the latitude in a sphere) during the projection. As shown in Fig. 1, objects near the poles appear in extremely magnified shapes. This demonstrates a challenge when applying conventional RefSR to spherical images. The performance can be even worse than the SISR when the reference is less relevant [9].
There are two approaches to alleviating the distortion in adapting convolution filters to 360 • images. First, deep models can be redesigned for specific tasks using 360 • images [17]- [19]. Second, domain adaptation can be applied to transform the trained kernels with perspective images into the ERP domain [20]- [22]. Both approaches require a large number of annotated images to learn the correspondence in an end-to-end manner. Unfortunately, there is no such dataset available to retrain the RefSR task for 360 • images.
Inspired by these previous studies, in this paper, we propose a novel RefSR algorithm for 360 • images referred to as Lat360. Effective exploitation of 360 • spatial correlation between the target and the reference is critical to achieving high SR quality of 360 • images. Therefore, learning to transfer 360 • image features from the reference and merge the information under the distortion in the ERP domain is addressed. We develop a long-range 360 disparity estimator (DE360) with latitude-aware convolution (LatConv) to overcome a large and distorted disparity. The LatConv generates a feature to alleviate spherical distortion and enhances accuracy to find the corresponding feature in a reference image. Furthermore, an occlusion mask generator (OMG) is employed to keep the reliability when aligning the features. In previous optical flow studies [23], [24], an occlusion mask was utilized to penalize outliers. However, it has not been found in RefSR.
Another important contribution is the training scheme of the proposed modules when there are insufficient annotated 360 • image samples. The DE360 that requires two 360 • images with various levels of the disparity is hardly trained with the current datasets. Therefore, we introduce a synthetic-to-real learning scheme, in which we first train the model with computer-synthetic images in the same ERP domain and then transfer the knowledge of the disparity estimation to real 360 • images. We obtain several hundred real-world 360 • images (using a consumer-level camera) for both transfer learning and testing.
Our primary contributions are summarized as follows: • We propose an efficient RefSR network for a 360 • image, including an adaptive disparity estimator with LatConv to circumvent the ERP distortion.
• We construct synthetic and real-world ERP datasets and adopt the synthetic-to-real learning scheme. Experimental results demonstrate that our network outperforms the previous SR methods.

II. RELATED WORKS A. REFERENCE-BASED SUPER-RESOLUTION
Deep learning based SR [12]- [14], [25]- [29] has superseded traditional SR in terms of peak signal-to-noise ratio (PSNR) performance. Performance has been dramatically improved with the residual block [14], but it sometimes suffers from blurry results (e.g., for 4× or over upscaling factors). [15], [16] adopted a generative adversarial network (GAN) to produce perceptually satisfying results. In [17], a GANbased SISR model for 360 • images was proposed. Although it used a spherical-content loss to train a network, the textures from a generative deep model severely deteriorated during the restoration of polar regions.
Most RefSR studies adopt SISR as a backbone network, while they have attempted to mitigate blurry results from SISR [6]- [8], [30]. In RefSR, the reliability of correspondence between the target and the reference substantially affects performance. Accordingly, RefSR uses a matching scheme to measure similarity. CrossNet [6], [30] chose the reference from the adjacent light-field views using simple optical flows [31], [32]. It is reliably applicable only to short disparities. In perspective images, [7], [8] explored a neural texture transfer to extract semantically relevant texture features from the reference. However, they have been hardly VOLUME 9, 2021 applied to 360 • images. In our study, RefSR is developed to improve the quality of 360 • SR images based upon [14] as a backbone.

B. DISPARITY AND OPTICAL FLOW ESTIMATION
Flow estimation is essential for RefSR to accurately incorporate relevant textures from the reference. A single-level CNN was used to produce an optical flow [31]- [33]. In [31], an encoder-decoder architecture was trained with a large number of labeled synthetic data. [32] extended the architecture in [31] to improve accuracy at the cost of computational complexity.
Several advanced deep models have been proposed to produce an optical flow [34], [35]. They use pyramidal structures to refine the estimation in a coarse-to-fine manner to resolve large displacements. SPyNet [34] conducted image warping to utilize the previous estimation iteratively. However, the result was less accurate than [32]. [24] addressed an occlusion for reliable disparity estimation, but they required a large number of labeled data. Unsupervised learning was applied to train using unlabeled data [23]. PWC-Net computes the correlation to warp features in a multi-scaled space [35]. It is a state-of-the-art study of flow estimation; thus, it is employed for the DE360. However, PWC-Net provides inaccurate flows upon deformation. Therefore, we attempt to improve the reliability using LatConv.

C. KERNEL TRANSFORM
Earlier studies have attempted to use the conventional CNN architecture by reprojecting to the tangent plane to process 360 • images or apply the kernels to the domain of the cube map projection (CMP) and ERP [36]- [42]. However, it was indicated in [20], [22] that the approaches suffered from not only noticeable distortion but also substantially increasing complexity.
Several recent studies have transformed conventional kernels from the spherical domain to the perspective or ERP domains [18], [20]- [22], [43]. In [18], [22], the kernel was transformed for image recognition and detection or saliency detection. In [20], the model was trained by conventional perspective images and panoramic images and then used to regress depth. The previous studies required interpolation for both latitudinal and longitudinal directions even though the extra filters increased computational complexity during projection and inverse projection to the tangent plane [20], [21], [43].
We address these problems by developing LatConv as a 360 • geometry adaptive kernel transform with less computational complexity. image from an adjacent viewpoint. E LR is upsampled using bicubic interpolation. The network produces E SR as the SR 360 • image for output.

III. PROPOSED NETWORK A. OVERVIEW
Lat360 first establishes a correspondence between E LR and E Ref to extract associated features. We present DE360 to compute the output disparity vector D R→L to transfer the reference feature to the SR without any flow supervision. It consists of LatConv, which is critical for overcoming a large and distorted disparity between two ERP images and in the subsequent flow estimator. It shares the same kernel parameters as shown in Fig. 2 LatConv improves the accuracy of the model by adapting to the geometry of 360 • images. When a pixel at (i, j) of an ERP image corresponds to (θ i , φ j ) in polar coordinates, the arc length is uniform in the vertical direction but is proportional to cos θ i in the horizontal direction [20], [21]. Therefore, it is appropriate to increase the horizontal sampling interval further in the polar regions than in the equator regions, because pixels in the higher latitude of a spherical image are densely sampled in the longitudinal direction. The sampling ratio can hinder flow estimation due to geometric distortion in ERP.
Accordingly, a sampling interval a i at each i th row is adjusted as follows to compensate for the latitude-dependent geometric distortion: where H and W are the height and width of an ERP image. Then, LatConv is conducted on each location (i, j) as follows: where s and w are the output feature and kernel, respectively. The kernel size is (2K + 1) × (2K + 1) and K is set to 1. m and n are coefficient indices. The locations are rounded to the nearest integer grid. a i is a sample-wise kernel stride for latitude-dependent scaled sampling over the input feature map f . The described scheme of LatConv is applied across the channels of an input feature. When applying the kernel at the boundaries of an image, we applied circular padding in the horizontal direction and mirror symmetry scheme in the vertical direction, a natural extension for ERP.
It is noted that Eq. (2) requires a bi-linear 1D interpolation only in the longitudinal convolution to interpolate samples in an integer grid. In contrast, it uses the same sampling interval along the latitude, requiring no interpolation in that direction. This approach can reduce computational complexity and alleviate interpolation errors during the projection compared with previous studies [20], [21], [43] using sphere-to-tangent plane projection. They need to perform interpolation in both the latitudinal and longitudinal directions. We conduct an experiment to determine the accuracy of the LatConv compared with the previous studies in Sec. V-E3.

2) CORRESPONDENCE MATCHING USING LatConv
In DE360, LatConv is applied before matching correspondence between f LR and f Ref , as shown in Fig. 2, because the features extracted from the conventional convolution layers are error-prone to the ERP geometry distortion. Therefore, as shown in Fig. 3, s LR and s Ref are computed by the LatConv for accurate matching.
We perform disparity estimation through a multi-scaled pyramid [35] in the network as in [7], [9]. This procedure aims to search for matching texture features between s LR k and s Ref k at each k-th level in a multi-scale fashion, as described in Fig. 3. The adaptive dilation in LatConv with circular padding can recover more accurate samples near the pole. LatConv is used at each level to enlarge the receptive fields adaptively. Thus, s LR k and s Ref k provide robust features in the disparity estimation.
We use the same schemes of the matching cost and the flow estimator in [35]. However, the matching mechanism is changed to compute a feature volume that represents the correlation between s LR k and s  We feed five input features that are E LR ,Ê Ref , D R→L , absolute difference between E LR andÊ Ref , and a matching cost feature at the finest level into the OMG to learn an occlusion mask M . We build a stack of six convolutional layers followed by a sigmoid activation for the implementation.
It determines which regions should be used for the reconstruction. Some undesired textures transferred from the reference, which could appear in occluded regions or edges, are suppressed with the following equation: whereÊ Ref M is the masked reference image.

2) ReconNet
We stack theÊ It is clearly seen that the results become closer to the ground-truth in each step. The mask helps avoid occlusion but would incur sample blurs during the mixture. The artifacts are removed during reconstruction, as shown in the final output.

A. TRANSFER LEARNING
We propose a transfer learning scheme using a syntheticto-real approach to facilitate training. It is developed to overcome a limited number of real-world data available for RefSR. Specifically, we employ a transfer layer to shift the domain.
The learning scheme is illustrated in Fig. 5. We first train all the modules except for the transfer layers in an end-toend manner using a computer-synthetic dataset, containing numerous pairs of LR and reference images with various disparity levels. In the initial learning, the network is trained to obtain the general features required for estimating the disparity characterized in the ERP domain.
We then transfer the knowledge to the real domain during fine-tuning. We freeze the previous parameters because they are sufficiently trained with synthetic data. Thus, we train only the transfer layers using real-world data to reduce the discrepancy caused by the domain transition. Accordingly, the model is adapted for real-world data using fewer training samples. The transfer layers have eight residual blocks and produce a 3 × H × W output feature.
We add transfer layers at both ends of the model. Given input real-world 360 • images, the transfer layers generate the domain-shifted features t LR and t Ref to be delivered to the subsequent modules trained in the synthetic data. Later, the features are restored to the real-domain in the back end. Furthermore, the model can perform the SR of synthetic data by simply removing the transfer layers.

B. 360 • IMAGE DATASET FOR RefSR
We create Synthetic360 and Real360 datasets including 360 • images for RefSR. The images are grouped with a ground-truth (and its LR version) and several reference images. We explain how the groups are configured in the two datasets in the following subsections.
For training, a reference image is randomly chosen from each group. In contrast, for testing, we choose a reference frame to evaluate performance at different levels of disparity. We display the results in Table 4, motivated by [7]. We set the resolutions of the ground-truth to 256 × 512 and 512 × 1024, respectively, for training and testing.

1) Synthetic360 DATASET
We use Unity software to render 400 virtual scenes such as a park, restaurant, or house. The scenes display several foreground objects such as human avatars, animals and other 3D figures.
In each scene, we record six reference videos from the virtual cameras positioned in 3D space around the center for the SR. The cameras are displaced along the x, y, and z axes and randomly rotated up to 50 • for data augmentation. In half of the scenes, each camera is located 25 cm from the center. In the other half, the distance is set to 50 cm to test with various levels of similarity.
After a scene is generated, we create a group of images by sampling a target image and six reference images from the videos. In other words, the group consists of seven images captured at the same timestamp from different videos. The groups are constructed at distant time intervals to avoid temporal redundancies among neighboring frames. Consequently, we generate 4,356 groups of LR and reference images and use them for training.

2) Real360 DATASET
We take real-world 360 • images formatted to a full ERP, using Samsung Gear360 in various environments. We capture 251 groups in which each target has 2 or 3 references captured from arbitrary positions with the same indoor or outdoor backgrounds. We have three different levels of similarity based on the camera distances between the target and the reference. We categorize the images as S, M, and L, whose distances are less than 50 cm, 50 cm ∼ 300 cm, and larger than 300 cm, respectively. We use 186 groups of the real-world 360 • images for transfer learning.

C. LOSS FUNCTION AND TRAINING DETAILS
We minimize the reconstruction loss and warping loss to preserve the HR context and the reliability of the alignment process, respectively. The photometric difference between the warped reference image and the ground-truth target is used as the supervision for DE360 as in [44], [45] to train the DE360 in the absence of the ground-truth disparity.
For both loss terms, we adopt a Charbonnier penalty function for robustness to outliers as in [6], [46]. The reconstruction loss L rec and warping loss L warp are computed as follows: and where E GT and E SR are the ground-truth and the output SR, respectively. ρ is set to 0.001. The overall loss is given as, where we compute the loss function in each pyramidal level and perform the summation as in [35]. The network is trained with an initial learning rate of 10 −4 using the Adam optimizer with β 1 = 0.9 and β 2 = 0.999. The learning rate decreases by half at 20K and 40K iterations, and the batch size is 8. In the beginning, we train the network during 100K iterations from scratch using Eq. (6). And then, we fine-tune the network using only L rec in Eq. (6) during approximately 50K iterations. The training strategy results in additional performance improvements of 0.3 dB. For transfer learning, we set the learning rate to 10 −5 and update transfer layers during 30K iterations.

V. EXPERIMENTAL RESULTS
We implemented the software using PyTorch. Our codes, datasets, and more results are reported on the project web page. 1 Sec. V-A, V-B, V-D and V-E are conducted for ×4 SR. Further experiments on upscaling factor ×8 is evaluated in Sec. V-C. For performance comparisons, we use the state-ofthe-arts SR models: SRCNN [12], VDSR [13], EDSR [14], RCAN [47], and 360SR [17] for SISR and CrossNet [6], SRNTT [7], TTSR [8] for RefSR. EDSR is the selected SISR method because Lat360 uses a series of residual blocks for the reconstruction as in EDSR. 360SR [17] is developed for SR for 360 • images. All the comparisons are conducted on the same platform using an NVIDIA Titan XP GPU to measure the running time.
We evaluate all the methods using Synthetic360, Real360, and ERA [48] datasets to demonstrate the superior performance of Lat360. For synthetic images, we train the compared methods using Synthetic360 and test them with different samples from the same set. The transfer layers in Lat360 are removed, when the synthetic images are superresolved. For natural images, all the methods are pre-trained with Synthetic360 and then fine-tuned with Real360. Both the Real360 and ERA datasets are used for testing. ERA is developed for object detection in 360 • images. We collect several groups from ERA that are suitable for RefSR. For instance, they have full ERP formats and captured from adjacent views. We use 444 groups from the Synthetic360, 65 groups from the Real360 in Sec. IV-B and 42 groups from the ERA [48] datasets. The size of the target image is 512 × 1024.

A. QUANTITATIVE PERFORMANCE EVALUATION
We use various metrics including PSNR, SSIM [49], and WS-PSNR [50] for quantitative evaluations. For the comparisons, we use reference resolutions identical to that of the target, as in the previous RefSR studies [6]- [8].
It is demonstrated with Table 1 that Lat360 outperforms the previous studies on all datasets. Lat360 exhibits significantly improved performance of approximately 1.29 dB and 0.25 dB in PSNR compared with EDSR and SRNTT on the Real360 dataset. Moreover, Lat360 outperforms previous methods by a large margin on the Synthetic360 dataset. It provides improved performance of approximately 0.98 dB in PSNR and 1.29 dB in WS-PSNR compared with SRNTT [7].  [50] Performance comparisons with the state-of-the-art SISR and RefSR algorithms on Real360, ERA [48], and Synthetic360 datasets for upscaling factor ×4. The best performance is marked in bold.
We investigate the computational complexity of the compared methods in Fig. 6. The running time of the proposed algorithm is 55 ms per image. The results indicate that Lat360 provides a good trade-off between SR performance and computational complexity when considering the execution time in SRNTT [7]. SRNTT exhibits 4.848 seconds which is the highest computational complexity among compared methods. We observed that the matching module in SRNTT is quite time-consuming.  Fig. 7 and Fig. 8 illustrate the subjective visual quality of the compared methods. In Fig. 7, the three rows from the top are chosen from Real360. Samples in the fourth row are from [48]. The samples in Fig. 8 are selected from Syn-thetic360. It is clearly shown that Lat360 produces more visually pleasing results. EDSR tends to suffer from blurry results compared with RefSR methods due to the lack of a reference. CrossNet cannot capture details from the reference for a large disparity because the flow estimator is limited to a narrow disparity. TTSR and SRNTT vary with the transferred textures. They provide good perceptual qualities, but sometimes the patches are mismatched and inconsistent with adjacent textures. We observe some undesirable artifacts (e.g. in the second and the fourth rows of Fig. 7).

C. EXPERIMENTAL RESULTS IN AN ×8 SCALE FACTOR
We illustrate the quantitative and qualitative performance for an upscaling factor ×8. The training details are the same as ×4 SR except for the number of iterations. We train the model for 120K iterations. We set the initial learning rate of 10 −4 and reduce by half at 30K and 60K iterations. For transfer learning, we set the learning rate to 10 −5 and update transfer layers for 20K iterations. 360SR [17] and RCAN [47] are used for comparisons because the other methods are not available for an upscaling factor ×8. Table 2 illustrates that Lat360 outperforms 360SR by approximately 1.96 dB and 1.69 dB on the Real360 and ERA [48] datasets, respectively. The qualitative comparisons of ×8 SR are shown in Fig. 9. The two rows from the top are chosen from Real360; the others are from [48]. We achieve visually satisfying results even for large upscaling factor ×8 compared with SISR methods [17], [47].  [48] datasets. The best performance is marked in bold.

D. FURTHER COMPARISONS WITH RECONSTRUCTION LOSS REFERENCE-BASED SUPER-RESOLUTION METHODS
We compare the quantitative and qualitative performance of Lat360 with SRNTT-2 and TTSR-rec, which are variants of the original SRNTT [7] and TTSR [8], respectively. SRNTT-2 and TTSR-rec were developed to improve PSNR performance by giving more weight to reconstruction loss terms and removing the other regularization terms such as adversarial losses that enhance subjective quality. However, the modified versions tend to blur the reconstructed image and degrade perceptual quality despite PSNR improving. This phenomenon is clearly reported in SRNTT [7] and TTSR [8].
We demonstrate that the proposed algorithm significantly improves visual quality compared with SRNTT-2 and TTSR-rec, as shown in Fig. 10. SRNTT-2 and TTSR-rec fail to restore details, although the PSNRs are improved over the original results in Table 3. Lat360 also produces superior   Table 4 illustrates the performance comparisons when the levels of the disparity between the target and the reference are short (S), medium (M), and long (L) as defined in Sec. IV-B.
The performance of Lat360 is compared with the other three RefSR methods [6]- [8]. The compared methods use their own strategies to overcome the amount of disparity. It is noted that mismatched textures from the reference substantially degrade the quantitative performance.
Lat360 achieves superior performance for all three levels of disparity in terms of PSNR, SSIM, and WS-PSNR. The results demonstrate that the core modules of Lat360 efficiently exploit the inter-view correlation among the views.    [48], and Synthetic360 datasets. The best performance is marked in bold.    reference resolution increases. Our algorithm with no reference, which falls into SISR, outperforms 360 SISR [17] by approximately 0.92 dB in PSNR.

3) ABLATION STUDIES
We conduct ablation studies on the Real360 dataset by turning off each core module of Lat360 one by one. We test the performance changes when LatConv, the OMG, and the transfer learning modules are removed. We also validate the importance of LatConv by replacing it with normal convolution (NC) and SphereNet [21]. The conditions are denoted by w/o LatConv, w/o OMG, w/o S-to-R, w/ NC, and w/ SphereNet, respectively, in Table 6.   TABLE 6. PSNR for ablation tests on Real360 dataset with upscaling factor ×4 when LatConv, OMG, and transfer learning (S-to-R) are turned off one by one, and LatConv is replaced with normal convolution and SphereNet. refers to the gap with Lat360 in terms of PSNR. Table 6 presents the PSNR performance, in which the degradation is represented with the values following . In w/o LatConv, we observe performance decreases of approximately 1.04 dB, 0.41 dB, and 0.21 dB at S, M, and L, respectively. Because the disparity estimator plays an important role in the alignment of a reference, it inflicts more impacts in S. Similarly, LatConv improves performance in PSNR compared with NC and SphereNet. The number of parameters for the compared convolutions is the same as c 2 k 2 , in which c and k are the number of channels and kernel size, respectively. This validates that LatConv is robust VOLUME 9, 2021 to capture distorted correspondence, which is necessary to estimate disparity in ERP. When the model needs to use a more distant reference, the OMG becomes more effective. As shown in Table 6, the performance decreases are approximately 0.46 dB, 0.64 dB, and 0.92 dB, respectively. This result is predictable because the reference becomes worse with a large and distorted disparity. In w/o S-to-R, the performance decrease is relatively small at approximately 0.20 dB on the average. However, it significantly affects perceptual quality, as shown in Fig. 12.

VI. CONCLUSION
In this paper, we presented a reference-based SR technique using LatConv in the DE360 to resolve large and distorted correspondences between two ERP images. LatConv was introduced to extract more reliable features from ERP images, which were integrated into a flow estimator. We constructed 360 • image datasets to facilitate learning using a syntheticto-real learning scheme. The model trained with a sufficient number of synthetic images was adapted to real-world images through the transfer layer. The experimental results demonstrated that the proposed algorithm outperformed various SISR and RefSR algorithms in Real360.