Aerial Visible-to-Infrared Image Translation: Dataset, Evaluation, and Baseline

Aerial visible-to-infrared image translation aims to transfer aerial visible images to their corresponding infrared images, which can effectively generate the infrared images of specific targets. Although some image-to-image translation algorithms have been applied to color-to-thermal natural images and achieved impressive results, they cannot be directly applied to aerial visible-to-infrared image translation due to the substantial differences between natural images and aerial images, including shooting angles, multi-scale targets, and complicated backgrounds. In order to verify the performance of existing image-to-image translation algorithms on aerial scenes as well as advance the development of aerial visible-to-infrared image translation, an Aerial Visible-to-Infrared Image Dataset (AVIID) is created, which is the first specialized dataset for aerial visible-to-infrared image translation and consists of over 3,000 paired visible-infrared images. Over the constructed AVIID, a complete evaluation system is presented to evaluate the generated infrared images from 2 aspects: overall appearance and target quality. In addition, a comprehensive survey of existing image-to-image translation approaches that could be applied to aerial visible-to-infrared image translation is given. We then provide a performance analysis of a set of representative methods under our proposed evaluation system on AVIID, which can serve as baseline results for future work. Finally, we summarize some meaningful conclusions, problems of existing methods, and future research directions to advance state-of-the-art algorithms for aerial visible-to-infrared image translation.


Introduction
With the rapid development of infrared technology, the infrared camera equipped on unmanned aerial vehicles (UAVs) is increasingly applied for aerial photography.Aerial infrared images have been widely used in the military and in industrial, agricultural, and environmental settings, such as moving target detection [1][2][3] and tracking [4][5][6], photovoltaic panel error detection [7][8][9], image registration [10][11][12], and visible-infrared image fusion [13][14][15][16] because of their advantages, including high sensitivity to temperature variation, strong capability to penetrate through the fog, and powerful robustness when encountering the weak light condition.
Due to the high cost of an infrared camera or the limitations of taking photography conditions, obtaining many aerial infrared images of some specific targets is challenging.In this case, the mainstream method to obtain aerial infrared images is to employ the simulation software platform for target scene infrared simulation [17][18][19][20][21].These methods first analyze the target attributes to obtain a simulated 3D model scene and then compute the infrared radiation distribution of different materials in the scene according to the infrared radiation theory.Next, the radiation attenuation of the infrared radiation to the detector is calculated by the atmospheric transmission model.The imaging characteristics of the imaging sensor are then simulated and added to the infrared radiation distribution.Finally, the simulated scene is gray-scaled to produce the final infrared image.
Compared with actual photography, the use of infrared simulation software to simulate aerial infrared images of targets can significantly save manpower, material resources, and financial capacity.At the same time, the simulated infrared images with various periods and different bands can be obtained by adjusting the parameters of the infrared radiation distribution model and the imaging sensor.However, these methods have problems such as low simulation degree of the target temperature model, huge intermediate parameters, high coupling degree of each system, and complicated processing procedures, which could not be suitable for quickly obtaining many aerial infrared images.In this paper, we propose a new task called aerial visible-to-infrared image translation, which aims to generate aerial infrared images from visible images and has 3 main advantages: • Due to the easy acquisition and lower photography cost of aerial visible images, aerial visible images can be translated into corresponding infrared images in a fast, efficient, and low-cost manner.
• Additional modality information can be provided by the aerial visible images to improve the performance of the aerial infrared images in downstream tasks.
• The translated aerial infrared and corresponding visible images can provide paired data support for cross-modality and domain adaptation tasks.
Though translating aerial visible images into corresponding infrared images has the advantage in terms of efficiency and speed compared to actually taking photography and infrared simulation, 3 significant issues seriously limit the development of aerial visible-to-infrared image translation.
• Lacking an available dataset for aerial visible-to-infrared image translation experiments: So far, most datasets consist of color images and lack paired infrared images.Although there are several color-to-thermal datasets [22,23], they are all natural images, not taken from an aerial perspective, without diverse targets and complicated backgrounds like aerial images.Therefore, to the best of our knowledge, there are currently no available datasets for aerial visible-to-infrared image translation.
• Lacking a survey of methods that could apply to aerial visible-to-infrared image translation: The translation of aerial visible-to-infrared images can be considered as cross-modality learning, which makes it challenging to model the mapping.As far as we know, no specific approaches have been proposed to solve this problem.Therefore, a survey of methods that can be effectively applied to aerial visible-to-infrared image translation remains to be clarified.
• Lacking a complete evaluation system to evaluate the quality of generated images: Existing metrics for evaluating the similarity between images are mainly traditional perceptual indicators, such as MSE, peak signal-to-noise ratio (PSNR), and SSIM.However, they are too shallow functions to account for many nuances of human perception.In addition, evaluating the quality of the generated images only from the similarity of the appearance is obviously unreasonable.A more complete evaluation system to evaluate the quality of generated images is necessary.
In order to address the above issues and fully advance the development of aerial visible-to-infrared image translation, we propose a new specific dataset for aerial visible-to-infrared image translation, called AVIID (Aerial Visible-to-Infrared Image Dataset), consisting of over 3,000 paired visible-infrared images.The goal of AVIID is to provide researchers with an available data resource to evaluate and improve state-of-the-art algorithms.The aerial visible-to-infrared image translation aims to learn a mapping between 2 image domains, which can be regarded as a cross-modality image-to-image translation problem.Recently, image-to-image translation algorithms [16,[24][25][26][27][28][29][30][31][32][33][34][35][36][37][38][39][40][41][42] among color image domains with the application of deep convolutional neural networks (CNNs) [43][44][45] and generative adversarial networks (GANs) [46][47][48][49][50] have made significant progress in a wide range of tasks, including style transfer [40,51,52], image inpainting [53], colorization [54], super-solution [55][56][57][58], dehazing [59,60], and denoising [61,62].Some researchers have applied image-to-image translation approaches to colorto-thermal image translation tasks [22,63,64] and achieved impressive results.For example, Kniaz and Knyaz [65] achieve multi-spectral person re-identification by using GAN for colorto-thermal image translation.In this paper, we attempt to apply these image-to-image translation approaches to aerial visibleto-infrared image translation and make a comprehensive survey of these methods.In addition, we propose a complete evaluation system to evaluate the generated infrared images from the overall appearance and target quality.The overall appearance aims to determine the similarity between the generated infrared images and real ones from the visual perception.The target quality reflects the quality of the targets in the generated infrared images, which is important for some downstream tasks such as object detection and tracking.We further evaluate several representative image-to-image translation methods on AVIID under this proposed complete evaluation system, and the results can be seen as a baseline to advance the development of aerial visible-to-infrared image translation.
In summary, the major contributions of this paper are as follows: • The first specific dataset for aerial visible-infrared image translation, AVIID, is constructed, which provides researchers with an available data resource to evaluate and advance stateof-the-art algorithms.
• A comprehensive survey of up-to-date image-to-image translation algorithms that could be applied to aerial visibleto-infrared image translation is proposed to promote the development of this field.
• A complete evaluation system is presented to evaluate the generated infrared images in terms of the overall appearance and target quality.Several representative image-to-image translation methods are evaluated on AVIID under our proposed complete evaluation system.These results can be regarded as a baseline for future work.
• Some meaningful conclusions, problems of existing methods, and future research directions are summarized to advance state-of-the-art algorithms for aerial visible-to-infrared image translation.
The rest of this paper is organized as follows.We first provide a comprehensive survey of image-to-image translation methods that can be applied to aerial visible-to-infrared image translation in the "A Survey of Methods for Aerial Visible-to-Infrared Image Translation" section.The details of AVIID are then described in the "A Specific Dataset for Aerial Visible-to-Infrared Image Translation" section.In the "Experiments and Results" section, the description of our proposed complete evaluation system and baseline results of representative methods on AVIID are given.Finally, the conclusion of our work is given in the "Conclusion" section.

A Survey of Methods for Aerial Visible-to-Infrared Image Translation
In this section, we comprehensively make a survey of imageto-image translation methods that could be applied in aerial visible-to-infrared translation.Based on whether the method depends on paired images or not, we simply classify these methods into supervised and unsupervised categories.Supervised methods aim to learn a pixel-level mapping from the source domain to the target domain with the paired data for training, which limits their applications.In contrast, unsupervised methods only need 2 images from 2 different domains as training data to achieve image-to-image translation by adopting additional constraints.According to whether multi-modal outputs are generated based on one single image as input or not, these unsupervised methods can be further divided into 2 types: oneto-one (single modal) and one-to-many (multi-modal).In addition, depending on the mapping relationship between the source and target domains, one-to-one unsupervised approaches can be further classified into 1-sided and 2-sided methods.One-sided unsupervised image-to-image translation methods can only translate the images from the source domain to the target domain.In contrast, 2-sided ones can achieve a bidirectional mapping between the source domain and the target domain.Figure 1 shows an overview of these methods.In what follows, we will introduce each category of these methods in detail.

Supervised image-to-image translation methods
Supervised image-to-image translation methods aim to learn a pixel-level mapping to achieve image translation from one domain to another based on paired data.Paired data means the training data are paired, and every image from the source domain has a corresponding image in the target domain.In this case, Pix2Pix is the first method to achieve task-agnostic image translation, which uses a conditional generative adversarial network (cGAN) [21] to learn a mapping from input images to output images.Based on the framework of Pix2Pix, BicycleGAN adds a variational autoencoder (VAE) in cGAN to generate multiple outputs from a single input image.Additional details of Pix2Pix and BicycleGAN are as follows.
Pix2Pix [24]: Pix2Pix investigates cGANs, a variant of GAN, as a general solution to image-to-image translation problems.
The key idea of GAN is to simultaneously train the discriminator and the generator: the discriminator is designed to distinguish between the real data and the generated samples, while the generator aims to generate the fake samples that are as real as possible in order to convince the discriminator that the fake samples come from the real data.Given the paired image data (x, y), where x is from the source domain X and y is from the target domain Y.The cGANs aim to learn a mapping from the image x with a random latent vector z to the image y, y = G(x, z).The generator G is trained to produce outputs that cannot be distinguished from the "real" images in the target domain with an adversarial discriminator, D, which is trained to detect the generator's "fakes" as soon as possible.The full objective of the cGANs can be expressed as where G attempts to minimize this objective versus an adversarial D that tries to maximize it.In addition, Pix2Pix adds an additional L 1 distance constraint to the generator to make the Downloaded from https://spj.science.orgon January 04, 2024 translated image visually similar to its corresponding ground truth, which can be formulated as Therefore, the final objective of Pix2Pix can be formulated as where λ is a hyperparameter.
BicycleGAN [26]: Though Pix2Pix has achieved ambiguous results for image-to-image translation, it is prone to suffer from mode collapse, resulting in generating very similar images.To address this issue, BicycleGAN aims to enhance the relationship between the output with the latent code, which helps to produce more diverse results.For the paired image data (x, y), BicycleGAN first maps the target domain image y to a specific latent code z by a VAE encoder, z = E(y).The latent code is encoded by the real data in the training process, but a random latent code may not yield realistic images at testing time.To avoid this, an additional KL loss is used to align the distribution of the latent code with the standard normal distribution.Then, BicycleGAN combines the latent code with the input image to translate it from the source domain to the target domain by cGANs like in Pix2Pix, ŷ = G E y , x .The translated image ŷ is not necessarily needed to be close to the ground truth, which may suffer from mode collapse, but must be realistic.To achieve this, BicycleGAN recovers the latent code by the VAE encoder, ẑ = E ŷ , and utilizes an L 1 loss to keep the consistency between the recovered and the original latent code, which can be expressed as

One-to-one
Unsupervised image-to-image translation algorithms aim to learn a joint distribution by using images from the marginal distributions in individual domains.Since there exists an infinite set of possible joint distributions that can arrive at the marginal distributions, it is impossible to guarantee that a particular input and output correspond in a meaningful way without additional assumptions or constraints.As a consequence, various constraints have been proposed to achieve unsupervised image-to-image translation.
DistanceGAN assumes that the distance between 2 images in the source domain should be preserved after mapping them to the target domain.GCGAN develops a geometry-consistency constraint from the special property of images that simple geometric transformations will not change the semantic structure of images.CUT proposes a contrastive learning-based constraint to maximize the mutual information between the input and the output.These methods can be seen as one-sided unsupervised image-to-image translation because the mapping from the source domain to the target domain is unidirectional.In addition, some methods construct various specific constraints to achieve 2-sided unsupervised image-to-image translation.For example, CycleGAN, DualGAN, and DiscoGAN employ the cycle-consistency constraint, which aims to transfer an image in the source domain to the target domain, and this translated image can also be transferred back to the source domain.UNIT makes a shared-latent space assumption that also implies the cycle-consistency constraint.DCLGAN takes advantage of CycleGAN and CUT, employing the idea of mutual information maximization to enable 2-sided unsupervised image-to-image translation.More details of these methods are as follows.
DistanceGAN [37]: Let x ∈ X denote a random image from the source domain, and y ∈ Y represents a random target domain image.Unsupervised training data pairs are expressed as (x i , y j ), i = 1, 2, …, N, where N means the size of the dataset.DistanceGAN presents a distance-preserving mapping, which aims to enforce that the distance between images in the source domain is preserved after mapping them to the target domain and can be formulated as where d(.) is a predefined metric function to measure the distance between 2 samples, a and b are the linear coefficient and bias, and G XY (.) is the generator.
GCGAN [36]: GCGAN presents a geometry-consistency constraint in that a given specific geometric between the input images should be preserved after transferring them to the target domain.In detail, given a random image x from the source domain X, a specific geometric transformation f(.), and 2 related translators G XY and G X Ỹ, the geometry-consistency constraint can be expressed as where f −1 (.) is the inverse of the transformation f(.).
CUT [40]: CUT proposes a novel constraint to maximize the mutual information between the corresponding input and output patches based on the intuition that each path in the output should reflect the content of the counterpart patch in the input and be independent of the domain.To achieve this, CUT uses a type of contrastive learning loss function, InfoNCE loss [66], which aims to learn an embedding that associates a patch of the output v and its corresponding patch of the input v + , while separating it from the other N noncorresponding patches of the input v − , which can be formulated as where τ is a temperature hyperparameter.Intuitively, this loss can be seen as a classifier that attempts to classify v as v + .
CycleGAN [51]/DualGAN [29]/DiscoGAN [67]: CycleGAN, DualGAN, and DiscoGAN propose the cycle-consistency constraint to achieve the 2-sided unsupervised image-to-image translation.These methods construct 2 translators to learn 2 mappings simultaneously via transferring an image to the target domain and back, maintaining the fidelity of the input and the reconstructed image through the cycle-consistency constraint.Mathematically, for an image x from the source domain X, the translator G XY translates it to the target domain Y, and then this translated image is transferred back to the source domain by the translator G YX , and the cycle-consistency constraint is used to preserve the semantic structure of the reconstructed image and the input.For the domain Y, it is an inverse process and the whole objective of cycle-consistency constraint can be expressed as (3) Downloaded from https://spj.science.orgon January 04, 2024 UNIT [25]: UNIT presents a shared-latent space assumption, which assumes that a pair of corresponding images from different domains can be mapped to the same latent representation in a shared-latent space.Consequently, the latent code can be computed from each of the images, and these 2 images can also be recovered from the shared latent code.Based on this assumption, UNIT proposes a 2-sided unsupervised image-toimage translation framework consisting of 6 sub-networks, including 2 domain image encoders E X and E Y , 2 domain generators G X and G Y , and 2 domain discriminators D X and D Y .For any given pair of image data (x, y), the shared latent code can be obtained by encoders z = E X (x) = E Y (y), and conversely, the images can be recovered from this latent code, x = G X (E Y (y) and y = G Y (E X (x).In this way, images from the source and target domains can be mutually transferred.However, to achieve this, a necessary condition to exist is the cycle-consistency constraint: . Therefore, from this perspective, the shared-latent space assumption also implies the cycle-consistency constraint.
DCLGAN [34]: Although the cycle-consistency constraint can ensure that the translated images have similar semantic information compared to the target domain, it enforces the relationship between the 2 domains to be bijective, which is too restrictive.At the same time, CUT has demonstrated the effectiveness of contrastive learning in one-sided unsupervised image-to-image translation.However, one embedding for 2 separate domains may not capture the domain gap.To solve this, DCLGAN takes advantage of CycleGAN and CUT to propose a novel method based on contrastive learning and a dual learning setting to enable an efficient 2-sided domain mapping with unpaired data.

One-to-many
Though several methods have enabled unpaired image-toimage translation, they fail to generate multi-modal results.An effective way to handle multi-modal image-to-image translation is to perform image translation conditioned on the input image and a specific latent code.To achieve this, DRIT/DRIT++ and MUNIT assume that the image representation can be disentangled into 2 spaces: a domain-invariant content space capturing shared information across domains and a domain-specific style space.Then, to achieve translation, they recombine its content information with a random style feature sampled from the style space of the target domain.To improve the diversity, MSGAN presents a mode-seeking regularization term that maximizes the ratio of the distance between translated images with respect to the distance between latent vectors.DSMAP leverages domain-specific mappings for remapping latent features in the shared content space to domain-specific content spaces, which is conducive to achieve more challenging style transfer tasks that require more attention on local and structural-semantic correspondences.These methods are de s cribed in detail as follows.
DRIT [52]/DRIT++ [35]/MUNIT [27]: DRIT/DRIT++ and MUNIT assume that images from 2 domains can be decomposed into a domain-invariant content space and a domain-specific style space.The domain-invariant content space captures the shared information across 2 domains, while the style space captures domain-specific attributes.To transfer an image from the source domain to the target domain, they recombine its content code with a random style code sampled from the target domain space.Mathematically, for a given unpaired image data (x, y) random sampled from the source domain X and the target domain Y, DRIT/DRIT++ and MUNIT first use the content encoders E X c , E Y c and style encoders E X s , E Y s to disentangle the images into the domain-variant content code, z c = E X c (x) = E Y c (y), and domain-specific style codes, x s = E X s (x) and y s = E Y s (y).Then, they perform a cross-domain mapping to obtain translated images x, ỹ by recombining the content code with the specific style code to the generator, ỹ = G XY E X c (x), E Y s y , and , where G YX and G XY are cross-domain generators.After that, they apply the above cross-domain mapping one more time and leverage the cycle-consistency constraint to enforce the consistency between the reconstructed images and the original input images, which can be formulated as MSGAN [68]: Existing cGANs tend to focus on conditional input images but ignore random latent vectors that significantly contribute to the diversity of outputs and thus suffer from mode collapse.To address this issue and improve the diversity of the generated images, MSGAN proposes a simple yet effective mode-seeking regularization term, which aims to maximize the ratio of the distance between generated images with respect to the corresponding latent vectors.Let an input image x from the domain X, 2 latent vectors z 1 , z 2 from the latent space Z, and a cross-domain generator G XY that translates the input image with the latent vectors to the target domain, respectively.Then, the mode-seeking regularization term directly maximizes the ratio of the distance between the translated images to the distance between the latent vectors, which can be expressed as where d(.) denotes the predefined distance metric.
DSMAP [39]: Previous multi-modal unsupervised imageto-image translation methods often assume that the image representation can be decomposed into a shared domain-variant content space and a domain-specific space.However, this content space only considers the shared information across domains but ignores the relationship between content and style, which may weaken the presentation of content.To address this issue, DSMAP leverages 2 additional domain-specific mapping functions to remap the content features in the shared domain-invariant content space into the domain-specific content spaces for different domains, which can be expressed as where x, y are an unpaired image data randomly sampled from the domain X and domain Y, (9) Downloaded from https://spj.science.orgon January 04, 2024 domain-specific mapping functions, and E Y c , E X c are the domaininvariant encoders.By these domain-specific mapping functions, the features in the shared content space could be aligned with the target domain to encode the domain-specific content features and thus improve the content representation ability for translation.

A Specific Dataset for Aerial Visible-to-Infrared Image Translation
In this section, we introduce AVIID, a specific dataset for aerial visible-to-infrared image translation in detail.AVIID consists of paired aerial visible and infrared images that are taken by a dual-light camera equipped on the UAV. Figure 2 shows the dual-light camera and the UAV.Table 1 describes the detailed parameters of the dual-light camera.Depending on the shooting time, various scenarios, and conditions of photography, we further divide AVIID into 3 subdatasets named AVIID-1, AVIID-2, and AVIID-3, respectively.Table 2 shows the overall comparison of the 3 subdatasets, and the details of them are described in the following.

AVIID-1
AVIID-1 contains 993 pairs of paired visible-infrared images with an image size of 434 × 434.The scenes of AVIID-1 are the roads, and the targets in the images are common vehicles, including cars, buses, vans, and trucks.These images are taken between 9 a.m. and 12 p.m. with temperatures ranging from 28 ∘ C to 32 ∘ C. When taking images, the height of the UAV is about 15 m, the distance from the road is about 90 m, and the shooting angle of the dual-light camera is 90 ∘ horizontally.The scenarios in these images are very similar, mainly including various cars, trees beside the road, and houses in the distance.Therefore, using this subdataset for aerial visible-infrared image translation is relatively simple.Figure 3 shows some examples of AVIID-1.

AVIID-2
AVIID-2 contains 1,090 pairs of paired visible-infrared images with an image size of 434 × 434.The taking conditions and scenes of AVIID-2 are the same as AVIID-1, except that this subdataset is taken from 8 p.m. to 10 p.m., and the temperatures are between 26 ∘ C and 28 ∘ C. The images of AVIID-2 are taken under low-light conditions, resulting in much noise in the images, and even blurry targets and backgrounds, which is challenging for aerial visible-to-infrared translation compared with AVIID-1.Some examples of AVIID-2 can be seen in Fig. 4.

AVIID-3
AVIID-3 contains 1,280 pairs of paired visible-infrared images with an image size of 512 × 512.These images are taken by the UAV at 3 different heights, including about 50 m, 100 m, and 150 m, and 2 different shooting angles of 45 ∘ and 60 ∘ vertically.The taking time is mainly from 2 p.m. to 5 p.m., and the temperatures are between 30 ∘ C and 34 ∘ C. Compared with AVIID-1 and AVIID-2, this dataset contains more types of vehicles and numerous targets of multiple densities, viewpoints, and scales.In addition, AVIID-3 is collected in various scenarios with more complicated backgrounds, including roads, bridges across rivers, parking lots, and streets of residential communities.Therefore, this dataset is more challenging for aerial visible-toinfrared image translation and can be better used to evaluate the performance of different methods.Some figures of AVIID-3 are displayed in Fig. 5.

Experiments and Results
In this section, we evaluate some representative image-to-image methods on AVIID.First, we present our experiment settings, including dataset usage, baseline methods, and training and testing procedure details.Then, our proposed complete evaluation system that evaluates generated images from 2 aspects, overall appearance and target quality, is introduced in detail.Finally, the baseline results are given for future work.

Settings
We conduct experiments on all 3 subdatasets and set the ratios of the training set to be 50% and 80%, respectively, the left data for testing.We select 10 representative methods as baseline methods for our experiments, 2 supervised methods, including Pix2Pix and BicycleGAN, and 8 unsupervised methods, including GCGAN, CUT, CycleGAN, UNIT, DCLGAN, MUNIT, DRIT, and MSGAN.In the training time, every image is first resized to 286 × 286, then random cropped to 256 × 256, and finally horizontally flipped with a probability of 0.5 for data augmentation.To train Pix2Pix, BicycleGAN, GCGAN, CUT, CycleGAN, and DCLGAN, we use the Adam optimizer with a learning rate of 0.0002 and a batch size of 4 for 1,000 epochs on NVIDIA RTX3090.For DRIT and MSGAN, the whole networks are also optimized by the Adam optimizer with a learning rate of 0.0001 for 1,200 epochs on GTX1080Ti and the batch size is also set to 4. With respect to UNIT and MUNIT, we use the Adam optimizer to train them for 200,000 iterations on NVIDIA RTX3090, the learning rate is 0.0001, the batch size is 4, and the weight decay is set to 0.0001.In the testing procedure, the input image is resized to 256 × 256 without any data augmentation.

Overall appearance evaluation
In order to evaluate the overall appearance quality of the generated images, we adopt the most widely used traditional perceptual metrics, including MSE, PSNR, and SSIM.The details of these metrics are as follows.
MSE: MSE is used to evaluate the margin of the discrepancy between the pixels of the generated image and its ground truth, which can be defined as where y and ŷ where y and ŷrepresent the generated image and the corresponding real ones, and H and W are the height and width of the image, respectively.
PSNR: The PSNR aims to measure the degree of distortion for the generated image with respect to its corresponding ground truth, which can be expressed as where max ŷ means the max pixel of the real image.Higher PSNR indicates a smaller distortion of the generated image.
SSIM: SSIM can estimate the structural similarity between the generated image and the real image, which can be formulated as where c 1 and c 2 are constant, μ y , ŷ, σ y , and ŷ are the mean and variance of the generated image and the ground truth, respectively, and yŷ is their covariance.Higher SSIM means the generated image is more similar to its corresponding real image.
Though MSE, PSNR, and SSIM are the most widely used traditional perceptual metrics, they are relatively shallow functions and fail to account for many nuances of human perception.In recent years, regarding the deep features of deep CNN as a perceptual metric have been demonstrated to be an effective way and more consistent with human perception judgment.Therefore, to more accurately evaluate the quality of the generated images, we adopt three CNN-based perceptual metrics, including FID [69], KID [70], and LPIPS [71].More details of FID, KID and LPIPS are as follows.
LPIPS: LPIPS is a CNN-based perceptual metric and has been demonstrated to coincide greatly with human judgment.It can be computed by a weighted L 2 distance between the deep features extracted by the deep CNN of the generated images and their ground truth  KID: KID is a metric similar to the FID, the Kernel Inception Distance, to be the squared MMD [15] between Inception representations and has a simple unbiased estimator.Correspondingly, a lower KID means a better performance.
In the testing process, we randomly sample 150 test images and implement translation on them to get the corresponding infrared images for Pix2Pix, BicycleGAN, CycleGAN, GCGAN, UNIT, CUT and DCLGAN.As for one-to-many methods, we generate 10 examples per input and randomly select one as the final result.These generated images and counterpart real ones are used to calculate the metrics mentioned above for each method.We repeat the experiments 5 times and report the average score and standard variances of each metric.

Target quality evaluation
For aerial infrared images, generating as real targets as possible is essential for many tasks, such as object detection and ( 16) Fig. 3. Some examples of AVIID-1.The scenes of AVIID-1 contain the roads with various kinds of vehicles, including cars, buses, vans, and trucks.tracking.However, existing perceptual metrics mainly consider the overall appearance of the generated images but ignore the evaluation of the targets in the generated images.To address this issue, we propose a new metric named RmAP, which aims to measure the similarity of the targets between the generated images and the real ones and can be obtained by computing the absolute value of the mAP between the real and generated images on the same object detection framework as where mAP is a widely used metric for evaluating the performance of object detection algorithms [72][73][74].
At testing time, we first use 80% of the real aerial infrared images to train 4 object detection models, including Faster RCNN [75], YOLOv3 [76], YOLOv5 [77], and YOLOx [78].Then, we randomly select 150 generated images with their ground truth for each method and compute the absolute value of their mAP on every object detection model with 3 kinds of IOU settings.Similar to the overall appearance evaluation, we also repeat the experiments 5 times and report the average score and standard variances of RmAP.

AVIID-1
Tables 3 and 4 show the means and standard variances of overall appearance evaluation metrics under 50% and 80% training ratio on AVIID-1, respectively.The results show that Pix2Pix performs better than BicycleGAN on both traditional and CNN-based perceptual metrics.DCLGAN and CUT perform similarly, outperforming other unsupervised methods on all appearance evaluation metrics, while CUT performs slightly worse.These results reveal that contrastive learning constraints can achieve a patch-level alignment by maximizing the mutual information between the corresponding input and output (17)  patches, thereby improving the overall appearance quality of generated images.Tables 5 to 8 illustrate the means and standard variances of target quality evaluation metric under 4 objection detection models with 3 IOU settings on AVIID-1.The RmAP results indicate that supervised methods give significantly superior performance compared with unsupervised ones in terms of target quality, which is contrary to the conclusions drawn from the overall appearance evaluation.This suggests that the pixellevel mapping learned from the paired data is beneficial for generating fine-grained targets, while also indicating that the RmAP metric complements overall appearance evaluation metrics and thus more effectively evaluate the performance of algorithms.For unsupervised methods, contrastive learning-based methods do not achieve as excellent performance in target quality as excellent a performance in target quality as in overall appearance  Similarly, GCGAN also gives better results on the YOLOv3 model under the 80% training ratio for all IOU settings.The possible reason for this phenomenon may be that the patch-level alignment can be seen as the coarse-grained mapping between the input and output images compared with the pixel-level mapping, which could lead to blurriness and distortion of targets in the generated images.This phenomenon becomes more serious in aerial images, mainly because there often exist many small and geometric discrepancy targets (such as cars and buses in our dataset).
Figures 6 and 7 display some generated images for each method under 50% and 80% training ratio on AVIID-1, respectively.By comparing these generated examples, we can find that the vehicles generated by DCLGAN and CUT have geometric distortion and blurred edges compared with Pix2Pix, especially CUT, which further confirms our assumption.

AVIID-2
Tables 9 and 10 show the means and standard variances of overall appearance evaluation metrics under 50% and 80% training ratio on AVIID-2, respectively.Through the results, we can get conclusions similar to those of AVIID-1 that Pix2Pix performs superiorly to BicycleGAN, and DCLGAN achieves the best performance followed by CUT in the unsupervised methods.It is worth noting that BicycleGAN gets a much lower performance than Pix2Pix, which is different from AVIID-1.The reason may be that the visible images in AVIID-2 are seriously affected by weak light and noise, resulting in large discrepancies between them and their corresponding infrared images, especially the backgrounds.As a result, the generator may pay too much attention to the latent vector encoded from infrared images in the translating process, which leads to the distortion of details in the generated images.In addition, the values of overall appearance evaluation metrics obtained by each method are significantly lower than those on AVIID-1, indicating that AVIID-2 is more challenging.Tables 11 to 14 illustrate the means and standard variances of target quality evaluation metric under 4 objection detection models with 3 IOU settings on AVIID-2.From the RmAP results, we can find that Pix2Pix has achieved a better performance than all other methods by a large margin in terms of target quality, which is similar to AVIID-1.As for the unsupervised approaches, DCLGAN achieves superior results on AVIID-2.For example, it gives the best performance on the Faster RCNN, YOLOv3, and YOLOv5 object detection models under 80% training ratio and a lower RmAP on the Faster RCNN and YOLOv5 under 50% training ratio when the IOU is set to 0.75.
Figures 8 and 9 display some generated images for each method under 50% and 80% training ratio on AVIID-2, respectively.From these figures, we can see that some generated images have blurred backgrounds and geometric distortion of the targets, which is more severe in supervised methods.This phenomenon may indicate that pixel-level mapping becomes too strict when the visible images are severely disturbed by weak light and noise, thus degrading the quality of the generated images.In this case, patch-level alignment is less strict than pixel-level mapping; thus, contrastive learning-based methods can better preserve the clarity of backgrounds and the geometry of targets in the generated images.

AVIID-3
Tables 15 and 16 show the means and standard variances of overall appearance evaluation metrics under 50% and 80% training ratio on AVIID-3, respectively.From the results, we can find that Pix2Pix still performs better than BicycleGAN, which is similar to AVIID-1 and AVIID-2.However, in the unsupervised methods, GCGAN gives a significantly improved performance of DCLGAN, which performs best on AVIID-1 and AVIID-2 under all overall appearance quality metrics.This phenomenon may result in the conclusion that simple geometryconsistency constraint can effectively maintain the geometric shape of the targets (particularly tiny and dense cars in AVIID) during the translating process, which is beneficial to reduce the blur and detail distortions of the generated images in the case of various scenarios with more complicated backgrounds, while contrastive learning and cycle-consistency constraint are too strict.Tables 17 to 20 illustrate the means and standard variances of target quality evaluation metric under 4 objection detection models with 3 IOU settings on AVIID-3.From the RmAP results, we can see that GCGAN achieves an overwhelming superiority in target quality compared with other unsupervised methods, which further reflects the effectiveness of  geometry-consistency constraint on generating high-quality targets.
Figures 10 and 11 display some generated images for each method under 50% and 80% training ratio on AVIID-3, respectively.From the figures, we can find that GCGAN can maintain the geometric shape of targets to reduce distortions and blur, especially in the case of dense cars, which further proves our conclusion.

Conclusion
From the above experimental results and discussion, we can sum up some meaningful conclusions as follows.
• The pixel-level mapping learned from the paired data is beneficial for generating fine-grained targets.Therefore, supervised methods give significantly superior performance in target quality evaluation compared with unsupervised approaches.
• The contrastive learning constraint can be seen as a patchlevel mapping by maximizing mutual information between the corresponding input and output patches.This patch-level alignment can enhance the correspondence of the input and output patches, which helps to improve the quality of generating images, especially in weak light and noisy conditions.
• The geometry-consistency constraint is a simple and effective way to maintain the geometric shape of the targets (particularly tiny and dense targets) during the translating process, which can meaningfully reduce the blur and detail distortions of the generated images in the case of various scenarios with complicated backgrounds.
In addition, several problems of existing methods can be summarized from the experiment results and discussion, which can be seen as follows.
• Current approaches only consider migrating the global styles or attributes onto the entire images but ignore the considerable discrepancy between targets and backgrounds in infrared attributes, resulting in unrealistic targets in the generated images.• Existing methods can only transfer styles or attributes between aerial visible and infrared images without taking into account the different properties of each modality.Consequently, the authenticity of generated images is poor.
• For aerial images with multi-scale dense targets, complex backgrounds, and diverse scenes, current methods struggle to capture the spatial differences between images, resulting in distortion and blurring of generated targets and backgrounds, significantly reducing the quality of generated images.
The above conclusions can provide meaningful guidance for investigating more efficient methods on more challenging datasets to facilitate the process of aerial visible-to-infrared image translation.
summarized to advance state-of-the-art algorithms for aerial visible-to-infrared image translation.In addition, several future research directions of this field are analyzed and summarized as follows.
• Current image-to-image translation methods are not concerned with the imaging mechanism between the visible and infrared image.How to construct reasonable imaging mechanism constraints to improve the realism of generated infrared images is a future research direction.
• The AVIID dataset proposed in this article are aerial remote sensing images taken by infrared camera equipped on the UAV.The visible-to-infrared image translation in satellite platform also deserves to be researched in the future.
• Existing image-to-image translation methods are mainly based on deep CNNs.However, due to the limitation computational resource, the parameters of the model cannot be infinitely large, so the size of the generated image is limited.Therefore, finding an effective way to transfer these approaches to large-scale areas is necessary.
• The quality of the generated images through image-toimage translation methods is highly correlated with the similarity between training and test data.Therefore, improving the transferability and generalizability of these methods is one of the research directions in the future.
• The radiation value of thermal images has a great relationship with the atmospheric conditions, and when the infrared  images are taken at a very high height above the ground, solving the atmospheric compensation is a worthwhile problem.Moreover, AVIID and PyTorch codes of these methods can be freely downloaded to advance the process of aerial visibleto-infrared image translation.

( 1 )Fig. 1 .
Fig.1.Overview of image-to-image translation methods that could be applied to aerial visible-to-infrared image translation.Each color represents a category.
where Y and Ŷ represent the generated images and the real ones, y l and ŷl are normalized deep features extracted from the l layer of the deep CNN, w l means the weighted parameters, and N is the number of the images.We use the AlexNet pretrained on the ImageNet as the deep feature extractor, and a lower LPIPS score indicates a better quantity of the generated images.FID: FID is a widely used metric to estimate the distribution of real and generated images through deep features extracted by the last pooling layer of the Inception-V3 model trained on the ImageNet and compute the divergence between them, which can be formulated as where m indicates the mean of the deep features, C means the covariance matrix, and Tr(.) is the trace operation.Intuitively, if the generated images are similar to the real ones, they should have lower FID values.

Table 1 .
Detailed parameters of the dual-light camera

Table 3 .
Overall appearance evaluation under 50% training ratio on AVIID-1.The best results are highlighted in bold Downloaded from https://spj.science.orgon January 04, 2024

Table 4 .
Overall appearance evaluation under 80% training ratio on AVIID-1.The best results are highlighted in bold

Table 5 .
RmAP under the Faster RCNN object detection model on AVIID-1.The best results are highlighted in bold

Table 6 .
RmAP under the YOLOv3 object detection model on AVIID-1.The best results are highlighted in bold Downloaded from https://spj.science.orgon January 04, 2024

Table 7 .
RmAP under the YOLOv5 object detection model on AVIID-1.The best results are highlighted in bold

Table 8 .
RmAP under the YOLOx object detection model on AVIID-1.The best results are highlighted in bold Downloaded from https://spj.science.orgon January 04, 2024

Table 9 .
Overall appearance evaluation under 50% training ratio on AVIID-2.The best results are highlighted in bold

Table 10 .
Overall appearance evaluation under 80% training ratio on AVIID-2.The best results are highlighted in bold

Table 11 .
RmAP under the Fater RCNN object detection model on AVIID-2.The best results are highlighted in bold Downloaded from https://spj.science.orgon January 04, 2024 quality and even perform worse than other approaches.For instance, DRIT has achieved much lower RmAP values than DCLGAN on Faster RCNN and YOLOv5 object detection algorithms under 50% training ratio with 3 kinds of IOU settings.

Table 12 .
RmAP under the YOLOv3 object detection model on AVIID-2.The best results are highlighted in bold

Table 13 .
RmAP under the YOLOv5 object detection model on AVIID-2.The best results are highlighted in bold

Table 14 .
RmAP under the YOLOx object detection model on AVIID-2.The best results are highlighted in bold Fig. 8.Some generated images for each method under 50% training ratio on AVIID-2.Downloaded from https://spj.science.orgon January 04, 2024

Table 15 .
Overall appearance evaluation under 50% training ratio on AVIID-3.The best results are highlighted in bold Downloaded from https://spj.science.orgon January 04, 2024

Table 16 .
Overall appearance evaluation under 80% training ratio on AVIID-3.The best results are highlighted in bold

Table 17 .
RmAP under the Faster RCNN object detection model on AVIID-3.The best results are highlighted in bold Downloaded from https://spj.science.orgon January 04, 2024

Table 20 .
RmAP under the YOLOx object detection model on AVIID-3.The best results are highlighted in bold