UNet-like network fused swin transformer and CNN for semantic image synthesis

Semantic image synthesis approaches has been dominated by the modelling of Convolutional Neural Networks (CNN). Due to the limitations of local perception, their performance improvement seems to have plateaued in recent years. To tackle this issue, we propose the SC-UNet model, which is a UNet-like network fused Swin Transformer and CNN for semantic image synthesis. Photorealistic image synthesis conditional on the given semantic layout depends on the high-level semantics and the low-level positions. To improve the synthesis performance, we design a novel conditional residual fusion module for the model decoder to efficiently fuse the hierarchical feature maps extracted at different scales. Moreover, this module combines the opposition-based learning mechanism and the weight assignment mechanism for enhancing and attending the semantic information. Compared to pure CNN-based models, our SC-UNet combines the local and global perceptions to better extract high- and low-level features and better fuse multi-scale features. We have conducted an extensive amount of comparison experiments, both in quantitative and qualitative terms, to validate the effectiveness of our proposed SC-UNet model for semantic image synthesis. The outcomes illustrate that SC-UNet distinctively outperforms the state-of-the-art model on three benchmark datasets (Citysacpes, ADE20K, and COCO-Stuff) including numerous real-scene images.

• We propose a UNet-like network model based on Swin Transformer and CNN for semantic image synthesis, which which overperforms the pure CNN-based model in effectively extracting high-and low-level features at different scales.• We propose a decoder based on the Conditional Residual Fusion (CRF) block, which can produce more accu- rate feature representations through the hierarchical fusion of multi-scale features to improve the synthesis performance.
• We propose two novel mechanisms embedded in the CRF block, the opposition-based learning mecha- nism can effectively enhance the semantic feature information, while the weight assignment mechanism can dynamically assign attentional weights in channel and spatial dimensions.
• Extensive experiments are undertaken on three public datasets: Citysacpes, ADE20K and COCO-Stuff.The results prove the effectiveness of our semantic image synthesis method and the state of the art performance is achieved.

Generative adversarial networks
Generative adversarial networks (GANs) 17,18 have become the mainstream method for image synthesis tasks.GAN architecture is usually composed of two main networks, namely the generator and the discriminator.The generator is in charge of synthesizing the target images using the given input conditions.Nevertheless, the discriminator aims to distinguish between the synthetic image and the matched natural image.The input conditions used by GAN-based image synthesis methods are various, such as sparse sketches [19][20][21] , gaussian noise 22,23 , text descriptions [24][25][26] , natural images 27,28 , and semantic layout [29][30][31][32] .Considering the great success of

Conventional residual block
The conventional residual block, as a classical structure of convolutional neural network, has been extensively studied in prior research 36 .It typically consists of two convolutional layers and a shortcut connection, allowing for efficient transfer of input features to output features, thus facilitating cross-layer feature fusion.Additionally, the residual block helps mitigate the vanishing gradient problem and enhances network training capabilities.Kaiming et al. 36 introduced a residual learning framework to train deeper networks effectively, paving the way for subsequent advancements.Ruofan et al. 37 proposed a deep residual network for end-to-end projection learning, demonstrating its applicability in tasks involving Bayer images and high-resolution images.Despite the achievements of conventional residual blocks, they still exhibit limitations in image synthesis tasks conditioned on semantic layout maps.Recognizing this, we aim to augment traditional residual blocks by incorporating mechanisms inspired by Opposition-based Learning Mechanism (OLM) and Weight Assignment Mechanism (WAM).Where OLM is derived from the concept of opposition 38 , which aims to enhance learning by considering both positive and negative aspects of semantic features.This technique is employed to augment semantic information, thereby enhancing normalization performance.On the other hand, WAM can dynamically allocate attention weights on both channel and spatial dimensions.Despite previous research on similar mechanisms [39][40][41] such as channel attention and spatial attention, WAM demonstrates uniqueness by integrating feature weighting, thereby showcasing innovation in image synthesis tasks.By integrating these components, we seek to enhance feature fusion and pay closer attention to semantic feature information, ultimately improving fusion performance in image synthesis tasks.

Method
SC-UNet is a semantic image synthesis model based on a UNet-like network structure composed of an encoder and a decoder.The overall architecture of the SC-UNet model is shown in Fig. 2. In the encoding stage, the input semantic layout map is first semantically augmented by one-hot encoding operation and candy edge extraction operation, and then the augmented semantic features perform a patch embedding layer to obtain a sequence embedding for the input of the Swin Transformer module.Finally, the encoder based on Swin Transformer will extract the low-level features at different scales from the input sequence embedding.In the decoding stage, the decoder combined with the Conditional Residual Fusion (CRF) block and Swin Transformer module will hierarchically fuse the high-level semantic features with the low-level positional features.To recover the photo-realistic synthesized image with abundant details, our model finally employs a tanh activation function on the decoder's output to maintain the pixel values within a specified range.Since our SC-UNet model utilizes a supervised training strategy based on Generative Adversarial Networks (GANs) and takes advantage of the pre-trained Swin Transformer module to initialize the partial weights of the network.Therefore, the GAN-based SC-UNet approach reduces the occurrence probability of exploding gradients owing to irrational initialisation.During the supervised training process, our model is optimised with the weighted summation of multiple loss functions, thus achieving better synthesis performance.

Model encoder
The encoder of our proposed SC-UNet method aims to extract the low-level position features under different dimensions from the input semantic layout map.Let M ∈ R H×W×1 be the input semantic layout map of the model, where H and W dimensions denote the height and width, respectively.To extract more accurate and comprehensive feature representations, the input semantic layout map first needs to be semantically augmented by simultaneously performing a one-hot encoding operation 15  as the input of the Swin Transformer module.More specifically, the patch embedding layer exploits non-overlapping convolution to partition the feature map F s into a series of patch tokens with size 4, and then each patch token is flattened into a sequence embedding e i by linear mapping.Compared to the default 16 × 16 patch setting, a smaller patch size facilitates to extract the local features containing more detailed information, but also leads to an extended computational workload.The above mapping process from an input semantic layout map to a one-dimensional sequence embedding is summarized as follows: where Concat, Encoding and Candy denote the concatenation operation, the one-hot encoding operation and the candy edge extraction operation, respectively.And Conv and Linear are utilized to realize patch partition and linear mapping in the patch embedding layer.
The backbone network of the encoder comprises four combinations of the Swin Transformer module and the patch merging layer, which aims to hierarchically extract the fine-grained features at different scales.Specifically, the patch merging layer mainly focuses on downsampling the feature map to influence the dimensions of width, height and number of channels.And the Swin Transformer module mainly focuses on extracting the low-level position features.Each Swin Transformer module T i is made of two consecutive Transformer blocks based on Layer Normalization (LN) 42 layer, Multi-head Self-Attention (MSA) layer, Feed-forward Network (FFN) layer, and skip connections, respectively.The two successive Transformer blocks have the same structure, but the MSA layer can be further subdivided into window-based MSA and shift-window-based MSA based on the window division scheme.In the T i module, the detailed computation of the Transformer block is as follows: here e i+1 represents the output of the Transformer block with a sequence embedding e i as input.Multi-head Self- Attention (MSA) has three attention heads with parallel and independent computation, which can effectively shorten the computation workload.The specific computing for the MSA is illustrated as follows: where ESA(Q, K, V ) Multiple self-attention ESA represents the output from the integration of QKV's attention.The symbols Q, K and V denote a query vector, a key vector and a value vector respectively.They are obtained (1) F s = Concat(Encoding(M), Candy(M)), (2) e i = Linear(Conv(F s ))), (3) www.nature.com/scientificreports/by linear mapping of the same input vector LN(e i ) .And, d K , B and T represent the scaled dot-product attention, bias vector and the transpose operation, respectively.

Model decoder
The decoder will fuse the multi-scale features hierarchically extracted by the encoder from the input semantic layout map to recover a realistic synthesized image at the original resolution size.The hierarchical features at the smallest scale are first passed through a Conditional Residual Fusion (CRF) block based on CNN to obtain high-level features with more semantic information.In order to be concatenated with the low-level features output from the encoder in the channel direction, the obtained high-level features need to be up-sampled to 2× resolution utilizing a patch expand layer.Subsequently, the fusion results obtained from the high-and low-level features using the concatenation operation performs a CRF block conditional on the feature representation F s ∈ R H×W×N cls+1 before being fed to the Swin Transform module.Compared with CNN, a series of combining operations constructed with the Swin Transformer module as the backbone can better capture the contextual feature mapping, which incorporates the comprehensive information of the low-level positions and the high-level semantics.Finally, the feature mapping output by Swin Transformer module will pass through an Image Block to recover a naturalistic synthesized image with the H × W × 3 dimensions.The image block, as the final layer of the decoder, is composed of two CRF blocks, a 3 × 3 convolution with a padding size of 1, an upsampling function, and a tanh activation function.The following is a detailed description of the CRF block in the decoder network.

Conditional residual fusion block
Earlier approaches for semantic image synthesis primarily specialized in the extraction of low-level features at multiple scales, while neglecting the information fusion among high-and low-level features.These methods exploit the concatenation in the number of channels to form the thicker features, which are then depended to recover the high quality synthesized image.Motivated by ResNet 36 , we design a Conditional Residual Fusion (CRF) block to achieve more effective information fusion between high-and low-level features at multiple scales.
As shown in Fig. 3a, the CRF block is composed of two successive convolutional blocks, a Opposition-based Learning Mechanism (OLM), and a Weight Assignment Mechanism (WAM).
For each convolutional block, our CRF block not only expands the single 3 × 3 convolution layer by adding a LReLU 22 activation function and a SM-Norm layer to effectively prevent network overfitting, but it also introduces novel mechanisms to enhance feature extraction.Where SM-Norm stands for the normalization based on semantic modulation, which can effectively improve the convergence speed by shortening the feature differences.Different from the Batch Normalization (BN) 43 , SM-Norm performs the normalization of input activation conditional on F s ∈ R H×W×N cls+1 , and its structure is given in Fig. 3b.The input activation h ∈ R H×W×C to the SM-Norm layer is first parameter-free normalized along the batch dimension exploiting the Synchronized Batch Normalization (SyncBN).Then, the condition F s as another input of SM-Norm layer will perform a combined block of Resize-Conv-LReLU to extract the semantic features, and utilizes two 1 × 1 convolution layers to pro- duce the normalization parameters γ ∈ R H×W×C and β ∈ R H×W×C , respectively.Finally, the produced γ and www.nature.com/scientificreports/β are multiplied and added to the normalized activation in the element-wise way.Formally, the SM-Norm layer can be defined as: where E[h] and Var(h) represent the mean and standard deviation of the input activation h, respectively.And the symbol ε usually denotes a very small positive number.The γ and β learned from condition F s are used to modulate the normalized activation in scale and bias.
In addition, the CRF block embeds two novel mechanisms, WAM and OLM, to enhance the hierarchical fusion of high-and low-level features.The WAM added on the constant shortcut connection can adaptively assign different attention weights for the input features, thus obtaining effective feature representations enhanced in both channel and spatial dimensions.Due to the sparse semantic information in condition F s , the designed OLM is utilized to actualize the enhancement of semantic feature information.

Opposition-based learning mechanism
The condition F s obtained from the input semantic layout map is used to positively influence the normalization layer of CRF block.Accordingly, it is necessary to augment the semantic information of the sparse F s for the improvement of normalization performance.Computational intelligence employs opposition-based learning 38 , which has been demonstrated to be an efficient way to improve different optimization methods.To augment semantic information in the condition F s , we suggest a fresh Opposition-based Learning Mechanism (OLM) for accomplishing the modulation of the normalization layer.
The condition F s is gained by the channel concatenation of one-hot label M and edge map E. Since the seman- tic information of the condition F s is mainly derived from the one-hot label M, which is the output of performing a one-hot encoding operation on the semantic layout map.Thus, the semantic augmentation result of condition F s passing through an opposites-based learning mechanism can be expressed as follows: where M * denotes the opposition-based one-hot label.The central idea underlying opposition-based learning is that the opposing side of a solution is possibly closer to the optimal solution.Let M = {m 1 , m 2 , ..., m C } H×W×C be the one-hot label with multi-channel feature maps.Where the symbols W, H, and C represent the width, height, and number of channels in a semantic condition, respectively.m i ∈ {a, b} H×W×1 denotes a feature map of the ith channel.Since each pixel value identifies the object class to which it belongs.In m i , only the pixel value of the ith object class is 1, and the other pixels are 0. Referring to the description of the opposite point in oppositionbased learning, the opposition-based one-hot label M * is described as: where, according to the definition of the one-hot label, the thresholds a and b are set to 0 and 1 respectively.m i * ∈ {0, 1} H×W×1 represents a feature map of the ith channel in the opposition-based one-hot label.

Weight assignment mechanism
The distribution of redundant information in input features is usually different.Therefore, we design a Weight Assignment Mechanism (WAM) embedded in the shortcut connections, which can adaptively assign different attention weights to the input features.The detailed structure of WAM is presented in Fig. 3c.WAM first extracts important semantic and positional features to filter the redundant information in the input features.And then, the extracted important features are efficaciously fused to output a more powerful feature representation.The extraction of important semantic features relies on the learning of semantic correlations on the channel dimensions.Since the sub-feature maps on each channel dimension contain different amounts of semantic information, assigning an attention weight to them can extract enhanced semantic features.The input feature f in ∈ R H×W×C of WAM is first passed through a 1 × 1 convolution layer and an LReLU activation to produce the intermediate feature t ch ∈ R H×W×C ′ .Then, t ch utilizes an adaptive average pooling layer and a sigmoid activation function to obtain channel attention weights W ch ∈ R 1×1×C ′ , which reflect the importance of the sub-feature maps on each channel dimension.Finally, t ch and W ch are fused using element-wise multiplication to extract the attended semantic features f sc ∈ R H×W×C ′ .This description of the above process can be defined as: where ⊙ denotes element-wise multiplication to fuse the feature information.
Extracting important positional features depends on learning to correlate positions in space.Similarly, each pixel in the spatial location is assigned an attention weight, which helps to extract the enhanced positional www.nature.com/scientificreports/features.To learning the spatial relationship, the input feature f in ∈ R H×W×C first combines the outcomes from the max pooling and average pooling layers in the channel concatenation way to produce a higher dimensional feature t sp ∈ R H×W×2 .To reduce the number of channels, t sp exploits a 3 × 3 convolution layer with padding size 1, resulting in the intermediate feature t * sp ∈ R H×W×1 .After that, t * sp utilizes a sigmoid activation function to get the spatial attention weights W sp ∈ R H×W×1 , which reflect the importance of the position-wise pixel in the spatial dimension.Finally, we perform a matrix multiplication between t * sp and W sp to extract the attended positional features f sp ∈ R H×W×1 .Mathematically, The semantic and location features are fused using channel concatenation , and the convolution layer and LReLU activation function will be sequentially performed to generate the final output feature f out ∈ R H×W×C ′ of the WAM.

Discriminator and loss function
Similar to GauGAN 22 , we use an efficient multi-scale discriminator, which will perform the adversarial training with our SC-UNet network (also regarded as a generator).The multi-scale discriminator utilizes the integration of multiple PatchGAN discriminators with the same structure, and the input image size is different for each PatchGAN 33 discriminator.To distinguish between synthesized and real images, the multi-scale discriminator first scales the input image into different sizes and feeds them into the corresponding PatchGAN discriminator.After that, the output matrices of all PatchGAN discriminators are calculated the mean value.Finally, the summation result of the mean value will be applied as the basis for true or false discrimination.In our experiments, the multi-scale discriminator actually uses only two PatchGAN discriminators, and the size of their input images is the original resolution and half of the original resolution, respectively.Table 1 shows the size change of an original resolution image after being fed to the PatchGAN discriminator.Where each PatchGAN discriminator consists of 6 convolution blocks, which is based on the convolution layer, the instance normalization 44 , and the LReLU activation function.
The multi-scale discriminator is optimized using only the hinge-based adversarial loss L D hadv 45 to distinguish between synthesized and real images.However, the generator is optimized with the weighted sum of the multiple loss functions, which include hinge-based adversarial loss L G hadv , feature matching loss L fm 30 , and perceptual loss L vgg 30 .Finally, all the above losses are integrated to define the overall optimization goal of the discrimina- tor and generator as, where γ hadv , γ fm , and γ vgg denote the weights corresponding to the losses, and set L hadv = 1 , L fm = 10 and L vgg = 10 in our experiments.
Figure 4 shows the variation trend of the discriminator and generator loss values with number of iterations during training on the Cityscapes dataset.Where the black and blue curves indicate the results of total loss and its correlated loss in training, respectively.We can observe that the correlated losses from the discriminator and generator are smoothly converging as the number of iterations increases.Moreover, the total losses display a positive correlation with its correlated losses.This indicates that our model mitigates the possibility of over-fitting over the training process owing to the reasonable design of the loss function. (13)

Datasets
In order to validate the superiority of the proposed SC-UNet approach, we have carried out extensive experiments on three public datasets: Cityscapes 14 , ADE20K 15 , and COCO-Stuff 16

Evaluation metric
Referring to previous work, we adopts both the Fréchet Inception Distance (FID) 65 as image generation score to assess the perceptual quality and diversity of the synthesized images.Moreover, we also utilize the mean Intersection over Union (mIoU) 29 and the pixel Accuracy (Acc) 22 as semantic segmentation scores to measure the segmentation accuracy.We use the state-of-the-art segmentation networks for each dataset: DRN-D-105 66 for Cityscapes, UperNet101 67 for ADE20K, and DeepLabV2 68 for COCO-Stuff.

Implementation details
We utilize the ADAM optimizer 69 with β 1 = 0 and β 2 = 0.999 to train our models on a single RTX 3090Ti GPU.The learning rates of the generator and the discriminator are defined as lr/2 and lr * 2 , where the initial value of the learning rate lr is set to 0.0002.To more accurately find the global optimal solution, the learning rate is dynamically changed during the training process.Formally, the dynamic learning rate is represented as follows: where n is the total number of training epochs and m = n/2 .According to the above formula, the learning rate will linearly decay to zero after m epochs.Furthermore,we train 200 epochs on the cityscape and ADE20K datasets to find the optimal solution, and 100 epochs on the COCO-Stuff dataset due to the large number of training images.

Quantitative results
Table 2 gives the quantitative comparison results of our method with the supervised baselines in image generation score (FID) and semantic segmentation scores (mIoU and Acc) on the Cityscapes, ADE20K and COCO-Stuff datasets.The results in the table show that our method obtains a lower generation score (FID) than the previous supervised baselines on the validation set for each dataset.The lower the generation score, the higher the fidelity and diversity of the synthesized images produced by the deep learning network.In addition, our proposed ( 19) Quantitative comparison of our method with the supervised baselines in image generation score (FID) and semantic segmentation scores (mIoU and Acc) on all the datasets."n/a" indicates that the visual result is not provided on the official website of the model.The boldface denotes the best performance.www.nature.com/scientificreports/method acquires a higher semantic segmentation scores (mIoU and Acc) than previous state-of-the-art models on the Cityscapes dataset, which has a small data amount and a relatively homogeneous distribution of semantic classes.In order to improve the semantic alignment with the input layout map, the latest OASIS 32 and SAFM 64 utilize the idea of semantic segmentation to improve the discriminator network.Although OASIS and SAFM obtain higher Acc and mIoU scores than our approach, this slight improvement only appears in the ADE20K and COCO-Stuff datasets with large data amounts and unbalanced semantic class distributions.Therefore, the quantitative comparison with the baselines confirms the superiority of our proposed network model in semantic image synthesis.Furthermore, the quantitative comparison of our method with the unsupervised baselines is reported in Table 3.Compared to the unsupervised baselines, we achieve better image generation score (FID) and semantic segmentation score (mIoU) on three public datasets by constructing a supervised model.Our improvement in the semantic segmentation score is particularly significant, mainly due to the supervised learning under the input semantic layouts.Moreover, the large amount of improvement indicates that the supervised strategy is more beneficial for the semantic image synthesis task.

Human perceptual evaluation
To further validate that our method performs better in the semantic image synthesis, we perform a human perception evaluation 22,23,56 to compare our approach with the several baseline methods of GauGAN 22 , DAGAN 63 , OASIS 32 , and SAFM 64 on the Cityscapes, ADE20K and COCO-Stuff datasets.Specifically, we first randomly select 200 semantic layout mappings from the validation set of each dataset to synthesis images for our method and the competing method.Then, we also randomly select 100 AMT workers to conduct the evaluation.Where AMT (known as Amazon Mechanical Turk 72 ) is a crowdsourcing marketplace that allows researchers to outsource their tasks to a distributed worker who can volunteer to perform the task for pay.Therefore, this experiment was carried out in accordance with relevant guidelines and regulations, and was obtained the approval of the AMT institutions, and the informed consent from all AMT workers.In each experiment, workers are required to select the perceptually more photo-realistic image from the shown two groups of synthesized images.The two groups of images are synthesized by our method and a competing method, respectively.Finally, we utilize the conventional statistical operations to obtain the average probability that the images synthesized by our method are selected by the workers on each dataset, and the results are shown in Table 5.The comparison results of the human perception evaluation reaffirm our method, and the images synthesized by it are more acceptable in terms of quality.

Traditional statistical evaluation
To further emphasize the efficacy of our method in semantic image synthesis tasks, we employed conventional statistical assessment techniques, including F-statistic 73 , p-value 74 , and Analysis of Variance (ANOVA) 75 .As depicted in Table 6, our approach yielded a lower F-statistic of 82.629 and a higher p-value of 5.2108.This observation suggests that, compared to existing unsupervised methods such as GauGAN 22 , OASIS 32 , and SAFM 64 , our method ensures minimal disparities among synthesized image samples.Additionally, ANOVA results indicate no discernible differentiation between the synthesized image dataset and the authentic image dataset, further substantiating the robustness of our approach.

Qualitative results
In Figs. 6, 7 and 8 give the qualitative comparison of our model with the competing methods 22,64 on Cityscapes, ADE20K and COCO-Stuff datasets.We found that the images synthesized by our model not only have better perceptual quality, but also are closer to the ground truth images in the overall color and texture distribution.Note that the complex real-world scenes synthesized by our method show significant improvement on Cityscapes datasets.However, SAFM 64 is the current state-of-the-art method, but the images synthesized by it are too bright Table 3. Quantitative comparison of our method with the unsupervised baselines in image generation score (FID) and semantic segmentation score (mIoU) on three public datasets."↓ " means lower performance is better." ↑ " means higher performance is better."+" represents the amount of improvement.Significant values are in bold.www.nature.com/scientificreports/and even show color distortion.Compared with them, our proposed approach produces photo-realistic images while respecting the input semantic layout map, and can generate challenging scenes with high image fidelity.

Mean power spectrogram
We also calculated the mean power spectrograms of images synthesized by our method with competing methods 22,64 on the Cityscapes dataset to compare the qualitative from a signal perspective.The similarity matching result of the average power spectrum is shown in Fig. 9.It is intuitively obvious that the two power  www.nature.com/scientificreports/spectrograms drawn separately from the ground-truth images and the synthesized images produced by our method are the most similar from the perspective of color, texture, and shape.Comparatively, the mean power spectrogram drawn from synthesized images produced by the competing methods showed distinct spikes.Some even present pseudo-local maxima, which are not observed in the average power spectrogram of the groundtruth images.Regarding the differences mentioned above, they can be clearly observed in the comparison of the    www.nature.com/scientificreports/zoomed-in areas.This enhancement allows for a more detailed examination of the discrepancies.Moreover, we utilize the ORB 70 and Histogram 71 algorithm to calculate the similarity between the ground-truth images and images synthesized by our method, and the results are shown in Table 4.Where the higher the value, the more similarity.The similarity matching results calculated by the mean power spectrograms also can validate that the images synthesized by our method are more photo-realistic in details.

Ablation on important components in SC-UNet
To verify the effectiveness of of each component in our SC-UNet method, we compare our SC-UNet method with three variants on the Citysacpes dataset.These three variants are obtained by gradually replacing or eliminating each component in the framework with our method as a benchmark.Specifically include: (i) "Ours" denotes our proposed SC-UNet model , which is used as a benchmark for the ablation experiments.(ii) "w/o SwinT" denotes that the Swin Transformer (SwinT) module is replaced by the traditional convolutional block to construct a pure CNN-based UNet-like network.(iii) "w/o CRF" represents that the designed Conditional Residual Fusion (CRF) block is replaced by the conventional residual block to fuse the high-and low-level feature information.(iv)"w/o OLM" does not use the designed Opposition-based Learning Mechanism (OLM) to enhance the semantic feature information.(v) "w/o WAM" does not use the Weight Assignment Mechanism (WAM) to allocate attention weights in channel and spatial dimensions.The results of the ablation study are shown in Table 7.By the pair-wise comparison between our SC-UNet method and other variants, we can observe that the SwinT is used as the backbone network to achieve better synthetic performance than pure CNN.Furthermore, it also validates the effectiveness of the CRF block, OLM and WAM components in SC-UNet for high-squality image synthesis based on semantic layout maps.Although the "w/o OLM" method is slightly lower than ours in terms of image synthesis score (FID) , the synthesized images from our SC-UNet approach have better performance in terms of two semantic segmentation scores, mIoU and Acc.

Ablation on discriminator and loss function
Our SC-UNet approach employs adversarial training based on the multi-scale discriminator, which improves the synthesis performance.To highlight the superiority of multi-scale discriminator, we utilize two discriminators available for replacement: a single-scale Markov discriminator 23 (denoted as "PatchGAN") and a feature pyramid semantic embedding discriminator 16 (denoted as "FPSE-D").As shown in Table 8, our SC-UNet method with aid of multi-scale discriminator not only performs well in terms of semantic segmentation scores (mIoU and Acc), but also excels in terms of image generation scores (FID).
To explore the effect of each loss function on semantic image synthesis, we use the combination of three loss functions as a baseline, and randomly replace or eliminate one of them for each comparison.Specifically, "w/o L hadv " denotes that the hinge-based adversarial loss is replaced by the conditional adversarial loss."w/o L fm " and "w/o L vgg " represent constraints without the feature matching loss and the perceptual loss, respectively.As shown in Table 8, hinge-based adversarial loss have more obvious advantages than conditional adversarial loss in semantic image synthesis.However, the fundamental difference between feature matching loss and perceptual loss is the image feature extraction network, and the two belong to a dynamic and a static relationship.As the image resolution increases, so does the amount of detailed information contained within, necessitating the model to possess a stronger learning capacity for effective processing.

Conclusion
In this paper, we propose a new semantic image synthesis method (SC-UNet) , which can transform a given semantic layout map into the synthesized images with visual fidelity and semantic alignment.Our SC-UNet model is able to decode more photo-realistic images from the hierarchical feature representations encoded from the input semantic layout maps, by building a U-shaped network using the Swin Transformer module as the basic unit.Furthermore, the skip connection is added to a U-shaped network to combine the high-and low-level features of both sides.To compensate for the loss of semantic information resulted from down-sampling, the low-level features are copied to the high-level features by skip connections.An effective Conditional Residual Fusion (CRF) block is designed to obtain the important semantic and location information from the concatenation of high-and low-level features for higher-quality image synthesis and lower memory usage.The performance improvement of CRF blocks is mainly attributed to the embedding of a opposition-based learning mechanism and a weight assignment mechanism.The opposition-based learning mechanism can effectively enhance the semantic feature information, while the weight assignment mechanism can dynamically assign attentional weights in channel and spatial dimensions.Experimental results show that our proposed method outperforms state-of-the-art methods on three baseline datasets, both qualitatively and quantitatively.Moreover, our SC-UNet method can offer widespread applications, such as content generation and image editing, by adding, deleting, or editing objects.Two examples of applications based on the SC-UNet method are as follows.semantic layout map.Thus, an ordinary user is also able to interactively manipulate the real image.As can be seen from the results of the semantic control synthesis, our approach can generate realistic and semantically aligned images.

Figure 1 .
Figure 1.Visual comparison of the synthesized images produced by our method and other baseline approaches.Key differences are positioned with boxes on the synthesized image, and shown magnified below image.'Hist-std' indicates the histogram's standard deviation, where lower values indicate more balanced colors in the synthesized image.
and a candy edge extraction operation.Among them, the one-hot encoding operation can map each object class in the semantic layout map with discrete nature into different channels, thus acquiring a more effective multi-channel feature representation.And the candy edge extraction operation can quickly and accurately extracts the positional information of the object edges from the input semantic layout map M by conducting several steps such as Gaussian blurring, gradient computation, https://doi.org/10.1038/s41598-024-65585-1www.nature.com/scientificreports/non-maximum value suppression, double threshold detection and edge tracking, thus acquiring a feature representation for further processing by the encoder.Two feature representations resulted by augmenting the semantics are fused into a fresh feature representation F s ∈ R H×W×N cls+1 by the channel concatenation way.Where N cls+1 is equal to the total number of object classes in a given dataset plus 1.The new feature representation F s is then fed to a patch embedding layer, thus obtaining a sequence embedding e 1 ∈ R W 4 × H 4 ×96

Figure 3 .
Figure 3. Structure of the CRF block in the SC-UNet method.(a) The CRF block represents the conditional residual fusion block.(b) SM-Norm denotes the normalization based on semantic modulation.(c) WAM stands for the weighting assignment mechanism.
. The Cityscape dataset includes 35 semantic classes, while training and validation images are 2975 and 500, respectively.The ADE20K dataset has 150 semantic classes, while 20,210 training and 2000 validation images.The COCO-Stuff dataset comprises 182 semantic classes in addition to 118,287 training and 5000 validation images.The distribution in the number of images for each semantic class on the three datasets is displayed in Fig. 5.As can be seen, there is an imbalance in the distribution of semantic categories.In addition, we adjusted the resolutions of the images in the cityscape, ADE20K and COCO-Stuf datasets to 512 × 256 , 256 × 256 and 256 × 256 , respectively, so as to verify the robust- ness of the proposed SC-UNet under different image resolutions.

Figure 4 .
Figure 4. Variation trend of the discriminator and generator loss values with number of iterations during training on the Cityscapes dataset.The black and blue curves indicate the results of total loss and its correlated loss in training, respectively.

Figure 5 .
Figure 5. Distribution in the number of images corresponding to each semantic class on the public datasets of Cityscapes, ADE20K and COCO-Stuff.

Figure 6 .
Figure 6.Qualitative comparison of our SC-UNet mothod with the competing methods on Cityscapes dataset.Our method generates images with better visual quality and higher-fidelity details.

Figure 7 .
Figure 7. Qualitative comparison results on the ADE20K dataset.Despite diverse semantic classes and small textures, our approach still ensures high fidelity.

Figure 8 .
Figure 8. Qualitative comparison results on the COCO-Stuff dataset.The comparison results show that the images synthesized by our model have a higher quality than GauGAN and SAFM.

Figure 9 .
Figure 9. Mean power spectra over the Cityscapes dataset.Key differences are positioned with boxes on the mean power spectra, and shown magnified below image.Magnitude is on a linear scale.

Figure 10 .
Figure 10.An example application of semantic control synthesis based on our SC-UNet method.

Figure 11 .
Figure 11.An example application of multi-style image synthesis based on our proposed SC-UNet method.z 1 , z 2 and z 3 denote three different random noise tensors, respectively.The symbols µ and δ represent the mean and variance of the noise sampling, respectively.

Table 1 .
The size change of an original resolution image after being fed into the PatchGAN discriminator.ConvLayer i stands for the convolution block in the ith layer.

Table 4 .
Similarity matching result of the average power spectrogram.Significant values are in bold.

Table 5 .
Human perceptual evaluation.These values reflect the average probability of our method being approved by the workers comparing to the baseline method in image synthesis.

Table 6 .
Traditional statistical evaluation.These values reflect the difference between the synthesised image and the real image.Significant values are in bold.

Table 7 .
Ablation studies on important components in SC-UNet.Bold denotes the best performance.

Table 8 .
Ablation studies on discriminator and loss function.Significant values are in bold.Compared to the constraints of one loss, the combined effect of two losses can improve the quantitative quality of image synthesis.Ablation on various image sizesTo explore the impact of image size on synthesis performance, we conducted an ablation study on different image sizes in Table9.First, images from the Cityscapes dataset were resized to 1024 × 2048 , 512 × 1024 , and 256 × 512 , respectively.Subsequently, we conducted model training by solely controlling the image size as the variable.The results in the table demonstrate that lower resolutions correspond to better synthesis performance.

Table 9 .
Ablation studies on various image sizes.Significant values are in bold.