Single-Channel Blind Image Separation Based on Transformer-Guided GAN

Blind source separation (BSS) has been a great challenge in the field of signal processing due to the unknown distribution of the source signal and the mixing matrix. Traditional methods based on statistics and information theory use prior information such as source distribution independence, non-Gaussianity, sparsity, etc. to solve this problem. Generative adversarial networks (GANs) learn source distributions through games without being constrained by statistical properties. However, the current blind image separation methods based on GANs ignores the reconstruction of the structure and details of the separated image, resulting in residual interference source information in the generated results. This paper proposes a Transformer-guided GAN guided by an attention mechanism. Through the adversarial training of the generator and the discriminator, U-shaped Network (UNet) is used to fuse the convolutional layer features to reconstruct the structure of the separated image, and Transformer is used to calculate the position attention and guide the detailed information. We validate our method with quantitative experiments, showing that it outperforms previous blind image separation algorithms in terms of PSNR and SSIM.


Introduction
Image interference or pollution is very common in practical situations. In the same way that ears can distinguish the speaker from a noisy soundscape at a cocktail party [1], eyes can similarly distinguish between different objects in a complex visual situation. Blind image separation (BIS) obtains the source image according to the mixing image without knowledge of source distribution and mixing channel, which is what BSS does in computer vision. Single-channel blind image separation (SCBIS) is an extremely underdetermined BSS problem because the observed images are only available from a single receiver channel. The difficulty with the SCBIS problem lies in the high-quality recovery of multiple sources from the observed single image without knowledge of the transmission channel characteristics.
Independent component analysis (ICA) [2] assumes that source signals are independent of one another and non-Gaussian distributed. In robust principal component analysis (RPCA) [3], the dimensionality of the observed signals is reduced to separate the source signals. Non-negative matrix decomposition (NMF) [4] trains iteratively to decompose the mixture signal into two non-negative matrices: the weight matrix and the feature matrix. To acquire additional constraint information, the above methods make strong assumptions about the non-Gaussianity, independence, and non-negativity of the distribution of the source signal, which are not guaranteed in the image samples and result in a low-quality separated image.
GANs [5] are based on the idea of a zero-sum game, in which the generator and discriminator are treated as two faces of the game, and the generator and discriminator are continually optimized during the training process until the generated samples are able to successfully fool the discriminator. Combining the adversarial generative structures with other networks leads to a significant improvement in the quality of the images generated. By mapping information from random distributions to target domains via adversarial learning, GANs offer a solution to the problem of less priori in BSS.
Ref. [6] applied neurons to GANs and achieved one-to-many single-channel blind source separation by adjusting the number of neurons to map random signals to all domains of the source signal at the same time, but neuronal networks caused dimensional disasters when processing image data while losing spatial information. Ref. [7] trained the generators for each source signal, which mapped the randomness distribution onto the source signal domain, and then used maximum likelihood estimation to minimize the sum distance between the generated signals and the given mixture signal in order to separate the source signal. The two-step separation scheme caused a deviation between the source and generated signals. Ref. [8] used CYCLEGAN for separating and reconstructing two random mixture signals. This is equivalent to leaving one field of the generated separated signal unchanged and the other field swapped twice, such that the reconstructed mixed signal is identically distributed relative to the original mixture signal. The deep convolution networks resulted in the loss of valuable detail information. The above GANs focused only on mapping the mix to the source, ignoring the loss of detail caused by positional correlation, resulting in differences in texture representation between the separated image and the source.
Ref. [9] proposed the UNet structure, which links the convolution neural network into a U-shaped structure to gain context information, thereby reducing the valid feature loss. The Transformer framework proposed in [10] computes the association between different feature positions to obtain attentional information. Therefore, we use the generative adversarial network combined with the Transformer structure and UNet to guide the reconstruction of the separated image on the overall structure and detailed information. UNet [9], as an applicable generative model, does a good job of reconstructing the target image by embedding the input image feature. Some SCBIS work has used GAN structures combined with UNet models. Ref. [11] treated the source and observed signals as information in different areas and solved the multisource single-channel blind split problem by combining multiple UNet-model-based DualGAN networks to train PDualGAN, but the detail representation of the source signal was not taken into account in the generation process. Ref. [12] combined the Attention module and UNet in the AGAN to achieve full-detail image generation; however, the attention model only focuses on the importance of single-pixel value while ignoring the interrelationship of pixels at different locations, leading to excessive attention on interference information in the input mix image. Ref. [13] cascaded two layers of GANs based on UNet, UGAN, and PAGAN to implement SCBIS in the case of a few samples, using UGAN for the generation of mixture samples and PAGAN for the separation of images. The resulting cascade GAN focuses more on the insufficient training samples than on the detailed information between pixels at different positions of the image.

Transformer
Transformer [10] was originally proposed to compute the correlation between the input word vectors in machine translation. Refs. [14,15] introduced to the field of view by improving Transformer. Ref. [16] proposed a simple and extensible ViT model that splits images into a fixed number of patches and presented a multi-head self-attention mechanism for extracting feature information containing relational information. Transformer has shown excellent performance in various image detail processing. Refs. [17,18] used Transformer in an image restoration architecture to improve detail restoration by capturing pixel interaction relationships over long distances; Ref. [19] proposed UTNet, integrating Transformer architecture into convolutional neural networks to enhance finer detail segmentation through local information; Ref. [20] proposed CNN-Transformer for facial super-resolution tasks to improve fine facial details and enhance the fidelity and naturalness of reconstructed facial images.
Inspired by the detail-processing ability of Transformer and the extraction of global context features of UNetGAN, this paper proposes a Transformer-guided GAN to resolve the loss of valid information and to balance between the part and the whole. The GAN structure incorporates UNet for source reconstruction and a Transformer module for positional encoding and attention information extraction, improving separation accuracy and retaining detail representation.

Method
The overall structure of the Transformer-guided GAN is shown in Figure 1. The generator consists of a combination of a U-shaped connected convolution network with a Transformer structure. The discriminator is formed by the convolution neural network with the activation function only. For a given mix image x ∈ R W×H×C , where W, H, C are its width, height, and number of channels, respectively, the goal of the generator is to reconstruct a high-quality separated imageŝ ∈ R W×H×C that is as similar as possible to the source image s ∈ R W×H×C . The objective of the discriminator is to distinguish between s andŝ. Through continuous adversarial learning of the generator and discriminator, the generator eventually reconstructs a plausibleŝ for any given x. Figure 1. The overall structure of Transformer-guided GAN. The generator is used to reconstruct the separated image, and the discriminator to distinguish its authenticity.

Structure of the Generator
The structure is shown in detail in Figure 2. UNet consists of an M-layer convolution neural network connected by a U-shaped structure. Each layer uses a convolution kernel of size k and takes a step size of t for downsampling feature extraction of the input image, as shown in Equation (1): where f Conv m ∈ R W m ×H m ×C m denotes the result of m-th feature extraction; W m = W//2 t ; and H m = H//2 t . With the same setup as the convolution module, the deconvolution upsamples the m-th layer fusion feature f m back to the previous layer, as in Equation (2), where f m ∈ R W m ×H m ×2C m is the m-th layer fusion-feature concatenated convolution features and deconvolution features, as shown in Equation (3).

Transformer Scheme
Transformer connects the two ends of UNet. The structure we used in this paper, as shown in Figure 2, has a linear projection layer and an L-Layer Encoder.
As shown in Figure 3, after linear mapping, the convolution maps, the output of the UNet M-th convolution f Conv M ∈ R W m ×H m ×2C m , is projected into N D-dimension patches, and then the position of each patch is encoded, as shown in Equation (4), where f n ∈ R N×(W m ·H m ) denotes the 2D patches of f Conv M and N = C M ; E ∈ R (W m ·H m )×D denotes the patch projection and D = W m ·H m ; and E pos ∈ R N×D is the position embedding of each patch. The Encoder module first performs layer normalization on the patch features and then obtains the attention information through the multi-head attention mechanism as shown in Equation (5), where LN is a de-correlation operation based on the mean and variance calculated in the channel dimension from the patch features of the mix image in the input batch, as shown in Equation (5), where γ, β are trainable parameters. MHA is the key part for Transformer to explore the attention information, as shown in Figure 4. It first obtains the Q, K, V matrixes of the patch vectors by linear mapping as shown in Equation (7), where W Q , W K , and W V ∈ R N×D are trainable weight parameters. Then each matrix is divided and recombines them into h head vectors as shown in Equation (8), where and D h = D//h are trainable weight parameters. The self-attention operation within each head vector calculates the correlation by the dot product of q and transposes k to get the weight corresponding to v, as shown in Equation (9), where denotes the dot product operation, √ d k is used to scale the dot into the result to make the gradient of the model more stable, and softmax is used calculate the attention score by normalization, as shown in Equation (10). Finally, the output of MHA is obtained by concatenating the attention results of the obtained individual head vectors, as shown in Equation (11).

Discriminator
The design of the discriminator is shown in Figure 1. The M-layer convolution network acquires the image features, and then the feature values are normalized between −1 and 1 by the tanh activation function. If the value obtained is greater than 0, it is determined that the reconstructed separated image is true; otherwise, it is false, as shown in Equation (12), value = tanh(Conv M (ŝ)), (12) where the parameters of the convolution network are set to be the same as those of the generator. The convolution kernel size is k, step size is t, and the tanh activation function is an odd function centered at 0, as shown in Equation (13), which improves the convergence speed compared to the sigmoid activation function.

Loss Function
The generator and discriminator of the Transformer-guided GAN play off of each other in training; the generator optimizes the network parameters to reconstruct a reconstructed separated image that can fool the discriminator, and the discriminator improves the ability to judge true and false by adjusting the parameters. The objective function of the Transformer-guided GAN is shown in Equation (14).
L GAN guarantees that the reconstructed image given by the generator matches the distribution of the source image, while the SCBIS task requires the separation of the corresponding source image according to the given mix image x, so to obtain the separated image corresponding to the mix image, we add a reconstruction loss L Rec (G) to the generator, as shown in Equation (15), where L Rec (G) calculates the mean squared error between the separated image and the corresponding source image, as shown in Equation (16).

The MNIST Dataset
The MNIST dataset [21] is used to evaluate the proposed Transformer-guided GAN. To meet the experimental requirements, we select 1000 handwritten digital images randomly from each of the categories 0 to 9 of the original dataset, then classify them into two categories according to the rules of 0 to 4 and 5 to 9 as the source image sets of the training set, and during the training process, each mix image is obtained by randomly selecting one image from each source image set and mixing them up. The same approach is used for the test set, and 200 images from each category of the original test set 0 to 9 are selected and divided into the two categories of 0 to 4 and 5 to 9. In total, 1000 test images are obtained by mixing the images in the two source image sets one-to-one.

Bags-Shoes Dataset
For verifying the performance of the proposed model on more realistic images, we use the shoes dataset [22] and the bags dataset [23] to form the bags-shoes mix dataset. In this dataset, 5000 images are used as the source image set for the training set, and the remaining 1000 images are used as the source image set for the test set. As with the MNIST dataset, the mix images used in the training are obtained by mixing images selected randomly from the two source image sets respectively in the training set. Similarly, the test set of mix images is obtained by one-to-one mixing of images from the two source image sets in the test set.

Experimental Settings
The Transformer-guided GAN model is implemented based on TensorFlow version 2.1. For the network parameters mentioned above, we have made the following settings. We resize mix images and source images to 64 × 64 × 3 and normalize them before the experiment. The generator uses a 4-layer U-shaped structure to connect the convolution and deconvolution networks with 16, 64, 256, and 768 channels per layer, respectively; the size of the convolution kernel is 4, the step size is 2, and the zero padding is required during convolution. In Transformer, the linear projection maps the convolution features into 16 patches of 768 dimensions using 3 Encoder modules and 4 head vectors. The Transformer-guided GAN was trained with a learning rate of 1 × 10 −4 Adam optimizer, with 50 iterations. To ensure the fairness of the experiments, the same training parameter values were used for all experiments.

Results
We compare quantitatively and qualitatively the proposed Transformer-guided GAN with the frontier algorithms FastICA [2], NMF [4], Neural egg separation (NES) [24], AGAN [12], and PDualGAN [11] to indicate its superiority on SCBIS tasks. To ensure a fair comparison, the AGAN has the same UNet-GAN structure parameters as configured in this paper, except that the channel reduction factor k is set to 8 in the self-attentive module; the PDualGAN consists of 2 DualGANs, each using 2 UNet-GANs in the same configuration as this paper to implement mix to source mapping. In addition, to demonstrate the effectiveness of the combination of Transformer structure and UNet for the model, we designed ablation experiments. 6 show the qualitative results for the MNIST and bags-shoes datasets in the noiseless and noisy cases, respectively, from which it is clear that in any case, for both datasets, the quality of the separated images from our model is much higher than that of the other models.

Figures 5 and
Qualitative results from the horizontal comparison of methods on the same dataset show that the FastICA algorithm barely separates the source images, which affects its metrics. The separation results given by the NMF algorithm show that the source image information is enhanced, but its loss in color, brightness, etc. leads to its unsatisfactory metrics. NES reduces the interference source information while leading to an increase in the disparity between the separated image and the source image in terms of structure and texture. The separated images of the AGAN from the MNIST dataset, where the source information is relatively simple, retain the information of the source images while eliminating the interference, although there are negligible artifacts. However, on the information-rich bags-shoes dataset, this interference cannot be neglected. The separation result of PDualGAN is similar to that of the AGAN but slightly inferior in the processing of more detailed information. In contrast, our proposed model is superior in terms of the overall structure, brightness, color, and detailed information processing.
From a longitudinal comparison of the qualitative results of the same method in different noise situations, we can see that the performance of all methods mentioned is significantly worse in noisy environments compared to noise-free results. The MNIST dataset, with its simple graphical structure, is less affected by noise than the bags-shoes dataset, with its rich detail. In addition to the traditional FastICA and NMF algorithms, methods using the GAN structure can filter the environmental noise and thus perform better in noisy situations. Furthermore, our proposed method gives better quality results than other methods in the noisy case, showing greater noise immunity.

Quantitative Results
We used both PSNR and SSIM metrics to quantitatively compare the separation results. PSNR [25] calculates the peak signal-to-noise ratio between the separated image and the source image, and the higher the ratio, the higher the quality of the separated image, as shown in Equation (17), where MAX I is the image pixel maximum value, and MSE represents the mean variance between source image and separated image. SSIM [25] is used to evaluate the similarity between the source image and the separated image, as shown in Equation (18), where µ s , µˆs represent the mean values of s andŝ, respectively; σ 2 s , σ 2 s are the variance of s andŝ, respectively; σ sŝ is the covariance of s andŝ ; and c 1 , c 2 are constants designed to maintain stability, such that c 1 = (k 1 L) 2 , c 2 = (k 2 L) 2 , where L represents the dynamic range of pixel values, and k 1 , k 2 are empirical values, where k 1 = 0.01, and k 2 = 0.03. SSIM takes a value between 0 and 1. The closer it is to 1, the more similar the separated image is to the source image. Table 1 shows that the quantitative calculations are consistent with the results of the qualitative analysis, with our model achieving higher PSNR and SSIM metrics for the separated images than the other methods for the two datasets and their noisy cases. Horizontal comparisons of quantitative calculations show that our method outperforms the state-of-the-art AGAN by 7.7 dB PSNR and 0.07 SSIM on the simple MNIST dataset, while its PSNR in the noisy case is 6.3 dB higher than that of the AGAN, and its SSIM is 0.08 higher than that of the AGAN. On the bags-shoes dataset, with its more complex structure and details, the PSNR and SSIM of the Transformer-guided GAN are 7.7 dB and 0.06 higher, respectively, than those of the AGAN, and its PSNR and SSIM in noisy cases are 3.2 dB and 0.04 higher, respectively, than those of PDualGAN. The main reason is that, compared to the AGAN and PDualGAN, the Transformer-guided GAN can extract more detailed features through positional encoding and multi-headed attention.
A longitudinal comparison of the quantitative results shows that the PSNR and SSIM decreased in the noisy environment for all the methods mentioned. This is because the noise affects the extraction of source information by the models. The PSNR and SSIM of the AGAN in the noiseless case are 0.9 dB and 0.01 higher than those of the PDualGAN on the MNIST dataset, and similarly 0.9 dB and 0.01 higher on the bags-shoes dataset. This is due to the fact that the self-attention mechanism in the AGAN can capture detailed information. In the noisy environment, the AGAN and PDualGAN are close in qualitative results on the MNIST dataset, but on the bags-shoes dataset, PDualGAN is 1.3 dB and 0.03 higher than AGAN in PSNR and SSIM, respectively. This is because convolutional networks are more noise resistant than attention mechanisms for linear computing. Our model replaces self-attention with Transformer in the AGAN, and in combination with UNet, the PSNR and SSIM drop by 1.8 dB and 0.02 for the noisy MNIST dataset, but are still higher than the metrics of other methods in the noiseless environment, and drop by 7.8 dB and 0.07 for the noisy bags-shoes dataset, but were still comparable to the other metrics in the noiseless environment.

Ablation Test
We designed three comparison GANs for the ablation experiments: (1) a Transformerbased generator, (2) a generator using only UNet, and (3) a generator with the Transformer-UNet structure proposed in this paper, with the exception that the rest of the network structure and experimental setup remained unchanged.
The quantitative results of the separation effects of the three networks on the two datasets are given in Table 2, respectively, showing that without UNet, the separation ability of the network decreases sharply and loses global control, while without the Transformer structure, although usable separation results can be obtained in mix images with a simple structure, there is the problem of not being able to ignore the details for information-rich source images; therefore, our model combines the two to improve the separation ability of the network in the face of complex structure images.

Conclusions
In this paper, a GAN structure based on the combination of UNet and Transformer is proposed to address the loss of detail information caused by ignoring location g correlation in SCBIS. UNet performs feature extraction on the source, Transformer acquires the detailed representations, and the overall reconstruction and detail reconstruction are achieved through GAN training. Experimental and comparative results show that the Transformerguided GAN provides better quality separated images and more detailed representation of source information in the SCBIS task. For future research, there is ample scope for unsupervised blind image separation, as blind image separation in practice may not have a clean and usable source image on which to base the discriminator's judgement. For future research, there is great scope for unsupervised SCBIS, as in practice, there may not be a usable source image.