Green Fluorescent Protein and Phase-Contrast Image Fusion via Generative Adversarial Networks

In the field of cell and molecular biology, green fluorescent protein (GFP) images provide functional information embodying the molecular distribution of biological cells while phase-contrast images maintain structural information with high resolution. Fusion of GFP and phase-contrast images is of high significance to the study of subcellular localization, protein functional analysis, and genetic expression. This paper proposes a novel algorithm to fuse these two types of biological images via generative adversarial networks (GANs) by carefully taking their own characteristics into account. The fusion problem is modelled as an adversarial game between a generator and a discriminator. The generator aims to create a fused image that well extracts the functional information from the GFP image and the structural information from the phase-contrast image at the same time. The target of the discriminator is to further improve the overall similarity between the fused image and the phase-contrast image. Experimental results demonstrate that the proposed method can outperform several representative and state-of-the-art image fusion methods in terms of both visual quality and objective evaluation.


Introduction
In the field of cell and molecular biology, fluorescent imaging and phase-contrast imaging are two representative imaging approaches. As a widely used tool in fluorescent imaging, green fluorescent protein (GFP) displays bright green fluorescence when exposed to light in the range of blue to ultraviolet. e GFP image contains functional information related to the molecular distribution of biological cells but has very low spatial resolution. Phase-contrast imaging is an optical microscopy technique that visualizes phase shifts through converting it to variation of amplitude or contrast in the image. e phase-contrast image provides structural information with high resolution. Fusion of GFP image and phase-contrast image is of great significance to the localization of subcellular structure, the functional analysis of protein, and the expression of gene [1].
In recent years, a variety of image fusion methods have been proposed. Generally, existing image fusion algorithms mainly consist of three steps: image transform, fusion, and inverse transform [2]. e representative fusion methods include multiscale transform-based ones [3][4][5][6][7][8], sparse representation-based ones [9][10][11][12][13], spatial domain-based ones [14][15][16][17], hybrid transform-based ones [18][19][20][21], etc. In most of the existing image fusion methods, the role of each input image is equivalent in terms of the fusion system, which means that the input images generally undergo identical transforms and uniform fusion rules. However, for the problem of GFP and phase-contrast image fusion, considering that the input images vary significantly from each other, different roles can be assigned to them in the fusion system by carefully addressing their own characteristics, which is likely to provide a more effective way to tackle this fusion issue.
In this paper, we propose a novel GFP and phasecontrast image fusion method based on generative adversarial networks (GANs). e fusion problem is modelled as an adversarial game between a generator and discriminator. e aim of the generator is to obtain a fused image that integrates the functional information from the GFP image together with the structural information from the phasecontrast image, while the discriminator further ensures the overall similarity between the fused image and the phasecontrast image. is adversarial process enables the fusion result to capture the complementary information from different input images as much as possible. An example of the proposed method is illustrated in Figure 1, where the input GFP and phase-contrast images are shown in Figures 1(a) and 1(b), respectively. Figure 1(c) shows the fusion result obtained by the proposed method. By referring to the input images, it can be seen that our method achieves high performance in terms of the preservation of functional and structural information. e main contributions of this paper are summarized as follows: (1) We propose a deep learning-(DL-) based GFP and phase-contrast image fusion method via generative adversarial networks (GANs). To extract information from these two kinds of biological images adequately, the input images are treated differently in the proposed fusion model according to their own characteristics.
(2) Extensive experiments on more than 140 pairs of input images demonstrate that the proposed method outperforms several representative image fusion methods in terms of both visual quality and objective evaluation.
e remainder of this paper is organized as follows. Section 2 depicts some related works. In Section 3, the proposed GAN-based image fusion method is introduced in detail. e experimental results and discussions are given in Section 4. Finally, Section 5 concludes the paper.

Related Work and Motivations
2.1. GFP and Phase-Contrast Image Fusion. Fusion of GFP and phase-contrast images is conducive to the study of subcellular localization and functional properties of protein.
In the past few years, several image fusion methods have been proposed to address this issue [22][23][24]. Li and Wang [22] proposed a NSCT-based GFP and phase-contrast image fusion method. In their method, the intensity components of input images are decomposed by NSCT and the obtained coefficients are then merged by a variable-weight fusion rule. In [23], Feng et al. introduced a fusion approach for GFP and phase-contrast images based on sharp frequency localization contourlet transform (SFL-CT). To fuse the decomposed coefficients, they designed a maximum region energy-(MRE-) based rule, a maximum absolute value-(MAV-) based rule, and a neighborhood consistency measurement-(NCM-) based rule to merge the approximation subbands, the finest detailed subbands, and other detailed subbands, respectively. Recently, Qiu et al. [24] presented a complex shearlet transform-(CST-) based method to fuse GFP and phase-contrast images. e high-frequency subbands are fused with the traditional absolute-maximum rule, while a Haar wavelet-based energy rule is introduced to merge lowfrequency subbands.
It is worth noting that all of the above GFP and phasecontrast image fusion methods are based on conventional multiscale transforms. Moreover, the role of each input image is equivalent in these fusion methods, as they handle the GFP image (more precisely, its intensity component) and phase-contrast image in the same way.

Deep
Learning-Based Image Fusion. In recent years, due to the high effectiveness and convenience in feature representation of deep learning (DL) models, DL-based study has emerged as a very active direction in the field of image fusion [25]. Many DL models such as stacked autoencoders (SAEs) and convolutional neural networks (CNNs) have been employed in a wide range of image fusion problems including remote sensing image fusion [26,27], multifocus image fusion [28][29][30], multiexposure image fusion [31,32], medical image fusion [33,34], and infrared and visible image fusion [35][36][37]. In [26], Huang et al. firstly introduced deep learning into remote sensing image fusion by applying a sparse denoising autoencoder to characterize the nonlinear mapping between low-and high-resolution multispectral image patches. Liu et al. [28] proposed a CNN-based multifocus image fusion method in which a Siamese network is designed to simultaneously act as the roles of activity level measurement and fusion rule. In [31], Kalantari and Ramamoorthi introduced a learning-based multiexposure image fusion approach via CNN to model the complex deghosting process in dynamic scenes. Hermessi et al. [33] presented a CNN-based medical image fusion method which preextracts the shearlet features of source images as network input. Most recently, Ma et al. [35] introduced a novel generative adversarial network-(GAN-) based infrared and visible image fusion method by modelling the fusion problem as an adversarial game, aiming to preserve infrared intensities and visible details at the same time. is work demonstrates the high potential of the GAN models for multimodal image fusion.

Motivations of is Work.
In this work, considering that the characteristics of the GFP image and the phase-contrast image are significantly different, unlike the exiting fusion methods on this issue introduced in Section 2.1, different roles are assigned to the input images for extracting information from them more effectively. To this end, and inspired by the great progress recently achieved in image fusion by deep learning, a GAN-based GFP and phase-contrast image fusion method is presented. We mainly adopt the GAN-based fusion scheme introduced in [35] due to its effectiveness and simplicity in multimodal image fusion, while carefully devising the loss functions according to the characteristics of the GFP and the phase-contrast images. To the best of our knowledge, this is the first time that a DL-based approach is used in the field of GFP and phase-contrast image fusion. Figure 2 shows the schematic diagram of the proposed GFP and phase-contrast image fusion method. e fusion issue is formulated as an adversarial problem to preserve the complementary information contained in the input images as much as possible. e GFP image is treated as an RGB color image in the fusion process. It is firstly converted into the YUV color space that can effectively separate the intensity or luminance component from the color image. Actually, this is a widely used approach in the field of functional and structural image fusion [6,38].

Overview.
During the training process, the GFP image I g is converted into YUV color space to acquire the Y, U, and V components: I Y g , I U g , and I V g . en, I Y g and the phasecontrast image I p are concatenated in the channel dimension to generate a two-channel map I c � I Y g , I p , in which the first channel I 1 c � I Y g and the second channel I 2 c � I p . Next, I c is fed into the generator G and the output is termed as the intermediate fused image I Y f , which inclines to maintain the functional information of I g and retain the structural information of I p . I Y f and I p are fed into the discriminator D to further ensure the overall similarity between them. In this way, adversarial game between G and D is founded.
During the testing process, I Y g and I p are concatenated in the channel dimension and then fed into the trained generator to obtain the intermediate fused image I Y f . e final fused image I f is acquired by performing the inverse YUV conversion (i.e., YUV to RGB) over I Y f , I U g , and I V g .

Network
Architecture. e network architecture of the generator is shown in Figure 3. e input of the generator is the concatenated I Y g and I p , followed by a five-layer convolution network. e filters used in the first two layers, the next two layers, and the last layer are 5 × 5, 3 × 3, and 1 × 1, respectively. e symbol "n256s1" denotes the corresponding layer has 256 feature maps and the stride is 1, and so forth. In each convolutional layer, the stride is 1 and there is no padding operation. To preserve the details contained in the source images, the downsampling process is not adopted in each layer. Besides, to overcome the problems of vanishing gradient and data initialization sensitivity, batch normalization are employed in the first four layers. Leaky ReLU and tanh activation functions are used in the first four layers and the last layer, respectively. e output of G is the e network architecture of the discriminator is shown in Figure 4. e inputs of the discriminator are I p and I Y f , followed by a five-layer convolution network where 3 × 3 filters are used in the first four layers with a stride of 2. e discriminator actually plays the role of a classifier. Batch normalization is employed in the second, third, and fourth layers, and the leaky ReLU activation function is used in the first four layers, and the last layer is a linear layer. e output of the discriminator is the predicted label (the dimension is one).

e Definition of the Loss Functions.
e loss functions of our network are composed of two parts: the loss function of the generator G and the loss function of the discriminator D.
To improve the quality of generated images and the stability of training process, they are designed based on the least squares generative adversarial networks (LSGANs) introduced by Mao et al. [39].

e Loss Function of the Generator.
e loss function of G is formulated as where V GAN (G) and L C denote the adversarial loss between the generator and the discriminator and the content loss, respectively. e parameter α is used to control the balance between V GAN (G) and L C . e first term V GAN (G) is defined as where N is the number of training samples in a batch and I Y(i) f denotes the fused image with i ∈ N N . e parameter c is the value that the generator expects the discriminator to believe in terms of the fake data. e second term L C is formulated as where H and W indicate the height and width of the input images, respectively, ‖ · ‖ F denotes the matrix Frobenius norm, and SSIM represents the structural similarity operation [40]. e first term is designed to preserve the functional information of GFP image. e second term aims to extract the energy (represented by image intensity) of the phase-contrast image, and the third term is devised to maintain the structural information contained in the phasecontrast image. β and c are trade-off parameters to balance these three terms. Training process RGB to YUV Testing process

Concatenate
Trained generator

3.3.2.
e Loss Function of the Discriminator. e information of I p is incapable of being completely expressed only by its energy and structural information. For example, the texture details may not be fully extracted in this way. To further improve the overall similarity between I p and I Y f , a discriminator D is introduced into the proposed framework. e loss function of D is formulated as where a and b stand for the labels of I Y f and I p , respectively.

Training Details.
e popular GFP database, which is available at http://data.jic.ac.uk/Gfp/, released by the John Innes Centre [1] is employed as the training data in this work. e database contains 148 pairs of registered GFP and phase-contrast images of size 358 × 358 pixels that focus on the Arabidopsis thaliana cells.
In order to obtain sufficient data for network training, each input image is cropped into a large number of patches of the same size 112 × 112 pixels. e stride for cropping is set to 12. As a result, we totally acquire 65268 pairs of GFP and phase-contrast image patches, and the range of each patch is normalized to [− 1, 1]. In each iteration during training, the input of the generator contains n pairs of input image patches (i.e., the batch size is n), and the output intermediate fused patches and the phase-contrast patches (the central part of size 100 × 100 pixels) are employed as the input of the discriminator. Moreover, in each iteration, the discriminator is firstly trained m times (i.e., the training step is m) using the Adam optimizer [41] and then the generator. Algorithm 1 summarizes the procedure of network training.
In our experiments, the parameters for training are set as follows. e batch size n and the number of epochs are set to 32 and 10, respectively. Accordingly, the number of training iterations is 65268 × 10/32 ≈ 20396. e training step of the discriminator m is fixed as 2, and the learning rate is set to 10 − 4 . For easier training, as suggested in [35], soft labels are adopted for a, b, and c.
at is, they are set to random numbers rather than specific ones. e label a of I Y f and the label b of I p are with the ranges of 0 to 0.3 and 0.7 to 1.2, respectively. e label c of I Y f ranges from 0.7 to 1.2.  [3], the curvelet transform-(CVT-) based method [4], the non-subsampled contourlet transform-(NSCT-) based method [5], the sparse representation-(SR-) based method [9], the convolutional neural network-(CNN-) based method [36], the sharp frequency   Figure 3: Network architecture of the generator. In each convolutional layer, there is no padding operation. During the training process, the input is image patches of size 112 × 112 pixels and the output is of size 100 × 100 pixels (see Section 3.4 for more details). During the testing process, the input is the entire images with 6 pixels padded in each direction to ensure that the output has the same size with the input images. For better visualization, we adopt the entire images as the input and output in this figure.

× Conv BatchNorm
localization contourlet transform-(SFL-CT-) based method [23], and the complex shearlet transform-(CST-) based method [24]. e first three are based on popular multiscale transforms, and their parameters are set to the optimal values reported in an influential comparative study [42]. e fourth one is based on sparse representation via simultaneous orthogonal matching pursuit (SOMP) algorithm. e fifth one is a recently proposed deep learning-(DL-) based method, while the last two are the fusion methods specially designed for GFP and phase-contrast images. e parameters in these methods are all set to the default values for unbiased comparison.

Objective Metrics.
In [43], Liu et al. presented a comprehensive review of the objective evaluation metrics for image fusion and classified them into four categories: the information theory-based ones, the image feature-based ones, the image structural similarity-based ones, and the human perception-inspired ones. In this paper, to conduct an all-round objective assessment, one widely used metric is chosen from each category. e first one is the normalized mutual information (Q MI ) [44] that measures the mutual dependence between the input images and the fused image. e second one is an image feature-based metric using phase congruency (Q P ) [45]. is metric assesses the fusion quality through comparing the local cross correlation of corresponding feature maps of the input and fused images. e third one is Yang's metric (Q Y ) [46], which evaluates the structural similarity between the input images and the fused one. e last one is proposed by Chen and Blum (Q CB ) [47] based on human visual system (HVS) models. In addition, the visual information fidelity (VIF) measure [48] between the input phase-contrast image and the fused image is also employed for objective assessment. By characterizing the relationship between image information and visual quality, the VIF measure has been widely verified to be highly consistent with subjective evaluation. It is worth noting that the same measure between the GFP image and the fused image is not included. As reported in [23] (Table 1), the result on VIF measure between the GFP image and the fused image (the proposed method has the lowest score) is on the contrary with that of the VIF measure between the phasecontrast image and the fused image (the proposed method has the highest score). We also verify this point in our experiment. Specifically, we experimentally find that the result on VIF measure between the phase-contrast image and the fused image is highly consistent with other fusion metrics, while the situation for the GFP image is just on the contrary. One possible explanation for this issue is that most of the pixels or regions in the GFP image are dark (the intensity is zero), which is significantly different from the situations of the fused image or the phase-contrast image. erefore, a higher VIF measure between the GFP image and the fused image may not indicate a better fusion result. Based on the above observations, only the VIF measure between the phase-contrast image and the fused image is used for evaluation in this work. For each of the above metrics, a higher score indicates a better performance.

Parameter Analysis.
In this section, the impacts of three trade-off parameters α, β, and c in our method are quantitatively studied via the objective fusion metrics. Based on a large quantity of experiments, we obtain an appropriate setting: α � 6, β � 6, and c � 6. As a popular approach for analysing the impacts of multiple parameters, the controlling for a variable is adopted to verify this point. e results are shown in Figure 5. Considering that it is practically difficult to show all the results that contain too many combinations, only one set of results is provided to exhibit the impact of each parameter, by fixing the other two as the well-performed values (this is a widely used manner in the study of image fusion [8,38]). For each metric, the average score of 148 images is employed for evaluation in Figure 5. It is obvious that for each parameter, the best performances on all the five metrics are mostly obtained when its value is 6. Accordingly, these three free parameters are all set to 6 in our method. Figures 6 and 7 provide two sets of fusion results which include the input images and the fused images obtained by different methods. In each image, two representative regions are enlarged as close-ups for better comparison.

Results and Discussion.
It can be seen that the DTCWT-based, CVT-based, NSCT-based, and SR-based methods can well capture the functional information from the GFP image and the spatial details from the phase-contrast image. However, these methods tend to lose a large amount of image energy from the phase-contrast image. As a result, the brightness of the fused images is obviously lower in comparison to the phasecontrast image, leading to undesirable visual artifact (see the first close-ups in Figures 6(b)-6(f ) and 7(b)-7(f )).
For the CNN-based method, the image energy can be well preserved, but the functional information is not well tackled as the green regions are actually over emphasized when compared with the GFP input image. As a consequence, some structural details are concealed by the green regions (see the second close-ups in Figures 6(g) and 7(g)).
e SFL-CT-based and CST-based methods achieve obvious improvement on this issue, but still suffer from this defect to a certain degree (see the second close-ups in Figures 6(h)-6(i) and 7(h)-7(i)). e proposed method can achieve the highest visual quality among all the methods. On the one hand, the functional information from the GFP image is accurately preserved by method. On the other hand, the fused images of our method well inherit both the structural information and image energy from the phase-contrast image. e objective assessment of different fusion methods on the above five metrics are listed in Table 1. For each method, the mean value (MV) and the standard deviation (SD) of each metric over 148 pairs of input images are reported. Moreover, the number of image pairs on which the corresponding method achieves the highest score is counted and termed as winning times (WT) in Table 1. e maximum mean value, minimum standard deviation, and maximum winning times among all the methods are indicated in bold.
It can be seen that the proposed method clearly outperforms the DTCWT-based, CVT-based, NSCT-based, SR-based, CNN-based, and SFL-CT-based methods on all the five evaluation metrics. In comparison to the CST-based method that wins the first places on Q Y and Q CB , our method owns obvious advantage on Q MI , Q P , and VIF, while achieving very close performance on Q Y and Q CB . Besides, the proposed method obtains relatively small standard deviations on all the five metrics, which indicates that it can stably obtain high-quality fusion results.
Based on the above qualitative and quantitative comparisons, the proposed method exhibits clear advantages over the other seven methods. Moreover, the computational efficiency is sufficiently high for practical usage. Specifically, under the hardware environment consisting of an Intel Core i7-7820K CPU and a NVIDIA TITAN Xp GPU, it takes only about 0.06 seconds for our method to fuse two images of size 358 × 358 pixels. Since all the other methods are implemented in Matlab, their running time is not provided for comparison.

Influence of Network Architecture.
In this section, we study the influence of network architecture on the fusion performance of the proposed method. Specifically, the impacts of the number of feature maps and the number of convolutional layers are studied. Firstly, two sets of experiments are conducted to investigate the influence of the number of feature maps, one of which is halving the number of the feature maps in the first four layers of the generator and the discriminator, and the other is doubling them. Secondly, to analyse the impact of the number of convolutional layers, we perform another two sets of experiments, one of which is removing the first layer of the generator and the fourth layer of the discriminator (both of them contain 256 feature maps), while the other is adding a convolutional layer with 512 feature maps into the generator before the first layer and into the discriminator after the fourth layer, respectively. Table 2 lists the objective evaluation results of the above experiments, which are denoted by halved feature maps, doubled feature maps, reduced layers, and increased layers. Select n phase-contrast image patches I (1) p , . . . , I (n) p ; (5) Update discriminator with the Adam optimizer: ∇L D ; (6) end for (7) Select n GFP image patches I Y(1) g , . . . , I Y(n) g as well as n phase-contrast image patches I (1) p , . . . , I (n) p from training data; (8) Update generator with the Adam optimizer: ∇L G ; (9) end for ALGORITHM 1: e procedure of network training in our method.
Computational and Mathematical Methods in Medicine e results of the original network architecture are also given as reference. For each approach, the mean value of each metric over 148 pairs of input images is reported. It can be seen that the proposed method can generally obtain better performance with more feature maps and convolutional layers. In particular, the number of feature maps has relatively more effect on the fusion performance in this task, in comparison to the number of convolutional layers. By taking the results given in Table 1 into consideration together, we can see that the proposed method with a slighter model (halved feature maps or reduced layers) is still competitive enough among all the fusion methods. A heavier model (doubled feature maps or increased layers) can provide some further improvement in terms of the original network architecture, but the extent is not significant. Considering the factors like memory consumption and computational efficiency, it is an appropriate choice to employ the network architectures described in Section 3 as the default settings.

4.5.
Verification of the Overfitting Problem. As mentioned above, the proposed fusion method is essentially an unsupervised approach since there is no ground truth fused images used for training. Accordingly, the whole dataset can be employed for training and testing in the above experiments, without dividing it into training set and testing set. Although it is a reasonable manner to obtain the fusion results for all the images, the performance of the trained model on new testing data remains unknown.
To address this issue, we conduct a 5-fold cross validation to study if the proposed fusion model has the overfitting problem. Specifically, all the 148 pairs of images are randomly divided into five groups, with 30 pairs in the first four groups and 28 pairs in the last group. In each fold, four groups are employed as training data and the remaining one is used for testing. erefore, each pair of images is employed for testing only once, and all the 148 fused images obtained in the testing process are used for objective evaluation. Table 3 shows the objective assessment results of the five-fold cross validation experiment, along with the results of original training/testing manner for comparison. For each approach, the mean value of each metric over 148 pairs of input images is given. It is not surprising that the performance of the cross validation approach has a slight decrease when compared with that of the original manner. By referring to the performances of other fusion methods reported in

Conclusion and Future Work
In this paper, we propose a GFP and phase-contrast image fusion method based on generative adversarial networks. e fusion problem is addressed as an adversarial game between a generator and a discriminator by carefully considering the characteristics of different input images. Experimental results demonstrate that the proposed method can simultaneously extract the functional information from the GFP image and the structural information from the phase-contrast image, leading to better performance than several existing methods in terms of both visual quality and objective assessment. e proposed fusion framework is of high generality to functional and structural image fusion problems. In the future, we will study its feasibility in multimodal medical image fusion issues such as magnetic resonance (MR) and positron emission tomography (PET) image fusion.
Data Availability e data supporting this study are from previously reported studies and datasets, which have been cited. e dataset used in this research work is available at http://data.jic.ac.uk/Gfp/, released by the John Innes Centre.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.