Inflating 2D Convolution Weights for Efficient Generation of 3D Medical Images

The generation of three-dimensional (3D) medical images has great application potential since it takes into account the 3D anatomical structure. Two problems prevent effective training of a 3D medical generative model: (1) 3D medical images are expensive to acquire and annotate, resulting in an insufficient number of training images, and (2) a large number of parameters are involved in 3D convolution. Methods: We propose a novel GAN model called 3D Split&Shuffle-GAN. To address the 3D data scarcity issue, we first pre-train a two-dimensional (2D) GAN model using abundant image slices and inflate the 2D convolution weights to improve the initialization of the 3D GAN. Novel 3D network architectures are proposed for both the generator and discriminator of the GAN model to significantly reduce the number of parameters while maintaining the quality of image generation. Several weight inflation strategies and parameter-efficient 3D architectures are investigated. Results: Experiments on both heart (Stanford AIMI Coronary Calcium) and brain (Alzheimer's Disease Neuroimaging Initiative) datasets show that our method leads to improved 3D image generation quality (14.7 improvements on Fr\'echet inception distance) with significantly fewer parameters (only 48.5% of the baseline method). Conclusions: We built a parameter-efficient 3D medical image generation model. Due to the efficiency and effectiveness, it has the potential to generate high-quality 3D brain and heart images for real use cases.


Introduction
With the availability of large-scale annotated datasets like ImageNet (Russakovsky et al., 2015), convolution neural networks (CNNs) have achieved unprecedented success in com-puter vision (Krizhevsky et al., 2012).Benefiting from CNNs, medical imaging research has made great advancements in the classification (Frid-Adar et al., 2018), segmentation (Zhou et al., 2022;Song et al., 2022;Bian et al., 2022), detection (Nguyen et al., 2022), reconstruction (Wu et al., 2023), and registration (Makela et al., 2002) of two-dimensional (2D) medical images.However, 3D medical image research lags behind due to the lack of large-scale 3D medical image datasets.As a result of the complex collection procedure, expert annotation, privacy concerns and patient consent, it is challenging to build a large-scale, 3D medical dataset similar to ImageNet.
One widely-used solution for the data deficit of medical images is Generative Adversarial Networks (GANs) (Goodfellow et al., 2014).These networks create high-quality synthetic images to mimic realistic data distributions.An example is using GANs with Wasserstein distance and perceptual loss for low-dose computed tomography (CT) image denoising (Yang et al., 2018).Perceptual loss cannot be directly used for 3D medical images due to the lack of interpretable pre-trained 3D models.A cyclic loss GAN was used by Quan et al. (2018) to reconstruct MRI images.Using cycle-consistent GANs, Kearney et al. (2020) translated magnetic resonance (MR) images to CT images.Albeit effective in mitigating the data-deficit challenge, most existing GANs-based methods are designed for 2D medical image generation.Therefore, they do not incorporate information about the 3D anatomical structure (Ferreira et al., 2022).Various medical applications require the 3D anatomical structure, including calcium scoring (Greenland et al., 2018;Gharleghi et al., 2022) of cardiac CT Coronary Angiograms (CTCAs), and brain tumor segmentation (Isensee et al., 2017;Cai et al., 2022).Unfortunately, there are two practical issues that hinder the effective training of the 3D medical generative model, preventing the use of GANs in 3D medical imaging.
First of all, there are usually insufficient 3D medical images to train effective 3D generative models.The effective training of 3D CNNs with natural videos relies on large-scale datasets, such as Moments in Time (Monfort et al., 2019) with 1 million short videos, and Kinetics (Carreira et al., 2019) with 750k video clips.In comparison, medical datasets contain far fewer 3D images.For example, Stanford AIMI Coronary Calcium (COCA) dataset (Stanford, 2022) only contains 787 CTCAs.To generate 3D images, Kwon et al. (2019) used 991 brain MRI images from the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset (Weiner et al., 2017).Training an effective 3D generative model is difficult with such small datasets of 3D medical images.
Secondly, 3D convolution layers have a large number of parameters, making training time slow and prone to overfitting.As a result of 3D convolution, weight parameters take on an additional dimension.For example, the conventional 3 × 3 convolution expands to 3 × 3 × 3 in the 3D case.Adding the third dimension allows the modeling of the 3D anatomical structure, but it also involves the introduction of an excessive number of parameters and computations, resulting in slower training.Moreover, the model is prone to overfitting due to the contrast between the large number of parameters and the small number of 3D training images.
To address the above two problems, we propose a novel GAN model, dubbed 3D Split&Shuffle-GAN for effective and efficient 3D medical image generation.The proposed model improves existing state-of-the-art GANs (e.g., StyleGAN2 (Karras et al., 2020b)) from two perspectives: training strategy and network architecture (see Figure 1).
The proposed training strategy takes advantage of the availability of 2D image slices to train a 2D GAN model.It then inflates the 2D weights to initialize the 3D GAN model.As the 3D GAN model is initialized with informative 2D weights, it can focus more on 3D anatomy, which results in a better generation of 3D images.By design, the 2D GAN shares a similar architecture as the 3D GAN, with the exception of the additional convolution dimension (e.g., 3 × 3 convolution vs. 3 × 3 × 3 convolution).This enables the 2D weights to be seamlessly expanded to 3D using the weight inflation technique (Carreira and Zisserman, 2017).Since the original inflation was designed for classification models rather than generative models, we evaluate five new inflation variants through extensive experiments to determine the most suitable one for the task of 3D image generation.
For the network architecture, we devise novel Channel Split&Shuffle modules to improve both the generator and discriminator networks.For the generator, since the state-of-theart style-based models (e.g., StyleGAN2) incorporate style vectors into convolution weights as a modulated convolution, efficient convolution operations (e.g., depthwise separable convolution (Howard et al., 2017) or group convolution (Krizhevsky et al., 2012)) cannot be directly adopted.This is mitigated by our Split&Shuffle module, which splits the feature channels into two equal branches and performs a modulated 3D convolution for each branch.Then, the output channels are concatenated and shuffled to encourage feature exchanges.With this design, the number of parameters of the generator is reduced by a factor of 2. For the discriminator, the number of parameters is further reduced by nearly a factor of 4 by replacing one of the 3 × 3 × 3 convolutions with 1 × 1 × 1 convolution.Although the number of parameters for both the generator and discriminator is significantly reduced, the devised modules achieve a much better performance than the original one.Under the extremely data-deficit challenges of generating 3D medical images, our parameter-efficient model is less likely to overfit than the original model.
To demonstrate the effectiveness of the training strategy and network architecture, we investigated five novel weight inflation variants as well as five network design choices on the heart dataset (COCA).In addition, we performed experiments on the brain dataset (ADNI) to demonstrate the general applicability of our method.
To summarize, this paper makes the following contributions to 3D medical image generation: • A novel 3D Split&Shuffle-GAN model for 3D medical im-age generation is proposed, and new inflation strategies are developed to facilitate training of 3D medical generation models.
• Parameter-efficient Channel Split&Shuffle modules are developed for both the generator and discriminator networks, which reduces the number of parameters (by a factor of at least 2) and improves generation quality (FID).
• We conducted comprehensive experiments to verify the effectiveness of the inflation strategy and network architecture.We achieved state-of-the-art performance on both the heart and brain datasets.

Generative Models for Medical Imaging
The most popular model for generating synthetic images is the generative adversarial networks (GANs) (Goodfellow et al., 2014).The GANs model synthesizes realistic images from a random noise variable and uses a discriminator to distinguish between the synthesized images and the realistic images.The distribution of the synthesized images gradually approaches the distribution of real images with the alternating training of the generator and discriminator.State-of-the-art GANs use the style-based generation technique (Karras et al., 2019(Karras et al., , 2020b)), in which style vectors are generated (for controlling the style of image generation) from a mapping network.
Providing annotations to large numbers of images in the field of medical imaging is a challenging task.The use of GANs is thus naturally adopted to solve a number of medical problems (Frid-Adar et al., 2018;Quan et al., 2018;Yang et al., 2018;Kearney et al., 2020;Ravi et al., 2022;Jung et al., 2021;Hu et al., 2020bHu et al., , 2019Hu et al., , 2020a;;You et al., 2022), such as classification, segmentation, registration, low dose CT denoising, and MR to positron emission tomography (PET) synthesis.GANs were used by Frid-Adar et al. (2018) to generate synthetic CT images for data augmentation to enhance liver lesion classification performance.RefineGAN (Quan et al., 2018) proposes a cyclic consistency loss for the modified variant of the deeper generator and discriminator networks to deal with the compressed sensing magnetic resonance imaging (CS-MRI) reconstruction problem.To improve the conventional GANs for the low dose CT (LDCT) denoising task, Yang et al. (2018) employed two practical methods, namely Wasserstein distance and perceptual loss.A-CycleGAN (Kearney et al., 2020) makes use of variational autoencoding (VAE), attention, and cycleconsistent generative adversarial network (CycleGAN) to improve existing MR-to-CT image translation algorithms.Hu et al. (2019) proposed an effective adversarial U-Net architecture along with different normalization techniques to solve the MRI to PET image synthesis task.
Despite the wide range of models and GAN variants proposed for medical imaging problems, most of them only focus on generating 2D images, disregarding the 3D anatomical structure.Only a few attempts have been made to generate 3D images.Leveraging an α-GAN, Kwon et al. (2019) utilizes the variational autoencoder (VAE) and GAN to generate 3D synthetic brain MRI images.Zhou et al. (2022) proposed a segmentation-guided style-based generative adversarial network (SGSGAN) for synthesizing full-dose PET images, where a style-based generator is directly used for style modulation.Hu et al. (2023) proposed a hierarchical shape-perception network (HSPN) for 3D brain reconstruction (point cloud) from a single incomplete image.In contrast, our method generates 3D medical images with only random variables as input.By extending StyleGAN2's 2D convolutions to 3D convolutions, Hong et al. (2021) used 3D-StyleGAN to generate 3D brain MRI images.A comprehensive review of the usage of GANs in 3D data can be found in Ferreira et al. (2022).Since most existing methods lift 2D GANs models to 3D in a straightforward manner, the number of parameters increases significantly, making it challenging to train the model effectively.In this paper, we propose both effective training strategies and efficient model architectures to generate 3D medical images using 3D GANs.

Training 3D Convolution Neural Networks
A multitude of research effort has been directed toward 3D CNNs in the field of natural images, especially for the spatiotemporal analysis of videos.The main idea is to introduce a third convolution dimension (k × k × k) to capture the temporal dependencies for video applications such as action recognition (Carreira and Zisserman, 2017).The training of 3D CNN models usually relies on large-scale video datasets, e.g., Kinetics-700 (Carreira et al., 2019) with 750k video clips, Moments in Time (Monfort et al., 2019) with 1 million short videos.However, due to the high annotation costs, patient consent issues, and expert annotation challenges, creating 3D medical image datasets of similar scale is not feasible.As a result, training 3D medical models is challenging.
Using degenerated 2D spatial information, another line of work contributes to initializing 3D convolution weights by utilizing beneficial priors.For example, Carreira and Zisserman (2017) proposed an inflation strategy to stack 2D weights for the 3D weights initialization.The video vision Transformer is trained using the central frame initialization strategy in Arnab et al. (2021).To our knowledge, no similar initialization technique has been explored in 3D medical GANs.On the one hand, the third dimension in video analysis corresponds to temporally varying frames, while the third dimension in medical images describes the 3D anatomical structure.On the other hand, the interplay between the discriminator and generator makes the training process more complex than that of classification models.In this paper, we consider both the 3D anatomical structure and the interplay between the discriminator and the generator to facilitate the 3D GAN training and architecture design.

Parameter-efficient 3D Convolution Neural Networks
3D convolution neural networks are challenged by the large number of parameters included by the additional third dimension.There are two main approaches to addressing this issue: tensor decomposition and efficient module design.In tensor decomposition, the low-rank tensor decomposition algorithms are applied to re-calculate the convolution weights, thereby compressing the network and reducing the number of parameters.For example, Tensor Train has been used in Novikov et al. (2015), CANDECOMP/PARAFAC (CP) decomposition is applied in Lebedev et al. (2014); Kossaifi et al. (2020), and Tucker decomposition is adopted in Kim et al. (2015).In spite of their mathematical soundness, these methods require specific re-implementation of existing convolution operations and cannot take advantage of the latest hardware acceleration (e.g., the NVIDIA cudnn library).In efficient module design, various parameter-efficient modules (e.g., bottleneck (Hara et al., 2018), group convolution, depthwise separable convolution, and pointwise convolution) are devised to replace the original module.These efficient modules are re-arranged and combined to form different network architectures.In MobileNet (Howard et al., 2017), for example, depthwise separable convolutions are used to construct a lightweight deep architecture for mobile devices.SqueezeNet (Iandola et al., 2016) combines pointwise convolution and regular convolution to form a Fire block.The computation cost of ShuffleNet (Zhang et al., 2018) is reduced by using pointwise group convolution.A comprehensive analysis of these modules can be found in Kopuklu et al. (2019).
All the above parameter-efficient designs are based on classification models, which cannot be directly and easily adopted in 3D generative models, such as StyleGAN2.StyleGAN2's style modulation mechanism will be destroyed if these modules are trivially adopted.To address this issue, we propose customized 3D modules for the style-based generative models to enable parameter-efficient generation of 3D medical images.

Preliminary of 3D Medical Image Generation 3.1.1. Overview of StyleGAN2 model Mapping Network.
A key difference between style-based generative models (e.g., StyleGAN2) and previous GANs is the introduction of the mapping network f .Specifically, given a latent code z ∈ Z, f : Z → W first produces a vector w ∈ W. The learned affine transform A is then applied to w to obtain the generator's per-layer style vectors s.
Generator.In the generator, original StyleGAN (Karras et al., 2019) directly utilizes the style vectors for adaptive instance normalization (AdaIN) on the feature maps, which will cause characteristic artifacts such as droplets.To mitigate these unrealistic artifacts, StyleGAN2 incorporates the style vectors into the weight modulation (Mod) operation, then applies the demodulation (Demod) to serve as the instance normalization.

Modulation: w
Demodulation: where w, w ′ , w ′′ are the original, modulated and demodulated convolution weights, s i is the style vector corresponding to the i-th feature map, j, k iterate the output feature maps and the spatial resolution, ϵ is a small constant to avoid numerical issues.In the above modulation and demodulation operations, the style vectors are directly entangled with convolution weights, which removes the characteristic artifacts while retaining the style controllability.However, this also impedes the straightforward modification to the convolution layers, e.g., depthwise separable convolution (details in Sec.3.3).
Discriminator.The discriminator of StyleGAN2 introduces a minibatch standard deviation layer to calculate the deviation of a minibatch and concatenates it to the original feature maps.This reduces the dependency on a minibatch to encourage diverse generations.

3D Medical Image Generation
The StyleGAN2 was originally designed for 2D natural image generation.To apply it to the generation of 3D medical images, a straightforward approach (Hong et al., 2021) is by lifting all the 2D convolution operations to the 3D convolution operations, e.g., expanding the 3×3 convolutions to the 3×3×3 convolutions.Albeit simple, this approach will significantly increase the number of parameters, thereby posing two practical issues: (1) it will require a large number of 3D images for training; otherwise, the model suffers overfitting and mode collapse issues2 , and (2) the largely increased parameter number will slow down training and generation.However, no existing methods have simultaneously addressed both issues.Therefore, it is non-trivial to improve StyleGAN2 for 3D medical image generation.
In this paper, we deal with the issues from the training strategy (Weight Inflation in Section 3.2) and network architecture (Split&Shuffle in Section 3.3) perspectives and propose an efficient 3D generative model as shown in Figure 2.

Inflating 2D Convolution Weights
This section discusses how to design a training strategy to generate synthetic 3D medical images and overcome the issue of data scarcity.Transfer learning (Weiss et al., 2016) is a widely-used technique to overcome the data paucity of the target task by employing additional external datasets.As an example, ImageNet (Russakovsky et al., 2015) is used as an external dataset for a variety of tasks, including object detection, and segmentation.In MinGAN (Wang et al., 2020), knowledge is transferred from GANs to domains with few images.It is currently difficult to apply the transfer learning technique for 3D GAN models in medical research due to the lack of large datasets and effective transfer learning strategies.3DSeg-8 dataset (Chen et al., 2019) has been aggregated from eight datasets to facilitate transfer learning between 3D medical images for liver segmentation and nodule classification.However, there are only tens to hundreds of 3D images of organs/tissues in each dataset in 3DSeg-8, which makes GANs incapable of transferring detailed knowledge.Moreover, the interplay between the generator and discriminator poses a different transfer learning challenge compared to classification tasks with a single network.
Although direct transferring from external datasets is not feasible for 3D medical image generation, we note an interesting and helpful observation: 2D slices in a medical dataset are several magnitudes larger than 3D images.As an example, COCA (Stanford, 2022) contains only 787 CTCA images, but these CTCA images contain 39, 281 2D slices.Hence, the number of these 2D slices is sufficient to train a 2D generative model such as StyleGAN2.Since 2D and 3D generative models have distinct weights, it is not possible to transfer weights directly from 2D to 3D generative models.As a result, we are naturally drawn to another technique called weight inflation (Carreira and Zisserman, 2017), which enables effective 3D network training from the pre-trained 2D weights.
Weight inflation was first introduced in Carreira and Zisserman (2017) for the design and training of 3D video action recognition networks.The method has been applied to both CNNs and Transformers-based 3D models (Carreira and Zisserman, 2017;Arnab et al., 2021;Solovyev et al., 2022).In technical terms, it extends/inflates/copies the 2D convolution weights along the third dimension (e.g., temporal dimension in the video) to provide a more favorable initialization for the 3D convolution networks.To ensure feasible inflation from 2D weights to 3D models, the 2D and 3D networks must have the same basic structure except for the additional third dimension (e.g., the 3 × 3 and 3 × 3 × 3 convolutions should have the same number of channels).To our knowledge, this technique has not been applied to medical image analysis for effective 3D medical image generation.
Considering the application differences between video action recognition and 3D medical image generation, we propose five customized inflation strategies to facilitate the training of 3D StyleGAN2.Here, we set the size of convolution weights to be 3×3×3, but our strategies can be easily adapted to other weight sizes.Let w 2 ∈ R C I ×C O ×3×3 denote the pre-trained 2D convolution weights and w 3 ∈ R C I ×C O ×3×3×3 denote the corresponding 3D convolution weights (C I represents the number of input channels, C O represents the number of output channels).At first, we initialize w 3 from a random Gaussian N(0, 0.1).Then, the weight w 3 is modified by the following inflation strategies: • Inflate-1: only inflating 1 center dimension.
w 3 [:, :, i, : By design, different inflation strategies offer diverse ways of reusing the 2D weights: the reusing degree increases from Inflate-1 to Inflate-3; Inflate-ASC considers the anatomical views; Inflate-NWI modifies Inflate-1 with more attention on the center dimension and negative weights on others.Intuitively, these inflation strategies introduce helpful 2D structure priors through weight initialization, which significantly reduces the training burden of the 3D convolution weights.In this way, by focusing more on the third dimension (for anatomy learning), the generative model is able to generate high-quality 3D images quickly and efficiently.

Efficient 3D Architecture Design
Although inflation strategy mitigates the lack of 3D data in training the 3D GAN model, it still suffers from the large number of model parameters.In this section, we address this issue from the perspective of efficient 3D architecture design.
We observe that most of the parameters in the 3D neural network architecture originate from the 3D convolution operation, which extends the 2D convolution weights to 3D, to model the 3D contextual and anatomical structures (e.g., lifting 3 × 3 weights to 3 × 3 × 3).Existing efficient 3D architecture designs mainly focus on parameter-efficient 3D convolution.On the one hand, factorized high-order CNNs are proposed with different tensor decomposition algorithms such as Tensor-Train (TT) (Novikov et al., 2015), CP decomposition (Lebedev et al., 2014;Kossaifi et al., 2020), Tucker Decomposition (Kim et al., 2015).These methods compress networks and reduce their parameters by applying low-rank tensor decompositions to high-order weights.On the other hand, driven by the requirements of mobile devices, various parameter-efficient convolution variants have been devised and combined into efficient architectures, such as group convolution (Krizhevsky et al., 2012), bottleneck (He et al., 2016), depthwise separable convolution (Howard et al., 2017).Group convolution divides the channels into groups and performs convolution only within each group.Bottleneck was introduced in ResNet (He et al., 2016) to reduce the number of channels of the 3 × 3 convolution by wrapping it with two 1×1 convolutions.With depthwise separable convolutions, the standard convolutions are factorized into a depth-wise convolution (i.e., a group convolution with a group number equal to a channel number) followed by a 1 × 1 pointwise convolution.
The above designs were developed to improve the efficiency of various applications, e.g., HO-CPConv (Kossaifi et al., 2020) for spatiotemporal facial emotion analysis, MobileNet (Howard et al., 2017) for image classification and object detection, and 3D-MobileNet (Kopuklu et al., 2019) for video action recognition.Despite this, the efficient 3D GANs architecture is rarely studied, especially for the state-of-the-art StyleGAN2 model.We attribute this to two possible reasons: (1) The StyleGAN2 architecture is more delicate than classification models, hindering the straightforward adoption of existing modules such as tensor decomposition, group convolution, or depthwise separable convolution.Specifically, the style vectors are absorbed in the modulation and demodulation operations (Equations 1, 2), which sets up hurdles for existing modules.(2) As a result of the interplay between the discriminator and generator, GANs training is difficult.It is non-trivial to directly use the same modules for the discriminator and generator to achieve the best performance.
Based on the above analysis, we propose a unique design customized to the StyleGAN2 model for parameter-efficient generation (Figure 3).For the generator, since the direct adoption of existing efficient modules (e.g., group convolution, depthwise separable convolution) will break the entangled structure of the convolution weights and style vectors (Equations 1, 2), we equally Split the feature maps, using the channel split operation, to create two branches.The modulation and demodulation operations for the style vectors and convolution weights are individually applied to each branch.Afterwards, the outputs of the two branches are concatenated, and the Channel Shuffle operation is performed, allowing information to be shared between two channels.If the Channel Shuffle operation is not performed, the generator is considered to be two independent networks.Channel Shuffle enables hybrid and diverse pattern combinations across branches to facilitate image generation quality.For the discriminator, as neither modulation nor demodulation is applied, there is more flexibility in improving the design.Therefore, we devised two asymmetric branches with 3 × 3 × 3 and 1 × 1 × 1 convolutions.The 1 × 1 × 1 convolution leads to a further parameter reduction while compromising the local spatial structure.But this is rectified by the Channel Shuffle operation, which exchanges information by shuffling the feature maps of 1 × 1 × 1 and 3 × 3 × 3 convolutions.
Considering C = 32 in Figure 3, both the input and output feature maps are of size 32 × H × W × D. Without splitting, the size of the convolution weight is 32×32×3×3×3.With splitting, the input map is split into two 16 × H × W × D branches, each undergoing a 3D convolution (weight size 16 × 16 × 3 × 3 × 3).The outputs of two branches are concatenated to get the feature map of size 32 × H × W × D. So, the total size of the convolution weight is 2×16×16×3×3×3, which means the generator enjoys a parameter reduction of 2. Similarly, the discriminator has a total weight size of 16 × 16 × 3 × 3 × 3 + 16 × 16 × 1 × 1 × 1, which means it enjoys a parameter reduction of nearly 4 (3.857).
We also propose several other parameter-efficient convolution architectures as baselines to verify the Split&Shuffle design's effectiveness.As a result of the modulation and demodulation constraint in the generator, several baselines only modify the discriminator (i.e., D only).All the model variants are listed below: • Group Convolution (D only), which replaces the convolution in the discriminator with a group convolution.
• Depthwise Separable Convolution (D only), which replaces the convolution in the discriminator with a depthwise separable convolution.• Split&Shuffle Convolution (D only), which applies the Split&Shuffle module only on the discriminator.
• Split Convolution, which applies the channel split without channel shuffle.
• Split&Shuffle Convolution, which is our final design.

Datasets
Stanford AIMI Coronary Calcium (COCA) (Stanford, 2022) dataset is used for heart CTCA images.COCA contains 787 3D coronary CT images.Each 3D image has a different number (ranging from 27 to 156) of 2D slices on the axial plane, and they add up to 39, 281 axial slices in total.For brain MRI images, we used the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset (Weiner et al., 2017).Specifically, we used 991 T1 structural images from the Cognitively Normal (CN) research group.MR images from non-brain areas were removed by the dataset provider using the software FreeSurfer's 3 reconall function.The processed MR images have 256 slices from all three planes.

Evaluation Metric
In order to assess the quality of generated images, GANs usually use the Fréchet inception distance (FID) (Heusel et al., 2017) metric, which compares the feature distributions of real and generated images.By default, the Inception V3 network (Szegedy et al., 2016) pre-trained with 2D natural images was deployed for feature extraction.However, the method cannot be directly applied to 3D images.As such, taking into account the 3D medical structure, we measured the FID scores on the center slices of axial, sagittal and coronal planes, i.e., FIDax, FID-sag, and FID-cor.Lastly, we averaged the three FID scores to obtain FID-avg as an overall measurement.
3 https://surfer.nmr.mgh.harvard.edu/fswikiHowever, the FID alone cannot evaluate 3D medical image generation comprehensively.The reasons are twofold: (1) the inception model only accepts 2D image slices of a specific plan, and (2) the inception model is pre-trained on natural images with a large gap with medical images.Therefore, we also adopted the widely-used metrics MS-SSIM, PSNR, and the t-Distributed Stochastic Neighbour Embedding (t-SNE) to evaluate the performance.

Pre-processing
Both the COCA and ADNI images were resized to 64 × 64 and 128 × 128 using bilinear interpolation in the sagittal and coronal planes.We aligned the axial slice number of the COCA dataset to 32 either by consecutive slice sampling or by zero padding.For the ADNI dataset, we resized the axial slice number to 64 directly.As a result, the image resolution of COCA and ADNI datasets are 32 × 64 × 64 (32 × 128 × 128) and 64 × 64 × 64, respectively.The Hounsfield Unit values were clipped to [−250, 650] and then normalized to [−1, 1].

StyleGAN2 architecture
The base layer (Const in Figure 2) for the COCA dataset was set to 2 × 4 × 4, followed by 5 upsampling stages.For the ADNI dataset, the base layer was set to 4 × 4 × 4, followed by 5 upsampling stages.The total number of convolution channels was set to 32, with each branch owing 16.The feature dimension of the mapping network was set to 64.

Hyper-parameters
We followed the StyleGAN2 configuration for training, with the following exceptions: γ = 0.0512 for R 1 regularization, minibatch = 128, learning rate lr = 0.0025 for training from scratch, lr = 0.002 for inflation initialization.For 2D GANs pre-training, the models are fed with 25, 000K images in total.However, the 3D GANs were trained with 5, 000K images since 3D models are slow to train.The first step is to conduct comprehensive experiments to verify that inflation strategies are effective for initializing the 3D generative model.A 2D StyleGAN2 model is pre-trained using all the 39, 281 axial slices to obtain the 2-dimensional convolution weights.Since the number of images is sufficient, the 2D model achieves an FID of 7.71.This means that the pre-trained 2D weights capture rich slice-level contextual information to generate high-quality 2D slices.Thus, it verifies the rationale behind inflating 2D pre-trained weights for 3D generative models.
Starting from the same pre-trained 2D weights, we apply all the proposed inflation strategies as well as a "No Inflate" baseline to initialize and train the 3D StyleGAN2 model.The results are shown in Table 1.It can be seen that three inflation strategies outperform the "No Inflate" baseline (the other two are comparable), indicating that inflation strategies are generally effective as favorable initialization methods for 3D generative models."Inflate-1", which only initializes one center dimension, achieves the best performance (FID-avg) among all inflation variants.Performance gradually degrades as more weights are initialized from 2D weights ("Inflate-2" and "Inflate-3").We hypothesise that the overly inflated 3D weights will prevent the model from freely learning the thirddimensional anatomical structure."Inflate-ASC" and "Inflate-NWI" perform slightly worse than "Inflate-1".
To understand the training dynamics of various inflation strategies, we plot the FID scores with respect to the training iteration (measured by the number of 1, 000 training images, i.e., kimgs) in Figure 4.In general, "Inflate-1" achieves the best performance during training, consistent with Table 1.Most inflation variants achieve significantly better FID scores than the baseline at the start of training (i.e., kimgs=0), demonstrating "Inflate-NWI" initially performs worse because it modifies the original 2D weights, but in the end, it outperforms the baseline since the prior informative 2D weights have made a big difference.
In Figure 5, we randomly generated the axial/sagittal/coronal slices of CT images for both the "No Inflate" baseline and our best variant "Inflate-1" to intuitively investigate why the inflation strategy works.Specifically, we show the generated images from the initial training iteration (i.e., the dashed lines in Figure 4) and the best training iteration (i.e., the red arrows in Figure 4).It is easy to observe that with inflation as a favorable initialization, the generated images already show meaningful anatomical structures even before training (e.g., FID-ax=201).In contrast, the randomly initialized "No Inflate" generates blurry meaningless images before training (FID-ax=440).This comparison provides an intuitive explanation for the working mechanism of the inflation strategy: with effective inflation, the 3D generative model can inherit meaningful 2D anatomical priors for better subsequent training.Furthermore, starting from better initial weights, the inflated model ("Inflate-1") is trained to achieve superior generative performance compared to the "No Inflate" baseline (e.g., FID-ax=60 vs FID-ax=104).
StyleGAN2's discriminator and generator have different structures by design.This motivates us to examine how inflation affects the discriminator and the generator.Specifically, we selected the three best inflation variants ("Inflate-1", "Inflate-ASC", and "Inflate-NWI") and performed three sets of experiments (Table 2): inflating the generator only (G), inflating the discriminator only (D), and inflating both the generator and discriminator (G&D, default).This analysis reveals two observations: (1) inflating the entire model (G&D) always achieves the best performance, (2) the generator plays a more important role in the inflation strategy, which is reasonable because the generator is responsible for generating images with the style vectors to control the generation.3. The models with a "-D" suffix apply efficient modules only on the discriminator, thus only leading to a slightly reduced parameter number (e.g., 0.434M vs. 0.600M)."Group-D", "Depthwise-D", and "Split&Shuffle-D" have a different number of parameters because the architec-

Coronal
Fig. 6: Generated images of higher resolution (128 × 128) from the Baseline and our Split&Shuffle method.Baseline sometimes shows anatomically inconsistent regions (marked in red boxes).Our method can generate images with better anatomic structure and possible calcium slices (marked in blue boxes), which will be helpful for downstream tasks.
tures are different.By contrast, the proposed "Split&Shuffle" design reduces more than half of the parameters compared to the baseline (0.291M vs 0.600M).
With regards to generation performance, the proposed "Split&Shuffle" architecture achieved the better FID-avg with the least number of parameters, proving its efficiency and effectiveness.All the "-D" models produce similar results to the baseline, indicating that modifying just the discriminator has little influence on the generation quality.Note that owing the same number of parameters, "Split" performs much worse than the "Split&Shuffle" design.As a result, the Channel Shuffle in Figure 3 plays a crucial role in ensuring performance.
Then, we examined how efficient architectures can be combined with inflation strategies for further performance enhancement.Specifically, we adopted the best inflation strategy "Inflate-1" and applied it to all the models in Table 3.The results are shown in Table 4.In this experiment, each individual model has its own pre-trained 2D weights due to the differences in architecture.Table 4 demonstrates that all models achieved good FID-2D values, once again verifying the rationale for inflating informative 2D weights to train 3D GANs.Since the number of 2D image slices is sufficient for 2D pre-training, "Baseline" with the largest number of parameters achieved the best FID-2D.However, for the final 3D training, our "Split&Shuffle" model achieves the best performance (FID-avg) with the least number of parameters.Compared with the "Baseline" model, "Split&Shuffle" reduces the FID-avg by 14.7 with only 48.5% of the parameters.As for the discriminator-only variants with the suffix "-D", they were much worse than the "Inflate-1" baseline.Without chan-nel shuffle operation, "Split" achieved the worst performance, again showing the indispensable role of channel shuffle in our architecture design.To show that our method can generate high-resolution images for practical application, we increased the resolution of COCA to 128 × 128.We set the base layer to 1 × 4 × 4, followed by six upsampling stages.The model capacity was also increased by using 64 convolution channels.

Image Resolution
The quantitative results are shown in Table 5.Compared with the Baseline (same as Table 3), Split&Shuffle achieves much better performance with fewer model parameters.The visualization of the generated image slices is shown in Figure 6.The generated image slices by our method show a more feasible heart anatomy structure and higher image quality.In addition, since COCA contains coronary calcium, our method generated image slices with possible calcium, which are more realistic.

Comparison with State-of-the-art Methods
Finally, we compared the performance of our method with the published 3D generative models in Table 6 (COCA) and Table 7 (ADNI).The comparison methods included the following 3D generation baselines: • 3D-WGAN-GP (Gulrajani et al., 2017), a 3D extension of Wasserstein GAN with Gradient Penalty to alleviate training instability.
• 3D-α-GAN (Rosca et al., 2017), applying the code discriminator and encoder on top of the conventional GANs to alleviate the collapse and blurriness.
On both the heart and brain datasets, the proposed method outperforms all the state-of-the-art methods by a large margin, demonstrating its effectiveness.Among all comparison methods, the first four baselines have a much greater number of parameters than our method but achieved inferior performance.Although 3D-StyleGAN2 has approximately twice as many parameters as our method, it still performed much worse.Because the 3D medical image generation lacks sufficient 3D training images, most baselines are short of sufficient 3D training images, leading to inferior performance.As an alternative, our Generation quality and generation diversity Except for FID scores, we considered two widely-used evaluation metrics: PSNR and MS-SSIM.Specifically, PSNR is calculated between the real and generated images to evaluate the generation quality.Following Kwon et al. (2019), MS-SSIM is calculated on pairs of generated images to evaluate the generation diversity (a smaller value means better diversity).The results are shown in Table 8.Our method achieved the best PSNR and MS-SSIM performance on both the heart and brain datasets, demonstrating that our method can generate high-quality 3D medical images with better diversity.We also note that the PSNR scores on the heart dataset are smaller than that on the brain dataset.This is due to the large variations among the unaligned heart images.In contrast, the brain images are aligned and exhibit smaller variations.
t-Distributed Stochastic Neighbour Embedding (t-SNE) To better understand the distributions of generated and real images, we performed t-SNE on real images, our method, 3Dα-WGAN-GP, and 3D-StyleGAN2.The visualization of the COCA dataset is shown in Figure 7.Although the distributions of both our method and 3D-α-WGAN-GP approach the real images, our method is closer to the real images.The distribution of 3D-StyleGAN2 is far from the real images, consistent with its large FID score.As to the visualization of the ADNI dataset in Figure 8, only our method shows a similar distribution to the real images.Figure 9 shows examples of brain slices generated using all comparison methods.Our method generates high-quality brain slices on three planes.Because of the random weight initialization, the comparison methods generate random images at the start of the training process without any meaningful patterns.Due to its training on sufficient 2D axial slices, our model exhibits good anatomy right at the beginning.Combining our Split&Shuffle design with this anatomy prior allows our model to generate better results with fewer parameters.

Conclusion
The purpose of our study was to address the important problem of generating reliable synthetic 3D medical images.The lack of annotated 3D data and inefficient parameter settings hinder the effective training of 3D medical generative models.A novel GAN model (i.e., 3D Split&Shuffle-GAN) is proposed to remedy these problems from two perspectives: training strategy and network architecture.For the training strategy, we used the weight inflation technique to pre-train a 2D GAN model and inflate the 2D convolution weights as a favorable method for initializing a 3D GAN model.For network architecture, we devised parameter-efficient Channel Split&Shuffle modules for the discriminator and generator of the GAN.We conducted comprehensive experiments to determine the best weight inflation variant and network architecture design.The effectiveness of our method is verified on both the heart and brain datasets.Further exploration of network weight initialization strategies beyond inflation and the design of new architectures will be completed in the future.

Declaration of Competing Interest
The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: Girish Dwivedi reports a relationship with Artrya Pty Ltd. that includes: consulting or advisory and equity or stocks.

Fig. 1 :
Fig.1: The general pipeline of our 3D generative model, which includes our contribution to both the training strategy (inflate 2D weights, colored in blue) as well as the network architecture (Split&Shuffle GAN, colored in orange).

Fig. 2 :
Fig.2: The overall architecture of the proposed 3D Split&Shuffle-GAN.It is composed of a Mapping Network, a Generator, and a Discriminator.Mapping Network maps a latent variable z to the style vector space W and produces the per-layer style vectors with a learned affine transform (A).The generator starts from a constant 3D input.Then it controls the 3D generation styles with the per-layer style vectors and adds details from the per-channel scaled (B) noise input.Weight Mod&Demod incorporates the style vector into the convolution operation (w(1)  3 , w (2) 3 , w (3) 3 , . . .represent the 3D convolution weights for first, second, third, . . .layers).Discriminator tries to differentiate the real 3D images from the generated fake 3D images.Inside both the Generator and Discriminator, we devise novel Channel Split&Shuffle modules for parameter-efficient 3D convolution operations, which are customized for the style-based generation framework.

Fig. 3 :
Fig. 3: The proposed Channel Split&Shuffle Convolution Modules for the Generator and Discriminator.The Discriminator module has two differences from the Generator module: the style Mod&Demod and 1 × 1 × 1 Convolution.Overall, the proposed module reduces the number of parameters by a factor of 2 in the Generator and nearly by a factor of 4 in the Discriminator.Here, w 11 and w 12 denote the 3D convolution weights of the left and right branches, which are used to perform the Mod&Demod operation in Equations 1-2.

Fig. 5 :
Fig. 5: Generated images from the initial training (dashed lines in Fig. 4) and the best (red arrows in Fig. 4) training iterations.(a) With random weights as an initialization (No Inflate), the initially generated images are meaningless.(b) With inflated weights as a favorable initialization (Inflate-1), the initial generated images already show basic anatomical structures.

Fig. 9 :
Fig. 9: Generated images from the initial training and the best training on the brain ADNI dataset.Our method shows good initial brain anatomy, especially for the axial view.

Table 1 :
Performance of different inflation variants (lower numbers are better).

Table 3 :
Parameter and performance comparison on different architectures.Models with "-D" mean only the discriminator owns the modified architecture.At first, all model variants, including the baseline, are trained from scratch without weight inflation.The FID values and the number of parameters are shown in Table

Table 4 :
Model performance when inflation strategy ("Inflate-1") is applied to different architectures.FID 2D denotes the pre-trained 2D model performance.

Table 5 :
Model performance comparison of higher resolution (128 × 128) images on COCA dataset.Baseline is the same as Table3.

Table 6 :
Comparison with state-of-the-art methods on COCA (heart) dataset.

Table 7 :
Comparison with state-of-the-art methods on ADNI (brain) dataset.

Table 8 :
PSNR and MS-SSIM evaluations on ADNI (brain) and COCA (heart) datasets.PSNR was calculated between real and generated images to evaluate generation quality (higher values are better).MS-SSIM was calculated on pairs of generated images to evaluate generation diversity (lower values are better).