Data synthesis and adversarial networks: A review and meta-analysis in cancer imaging

Despite technological and medical advances, the detection, interpretation, and treatment of cancer based on imaging data continue to pose significant challenges. These include inter-observer variability, class imbalance, dataset shifts, inter- and intra-tumour heterogeneity, malignancy determination, and treatment effect uncertainty. Given the recent advancements in Generative Adversarial Networks (GANs), data synthesis, and adversarial training, we assess the potential of these technologies to address a number of key challenges of cancer imaging. We categorise these challenges into (a) data scarcity and imbalance, (b) data access and privacy, (c) data annotation and segmentation, (d) cancer detection and diagnosis, and (e) tumour profiling, treatment planning and monitoring. Based on our analysis of 164 publications that apply adversarial training techniques in the context of cancer imaging, we highlight multiple underexplored solutions with research potential. We further contribute the Synthesis Study Trustworthiness Test (SynTRUST), a meta-analysis framework for assessing the validation rigour of medical image synthesis studies. SynTRUST is based on 26 concrete measures of thoroughness, reproducibility, usefulness, scalability, and tenability. Based on SynTRUST, we analyse 16 of the most promising cancer imaging challenge solutions and observe a high validation rigour in general, but also several desirable improvements. With this work, we strive to bridge the gap between the needs of the clinical cancer imaging community and the current and prospective research on data synthesis and adversarial networks in the artificial intelligence community.


The burden of cancer and early detection
The evident improvement in global cancer survival in the last decades is arguably attributable not only to health care reforms, but also to advances in clinical research (e.g., targeted therapy based on molecular markers) and diagnostic imaging technology e.g whole-body magnetic resonance imaging (MRI) (Messiou et al., 2019), and positron emission tomography-computed tomography (PET-CT) (Arnold et al., 2019). Nonetheless, cancers still figure among the leading causes of morbidity and mortality worldwide (Ferlay et al., 2015), with an approximated 9.6 million cancer related deaths in 2018 (World Health Organization, 2018). The most frequent cases of cancer death worldwide in 2018 are lung (1.76 million), colorectal (0.86 million), stomach (0.78 million), liver (0.78 million), and breast (0.63 million) (World Health Organization, 2018). These figures are prone to continue to increase in consequence of the ageing and growth of the world population (Jemal et al., 2011).
A large proportion of the global burden of cancer could be prevented due to treatment and early detection (Jemal et al., 2011). For example, an early detection can provide the possibility to treat a tumour before it acquires critical combinations of genetic alterations (e.g., metastasis with evasion of apoptosis Hanahan and Weinberg, 2000). Solid tumours become detectable by medical imaging modalities only at an approximate size of 10 9 cells (≈ 1 cm 3 ) after evolving from a single neoplastic cell typically following a Gompertzian (Norton et al., 1976) growth pattern (Frangioni, 2008  inspect, normally by visual assessment, medical imaging modalities such as magnetic resonance imaging (MRI), computed tomography (CT), ultrasound (US), X-ray mammography (MMG), PET (Frangioni, 2008;Itri et al., 2018;McCreadie and Oliver, 2009).
Medical imaging data evaluation is time demanding and therefore costly in nature. In addition, volumes of new technologies (e.g., digital breast tomosynthesis Swiecicki et al., 2021) become available and studies generally show an extensive increase in analysable imaging volumes (McDonald et al., 2015). Also, the diagnostic quality in radiology varies and is very much dependent on the personal experience, skills and invested time of the data examiner (Itri et al., 2018;Elmore et al., 1994;Woo et al., 2020). Hence, to decrease cost and increase quality, automated or semi-automated diagnostic tools can be used to assist radiologists in the decision-making process. Such diagnostic tools comprise traditional machine learning, but also recent deep learning methods, which promise an immense potential for detection performance improvement in radiology.

The promise of deep learning and the need for data
The rapid increase in graphics processing unit (GPU) processing power has allowed training deep learning algorithms such as convolutional neural networks (CNNs) (Fukushima, 1980;LeCun et al., 1989LeCun et al., , 1998 on large image datasets achieving impressive results in Computer Vision (CireAan et al., 2012;Krizhevsky et al., 2012), and Cancer Imaging (Cireşan et al., 2013). In particular, the success of AlexNet in the 2012 ImageNet challenge (Krizhevsky et al., 2012) triggered an increased adoption of deep neural networks to a multitude of problems in numerous fields and domains including medical imaging, as reviewed in Shen et al. (2017),  and Litjens et al. (2017). Despite the increased use of medical imaging in clinical practice, the public availability of medical imaging data remains limited (McDonald et al., 2015). This represents a key impediment for the training, research, and use of deep learning algorithms in radiology and oncology. Clinical centres refrain from sharing such data for ethical, legal, technical, and financial (e.g., costly annotation) reasons (Bi et al., 2019).
Such cancer imaging data not only is necessary to train deep learning models, but also to provide them with sufficient learning possibility to acquire robustness and generalisation capabilities. We define robustness as the property of a predictive model to remain accurate despite of variations in the input data (e.g., noise levels, resolution, contrast, etc.). We refer to a model's generalisation capability as its property of preserving predictive accuracy on new data from unseen sites, hospitals, scanners, etc. Both of these properties are in particular desirable in cancer imaging considering the frequent presence of biased or unbalanced data with sparse or noisy labels. 3 Both robustness and generalisation are essential to demonstrate the trustworthiness of a deep learning model for usage in a clinical setting, where every edgecase needs to be detected and a false negative can potentially cost the life of a patient.

Synthetic cancer imaging data
We hypothesise that the variety of data needed to train robust and well-generalising deep learning models for cancer images can be largely synthetically generated using Generative Adversarial Networks (GANs) (Goodfellow et al., 2014). The adversarial learning scheme in GANs is based on a generator that generates synthetic (alias 'fake') samples of a target distribution trying to fool a discriminator, which classifies these samples as either real or fake. Various papers have provided reviews of GANs in the medical imaging domain (Yi et al., 2019;Kazeminia et al., 2020;Tschuchnig et al., 2020;Sorin et al., 2020;Lan et al., 2020;Singh and Raza, 2020), but they focused on general presentation of the main methods and possible applications. In cancer imaging, however, there are specificities and challenges that call for specific implementations and solutions based on GANs and the adversarial learning scheme at large, including: (i) the small size and complexity of cancerous lesions R. Osuala et al. Fig. 2. Overview of the most common organs and modalities targeted by the surveyed cancer imaging publications. A respective histogram that shows the number of papers per modality and per organ can be found in Fig. 15. (ii) the high heterogeneity between tumours within as well as between patients and cancer types (iii) the difficulty to annotate, delineate and label cancer imaging studies at large scale (iv) the high data imbalance in particular between healthy and pathological subjects or between benign and malignant cases (v) the difficulty to gather large consented datasets from highly vulnerable patients undergoing demanding care plans Hence, the present paper contributes a unique perspective and comprehensive analysis of adversarial networks attempting to address the specific challenges in the cancer imaging domain. To the authors' best knowledge, this is the first survey that exclusively focuses on GANs and adversarial training in cancer imaging. In this context, we define cancer imaging as the entirety of approaches for research, diagnosis, and treatment of cancer based on medical images. Our survey comprehensively analyses cancer imaging GAN and adversarial training applications focusing on radiology modalities. As presented in Fig. 2, we recognise that non-radiology modalities are also widely used in cancer imaging. For this reason, we do not restrict the scope of our survey to radiology, but rather also analyse relevant publications in these other modalities including histopathology and cytopathology (e.g., in Section 4.5), and dermatology (e.g., in Sections 4.3 and 4.4).
Further, our survey uncovers and highlights promising research directions for adversarial networks and image synthesis that can facilitate the sustainable adoption of AI in clinical oncology and radiology.

Section organisation
The remainder of this paper is organised as follows. In Section 2, we introduce the methodology of this review. Section 3 provides an overview of GANs and highlights extensions of the adversarial learning framework relevant to cancer imaging. Section 4 contains the main contribution that encompasses the systematic review of challenges of cancer imaging and potential solutions based on adversarial networks. This organisation is depicted in more detail in Fig. 1.
The different challenges are categorised into groups in the Sections 4. 1, 4.2, 4.3, 4.4, and 4.5. Each of the challenges categories contains several specific cancer imaging challenges, which we introduce and discuss in 4.1.1-4.5.3. The sections are organised in an independent way allowing the reader to directly jump to a particular cancer imaging category (4.1-4.5) of interest without requiring context from previous sections. For each of the specific challenges, we survey and discuss potential solutions, as depicted in Fig. 1(a)-(p).
The subsequent Section 5 contains our second core contribution, which consists of the SynTRUST framework for systematic analysis of trustworthiness criteria of image synthesis and adversarial training publications in medical imaging. Based on this framework, we metaanalyse a set of studies selected based on their strong performance and promising methodology for solving a specific cancer imaging challenge.
After learning how and to what extent image synthesis and adversarial training solutions have addressed cancer imaging challenges in the past, we highlight and discuss prospective avenues of future research in the Discussion Section 6 and point out unexploited potential of image synthesis and adversarial networks in cancer imaging.

Review methodology
Our review comprises two comprehensive literature screening processes. The first screening process surveyed the current challenges in the field of cancer imaging with a focus on radiology imaging modalities. After screening and gaining a deepened understanding of AI-specific and general cancer imaging challenges, we grouped these challenges for further analysis into the following five categories.
• Data scarcity and usability challenges (Section 4.1); discussing dataset shifts, class imbalance, fairness, generalisation, domain adaptation and the evaluation of synthetic data. • Data access and privacy challenges (Section 4.2); comprising patient data sharing under privacy constraints, security risks, and adversarial attacks. • Data annotation and segmentation challenges (Section 4.3); discussing costly human annotation, high inter and intra-observer variability, and the consistency of extracted quantitative features. • Detection and diagnosis challenges (Section 4.4); analysing the challenges of high diagnostic error rates among radiologists, early detection, and detection model robustness.
• Treatment and monitoring challenges (Section 4.5); examining challenges of high inter and intra-tumour heterogeneity, phenotype to genotype mapping, treatment effect estimation and disease progression.
The second screening process comprised first of a generic and second a specific literature search to find all papers that apply adversarial learning (i.e. GANs) to cancer imaging. In the generic literature search, generic search queries such as 'Cancer Imaging GAN', 'Tumour GANs' or 'Nodule Generative Adversarial Networks' were used to recall a high number of papers. The specific search focused on answering key questions of interest to the aforesaid challenges such as 'Carcinoma Domain Adaptation Adversarial', 'Skin Melanoma Detection GAN', 'Brain Glioma Segmentation GAN', or 'Cancer Treatment Planning GAN'. In Section 4, we map the papers that propose adversarial training and GAN applications applied to cancer imaging (second screening) to the surveyed cancer imaging challenges (first screening). The mapping of these GAN-related papers to challenge categories facilitates analysing the extent to which existing solutions solve the current cancer imaging challenges and helps to identify gaps and further potential for adversarial networks in this field. The mapping is based on the evaluation criteria used in the GAN-related papers and on the relevance of the reported results to the corresponding section. For example, if a GAN generates synthetic data that is used to train and improve a tumour detection model, then this paper is assigned to the detection and diagnosis challenge Section 4.4. If a papers describes a GAN that improves a segmentation model, then this paper is assigned to the segmentation and annotation challenge Section 4.3, and so forth.
To gather the literature (e.g., first papers describing cancer imaging challenges, second papers proposing GAN solutions), we have searched in medical imaging, computer science and clinical conference proceedings and journals, but also freely on the web using the search engines Google, Google Scholar, and PubMed. After retrieving all papers with a title related to the subject, their abstract was read to filter out nonrelevant papers. A full-text analysis was done for the remaining papers to determine whether they were to be included into our manuscript. We analysed the reference sections of the included papers to find additional relevant literature, which also underwent filtering and full-text screening. Applying this screening process, we reviewed and included a total of 164 GAN and adversarial training cancer imaging publications comprising both peer-reviewed articles and conference papers, but also relevant preprints from arXiv and bioRxiv.
Details about these 164 cancer imaging applications can be found in Tables 2-6. The distribution of these publications across challenge category, year, modality, and anatomy is outlined in Fig. 15.
The methodology for deriving and applying the SynTRUST metaanalysis framework, which assesses the validity and trustworthiness of medical image synthesis studies, is provided in Section 5.

Introducing the theoretical underpinnings of GANs
Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) are a type of generative model with a differentiable generator network . GANs are formalised as a minimax two-player game, where the generator network (G) competes against an adversary network called discriminator (D). As visualised in Fig. 3, given a random noise distribution , G generates samples = ( ; ( )) that D classifies as either real (drawn from training data, i.e. ∼ ) or fake (drawn from G, i.e. ∼ ). is either sampled from or from with a probability of 50%. D outputs a value = ( ; ( )) indicating the probability that is a real training example rather than one of G's fake samples . As defined by Goodfellow et al. (2014), the task of the discriminator can be characterised as binary classification (CLF) of samples . Hence, the discriminator can be trained using binary-cross entropy resulting in the following loss function : D's training objective is to minimise (or maximise − ) while the goal of the generator is the opposite (i.e. minimise − ) resulting in the value function ( , ) of a two-player zero-sum game between D and G: In theory, in convergence, the generator's samples become indistinguishable from the real training data ( ∼ = ) and the discriminator outputs = 1∕2 for any given sample . As this is a state where both D and G cannot improve further on their objective by changing only their own strategy, it represents a Nash equilibrium (Farnia and Ozdaglar, 2020;Nash et al., 1950). In practice, achieving convergence for this or related adversarial training schemes is an open research problem (Kodali et al., 2017;Mescheder et al., 2018;Farnia and Ozdaglar, 2020).

Extensions of the Vanilla GAN methodology
As indicated by Fig. 4, numerous extensions of GANs have shown to generate synthetic images with high realism (Karras et al., 2017(Karras et al., , 2019(Karras et al., , 2020Chan et al., 2020) and under flexible conditions (Mirza and Osindero, 2014;Odena et al., 2017;Park et al., 2018). GANs have been successfully applied to generate high-dimensional data such as images and, more recently, have also been proposed to generate discrete data (Hjelm et al., 2017). Apart from image generation, GANs have also widely been proposed and applied for paired and unpaired imageto-image translation, domain-adaptation, data augmentation, image inpainting, image perturbation, super-resolution, and image registration and reconstruction (Yi et al., 2019;Kazeminia et al., 2020;Wang et al., 2019b). Table 1 introduces a selection of common GAN extensions found to be frequently applied to cancer imaging. For each GAN methodology in this and the Tables 1-6, we define the 'Task' describing the application of the respective adversarial network. For instance, in 'noise-to-image synthesis' the input into the generator G consists of a noise vector that G translates into an image. A further input into G can be a class label as in 'class-conditional-image-synthesis' based on which an output is generated that corresponds to this class. Paired and unpaired translation refer to the task where the input into G is a sample (e.g. an image in the source domain) based on which G generates another sample (e.g. an image in the target domain). This translation is paired if the training data consists of target and source domain sample pairs. The key characteristics of each of the GAN extensions of Table 1 are described in the following paragraphs.

Noise-to-image GAN extensions
As depicted in blue in Fig. 3, cGAN adds a discrete label as conditional information to the original GAN architecture that is provided as input to both generator and discriminator to generate class conditional samples (Mirza and Osindero, 2014).
AC-GAN feeds the class label only to the generator while the discriminator is tasked with correctly classifying both the class label and whether the supplied image is real or fake (Odena et al., 2017).
WGAN is motivated by mathematical rationale and based on the Wasserstein-1 distance (alias 'earth mover distance' or 'Kantorovich distance') between two distributions. WGAN extends on the theoretic formalisation and optimisation objective of the vanilla GAN to better approximate the distribution of the real data. By applying an alternative loss function (i.e. Wasserstein loss), the discriminator (alias 'critic' or ' ') maximises -and the generator minimises -the difference between the critic's scores for generated and real samples. A important benefit of WGAN is the empirically observed correlation of the loss with sample quality, which helps to interpret WGAN training progress and convergence .
In WGAN, the weights of the critic are clipped, which means they have to lie within a compact space [− , ]. This is needed to fulfil that the critic is constraint to be in the space of 1-Lipschitz functions. With clipped weights, however, the critic is biased towards learning simpler functions and prone to have exploding or vanishing gradients if the clipping threshold is not tuned with care (Gulrajani et al., 2017;Arjovsky et al., 2017).
In WGAN-GP, the weight clipping constraint is replaced with a gradient penalty. Gradient penalty of the critic is a tractable and soft version of the following notion: By constraining that the norm of the gradients of a differentiable function is at most 1 everywhere, the function (i.e. the critic) would fulfil the 1-Lipschitz criterion without the need of weight clipping. Compared, among others, to WGAN, WGAN-GP was shown to have improved training stability (i.e. across many different GAN architectures), training speed, and sample quality (Gulrajani et al., 2017). DCGAN generates realistic samples using a convolutional network architecture with batch normalisation (Ioffe and Szegedy, 2015) for R. Osuala et al. Fig. 3. An example of a generic GAN architecture applied to generation of synthetic mammography region of interest (ROI) images based on the INbreast dataset (Moreira et al., 2012). Note that including the 'Condition' depicted in blue colour extends the vanilla GAN architecture (Goodfellow et al., 2014) to the cGAN architecture (Mirza and Osindero, 2014). both generator and discriminator and progressively increases the spatial dimension in the layers of the generator using transposed convolution (alias 'fractionally-strided convolution') (Radford et al., 2015).
PGGAN is tested with loss and configurations introduced in WGAN GP. It starts by generating low pixel resolution images, but progressively adds new layers to the generator and discriminator during training resulting in increased pixel resolution and finer image details. It is suggested that after early convergence of initial low-resolution layers, the introduced additional layers enforce the network to only refine the learned representations by increasingly smaller-scale effects and features (Karras et al., 2017).
In SRGAN, the generator transforms a low-resolution (LR) to a highresolution (HR, alias 'super-resolution') image, while the discriminator learns to distinguish between real high-resolution images and fake super-resolution images. Apart from an adversarial loss, a perceptual loss called 'content loss' measures how well the generator represents higher level image features. This content loss is computed as the euclidean distance between feature representations of the reconstructed image and the reference image based on feature maps of a pretrained 19 layer VGG (Simonyan and Zisserman, 2014) network .

Image-to-image GAN extensions
In image-to-image translation, a mapping is learned from one image distribution to another. For example, images from one domain can be transformed to resemble images from another domain via a mapping function implemented by a GAN generator.
CycleGAN achieves realistic unpaired image-to-image translation using two generators ( , ) with one traditional adversarial loss each and an additional cycle-consistency loss. Unpaired image-to-image translation transforms images from domain to another domain in the absence of paired training data i.e. corresponding image pairs for both domains. In CycleGAN, the input image from domain is translated by generator ( ) to resemble a sample from domain . Next, the sample is translated back from domain to domain by generator ( ( )). The cycle consistency loss enforces that ( ( )) ≈ (forward cycle consistency) and that ( ( )) ≈ (backward cycle consistency) .
Both pix2pix and SPADE are used in paired image-to-image translation where corresponding image pairs for both domains and are available. pix2pix (alias 'condGAN') is a conditional adversarial network that adapts the U-Net architecture 4 (Ronneberger et al., 2015) for the generator to facilitate encoding an conditional input image into a latent representation before decoding it back into an output image. pix2pix uses L1 loss to enforce low level (alias 'low frequency') image reconstruction and a patch-based discriminator ('PatchGAN') to enforce high level (alias 'high frequency') image reconstruction that the authors suggest to interpret as texture/style loss. Note that the input into the PatchGAN discriminator is a concatenation 5 of the original image (i.e. the generator's input image; e.g. this can be a segmentation map) and the real/generated image (i.e. the generator's output image) . In SPADE, the generator architecture does not rely on an encoder for downsampling, but uses a conditional normalisation method during upsampling instead: A segmentation mask as conditional input into the SPADE generator is provided to each of its upsampling layers via spatially-adaptive residual blocks. These blocks embed the masks and apply two two-layer convolutions to the embedded mask to get two tensors with spatial dimensions. These two tensors are multiplied/added to each upsampling layer prior to its activation function. The authors demonstrate that this type of normalisation achieves better fidelity and preservation of semantic information in comparison to other normalisation methods that are commonly applied in neural networks (e.g., Batch Normalisation). The multi-scale discriminators and the loss functions from pix2pixHD  are adapted in SPADE, which contains a hinge loss (i.e. as substitute of the adversarial loss), a perceptual loss, and a feature matching loss (Park et al., 2019).

GAN network architectures and adversarial loss
For further methodological detail on the aforementioned GAN methods, loss functions, and architectures, we point the interested reader to the GAN methods review by Wang et al. (2019b). Due to the R. Osuala et al.  (Jiang et al., 2021) and VQGAN (Esser et al., 2021) were proposed, which diverges from the CNN design pattern to using Transformer Neural Networks (Vaswani et al., 2017). Due to the promising performances of these approaches in computer vision tasks, we encourage future studies to investigate the potential of transformer-based GANs for applications in medical and cancer imaging.
Multiple deep learning architectures apply the adversarial loss proposed in Goodfellow et al. (2014) together with other loss functions (e.g., segmentation loss functions) for other tasks than image generation (e.g., image segmentation). This adversarial loss is useful for unsupervised learning of features and representations that are invariant to some part of the training data. For instance, adversarial learning can be useful to discriminate a domain to learn domain-invariant representations (Ganin and Lempitsky, 2015), as has been successfully demonstrated for medical images (Kamnitsas et al., 2017). Such methods that apply the adversarial loss internally are referred to as 'adversarial training' methods and are included in the scope of our survey. That is, we include and consider all relevant cancer imaging papers that apply or build upon the adversarial learning scheme defined in Goodfellow et al. (2014), which comprises GANs as well as adversarial training methods.

Cancer imaging challenges addressed by data synthesis and adversarial networks
In this section we follow the structure presented in Fig. 1, where we categorise cancer imaging challenges into five categories consisting of data scarcity and usability (4.1), data access and privacy (4.2), data annotation and segmentation (4.3), detection and diagnosis (4.4), and treatment and monitoring (4.5). In each subsection, we group and analyse respective cancer imaging challenges and discuss the potential and the limitations of corresponding GAN-based data synthesis and adversarial training solutions. In this regard, we also identify and highlight key needs to be addressed by researchers in the field of cancer imaging GANs towards solving the surveyed cancer imaging challenges. We provide respective Tables 2-6 for each Sections 4.1-4.5 containing relevant information (publication, method, dataset, modality, task, highlights) for all of the reviewed cancer imaging GAN solutions.
Chronology of key innovations. The most commonly applied adversarial network methodologies in cancer imaging are summarised chronologically in Fig. 4. Next to each network (a)-(m), the number of occurrence per cancer imaging challenge category 4.1-4.5 is highlighted.
Following Vanilla GANs 4(a), four main lines of innovations have been widely adopted in cancer imaging. These are methods that condition the synthetic data generation e.g. cGAN 4(b), methods that improve upon the network architecture e.g. DCGAN 4(c), methods that improve upon the adversarial loss function e.g. WGAN 4(g), and methods that backpropagate the adversarial loss for representation learning, e.g. domain-invariant representations 4(d).
As to conditional methods, further key innovations have been AC-GAN's 4(f) discriminator classifying the input condition, and methods that conditioning the generation based on an input image using additional reconstruction (e.g., pix2pix 4(e), cycleGAN 4(h)) or perceptual (e.g., SRGAN 4(i)) losses. Recent approaches (e.g., SPADE 4(l)) innovate regarding how the input image is provided to the generator network, e.g., via spatially-adaptive residual blocks in upsampling layers.
WGAN's 4(g) loss based on the discriminator estimating the Wasserstein-1 distance between real and synthetic image distributions is a widely used and extended (e.g., WGAN-GP 4(j)) alternative to the vanilla binary-cross entropy adversarial loss in cancer imaging.
The architectural innovation of progressive network growing 4(k) unlocked high-resolution cancer image generation and is adopted by recent approaches such as StyleGAN 4(m), which introduced adaptive instance normalisation and pioneered noise (and style condition) input via intermediate activation maps.

Challenging dataset sizes and shifts
Although data repositories such as The Cancer Imaging Archive (TCIA) (Clark et al., 2013) have made a wealth of cancer imaging data available for research, the demand is still far from satisfied. As a result, data augmentation techniques are widely used to artificially enlarge the existing datasets, traditionally including simple spatial (e.g., flipping, rotation) or intensity transformations (e.g., noise insertion) of the true data. GANs have shown promise as a more advanced augmentation technique and have already seen use in medical and cancer imaging (Han et al., 2018;Yi et al., 2019).
Aside from the issue of lacking sizeable data, data scarcity often forces studies to be constrained on small-scale single-centre datasets. The resulting findings and models are likely to not generalise well due to diverging distributions between the (synthetic) datasets seen in training and those seen in testing or after deployment, a phenomenon known as dataset shift (Quionero-Candela et al., 2009). 6 An example of this in clinical practice are cases where training data is preselected from specific patient sub-populations (e.g., only high-risk patients) resulting in bias and limited generalisability to the broad patient population (Troyanskaya et al., 2020;Bi et al., 2019).
From a causality perspective, dataset shift can be split into several distinct scenarios : • Population shift, caused by differences in age, sex, ethnicities etc.
• Acquisition shift, caused by differences in scanners, resolution, contrast etc. • Annotation shift, caused by differences in annotation policy, annotator experience, segmentation protocols etc. • Prevalence shift, caused by differences in the disease prevalence in the population, often resulting from artificial sampling of data • Manifestation shift, caused by differences in how the disease is manifested GANs may inadvertently introduce such types of dataset shifts (e.g., due to mode collapse Goodfellow et al., 2014), but it has been shown that this shift can be studied, measured and avoided (Santurkar 6 More concretely, this describes a case of covariate shift (Quionero-Candela et al., 2009;Shimodaira, 2000) defined by a change of distribution within the independent variables between two datasets. Arora et al., 2018). GANs can be a sophisticated tool for data augmentation or curation (Diaz et al., 2021) and by calibrating the type of shift introduced, they have the potential to turn it into an advantage, generating diverse training data that can help models generalise better to unseen target domains. The research line studying this problem is called domain generalisation, and it has presented promising results for harnessing adversarial models towards learning of domaininvariant features . GANs and adversarial training have been used in various ways in this context, using multi-source data to generalise to unseen targets (Rahman et al., 2019;Li et al., 2018) or in unsupervised domain generalisation using adaptive data augmentation to append adversarial examples iteratively (Volpi et al., 2018). As indicated in Fig. 1(a), the domain generalisation research line has recently been further extended to cancer imaging (Lafarge et al., 2019;Chen et al., 2021).
In the following, further cancer imaging challenges in the realm of data scarcity and usability are described and related GAN solutions are referenced. Given these challenges and solutions, we derive a workflow for clinical adoption of (synthetic) cancer imaging data, which is illustrated in Fig. 5.

Imbalanced data and fairness
Apart from the rise of data-hungry deep learning solutions and the need to cover the different organs and data acquisition modalities, a major problem that arises from data scarcity is that of imbalancei.e. the overrepresentation of a certain type of data over others (Bi et al., 2019). In its more common form, imbalance of diagnostic labels can hurt a model's specificity or sensitivity, as a prior bias from the data distribution may be learned. The Lung Screening Study (LSS) Feasibility Phase exemplifies the common class imbalance in cancer imaging data: 325 (20.5%) suspicious lung nodules were detected in the 1586 first low-dose CT screening, of which only 30 (1.89%) were lung cancers (Gohagan et al., 2004(Gohagan et al., , 2005NLST Research Team, 2011). This problem directly translates to multi-task classification (CLF), with imbalance between different types of cancer leading to worse sensitivity on the underrepresented categories (Yu et al., 2013). It is important to note that by solving the imbalance with augmentation techniques, Fig. 5. Illustration of a workflow that applies GANs to the challenges of data scarcity and data curation. After the GAN generates synthetic data specific to the issue at hand, the data is automatically and manually evaluated before further used in medical AI research. Ultimately, both synthetic data and medical AI models are integrated as decision support tools into clinical practice.
bias is introduced as the prior distribution is manipulated, causing prevalence shift. As such, the test set should preserve the population statistics. Aside from imbalance of labels, more insidious forms of imbalance such as that of race/ethnicity (Adamson and Smith, 2018) or gender (Larrazabal et al., 2020) of patients are easily omitted in studies. This leads to fairness problems in real world applications as underrepresenting such categories in the training set will hurt performance on these categories in the real world (population shift) (Li et al., 2021a). Because of their potential to generate synthetic data, GANs are a promising solution to the aforementioned problems and have already been thoroughly explored in this regard in Computer Vision (Sampath et al., 2021;Mullick et al., 2019). Concretely, the discriminator and generator can be conditioned on underrepresented labels, forcing the generator to create images for a specific class, 7 as indicated in Fig. 1(d). Many lesions classifiable by complex scoring systems such as RADS reporting are rare and, hence, effective conditional data augmentation is needed to improve the recognition of such lesions by ML detection models (Kazuhiro et al., 2018). GANs have already been used to adjust label distributions in imbalanced cancer imaging datasets, e.g. by generating underrepresented grades in a risk assessment scoring system (Hu et al., 2018b) for prostate cancer. A further promising applicable method is to enrich the data using a related domain as proxy input (Addepalli et al., 2020). Towards the goal of a more diverse distribution of data with respect to gender and ethnicity, similar principles can be applied. For instance, Li et al. (2021a) proposed an adversarial training scheme to improve fairness in classification of skin lesions for underrepresented groups (age, sex, skin tone) by learning a neutral representation using an adversarial bias discrimination loss. Fairness imposing GANs can also generate synthetic data with a preference for underrepresented groups, so that models may ingest a more balanced dataset, improving demographic parity without excluding data from the training pipeline. Such models have been trained in computer vision tasks (Sattigeri et al., 2018;Wang et al., 2019a;Zhang et al., 2018a;Beutel et al., 2017), but corresponding research on medical and cancer imaging denoted by Fig. 1(c) has been limited (Li et al., 2021a;Ghorbani et al., 2020). 7 The class can be something as simple as 'malignant' or 'benign', or a more complex score for risk assessment of a tumour such as the BiRADs scoring system for breast tumours (Liberman and Menell, 2002).

Cross-modal data generation
In cancer, multiple acquisition modalities are enlisted in clinical practice (Kim et al., 2016;Chen et al., 2017;Barbaro et al., 2017;Chang et al., 2020b,a); thus automated diagnostic models should ideally learn to interpret various modalities as well or learn a shared representation of these modalities. Conditional GANs offer the possibility to generate one or multiple (Yurt et al., 2019;Zhou et al., 2020) modalities from another, alleviating the need to actually perform the potentially more harmful screenings-i.e. high-dose CT, PET-that expose patients to radiation, or require invasive contrast agents such as intravenous iodine-based contrast media (ICM) in CT (Haubold et al., 2021), gadolinium-based contrast agents in MRI (Zhao et al., 2020a) (in Table 5) or radioactive tracers in PET Zhao et al., 2020b). Furthermore, extending the acquisition modalities used in a given task would also enhance the performance and generalisability of AI models, allowing them to learn shared representations among these imaging modalities (Bi et al., 2019;Hosny et al., 2018). Towards this goal, multiple GAN domain-adaptation solutions have been proposed to generate CT using MRI (Wolterink et al., 2017;Kearney et al., 2020b;Tanner et al., 2018;Kaiser and Albarqouni, 2019;Nie et al., 2017;Kazemifar et al., 2020;Prokopenko et al., 2019), PET from MRI , PET from CT (Ben- Cohen et al., 2017;Bi et al., 2017) (in Table 5), and CT from PET as in Armanious et al. (2020), where also GAN-based PET denoising and MR motion correction are demonstrated. If not indicated otherwise, these image-to-image translation studies are outlined in Table 2. Because of its complexity, clinical cancer diagnosis is based not only on imaging but also non-imaging data (genomic, molecular, clinical, radiological, demographic, etc.). In cases where this data is readily available, it can serve as conditional input to GANs towards the generation of images with the corresponding phenotypegenotype mapping, as is also elaborated in regard to tumour profiling for treatment in Section 4.5.1. A multimodal cGAN was recently developed, conditioned on both images and gene expression code ; however, research along this line is otherwise limited.

Feature hallucinations in synthetic data
As displayed in Fig. 6 and denoted in Fig. 1(b), conditional GANs can unintentionally 8 hallucinate non-existent artifacts into a patient image. This is particularly likely to occur in cross-modal data augmentation, especially but not exclusively if the underlying dataset is R. Osuala et al. Fig. 6. Example of a GAN that translates Film Scanned MMG (source) to Full-Field Digital MMG (target). The generator transforms 'benign' source images (triangles) into 'malignant' target images (plus symbols). As opposed to source, the target domain contains more malignant MMGs than benign ones. If the discriminator thus learns to associate malignancy with realness, this incentivises the generator to inject malignant features (depicted by dotted arrows). For simplicity additional losses (e.g., reconstruction losses) are omitted.
imbalanced. For instance, Cohen et al. (2018a) describe GAN image feature hallucinations embodied by added and removed brain tumours in cranial MRI. The authors tested the relationship between the ratio of tumour images in the GAN target distribution and the ratio of images diagnosed with tumours by a classifier. The classifier was trained on the GAN generated target dataset, but tested on a balanced holdout test set. It was thereby shown that the generator of CycleGAN effectively learned to hide source domain image features in target domain images, which arguably helped it to fool its discriminator. Paired image-toimage translation with pix2pix  was more stable, but still some hallucinations were shown to likely have occurred. A cause for this can be a biased discriminator that has learned to discriminating specific image features (e.g., tumours) that are more present in one domain. Cohen et al. (2018a,b) and Wolterink et al. (2018) warn that models that map source to target images, have an incentive to add/remove features during translation if the feature distribution in the target domain is distinct from the feature distribution in the source domain. 9 Domain-adaptation with unpaired image-to-image translation GANs such as CyleGAN has become increasingly popular in cancer imaging (Wolterink et al., 2017;Tanner et al., 2018;Modanwal et al., 2019;Fossen-Romsaas et al., 2020;Zhao et al., 2020b;Hognon et al., 2019;Mathew et al., 2020;Kearney et al., 2020b;Peng et al., 2020;Jiang et al., 2018;Sandfort et al., 2019). As described, these methods are hallucination-prone and, thus, can put patients at risk when used in clinical settings. More research is needed on how to robustly avoid or detect and eliminate hallucinations in generated data. To this end, we highlight the potential of investigating feature preserving image translation techniques and methods for evaluating whether features have been accurately translated. For instance, in the presence of feature masks or annotations, an additional local reconstruction loss can be introduced in GANs that enforces feature translation in specific image areas.

Data curation and harmonisation
Aside from the limited availability of cancer imaging datasets, a major problem is that the ones available are often not readily useable and require further curation (Hosny et al., 2018). Curation includes dataset formatting, normalising, structuring, de-identification, quality assessment and other methods to facilitate subsequent data processing steps, one of which is the ingestion of the data into AI models (Diaz et al., 2021). In the past, GANs have been proposed for curation of data labelling, segmentation and annotation of images (details in Section 4.3) and de-identification of facial features, EHRs, etc (details in Section 4.2). Particular to cancer imaging datasets and of significant importance is the correction of artifacts, such as patient motion, metallic objects, chemical shifts and others caused by the image processing pipeline (Pusey et al., 1986;Nehmeh et al., 2002), which run the risk of confusing models with spurious information. Towards the principled removal of artifacts, several GAN solutions have been proposed (Vu et al., 2020b;Koike et al., 2020;Armanious et al., 2020). As for the task of reconstruction of compressed data (e.g., compressed sensing MRI Mardani et al., 2017), markedly, Yang et al. (2018a) proposed DAGAN, which is based on U-Net (Ronneberger et al., 2015), reduces aliasing artifacts, and faithfully preserves texture, boundaries and edges (of brain tumours) in the reconstructed images. Kim et al. (2018a) feed down-sampled high-resolution brain tumour MRI into a GAN framework similar to pix2pix to reconstruct high-resolution images with different contrast. The authors highlight the possible acceleration of MR imagery collection while retaining high-resolution images in multiple contrasts, necessary for further clinical decision-making. As relevant to the context of data quality curation, GANs have also been proposed for image super-resolution in cancer imaging (e.g., for lung nodule detection Gu et al., 2020, abdominal CT You et al., 2019, and breast histopathology Shahidi, 2021. Beyond the lack of curation, a problem particular to multi-centre studies is that of inconsistent curation between data derived in different centres. These discontinuities arise from different scanners, segmentation protocols, demographics, etc, and can cause significant problems to subsequent ML algorithms that may overfit or bias towards one configuration over another (i.e. acquisition and annotation shifts). GANs have the potential to contribute in this domain as well by bringing the distributions of images across different centres closer together. In this context recent work by Li et al. (2021b) and Wei et al. (2020) used GAN-based volumetric normalisation to reduce the variability of heterogeneous 3D chest CT scans of different slice thickness and dose levels. The authors showed that features in subsequent radiomics analysis exhibit increased alignment. Other works in this domain include a framework that could standardise heterogeneous datasets with a single reference image and obtained promising results on an MRI dataset (Hognon et al., 2019), and GANs that learn bidirectional mappings between different vendors to normalise dynamic contrast enhanced (DCE) breast MRI (Modanwal et al., 2019). An interesting research direction to be explored in the future is synthetic multi-centre data generation using GANs, simulating the distribution of various scanners/centres.

Synthetic data assessment
As indicated in Fig. 1(e), a condition of paramount importance is proper evaluation of GAN-generated or GAN-curated data. This evaluation is to verify that synthetic data is useable for a desired downstream task (e.g., segmentation, classification) and/or indistinguishable from real data while ensuring that no private information is leaked. GANs are commonly evaluated based on fidelity (realism of generated samples) and diversity (variation of generated samples compared to real samples) (Borji, 2021). Different quantitative measures exist to assess GANs based on the fidelity and diversity of its generated synthetic medical images (Yi et al., 2019;Borji, 2021).
Visual Turing tests (otherwise referred to as Visual Assessment, Mean Opinion Score (MOS) Test, and sometimes used interchangeably with In-Silico Clinical Trials) are arguably the most reliable approach, where clinical experts are presented with samples from real and generated data and are tasked to identify which one is generated. Korkinof et al. (2020) showed that their PGGAN-generated (Karras et al., 2017) 1280 × 1024 mammograms were inseparable by the majority of participants, including trained breast radiologists. A similar visual Turing test was successfully done in the case of skin disease (Ghorbani et al., 2020), super-resolution of CT (You et al., 2019), brain MRI (Kazuhiro et al., 2018;Han et al., 2018), lung cancer CT scans (Chuquicusma et al., 2018), and histopathology images (Levine et al., 2020). For instance, Chuquicusma et al. (2018) trained a DCGAN (Radford et al., 2015) on the LIDC-IDRI dataset (Armato III et al., 2011) to generate 2D (56 × 56 pixel) pulmonary lung nodule scans that were realistic enough to deceive 2 radiologists with 11 and 4 years of experience. In contrast to computer vision techniques where synthetic data can often be easily evaluated by any non-expert, the requirement of clinical experts makes Visual Turing Tests in this domain much more costly. Furthermore, a lack of scalability and consistency in medical judgement needs to be taken into account as well Brennan and Silman (1992) and visual Turing tests should in the ideal case engage a range of experts to address inter-observer variation in the assessments. Also, iterating over the same observer addresses intra-observer variationi.e. repeating the process within a certain amount of intervals that could be days or weeks. These problems are further magnified by the shortage of radiology experts (Mahajan and Venugopal, 2020;Rimmer, 2017) which brings up the necessity for supplementary metrics that can automate the evaluation of generative models. Such metrics allow for preliminary evaluation and can enable research to progress without the logistical hurdle of enlisting experts.
Furthermore, in cases where the sole purpose of the generated data is to improve a downstream task-i.e. classification or segmentationthen the prediction success of the downstream task would be the metric of interest. The latter can reasonably be prioritised over other metrics given that the underlying reasons why the synthetic data alters downstream task performance are examined and clarified. 10 Image quality assessment metrics. Wang et al. (2004) have thoroughly investigated image quality assessment metrics. The most commonly applied metrics include structural similarity index measure (SSIM) 11 between generated image and reference image (Wang et al., 2004), mean squared error (MSE) 12 and peak signal-to-noise ratio (PSNR). 13 In a recent example that followed this framework of evaluation, synthetic brain MRI with tumours generated by edge-aware EA-GAN  was assessed using three such metrics: PSNR, SSIM, and normalised mean squared error (NMSE). The authors integrated an endto-end sobel edge detector to create edge maps from real/synthetic images that are input into the discriminator in the dEa-GAN variant to enforce improved textural structure and object boundaries. Interestingly, aside from evaluating on the whole image, the authors demonstrated evaluation results focused on the tumour regions, which were overall significantly lower than the whole image. Other works that have evaluated their synthetic images in an automatic manner have focused primarily on the SSIM and PSNR metrics and include generation of CT (Kearney et al., 2020b;Mathew et al., 2020) and PET scans (Zhao et al., 2020b). While indicative of image quality, these similarity-based metrics might not generalise well to human judgement of image similarity, the latter depending on high-order image structure and context (Zhang et al., 2018c). Finding evaluation metrics that are strong correlates of human judgement of perceptual image similarity is a promising line of research. In the context of cancer and medical imaging, we highlight the need for evaluation metrics for synthetic images that correlate with the perceptual image similarity judged by medical experts. Apart from perceptual image similarity, further evaluation metrics in cancer and medical imaging are to be investigated that are able to estimate the diagnostic value of (synthetic) images and, in the presence of reference images, the diagnostic value proportion between target and reference image.
Deep generative model-specific assessment metrics. In recent years, the Inception score (IS) (Salimans et al., 2016) and Fréchet Inception distance (FID) (Heusel et al., 2017) have emerged, offering a more sophisticated alternative for the assessment of synthetic data. The IS uses a classifier to generate a probability distribution of labels given a synthetic image. If the probability distribution is highly skewed, it is indicative that a specific object is present in the image (resulting in a higher IS), while in the case where it is uniform, the image contains a jumble of objects and that is more likely to be non-sense (resulting in a lower IS). 14 The FID metric compares the distance between the synthetic image distribution to that of the real image distribution by comparing extracted highlevel features from one of the layers of a classifier (e.g., Inception v3 as in IS). Both metrics have shown promise in the evaluation of GAN-generated data; however, they come with several bias issues that need to be taken into account during evaluation (Chong and Forsyth, 2020;DeVries et al., 2019;Borji, 2019). As these metrics have not been widely used in cancer imaging yet, their applicability on GANsynthesised cancer images remains to be investigated. In contrast to computer vision datasets containing diverse objects, medical imaging datasets commonly only contain images of one specific organ. In this regard, we promote further research as to how object diversity based methods such as IS can be applied to medical and cancer imaging, 11 SSIM predicts perceived quality and considers image statistics to assess structural information based on luminance, contrast, and structure. 12 MSE is computed by averaging the squared intensity differences between corresponding pixels of the generated image and the reference image. 13 PSNR is an adjustment to the MSE score, commonly used to measure reconstruction quality in lossy compression.
14 Not only a low label entropy within an image is desired, but also a high label entropy across images: IS also assesses the variety of peaks in the probability distributions generated from the synthetic images, so that a higher variety is indicative of more diverse objects being generated by the GAN (resulting in a higher IS). which requires, among others, meaningful adjustments of the datasetspecific pretrained classifications models (i.e. Inception v3) that IS and FID rely upon.
Uncertainty quantification as GAN evaluation metric?. A general problem facing the adoption of deep learning methods in clinical tasks is their inherent unreliability exemplified by high prediction variation caused by minimal input variation (e.g., one pixel attack Korpihalkola et al., 2020). This is further exacerbated by the nontransparent decision making process inside deep neural networks thus often described as 'black box models' (Bi et al., 2019). Also, the performance of deep learning methods in out-of-domain datasets has been assessed as unreliable (Lim et al., 2019). To eventually achieve beneficial clinical adoption and trust, examining and reporting the inherent uncertainty of these models on each prediction becomes a necessity. Besides classification, segmentation Alshehhi and Alshehhi, 2021), etc, uncertainty estimation is applicable to models in the context of data generation as well (Lim et al., 2019;Abdar et al., 2020;Hu et al., 2020). Edupuganti et al. (2019) studied a GAN architecture based on variational autoencoders (VAE) (Kingma and Welling, 2013) on the task of MRI reconstruction, with emphasis on uncertainty studies. Due to their probabilistic nature, VAEs allowed for a Monte Carlo sampling approach which enables quantification of pixel-variance and the generation of uncertainty maps. Furthermore, they used Stein's Unbiased Risk Estimator (SURE) (Stein, 1981) as a measure of uncertainty that serves as surrogate of MSE even in the absence of ground truth. Their results indicated that adversarial losses introduce more uncertainty. Parallel to image reconstruction, uncertainty has also been studied in the context of brain tumours (glioma) in MRI enhancement (Tanno et al., 2021). In this study, a probabilistic deep learning framework for model uncertainty quantification was proposed, decomposing the problem into two uncertainty types: intrinsic uncertainty (particular to image enhancement and pertaining to the one-to-many nature of the superresolution mapping) and parameter uncertainty (a general challenge, it pertains to the choice of the optimal model parameters). The overall model uncertainty in this case is a combination of the two and was evaluated for image super-resolution. Through a series of systematic studies the utility of this approach was highlighted, as it resulted in improved overall prediction performance of the evaluated models even for out-of-distribution data. It was further shown that predictive uncertainty highly correlated with reconstruction error, which not only enabled spotting unrealistic synthetic images, but also highlights the potential in further exploring uncertainty as an evaluation metric for GAN-generated data. A further use-case of interest for GAN evaluation via uncertainty estimation is the 'adherence' to provided conditional inputs. As elaborated in 4.1.4 for image-to-image translation, conditional GANs are likely to introduce features that do not correspond to the conditional class label or source image. After training a classification model on image features of interest (say, tumour vs non-tumour features), we can examine the classifier's prediction and estimated uncertainty 15 for the generated images. Given the expected features in the generated images are known beforehand, the classifier's uncertainty of the presence of these features can be used to estimate not only image fidelity (e.g., image features are not generated realistic enough), but also 'condition adherence' (e.g., expected image features are altered during generation).
Outlook on clinical adoption. Alongside GAN-specific and standard image assessment metrics, uncertainty-based evaluation schemes can further automate the analysis of generative models. To this end, the challenge of clinical validation for predictive uncertainty as a reliability metric for synthetic data assessment remains (Tanno et al., 2021). In practice, building clinical trust in AI models is a non-trivial endeavour and will require rigorous performance monitoring and calibration especially in the early stages (Kelly et al., 2019;Durán and Jongsma, 2021). This is particularly the case when CADe and CADx models are trained on entirely (or partially) synthetic data given that the data itself was not first assessed by clinicians. Until a certain level of trust is built in these pipelines, automatic metrics will be a preliminary evaluation step that is inevitably followed by diligent clinical evaluation for deployment. A research direction of interest in this context would be 'gatekeeper' GANs-i.e. GANs that simulate common data (and/or difficult edge cases) of the target hospital, on which deployment-ready candidate models (e.g., segmentation, classification, etc.) are then tested to ensure they are sufficiently generalisable. If the candidate model performance on such test data satisfies a predefined threshold, it has passed this quality gate for clinical deployment.

Data access and privacy challenges
Access to sufficiently large and labelled data resources is the main constraint for the development of deep learning models for medical imaging tasks (Esteva et al., 2019). In cancer imaging, the practice of sharing validated data to aid the development of AI algorithms is restricted due to technical, ethical, and legal concerns (Bi et al., 2019). The latter is exemplified by regulations such as the Health Insurance Portability and Accountability Act (HIPAA, 1996) in the United States of America (USA) or the European Union's General Data Protection Regulation (GDPR, 2016) with which respective clinical centres must comply with. Alongside the need and numerous benefits of patient privacy preservation, it can also limit data sharing initiatives and restrict the availability, size and usability of public cancer imaging datasets. Bi et al. (2019) assess the absence of such datasets as a noteworthy challenge for AI in cancer imaging.
The published GANs and adversarial training methods that are suggested for or applied to cancer imaging challenges within this Section 4.2 are summarise below in Table 3.

Decentralised data generation
As AI systems are often developed and trained outside of medical institutions, prior approval to transfer data out of their respective data silos is required, adding significant hurdles to the logistics of setting up a training pipeline or rendering it entirely impossible. In addition, medical institutions can often not guarantee a secured connection to systems deployed outside their centres (Hosny et al., 2018), which further limits their options to share valuable training data.
One privacy preserving approach solving this problem is federated learning (McMahan et al., 2017), where copies of an AI model are trained in a distributed fashion inside each clinical centre in parallel and are aggregated to a global model in a central server. This eliminates the need for sensitive patient data to leave any of the clinical centres (Kaissis et al., 2020;Sheller et al., 2020). However, it is to be noted that federated learning cannot guarantee full patient privacy. Hitaj et al. (2017) demonstrated that any malicious user can train a GAN to violate the privacy of the other users in a federated learning system. While difficult to avoid, the risk of such GAN-based attacks can be minimised, e.g., by using a combination of selective parameter updates (Shokri and Shmatikov, 2015) (sharing only a small selected part of the model parameters across centres) and the sparse vector technique 16 as shown by Li et al. (2019b).

Table 2
Overview of adversarially-trained models applied to cancer imaging data scarcity and usability challenges. Publications are clustered by section and ordered by year in ascending order.
(continued on next page)  7. Visual example of a GAN in a federated learning setup with a central generator trying to generate realistic samples that fool all of the discriminators, which are distributed across clinical centres as in Chang et al. (2020b,a). Once trained, the generator can produce training data for a downstream task model (e.g., segmentation, detection, classification). As depicted in blue colour, we suggest to extend the federated learning setup by adding 'Noise' to the gradients providing a differential privacy guarantee. This reduces the possibility of reconstruction of specific records of the training data (i.e. images of a specific patient) by someone with access to the trained GAN model (i.e. to the generator) or by someone intercepting the synthetic images while they are transferred from the central generator to the centres during training.
be trained on sensitive patient data to generate synthetic training data. The technical, legal, and ethical constraints for sharing de-identified synthetic data are typically less restrictive than for real patient data. Such generated data can be used instead of the real patient data to train models on disease detection, segmentation, or prognosis. For instance, Chang et al. (2020b,a) proposed the Distributed Asynchronized Discriminator GAN (AsynDGAN), which consists of multiple discriminators deployed inside various medical centres and one central generator deployed outside the medical centres. The generator never needs to see the private patient data, as it learns by receiving the gradient updates of each of the discriminators. The discriminators are trained to differentiate images of their medical centre from synthetic images received from the central generator. After training AsynDGAN, its generator is used and evaluated based on its ability to provide a rich training set of images to successfully train a segmentation model. AsynDGAN is evaluated on MRI brain tumour segmentation and cell nuclei segmentation. The segmentation models trained only on AsynDGAN-generated data achieves a competitive performance when compared to segmentation models trained on the entire dataset of real data. Notably, models trained on AsynDGAN-generated data outperform models trained on local data from only one of the medical centres. To our best knowledge, AsynDGAN is the only distributed GAN applied to cancer imaging to date. Therefore, we promote further research in this line to fully exploit the potential of privacy-preservation using distributed GANs. As demonstrated in Fig. 7 and suggested in Fig. 1(f), for maximal privacy preservation we recommend exploring methods that combine privacy during training (e.g., federated GANs) with privacy after training (e.g., differentially-private GANs), the latter being described in the following section. Shin et al. (2018a) train a GAN to generate brain tumour images and highlight the usefulness of their method for anonymisation, as their synthetic data cannot be attributed to a single patient but rather only to an instantiation of the training population. However, it is to be scrutinised whether such synthetic samples are indeed fully private, as, given a careful analysis of the GAN model and/or its generated samples, a risk of possible reconstruction of part of the GAN training data exists (Papernot et al., 2016). For example, Chen et al. (2020a) propose a GAN for model inversion (MI) attacks, which aim at reconstructing the training data from a target model's parameters. A potential solution to avoid training data reconstruction is highlighted by Xie et al. (2018), who propose the Differentially Private Generative Adversarial Network (DPGAN). In Differential Privacy (DP) (Dwork, 2006) the parameters ( , ) denote the privacy budget (Torfi et al., 2020), where measures the privacy loss and represents the probability that a range of outputs with a privacy loss > exists. 17 Hence, the smaller the parameters ( , ) for a given model, the less effect a single sample in the training data has on model output. The less effect of such a single sample, the stronger is the confidence in the privacy of the model to not reveal samples of the training data.

Differentially-private data generation
Examples of GANs with differential privacy guarantees. In DPGAN noise is added to the model's gradients during training to ensure training data privacy. Extending on the concept of DPGAN, Jordon et al. (2018) train a GAN coined PATE-GAN based on the Private Aggregation of Teacher Ensembles (PATE) framework (Papernot et al., 2016(Papernot et al., , 2018. In the PATE framework, a student model learns from various unpublished teacher models each trained on data subsets. The student model cannot access an individual teacher model nor its training data. PATE-GAN consists of discriminator teachers, 1 , … , , and a student discriminator that backpropagates its loss back into the generator. This limits the effect of any individual sample in PATE-GAN's training. In a ( = 1, = 10 −5 )-DP setting, classification models trained on PATE-GAN's synthetic data achieves competitive performances e.g. on a non-imaging cervical cancer dataset (Fernandes et al., 2017) compared to an upper bound vanilla GAN baseline without DP.
For the generation of biomedical participant data in clinical trials, Beaulieu-Jones et al. (2019) apply an AC-GAN under a ( = 3.5, = 17 For example, if an identical model is trained two times, once with training data resulting in and once with marginally different training data ′ resulting in ′ , it is ( )-DP if the following holds true: For any possible output , the output probability for of model differs no more than ( ) from the output probability for of ′ .
10 −5 )-DP setting based on Gaussian noise added to AC-GAN's gradients during training. Bae et al. (2020) propose AnomiGAN to anonymise private medical data via some degree of output randomness during inference. This randomness of the generator is achieved by randomly adding, for each layer, one of its separately stored training variances. AnomiGAN achieves competitive results on a non-imaging breast cancer dataset and a non-imaging prostate cancer for any of the reported privacy parameter values ∈ [0.0, 0.5] compared to DP, where Laplacian noise is added to samples.
Outlook on synthetic cancer image privacy. Despite the above efforts, DP in GANs has only been applied to non-imaging cancer data which indicates research potential for extending these methods reliably to cancer imaging data. According to Stadler et al. (2021), using synthetic data generated under DP can protect outliers in the original data from linkage attacks, but likely also reduces the statistical signal of these original data points, which can result in lower utility of the synthetic data. Apart from this privacy-utility tradeoff, it may not be readily controllable/predictable which original data features are preserved and which omitted in the synthetic datasets (Stadler et al., 2021). In fields such as cancer imaging where patient privacy is critical, desirable privacy-utility tradeoffs need to be defined and thoroughly evaluated to enable trust, shareability, and usefulness of synthetic data. Consensus is yet to be found as to how privacy preservation in GAN-generated data can be evaluated and verified in the research community and in clinical practice. Promising approaches include methods that define a privacy gain/loss for synthetic samples (Stadler et al., 2021;Yoon et al., 2020). Yoon et al. (2020), for instance, define and backpropagate an identifiability loss to the generator to synthesis anonymised electronic health records (EHRs). The identifiability loss is based on the notion that the minimum weighted euclidean distance between two patient records from two different patients can serve as a desirable anonymisation target for synthetic data. Designing or extending reliable methods and metrics for standardised quantitative evaluation of patient privacy preservation in synthetic medical images is a line of research that we call attention to.

Obfuscation of identifying patient features in images
If the removal of all sensitive patient information within a cancer imaging dataset allows for sharing such datasets, then GANs can be used to obfuscate such sensitive data. As indicated by Fig. 1(g), GANs can learn to remove the features from the imaging data that could reveal a patient's identity, e.g. by learning to apply image inpainting to pixel or voxel data of burned in image annotations or of identifying body parts. Such identifying body parts could be the facial features of a patient, as was shown by Schwarz et al. (2019) on the example of cranial MRI. Numerous studies exist where GANs accomplish facial feature de-identification on non-medical imaging modalities Hukkelås et al., 2019;Li and Lin, 2019;Maximov et al., 2020). For medical imaging modalities, GANs have yet to prove themselves as tool of choice for anatomical and facial feature de-identification against common standards (Ségonne et al., 2004;Bischoff-Grethe et al., 2007;Schimke et al., 2011;Milchenko and Marcus, 2013) with solid baselines. These standards, however, have shown to be susceptible to reconstruction achieved by unpaired image-to-image GANs on MRI volumes with high reversibility for blurred faces and partial reversibility for removed facial features (Abramian and Eklund, 2019). Van der Goten et al. (2021) provide a first proof-of-concept for GAN-based facial feature de-identification in 3D (128 3 voxel) cranial MRI. The generator of their conditional de-identification GAN (C-DeID-GAN) receives brain mask, brain intensities and a convex hull of the brain MRI as input and generates de-identified MRI slices. C-DeID-GAN generates the entire deidentified brain MRI scan and, hence, may not be able to guarantee that the generation process does not alter any of the original brain features. A solution to this can be to only generate and replace the 2D MRI slices or parts thereof that do contain non-pathological facial features while retaining all other original 2D MRI slices. Presuming preservation of medically relevant features and robustness of de-identification, GANbased approaches can allow for subsequent medical analysis, privacy preserving data sharing and provision of de-identified training data. Hence, we highlight the research potential of GANs for robust medical image de-identification e.g. via image inpainting GANs that have already been successful applied to other tasks in cancer imaging such as synthetic lesion inpainting into mammograms (Wu et al., 2018a;Becker et al., 2019) and lung CT scans (Mirsky et al., 2019). Also, GANbased patient feature de-identification methods that are adjustable and trainable to remain quantifiably robust against adversarial image reconstruction are a research line of interest.

Identifying patient features in latent representations
In line with Fig. 1(g), a further example for privacy preserving methods are autoencoders 18 that learn patient identity-specific features and obfuscate such features when encoding input images into latent space representation. Such an identity-obfuscated representation can be used as input into further models (classification, segmentation, etc.) or decoded back into a de-identified image. Adversarial training has been shown to be effective for learning a privacy-preserving encoding function, where a discriminator tries to succeed at classifying the private attribute from the encoded data (Raval et al., 2017;Wu et al., 2018c;Yang et al., 2018c;Pittaluga et al., 2019). Apart from being trained via the backpropagated adversarial loss, the encoder needs at least one further utility training objective to learn to generate useful representations, such as denoising (Vincent et al., 2008) or classification of a second attribute (e.g., facial expressions Chen et al., 2018a;Oleszkiewicz et al., 2018). Siamese Neural Networks (Bromley et al., 1994) such as the Siamese Generative Adversarial Privatizer (SGAP) (Oleszkiewicz et al., 2018) have been used effectively for adversarial training of an identity-obfuscated representation encoder. In SGAP, two weight-sharing Siamese Discriminators are trained using a distance based loss function to learn to classify whether a pair of images belongs to the same person. As visualised in Fig. 8, Kim et al. (2019a) follow a similar approach with the goal of de-identifying and segmenting brain MRI data. Two feature maps are encoded from a pair of MRI scans and fed into a Siamese Discriminator that evaluates via binary classification whether the two feature maps are from the same patient. The generated feature maps are also fed into a segmentation model that backpropagates a Dice loss (Sudre et al., 2017) to train the encoder. Fig. 8 illustrates the scenario where the encoder is deployed in a trusted setting after training, e.g. in a clinical centre, and the segmentation model is deployed in an untrusted setting, e.g. outside the clinical centre at a third party. The encoder shares the identity-obfuscated feature maps with the external segmentation model without the need of transferring the sensitive patient data outside the clinical centre. This motivates further research into adversarial identity-obfuscated encoding methods e.g., to allow sharing and usage of cancer imaging data representations and models across clinical centres.

Adversarial attacks putting patients at risk
Examples of GAN-based tampering with cancer imaging data. For instance, Mirsky et al. (2019) added and removed evidence of cancer in lung CT scans. Of two identical deep 3D convolutional cGANs (based on pix2pix), one was used to inject (diameter ≥ 10 mm) and the other to remove (diameter < 3 mm) multiple solitary pulmonary nodules indicating lung cancer. The GANs were trained on 888 CT scans from the Lung Image Database Consortium image collection (LIDC-IDRI) dataset (Armato III et al., 2011) and inpainted on an extracted R. Osuala et al. Fig. 8. Example of an autoencoder architecture trained via adversarial loss to learn privacy-preserving feature maps as in Kim et al. (2019a) and/or a privacy-preserving latent representation . Once trained and after thorough periodic manual verification of its ability to preserve privacy, the representation and/or the feature maps can be sent to third parties outside the clinical centre for model training or inference requests. region of interest of 32 3 voxel cuboid shape. The trained GANs can be autonomously executed by malware and are capable of ingesting nodules into standard CT scans that are realistic enough to deceive both radiologists and AI disease detection systems. Three radiologists with 2, 5 and 7 years of experience analysed 70 tampered and 30 authentic CT scans. Spending on average 10 min on each scan, the radiologists diagnosed 99% of the scans with added nodules as malignant and 94% of the scans with removed nodules as healthy. After disclosing the presence of the attack to the radiologists, the percentages dropped to 60% and 87%, respectively (Mirsky et al., 2019 (Lopez et al., 2012) and the INbreast (Moreira et al., 2012) datasets to generate suspicious features and was able to remove or inject them into existing mammograms. They showed that their approach can fool radiologists at lower pixel dimensions (i.e. 256 × 256) demonstrating that alterations in patient images by a malicious attacker can remain undetected by clinicians, influence the diagnosis, and potentially harm the patient .
Defending adversarial attacks. In regard to fooling diagnostic models, one measure to circumvent adversarial attacks is to increase model robustness against adversarial examples (Madry et al., 2017), as suggested by Fig. 1(h). Augmenting the robustness has been shown to be effective for medical imaging segmentation models (He et al., 2019;Park et al., 2020), lung nodule detection models Paul et al., 2020), skin cancer recognition (Huq and Pervin, 2020;Hirano et al., 2021), and classification of histopathology images of lymph node sections with metastatic tissue (Wetstein et al., 2020). Liu et al. (2020b) provide model robustness by adding adversarial chest CT examples to the training data. These adversarial examples are composed of synthetic nodules that are generated by a 3D convolutional variational encoder trained in conjunction with a WGAN-GP (Gulrajani et al., 2017) discriminator. To further enhance robustness, Projected Gradient Descent (PGD) (Madry et al., 2017) is applied to find and protect against noise patterns for which the detector network is prone to produce over-confident false predictions .
Apart from being the adversary, GANs can also detect adversarial attacks and thus are applicable as security counter-measure enabling attack anticipation, early warning, monitoring and mitigation. Defence-GAN, for example, learns the distribution of non-tampered images and can generate a close output to an inference input image that does not contain adversarial modifications (Samangouei et al., 2018).
We highlight the research potential in adversarial attacks and examples, alongside prospective GAN detection and defence mechanisms that can, as elaborated, highly impact the field of cancer imaging.
Apart from the image injection of entire tumours and the generation of adversarial radiomics examples, a further attack vector to consider in future studies is the perturbation of the specific imaging features within an image that are used to compute radiomics features.

Data annotation and segmentation challenges 4.3.1. Annotation-specific issues in cancer imaging
Missing annotations in datasets. In cancer imaging, not only the availability of large datasets is rare, but also the availability of labels, annotations, and segmentation masks within such datasets. The generation and evaluation of such labels, annotations, and segmentation masks is a task for which trained health professionals (radiologists, pathologists) are needed to ensure validity and credibility (Hosny et al., 2018;Bi et al., 2019). Nonetheless, radiologist annotations of large datasets can take years to generate (Bi et al., 2019). The tasks of labelling and annotating (e.g., bounding boxes, segmentation masks, textual comments) cancer imaging data is, hence, expensive both in time and cost, especially considering the large amount of data needed to train deep learning models.
Intra/inter-observer annotation variability. This cancer imaging challenge is further exacerbated by the high intra-and inter-observer variability between both pathologists (Gilles et al., 2008;Dimitriou et al., 2018;Martin et al., 2018;Klaver et al., 2020) and radiologists (Elmore et al., 1994;Hopper et al., 1996;Hadjiiski et al., 2012;Teh et al., 2017;Wilson et al., 2018;Woo et al., 2020;Brady, 2017) in interpreting cancer images across imaging modalities, affected organs, and cancer types. Automated annotation processes based on deep learning models allow to produce reproducible and standardised results in each image analysis. In one of most common case where the annotations consist of a segmentation mask, reliably segmenting both tumour and nontumour tissues is crucial for disease analysis, biopsy, and subsequent intervention and treatment (Hosny et al., 2018;Huynh et al., 2020), the latter being further discussed in Section 4.5. For example, automatic tumour segmentation models are useful in the context of radiotherapy treatment planning (Cuocolo et al., 2020).
Human biases in cancer image annotation. During routine tasks, such as medical image analysis, humans are prone to account for only a few of many relevant qualitative image features. On the contrary, the strength of GANs and deep learning models is the evaluation of large numbers of multi-dimensional image features alongside their (non-linear) interrelationships and combined importance (Hosny et al., 2018). Deep learning models are likely to react to unexpected and subtle patterns in the imaging data (e.g., anomalies, hidden comorbidities, etc.) that R. Osuala et al.

Table 3
Overview of adversarially-trained models applied/applicable to data access and privacy cancer imaging challenges. Publications are clustered by section and ordered by year in ascending order. medical practitioners are prone to overlook for instance due to any of multiple existing cognitive biases (e.g., anchoring bias, framing bias, availability bias) (Brady, 2017) or inattentional blindness (Drew et al., 2013). Inattentional blindness occurs when radiologists (or pathologist) have a substantial amount of their attention drawn to a specific task, such as finding an expected pattern (e.g., a lung nodule) in the imaging data, that they become blind to other patterns in that data.
Implications of low segmentation model robustness. As for the common annotation task of segmentation mask delineation, automated segmentation models can minimise the risk of the aforesaid human biases. However, to date, segmentation models have difficulties when confronted with intricate segmentation problems including domain shifts, rare diseases with limited sample size, or small lesion and metastasis segmentation. In this sense, the performance of many automated and semi-automated clinical segmentation models has been suboptimal (Sharma and Aggarwal, 2010). This emphasises the need for expensive manual verification of segmentation model results by human experts (Hosny et al., 2018). The challenge of training automated models for difficult segmentation problems can be approached by applying methods for learning discriminative features without explicit labels. Such methods include GANs and variational autoencoders (Kingma and Welling, 2013) capable of automating robust segmentation (Hosny et al., 2018). In addition, segmented regions of interest (ROI) are commonly used to extract quantitative imaging features with diagnostic value such as radiomics features. The latter are used to detect and monitor tumours R. Osuala et al.

Table 4
Overview of adversarial training and GAN-based approaches applied to segmentation in cancer imaging tasks. Publications are clustered by organ type and ordered by year in ascending order. '*' indicates that the metrics are only available in figures and the baseline numbers are lower than using GANs in the corresponding paper. 'n.a.' indicates that there was no comparison with a specific baseline with the reason for this being indicated in the 'Highlights' column.  (continued on next page) (e.g., lymphoma Kang et al., 2018), biomarkers, and tumour-specific phenotypic attributes (Lambin et al., 2012;Parmar et al., 2015). The accuracy and success of such commonly applied diagnostic image feature quantification methods, hence, depends on accurate and robust ROI segmentations. Segmentation models need to be able to provide reproducibility of extracted quantitative features and biomarkers (Bi et al., 2019) with reliably-low variation, among others, across different scanners, CT slice thicknesses, and reconstruction kernels (Balagurunathan et al., 2014;Zhao et al., 2016). To this end, we promote lines of research that use adversarial training schemes to target the robustification of segmentation models. Progress in this open research challenge can beneficially unlock trust, usability, and clinical adoption of biomarker quantification methods in clinical practice. Table 4 summarises the collection of segmentation publications that utilise such adversarial training approaches and GAN-based data synthesis for cancer imaging.

GAN applications for cancer image segmentation
In Table 4, we further report the baseline performance alongside the performance increase attributable to applying GANs or adversarial training for each surveyed publication. For the common Dice Score segmentation performance metric, Fig. 10 visualises these differences. Comparing the figure's black identity line and the red trend line over publications, we observe a general improvement of approximately 5 percentage points of adversarial learning methods compared to their baselines. Fig. 11 further displays the variation in performance between baselines and adversarial network methods for the years 2017 to 2021. Based on visual analysis, performance gains seem to be both anatomyinvariant and invariant to the strength of the baseline, where similar gains are achieved for initially low (e.g., < 0.7) and high (e.g., >= 0.7) baseline Dice scores. While Figs. 10 and 11 offer interesting quantitative insights, we recommend taking potential publication bias 19 into account when drawing conclusions from these figures. Trends in the presented data in these plots can be analysed holistically, however, they are not intended to benchmark and compare individual publications against each other. This is due to multiple limiting factors of such comparisons including the differences in (a) the used baselines, (b) organs, (c) modalities, (d) the segmentation task and its associated difficulty, (e) the amount of training and testing data, (f) data and annotation quality, (g) pre-and post-processing methods, or (h) the study's objectives. In regard to (g), some studies may focus on other benefits of adversarial learning methods instead of or apart from Dice Score improvement, such as, reducing the needed training dataset size, domain adaptation in general, protecting patient privacy with synthetic data, or simply improving other metrics (e.g. Hausdorff distance, FID).
In the following sections, we provide a summary of the commonly used techniques and trends in the GAN literature that address the challenges in cancer image segmentation.
Robust quantitative imaging feature extraction. For example, Xiao et al. (2019) addressed the challenge of robustification of segmentation models and reliable biomarker quantification. Xiao et al. (2019) provide radiomics features as conditional input to the discriminator of their adversarially trained liver tumour segmentation model. Their learning procedure strives to inform the generator to create segmentations that are specifically suitable for subsequent radiomics feature computation. Apart from adversarially training segmentation models, we also highlight the research potential of adversarially training quantitative imaging feature extraction models (e.g., deep learning radiomics) for reliable application in multi-centre and multi-domain settings.
Synthetic segmentation model training data. By augmenting and varying the training data of segmentation models, it is possible to substantially decrease the amount of manually annotated images during training while maintaining the performance (Foroozandeh and Eklund, 2020). A general pipeline of such usage of GAN based generative models is demonstrated in Fig. 9(a) and mentioned in Fig. 1(j).
Over the past few years, CycleGAN  based approaches have been widely used for synthetic data generation due to the possibility of using unpaired image sets in training, as compared to paired image translation methods like pix2pix  or SPADE (Park et al., 2019). CycleGAN based data augmentation has been shown to be useful for segmentation model training, in particular, for generating images with different acquisition characteristics such as contrast enhanced MRI from non-contrast MRI , cross-modality image translation between different modalities such as CT and MRI images (Huo et al., 2018), and domain adaptation tasks (Jiang et al., 2018). The popularity of the CycleGAN based methods lies not only in image synthesis or domain adaptation, but also in the inclusion of simultaneous image segmentation in its pipeline .
Although pix2pix methods require paired samples, it is also a widely used type of GAN in data augmentation for medical image segmentation (see Table 4). Several works on segmentation have demonstrated its effectiveness in generating synthetic medical images. By manipulating its input, the variability of the training dataset for image segmentation could be remarkably increased in a controlled manner (Abhishek and R. Osuala et al. Fig. 9. Overview of cancer imaging GAN applications for detection and segmentation. (a) describes training data augmentation of downstream task models (e.g., segmentation, detection, classification, etc.). In (b) a discriminator scrutinises the segmentations created by a segmentation model, while in (c) the discriminator enforces the model to create domain-agnostic latent representations. (d) illustrates domain-adaptation, where the translated target domain images are used for downstream model training. In (e), the AC-GAN (Odena et al., 2017) discriminator classifies original data. In (f), one GAN generates ROIs while another inpaints them into full-sized images. (g) uses the discriminator's latent space to find abnormal/outlier representations. Fig. 10. Scatter plot illustrating the segmentation performance improvement attributable to adversarial networks for the surveyed publications. Each publication is represented by a marker with a colour and shape encoding depicting the publication's anatomical category. Only the publications are included that measure performance via Dice Score and compare against a baseline, as reported in Table 4. For publications reporting multiple Dice Scores, their mean was computed and included herein. The black identity line indicates no change between baseline and adversarial network intervention, while dots below this line represent an improvement. The red regression line depicts the trend of improvement across publications. The author names of a few publications have been manually selected for highlighting based on the distance to the trend line.

Table 5
Overview of adversarially-trained models applied to detection and diagnosis tasks in cancer imaging. Publications are clustered by organ type and ordered by year in ascending order.    Fig. 11. Scatter plot displaying year of publication and Dice Score improvement (in %). Each marker represents a publication and its colour and shape encoding represents its corresponding anatomical category. Only the publications are included that report Dice Score alongside a baseline comparison. For publications reporting multiple Dice Scores, their mean was computed and included herein. Author names have been manually selected at random for highlighting.
Hamarneh, 2019; Oliveira, 2020b). Similarly, the conditional GAN methods have also been used for controllable data augmentation for improving lesion segmentation (Oliveira, 2020a). Providing a condition as an input to generate a mask is particularly useful to specify the location, size, shape, and heterogeneity of the synthetic lesions. One of the recent examples, proposed by , demonstrates this in brain MRI tumour synthesis by conditioning an input with simplified controllable concentric circles to specify lesion location and characteristics. A further method for data augmentation is the inpainting of generated lesions into healthy real images or into other synthetic images, as depicted by Fig. 9(f). Overall, the described data augmentation techniques have shown to improve generalisability and performance of segmentation models by increasing both the number and the variability of training samples (Qasim et al., 2020;Foroozandeh and Eklund, 2020;. Segmentation models with integrated adversarial loss. As stated in Fig. 1(i), GANs can also be used as the algorithm that generates robust segmentation masks, where the generator is used as a segmenter and the discriminator scrutinises the segmentation masks given an input image. One intuition behind this approach is the detection and correction of higher-order inconsistencies between the ground truth segmentation maps and the ones created by the segmenter via adversarial learning (Luc et al., 2016;Hu et al., 2020;Cirillo et al., 2020). This approach is demonstrated in Fig. 9(b). With the additional adversarial loss when training a segmentation model, this approach has been shown to improve semantic segmentation accuracy (Hung et al., 2019;Sarker et al., 2019;Shi et al., 2020). Using adversarial training, similarity of a generated mask to manual segmentation given an input image is taken under consideration by the discriminator allowing a global assessment of the segmentation quality. This approach further offers a practical solution towards handling intra-and inter-observer annotation variability, as the mask discriminator learns an average over observers, which is backpropagated to the segmenter via adversarial loss.
A unique way of incorporating the adversarial loss from the discriminator has been recently proposed in Nie and Shen (2020). In their work, the authors utilise a fully-convolutional network as a discriminator, unlike its counterparts that use binary, single neuron output networks. In doing so, a dense confidence map is produced by the discriminator, which is further used to train the segmenter with an attention mechanism.
Overall, using an adversarial loss as an additional global segmentation assessment is likely to be a helpful further signal for segmentation models, in particular, for heterogeneously structured datasets of limited size (Kohl et al., 2017), as is common for cancer imaging datasets. We highlight potential further research in GAN-based segmentation models to learn to segment increasingly fine radiologic distinctions. These models can help to solve further cancer imaging challenges, for example, accurate differentiation between neoplasms and tissue response to injury in the regions surrounding a tumour after treatment (Bi et al., 2019).

Segmentation models with integrated adversarial domain discrimination.
Moreover, a similar adversarial loss can also be performed internally on the segmentation model features as illustrated in Fig. 9(c). Such an approach can benefit unsupervised domain adaptation and domain generalisation by enforcing the segmentation model to learn to base its prediction on domain-invariant feature representations (Kamnitsas et al., 2017).

Limitations and future prospects for cancer imaging segmentation
As shown in Table 4, the applications of GANs in cancer image segmentation cover a variety of clinical requirements. Remarkable steps have been taken to advance this field of research over the past few years. However, the following limitations and future prospects can be considered for further investigation: • Although the data augmentation using GANs could increase the number of training samples for segmentation, the variability of the synthetic data is limited to the training data. Hence, it may limit the potential of improving the performance in terms of segmentation accuracy. Moreover, training a GAN that produce high sample variability requires a large dataset also with a high variability, and, in most of the cases, with corresponding annotations. Considering the data scarcity challenge in the cancer imaging domain, this can be difficult to achieve. • In some cases, using GANs could be excessive, considering the difficulties related to convergence of competing generator and discriminator parts of the GAN architectures. For example, the recently proposed SynthSeg model (Billot et al., 2020) is based on Gaussian Mixture Models to generate images and train a contrast agnostic segmentation model. Such approaches can be considered as an alternative to avoid common pitfalls of the GAN training process (e.g., mode collapse). However, this approach needs to be further investigated for cancer imaging tasks where the heterogeneity of tumours is challenging. • A great potential for using synthetic cancer images is to generate common shareable datasets as benchmarks for automated segmentation methods (Bi et al., 2019). Although this benchmark dataset needs its own validation, it can be beneficial in testing the limits of automated methods with systematically controlled test cases. Such benchmark datasets can be generated by controlling the shape, location, size, intensities of tumours, and can simulating diverse images of different domains that reflect the distributions from real institutions. To avoid learning patterns that are only available in synthetic datasets (e.g., checkerboard artifacts), it is a prospect to investigate further metrics that measure the distance of such synthetic datasets to real-world datasets and the generalisation and extrapolation capabilities of models trained on synthetic benchmarks to real-world data.

Detection and diagnosis challenges 4.4.1. Common issues in diagnosing malignancies
Clinicians' high diagnostic error rates. Studies of radiological error report high ranges of diagnostic error rates (e.g., discordant interpretations in 31%-37% in Oncologic CT, 13% major discrepancies in Neuro CT and MRI) (Brady, 2017 These findings exemplify the uncomfortably high diagnostic and image interpretation error rates that persist in the field of radiology despite decades of interventions and research (Itri et al., 2018).
The challenge of reducing clinicians' high workload. In some settings, radiologists must interpret one CT or MRI image every 3-4 s in an average 8-h workday (McDonald et al., 2015). Automated CADe and CADx systems can provide a more balanced quality-focused workload for radiologists, where radiologists focus on scrutinising the automated detected lesions (false positive reduction) and areas/patches with high predictive uncertainty (false negative reduction). A benefit of CADe/CADx deep learning models are their real-time inference and strong pattern recognition capabilities that are not readily susceptible to cognitive bias (discussed in 4.3.1), environmental factors (Itri et al., 2018), or inter-observer variability (discussed in 4.3.1).

Fig. 12.
Scatter plot illustrating the performance improvement attributable to adversarial networks for the surveyed disease diagnosis and detection publications. The shown performance is based on the respective publication's metric reported in Table 5 and include f1-score (F1), sensitivity (SEN), accuracy (ACC), area under the receiver operating characteristic curve (AUC), and detection rate (DR). Each publication is represented by a marker with a colour and shape encoding depicting the publication's anatomical category. The black identity line indicates no change between baseline and adversarial network intervention, while dots below this line represent an improvement. The red regression line depicts the trend of improvement across publications. The author names of a few publications have been manually selected for highlighting based on the distance to the trend line.
Detection model performance on critical edge cases. Challenging cancer imaging problems are the high intra-and inter-tumour heterogeneity (Bi et al., 2019), the detection of small lesions and metastasis across the body (e.g., lymph node involvement and distant metastasis Hosny et al., 2018) and the accurate distinction between malignant and benign tumours (e.g., for detected lung nodules that seem similar on CT scans Hosny et al., 2018). Methods are needed to extend on and further increase the current performance of deep learning detection models (Bi et al., 2019).

GAN applications for cancer detection and diagnosis
As we detail in the following, the capability of adversarial learning to improve malignancy detection has been demonstrated for multiple tumour types and imaging modalities. To this end, Table 5 summarises the collection of recent publications that utilise GANs and adversarial training for cancer detection, classification, and diagnosis.
Figs. 12 and 13 visualise the publications' performance metric values reported in Table 5. Fig. 12 provides visual estimate of the effectiveness of GANs and adversarial training in increasing downstream task performance. A performance increase of approximately 5 percentage points can be observed by comparing the figure's black identity line with the red trend line over publications. We note that no visual pattern seems to be observable that indicates a difference in performance gain between anatomical categories. Across all publications, the performance gains seem not to be a function of the strength of the baseline, as they remain approximately constant with increasing baseline performance. This is indicated by the minimal change of distance between the black identity line and the red trend line throughout the graph. Fig. 13 shows the GAN-induced variation in performance for the years 2017 to 2021 with multiple adversarial models achieving a performance increase of over 10% and most models over 3% on their respective diagnostic downstream task. As emphasised in Section 4.3.2, conclusion drawn from Figs. 12 and 13 have to take publication bias into account. Further, benchmarking and comparison of individual publications based on the presented data in these figures is not part of their intended use due to the differences in baselines, modalities, organs, train and test datasets, and publication objectives. Schlegl et al. (2017) captured imaging markers relevant for disease prediction using a deep convolutional GAN named AnoGAN. AnoGAN learnt a manifold of normal anatomical variability, accompanying a novel anomaly scoring scheme based on the mapping from image space to a latent space. While Schlegl et al. validated their model on retina optical coherence tomography images, their unsupervised anomaly detection approach is applicable to other domains including cancer detection, as indicated in Fig. 1(l). Chen et al. (2018b) used a Variational Autoencoder GAN for unsupervised outlier detection using T1 and T2 weighted brain MRI images. The scans from healthy subjects were used to train the auto-encoder model to learn the distribution of healthy images and detect pathological images as outliers. Creswell et al. (2018) proposed a semi-supervised Denoising Adversarial Autoencoder (ssDAAE) to learn a representation based on unlabelled skin lesion images. The semi-supervised part of their CNN-based architecture corresponds to malignancy classification of labelled skin lesions based on the encoded representations of the pretrained DAAE. As the amount of labelled data is smaller than the unlabelled data, the labelled data is used to fine-tune classifier and encoder. In ssDAAE, not only the adversarial autoencoder's chosen prior distribution (Makhzani et al., 2015), but also the class label distribution is discriminated by a discriminator, the latter distinguishing between predicted continuous labels and real binary (malignant/benign) labels. Kuang et al. (2020) applied unsupervised learning to distinguish between benign and malignant lung nodules. In their multi-discriminator GAN (MDGAN) various discriminators scrutinise the realness of generated lung nodule images. After GAN pretraining, an encoder is added in front of the generator to the end-to-end architecture to learn the feature distribution of benign pulmonary nodule images and to map these features into latent space. The benign and malignant lung nodules were scored similarly as in the f-AnoGAN framework (Schlegl et al., 2019), computing and combining an image reconstruction loss and a feature matching loss, the latter comparing the discriminators' feature representations between real and encoded-generated images from intermediate discriminator layers. As exemplified in Fig. 9(g), the model yielded high anomaly scores on malignant images and low anomaly scores on benign images despite limited dataset size. Benson and Beets-Tan (2020) used GANs trained from multi-modal MRI images as a 3-channel input (T1-T2 weighted, FLAIR, ADC MRI) in brain anomaly detection. The training of the generative network was performed using only healthy images together with pseudo-random irregular masks. Despite the training dataset consisting of only 20 subjects, the resulting model increased the anomaly detection rate.

Adversarial anomaly and outlier detection examples.
Synthetic detection model training data. Among the GAN publications trying to improve classification and detection performance, data augmentation is the most recurrent approach to balance, vary, and increase the detection model's training set size, as suggested in Fig. 1(k). Fig. 13. Scatter plot displaying year of publication and change in performance between baseline and adversarial network method (in %). As in Table 5, the underlying performance metrics include f1-score (F1), sensitivity (SEN), accuracy (ACC), area under the receiver operating characteristic curve (AUC), and detection rate (DR). Each publication is represented by a marker with a colour and shape encoding depicting the publication's anatomical category. Author names have been manually selected at random for highlighting.

R. Osuala et al.
For instance in breast imaging, Wu et al. (2018a) trained a classconditional GAN to perform contextual in-filling to synthesise lesions in healthy scanned mammograms. Guan and Loew (2019) trained a GAN on the same dataset (Heath et al., 2001) to generate synthetic patches with benign and malignant tumours. The synthetic generated patches had clear artifacts and did not match the original dataset distribution. Jendele et al. (2019) used a CycleGAN  and both film scanned and digital mammograms to improve binary (malignant/benign) lesion detection using data augmentation. Detecting mammographically-occult breast cancers is another challenging topic addressed by GANs. For instance, Lee and Nishikawa (2020) exploit asymmetries between mammograms of the left and right breasts as signals for finding mammography-occult cancer. They trained a image-conditioned GAN (pix2pix) to generate a healthy synthetic mammogram image of the contralateral breast (e.g., left breast) given the corresponding single-sided mammogram (e.g., right breast) as input. The authors showed that there is a higher similarity (MSE, 2D-correlation) between simulated-real (SR) mammogram pairs than real-real (RR) mammogram pairs in the presence of mammography-occult cancer. Consequently, distinguishing between healthy and mammography-occult mammograms, their classifier yielded a higher performance when trained with both RR and SR similarity as input (AUC = 0.67) than when trained only with RR pair similarity as input (AUC = 0.57). 3-dimensional conditional image synthesis with GANs has been shown, for instance, by Han et al. (2019a), who proposed a 3D Multi-Conditional GAN (3DMCGAN) to generate realistic and diverse nodules placed naturally on lung CT images to boost sensitivity in 3D object detection. Bu et al. (2020) built a 3D image-conditioned GAN based on pix2pix, where the input is a 3D volume of interest (VOI) that is cropped from a lung CT scan and contains a missing region in its centre. Both generator and discriminator contain squeeze-and-excitation (Hu et al., 2018a) residual  neural network (SE-ResNet) modules to improve the quality of the synthesised lung nodules. Another example based on lung CT images is the method by Nishio et al. (2020), where the proposed GAN model used masked 3D CT images and nodule size information to generate images.
As to multi-modal training data synthesis, Van Tulder and de Bruijne (2015) replaced missing sequences of a multi-sequence MRI with synthetic data. The authors illustrated that if the synthetic data generation model is more flexible than the classification model, the synthetic data can provide features that the classifier has not extracted from the original data, which can improve the performance. During colonoscopy, depth maps can enable navigation alongside aiding detection and size measurements of polyps. For this reason, Rau et al. (2019) demonstrated the synthesis of depth maps using a image-conditioned GAN (pix2pix) with monocular endoscopic images as input, reporting promising results on synthetic, phantom and real datasets. In breast cancer detection, Muramatsu et al. (2020) translated lesions from lung CT to breast MMG using cycleGAN yielding a performance improvement in breast mass classification when training a classifier with the domain-translated generated samples.

Future prospects for cancer detection and diagnosis
Granular class distinctions for synthetic tumour images. Further research opportunity exists in exploring a more fine-grained classification of tumours that characterises different subtypes and disease grades instead of binary malignant-benign classification. Being able to robustly distinguish between different disease subtypes with similar imaging phenotypes (e.g., glioblastoma versus primary central nervous system lymphoma Kang et al., 2018) addresses the challenge of reducing diagnostic ambiguity (Bi et al., 2019). GANs can be explored to augment training data with samples of specific tumour subtypes to improve the distinction capabilities of disease detection models. This can be achieved by training a detection model on training data generated by various GANs, where each GAN is trained on a different tumour subtype distribution. Another option we estimate worth exploring is to use the tumour subtype or the disease grade (e.g., the Gleason Score for prostate cancer Hu et al., 2018b) as a conditional input into the GAN to generate additional labelled synthetic training data.
Cancer image interpretation and risk estimation. Besides the detection of prospectively cancerous characteristics in medical scans, ensuring a high accuracy in the subsequent interpretation of these findings are a further challenge in cancer imaging. Improving the interpretation accuracy can reduce the number of unnecessary biopsies and harmful treatments (e.g., mastectomy, radiation therapy, chemotherapy) of indolent tumours (Bi et al., 2019). For instance, the rate of overdiagnosis of non-clinically significant prostate cancer ranges widely between 1.7% up to a noteworthy 67% (Loeb et al., 2014). To address this, detection models can be extended to provide risk and clinical significance estimations. For example, given both an input image, and an array of risk factors (e.g., BRCA1/BRCA2 status for breast cancer Li et al., 2017, comorbidity risks), a deep learning model can weight and evaluate a patient's risk based on learned associations between risk factors and input image features. The GAN framework is an example of this, where clinical, non-clinical and imaging data can be combined, either as conditional input for image generation or as prediction targets. For instance, given an input image, an AC-GAN (Odena et al., 2017;Kapil et al., 2018) can classify the risk as continuous label (see Fig. 9(e)) or, alternatively, a discriminator can be used to assess whether a risk estimate provided by a generator is realistic. Also, a generator can learn a function for transforming and normalising an input image given one or several conditional input target risk factors or tumour characteristics (e.g., a specific mutation status, a present comorbidity, etc.) to generate labelled synthetic training data.

Treatment and monitoring challenges
After a tumour is detected and properly described, new challenges arise related to planning and execution of medical intervention. In this section we examine these challenges, in particular: tumour profiling and prognosis; challenges related to choice, response and discovery of treatments; as well as further disease monitoring. Table 6 provides an overview of the cancer imaging GANs that are applied to treatment and monitoring challenges, which are discussed in the following.

Disease prognosis and tumour profiling
Challenges for disease prognosis. An accurate prognosis is crucial to plan suitable treatments for cancer patients. However, in specific cases, it could be more beneficial to actively monitor the tumours instead of treating them Bi et al. (2019). Challenges in cancer prognosis include the differentiation between long-term and short term survivors (Bi et al., 2019), patient risk estimation considering the complex intra-tumour heterogeneity of the tumour microenvironment (TME) (Nearchou et al., 2021), or the estimation of the probability of disease stages and tumour growth patterns, which can strongly affect outcome probabilities (Bi et al., 2019). In this sense, GANs (Li et al., 2021b;Kim et al., 2018b) and AI models in general (Cuocolo et al., 2020;Dimitriou et al., 2018) have shown potential in prognosis and survival prediction for oncology patients. Table 2) show that their GAN-based CT normalisation framework for overcoming the domain shift between images from different centres significantly improves accuracy of classification between short-term and long-term survivors. Ahmed et al. (2021) trained omicsGAN to translate between microRNA and mRNA expression data pairs, but could be readily enhanced to also translate between cancer imaging features and genetic information. The authors evaluate omicsGAN on breast and ovarian cancer datasets and report improved prediction signals fo synthetic data tested via cancer outcome classification. Another non-imaging approach is provided by Kim et al. (2018b), who apply a GAN for patient cancer prognosis prediction based on identification of prognostic biomarker genes. They train their GAN on reconstructed human biology pathways data, which allows for highlighting genes relevant to cancer development, resulting in improvement of the prognosis prediction accuracy. In regard to these works on non-imaging approaches, we promote future extensions combining prognostic biomarker genes and -omics data with the phenotypic information present in cancer images into multi-modal prognosis models. Fig. 1(l), Vu et al. (2020a) propose that image-conditioned GANs (pix2pix) can learn latent characteristics of tissues of tumours that correlate with specific tumour grade. The authors show that when inferring their proposed BenignGAN on malignant tumour tissue images after training it exclusively on benign ones, it generates less realistic results. This allows for quantitative measurement of the differences between the original and the generated image, whereby these differences can be interpreted as tumour grade. Kapil et al. (2018) explore AC-GAN (Odena et al., 2017) on digital pathology imagery for semi-supervised quantification of the Non-Small-Cell-Lung-Cancer biomarker programmed death ligand 1 (PD-L1). Their class-conditional generator receives a one-hot encoded PD-L1 label as input to generate a respective biopsy tissue image, while their discriminator receives the image and predicts both PD-L1 label and whether the image is fake or real. The AC-GAN method compares favourably to other supervised and non-generative semi-supervised approaches, and also systematically yields high agreement with visual 21 tumour proportional scoring (TPS). 21 A visual estimation of pathologists of the tumour cell percentage showing PD-L1 staining.

GAN tumour profiling examples. Related to
As for the analysis of the TME, Quiros et al. (2019) propose Pathol-ogyGAN, which they train on breast and colorectal cancer tissue imagery. This allows for learning the most important tissue phenotype descriptions, and provides a continuous latent representation space, enabling quantification and profiling of differences and similarities between different tumours' tissues. Quiros et al. (2019) show that lesions encoded in an GAN's latent space enable using vector distance measures to find similar lesions that are close in the latent space within large patient cohorts. We highlight the research potential in lesion latent space representations to assess inter-tumour heterogeneity. Also, the treatment strategies and successes of patients with a similar lesion can inform the decision-making process of selecting treatments for a lesion at hand, as denoted by Fig. 1(m).
Outlook on genotypic tumour profiling with phenotypic data. A further challenge is that targeted oncological therapies require genomic and immunological tumour profiling (Cuocolo et al., 2020) and effective linking of tumour genotype and phenotype. Biopsies only allow to analyse the biopsied portion of the tumour's genotype, while also increasing patient risk due to the possibility of dislodging and seeding of neoplastic altered cells (Shyamala et al., 2014;Parmar et al., 2015). Therefore, a trade-off 22 exists between minimising the number of biopsies and maximising the biopsy-based information about a tumour's genotype. These reasons and the fact that current methods are invasive, expensive, and time-consuming (Cuocolo et al., 2020) make genotypic tumour profiling an important issue to be addressed by AI cancer imaging methods. In particular adversarial deep learning models are promising to generate the non-biopsied portion of a tumour's genotype after being trained on paired genotype and radiology imaging data. 23 We recommend future studies to explore this line of research, which is regarded as a key challenge for AI in cancer imaging (Bi et al., 2019;Parmar et al., 2015).

Treatment planning and response prediction
Challenges for cancer treatment predictions. A considerable number of malignancies and tumour stages have various possible treatment options and almost no head-to-head evidence to compare them to. Due to that, oncologists need to subjectively select an approved therapy based on their individual experience and exposure (Troyanskaya et al., 2020).
Furthermore, despite existing treatment response assessment frameworks in oncology, inter-and intra-observer variability regarding choice and measurement of target lesions exists among oncologists and radiologists (Levy and Rubin, 2008). To achieve consistency and accuracy in standardised treatment response reporting frameworks (Levy and Rubin, 2008), AI and GAN methods can identify quantitative biomarkers 24 from medical images in a reproducible manner useful for risk and treatment response predicts (Hosny et al., 2018).
Apart from the treatment response assessment, treatment response prediction is also challenging, particularly for cancer treatments such as immunotherapy (Bi et al., 2019). In cancer immunogenomics, for instance, unsolved challenges comprise the integration of multi-modal data (e.g., radiomic and genomic biomarkers Bi et al., 2019), immunogenicity prediction for neoantigens, and the longitudinal non-invasive monitoring of the therapy response (Troyanskaya et al., 2020). In regard to the sustainability of a therapy, the inter-and intra-tumour heterogeneity (e.g., in size, shape, morphology, kinetics, texture, etiology) and potential sub-clone treatment survival complicates individual treatment prediction, selection, and response interpretation (Bi et al., 2019). 22 Due to this and due to the high intra-tumour heterogeneity, available biopsy data likely only describes a subset of tumour's clonal cell population. 23 Imaging data on which the entire lesion is visible to allow learning correlations between phenotypic tumour manifestations and genotype signatures. 24 For example, characteristics and density variations of the parenchyma patterns on breast images (Bi et al., 2019).

Table 6
Overview of adversarially-trained models applied to treatment and monitoring challenges. Publications are clustered by section and ordered by year in ascending order.
Ahmed et al.

GAN treatment effect estimation examples.
In line with Fig. 1(n), Yoon et al. (2018) propose the conditional GAN framework 'GANITE', where individual treatment effect prediction allows for accounting for unseen, counterfactual outcomes of treatment. GANITE consists of two GANs: first, a counterfactual GAN is trained on feature and treatment vectors along with the factual outcome data. Then, the trained generator's output is used for creating a dataset, on which the other GAN, called ITE (Individual Treatment Response) GAN, is being trained. GANITE provides confidence intervals along with the prediction, while being readily scalable for any number of treatments. However, it does not allow for taking time, dosage or other treatment parameters into account. MGANITE, proposed by Ge et al. (2020), extends GANITE by introducing dosage quantification, and thus enables continuous and categorical treatment effect estimations. SCIGAN (Bica et al., 2020) also extends upon GANITE and predicts outcomes of continuous rather than one-time interventions and the authors further provide theoretical justification for GANs' success in learning counterfactual outcomes.
As to the problem of individual treatment response prediction, we suggest that quantitative comparisons of GAN-generated expected posttreatment images with real post-treatment images can yield interesting insight for tumour interpretation. We encourage future work to explore generating such post-treatment tumour images given a treatment parameter and a pre-treatment tumour image as conditional inputs. With varying treatment parameters as input, it is to be investigated whether GANs can inform treatment selection by simulating various treatment scenarios prior to treatment allocation or whether GANs can help to understand and evaluate treatment effects by generating counterfactual outcome images after treatment application. Goldsborough et al. (2017) present an approach called CytoGAN, where they synthesise fluorescence microscopy cell images using DC-GAN, LSGAN, or WGAN. The discriminator's latent representations learnt during synthesis enable grouping encoded cell images together that have similar cellular reactions to treatment by chemicals of known classes (morphological profiling). 25 Even though the authors reported that CytoGAN obtained inferior result 26 compared to classical, widely applied methods such as CellProfiler (Singh et al., 2014), using GANs to group tumour cells representations to inform chemical cancer treatment allocation decisions is an interesting approach in the realm of treatment selection, development (Kadurin et al., 2017a,b) and response prediction.
GAN radiation dose planning examples. As radiation therapy planning is labour-intensive and time-consuming, researchers have been spurred to pursue automated planning processes (Sharpe et al., 2014). As outlined in the following and suggested by Fig. 1(o), the challenge of automated radiation therapy planning can be approached using GANs.
By framing radiation dose planning as an image colourisation problem, Mahmood et al. (2018) introduced an end-to-end GAN-based solution, which predicts 3D radiation dose distributions from CT without the requirement of hand-crafted features. They trained their model on Oropharyngeal cancer data along with three traditional ML models and a standard CNN as baselines. The authors trained a pix2pix ) GAN on 2D CT imagery, and then fed the generated dose distributions to an inverse optimisation (IO) model , in order to generate optimised plans. Their evaluation showed that their GAN plans outperformed the baseline methods in all clinical metrics. Kazemifar et al. (2020) (in Table 2) proposed a cGAN with U-Net generator for paired MRI to CT translation. Using conventional dose calculation algorithms, the authors compared the dose computed for real CT and generated CT, where the latter showed high dosimetric accuracy. The study, hence, demonstrates the feasibility of synthetic CT for intensity-modulated proton therapy planning for brain tumour cases, where only MRI scans are available. Maspero et al. (2018) proposed a GAN-assisted approach to quicken the process of MR-based radiation dose planning, by using a pix2pix for generating synthetic CTs (sCTs) required for this task. They show that a conditional GAN trained on prostate cancer patient data can successfully generate sCTs of the entire pelvis.
A similar task has also been addressed by Peng et al. (2020). Their work compares two GAN approaches: one is based on pix2pix and the other on a CycleGAN . The main difference between these two approaches was that pix2pix was trained using registered MR-CT pairs of images, whereas CycleGAN was trained on unregistered pairs. Ultimately, the authors report pix2pix to achieve results (i.e. mean absolute error) superior to CycleGAN, and highlight difficulties in generating high-density bony tissues using CycleGAN.
The recently introduced attention-aware DoseGAN (Kearney et al., 2020a) overcomes the challenges of volumetric dose prediction in the presence of diverse patient anatomy. As illustrated in Fig. 14, DoseGAN is based on a variation of the pix2pix architecture with a 3D encoder-decoder generator (L1 loss) and a patch-based patch-GAN discriminator (adversarial loss). The generator was trained on concatenated CT, planning target volume (PTV) and organs at risk 25 CytoGAN uses an approach comparable to the one shown in Fig. 9(g). 26 i.e. mechanism-of-action classification accuracy.
(OARs) data of prostate cancer patients, and the discriminator's objective was to distinguish the real dose volumes from the generated ones. Both qualitatively and quantitatively, DoseGAN was able to synthesise more realistic volumetric doses compared to current alternative state-of-the-art methods. Murakami et al. (2020) published another GAN-based fully automated approach to dose distribution of Intensity-Modulated Radiation Therapy (IMRT) for prostate cancer. The novelty of their solution is that it does not require the tumour contour information, which is time-consuming to create, to successfully predict the dose based on the given CT dataset. Their approach consists of two pix2pix-based architectures, one trained on paired CT and radiation dose distribution images, and the other trained on paired structure images and radiation dose distribution images. From the generated radiation dose distribution images the dosimetric parameters for the PTV and OARs are computed. The generated dosimetric parameters differed on average only between 1%-3% with respect to the original ground truth dosimetric parameters. Koike et al. (2020) proposed a CycleGAN for dose estimation for head and neck CT images with metal artifact removal in CT-to-CT image translation as described in Table 2. Providing consistent dose calculation against metal artifacts for head and neck IMRT, their approach achieves dose calculation performance similar to commercial metal artifact removal methods.

Disease tracking and monitoring
Challenges in tracking and modelling tumour progression. Tumour progression is challenging to model  and commonly requires rich, multi-modal longitudinal data sets. As cancerous cells acquire growth advantages through genetic mutation in a process arguably analogous to Darwinian evolution (Hanahan and Weinberg, 2000), it is difficult to predict which of the many sub-clones in the TME will outgrow the other clones. A tumour lesion is, hence, constantly evolving in phenotype and genotype (Bi et al., 2019) and might acquire dangerous further mutations over time, anytime. The TME's respective impact is exemplified by the stage II colorectal cancer outcome classification performance gain in Dimitriou et al. (2018), which is likely attributable to the high prognostic value of the TME information in their training data.
In addition, concurrent conditions and alterations in the organ system surrounding a tumour, but also in distant organs may not only remain undetected, but could also influence patient health and progression (Bi et al., 2019). GANs can generate hypothetical comorbidity data 27 to aid awareness, testing, finding, and analysis of complex disease and comorbidity patterns. A further difficulty for tumour progression modelling is the a priori unknown effect of treatment. Treatment effects may even remain partly unknown after treatment for example in the case of radiation therapy 28 (Verma et al., 2013) or after surgery 29 (Bi et al., 2019). Fig. 1(p), GANs can not only diversify the training data, but can also be applied to simulate and explore disease progression scenarios (Elazab et al., 2020). For instance, Elazab et al. (2020) propose GP-GAN, which uses stacked 3D conditional GANs for growth prediction of glioma based on longitudinal MR images. The generator is based on the U-Net architecture (Ronneberger et al., 2015) and the segmented feature maps 27 For example from EHR (Hwang et al., 2017;Dashtban and Li, 2020), imaging data, or a combination thereof. 28 Radiation therapy can result in destruction of the normal tissue (e.g., radionecrosis) surrounding the tumour. Such heterogeneous normal tissue can become difficult to characterise and distinguish from the cancerous tissue (Verma et al., 2013). 29 It is challenging to quantify the volume of remaining tumour residuals after surgical removal (Bi et al., 2019).   Kearney et al. (2020a) and based on pix2pix . Given concatenated CT scans, planning target volume (PTV) and organs at risk (OARs), the generator of DoseGAN addresses the challenge of volumetric dose prediction for prostate cancer patients.

GAN tumour progression modelling examples. Relating to
are used in the training process. Kim et al. (2019b) trained a CycleGAN on concatenated pre-treatment MR, CT and dose images (i.e. resulting in one 3-channel image) of patients with hepatocellular carcinoma to generate follow-up enhanced MR images. This enables tumour image progression prediction after radiation treatment, whereby CycleGAN outperformed a vanilla GAN baseline.
The deep convolutional (DC) (Radford et al., 2015) -AlexNet (AL) (Krizhevsky et al., 2012) GAN (DC-AL GAN) proposed by Li et al. (2020a) is trained on longitudinal diffusion tensor imaging (DTI) data of pseudoprogression (PsP) and true tumour progression (TTP) in glioblastoma multiforme (GBM) patients. Both of these progression types can occur after standard treatment 30 and they are often difficult to differentiate due to similarities in shape and intensity. In DC-AL GAN, representations are extracted from various layers of its AlexNet discriminator that is trained on discriminating between real and generated DTI images. These representations are then used to train a support vector machine (SVM) classifier to distinguish between PsP and TTP samples achieving promising performance.
We recommend further studies to extend on these first adversarial learning disease progression modelling approaches. One potential research direction are GANs that simulate environment and tumour dependent progression patterns based on conditional input data such as the tumour's gene expression data  or the progressed time between original image and generated progression image (e.g., time passed between image acquisitions or since treatment exposure). To this end, unexpected changes of a tumour may be uncovered between time points or deviations from a tumour's biopsy proven genotypic growth expectations. 31

Trustworthiness of medical image synthesis studies
Section 4 presented an extensive analysis of the challenges, existing publications, and state-of-the-art data synthesis and adversarial network methods in cancer imaging. While the methodologies, experiments, and results of these studies were elaborated, their validity and trustworthiness was not specifically addressed. The validity and trustworthiness varies between studies and depends on the breadth and depth of the methodological evaluation and the analysis of potential limitations. In the absence of a rigorous evaluation indicating otherwise, the methodology and experimental results of a study cannot be readily assumed to be transferable across domains, settings, tasks, datasets and modalities. Hence, while a study reports promising results for a particular task and seemingly solves the task's underlying (cancer imaging) challenge, modest changes in the dataset, evaluation method or evaluation metrics can lead to different results and conclusions. This points to the need of a principled assessment of trustworthiness and validity of studies in the cancer and medical imaging domains, in particular, the ones contributing and evaluating synthetic data and data generation methodology.
Some frameworks have proposed guidelines and best practices for the development of trustworthy artificial intelligence solutions in medical imaging Hasani et al., 2022). However, to the best of our knowledge, no framework has been proposed for trustworthiness assessment of studies focused on medical image synthesis solutions. Building upon the FUTURE-AI consensus guidelines  and the lesson's learned from the extensive analysis of the 164 publications presented in Section 4, we propose the Synthesis Study Trustworthiness Test (SynTRUST ) as a principled framework to evaluate medical image synthesis studies.

Proposing the SynTRUST framework
The Synthesis Study Trustworthiness Test (SynTRUST ) framework consists of a principled set of measures to assess the trustworthiness and validity of studies proposing generative models, synthetic data, or adversarial training methods in medical and cancer imaging. It is based on five core principles, namely,  3 Analysis of undesired removal/addition of features such as artifacts or tumours (e.g., via inspection or classification).
Fairness variation testing Te4 3 Change in fairness is measured for generative/adversarial model intervention (e.g., via equalised odds in downstream task). Privacy preservation testing Te5 3 Investigation of patient-identifying feature leakage and training data reconstruction risk given generative model (output).
(v) T enability, acceptability, and reliability of the properties of the model and respective synthetic data.
The methodology applied to derive the SynTRUST framework is composed of several consecutive steps, outlined as follows.
1. Observation of experimental evaluation methods in the surveyed cancer imaging papers. 2. Questioning to which extent an observed study concludes with a generally-applicable, scientifically-sound finding. 3. Definition of causes as to why the results of the study are limited in general-applicability and trustworthiness. 4. Suggestion of additional validation methods that can increase the study's general-applicability. 5. Grouping and formalisation of suggestions into 26 concrete validation measures. 6. Definition of an overarching principle for each group of measures resulting in the 5 core principles: T horoughness, Reproducibility, U sefulness, Scalability, and T enability. 7. Refinement of the measures to complement with and extend on expert consensus on best practices for the application of artificial intelligence in medical imaging .
8. Importance rating of each measure from 1 to 3 based on their estimated impact on trustworthiness. A rating of 1 indicates essential measures with the highest importance, a rating of 2 characterises desirable measures, and a rating of 3 depicts measures that are recommended additions to a study.
The resulting SynTRUST framework is illustrated in Table 7. Table 7 contains the title, the definition, the importance rating, and an ID for reference for each of the 26 measures, grouped by the 5 SynTRUST principles.

SynTRUST study curation
Towards the objective of evaluating the trustworthiness of cancer imaging solutions, we demonstrate in the following how the SynTRUST framework can be used to analyse medical imaging publications. This not only shows the practicability of the SynTRUST framework, but also estimates the trustworthiness of current results in the field. The latter allows to corroborate concrete quality-controlled conclusions about the progress and state-of-the-art in adversarial networks in cancer imaging. Table 8 Selection of studies that employ data synthesis and adversarial networks methodology curated based on their promising potential towards solving the cancer imaging challenges surveyed in Sections 4.1-4.5. Each of the studies represents one concrete proposed solution to one of the challenges.

Table 9
Results of the in-depth analysis of all essential and desirable measures of the SynTRUST framework for studies proposing adversarial network methodology. The analysed studies are selected in Table 8 and represent solutions to key cancer imaging challenges. The SynTRUST measures are referenced by ID from In our analysis we first sample the present-day challenges in cancer imaging that were surveyed in Section 4 and summarised in Fig. 1. Next, we carefully select representative adversarial network publications to represent a particular challenge and its solution. This selection is based on the criteria that the publication (a) proposes a particularly promising solution to its respective challenge, (b) contributes a methodology that is generally-applicable across domains and (c) report promising results. Most of the sampled publications further (d) have shown more impact and were referenced in other relevant studies. The selected studies are displayed in Table 8 together with their representative solution and associated cancer imaging challenge.

SynTRUST study assessment
Next, we analyse each of the selected publications independently based on the SynTRUST framework. We choose to base our analysis on the most important measures of the SynTRUST framework that, as shown in Table 7, have received either a rating of 1 as essential or a rating of 2 as desirable. For the sake of conciseness, we leave the analysis of less critical measures rated as 3 (recommended) to further studies. The results of our analysis of each of the selected publications are summarised in Table 9.
Essential SynTRUST measures. We observe that the analysed studies overall show strong trustworthiness and validity considering the essential measures: 11 out of 16 studies fulfil all of the essential criteria, while the remaining 5 studies fulfil all but one essential measures. For 3 out of these 5 studies, the only essential measure that is not fulfilled is Th1 (minimum test set size). For instance, studies that pioneer methodologies on promising new clinical applications, such as generative tumour progression modelling (Elazab et al., 2020), it is particularly challenging to encounter datasets suitable for the clinical task at hand. Even though the number of test images exceeds the defined minimum of 100 in these studies, the number of different patients (cases) is lower than 30. 30 was defined as the indicative minimum of cases to allow for conclusions for the larger patient population. 32 All 16 studies have a detailed reporting of design decisions (R1), train and test on real-world representing clinical data (S1), and test the conditions of their adversarial network (Te1). Also, 15 out of 16 studies report multiple standardised performance metrics to evaluate the adversarial network (Th2) and demonstrate their method's usefulness on a clinically relevant downstream task (U1). In sum, the result for the essential measures demonstrates that the reported performance and progress of the analysed studies are considerably reliable and trustworthy.
Desirable SynTRUST measures. While the 6 essential basic trustworthiness requirements are mostly fulfilled, the result for the 8 desirable measures is more varied. This highlights that the studies have a general high level of trustworthiness, but a lower level of trustworthiness for the more specific and nuanced aspects of their reported results and validations. For instance, while 15 out of 16 studies included a comparison with a suitable baseline (Th4), multiple studies did not accomplish a positive evaluation of Th3 (8), Th5 (9), R2 (7), R3 (11), U2 (11), S2 (7), and Te2 (15).
• Regarding Th3, often studies defined a static train and test set without running experiments multiple times. For example, multiple different random seed network weight initialisations or k-fold cross-validation are options to corroborate results by demonstrating stable performance with reported mean and standard deviation across runs/folds. • Regarding Th5, in general the train-test split ensured no data leaking between training and testing sets, e.g., with images from the same patient not being in both sets. However, the benchmark test sets were often not defined systematically to ensure validating the methods on a varied distribution of, e.g., cases, patients, pathologies, and acquisition parameters. • For R2, we observe that often the studies' datasets are not public available, which limits the reproducibility of the results. Often, this is due to the collection and usage of private patient data from hospitals. Further limiting factors are the high effort to repeat the study on public datasets or the specificity of the clinical task rendering its evaluation non-viable on the available public datasets. • Analysing R3 shows that the software implementing the studies' methods and experiments is often not shared publicly in code repositories, which reduces reproducibility and impedes rerunning experiments with exactly the same code base used in the respective study. • Regarding U2, often the correlation between (a) the downstream tasks and (b) either the synthetic data quality (e.g., in the case of generative models) or the adversarial loss (e.g. in the case of adversarial training) is not analysed. Such an analysis informs on the usefulness of the quality of the respective model and on its contribution to the results on the clinical task. • As to S2, often the method is validated on, both, (a) a single dataset and (b) a single modality, while a desirable evaluation would use multiple datasets, modalities, ideally further demonstrating the method's transferability across organs, clinical domains and acquisition protocols. • For Te2, we note the general absence of an analysis of the bias that is transferred from the training dataset into the models. For instance, a model trained on a homogeneous patient population sample, e.g., in terms of gender, sex, ethnicity, geography, likely is biased towards this subset of the overall population and can result in unequal treatment of patients from other subsets. Model biases can be detected by reviewing (a) the dataset statistics, (b) the model performance shifts on carefully subset patient samples, and (c) the exclusion and inclusion criteria applied in the data acquisition and curation processes. This enables to report and potentially mitigate otherwise unknown model biases, which increases the knowledge and reliability of a model's properties.
In concluding our meta-analysis, we highlight the high general level of trustworthiness of the selected adversarial network publications based on our assessment of the essential SynTRUST measures. This demonstrates technical maturity of adversarial training and image synthesis methods in cancer imaging. As described in the Sections 4.1-4.5, many approaches towards solving the challenges in cancer imaging are not yet fully explored. Nonetheless, the solutions that have been pioneered and validated are shown to be relatively trustworthy and solid.
However, our meta-analysis also revealed that specific desirable trustworthiness criteria that go beyond basic essential validation are often not fulfilled, even by the most promising and in-depth studies in the field. For instance, a wider practice of data and code sharing is desirable. Closing this gap will not only increase reproducibility, but also accelerate adoption of existing methods and further innovation. part from that, the validation of biases and fairness criteria in datasets and models is largely overlooked despite its importance to ensure a model's acceptability and trust in the clinical setting.
We motivate further studies to address and build upon the gaps our analysis has revealed regarding the trustworthiness of existing cancer imaging studies. In this regard, we highlight the SynTRUST framework not only as a means for study evaluation, but also as a guideline guiding the design of future image synthesis studies.
6. Discussion and future perspectives 6.1. Adversarial methods in cancer imaging over the years As presented in Fig. 15(c), we have included 164 of the surveyed GAN-based data synthesis and adversarial training publications in the timeframe from 2017 until March 7th 2021. We observe that the numbers of these cancer imaging GAN publications has been increasing from 2017 to 2020 from 10 to 63 with a surprising slight drop between 2018 to 2019 (41 to 38). The final number of respective publications for 2021 is still pending. The trend towards publications that propose GANs and adversarial training to solve cancer imaging challenges demonstrates the considerable research attention that the adversarial learning scheme has been receiving in this field. Following our literature review in Section 4, the need for further research in adversarial networks seems not yet to be met. We were able to highlight various lines of research for GANs and adversarial training in oncology, radiology, and pathology that have received limited research attention or are untapped research potentials. These potentials indicate a continuation of the trend towards more data synthesis and adversarial training applications and standardised integration of GAN-generated synthetic data into medical image analysis pipelines and software solutions.

Modality biases
In regard to imaging modalities, we analyse in Fig. 15(b) how much research attention each modality has received in terms of the number of corresponding publications. By far, MRI and CT are the most dominant modalities with 61, and 53 publications, respectively, followed by MMG (13), dermoscopy (12) and PET (6). The wide spread between MRI and CT and less investigated domains such as endoscopy (3), ultrasound (3), and digital tomosynthesis (0) is to be critically remarked. Due to variations in the imaging data between these modalities (e.g., spatial resolutions, pixel dimensions, domain shifts), it cannot be readily assumed that a GAN application with desirable results in one modality will produce equally desirable results in another. Due to that and with awareness of the clinical importance of MRI and CT, we suggest a more balanced application of GANs and adversarial training across modalities including experiments on rare modalities to demonstrate the clinical versatility and applicability of GAN-based solutions. Alongside the open-access datasets described by Diaz et al. (2021), we highlight the following additional recent open datasets to facilitate experiments on some of the cancer imaging modalities that we found to be less explored: • Breast tomosynthesis: BCS-DBT  • PET-CT: Lung-PET-CT-Dx  • Endoscopy: HyperKvasir (Borgli et al., 2020) • Dermatology: HAM10000 (Tschandl et al., 2018) • Cytology: CERVIX93 (Phoulady and Mouton, 2018) • Thoracic X-ray: Node21 (Sogancioglu et al., 2021) R. Osuala et al.  Tables 2-6 of the respective Sections 4.1-4.5. Note that (b) and (d) contain more publications in total than (a) and (c), which is caused by GAN publications that evaluate on (and are assigned to) more than one modality (b) and/or anatomy (d) due to multiple experiments or cross-domain translation. In (c), the count for 2021 is not final, as the GAN papers herein analysed have been published on or before 7th March 2021.

Anatomy biases
In comparison, the GAN-based solutions per anatomy are more evenly spread, but still show a clear trend towards brain, head, neck (50), lung, chest, thorax (33) and breast (24). We suspect these spreads are due to the availability of few well-known widely-used curated benchmark datasets (Menze et al., 2014;Armato III et al., 2011;Heath et al., 2001;Moreira et al., 2012) resulting in underexposure of organs and modalities with less publicly available data resources. Where possible, we recommend evaluating GAN-based data synthesis and adversarial training on a range of different tasks and organs. This can avoid iterating towards non-transferable solutions tuned for specific datasets with limited generalisation capabilities. Said generalisation capabilities are critical for beneficial usage in clinical environments where dynamic data processing requirements and dataset shifts (e.g., multi-vendor, multi-scanner, multi-modal, multi-organ, multi-centre) commonly exist.  (42), and 4.1 data scarcity and usability (38) have received much research attention, Sections 4.5 treatment and monitoring (18) and 4.2 data access and privacy (12) contain substantially less GAN-related publications. This spread can be anticipated considering that classification and segmentation are popular computer vision problems and common objectives in publicly available medical imaging benchmark datasets. Early detected cancerous cells likely have had less time to acquire malignant genetic mutations Weinberg, 2000, 2011) than their latter detected counterparts, which, by then, might have acquired more treatment-resistant alterations and subclone cell populations. Hence, automated early detection, location and diagnosis can provide high clinical impact via improved cancer treatment prospects, which likely influences the trend towards detection and segmentation-related GAN publications.

Well-validated adversarial network solutions
Our survey uncovers in Sections 4.1, 4.3, and 4.4 that a vast amount of cancer imaging literature exists around a few common adversarial network solutions.
The most common application of GANs is data augmentation, where synthetic data is added to the training dataset to yield an improved downstream task performance. Such data augmentation can be further used to balance imbalanced datasets, which, for instance, often include much more benign tumour images than malignant ones.
A further well-explored application of GANs is domain adaptation via adversarial training, where a domain-adversarial loss is backpropagated into a downstream task model. Domain mapping is a related application, where images are translated from one domain to another. In general, GANs learn to translate between one source and one target domain. However, promising work has extended this technique to cross-modal synthesis between multiple domains (Yurt et al., 2019;Zhou et al., 2020), which remains an area with much clinically-relevant research potential. Similarly, GANs for super resolution and data curation including artifact removal and image denoising achieve desirable performance and real-world applicability.
Image-to-image translating GANs can remove or hallucinate features such as tumours (Cohen et al., 2018b,a) into generated images. While this can be a major concern for clinical adoption, it also opens an avenue for future research into automated detection and assessment of removed or hallucinated features and sheds light on the need for additional metrics for GAN condition-adherence and synthetic data evaluation.
Furthermore, we observe that the discriminator and its associated adversarial loss can be flexibly used to classify any type of model output without necessarily following the purpose of data generation. For example, discriminator can predict whether a segmentation mask is real or created by a segmentation model, which enables the model to learn to output more globally coherent segmentation masks.

New solutions for unexploited areas
Patient privacy. We promote future work on the less researched open challenges in Section 4.2, where we describe the promising research potential of adversarial networks in patient data privacy and security. We note that secure patient data is required for legal and ethical patient data sharing and usage, which, on the other hand, is required for successful training of state-of-the-art downstream task models. For instance, sharing GANs instead of private patient data can reduce data sharing constraints, while maintaining data utility (Szafranowska et al., 2022). Furthermore, GANs can be trained both in a federated learning setup as well as in a differential-privacy setup. Both of these techniques can be combined to further reduce privacy risks such as the risk of generating synthetic imaging data attributable to a specific patient. Further unexploited research potential lies in adversarial identity obfuscation both on image level, as well as on latent feature representation level. In particular, devising privacy preservation testing methods to evaluate the success of adversarial identity obfuscation and related methods is a needed and not fully addressed research problem in cancer imaging and AI in healthcare at large.
Patient security. With the projected increase in clinical AI applications, adversarial learning based cybersecurity methodology becomes increasingly important to protect patients against the vulnerabilities inherent in clinically deployed deep learning solutions. Attacks can alter diagnostic markers on cancer imaging data, which can potentially result in diagnostic errors with dangerous consequences for the targeted patients. For instance, defences against adversarial examples Samangouei et al., 2018) or detection of imaging data that has been tampered with Mirsky et al. (2019) are areas where solutions based on adversarial methods will increasingly gain practical importance.
Model debiasing. The versatile ability of adversarial training to curate a model's latent space is likely to continue to increase in popularity due to the need to remove certain features in clinical AI models. For example, it is desirable to minimise a model's learned biases to increase the fairness of clinical models across patient populations . Such bias removal has been shown to be achievable via adversarial loss backpropagation (Zhang et al., 2018a;Li et al., 2021a). As Elazar and Goldberg (2018) point out, some residual biases may remain in a model's latent space after converged adversarial bias removal training. Therefore, research potential lies in automated test and evaluation methodology to assess the quantity of residual bias remaining in an adversarial networks after debiasing, particularly if applied to data unseen during training.

Generative model evaluation.
A key aspect this survey observes is the absence of interpretable, standardised and exact evaluation methodology for synthetic data and generative models in the medical and cancer imaging domains. This is particularly noticeable for models without a narrow downstream task performance objective that can be used as surrogate evaluation metric nor a reconstruction objective that informs the evaluation technique. Generative models that generate a synthetic image with a clear reference value (i.e., a real image) can be evaluated based on the difference between reference and generated sample, e.g., via perceptual and reconstruction losses and metrics such as SSIM, PSNR, MSE, as discussed in Section 4.1. In the absence of such reference images, remaining methods at hand are image inspection techniques and real versus synthetic distribution comparisons, the latter including the Fréchet Inception Distance (FID) score (Heusel et al., 2017). The popularity of the FID metric for fidelity and diversity evaluation of synthetic data has largely translated from computer vision into medical imaging. The applicability of FID in the medical domain, nonetheless, is questionable, as it internally relies on an inception classifier pretrained on the ImageNet dataset consisting of 3-channel natural images as opposed to, for instance, grayscale images from radiological domains. This demonstrates a clear need for research on further evaluation methodologies of synthetic medical images. FID extensions that pretrain the internal classifier on medical imaging datasets are potential directions, but limited by the acquisition techniques, scope, modalities, and, importantly, the size of these medical imaging datasets. Recent promising work proposed the automated generation of segmentation mask from GANs based on latent space exploration (Melas-Kyriazi et al., 2021). Such latent space inspection approaches can offer further potential for generative model evaluation, e.g., by helping to measure the number and difference between modes or by providing quality and diversity estimates of the segmentation masks (or other extractable pieces of information) that the model produces.
Patient treatment. Sections 4.3 and 4.4 have shown that adversarial models for cancer detection, classification, and localisation are, at least for particular organs and modalities, well explored research areas. These applications are mostly relevant in diagnostic activities, which comprise only one part of the clinical workflow. We encourage more research on GAN-based solutions in less explored subsequent clinical workflow steps such as oncological treatment planning and disease monitoring as elaborated in Section 4.5. For example, adversarial learning offers research potential in tumour profiling and intra-and inter-tumour heterogeneity assessment via anomaly detection within the latent space of adversarial models (Schlegl et al., 2019;Quiros et al., 2019). The high intra-and inter-tumour heterogeneity increases the difficulty of assessing and selecting targeted treatment options. Research potential exists in precisely encoding a tumour based on imaging and/or non-imaging patient and tumour data in an adversarial model's multi-dimensional latent space. For example, this can unlock vector search applications to find similarly encoded tumours in databases to inform on therapy selection, success probabilities, and progression patterns. Tumour progression modelling on image-level based on generative models such as GANs remains largely unexplored. Even though not strictly necessary (Xia et al., 2021), longitudinal and time-series cancer imaging datasets will likely trigger increased exploration of this research area once such data becomes available. For instance, given a tumour image at timepoint t1, a GAN can learn to simulate the tumour image at timepoint t2. To this end, generation of image-level counterfactuals (Pawlowski et al., 2020) as a clinically impactful solution for probing interventions. For instance, GANs can generate a tumour at t2 given the tumour image at t1 alongside multiple input conditions such as tumour growth rate, tumour type, and applied treatments.

Towards state-of-the-art GAN innovations in cancer imaging
In recent years, multiple novel adversarial networks have been introduced in the field of computer vision. A lesson learned from our survey is that many of these techniques are yet to be applied thoroughly to cancer imaging. These innovations open avenues in cancer imaging that extend upon the currently used methods shown in Fig. 4, for instance, enabling improved high-resolution image generation and input-conditioned image synthesis.
Overcoming dataset and computation limitations. For instance, the recent VQGAN (Esser et al., 2021) combines the efficiency of convolutional networks with the expressiveness of transformers, which model the composition of a reusable codebook of context-rich visual parts. This approach is particularly relevant to medical and cancer imaging, as it allows high-resolution image synthesis despite limited computing resources. Apart from containing high resolution images, cancer imaging datasets are often limited in the number of image, which may not suffice to train a GAN. In these cases, the potential issues are that during training there is convergence-failure, where the synthetic image quality is low and does not improve any further during training. While the adversarial loss often is non-interpretable not corresponding to synthetic image fidelity, diversity or condition adherence, also mode collapse may occur, where the generator has learned a particular mode to fool the discriminator instead of generating a high diversity of samples. These issues are not only a function of the GAN architecture and loss function, but also of the size of the training dataset. FastGAN  and SinGAN (Shaham et al., 2019) shows great promise to overcome this data scarcity cancer imaging problem. FastGAN  uses self-supervised training of discriminator as encoder for regularisation and generates high-resolution images despite limited computing resources and dataset size. SinGAN (Shaham et al., 2019) generates multiple synthetic images based on only a single training image. This has wide applicability and can substantially increase the usefulness of even very small cancer imaging datasets via SinGANbased data augmentation. A first successful applications of SinGAN and FastGAN to cancer imaging for polyp segmentation by Thambawita et al. (2022) shows the potential of using these models to generate not only a synthetic images, but also a corresponding segmentation mask by outputting an additional channel. This type of methodology enables training data generation for tumour detection, localisation and segmentation models without the need of conditioning the GAN on input segmentation masks.
Best practice combining GAN frameworks. As a vast amount of novel additions to the GANs framework has been suggested, some work (Brock et al., 2018) has focused on collecting the best working practices and combining them into novel architectures, which are promising and not yet widely applied to challenges in cancer imaging. For example, BigGAN (Brock et al., 2018) (a) scales model parameters by increasing the size of the feature maps, (b) applies large batch sizes, (c) uses self-attention based on SAGAN , (d) provides information about the class via class-conditional batch normalisation, and (e) uses hinge-loss. BigGAN and extensions thereof (e.g., Zhang et al. (2019b), Casanova et al. (2021) and Schonfeld et al. (2020)) achieve state-of-the-art performance on class-conditional image generation.
Extending on PGGAN (Karras et al., 2017) as shown in Fig. 4(m), another such example is StyleGAN (Karras et al., 2019) and its variants (Karras et al., 2021(Karras et al., , 2020Sauer et al., 2022), which accomplish state-of-the-art performance in conditional and unconditional computer vision image generation benchmarks. Yielding strong results, multiple architectural innovations have been introduced by the StyleGAN family, such as a style vector generating fully connected mapping network, adaptive instance normalisation, and, instead of sampling from a noise vector, moving the noise input to intermediate activation maps. These innovations can inform cancer image generation models and improve their latent space exploration capabilities e.g. allowing to compare different tumour types and manifestations.
Image-to-image translation. Image-to-image translation problems in cancer imaging are widely approached using commonly pix2pix  (paired) and cycleGAN (Zhu et al., 2017) (unpaired). Nonetheless, more recent models such as OASIS (Sushko et al., 2020), ResVit (Dalmaz et al., 2021), and StarGAN V2 (Choi et al., 2020) have been proposed, which are not only applicable to cancer imagery, but also have shown superior performance on computer vision benchmarks. ResVit (Dalmaz et al., 2021), for instance, diverges away from common CNN architectures with inductive biases by using a vision transformer architecture (Dosovitskiy et al., 2020) alongside an adversarial loss (Goodfellow et al., 2014) and the common L1 losses between source and target  and between source and reconstructed source . StarGAN V2 (Choi et al., 2020) employs besides the adversarial and cycle consistency losses also a style reconstruction loss and a style diversification loss, while OASIS (Sushko et al., 2020) shows that a perceptual loss is not necessary given an adversarial loss and a segmentation-based discriminator.

GAN alternatives and complementary methods
Diffusion models. In image inpainting (Saharia et al., 2021a) and super resolution (Saharia et al., 2021b), the recently proposed and increasingly popular diffusion models (Sohl-Dickstein et al., 2015;Song and Ermon, 2019;Ho et al., 2020) have been shown to achieve state-ofthe-art and competitive performances for computer vision benchmarks and, thus, are an alternative to GANs. Diffusion models iteratively add noise to an image in a Markov chain of diffusion steps. Reversing this process, a noise vector z is gradually denoised and transformed into an image. While achieving promising generative modelling capabilities, it still takes longer to sample from diffusion models than from GANs due to multiple denoising steps, while also further work is needed to explore the interpretability of latent representations of diffusion models (Dhariwal and Nichol, 2021). A promising line of research suggests the combination of GANs with diffusion models to increase the stability and data efficiency of GAN training (Wang et al., 2022).
Variational autoencoders. GANs are commonly considered to achieve higher quality outputs than variational autoencoders (VAEs) (Kingma and Welling, 2013) at the cost of a training process more prone towards requiring manual intervention and tuning. A promising line of research improves upon vanilla VAE by exploring combinations of GANs and VAEs (Larsen et al., 2016;Makhzani et al., 2015). Extending on VAEs, Van Den Oord et al. (2017) proposed Vector Quantised Variational AutoEncoder (VQ-VAE), which learns discrete instead of continuous latent representations to avoid the issue of 'posterior collapse' that is common in VAEs. VQ-VAE has been shown to be an effective method for diverse high-quality synthetic image generation (Razavi et al., 2019). A promising extension combines VQ-VAE with transformers (Vaswani et al., 2017) for unsupervised anomaly detection and segmentation and demonstrates its potential for tumour segmentation in brain MRI (Pinaya et al., 2021).
Normalizing flows. The recently proposed Normalizing Flows (Rezende and Mohamed, 2015;Dinh et al., 2014Dinh et al., , 2016 are an alternative deep generative model gaining increasing popularity for synthetic data generation tasks. As opposed to GANs and VAEs (implicit), Normalizing Flows explicitly learn the probability density function ( ) and are trained via maximum likelihood estimation. Knowing ( ), unobserved but realistic new data points can be sampled with exact likelihood estimates. Normalizing Flows have been shown to be combinable with GANs and the adversarial loss function, e.g., by being the building block of the generator network (Grover et al., 2018), and for image-to-image translation (Grover et al., 2020). To date, Normalizing Flows have seen less adoption in medical and cancer imaging than GANs, but promising initial applications exist. For example, Normalizing Flows have been proposed for uncertainty estimation of lung lesion segmentation (Selvan et al., 2020), counterfactual inference on brain MRI (Pawlowski et al., 2020), and low-dose CT image reconstruction (Denker et al., 2020).
Unsupervised domain adaptation. In unsupervised domain adaptation, self-training approaches are described as an alternative to domain adversarial losses. For example, state-of-the-art methods like HRDA (Hoyer et al., 2022b) and DaFormer (Hoyer et al., 2022a) show the effectiveness of self-training in domain-adaptive semantic segmentation. DaFormer uses a transformer encoder (Vaswani et al., 2017;Dosovitskiy et al., 2020) and transfers knowledge from source to target domain via a teacher network that generates pseudo-labels for the data from the target domain. A promising avenue of research combines selftraining approaches and adversarial losses (Li et al., 2019c;Kim and Byun, 2020;Wang et al., 2020a).
Self-supervised learning. Given successes in learning useful representations from unlabelled data, self-supervised learning (SSL) approaches, such as BYOL (Grill et al., 2020), have become a common technique in the toolkit of deep learning researchers. Particularly when working with datasets limited in size or annotations, additional GAN-generated data can improve the learning of representations, upon which a downstream task model produces its predictions. SSL can provide an alternative, often computationally less expensive, means towards representation learning given a training task with objective function, where labels and inputs are extracted from an unlabelled dataset. A popular and powerful SSL method is contrastive learning, where a model's latent space is learned by minimising the distance of similar samples and maximising the distance between dissimilar ones. Effective model pretraining methods such as SimCLR ) rely on such contrastive loss functions, which, e.g., maximise agreement between differently augmented views of the same image. Multiple recent studies propose the combination of GANs and self-supervised (Patel et al., 2021) and contrastive learning with promising results reporting improved performance and sample diversity, as well as reduced discriminator overfitting (Jeong and Shin, 2021;Kang and Park, 2020;Liu et al., 2021). In cancer imaging, for instance, this combination has been applied to address the problem of mode collapse while retaining phenotypic tumour features for the task of colour normalisation in histopathology images (Ke et al., 2021).

Conclusion
In closing, we emphasise the versatility and the resulting modalityindependent wide applicability of the adversarial learning scheme of GANs. In this survey, we strive to consider and communicate this versatility by describing the wide variety of problems in the cancer imaging domain that can be approached with adversarial networks. For example, we highlight GAN and adversarial training solutions that range from unsupervised domain adaptation to patient privacy preserving distributed data synthesis, to adversarial segmentation mask discrimination, to multi-modal radiation dose estimation, amongst others.
Before reviewing and describing GAN and adversarial training solutions, we surveyed the literature to understand the current challenges in the field of cancer imaging with a focus on radiology, but without excluding non-radiology modalities common to cancer imaging. After screening and analysing the cancer imaging challenges, we grouped them into the challenge categories Data Scarcity and Usability, Data Access and Privacy, Data Annotation and Segmentation, Detection and Diagnosis, and Treatment and Monitoring. After categorisation, we surveyed the literature for adversarial networks applied to the field of cancer imaging and found 164 relevant publications, each of which we assigned to its respective cancer imaging challenge category. Finally, we provide a comprehensive analysis for each challenge and its assigned GAN-related publications to determine to what extent it has and can be solved using GANs and adversarial training. We further establish the SynTRUST framework for assessing the trustworthiness of medical image synthesis studies. Based on SynTRUST, we analyse 16 carefully selected cancer imaging challenge solutions. Notwithstanding the overall high level of rigour and validity of these studies, we are able to recommend a set of unaddressed trustworthiness improvements in order to guide future studies. To this end, we also highlight research potential for challenges where we were able to propose data synthesis or adversarial training solutions that have not yet been fully explored by the literature.
With our work, we strive to uncover and motivate promising lines of research in data synthesis and adversarial networks that we envision to ultimately benefit the field of cancer imaging in clinical practice.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data availability
No data was used for the research described in the article.