The optimisation of deep neural networks for segmenting multiple knee joint tissues from MRIs

.


Introduction
Osteoarthritis (OA) is a degenerative disease involving the entire synovial joint (Goldring et al., 2017;Hunter and Eckstein, 2009;Martel-Pelletier et al., 2016). Important risk factors for the development of OA include age, muscle weakness, abnormal joint loading due to joint malalignment or overloading (obesity, high impact sport), and injury to the menisci and ligaments (Ismail and Vincent, 2017;Lohmander et al., 2007;Martel-Pelletier et al., 2016). Distinctive hallmarks of OA include the progressive destruction of articular cartilage structure and alterations in the surrounding joint tissues, including bone, meniscus, ligament and peri-articular muscle. Magnetic resonance imaging (MRI) is a commonly used tool to evaluate clinical abnormalities of the knee (Blumenkrantz and Majumdar, 2016). Morphological changes due to OA are well demonstrated with MRI (Benhamou et al., 2001;Hunter et al., 2015;MacKay et al., 2018;Neogi et al., 2013;Wise et al., 2018). Tissue specific masks of the knee joint can be useful for the analysis of OA, especially as automated tools continue to be developed and validated (Bindernagel et al., 2011;Deniz et al., 2018;Lee et al., 2014;Liu et al., 2017;Ng et al., 2006;Patel and Singh, 2018;Seim et al., 2010;Shan et al., 2014;Shrivastava et al., 2014;Swanson et al., 2010;Xia et al., 2013;Zhou et al., 2016).
For both clinical and research usage, a significant amount of time is spent manually segmenting images to designate tissue-specific regional masks, also known as regions-of-interest (ROIs). Image masking remains a very significant challenge within medical imaging due to heterogeneity in organ appearance and disease progression and presentation. The segmentation of neighbouring soft tissues such as the cruciate ligaments, cartilages and muscles in the knee joint which have similar image intensities (and therefore poor contrast resolution) is an especially demanding task. ROIs can be generated through manual or semi-manual delineation by a trained reader, or they may be generated automatically using signal thresholding (Swanson et al., 2010), shape (Bindernagel et al., 2011;Seim et al., 2010), atlas (Lee et al., 2014;Shan et al., 2014), or derive from region based (Ng et al., 2006;Patel and Singh, 2018;Shrivastava et al., 2014) approaches, as well as with machine learning approaches (Deniz et al., 2018;Liu et al., 2017;Xia et al., 2013;Zhou et al., 2016). Machine learning methods include unsupervised learning, such as k-means clustering, which segments based on spatial clusters of similar signal intensities in an image (Ng et al., 2006;Patel and Singh, 2018;Shrivastava et al., 2014), or supervised learning by training the algorithm on image masks that have been obtained from any previous masking technique (Deniz et al., 2018;Liu et al., 2017;Xia et al., 2013;Zhou et al., 2016). The number of high-quality label maps for supervised learning is typically very small, and the performance of a machine learning network trained on a low number of data is limited due to the lack of heterogeneity of images presented during training. Transfer learning may be used to mitigate this by pretraining a network on a large dataset with different but related similarities to the actual task, followed by network refinement on the small dataset (Shie et al., 2015).
Convolutional neural networks (CNNs), in particular U-Nets (Ronneberger et al., 2015), have demonstrated their capability to automate the segmentation of musculoskeletal MRIs (Liu et al., 2017;Norman et al., 2018). Nevertheless, a drawback of this approach with CNNs is that they usually use pixel-wise measures such as the absolute (L1) or square (L2) error loss which can be non-optimal for image data, and, in the case of L2, result in blurry boundaries (Pathak et al., 2016). In contrast, generative adversarial networks (GANs) (Goodfellow et al., 2014) learn a similarity measure (feature-wise metric) that adapts to the training task by implementing two competing, or adversarial, neural networks. During adversarial training, one network focusses on image discrimination and guides a second network which focusses on image generation to create "real" images that have a data distribution indistinguishable from the training data distribution. The generator and discriminator are trained simultaneously and competitively in a mini-max game while convergence is achieved when the Nash equilibrium is reached, i.e. no network can improve through further training if one remains unchanged .
Conditional GANs (cGANs) modify the GAN approach to learn image-to-image mappings (Goodfellow et al., 2014;Isola et al., 2017). In comparison to traditional GANs that learn a mapping from random noise to a generated output, cGANs learn a mapping from an observed variable, for example an image to generate an output, such as a label map (Goodfellow et al., 2014;Isola et al., 2017). cGANs have been used to produce image labels for neurological (Rezaei et al., 2017), cardiac , abdominal (Huo et al., 2018), respiratory  and musculoskeletal imaging , Gaj et al., 2019. (Liu, 2019) used unpaired image-to-image translation with a method called cycle-consistent generative adversarial network (CycleGAN) to perform semantic image segmentation of femorotibial cartilage and bone of the knee joint of unlabelled MRI datasets. The "pix2pix" framework is one cGAN approach that has demonstrated segmentation capability (Isola et al., 2017). Semantic segmentation with cGANs, particularly those combining U-Net generators and Markov Random Field discriminators (patch-based discriminators), is relatively unexplored. The method has previously been performed for semantic segmentation of the brain (Rezaei et al., 2017). In (Gaj et al., 2019), a cGAN was used for semantic segmentation of knee cartilage and meniscus but with an image-wise discriminator rather than a patch-wise discriminator.
The aim of this study was to implement and evaluate a cGAN for automated semantic segmentation of multiple joint tissues from MR images: the femoral, tibial and patellar bones and cartilage surfaces; the cruciate ligaments; and two selective muscles, the medial vastus and gastrocnemius. Our essential contributions are summarised as followed: 1 Implementation of a cGAN based on the "pix2pix" framework introduced by (Isola et al., 2017) using a U-Net generator and a patch-based discriminator for automatic segmentation of multiple knee joint tissues. As far as we know, cGANs have not previously been used for semantic segmentation of the patellar bone and cruciate ligaments, as well as muscles of the knee joint. 2 Evaluating the segmentation performance of the cGAN with different objective functions by combining the cGAN loss with different pixelwise error losses and modifying the weighting hyperparameter between the cGAN loss and pixel-wise error loss. 3 Assessing the choice of the generator depth and discriminator receptive field size on the performance of the cGAN for multi-tissue segmentation. 4 Quantitative comparison of the cGAN approach with the well-known U-Net approach. 5 Exploring the use of transfer learning for improved segmentation performance of both cGAN and U-Net.

Image datasets
Three image datasets were used for network training and testing; the publicly available SKI10 and OAI ZIB datasets, consisting of 100 and 507 labelled knee MRs, respectively, and a locally acquired dataset of ten segmented knee MRs (Advanced MRI of Osteoarthritis (AMROA) study).

SKI10
The "Segmentation of Knee Images 2010 ′′ (SKI10) dataset (Heimann et al., 2010), consists of approximately 90 % 1.5 T and 10 % 3.0 T sagittal MR images using multiple system vendors -GE, Siemens, Philips, Toshiba, and Hitachi. The sequences were varied and included both gradient echo and spoiled gradient echo sequences, commonly with fat suppression. The images were segmented on a slice-by-slice basis by experts from Biomet, Inc., initially through intensity thresholds and thereafter with manual editing. One hundred 3D image datasets of the SKI10 challenge were provided with semi-manual masks of femoral and tibial cartilage and bone. In our study, 70 datasets were used for network training and 30 for network testing.

OAI ZIB
The OAI ZIB dataset (Ambellan et al., 2019) is comprised of segmentations of femoral and tibial cartilage and bone of 507 MR imaging volumes from the publicly available Osteoarthritis Initiative dataset (The Osteoarthritis Initiative, 2020). The MR images were acquired on Siemens 3 T Trio systems using a 3D double echo steady state (DESS) sequence with water excitation. Outlines of femoral and tibial bone and cartilage were generated using a statistical shape model (Seim et al., 2010) with manual adjustments performed by experts at Zuse Institute Berlin. The OAI ZIB data covers all degrees of OA (KL 0-4), with more cases having severe OA (KL ≥ 3) (Ambellan et al., 2019). As with the SKI10 dataset, we split the dataset in 70 % (355) for network training and 30 % (152) for testing.

AMROA
The locally acquired participant cohort consisted of ten subjects: five healthy volunteers and five patients with mild-to-moderate OA. The patients followed at least one subset of American College of Rheumatology criteria for OA and were recruited between April 2017 to April 2018 (Table 1). The healthy volunteers were approximately matched to OA patients for age, sex, and body mass. Network training was performed on data from four subjects with OA and four healthy subjects. Two individuals (one with OA and one healthy) were used as a unique set for test measurements. The number of test individuals was chosen such that roughly 80 % of the data could be used for training. Ethical approval was obtained from the National Research Ethics Service, and all subjects provided written informed consent before participation.
Semi-manual segmented masks ( Fig. 2A) of the patella, tibia, and femur bones as well as of their respective surrounding patellar, tibial and femoral cartilages (Fig. 2b) were created from the 3D-FS SPGR images by a musculoskeletal radiologist with 8 years' experience, using the Stradwin software v5.4a (University of Cambridge Department of Engineering, Cambridge, UK, now freely available as 'StradView' at http://mi.eng.cam.ac.uk/Main/StradView/) (MacKay et al., 2020). Additionally, masks of the vastus medialis and medial head of gastrocnemius muscles were created. This semi-manual segmentation pipeline consists of sparse manual contour generation (every 2nd-5th sagittal image/2− 5 mm) followed by automatic surface triangulation using the regularised marching tetrahedra method. Volume preserving surface smoothing allows creation of an accurate segmentation from relatively sparse manual contours (Treece et al., 1999). Manual segmentations of the anterior cruciate ligament (ACL) and posterior cruciate ligament (PCL) were created on the 3D-FS SPGR images using ITK SNAP (Yushkevich et al., 2006) by a radiologist with 3 years' experience.

Training data and masking
Each of the major structures were given a separate image value, i.e., colour, in the segmentation mask, such that the network determined the unique weights to generate a similar regional colour-value from an MR image. On a 256-bit colour-scale, the three bones were stored in the blue colour channel where the femur colour code was 50, tibia was 100, and patella was 150. The cartilages were stored in the green colour channel where the femoral cartilage colour code was 50, the tibial was 100 and the patellar was 150. Additionally, for the AMROA dataset, the muscles were stored in the red colour channel with the medial vastus muscle code set to 100 and the medial gastrocnemius muscle colour code set to 200. The ACL mask was stored in the blue colour channel and the PCL in the green colour channel with both colour codes set to 200.
The MRIs and image masks were converted from the DICOM and NIFTI formats (Larobina and Murino, 2014), respectively, to a common image format (Portable Network Graphics, PNG) before training. Noise-only images were not used for training or testing, as training a network to fit against zero-valued masks results in a poor constraint. After network training, a tissue-/ region-specific Boolean mask was created on the predicted test images by removing prediction values outside of ±20 colour scale units of the tissue specific value. 3D mask predictions were obtained by iterating over the 2D segmented slices.

Network specifications
This work uses the "pix2pix" framework of a conditional GAN  1. Conditional GAN structure. The generator is a U-Net that progressively down-samples / encodes and then up-samples / decodes an input by a series of convolutional layers, with additional skip-connections between each major layer. The generated, 'fake' segmentation image is then fed together with the ground truth segmentation image into a discriminator network (PatchGAN (Isola et al., 2017)) that gives its prediction of whether the generated image is a 'real' representation of the ground truth image, or not. A detailed description of the network architecture can be found in the Appendix.
(cGAN) described by Nvidia (Isola et al., 2017). The cGAN consists of two deep neural networks, a generator (G) and a discriminator (D). For our task, G learns to translate sagittal MR images of the knee joint (source images x) to semantic segmentation maps (G(x)), while D aims to differentiate between the real segmentation map (y) and the synthetically generated. The structure of a cGAN is illustrated in Fig. 1. The loss function for this cGAN is The loss function describes how G is minimized against a maximised D. Since both optimisation processes are dependent on each other, convergence is achieved by reaching a saddle point (simultaneously minimum / maximum for both networks' cost) rather than a minimum. The loss also incorporates a L1 distance to reduce image blurring and ensure that the generated image from G(x) are not significantly different from the target image y (Isola et al., 2017;Regmi and Borji, 2018). This L1 loss is given by The overall objective of the cGAN is to find the optimal solution to with λ being a hyper-parameter used for balancing the two losses (Regmi and Borji, 2018). The cGAN used in this work utilises the U-Net encoder-decoder architecture for the generator, which is frequently used for image segmentation problems (Ronneberger et al., 2015). The generator was trained to generate images that are indistinguishable from a target image (i.e., the segmented map). Spatial consistency of the data is not guaranteed with a U-Net segmented map, which can cause inaccurate boundaries (Ronneberger et al., 2015). However, adversarial losses in the discriminator regulate and therefore increase the accuracy to higher order shapes .
We modified the U-Net generator from the "pix2pix" network by increasing the input layer to be able to train on 512 × 512 resolution images. For this an additional Convolution-BatchNorm-leakyReLU layer was inserted in the encoding and a Convolution-BatchNorm-ReLU layer in the decoding network part.
The discriminator is a patch-based fully convolutional neural network, PatchGAN (Li and Wand, 2016;Long et al., 2018), which models the image as a Markov random field. It performs a convolutional patch-wise (N x N) classification with all the outputs in the patch averaged and taken as the output of D. D is therefore less dependent on distant pixels/voxels beyond a "patch diameter" and is a form of neighbouring texture loss. The PatchGAN can be applied to arbitrarily large images, due to a fixed size of the patch.
To analyse the cGANs performance we compared it to the performance of a U-Net network, which is widely used for image segmentation processes. We used the cGAN generator network as the U-Net network to maintain an effective comparison.
The networks were implemented using PyTorch (Torch v1.0.1) and all training was performed on a Nvidia P6000 GPU card (3840 CUDA cores, 24 GB GDDR5X). The training phase of optimisation was performed as described by the "pix2pix" network, using stochastic gradient descent to minimise D(x,y) and stochastic gradient ascent to maximise D (x,G(x)). The Adam solver was used with a learning rate 0.0002 and momentum parameters,β 1 = 0.5β 2 = 0.999. We introduced random noise (jitter) during training by resizing the input images to 542 × 542 using bi-cubic interpolation followed by random cropping back to 512 × 512.
A detailed description of the network architectures can be found in the Appendix.

Segmentation evaluation metrics
The Sørensen-Dice Similarity Coefficient (DSC) (Dice, 1945;Sørensen, 1948) was used to evaluate the overlap between the generated segmentation and the manual segmentation. The DSC ranges between 0 and 1, with 0 representing no overlap and 1 complete overlap between the two sets. DSC is defined as twice the size of the intersect divided by the sum of the sizes of two sample sets, given as for Boolean metrics. For the experiments involving the SKI10 and OAI ZIB datasets, the volumetric overlap error (VOE) and the boundary distance-based metric average surfaces distance (ASD) were determined to assess segmentation accuracy and allow an appropriate comparison with previous studies using these datasets. The VOE can be calculated as with small values for VOE expressing greater accuracy. The ASD is expressed in mm and is defined as where D X (y) = min x∈X ‖y − x‖ is the distance of a voxel y to a surface X and ‖•‖ denotes the Euclidean norm.

Evaluation of network characteristics
This section aims at evaluating and adjusting specific network characteristics towards improving overall network performance, for both cGAN and U-Net. All networks in this section were trained for 100 epochs and all cGANs with a 70 × 70 PatchGAN discriminator unless otherwise stated.

Evaluation of network objective function
We evaluated the cGANs performance with different objective functions by combining the cGAN loss with different pixel-wise error losses. In this work the cGAN is tasked to output a segmentation map of multiple tissues having different features and locations in the input MR image. We assessed the shortcomings and strengths of including the L L1 , L L2 and Smooth L1 (L SmL1 ) (Girshick, 2015) loss functions in the cGAN objective. The L L2 loss and L SmL1 loss are given by Furthermore, the weighting hyperparameter λ between the cGAN loss and pixel-wise error loss was changed to vary the balance between the two task losses. λ = 0.01, 1, 100 and 10,000 were investigated. Network training with the cGAN loss alone (λ = 0) was additionally performed and evaluated.
We also trained the U-Net with the same three different pixel-wise error losses (L L1 , L L2 and L SmL1 ) as the cGAN to maintain an effective comparison.

Evaluation of altering the loss objective during training
After obtaining initial results, we observed that the cGAN was unable to segment muscle tissues, independent of the objective function trained on. Therefore, we decided to explore the effect of varying the loss objective during training. For this, we trained a cGAN with L cGAN + λL L2 loss and a U-Net with L L2 loss for 50 epochs and then changed the loss functions for the ensuing 50 epochs to L cGAN + λL L1 and L L1 , respectively.

Evaluation of the generator depth
We analysed the effect of changing the depth of the generator network on the cGANs and U-Nets quantitative performance. In addition to the generator down-sampling the input through nine convolutional networks, we tested a generator consisting of seven and five convolutions during down-sampling. Furthermore, we assessed the quantitative performance of the generator network with different numbers feature channels. We compared networks starting with different minimum number of feature channels (16, 32, 64 and 128) and thus end at different maximum numbers of feature channels (128, 256, 512 and 1024). All cGANs were trained with L cGAN + λL L1 loss with λ = 100 and all U-Nets with the L L1 loss. Detailed descriptions of the generator network architectures can be found in the Appendix.

Evaluation of the PatchGAN receptive field size
We evaluated the effect of changing the PatchGAN receptive field size on the cGANs qualitative (artefact emergence) and quantitative (segmentation accuracy) performance. In addition to the 70 × 70 PatchGAN, we tested a 1 × 1 (PixelGAN), 34 × 34 and 286 × 286 PatchGAN. All cGANs were trained with L cGAN + λL L1 loss with λ = 100. Detailed descriptions of the discriminator network architectures can be found in the Appendix.

Evaluation of transfer learning
Since the AMROA dataset only comprises of a low number of subjects (N = 8) for training, we assess the influence of transfer learning on network performance, by initially training both a cGAN (L cGAN + λL L1 ) and a U-Net (L L1 ) for 20 epochs on the larger SKI10 and OAI ZIB training datasets separately followed by network fine-tuning for 80 epochs on the smaller AMROA training set. Additionally, a cGAN and a U-Net were trained for 20 epochs on the AMROA training dataset followed by network refinement training for 80 epochs on either the SKI10 or OAI ZIB training set to analyse the potential segmentation improvement of SKI10 and OAI ZIB. Network performance evaluations were performed using AMROA, SKI10 and OAI ZIB testing datasets. As determined from the previous sections, the cGAN trained with the L cGAN + λL L1 loss objective (λ = 100) and a 1 × 1 PixelGAN as well as the U-Net trained with the L L1 loss objective achieved the highest segmentation accuracies for most knee joint tissues segmented in the AMROA dataset and were used in this section.

Table 2
Results of the Network Objective Function: cGAN. The influence of mixing the cGAN objective with different pixel-wise error losses and varying their significance by changing the weighting hyperparameter λ on the segmentation performance of the proposed cGAN was assessed. Highest DSCs achieved for each tissue are in bold. Training and testing were performed on the AMROA training and testing datasets, respectively. DSCs presented as mean ± standard deviation. Abbreviations: F Bonefemoral bone, T Bonetibial bone, P Bonepatellar bone, F Cartilagefemoral cartilage, T Cartilagetibial cartilage, P Cartilagepatellar cartilage, VM Muscle -vastus medialis muscle, GM Musclemedial head of gastrocnemius medialis muscle, ACLanterior cruciate ligament, PCLposterior cruciate ligament, DSC -Sørensen-Dice similarity coefficient.

Network training and testing
Semi-manual segmentation of the AMROA images by the reader required − 30 min per subject-volume. Segmentation post-training on a single slice was processed in ≈0.13 s. A detailed description of all cGAN and U-Net training durations for all datasets can be found in the Appendix. The highlights of the upcoming sections are: The U-Net trained with L L1 loss objective outperformed the cGANs and the U-Nets trained with different loss objectives in the segmentation performance of most knee joint tissues.
(continued on next column) 3.3 Altering the network objective function midway through cGAN and U-Net training lead to unanticipated but advantageous results. This variation resulted in improved segmentation performances of several tissues and the cGANs capability to segment muscle tissue, which previously had not been possible with non-altered objective function training. 3.4 The cGAN and U-Net trained with nine convolutions/transpose convolutions in the networks encoding/decoding parts and a minimum feature channel change of 64 achieved the highest segmentation accuracies for most knee joint tissues annotated. 3.5 The greatest improvements in segmentation performance of the cGAN was achieved by reducing the receptive field size of the discriminator network. This resulted in segmentation accuracies equivalent to those of the U-Net.
(continued on next page) 3.6 Transfer learning not only increased segmentation accuracy of some tissues of the fine-tuned dataset, but also increased the network's capacity to maintain segmentation capabilities for the pretrained dataset. 3.7 Overall, the cGAN trained with the L cGAN + λL L1 loss objective (λ = 100) and a 1 × 1 PixelGAN as well as the U-Net trained with the L L1 loss objective achieved comparable and the highest segmentation accuracies for most knee joint tissues segmented.

Evaluation of network objective function
The quantitative results of assessing the impact of combining the cGAN objective with three different pixel error losses with varying weightings λ on the cGANs segmentation performance are in Table 2, with the qualitative results depicted in Fig. 2B. The cGANs trained with larger values for λ (λ = 100 and 10,000) achieved the highest segmentation performance for all tissues and the produced segmentation maps were less affected by artefacts compared to the cGANs trained with λ = 0.01 and 1. For instance, the images from the networks trained with L cGAN + λL L1 (λ = 0.01), L cGAN + λL L2 (λ = 1) and L cGAN + λL SmL1 (λ = 1) had artefacts where the networks seem to detect bone or cartilage structures where there were none in the original MR input image. By increasing the weighting hyperparameter λ, more emphasis is put on the pixel error losses to guide the network to produce more accurate representations of the ground truth segmentation map and reduces these artefacts. However, the influence of GAN loss diminishes with very large values for λ with the discriminator having minimal effect on generator training.
The qualitative results of training a U-Net with different pixel error losses are presented in Fig. 2C while the quantitative results are listed in Table 3. The U-Net trained with L L1 loss objective achieves the highest accuracy for all tissues compared to L L2 and L SmL1 loss except for the muscle tissues. Muscle tissues appeared on the majority of 2D MR knee images seen by the network during training, however we only segmented two selective medial muscles in the AMROA dataset due to time constraints. It is interesting to note that although the U-Net trained with L L1 was not able to capture the medial head of gastrocnemius and vastus medialis muscles, the cGAN trained with the L cGAN + λL L1 objective (λ = 10,000) was. Simple absolute difference (L L1 ) was not capable of differentiating lateral muscle textures from medial. The U-Nets trained with L L2 and L SmL1 losses were capable of segmenting the selective muscles with high accuracies as they are penalised more by the squaring term in their loss objectives when the difference between ground truth and model predictions are large. Interestingly, although the patella bone and cartilage only appear on very few slices in a 3D dataset, and ACL and PCL on even fewer, the U-Net with L L1 segmented these tissues better than the L L2 and L SmL1 (L L2 : DSC P Bone < 0.2 %, DSC P Cartilage < 5.3 %, DSC ACL < 15.2 %, DSC PCL < 21.3 %; L SmL1 : DSC P Bone < 0.4 %, DSC P Cartilage < 6.0 %, DSC ACL < 6.9 %, DSC PCL < 17.8 %). This could be explained by the cruciate ligament and patellar tissues Training and testing were performed on the AMROA training and testing datasets, respectively. DSCs are presented as mean ± standard deviation. Abbreviations: F Bonefemoral bone, T Bonetibial bone, P Bonepatellar bone, F Cartilagefemoral cartilage, T Cartilagetibial cartilage, P Cartilagepatellar cartilage, VM Muscle -vastus medialis muscle, GM Musclemedial head of gastrocnemius medialis muscle, ACLanterior cruciate ligament, PCLposterior cruciate ligament, DSC -Sørensen-Dice similarity coefficient.

Table 4
Results of additionally testing on noise only images. The influence of including noise only images in the testing set on the overall segmentation performance of a cGAN trained with L cGAN + λL L1 (λ = 100) loss objective and a U-Net trained with L L1 objective. Training was performed on the AMROA training dataset without noise only images. either being present or not on a 2D training image and the network is not being constrained to only segment medial tissues. Overall, the U-Net with L L1 produced sharper boundaries, especially for the smaller ligament structures, as compared to the segmentation maps produced by U-Nets trained with L L2 and L SmL1 , in which the boundaries are more diffused.
We decided to assess the model's performance when including noiseonly images in the testing dataset as we excluded them during model training, and this might limit the models' use in a clinical setting. This effect was only evaluated for a the cGAN trained with the L cGAN + λL L1 (λ = 100) objective function and the U-Net trained with the L L1 loss objective. The quantitative results are listed in Table 4 with qualitative results displayed in Fig. 3. Both networks showed comparable segmentation performances after testing with noise-only images with percentage differences (%-Diff) of the DSC for all segmented tissues ≤ 2.3 %. Including noise-only images into the testing set had greater effects on the cGAN DSC of the medial vastus muscle (VM muscle) (%-Diff = 1.5 %), the ACL (%-Diff = 1.6 %) and the PCL (%-Diff = 1.9 %) as well as on the U-Net DSC of the ACL (%-Diff = 2.3 %). These higher differences could be explained by the lower segmentation capability of these structures by the cGAN and U-Net models to begin with (cGAN: DSC VM muscle : 0.113 vs 0.098, DSC ACL : 0.577 vs 0.593; DSC PCL : 0.073 vs 0.092; U-Net: DSC ACL : 0.643 vs 0.620). Furthermore, the larger %-Diff in the DSC of the VM muscle is caused by the cGAN model irregularly segmenting VM muscle tissues on noise only images (Fig. 3B).  Table 5 compares the DSCs obtained from a cGAN and a U-Net, in which the objective functions were changed midway through training to the cGANs and U-Nets trained with non-altered objective functions. Training a cGAN with varied loss objective (L cGAN + λL L2 → L cGAN + λL L1 ) notably reduced its ability to segment the ACL, however considerably improved its segmentation performance on the medial vastus and gastrocnemius muscles, as well as PCL, compared to the other cGANs (L cGAN + λL L1 and L cGAN + λL L2 ). The images in Fig. 4B show the improvements in muscle segmentation with the cGAN trained with varied loss objective. This was a surprising result as neither the cGAN trained with L cGAN + λL L1 nor with L cGAN + λL L2 alone were able to segment muscle. Looking at the different training epochs of the cGAN trained with varied loss, during L cGAN + λL L2 no muscle tissue was being semantically segmented. However, when changing to L cGAN + λL L1 and between training epochs 50 and 60, the network started segmenting muscle tissue (Fig. 5). After the initial 50 epochs of L cGAN + λL L2 training, the cGANs weights must have been favourable for continuing training with L cGAN + λL L1 to additionally semantically segment muscle tissue.

Evaluation of altering loss objective during training
The U-Net trained with altered objective function (L L2 → L L1 ) also showed notable improvements in the segmentation performance of the medial vastus and gastrocnemius muscles while the segmentation scores of the other knee tissues remained comparable with those of the other U-Nets (L L1 and L L2 ). Fig. 4C qualitatively compares the results of a U-Net trained with altered loss objective to those of the U-Nets trained with a single, non-altered loss objective. As mentioned in the corresponding method section, this idea came after reviewing a few initial training results. While the U-Net trained with the L L1 objective was not able to segment the medial vastus and gastrocnemius muscles after training, the U-Net with the L L2 loss objective was. However, these images were slightly blurrier, and the segmentation accuracy of the remaining tissues was poorer than compared to L L1 . By varying the loss objective during training, the strengths of L L2 and L L1 were combined. We decided to first train the network with L L2 loss to capture all tissues and then to change to L L1 halfway through training to make the images sharper and increase segmentation accuracy. This method created a more proficient network capable of segmenting all tissues with higher or comparable accuracies to the networks trained with non-altered loss objectives.

Evaluation of the generator depth
The quantitative results of assessing the impact of generator network depth on the cGANs and U-Nets segmentation performances are in Tables 6 and 7. The cGAN with a generator down-sampling the input through nine convolutional networks achieved the highest DSC scores for tibial and patellar bone, as well as for femoral and patellar cartilage. Femoral bone and tibial cartilage were best segmented by the cGAN with five convolutions/transpose convolutions in the generator encoding/decoding parts. The medial vastus and gastrocnemius muscles, as well as ACL and PCL were best segmented by the cGAN with seven convolutions. Training the cGAN with a minimum feature channel change of 64 resulted in the highest segmentation scores for most tissues except for femoral bone, tibial cartilage and the medial vastus muscle.
The U-Net trained with nine convolutions/transpose convolutions in the networks encoding/decoding parts achieved the highest segmentation accuracies for all but one tissue (femoral cartilage), which was slightly better segmented by the U-Net with five convolutions/transpose convolutions. Training the U-Net with a minimum feature channel change of 64 resulted in the highest DSC scores for most tissues apart from patella cartilage and ACL which were segmented best by the U-Net trained with a minimum feature channel change of 128.
It is important to note for this section that increasing the number of convolutions and feature channels in the generator network substantially increases the overall number of parameters in the network and the time per epoch required to train the network (see network architectures in the Appendix for details). A considered decision between increase in learning time and significant improvement in segmentation accuracy has to be made.  Fig. 6 shows the qualitative comparison of the effect of using different patch sizes in the discriminator network, while the corresponding DSCs are listed in Table 8. The cGAN trained with the 1 × 1 PatchGAN (PixelGAN) achieved the highest segmentation accuracy for most tissues except for femoral and tibial cartilage and both muscle tissues, which were best segmented by the 34 × 34 PatchGAN. Increasing the receptive field size increases the number of parameters in the discriminator network and therefore may be more difficult to train. Additionally, as in the 'pix2pix' paper (Isola et al., 2017), we also noticed the repetitive tiling / checkerboard artefact (Fig. 7). However, in our instance, the artefacts become more pronounced with every increase in patch size instead of the inverse tendency as seen by (Isola et al., 2017). This could be a result of us assigning the cGANs with the reverse task (image to label) compared to the one performed by (Isola et al., 2017) (label to image). Fig. 8 depicts the loss evolution during network training of the cGAN trained with the 1 × 1 PatchGAN discriminator. The loss evolutions of the cGAN generator (L cGAN and L L1 ) and discriminator (L real and L fake ) are shown in Fig. 8A and B, respectively. Fig. 8B highlights how the Nash equilibrium was reached for the discriminator network during cGAN training.

Evaluation of transfer learning
The quantitative results of this section are presented in Tables 9 and 10 with qualitative comparisons between single step (one dataset) and two step training (transfer learning) displayed in Figs. 9 and 10.
When comparing the segmentation performances of the proposed cGAN and U-Net without and with transfer learning and testing on the SKI10 testing dataset (Table 9, Fig. 9A-C), the AMROA-pretrained / SKI10-retrained (AMROA → SKI10) U-Net showed the highest DSC scores for femoral and tibial bone and the highest boundary accuracy (i. e. smallest ASDs) for femoral bone, while the SKI10-only trained U-Net segmented the tibial bone with the highest boundary accuracy. Femoral cartilage was best segmented by the AMROA-pretrained / SKI10retrained (AMROA → SKI10) cGAN and tibial cartilage by the SKI10only trained cGAN.
Testing the OAI ZIB testing dataset on the proposed cGAN and U-Net without and with transfer learning (Table 9, Fig. 9D-F), the AMROApretrained / OAI ZIB-retrained (AMROA → OAI ZIB) cGAN showed the highest accuracies for tibial bone and femoral cartilage, while the OAI ZIB-only trained cGAN segmented the femoral bone and tibial cartilage with the highest accuracies.
When testing the cGANs and U-Nets on the AMROA testing dataset (Table 10, Fig. 10), the SKI10-pretrained / AMROA-retrained (SKI10 → AMROA) U-Net had the highest DSCs for femoral and tibial bone as well as the ACL. Femoral cartilage as well as patellar bone and cartilage was segmented most accurately by the OAI ZIB-pretrained / AMROAretrained (OAI ZIB → AMROA) U-Net. The AMROA only trained U-Net showed the best segmentation accuracy for tibial cartilages. The SKI10pretrained / AMROA-retrained (SKI10 → AMROA) cGAN provided the highest segmentation score for the vastus medialis muscle while the medial head of gastrocnemius muscle and the PCL was best segmented Table 5 Results of Altering the Loss Objective during Training. Assessing the influence of altering the loss objective function during training on the segmentation performance of the proposed cGAN and U-Net. A cGAN was trained with L cGAN + λL L2 objective and a U-Net with L L2 objective for 50 epochs followed by a further 50 epochs training with L cGAN + λL L1 and L L1 objectives, respectively. Segmentation performances are compared with the previously trained cGANs (L cGAN + λL L1 and L cGAN + λL L2 ; λ = 100; 100 epochs) and U-Nets (L L1 and L L2 ;100 epochs). Highest DSCs achieved for each tissue are in bold.   Training and testing were performed on the AMROA training and testing datasets, respectively. DSCs are presented as mean ± standard deviation. Abbreviations: F Bonefemoral bone, T Bonetibial bone, P Bonepatellar bone, F Cartilagefemoral cartilage, T Cartilagetibial cartilage, P Cartilagepatellar cartilage, VM Muscle -vastus medialis muscle, GM Musclemedial head of gastrocnemius medialis muscle, ACLanterior cruciate ligament, PCLposterior cruciate ligament, DSC -Sørensen-Dice similarity coefficient.
The ability for the network to be used under variable conditions was simulated by using three knee datasets (AMROA, SKI10 and OAI ZIB). Even without transfer learning, the AMROA training enabled SKI10 and OAI ZIB segmentation and vice versa, albeit not with high accuracy, but nonetheless indicating the robustness of deep learning methods. Transfer learning not only improved the segmentation accuracy for some tissues of the local dataset but also enhanced the networks ability to segment the SKI10 / OIA ZIB test dataset by introducing more heterogeneity into the model. Even though the SKI10-and OAI ZIBpretrained networks were then fine-tuned to segment the local AMROA dataset, it could segment the SKI10 and OAI ZIB testing dataset with an improved performance compared to the AMROA-only trained network without pretraining. This effect was seen for both cGANs and U-Nets.

AMROA: comparison to previous studies
In this subsection, the results obtained for the different tissues semantically segmented in this study are compared to those of previous studies. The cGAN and U-Net achieving the highest segmentation accuracy on the AMROA dataset for each respective tissue is chosen for this purpose.

Bone
While cartilage has been traditionally studied for OA, bone shape has been under increasing investigations (Ambellan et al., 2019;Felson and Neogi, 2004). Bone shape has been linked to radiographic OA (Hunter et al., 2015;Neogi et al., 2013;Wise et al., 2018) and associated with longitudinal pain progression (Hunter et al., 2015). Segmented bone can be used to separate out bone-specific diseases, such as osteochondral defects.
slightly higher segmentation accuracies for femoral and tibial bone tissues (femoral: DSC = 0.974; tibial: DSC = 0.965) and the OAI ZIBpretrained / AMROA-retrained U-Net for patellar bone (DSC = 0.948), compared to the cGANs. The boundaries of the images, near the top and bottom of any 2D slice, did not always segment all bone, which is where the MRI radiofrequency (RF) transmit and receive uniformity was poor due to characteristics of the MRI coil. Traditional semi-automatic approaches involving signal threshold, region-based or clustering segmentation can be similarly sensitive to image non-uniformities (Swanson et al., 2010). These non-uniformities are shown as a change in signal-to-noise or darkening of the surrounding muscle tissues (see lower regions of Fig. 2). These effects from RF transmit or receive non-uniformity could be mitigated with a larger training population, as more complex modelling of data is possible. Nevertheless, segmentation of the patella achieved the lowest accuracy. The patella has the widest range of inter-subject variability when compared to the larger tibial and femoral bones. The patella bone can vary in both shape and position, shifting due to the orientation and bend of the knee. Additionally, due to its smaller volume, fewer training images are used for the patella segmentation.
The cGAN and U-Net bone segmentation scores achieved in this study are similar to those achieved by a CycleGAN method using unannotated knee MR images for femoral (DSC = 0.95 -0.97) and tibial (DSC = 0.93 -0.95) bone segmentation (Liu, 2019), and a convolutional encoder-decoder network combined with a 3D fully connected conditional random field and simplex deformable modelling for femoral (DSC = 0.970), tibial (DSC = 0.962) and patellar (DSC = 0.898) bone segmentation (Zhou et al., 2018).

Cartilage
For a long time, OA was considered a disease primarily involving variations in articular cartilage composition and morphology. Therefore, the attention was predominantly placed on the extraction of OA biomarkers from quantitative MR imaging techniques using manual or semi-manual segmentation techniques that suffer from intra-and interobserver variability (Pedoia et al., 2016). Deep learning methods can provide a fast and repeatable alternative to overcome these time-consuming and operator-dependent procedures.

Muscle
As muscle weakness and atrophy can be regarded as preceding risk factors and resulting pain-related consequences for the development and progression of OA, studying morphological changes in knee joint muscles has become increasingly important (Fink et al., 2007;Slemenda et al., 1997).
Our results are comparatively lower compared to those of a semiautomatic single-atlas (DSC = 0.95− 0.96) and fully-automatic multiatlas (DSC = 0.91 -0.94) based approach for medial vastus segmentation (Le Troter et al., 2016), and a 2D U-Net for quadriceps (DSC = 0.98) segmentation (Kemnitz et al., 2019). A crucial difference between these studies and ours is the plane in which segmentation was performed. While muscles are typically segmented on axial images as this provides a more straightforward task with clearer separation between different muscles, our multi-class tissue segmentation approach was performed on sagittal images. Segmenting different muscles in the sagittal plane is a demanding task, especially in areas of the calf muscles where the two-headed gastrocnemius muscle overlaps (medial and lateral) while also overlaying the soleus muscle.

Cruciate ligament
There has been a growing interest in investigating and understanding the mechanism responsible for the post-traumatic development of OA following injury to the cruciate ligaments, especially the ACL (Chaudhari et al., 2008;Messer et al., 2019;Monu et al., 2017). Although ACL reconstruction and rehabilitation can help restore patients to normal life and previous activities, it cannot prevent the long-term risk of developing OA (Paschos, 2017). Accurate and repeatable segmentations of the cruciate ligaments are required when aiming at evaluating longitudinal changes in the cruciate ligaments following reconstructive surgery.
In our study, the OAI ZIB-pretrained / AMROA-retrained cGAN trained with the 1 × 1 PixelGAN and L cGAN + λL L1 loss objective (λ = 100) achieved the highest accuracy for ACL (DSC = 0.664) and PCL segmentation (DSC = 0.652). The SKI10-pretrained / AMROA-retrained U-Net (L L1 loss objective) achieved a similar accuracy for ACL segmentation (DSC = 0.665) and the AMROA-only trained U-Net (L L1 loss Table 9 Results of Transfer Learning. Comparison of segmentation performance of the proposed cGAN and U-Net without and with transfer learning and testing on the SKI10 and OAI ZIB testing dataset. Highest network scores achieved for each tissue are in bold. objective) achieved a marginally lower accuracy for PCL segmentation (DSC = 0.641), compared to the best performing cGANs. (Lee et al., 2013) proposed a graph cut method for automatic ACL segmentation and attained a DSC score of 0.672, while (Paproki et al., 2016) used a patch-based method for PCL segmentation to achieve a DSC score of 0.744. Using a 3D convolutional neural network (CNN), (Mallya et al., 2019) achieved DSC scores of 0.40 and 0.61 for ACL and PCL segmentations, respectively. When combining their 3D CNN with a deformable atlas-based segmentation method, their ACL (DSC = 0.84) and PCL (0.85) segmentation accuracies increased substantially. In general, 3D networks could provide higher segmentation accuracies especially for fine structures such as the cruciate ligaments that only appear on a few 2D slices in a 3D dataset. However, 2D segmentation techniques are useful for broader applicability, as 2D imaging is often faster and currently still more clinically employed than 3D imaging.
The lower similarity scores achieved in our study compared to the other studies could arise from the use of 3D-FS SPGR images as source images during training as these are non-optimal for the segmentation of the cruciate ligaments due to their less than ideal soft tissue separation with surrounding structures and fluid. Fat-saturated proton-densityweighted fast spin echo or T2-weighted fast spin echo images are more suitable for segmentation purposes as shown by (Mallya et al., 2019) and (Paproki et al., 2016), respectively. These sequences are clinically used for cruciate ligament assessment due to their dark appearance and clear separation from fluid and other surrounding tissues.

SKI10 and OAI ZIB: comparison to previous studies
In this subsection, the segmentation results of the SKI10 and OAI ZIB datasets in this study are compared to those of previous studies. The cGAN and U-Net achieving the highest segmentation accuracy on these datasets is chosen for this purpose.

OAI ZIB
The OAI ZIB-only trained cGAN trained with the L cGAN + λL L1 loss objective (λ = 100) and a 1 × 1 PixelGAN generated segmentations of femoral bone (DSC = 0.985) and tibial cartilage (DSC = 0.839) with the highest accuracy. AMROA-pretrained / OAI ZIB-retrained cGAN trained with the 1 × 1 PixelGAN and L cGAN + λL L1 loss objective (λ = 100) achieved the highest accuracy for tibial bone (DSC = 0.985) and femoral cartilage (DSC = 0.897) segmentation. The ASD of both the femoral (ASD = 0.33 mm) and tibial (ASD = 0.29 mm) bones were smaller than image resolution of the OAI DESS images (0.36 × 0.36 × 0.7 mm 3 ). Although we achieve similar DSC scores for all tissues on the OAI ZIB dataset compared to those presented in (Ambellan et al., 2019), our ASD scores were larger. The pixel-wise error losses (L L1 . L L2 and L SmL1 ) used to train the networks in our work were chosen to maintain an effective comparison between the cGAN and the U-Net. However, training our models with loss functions more traditionally used for segmentation purposes such as multi-class Dice similarity or cross SKI10/OAI ZIB → AMROA: Pretraining the network for 20 epochs on the SKI10/OAI ZIB dataset followed by network fine-tuning for 80 epochs on the AMROA dataset. AMROA → SKI10/OAI ZIB: Pretraining the network for 20 epochs on the AMROA dataset followed by network fine-tuning for 80 epochs on the SKI10/OAI ZIB dataset. Abbreviations: FBfemoral bone, TBtibial bone, PBpatellar bone, FCfemoral cartilage, TCtibial cartilage, PCpatellar cartilage, VM Muscle -vastus medialis muscle, GM Musclemedial head of gastrocnemius medialis muscle, ACLanterior cruciate ligament, PCLposterior cruciate ligament, DSC -Sørensen-Dice similarity coefficient. Fig. 9. Results of Transfer Learning: SKI10 and OAI ZIB. Assessing the influence of transfer learning on segmentation performance of cGAN and U-Net when tested on the SKI10 and OAI ZIB test datasets. SKI10 / OAI ZIB → AMROA: Pretraining the network for 20 epochs on the SKI10 / OAI ZIB training dataset followed by network fine-tuning for 80 epochs on the AMROA training dataset. AMROA → SKI10 / OAI ZIB: Pretraining the network for 20 epochs on the AMROA training dataset followed by network fine-tuning for 80 epochs on the SKI10 / OAI ZIB training dataset. Fig. 10. Results of Transfer Learning: AMROA. Assessing the influence of transfer learning on segmentation performance of cGAN and U-Net when tested on the AMROA test datasets. SKI10 / OAI ZIB → AMROA: Pretraining the network for 20 epochs on the SKI10 / OAI ZIB training dataset followed by network fine-tuning for 80 epochs on the AMROA training dataset. AMROA → SKI10 / OAI ZIB: Pretraining the network for 20 epochs on the AMROA training dataset followed by network fine-tuning for 80 epochs on the SKI10 / OAI ZIB training dataset. entropy might lead to more comparable results for boundary-distance-based metrics.

Limitations
The network performances are depended on the accuracy of the ground truth segmentations. Inaccuracies or errors in the segmentation maps could result in a less accurate network, especially when trained on a low number of image volumes, as done in this study. Additionally, training a network on a low number of high-quality images restricts the networks applicability to only highly controlled studies with homogeneous data. Therefore, the networks trained in this study might be limited in their application in clinical settings where high image quality is not always achievable due to patient conditions and operator variabilities.
Network training on 2D MR image slices is considerably less computationally demanding than on 3D volumes. For the purposes of this study such as investigating the effects of training with different loss objectives and cGAN discriminator networks, it was sufficient to train on 2D images. Nevertheless, the segmentation of small knee joint structures, such as the cruciate ligaments, could benefit from 3D networks that should add spatial continuity along the slice dimension.
Furthermore, the segmentation results presented in this study are from standalone networks without further processing within a pipeline. Therefore, the obtained results, especially for cartilage segmentation, are not comparable to those from current state-of-the-art pipeline methods such as described by (Liu et al., 2017) and (Ambellan et al., 2019) that initially perform automated segmentation using a CNN followed by further refinement using deformable or statistical shape models, respectively.
Lastly, additional investigations into varying the network architectures and optimisation strategies are warranted, with ever more loss functions as well as layer combination and optimisation strategies continuously being developed.

Conclusion
This work demonstrated the usage of a cGAN, using a U-Net generator with a PatchGAN discriminator, for the purpose of automatically segmenting multiple knee joint tissues on MR images. While DSC > 0.95 were achieved for all segmented bone structures and DSC > 0.83 for cartilage and muscle tissues, DSC of only ≈0.66 were achieved for cruciate ligament segmentations. Nevertheless, this segmentation performance was attained despite the low number of subjects (N = 8) for training on the local dataset. Although the U-Net outperformed the cGAN in most knee joint tissue segmentations, this study provides an optimal platform for future technical developments for utilising cGANs for segmentation tasks. By enabling automated and simultaneous segmentation of multiple tissues we hope to increase the accuracy and time efficiency for evaluating joint health in osteoarthritis.

Declaration of Competing Interest
The authors report no declarations of interest.
convolutions with stride 2, down-sampling the input by a factor of 2 at each layer. In the ensuing decoding part, the input is repeatedly up-sampled by a factor of 2 by five 4 × 4 transpose convolutional layers with stride 2 and additional skip connections between each layer i and 5-i. Generator with seven convolutions in encoder/decoder: The encoding part consists of the repeated application of seven 4 × 4 convolutions with stride 2, down-sampling the input by a factor of 2 at each layer. In the subsequent decoding part, the input is repeatedly up-sampled by a factor of 2 by seven 4 × 4 transpose convolutional layers with stride 2 and additional skip connections between each layer i and 7-i. Generator with 16 as minimum number of feature channels: In this network, the number of feature channels is changed from 3 to 16 during the first encoding step. During the following three encoding steps, the number of feature channels is doubled (16-128) Generator with 32 as minimum number of feature channels: The number of feature channels is changed from 3 to 32 during the first encoding step. In the following three encoding steps, the number of feature channels is doubled (32-256), while the subsequent five are kept at 256. Generator with 128 as minimum number of feature channels: In the first encoding step the number of feature channels is changed from 3 to 128. In the following three encoding steps, the number of feature channels is doubled (128-1024), while the subsequent five are kept at 1024. Discriminator: 70 × 70 PatchGAN: The discriminator network repeatedly down-samples the input by applying three 4 × 4 convolutions with stride 2 followed by two 4 × 4 convolutions with stride 1. Each convolution during down-sampling is followed by a batch normalisation layer (except the first and last layer) and a leaky ReLU (slope 0.2) (except for the last layer). The number of feature channels are doubled (64-512) during the first four convolutional steps. The final convolutional layer is proceeded by a Sigmoid activation layer.
Total number of parameters: 2.769 M 1 × 1 PatchGAN (PixelGAN): This PixelGAN discriminator network applies three 1 × 1 convolutions with stride 1, where the first convolution is followed by a leaky ReLU (slope 0.2), the second convolution by a batch normalisation layer and a leaky ReLU (slope 0.2) and the final convolution by a Sigmoid activation function. The number of feature channels are doubled (64-128) during the first two convolutions.
Total number of parameters: 0.009 M 34 × 34 PatchGAN: This network repetitively down-samples the input by using two 4 × 4 convolutions with stride 2 followed by two 4 × 4 convolutions with stride 1. Each convolution is followed by a batch normalisation layer (except the first and last layer) and a leaky ReLU (slope 0.2) (except for the last layer). The number of feature channels are doubled (64-256) during the first three convolutional steps. The final layer is ensued by a Sigmoid activation layer.
Total number of parameters: 0.666 M 286 × 286 PatchGAN: This discriminator network consists of eight convolutional layers with 4 × 4 spatial filters. The first 6 convolutions have stride 2 while the last two have stride 1. Each convolutional layer is followed by a batch normalisation layer (except the first and last layer) and a leaky ReLU (slope 0.2) (except for the last layer). The number of feature channels are doubled (64-512) during the first four convolutions and kept at 512 for the ensuing layers. A Sigmoid activation layer succeeds the final convolution.
Total number of parameters: 11.159 M