Autoencoding Low-Resolution MRI for Semantically Smooth Interpolation of Anisotropic MRI

High-resolution medical images are beneficial for analysis but their acquisition may not always be feasible. Alternatively, high-resolution images can be created from low-resolution acquisitions using conventional upsampling methods, but such methods cannot exploit high-level contextual information contained in the images. Recently, better performing deep-learning based super-resolution methods have been introduced. However, these methods are limited by their supervised character, i.e. they require high-resolution examples for training. Instead, we propose an unsupervised deep learning semantic interpolation approach that synthesizes new intermediate slices from encoded low-resolution examples. To achieve semantically smooth interpolation in through-plane direction, the method exploits the latent space generated by autoencoders. To generate new intermediate slices, latent space encodings of two spatially adjacent slices are combined using their convex combination. Subsequently, the combined encoding is decoded to an intermediate slice. To constrain the model, a notion of semantic similarity is defined for a given dataset. For this, a new loss is introduced that exploits the spatial relationship between slices of the same volume. During training, an existing in-between slice is generated using a convex combination of its neighboring slice encodings. The method was trained and evaluated using publicly available cardiac cine, neonatal brain and adult brain MRI scans. In all evaluations, the new method produces significantly better results in terms of Structural Similarity Index Measure and Peak Signal-to-Noise Ratio (p<0.001 using one-sided Wilcoxon signed-rank test) than a cubic B-spline interpolation approach. Given the unsupervised nature of the method, high-resolution training data is not required and hence, the method can be readily applied in clinical settings.

: Visualization of proposed method. To perform upsampling in anisotropic medical images we exploit the ability of autoencoders to interpolate in latent space. The trained autoencoder is used to project two spatially adjacent slices onto a latent space. Thereafter, latent space encodings z n and z n+1 are combined using their convex combination. Increasing α from 0 to 1 results in a sequence of new slices where each subsequent slice is progressively less semantically similar to x n and more semantically similar to x n+1 . Anatomy in the obtained stack of slices appears semantically smooth in the direction that is perpendicular to the imaging plane.
reduced signal-to-noise ratio (SNR) or decreased temporal resolution. Higher image resolution can be achieved by increasing acquisition time. However, in clinical practice, fast scanning is often required to mitigate the risk for motion artifacts and to sustain patient comfort.
As a result, MRI scans with high temporal resolution are often highly anisotropic, which may hamper accurate analysis. There is ample research on methods that enable faster acquisition while maintaining high SNR and high spatial resolution such as compressed sensing ( [1]) and parallel MRI [2]. However, these methods are mainly available in a research setting and are not available for the majority of MRIs obtained in clinical practice.
Conventional interpolation methods like Linear, B-spline, and Lanczos resampling [3] are often used to increase throughplane resolution of anisotropic MRIs at post-processing. The possibility to apply such methods retrospectively, i.e. after image acquisition, is often advantageous because they do not require raw image data. Conventional interpolation methods are easily and broadly applicable. Nevertheless, they cannot exploit high-level contextual information contained in the images. Furthermore, to upsample low-resolution images these methods quintessentially compute weighted intensity averages arXiv:2202.09258v1 [eess.IV] 18 Feb 2022 using existing image voxel values.
More sophisticated super-resolution methods have been developed that either perform denoising, deblurring, antialiasing, upsampling, or a combination thereof, aiming to recover a high quality of medical images from their degraded versions. In the context of this work, super-resolution for medical images refers to the process of recovering information that was lost or degraded during the low-resolution sampling process. Hence, such methods can recover anatomical structures that are finer than the sampling grid. As a result of this process anatomical structures also appear smooth and plausible in through-plane direction.
Super-resolution methods can be broadly divided into two categories. First, approaches that combine several lowresolution images to estimate, or reconstruct, the highresolution image ( [4,5]). Typically, such methods require registering the low-resolution scans with each other. Therefore, performance of these approaches depends on the quality of image alignment. Given that alignment of non-moving organs can be achieved easier than for moving organs like the heart, early super-resolution approaches in the medical imaging domain were first developed for brain MRI (e.g. [6,7]). Second, to extend the applicability of super-resolution methods to moving organs approaches were developed that learn a nonlinear mapping between (paired) low-resolution (LR) and highresolution (HR) image patches [8,9,10,11,12]. Recently, these methods were superseded by deep-learning based superresolution approaches using convolutional neural networks (CNN) [13,14,15,16,17,18,19,20,21,22,23,24,25,26,27].
To learn the non-linear mapping between low and highresolution images the aforementioned methods require highresolution training examples. However, in clinical practice such images are impractical or even impossible to acquire. Hence, to circumvent this short-coming, several unsupervised methods [28,29,30,31,32,33,34] have been proposed to increase resolution of images without using high-resolution training data. These approaches take advantage of anisotropic images and exploit the high in-plane resolution to increase the low through-plane resolution. [28] proposed a Fourier-based method [35] that combines multiple augmentations of a single 3D low-resolution input image to estimate Fourier coefficients of higher frequency ranges absent in the anisotropic images. To create required image augmentations the method learns a regression function between blurred low-resolution slices and their corresponding high-resolution slices extracted from the high-resolution in-plane direction. Later [29] proposed to alter the method of [28] by replacing the regression approach with the deep-learning super-resolution approach by [17]. Both methods were evaluated on brain MRI with relatively mild anisotropy. Subsequently, [30,31] extended their approach [29] to suppress aliasing artifacts. In parallel, [32] proposed a Gaussian Mixture Model that learns to encode anatomical similarities extracted from a large collection of anisotropic brain MRIs. Using the learned latent structure, low-resolution scans are upsampled by imputing missing slices.
We propose a deep learning semantic interpolation approach that synthesizes new intermediate slices from encoded low-resolution examples. Specifically, in our preliminary study [33] we trained an autoencoder to compress and reconstruct highresolution slices taken from highly anisotropic volumes. Using the latent space of the trained autoencoder new intermediate slices are synthesized by mixing the encodings of two adjacent slices. The process is depicted in Figure 1. Our method can synthesize an arbitrary number of intermediate slices exploiting the mixing coefficient of the neighboring slice's latent vectors. Note that the method effectively imputes image slices that are of similar appearance, i.e. slice thickness is retained, but slice spacing will be decreased, resulting in anatomically and semantically smooth transitions in throughplane direction. The method was evaluated on cardiac cine MRI. In parallel to our study, [34] presented a super-resolution approach for cardiac cine MRI that employs a conditional generative adversarial network (GAN) that takes two spatially adjacent cardiac MRI (CMRI) slices as input to synthesize a slice in-between the input slices. To guide adversarial training the approach generates an auxiliary image using previously developed optical [36] and depth-aware [37] flow-based interpolation approaches. The approach increases anisotropic resolution with upsampling-factor of two, synthesizing one slice in through-plane direction. By recursively applying the method, higher upsampling-factors of two can be achieved.
Building further on our preliminary method using convolutional autoencoders, we introduce an additional training loss function that exploits the spatial relationship between neighboring slices in 3D images. This enables us to define a notion of semantic similarity for a given dataset. As a result, the autoencoder is encouraged to generate new slices that provide a semantically smooth morphing between two input images. Furthermore, we provide evidence that our extended approach leads to improved upsampling performance when compared to [33]. Unlike [34], our approach can be applied with any desired upsampling factor, i.e. it can synthesize an arbitrary number of slices between two given slices in a straightforward fashion. Moreover, our approach uses only a single encoder-decoder structure and does not rely on auxiliary networks. Compared to [32] our approach does not require a common atlas space to operate in. Moreover, our method is easy to implement and requires little GPU memory.
Like in our preliminary work [33], we evaluate performance on cardiac cine MRI. Moreover, to show that our approach generalizes to other anatomies the evaluation has been substantially extended. In the experiments, three publicly available MRI datasets were used: cardiac cine MRI from the MICCAI 2017 Automated Cardiac Diagnosis Challenge (ACDC) ( [38]); neonatal brain MRI of the developing Human Connectome Project (dHCP) ( [39]) and adult brain MRI from the OASIS project ( [40]). Evaluation on neonatal and adult brain MRI enabled performance comparison with related unsupervised [28,29] and supervised [41] super-resolution methods. Finally, to demonstrate that our model is invariant to MRI scanners and voxel intensity distributions, we apply a model trained on cardiac MRI scans from the ACDC dataset to cardiac MRIs of the Sunnybrook dataset [42].
II. DATA A. Cardiac Cine MRI 1) Automated Cardiac Diagnosis Challenge Cardiac cine MRIs from the MICCAI 2017 Automated Cardiac Diagnosis Challenge (ACDC) ( [38]) were used. The dataset consists of short-axis cardiac cine MRIs from 100 patients uniformly distributed over normal cardiac function and four disease groups: dilated cardiomyopathy, hypertrophic cardiomyopathy, heart failure with infarction, and right ventricular abnormality. Detailed acquisition protocol is described by [38].
Briefly, MRIs have an in-plane resolution ranging from 1.37 to 1.68 mm (average reconstruction matrix 243×217 voxels) with slice thickness and spacing varying from 5 to 10 mm. The ACDC dataset specifies slice spacing for each image volume while slice thickness is only specified as a range for the complete dataset. Per patient 28 to 40 time points cover the cardiac cycle. Each volume consists of on average ten slices including the heart. To correct for intensity differences among scans, in the current work, image intensities of each volume were rescaled and clamped between [0, 1] based on their 1 st and 99 th percentiles. Furthermore, to correct for differences in-plane voxel sizes, image slices were resampled to 1.4 ×1.4 mm 2 .
2) Sunnybrook Cardiac Data To demonstrate the ability of our proposed method to generalize to other datasets with the same modality and anatomy, cardiac cine MRI from the publicly available Sunnybrook Cardiac dataset [42] was used for additional model evaluation. The dataset contains 45 short-axis cine MRI images distributed over four pathology categories: healthy subjects, patients with hypertrophy, patients with heart failure and infarction, and patients with heart failure without infarction.
Each scan contains 20 time points (i.e. volumes) encompassing the entire cardiac cycle, which results in 45×20 volumes in total. All scans have a slice thickness and spacing of 8 mm and an in-plane resolution of 1.25×1.25 mm 2 . Scans are made with a 256×256 reconstruction matrix and consist of about 10 slices. In this work, image intensities of each volume were rescaled and clamped between [0, 1] based on their 1 st and 99 th percentiles.

B. Neonatal Brain MRI
In this study neonatal brain MRIs of the developing Human Connectome Project (dHCP) [39] were used (second release). The dataset constists of 508 infants with gestational age at birth ranging from 24 to 42 weeks. All infants were scanned without sedation in a scanner environment optimized for safe and comfortable neonatal imaging. A comprehensive description of the acquisition protocol can be found in [39].
The T 2 -weighted (T2w) images are provided with an isotropic resolution of 0.5×0.5×0.5 mm 3 . To reduce image size, in this work, volumes were cropped to cortical brain structures resulting in an axial reconstruction matrix of 256×256 voxels for all images. Finally, to correct for intensity differences among scans, voxel intensities of each volume were scaled to the [0, 1] range.

C. Adult Brain MRI
Brain MRIs of 416 subjects aged 18 to 96 from the OASIS project ( [40]) were used. Detailed information about the acquisition can be found in the paper by [40] and on the OASIS website 1 .
Briefly, T 1 -weighted brain MRIs were provided with an isotropic resolution of 1.0×1.0×1.0 mm 3 . To correct for intensity differences among scans, in this work, voxel intensities of each volume were scaled to the [0, 1] range.

III. METHOD
We propose a method to synthesize new slices in anisotropic 3D medical images by using the ability of a trained autoencoder to interpolate in latent space. The autoencoder is trained to compress and reconstruct high-resolution 2D slices taken from volumes with low through-plane resolution. We postulate that the autoencoder learns to encode anatomy from a collection of training images. While an individual image may only capture a partial aspect of a complete anatomical structure, such as the heart, the autoencoder may infer the missing information from similarly appearing images that captured different aspect of the anatomical structure.
After training, input slices can be reconstructed with minimal information loss. More important, new slices are synthesized through convex combinations of latent space encodings of the two adjacent slices, which is followed by decoding of the convex combinations to the new intermediate slices.
Note that increasing the mixing coefficient from 0 to 1 results in a sequence of new slices where each subsequent slice is progressively less semantically similar to the first input slice and more semantically similar to the second input slice. Although thickness of synthesized slices will be similar to the input slices, slice spacing will decrease. As a result, anatomical structures appear smooth and plausible in throughplane direction.

A. Autoencoder
An autoencoder ( [43]) is an unsupervised learning algorithm that aims to learn a lower-dimensional representation of the input. It consists of an encoder and decoder implemented as neural network. The encoder f θ parametrized by θ compresses the input x ∈ R dx into a lower-dimensional space z = f θ (x), z ∈ R dz , i.e. the latent space representation, which captures the most salient features of the input. Typically, this layer has the least amount of neurons and is also referred to as the bottleneck of an autoencoder.
The decoder g φ parametrized by φ uses the latent space representation to generate an approximate reconstruction of the input,x = g φ (z). The network layers in encoder and decoder are fully connected. In general, training an autoencoder aims to minimize a loss function L(x,x) that quantifies dissimilarity between the input and the corresponding reconstruction.
This work uses a convolutional autoencoder 2 (CAE) ( [44]) that has the same overall structure as a standard autoencoder but replaces the fully connected layers with convolutions. Latent space encodings generated by standard autoencoders are vectors with dimensionality equal to the size of the lowerdimensional space. In comparison, latent space representation of an input tensor generated by a convolutional autoencoder is a tensor with rank equal to the rank of the input tensor. The rank of an image is three (width, height, number of input channels). Throughout this work the number of input channels is one for gray scale images.

B. Autoencoder Architecture
The architecture of the convolutional autoencoder used in this work is the same for all datasets and experiments. The architecture of the encoder consists of two blocks, each with two consecutive convolutional layers using a kernel size of 3×3 voxels and zero-padding of size 1, followed by batch normalization and 2×2 voxels average pooling. The first and second block use 32 and 64 kernels, respectively. The last block is followed by two additional convolutional layers of 128, and 128 kernels for the final output layer. The output of the final convolutional layer is used as latent space representation of the input. All convolutional layers except for the final use a leaky ReLU nonlinearity. The combination of two average pooling layers of size 2×2 voxels and 128 kernels for the latent space representation results in an overcomplete autoencoder. In other words, the information that can potentially be stored in the latent space is larger than the information contained in the grayscale input image.
The architecture of the decoder is reverse of the encoder. It consists of two blocks of two consecutive convolutional layers with leaky ReLU nonlinearities followed by batch normalization and 2×2 voxels nearest neighbor upsampling. The number of kernels is halved after each upsampling layer. The last block is followed by two additional convolutional layers of 32 kernels, and 1 kernel for the last layer. All convolutional layers of the autoencoder use a kernel size of 3×3 voxels and zero-padding of size 1. To ensure that output values are in the range of [0, 1] the final layer uses the sigmoid function. Moreover, using two average pooling layers of size 2×2 voxels in the encoder requires the width and height of the input image each to be divisible by four. Nevertheless, test images do not have to match the size of the training patches.

C. Autoencoding for Semantic Interpolation
Our interpolation approach projects two spatially adjacent slices (x n , x n+1 ) onto a latent space. It requires high in-plane resolution. Thereafter, latent representations z n = f θ (x n ) and z n+1 = f θ (x n+1 ) are combined using a convex combination: where α denotes the mixing coefficient. Finally, the decoder generates a new slice x α n,n+1 = g φ (z α(zn,zn+1) ) by decoding the mixture of latent codes. We refer to this process as synthesizing slices for values of α ∈ (0, 1), as opposed to reconstructing encoded input slices when α ∈ {0, 1} for slices x n and x n+1 , respectively. Increasing α from 0 to 1 results in a sequence of new slices where each subsequent slice is progressively less semantically similar to x n and more semantically similar to x n+1 . As a result, anatomy in the obtained stack of slices also appears semantically smooth in the direction from which the slices were extracted. We refer to this approach as Autoencoding for Semantic Interpolation (ASI).
To attain upsampling of anisotropic images by factor K in through-plane direction, K − 1 slices need to be synthesized where the set of alpha values A is defined as follows: and |.| denotes the cardinality of a set.

D. Loss Function
The autoencoder is trained to compress and reconstruct high-resolution slices taken from an anisotropic 3D medical imaging dataset. The model aims to minimize the reconstruction loss between the original x and the reconstructedx slice.
To further constrain the model a notion of semantic similarity is defined for a given dataset. For this, the spatial relationship between slices of the same volume is exploited. During training an existing in-between slice x n (where n ∈ N + ) that has two neighboring slices, x n−1 and x n+1 , is synthesized using a convex combination of the neighboring slice encodings where α (the mixing coefficient) of Equation 1 is set to 0.5. The new slice encoding is mapped through the decoder to an approximationx α=0.5 n of the original in-between slice x n . Finally, a distance loss is computed between the original inbetween x n slice and its approximation i.e. synthesized slicê x α=0.5 n resulting in the following combined loss: d denotes a distance function between two images and λ is a hyperparameter weighting the contribution of the synthesis loss. Setting λ to zero disables the synthesis loss during training. Minimizing the synthesis loss during training should encourage the autoencoder to linearize the latent space of images. Therefore, a convex combination of slice encodings should result in smooth nonlinear interpolations in image space.
To compute the reconstruction loss, this work used the pixel-wise mean squared error between original x n and reconstructedx n image. In addition, to compute the synthesis loss between reference x n and synthesized imagex α=0.5 n the Learned Perceptual Image Patch Similarity (LPIPS) metric ( [45]) was used. The LPIPS metric is a perceptually-based pairwise image distance that is calculated as a weighted difference between the VGG-16 ( [46]) embedding of the reference and synthesized image. LPIPS uses the embeddings of VGG-16 layers conv_1 to conv_5. The VGG-16 CNN is pretrained on ImageNet and the weights to compute the weighted difference were fit so that the metric agrees with human perceptual similarity judgments.

IV. IMPLEMENTATION DETAILS
The method was implemented using the PyTorch framework ( [49]) and trained on one Nvidia GTX Titan X GPU with 12 GB memory. Model weights were initialized as zeromean Gaussian random variables with a standard deviation of 1/ n l (1 + 0.2 2 ) set in accordance with the leaky ReLU slope of 0.2. n l denotes the number of incoming network connections to layer l.
A model was trained using mini-batch stochastic gradient descent with a learning rate of 1 × 10 −5 . Image slices were provided once per epoch to the autoencoder in random order. In each experiment the training set was augmented by 90 degree rotations of the images and random intensity changes. Network parameters were optimized using the Adam optimizer ( [50]) minimizing the reconstruction and synthesis loss.
To compute the synthesis loss as described in Section III-D, this study used the LPIPS metric implementation 3 of [45]. Furthermore, in order to compute the synthesis loss minibatches of T slice pairs were randomly selected from the training set. A slice pair consisted of two slices (x n−1 and x n+1 ) originating from the same volume that are spatially separated by one in-between slice x n . To determine the optimal value for λ i.e. the contribution of the synthesis loss to the overall loss, a separate line search was performed for each dataset.
Finally, in all experiments model selection was performed on the validation set. The test set was not used during method development in any way.

V. EVALUATION
Performance of the method was quantitatively evaluated in terms of Structural Similarity Index Measure (SSIM), Peak Signal-to-Noise Ratio (PSNR) and Visual Information Fidelity (VIF) ( [51]). Recent work of [52] conveyed that 3 https://github.com/richzhang/PerceptualSimilarity Visual Information Fidelity demonstrates a high correlation with radiologists' opinions of MRI quality. The proposed method was compared against performance of cubic B-spline interpolation which is known to outperform methods like Nearest-Neighbor or Linear interpolation [53,54]. Statistical significance of performance differences between evaluated methods was tested using the one-sided Wilcoxon signed-rank test.
In addition, upsampling performance was qualitatively evaluated by visually inspecting the reconstructed and synthesized slices. Visual inspection mainly focused on anatomical plausibility and semantic similarity of synthesized slices compared to corresponding reference slices. Furthermore, generated images were visually examined for smoothness of interpolation.

A. Comparison of autoencoding approaches
Before applying the proposed approach to cardiac and brain MRI scans, several autoencoder approaches were investigated for latent space interpolation using MNIST data [55]. Given any digit and its 40 degree rotated variant, referred to as input images, intermediate rotations were synthesized by mixing the latent space encodings of the two input images. Results were compared with digits that were rotated in the image space. Three different approaches were compared: Variational Autoencoder (VAE) [47,56], Adversarially Constrained Autoencoder Interpolation (ACAI) [48] and the proposed approach (ASI). Interpolation performance was qualitatively evaluated using images of the MNIST dataset.

1) Experimental details
The dataset was randomly split into training (60 000 images), validation (1000 images), and test sets (9000 images). To train the models, patches of 32×32 were randomly chosen from the training set in mini batches of 32 images. The training set was augmented by random rotations γ ∈ [0, 2π] of the images. Models were trained for 100 epochs. The proposed model was implemented as described in section III-B except that 16 kernels were used for the latent space representation. Furthermore, λ as specified in Equation 3 was set to 10 after performing a line search (λ ∈ {0.05, 0.5, 1, 10, 100, 1000}).
To compute the synthesis loss as specified in Equation 3 each training image x n was augmented with two neighboring images {x n−1 , x n+1 }. x n−1 denotes a 15 degree counterclockwise rotation of image x n and x n+1 a 15 degree clockwise rotation of image x n . This enabled synthesizing imagex α=0.5 n in Equation 3 using a convex combination of the neighboring image encodings {z n−1 , z n+1 } where α (the mixing coefficient) in Equation 1 was set to 0.5.
During evaluation, for each test image x three new images were synthesized by interpolating between the image and a 40 degree counterclockwise rotation of the same image x rot40 • . For this, the set of mixing coefficients A as specified in Equation 2 was set to {0.25, 0.5, 0.75}. As a result, the three synthesized in-between images should be rotated versions of the original image x in steps of 10 degree Reconstructed Note that our approach follows the original image rotation more closely than VAE [47] and ACAI approach [48]. α denotes the mixing coefficient as specified in Equation 1. Variational Autoencoder (VAE) To improve smoothness of the latent space of an autoencoder [47,56] proposed to model the latent representations as a random variable distributed according to a prior distribution. The latent distribution constraint is enforced by an additional loss term which measures the Kullback-Leibler divergence between approximate posterior, modelled by the encoder, and prior distribution. In this work the prior was equal to a Gaussian distribution with diagonal covariance matrix. Implementation of the autoencoder was identical to the proposed approach except for the encoder that was extended with two linear layers to model the mean and covariance matrix of the posterior Gaussian distribution.

Adversarially Constrained Autoencoder Interpolation (ACAI)
To improve the ability of a convolutional autoencoder to interpolate in latent space [48] proposed to regularize the autoencoder by means of an adversarial training objective. Using a discriminator the autoencoder is encouraged to generate interpolated images that appear to be indistinguishable from reconstructions of real images. In this work, the approach was implemented following implementation details as described in [48].
2) Results Qualitative comparison of autoencoding approaches shown in Figure 2 demonstrates that our proposed model achieved the best interpolation performance. Performance differences become most apparent for interpolated images using a mixing coefficient equal to 0.5. Additionally, one can observe that linear steps taken in latent space using the set of mixing coefficients can approximate rotation steps in image space.

B. Semantic Interpolation of Cardiac Cine MRI
Short-axis cardiac cine MRIs are acquired to primarily investigate cardiac function. These images have a high temporal and in-plane resolution at the cost of lower through-plane resolution. The functional parameters extracted from these images, such as ejection fraction, may show high variability, which can be explained by the high anisotropic resolution, that may heavily influence volume measurements. These measurements may improve when extracted from upsampled images with smooth cardiac structures in through-plane direction. Therefore, the proposed approach was evaluated on highly anisotropic cardiac cine MRI using the ACDC dataset.

1) Experimental details
The dataset was randomly split into training (70 patients), validation (10 patients), and test sets (20 patients).  To train a model, patches of 128×128 pixels were randomly chosen from the training set in mini-batches of 12 slice pairs i.e. 24 image slices. A model was trained for 900 epochs. Furthermore, λ in Equation 3 was set to 0.05.
Test images were center-cropped to 140×140 pixels cover-   ing slice spacing. These excluded slices were subsequently recovered by synthesizing them using the proposed approach. For this, the upsampling factor K was set to 2 and A, the set of mixing coefficients was equal to 0.5. Downsampled test volumes had a slice spacing of 10 mm and 20 mm while slice thickness remained unchanged.
Comparison With Conventional Interpolation Method: Upsampling performance of the proposed unsupervised approach was quantitatively and qualitatively evaluated and compared with cubic B-spline interpolation.

2) Results
The primary goal of the proposed method is to synthesize new slices located in-between two spatially adjacent slices. However, the method's capacity to synthesize new slices depends on the ability of the autoencoder to reconstruct existing slices. Therefore, we report reconstruction and synthesis performance of the trained autoencoder separately.
Slice Reconstruction: Results for reconstructed and synthesized slices listed in Table I convey that the proposed approach achieved high reconstruction performance especially in terms of SSIM and PSNR. Figure 3 depicts qualitative results of reconstruction performance for the proposed method on cardiac MRI. The results show that the trained autoencoder can reconstruct high-quality images i.e. input slices. Nevertheless, difference image shown in Figure 3c depicts that some high Original axial slice to be synthesized

B-spline ours
Original coronal slice B-spline ours Original sagittal slice B-spline ours Fig. 9: Qualitative comparison of image synthesis performance on neonatal brain MRI (dHCP dataset) between conventional interpolation methods and proposed approach. Original volumes with slice thickness and spacing of 0.5 mm were downsampled to 2.5 mm by applying a Gaussian blur before including every fifth slice in the test volume. Differences between reference (minuend) and synthesized slice (subtrahend). Blue corresponds to negative and red to positive differences. Image intensities are scaled to a [0, 1] range. All difference images use the same color scale [−1, 1]. spatial frequency details of the input slice are lacking in the reconstructed slice. Slice Synthesis: Qualitative evaluation of synthesis performance of the proposed method conveys that synthesized slices, i.e. those that are generated using a convex combination of the neighboring slice encodings, show an anatomically and semantically meaningful transition between the two neighboring slices. Moreover, despite large anatomical variations between the neighboring slices for the right ventricle, left ventricle and trabecular structures of the left ventricle, the proposed method can generate slices that depict an anatomically smooth transition between the neighboring slices. Figure 4 illustrates three example evaluations for basal, mid-ventricular and apical MRI slices, respectively, where upsampling factor K was set to 7 and A, the set of mixing coefficients was equal to { 1 /7, 2 /7, 3 /7, 4 /7, 5 /7, 6 /7}. Furthermore, quantitative evaluation of synthesis performance assessed on downsampled cardiac cine MRI scans listed in Table I reveals that synthesis performance is lower than reconstruction performance.

Comparison With Conventional Interpolation Method:
Upsampling performance was compared with cubic B-spline in terms of SSIM, PSNR and VIF. For this, the methods synthesized cardiac slices that were excluded from the test volumes (see Section VI-B1). Results for upsampling factor 2 are shown in Figure 5. We observe that the proposed method achieved better performance when evaluated by all measures compared with the conventional interpolation method. These differences are statistically significant (p < 0.0001) using the one-sided Wilcoxon signed-rank test.
Moreover, qualitative comparison of the methods reveals that synthesized images using the proposed method contain fewer errors than images generated by conventional interpolation method. Results shown in Figure 6 depict that performance differences are especially pronounced for the left ventricle papillary muscles and right ventricle myocardium.
Finally, methods were qualitatively compared for a throughplane upsampling factor of ten using the original cardiac volumes. Visual inspection of the results discloses that proposed method can generate volumes with a higher image quality than cubic B-spline interpolation. Qualitative comparison shown in Figure 7 reveals that performance differences are most pronounced for the myocardial structures of the left ventricle. These structures show smoother transitions between adjacent axial slices when generated by proposed method compared to cubic B-spline interpolation. The latter method generates volumes that suffer from aliasing artifacts while the proposed method can mostly suppress such artifacts.

C. Semantic Interpolation of Neonatal Brain MRI
Acquisition of high-resolution neonatal brain MRIs is typically hampered by uncontrollable full-term infant motion and their small size of the brain [58]. As a result, acquired neonatal brain MRIs are often anisotropic and poorly capture the 3D brain structures [5]. Super-resolution of neonatal brain MRI may enhance the capacity of image analysis on the dynamics of brain maturation [59] and brain development [60,61]. Therefore, our proposed approach was evaluated using 240

1) Experimental Details
The dataset was randomly split into training (200), validation (20) and test set (20). The images with isotropic resolution of 0.5×0.5×0.5 mm 3 served as ground truth high-resolution (HR) images. To simulate low through-plane resolution (LR) images were downsampled by factor K ∈ {2, 3, 4, 5, 6} in the z-axis. More specifically, low-resolution images were generated by using a Gaussian blur with the full-width-athalf-maximum (FWHM) set to the desired slice thickness ( [4]). Subsequently, volumes were downsampled with factor K by including every K th slice in the test images to obtain 0.5×0.5×K * 0.5 mm 3 resolution. To assess upsampling performance of the proposed method downsampled test volumes were upsampled in through-plane direction by synthesizing K − 1 new slices between each pair of neighboring slices using the set of mixing coefficients as defined in Equation 2.
Resulting volumes were then compared with high-resolution ground truth data.
To train a model patches of 64×64 voxels were randomly chosen from the training set using mini-batches of 8 randomly selected slice pairs (16 slices) originating from the same volume as described in Section IV. To balance loss terms in Equation 3, λ was set to 0.001. A model was trained in 1300 epochs and the best performing model on the validation set was selected for final evaluation on the test volumes.

2) Results
Slice Synthesis: Qualitative evaluation of the proposed approach on neonatal brain MRI with reveals that generated slices using a convex combination of neighboring slice encodings, comprise a smooth anatomical transition between adjacent slices. Examples depicted in Figure 8 show upsampling performance of proposed method for neighboring slices with large anatomical variations. In the depicted figures slice spacing was improved from 2 to 0.29 mm by synthesizing six intermediate slices using latent space encodings of the two neighboring slices.
Visual inspection of Figure 9 conveys that our proposed approach was able to synthesize excluded high-resolution axial slices more accurately than cubic B-spline interpolation. These results are corroborated by the coronal and sagittal views revealing that volumes generated by the proposed method are less blurry and contain smoother transitions between slices compared to volumes generated by the conventional interpolation method.
Moreover, quantitative comparison shown in Figure 10 depicts that the proposed unsupervised method outperformed cubic B-spline interpolation in terms of SSIM and PSNR for all evaluated upsampling factors. Furthermore, the performance differences are statistically significant (p < 0.001) using the one-sided Wilcoxon signed-rank test.
Comparison With Supervised Super-Resolution Methods: Supervised deep-learning super-resolution methods developed by [41,57] were evaluated on a subset of 20 neonatal brain MRIs from the dHCP dataset. Table III lists results as reported by [41] together with quantitative evaluation of our proposed approach on the same dataset (see Section VI-B1). Although methods of [41,57] used high-resolution groundtruth data for model training their results can be used to put results reported in this work into perspective. One may carefully conclude that our proposed unsupervised deeplearning approach is on par with supervised super-resolution approaches by [41,57] when evaluated on neonatal brain MRI.

D. Semantic Interpolation of Adult Brain MRI
Upsampling performance of the proposed method was also evaluated using T 1 -weighted (T1w) adult brain MRIs of the OASIS project ( [40]). The images with isotropic resolution of 1.0×1.0×1.0 mm 3 served as ground truth high-resolution (HR) images.

1) Experimental Details
Model was trained on the first 200 unique patients, validated on the subsequent 20 patients, and tested on 50 randomly Original axial slice to be synthesized

B-spline ours
Original coronal slice B-spline ours Original sagittal slice B-spline ours Fig. 12: Comparing upsampling performance between cubic Bspline interpolation and proposed approach using T 1 -weighted adult brain MRI of the OASIS dataset. Original volumes with slice thickness and spacing of 1 mm were downsampled to 5 mm by applying a Gaussian blur before including every fifth slice in the test volume. Differences between reference (minuend) and synthesized slice (subtrahend). Blue corresponds to negative and red to positive differences. Image intensities are scaled to a [0, 1] range. All difference images use the same color scale [−1, 1].
selected scans from the remaining set. To ensure that test images were divisible by four, slices were zero-padded to 220×220 voxels. Other experimental settings were identical to experiments performed on neonatal brain MRI described in section VI-C.

2) Results
Slice Synthesis: Qualitative evaluation of proposed method on adult brain MRI with reveals that synthesized slices constitute a smooth anatomical transition between neighboring slices. The proposed method is able to bridge large anatomical variations between adjacent slices. These findings are depicted in Figure 11.

Comparison With Conventional Interpolation Method:
Qualitative comparison of generated axial brain MRI slices between cubic B-spline interpolation and proposed approach shown in Figure 12 reveals that the proposed method can synthesize excluded axial slices with higher image quality than conventional interpolation method. Moreover, visual inspection of coronal and sagittal slices shown in Figure 12 conveys that images generated by cubic B-spline interpolation more frequently suffer from aliasing artifacts than images generated by our proposed method.
In line with results reported for cardiac cine and neonatal brain MRI in Sections VI-B and VI-C, respectively, quantitative evaluation in terms of SSIM, PSNR, and VIF depicted in Figure 13 corroborate the qualitative findings. Measures were computed on sagittal slices through volume. The proposed method outperformed cubic B-spline interpolation and the differences are statistically significant (p < 0.0001) in terms of SSIM and PSNR for all upsampling factors (K ∈ {2, 3, 4, 5, 6}).
Comparison With Unsupervised Super-Resolution Methods: Previously developed unsupervised super-resolution methods by [28,29] were evaluated on T 1 -weighted adult brain MRIs from the Neuromorphometrics dataset. The dataset is not publicly available. From the 114 brain scans in the Neuromorphometrics dataset a subset of 60 scans (30 patients) was taken from the OASIS project. Therefore, reported results of previously developed methods for images from the Neuromorphometrics dataset can be used as estimates for a careful comparison. Quantitative results in terms of PSNR and SSIM listed in Table IV show that for upsampling factor 2 our proposed method is on par or better than previously developed super-resolution approaches. For upsampling factor 3 the method of [29] achieved the best results and our proposed approach is on par with method of [28].
Generative image synthesis approach of [32] was evaluated on 50 T 1 -weighted brain MRIs of the ADNI dataset [62]. To compare performances, our model was trained (100 scans) and evaluated (50 scans) on the ADNI dataset using identical experimental settings as described in section VI-D. Table V lists quantitative results for both methods in terms of mean squared error (MSE). One can observe that our method achieved a lower mean squared error compared with approach of [32]. IV: Comparison between proposed method (ASI) and unsupervised super-resolution approaches of [28,29] in terms of SSIM and PSNR. Approaches of [28] and [29] were evaluated on T 1 -weighted adult brain MRIs with 1.0 mm 3 isotropic resolution from the Neuromorphometrics dataset. Slice spacing was improved from 2 to 1 mm (factor 2) and 3 to 1 mm (factor 3). Results reported here are taken from the original work by [28] and [29].

E. Ablation Study
To investigate the effect of the synthesis loss on upsampling performance, the proposed model was trained and evaluated on cardiac cine MRIs by minimizing the reconstruction loss only. For this, λ in Equation 3 is set to zero (referred to as ASI λ=0 ). All other experimental conditions were held constant.  [32] in terms of mean squared error (MSE). Both approaches were evaluated on 50 randomly selected T 1weighted adult brain MRIs with 1.0 mm 3 isotropic resolution from the ADNI dataset. Slice spacing was improved from 6 to 1 mm. Result listed here was reported in the work by [32]. Values represent medians over 50 scans.
This setting enables performance comparison between model ASI λ=0 and a model trained with the combined reconstruction and synthesis loss (ASI λ=0.05 ). Figure 14 depicts qualitative comparison between the two models for image synthesis of cardiac MRI. The results reveal that performance decreased for a model trained with the reconstruction loss only. The performance difference is more pronounced for larger anatomical variations between neighboring slices e.g. the shape of the right ventricle. Furthermore, Figure 15 demonstrates synthesis performance of the two models. The model trained with just the reconstruction loss (ASI λ=0 ) generates cross-fade artifacts between the intensities of the two neighboring slices [63,64,65] whereas the model trained with a combination of reconstruction and synthesis loss (ASI λ=0.05 ) can substantially suppress such artifacts.
These results are corroborated by quantitative evaluation in terms of SSIM, PSNR, and VIF depicted in Figure 16. One can observe that the model trained with the combined reconstruction and synthesis loss achieved better performance when evaluated by SSIM and PSNR compared to a model trained with the reconstruction loss only (ASI λ=0 ). For the VIF measure the performance achievements are reversed. All differences are statistically significant (p < 0.0001) using the one-sided Wilcoxon signed-rank test. To further investigate the effect of the synthesis loss on upsampling performance, separate quantitative evaluations were performed on reconstructed and synthesized short-axis slices, respectively. The results listed in Table I reveal that a model trained with the reconstruction loss only (ASI λ=0 ) achieved better performance for the reconstruction task compared to a model trained with the combined reconstruction and synthesis loss (ASI λ=0.05 ). In contrast, the latter model performed better in terms of SSIM and PSNR when synthesizing the excluded slices. These results indicate that the additional synthesis loss resulted in increased interpolation performance but at the same time impacted reconstruction performance of the autoencoder.
Finally, Figure 17 shows the effect of different values of λ on reconstruction and synthesis performance of the proposed approach. Increasing the contribution of the synthesis loss, i.e. using larger values for λ, increases synthesis and lowers reconstruction performance of the model in terms of SSIM and PSNR.

F. Evaluation on Cardiac Cine MRIs from Sunnybrook dataset
To examine generalization performance of our proposed method a model trained on cardiac cine MRIs from the ACDC dataset was evaluated on cardiac MRI scans from the Sunnybrook dataset [42]. Figure 18 shows   for cubic B-spline compared with proposed method (ASI) in terms of SSIM, PSNR, and VIF. One can observe that the proposed approach trained on ACDC images outperformed cubic B-spline interpolation for all measures on Sunnybrook dataset. The latter methods do not require training. These differences are statistically significant (p < 0.0001) using the one-sided Wilcoxon signed-rank test. Furthermore, the results depict that relative performance differences between methods are nearly identical to those observed on ACDC dataset depicted in Figure 5. Finally, achieved performance on cardiac MRI scans of Sunnybrook dataset is higher for all measures compared with performance reported for evaluation on ACDC dataset depicted in Figure 5. This might have been caused by scans of Sunnybrook dataset having more consistent image quality, better alignment of adjacent slices, and higher bit depth compared with scans from ACDC dataset.

VII. DISCUSSION
A method for unsupervised deep-learning image synthesis of medical images has been presented. To synthesize new intermediate slices and thereby recovering spatial information, the method exploits the latent space interpolation ability of autoencoders. New intermediate slices are generated by mixing the latent space encodings of two spatially adjacent slices. High-resolution ground-truth images are not required to train the approach. Results of our preliminary experiments using MNIST data demonstrated that our proposed approach outperformed a variational autoencoder (VAE) and Adversarially Constrained Autoencoder Interpolation (ACAI) approach [48] for interpolating rotations of handwritten digits. Evaluation of the approach on cardiac and brain structures using four publicly available MRI datasets revealed that the method can outperform cubic B-spline interpolation. Performance differences between evaluated methods become more apparent when the upsampling task becomes more difficult i.e. for highly anisotropic volumes e.g. cardiac MRI or larger through-plane upsampling factors. This might indicate that the model can infer the missing information from contextual and anatomical information captured in the latent space. Furthermore, the experimental results revealed that our proposed approach can compete with related unsupervised [28,29] and supervised [41] super-resolution approaches. Compared with unsupervised super-resolution methods of [28,29,30,31,32,34] our approach can be applied with any desired upsampling factor and uses a single encoder-decoder structure. Moreover, applying methods of [28] and [29,30,31] requires optimization during inference for every image at hand and therefore, inference requires several minutes of GPU processing time [31]. In contrast, at test time our method can synthesize multiple intermediate slices between each pair of adjacent slices in an MRI scan in less than a second on a GPU. Furthermore, using the method of [32] necessitates creation of a common atlas space to which each image must be Proposed approach was trained on CMR scans from ACDC dataset. Hence, depicted results demonstrate generalization performance of proposed method. A higher score indicates better performance. Measures were computed on sagittal slices through short-axis volume. Performance differences are statistically significant (p < 0.0001) using the one-sided Wilcoxon signed-rank test. Triangle indicates mean value.
transformed. Finally, it is fair to note that the approach of [30,31] performs explicitly anti-aliasing by using an additional CNN.
Even though autoencoders are designed to learn a lowerdimensional representation of the input while minimizing information loss, in this work, we used them to perform semantic interpolation and thereby recover spatial information in anisotropic medical images. Specifically, we used an overcomplete autoencoder that can potentially retain all information contained in the input. Theoretically, such an approach could learn an identity to minimize the reconstruction loss. Nonetheless, in line with previous research ( [66]), the results presented in this work seem to indicate that such a model can learn a useful representation of its input. Furthermore, the experimental results revealed that interpolation between image representations is feasible to approximate information orthogonal to the input images. This suggests that the proposed approach learns to extract contextual and high level conceptual information from the input images. Moreover, the results demonstrate that the decoder learns to exploit this information to instantiate semantically meaningful intermediate slices. While previously developed shape-based interpolation approaches of [67,68] exploit anatomical shape information to achieve high-order interpolation between cross sections of 3D anatomical structures, we argue that our approach performs semantic interpolation between two spatially adjacent slices.
Synthesized images are not guaranteed to be semantically meaningful. However, the solution space of the proposed approach is constrained using encodings of two spatially adjacent slices. Moreover, compared with an adversarial training objective [48] training the autoencoder with the proposed synthesis loss enforces an explicit constraint on the solution space because the synthesized image is evaluated against its reference. Nevertheless, image interpolation in latent space can still result in cross-fade artifacts between the intensities of the two images i.e. neighboring slices [63,64,65]. Appearance of such an artifact can be observed in Figure 8 (second row, columns four to six). However, results presented in Figure 15 demonstrated that the proposed model can synthesize images with substantially less artifacts compared with a standard autoencoding approach. Although adversarial approaches are extremely difficult to train [69], adding a critic to our approach could further constrain the model and improve synthesis performance. Moreover, to further constrain the model an additional synthesis loss in latent space could have been proposed [65]. We deliberately leave the challenge of defining a useful distance metric in latent space for future work.
Our approach assumes that a linear spacing in latent space corresponds to slice spacing in image space. Although, alternatives such as spherical latent space interpolation [70,71] or an enforced Riemannian latent space [63,72] can be used, linear interpolation in the latent space of an autoencoder trained with the proposed synthesis loss showed excellent results.
Neighboring slice 1 Slice 2, located in-between slice 1 and 3 Synthesized slice 2 using proposed approach with α = 0.5 Neighboring slice 3 Neighboring slice 1 Slice 2, located in-between slice 1 and 3 Synthesized slice 2 using proposed approach with α = 0.5 Neighboring slice 3 Fig. 19: Qualitative comparison for upsampling factor 2 on cardiac MRI (ACDC dataset). Each row depicts an example of synthesizing intermediate slice 2 (second column) using latent space encodings of the two neighboring slices 1 and 3 (first and last column). The synthesized slice is shown in penultimate column. First row shows basal slices with large anatomical variations compared to second row that depicts mid-ventricular slices with mild anatomical variations. α denotes the mixing coefficient as specified in Equation 1. Scans have a slice spacing of 10 mm. .
Nevertheless, human anatomy does not change linearly along spatial dimensions. We conjecture that the model can learn such a nonlinearity from the training data. Furthermore, we presume that the synthesis loss encourages the model to learn the nonlinear mapping between distances in latent and image space. For example, our experiments on cardiac MRI revealed that the model has learned that structural changes at the base of the heart are substantially different than at the apex. Finally, our experiments on neonatal and adult brain MRI apply mixing coefficients (α) unequal to 0.5. Results of these experiments corroborate our assumption that linear steps taken in latent space can approximate anatomical distances in image space, however, the approach does not guarantee such a relationship to be exact. Performance of the proposed method is affected by large anatomical variations between adjacent slices as shown in Figure 19. Furthermore, quality of synthesized images is decreased when adjacent slices within the original volume are misaligned. This is a known problem in cardiac cine MR imaging caused by surrounding organ motion during breathhold acquisition. In these cases intermediate points along the interpolation path spanned by two adjacent slice encodings result in anatomically implausible images i.e. cross-fades. This might indicate that the latent space between the two endpoints is too sparsely populated. For cardiac cine MRI this could be alleviated by choosing additional interpolation endpoints e.g. from other time frames. This direction will be investigated in future work.
Training the autoencoder with a combination of reconstruction and synthesis loss slightly hampered reconstruction performance when compared to training with reconstruction loss only as shown in Table I. However, quality of synthesized slices considerably improved when adding the synthesis loss during training. To render 3D high-resolution images the method only relies on the quality of newly synthesized slices. Hence, all existing slices do not need to be reconstructed but can be taken from the original anisotropic 3D volumes.
The here presented approach was developed in parallel with the super-resolution method proposed by [34]. In the current work quantitative results are presented in terms of SSIM, PSNR and VIF for all cardiac MRI slices of test patients from the ACDC dataset and 45 subjects from the Sunnybrook dataset. In comparison, work of [34] depicts results in terms of PSNR and correlation coefficient for a limited selection of two cardiac MRI slices (mid-ventricular and basal) from the UK Biobank. Note that intensity statistics of images from different datasets may be very different and hence, PSNR measurements might be inaccurate [18]. Therefore, thorough quantitative comparison of the two methods is hardly feasible.
To conclude, we presented a method for unsupervised semantic interpolation of anisotropic 3D medical images achieving anatomically smooth transitions in through-plane direction. New intermediate slices are generated by mixing the latent space encodings of two spatially adjacent slices. The experiments using cardiac cine and brain MRIs demonstrated that the proposed approach outperforms cubic B-spline interpolation on cardiac cine and brain MRIs. Given the unsupervised nature of the method, high-resolution training data is not required and hence, the method can be readily applied in clinical settings.

ACKNOWLEDGMENT
This study was performed within the DLMedIA program (P15-26) funded by Dutch Technology Foundation with participation of Pie Medical Imaging.