Adversarial training with cycle consistency for unsupervised super-resolution in endomicroscopy

Highlights • We propose an unsupervised SR framework based on an adversarial deep neural network with a physically-inspired cycle consistency.• Our framework can exploit HR images, regardless of the domain where they are coming from, to transfer the quality of the HR images to the initial LR images.• Our system can be particularly useful in all situations where pairs of LR/HR are not available during the training of a super-resolution algorithm.• Our quantitative analysis, validated using a database of 238 endomicroscopy video sequences from 143 patients, shows the ability of the pipeline to produce convincing super-resolved images.• A Mean Opinion Score (MOS) study also confirms this quantitative image quality assessment.


Introduction
According to a recent report by the World Health Organization, cancer is the second leading cause of death after cardiovascular disease and was responsible for 8.8 million deaths in 2015.Early detection, such as the ability to detect precancerous lesions, plays an important role in reducing cancer incidence and related mortality (Torre et al., 2016).Optical endomicroscopy, based for example on confocal microscopy, optical coherence tomography or spectroscopy, has the ability to perform optical biopsies and identify early pathology in tissues or organs including the colon, oesophagus, pancreas, brain, liver and cervix (Nguyen et al., 2015;Ravì et al., 2017).Although, in the last years, progress has been made to build reliable optical endomicroscopy devices (Neumann et al., 2010), the need to operate at micron scale through the use of endoscopes, fibre bundles, laparoscopes, and needles, limits the final resolution of the images.Further hardware improvements are difficult to achieve and one possibility to improve the image quality is to post-process the images using SR techniques.
Recent methods for SR are based on training example-based models that learn how to improve image resolution by exploiting a database of aligned pairs of LR and HR images (Ledig et al., 2017;Ravì et al., 2018).Nonetheless, due to the lack of HR endomicroscopy images, these pairs are not typically available in this domain.An option is to generate these pairs synthetically, but achieving this in a sufficiently realistic manner is only feasible when the acquisition process is extremely well defined.In most of the cases, the acquisition process is only known approximately and supervised methods may thus not be applicable.
For this reason, we designed a deep learning architecture trained in an unsupervised manner where the aforementioned one-to-one alignment between LR and HR is not required anymore.We formalize our framework so that LR images from an initial input domain I LR could be transformed into images of any target domain T HR .The target domain can be the same or different from the initial one.An example of the difference between initial and target domain with paired and unpaired patches is shown in Fig. 1.To train the model using unpaired patches and avoid that the network learns to produce HR images with no direct relationship with the input images, a cycle consistency block is included in our architecture.This block imposes some physical acquisition properties so that the obtained HR images faithfully represent the initial LR images.
More specifically, during the training procedure, we make use of an adversarial network, a class of artificial intelligence algorithms which train two separate models that challenge each other in a zero-sum game.The first model is a SR network that learns how to improve the resolution of the images, and the second is a discriminative network DS that, looking at the target domain, tries to distinguish images generated by the SR network from the real T HR images.The aim of SR is to learn how to fool the DS network and this leads to a generation of super-resolved images.
Adversarial training can learn how to produce outputs with the same distribution as the target domain.However, the target domain distribution could be sampled by simply mapping the input images to any random permutation of images in the target domain.Therefore, in this context and without specific constraints, an adversarial loss, alone, cannot guarantee that the learned function maps an input to a desired corresponding super-resolved image.Thus, following the idea proposed by (Zhu et al., 2017) in the context of style transfer, we add in the adversarial training a further cycle block that imposes a consistency between the HR images and the initial LR images.In contrast to the work in (Zhu et al., 2017) where the reverse mapping between the target and the source domain is also learned, in this work, this consistency is obtained by constraining the super-resolved T HR image to have similar physical acquisition properties to the initial I LR image.
As explained in more detail in Section 3.3, the raw signal of pCLE images is acquired from tens of thousands of fibres irregularly placed in a bundle.Moreover, the LR images are reconstructed using a Delaunay-based linear interpolation that interpolates pixels from the centres of the fibres to a regular grid.Starting from the super-resolved pCLE images created by our framework, the proposed physical constraints impose that the values obtained by inverting the aforementioned interpolation are similar to the raw signal acquired from the fibres for the corresponding LR image.
To the best of our knowledge, this paper is the first to propose an adversarial network that takes advantage of the knowledge of the physical acquisition process by imposing a cycle consistency to perform unsupervised SR of medical images.In our experiment, we show that the proposed framework does not require paired aligned patches for the training.This is an important property

Related work
With the recent outbreak of deep learning, example-based super-resolution (EBSR) has led to a dramatic leap in SR performance.These approaches are mainly based on a supervised training procedure where a database of aligned pairs of LR and HR images is required to create the model.Being supervised, these SR methods are restricted to specific training data, where the LR images are usually predetermined from their HR counterparts.However, in many contexts, such as in endomicroscopy, HR images are not available due to physical constraints and therefore these paired aligned images cannot be generated.A first attempt to train an EBSR network for endomicroscopy was proposed by (Ravì et al., 2018) where a video-registration technique is used to estimate the HR images from a sequence of LR images.A pipeline for generation of synthetic data is finally presented to produce the desired aligned pairs.Although models trained with generated synthetic data can obtain convincing SR images, the domain gap between synthetic LR images and original pCLE images raises questions about their reliability for clinical use.For this reason, we believe that unsupervised super-resolution techniques would be more suitable in these cases.In (Ayasso et al., 2012) is presented an unsupervised method for image SR based on a Variational Bayesian (VB) algorithm that combines a Bayesian technique with a Markovian model.The main issue with this approach is the difficulty to hand-craft a good perceptual loss function and the final images tend to be blurred.Rather than designing a suitable similarity loss function, (Goodfellow et al., 2014) proposed a general framework called Generative Adversarial Network (GAN) where the perceptual loss function is trained directly using a discriminative network.This allows the method to automatically verify if a generated sample is similar to a real one from the target domain.In particular, the adversarial process uses two models: i) a generative model G, and ii) a discriminative model D that are trained to play a zero-sum game.Following this general framework, (Ledig et al., 2017) proposed a single image super-resolution architecture called SRGAN.Although this approach is unsupervised, part of its loss is still supervised.In fact, a content loss term based on a per-pixel loss between the output and ground-truth images is used there.This term requires again alignment between LR and HR thereby limiting its applicability in our context.Another drawback of SRGAN is its difficulty to train, often generating SR images that are too sharp or have artefacts.To reduce these drawbacks, (Bao et al., 2017) proposed to combine a VB approach with GAN.They show that an asymmetric loss function obtained using a cross-entropy loss for the discriminative network and a mean discrepancy objective for the generative network, make the GAN training more stable.Similarly to this idea, an Adversarial Variational Bayes was proposed by (Mescheder et al., 2017) where a Variational Autoencoder (VAE) is trained using an auxiliary discriminative network.Contrary to the previous case, this approach provides a more clear theoretical justification.However, the problem of using paired LR/HR has not been resolved by any of the approaches described so far.One of the first approaches that formalize the possibility to translate images from a source domain X to a target domain Y in the absence of paired examples was proposed by (Zhu et al., 2017) and is called CycleGAN.Using an adversarial training the goal of this method is to learn a mapping G : X → Y such that the distribution of images from G(X) is indistinguishable from the distribution Y .Since this mapping is highly under-constrained, the authors also introduced an inverse mapping F : Y → X and a cycle consistency loss to ensure F (G(X)) ≈ X. Thanks to this two-step consistency, the need for paired images is eliminated.Varying the input-output domain, this framework can be used to perform artistic style transformation (Johnson et al., 2016) (where, for example, horses can be converted into zebras) or, as in our case, transfer the resolution from one domain to another.Some other interesting approaches were proposed by (Shocher et al., 2017) and (Bulat et al., 2018).Here the authors question that the predetermined LR images obtained from standard bi-cubic down-sampling rarely look like the real LR images.In (Shocher et al., 2017) the authors introduce a method called Zero-Shot SR, that does not rely on prior training.To do so they exploit the internal recurrence of information inside a single image and train a small image-specific CNN at test time.This facilitates self-training SR for biological data, old photos, noisy images, and other images where the acquisition process is unknown.
Following the CycleGAN concept of (Zhu et al., 2017), we propose an unsupervised framework that uses unpaired images and is designed to overcome the limitations that standard SR approaches have when aligned pairs of LR/HR images are required.

Database
To validate our solution, we used the database proposed by (André et al., 2011) containing 238 anonymized probe-based Confocal Laser Endomicroscopy (pCLE) video sequences from 143 patients captured on the colon and oesophagus regions.This database does not provide the real ground truth of the HR images and only estimated HR, computed using a time-consuming videoregistration technique on the LR images are available.Video-registration may generate HR that are not perfectly aligned with the LR and might display further reconstruction artefacts.We define this set of data as DB orig .A second version of this database called DB syn and based on the simulated framework proposed in (Ravì et al., 2018) is also used in our experiments.In this case, the LR images are synthetically generated from the HR and this results in paired images perfectly aligned.More specifically, the simulated framework extract fibre signals f s directly from the HR image, by averaging the HR pixel values within the region defined by the Voronoi cell computed from the centre of the fibre's position.Moreover, to replicate realistic noise patterns on the simulated LR images, additive and multiplicative Gaussian noise is added to each fibre signal f s obtaining a noisy fibre signal nf s.Finally, Delaunay-based linear interpolation is performed thereby leading to the final simulated LR images.Fig. 3 shows some example of images extracted from both the two versions of the database.In both these scenarios, the database was divided randomly into three subsets: a train set (70%), a validation set (15%), and a test set (15%).The number of images that belong to each clinical setting is maintained equal in each of these set.
We provide the results from two different case studies: CS 1 where the images that belong to the same video are only distributed within one of the subsets (train, test or validation) and CS 2 where, additionally, the images from the same patient are only distributed within one of these subsets.CS 1 allows us to understand if the system is capable of super-resolve new visual patterns that have never been seen before.Given the size of our dataset, CS 2 allows for coarser but less prone-to-bias evaluation that mimics a more realistic scenario where the effectiveness of the system to transfer the inter-patient super-resolution capability to new patients is analysed.In our experimental section, we show that these two scenarios demonstrate similar trends.
Following the same pre-processed steps proposed in (Ravì et al., 2018), the intensity values are first normalized to have mean zero and variance one, then the pixels values were scaled of every frame individually in the range [0-1] and finally, non-overlapping patches of 64×64 pixels were extracted only from the pCLE field of view of the train and validation set.The patches in the validation set were used to monitor the loss and avoid overfitting.Test images with size 512×512 were processed at full-size to compute the final results.We highlight that the proposed SR framework does not have the aim to increase the number of pixels, but rather to improve the quality of the LR images that are initially oversampled with an average of 7 interpolated pixels for each informative fibre-pixel.The choice to initially oversample is made by the manufacturer to ensure that the image space is discretised in a sufficiently fine manner to map the fibre graph onto a square pixel grid without too much distortion.In conclusion, in our system, the output images have the same size as the input images but display refined content.The full-size processing of the test images is possible since the inference network is fully convolutional and no specific image size is required as input.

Adversarial training
The pipeline used for training our framework is presented in Fig. 2 and is divided into different sub-sections, each coded by a specific colour.
We formalize our training as an adversarial min-max problem where two networks, a discriminative network defined as DS γ (red sub-sequence in Fig. 2), and a super-resolution network defined as SR θ (grey sub-sequence in Fig. 2) are trained concurrently.More specifically, the first network DS γ is trained solving: where p I LR and p T HR are respectively the patch distributions on the input and target domain, DS γ ( * ) estimates the probability that a patch comes from the target domain, whereas SR θ (x) is the predicted super-resolved patch obtained from x.The meaning of Eq. 1 is that the discriminator has to maximize how to discriminate predicted super-resolved images from real T HR patches.
The second network SR θ , is trained instead through the minimization of a composite loss function loss t obtained solving: (2) The proposed loss t , defined in Eq. 3, is a combination of three terms: l V ec that models the physical acquisition characteristics of the predicted superresolved patch, l Adv that models the adversarial loss function and l Reg used to regularize the network training.The details of each term are provided later in this section.
Both SR θ and DS γ are concurrently trained using the back-propagation algorithm that gradually adjusts the parameters θ and γ through a stochastic gradient descent for the former and a stochastic gradient ascent for the latter.

Input domain and cycle consistency 3.3.1. Input domain
The green blocks in Fig. 2 represent the data structures required as input for the proposed pipeline.The most obvious input is the reconstructed I LR that is used by the SR θ network to infer the super-resolved patch SR θ (I LR ).
In the pCLE imaging, image acquisition is achieved by illuminating one fibre at a time.Each fibre acts as an individual pinhole and a scan point for fibre confocality.The information from all the fibres is then collected in a vector that we refer to as a vectorized image V (I LR ), and represents the main input block in our pipeline.I LR images are reconstructed interpolating the values in V (I LR ) from the centres of the fibre positions to the points of a regular grid.Therefore the fibre positions are the other key input block required by our pipeline.

Cycle consistency
Starting from a generated high-resolution pCLE image SR θ (I LR ), we can obtain a low-resolution representation of it, by a process referred to as Voronoi vectorization V (SR θ (I LR )) which is equivalent to the down-sampling for standard images.The details of the Voronoi vectorization used in our framework are described in Fig. 4. Here, the first step is to compute the Voronoi diagram from the fibre positions.The result is a partition of the plane where for each fibre there is a corresponding region, called Voronoi cell, consisting of all points closer to this fibre than to any other fibre.The next step is to average the pixels in the SR θ (I LR ) patch that belongs to the same Voronoi cell, imitating the point spread function of the fibre acquisition process.All the elements in the vector are then normalized in the range [0, 1].This normalization makes the training faster and reduces the chances of getting stuck in local optima.Since each patch may have a different number of fibres, the vectorization can produce vectors of different sizes.Therefore as a final step, a 0-padding is introduced so that each vector always has a fixed number of elements.We define this fixed number as N F that is equal to the maximum number of fibres in a single patch.In our database N F is 682 which is commensurate with the ratio between the patch size (64×64) and the average factor (7) used to oversample each informative fibre-pixel.
The vectorized V (I LR ) and the Voronoi vectorization V (SR θ (I LR )) are used in our pipeline to create the cycle consistency (blocks coloured in cyan in Fig. 2).These blocks are used to impose the requirements for the predicted super-resolved images SR θ (I LR ) to have the same physical acquisition properties as the initial I LR images.Without this cycle consistency, the network could simply produce arbitrary images in the target domain with no relationship to the structures contained in the input image, because our framework relies on unpaired patches.To avoid this, we force the V (I LR ) and V (SR θ (I LR )) to be similar using the l V ec term in the proposed loss function.
In contrast to CycleGan (Zhu et al., 2017), our cycle consistency block is not a trainable network, but rather is used to constrain the SR θ network to generate images with the same physical acquisition properties as the initial I LR images.

Super-resolution network
We decided to use the layout for the SR network proposed in (Ledig et al., 2017).SR θ is aimed at producing images that are similar to the one in the target domain by trying to fool the discriminator network.This is achieved through the term l Adv in the proposed loss function defined as follows: where DS γ (SR θ (I LR )) is the probability that the predicted image SR θ (I LR ) is classified as a real T HR .As proposed by (Goodfellow et al., 2014) we minimize −logDS γ (SR θ (I LR )) instead of log[1 − DS γ (SR θ (I LR ))] for better gradient behavior.
In the inference phase, only SR θ is used for processing the I LR images.

Regularization
The blocks displayed in orange in Fig. 2 are used to regularize the network training.This regularization is required since the Voronoi vectorization of each patch is normalized to the range [0-1] and this may result in an expansion of its histogram range.To restore the correct histogram distribution, we impose that the mean values in each row and each column of the patch are identical between the initial I LR and the obtained T HR .This is achieved in our framework through the l Reg term of loss t :

Training domain
In our pipeline we considered four different target domains to transfer the super-resolution to the initial LR images: i) T HR nat where the HR patches are extracted from natural images (grey-scaled images from the Sun2012 database (Xiao et al., 2010)), ii) T HR orig containing the HR patches obtained by the video-registration technique on the LR images, iii) T HR syn containing paired HR patches obtained using the video-registration technique while the LR are synthetically aligned, and iv) T HR res where the HR patches are obtained by down-sampling large portions of the LR images by a factor of four.Inspired by the work proposed in (Shocher et al., 2017), the idea behind this last target domain is based on the fact that patches in the images have recurrences at a different scale and down-sampling large LR images may increase the high-frequency responses in the generated down-sampled HR patches.

Training details and parameters
In our implementation, Eq. 1 is solved by minimizing the cross-entropy of the number of samples correctly discriminated by DS.As proposed by (Arjovsky and Bottou, 2017) we add white noise to the inputs of the DS γ network to stabilize the adversarial training.We trained our networks on an NVIDIA GTX TITAN-X GPU card with 12 GB of memory.The training procedure converges after 50-80 thousand iterations of random mini-batch with 54 patches.For the optimization of the stochastic gradient descent, we use Adam with β1 = 0.9, β2 = 0.999 and = 10e-8.The networks were trained with a learning rate of 10e-4.

Experiments
Due to the lack of real ground truth in our database, the validation of our experiments is based on complementary quantitative and qualitative analysis.The quantitative analysis, presented in Section 4.1, uses four different metrics to evaluate the obtained images.The qualitative analysis is instead based on a MOS study carried out by clinicians and medical imaging experts that gave numerical indications of the perceived quality of the super-resolved images.

Quantitative analysis
The four metrics used in our quantitative analysis are: i) a Structural Similarity matrix (SSIM) proposed by (Wang et al., 2004) that evaluates the similarity between SR θ (I LR ) and HR, ii) ∆GCF HR that quantifies the improvement on the global contrast factor (a reference-free metric for measuring image contrast (Matkovic et al., 2005)) that the super-resolved image yields with respect to HR, iii) ∆GCF I LR that is the improvement of the global contrast factor that the super-resolved image yields with respect to the initial I LR , and iv) a composite score T ot cs obtained by normalizing the value of SSIM and ∆GCF HR in the range [0,1] and averaging the obtained results.
The formula used to compute T ot cs is described by Eq. 7.
This composite score leads to a more robust evaluation of the results since, SSIM alone is not reliable when the ground truth is only estimated, while the GCF can be improved by merely adding random high frequency to the images.
Our first experiment is aimed at finding the best target domain for improving the pCLE images.The results computed on DB orig for both the case studies are reported in Table 1.As we can see, the network trained with natural images (T HR nat ) obtains the best T ot cs score.From these results, we can also deduct that using synthetic images for the training is worse than using images from the original domain.This is probably due to the fact that synthetic images may have a non negligible domain gap with the real images.With this result, we can state that paired patches are not anymore a requirement for our framework.Finally, downsampling LR images to create patches with higher frequency content does not seem to provide good results and the high-frequency signals are not recovered.These qualitative indications can be seen on reconstructed images reported in Fig. 5.
Looking at the different case studies, the aforementioned considerations are consistent along both the cases, although CS 2 shows slightly lower performances with respect to CS 1 probably due to the fact that it has a coarser split of its dataset.To further validate our framework, we compare our best approach (the network trained with the target domain T HR nat ), against some state-of-the-art single image super-resolution methodologies.These results are presented in Table 2 for the database DB orig and in Table 3 for the database DB syn .In these experiments we consider three different approaches: i) the unsupervised Wiener deconvolution tuned on the train set, ii) the unsupervised variational Bayesian inference approach with sparse and non-sparse priors (Villena et al., 2013), and the supervised EBSR proposed by (Ravì et al., 2018).Finally, a contrast-enhancement approach obtained by sharpening the input was also used as a baseline.
In the dataset DB orig , although the sharpening algorithm produces the best  (Ravì et al., 2018), and proposed approach trained using two different target domains (T HR nat and T HR orig ).These results were obtained on the test images of DB orig for the case study CS 1 .
contrast improvements, our approach obtains the highest SSIM and, according to T ot cs , the overall performance outperforms all the other approaches.
Differently from the results obtained with DB orig , with the database DB syn , our approach is not able to overcome the results obtained by (Ravì et al., 2018).This is probably due to the fact that the supervised training in (Ravì et al., 2018) exploits principles that are similar to the ones used to generate the synthetic images in DB syn .Therefore, the results obtained by this approach in the database DB syn are obtained in a purely simulated scenario where the model is trained on data with no domain gap with the test set.Consequently, the supervised approach in (Ravì et al., 2018) has an advantage with respect to our unsupervised one in this setting.What is interesting to see nonetheless, is that our unsupervised framework can achieve almost similar performance to the supervised one of (Ravì et al., 2018) despite the evaluation being intrinsically favourable for this last solution.
Also in these two experiments, close results are obtained between CS 1 and CS 2 confirming that the system is able to super-resolve images for both the considered cases (i.e. when the images contain new visual structures or when they are extracted from new patients).
The statistical significances of the improvements discussed in this section were assessed with a paired t-test and the p-values are all less than 0.0001.

Semi-quantitative analysis (MOS)
To perform the MOS, we asked 10 trained individuals to evaluate, on average, 20 images each, randomly selected from the test set of DB orig on the case study CS 1 .At each step, the SR images obtained with two different configurations of the proposed approach, with (Ravì et al., 2018), and with a contrast-enhancement approach that sharpens the input (baseline), were shown to the user in a random order to reduce any possible bias on the evaluation of the images.The two configurations used for our approach were the one based on training our model with the two best target domains (i.e.T HR nat and T HR orig ).The input and the HR were also displayed on the screen as references for the participants.For each of the four images, the user assigned a score between 1 (strongly disagree) to 5 (strongly agree) on the following questions: • Q1: Is the image artefact-free?
• Q2: Can you see an improvement in contrast with respect to the input?
• Q3: Can you see an improvement in the details with respect to the input?
• Q4: Would you prefer seeing the new image over the input?
To make sure that the questions were consistently interpreted, each participant received a short training before starting the study.The results on the MOS presented in Fig. 6 show that, between the two different configurations used on our approach, the model trained with natural images T HR nat provides a better trade-off of artefacts, contrast and details with respect to the training using the target domain T HR orig .The results show also that the proposed approach and (Ravì et al., 2018) provide complementary features.In fact, although the details (question Q3) and the contrast (question Q2) in both the settings of our approach seem to be worse than (Ravì et al., 2018), our solution provides better scores for the absence of artefacts (question Q1), which is an important characteristic in clinical applications.Regarding the final preference (question Q4), our solution trained using natural images (T HR nat ) shows the best results with respect to all the other approaches, confirming the validity of our solution to perform super-resolution on pCLE images.The approach that sharpens the images is, instead, the one that provides the lowest scores for Q2, Q3 and Q4, probably because it enhances the noise.
Visual results for some of these images are shown in Fig. 7, confirming these findings.More specifically, although the output proposed by (Ravì et   2018) shows better contrast and higher sharpness images, the corresponding algorithm also behaves much worse term of artefacts generation.In fact, as we can see in Fig. 7, it often enhances noise and make up details that are not visible neither in the input nor in the estimated HR images and this can eventually lead to a wrong clinical interpretation of the images.In Fig. 7 we have marked with a white circle some of the regions where these issues are more evident.From a clinical point of view, we believe that the reliability of SR images in terms of details is a more important feature than having high contrast or high sharpness created by artefacts.According to these visual considerations and the MOS findings, we can conclude that our solution provides a more convincing representation for super-resolved pCLE images with respect to the other state-of-the-art approaches.

Discussion and conclusions
Obtaining medical images that accurately visualize structures of tissues is still today an open challenge.One of the main issue that researchers are trying to address here, is to improve the image resolution.In endomicroscopy, low image resolution is often dependent on the intrinsic limitations of the acquisition systems.Current solutions propose SR methods to post-process the final images as an alternative to the more difficult hardware enhancements.Clinical impacts and benefits in the use of SR methods include: i) better localization of tissue structures, ii) improving the image contrast and iii) improving the Signal to Noise Ratio (SNR) (Greenspan, 2008).However, often the validation of these benefits in terms of clinical outcome is not straightforward.An attempt for this was proposed in (Kennedy et al., 2006), where Positron Emission Tomography (PET) scans on phantom and patients were used to prove that smaller visual features were localized and better visualized using SR techniques than without.Another similar study conducted in (Kennedy et al., 2007) shows that using SR techniques produces better contrast ratios and better target-to-background ratios than the standard reconstructions.(Plenge et al., 2012) designed, instead, an experimental framework to show that the SR reconstructions are more advantageous in terms of the SNR with respect to the direct HR acquisition.Finally, (Sano et al., 2017) proposed a novel measurement algorithm for joint space distance on X-ray images generated by a SR method.The results exhibit higher accuracy in the measured distances when SR images were used.
The studies above show that SR methods can improve the clinical outcomes and can open the door for better diagnosis.In our case, however, the lack of a real ground truth can raise some scepticism on the validation of the results since is not simple to show that SR approaches don't emphasize or make up details that are not real.Our extensive quantitative and qualitative analysis, based also on expert's evaluations, are developed to show the reliability of the obtained SR images and the reduced presence of artefacts even in the absence of real ground truth.More specifically, these results validated using two versions of a database containing 238 endomicroscopy video sequences captured from 143 patients demonstrate the ability of the pipeline to produce convincing super-resolved images.
In conclusion, in our study we report a super-resolution framework for endomicroscopy images based on an unsupervised adversarial deep neural network that takes advantage of the knowledge of the physical acquisition process to impose a cycle consistency.The proposed framework results to be particularly useful in all situations where there is a lack of HR images and pairs of LR/HR images are not available for the supervised training.
To the best of our knowledge, we are the first to propose an unsupervised super-resolution approach for medical images.Further clinical studies could validate the relevance of the proposed framework to specific clinical applications for super-resolution.

Figure 1 :
Figure 1: Example of aligned-paired and unpaired patches used for training super-resolution neural networks.

Figure 2 :
Figure 2: Pipeline used for training the proposed adversarial network with cycle consistency.Each component of the pipeline is identified by a different color.

Figure 3 :
Figure 3: Example of images contained in the two proposed versions of the database.Both the two versions share the same HR images.The LR in DB orig are images acquired on human tissues and they are not always aligned with the related HR.The LR in DB syn are instead generated synthetically and are always aligned with the corresponding HR.

Figure 4 :
Figure 4: Voronoi vectorization used in our pipeline to constrain the predicted super-resolved patches.

Figure 5 :
Figure 5: Example of visual results obtained by the proposed approaches when trained with different target domains.From left to right we have: Input, training with T HR syn , training with T HR orig , training with T HR res , training with T HR nat and HR.

Figure 6 :
Figure6: Mean and standard deviation of the participants' replies to each of the four MOS questions for the evaluation of the results obtained by contrast-enhancement (baseline),(Ravì et al., 2018), and proposed approach trained using two different target domains (T HR nat and T HR orig ).These results were obtained on the test images of DB orig for the case study CS 1 .

Figure 7 :
Figure 7: Example of visual results obtained by our approach in comparison with other state-of-the-art approaches.From left to right we have: Input image, proposed output, output from(Ravì et al., 2018), output from(Villena et al., 2013) and HR.

Table 1 :
Quantitative analysis results obtained by our approach when trained with different target domains on case study CS 1 and CS 2 .

Table 2 :
Quantitative analysis results of the proposed approach against state-of-the-art methods on the database DB orig for case study CS 1 and CS 2 .

Table 3 :
Quantitative analysis results of the proposed approach against state-of-the-art methods on the database DB syn for case study CS 1 and CS 2 . al.,