Convolutional neural networks for whole slide image superresolution

: We present a computational approach for improving the quality of the resolution of images acquired from commonly available low magniﬁcation commercial slide scanners. Images from such scanners can be acquired cheaply and are eﬃcient in terms of storage and data transfer. However, they are generally of poorer quality than images from high-resolution scanners and microscopes and do not have the necessary resolution needed in diagnostic or clinical environments, and hence are not used in such settings. The driving question of this presented research is whether the resolution of these images could be enhanced such that it would serve the same diagnostic purpose as high-resolution images from expensive scanners or microscopes. This need is generally known as the image super-resolution (SR) problem in image processing, and it has been studied extensively. Even so, none of the existing methods directly work for the slide scanner images, due to the unique challenges posed by this modality. Here, we propose a convolutional neural network (CNN) based approach, which is speciﬁcally trained to take low-resolution slide scanner images of cancer data and convert it into a high-resolution image. We validate these resolution improvements with computational analysis to show the enhanced images oﬀer the same quantitative results. In summary, our extensive experiments demonstrate that this method indeed produces images that are similar to images from high-resolution scanners, both in quality and quantitative measures. This approach opens up new application possibilities for using low-resolution scanners, not only in terms of cost but also in access and speed of scanning for both research and possible clinical use.

can be easily distorted in low magnification images. Addressing these concerns requires a way to improve the resolution of the images on-the-fly, without substantial increase in storage and computational requirements.
The general goal of the extraction of high resolution features from low resolution data is known in computer vision research as super-resolution (SR). SR is a widely researched framework which aims at constructing a high resolution (HR) image given only a (or a set of) low resolution (LR) image(s) as input. It is often applicable in scenarios where such HR images are otherwise unavailable but may be needed for downstream processing. However, solving the super-resolution problem is challenging in practice. This is because of the ill-posed nature of the problem, given that there is generally no unique solution for a given LR image: a large number of different HR images, when downsampled can give rise to the same LR image. This issue is especially evident at higher magnification ratios. While there is no one solution that works for all SR problem domains, this issue is typically mitigated by constraining the solution space by strong domain specific a priori information. The SR problem occurs in a number of different scenarios, such as image enhancement, analyzing range images, face recognition, as well as medical/biological applications [8][9][10][11][12]. One such area in which super-resolution problem is naturally applicable is microscopic imaging, where insights into biological functions depend on the ability of observing the cellular dynamics, but is sometimes limited by the temporal resolution of acquisition devices. Note that most existing techniques for image SR are designed for natural image based applications, where image are acquired using digital cameras. These methods make use of image features, such as transformed exemplars [13], textures [14] and other high-level features. However, it is often difficult to obtain such high-level cues from low resolution whole slides images, making it hard to use available off-the-shelf solvers directly to improve their resolution. Next we describe the main focus of the paper, which is to show how the SR problem can be adapted to can address this important challenge in this context of whole slide imaging.
Our Contribution: Here, we investigate whether SR techniques can be used to generate high resolution images using only low resolution WSI images as an input. It is desired that these generated high resolution images should match well with an image acquired using a high-quality, expensive scanner in terms of quality and also be potentially useful for diagnostic purposes. To this end, we develop a convolutional neural network framework (CNN) with the following functionality. After the model has been trained, it enables generating at test time, a high-resolution image corresponding to a low resolution image which has been provided as an input. CNNs have been used in the past for the SR problem [15,16], since they can learn complex transformations across the image domains. But we find that such methods do not work directly for our application, and often result in an overtly smooth output image. This problem is explained by a recent paper [17], which show that in such CNN frameworks, the single output tend to look like a "smoothed" average of all the potential images that could be generated. Such smoothed images can lead to poor quality of segmentation, which is needed for identifying the different tissue types for further diagnosis. SI images as input, which can match an images acquired using high-quality, expensive scanners in terms of quality and use for diagnostic purposes. To this end, we develop a convolutional neural network framework (CNN), which when trained, outputs a reconstructed high-resolution image, given only a low resolution image as input. CNNs have been used in the past for the SR problem [15,16], since they can learn complex transformations across the image domains. But we find that such methods do not work directly for our application.
In this paper, we propose a CNN-KNN framework for the super-resolution, that is specifically designed for the slide imaging and address the issues mentioned above in the following ways: 1) We design a customized CNN framework, whose architecture is designed specifically to optimize the outputs for such microscopic images, including smaller filter sizes and increasing the number of filters. For datasets like ours with smaller sample sizes, it has been shown that filters with fewer weights often works well. The Scattering transform idea by Mallat's group takes this premise to the extreme [18]. The rationale of using smaller filter sizes was motivated and informed by this body of work. In the same vein, the reasoning behind increasing the number of filters, is to account for the complexity of transformation. It is well known that filters in the neural net is used to learn features are most relevant for the purposes of the learning problem. Increasing the number of filters allows the network to learn a wider range of features, hence it can model more complex changes in input-output characteristic of each layer. Among a large set of experiments we conducted, the proposed architecture seemed to yield the most consistent, reliable and reproducible results. We believe that it was important to choose a setting that is robust across characteristics that are often different across different labs such as sample sizes and this serves as an important design consideration in the architecture of the proposed network. 2) In training the model, we incorporate other forms of optimization objectives beyond mean square error(MSE), which is most commonly used to measure similarity between HR ground truth image and its reconstruction from the LR observation. In particular, we include metrics that capture human perception of image quality and can lead to quality reconstructions. 3) In addition, we enhance the output from the CNN, by capturing fine-grained details though a nearest-neighbor search over a dictionary. Such an approach is useful in restoring the image specific high-frequency components and results in a much higher quality of reconstruction. Results of two different cell-lines show our method outperforms other approaches for SR reconstructions, both qualitatively and quantitatively, generating images which match the HR images in quality and can be used for subsequent end-user applications. We describe our architecture of CNN and KNN in Section 3. But first, we briefly review the prior work related to image super-resolution in the next section.

Related work
Here we review the existing literature on single image super resolution techniques. Unsupervised interpolation-based methods were among the first approaches for this problem which includes linear, bicubic or Lanczos filtering [19]. These can be very fast, but usually yield solutions which are blurry and corrupted with aliasing artifacts for natural images. Edge based interpolation is another popular method for this problem [20,21]. However, the most efficient approaches for SR problem are learning-based where a correspondence/mapping function is learnt between LR and HR images. One family of approaches under this umbrella are sparsity-based techniques. Sparse coding is an effective mechanism that assumes any natural image can be sparsely represented as a combination of dictionary elements, which can be learnt through a training process [22,23]. Other important works in this regard include Glasner [11] who exploited patch redundancies across scales within the image to drive the reconstruction, Huang [13] who extended self dictionaries to further allow for small transformations and shape variations and [24] who proposed a convolutional sparse coding approach that improves consistency by processing the whole image rather than overlapping patches. In addition, Zhang [25] proposed a multi-scale dictionary to capture redundancies of similar image patches at different scales. Another line of algorithms is neighborhood based embedding approaches which up sample a LR image patch by finding similar training patches in a low dimensional manifold and combining their corresponding high-resolution patches for reconstruction [26]. This was later improved by [12] who formulated a more general map of example pairs using kernel ridge regression. The regression problem can also be solved various different ways as well [27].
Image representations derived from deep networks have recently also shown promise for SR problems. Stacked collaborative local auto-encoders are used [28] to construct the LR image layer by layer. [29] suggested a method for SR based on an extension of the predictive convolutional sparse coding framework. A multiple layer convolutional neural network (CNN), similar to our model, inspired by sparse-coding methods is proposed in [15,16,30]. Chen [31] proposed to use multi-stage trainable nonlinear reaction diffusion (TNRD) as an alternative to CNN where the weights and the nonlinearity is trainable. Wang [8] trained a cascaded sparse coding network from end to end inspired by LISTA (Learning iterative shrinkage and thresholding algorithm) [32] to fully exploit the natural sparsity of images. Recently, [14] proposed a method for automated texture synthesis in reconstructed images by using a perceptual loss focusing on creating realistic textures. Several recent ideas have involved reducing the training complexity of the learning models using approaches such as Laplacian Pyramids [33], removing unnecessary components of CNN [34] and addressing the mutual dependencies of low and high resolution images using Deep Back-Projection Networks [35]. In addition,Generative adversarial networks (GAN) have also been used for the problem of single image super-resolution, these include [36][37][38][39]. Other deep network based models for image super-resolution problem includes [40][41][42][43].

Methods
Here, we discuss our main model for obtaining high-resolution images. But first, we briefly outline the problem setting. Let H and L denote the high and low resolution image sets respectively. For training/learning, we assume that the corresponding high resolution image H i for each low resolution image L i is available. We extract patches from the low resolution image L i and represent each patch as a high-dimensional vector (where the vector for jth patch is reffered to as l ji ). The goal of the training process is to learn a non-linear function f, when applied to l ji , transforms it into a high-resolution reconstructed patch r ji . That is r ji = f(l ji ). Then, we aggregate all such patches to form the reconstructed high resolution image R. The objective driving the training process often minimizes some metric of difference between R and H. Most CNN [15,16,30] based models employ a number of convolutional layers to learn the complex mapping between the imaging domains. But there are some salient properties of our data which makes it hard to apply these existing approaches directly to our problem. First, these models are trained by upscaling the LR input image to the size of HR image, before it is passed through the CNN layers, which often leads to higher computational times to train the model. Secondly, these methods are not designed to handle cases where the LR image may be from a different modality, such as in our case, where the complexity of the transformation is much greater. Also, most CNN based methods use mean square error(MSE) as the metric to evaluate the similarity of H and R, where R i is the reconstuction of the ith image. Such a loss function is easy to minimize, but it correlates poorly with human perception of image quality and as a result, the resultant images are sometimes blurry and/or lacking the high-frequency components of the original images. We address this issue by adding image saliency based terms in the objective. In addition, a nearest neighbor based procedure is employed to restore the image specific details which may have been lost in the convolutional process. We describe the network next.

Convolutional neural network design
In this section, we describe the internal architecture of our convolution neural network, see Figure  1.
Feature extraction layer: The first step in the convolution process is to extract features from the low resolution input images. Note that for most feature extraction methods such as Haar, DCT etc, the key problem can be posed as the task of learning a functionf, which takes as input the low resolution images and outputs the learned featuresf(L i ). Therefore the feature extraction process can be learned as a layer of the convolutional neural network, which constitutes the first layer of our network. This can be expressed as where L is the entire corpus of low resolution images and θ 1 and b 1 represent the weights and biases of the first layer. The weights are composed of n 1 = 64 convolutions on each image patch, with each convolution filter being of size 2 × 2. Therefore this layer has 64 filters, each of size 2 × 2. The bias vector is of size b i ∈ R n 1 . We keep filter sizes small at this level, so as it extract more fine grained features from each patch. The σ(x) function implements a ReLU function, which can be written as σ(x) = max(0, x).

Feature mapping layer
The second layer is similar to the previous layer except the filter sizes are set to 1 × 1. The number of filters are still set to 64. The purpose of this layer is to obtain a weighted sum pool of features across various feature-maps of the previous layer. The output of this layer is referred to as Y 2 .
Intermediate convolutional layers: The feature extraction layer is followed by three convolutional layers. In this setting, we assume that for the ith layer (i ∈ {3, 4, 5}), the previous layer output is given Y i−1 , which is then served as input to the ith layer. The convolutional filter functions in these intermediate layers can be written as follows: where θ i and b i represent the weights and biases of the ith layer. Each of the weights θ i is composed of n i filters of size n i−1 × f i × f i . We set n i = 2 8−i . This makes n 3 = 32 and the number of filter decreases by a factor of 2, with each subsequent layer. We observe this has computational advantages, without noticeable decay in reconstruction performance. The filter sizes f i are set to {3, 2, 1} for each of the three layers respectively. This is akin to first applying the non-linear mapping to 3 × 3 patch of the feature map and then progressively reducing the size to 1. This structure is inspired by hierarchal CNN models, as described in [44]. Subpixel layer: The purpose of the final (6th) layer is to increase the resolution of the LR image to convert it to a HR image from the learnt LR feature maps. For this, we use a subpixel layer similar to the one proposed in [45]. The advantage of using such sub-pixel layer is that other previous layers operate on the reduced LR image, which reduce the computational and memory complexity substantially. The upscaling of the LR image to the size of the HR image is implemented as a convolution with a filter θ sub whose stride is 1 r (r is the resolution ratio between the HR and LR images). Let the size of the filter θ sub be f sub . A convolution with stride of 1 r in the LR space with a filter θ sub (weight spacing 1 r ) would activate different parts of θ sub for the convolution. The weights that fall between the pixels will not be activated. The patterns are activated at periodic intervals of mod(x, r) and mod(y, r) where x, y are the pixel position in HR space. Alternatively, this can be implemented as a filter θ 6 , whose size is n 5 × r 2 × f 6 × f 6 , given that f 6 = f su b r and mod( f sub , r) = 0. This can be written as where γ is periodic shuffling operator which rearranges r 2 channels of the output to the size of the HR image (See [46] for the detailed reasoning).

Training and loss function
The objective function, based on which the CNN is trained, is crucial in determining the quality of the high resolution reconstructions. Most SR systems minimize the pixel-wise mean squared error (MSE) between the HR and the reconstructed image, which while easy to optimize, often correlates poorly with human perception of image quality. This is because MSE estimator returns the average of a number of possible solutions, which does not perform well for high-dimensional data [14]. The paper by [47] shows that two very different reconstructions of the same image, can have the same MSE error and reconstructions based on MSE alone has been shown to be blurry and/or lack high frequency components of the original image [14,48].
To address this issue, we train our CNN using linear combination function of Multi-scale structured similarity (MSSIM) in addition to mean square error between the reconstructed image(R) and the high resolution image (H). We briefly describe this objective next. In particular we choose the MSSIM, since it better calibrated to capture perceptual metrics of image quality. Also, its pixel-wise gradient has a simple analytical form and is inexpensive to compute and therefore can be easily incorporated in gradient descent based back-propagation. MSSIM is the multi-scale extension of structured similarity (SIM), which is defined based on the following parameters. Let x and y be two patches of equal size from the two images H and R being compared. Assume µ x (µ y ) denote the mean, σ 2 x (σ 2 y ) denote the variance of the patch x(y) respectively, and σ xy denote their covariance. Therefore, the SIM function can be defined as: were I(x, y) = (2µ x µ y +c 1 ) (µ 2 x +µ 2 y +c 1 ) is the luminance based comparison, C(x, y) = (2σ x σ y +c 2 ) (σ 2 x +σ 2 y +c 2 ) is a measure of contrast difference and S(x, y) = σ x y +c 3 σ x σ y +c 3 is the measure of structural differences between the two images. c i for i = {1, 2, 3} are small values added for numerical stability and the α, β and γ are the relative exponent weights in the combination. The structured similarity between the images H and R is averaged over all corresponding patches x and y. This single-scale measure assumes a fixed image sampling density and viewing distance, and may only be appropriate for certain range of image scales. To make it more broadly applicable, a variant of SIM, called the multi-scale structured similarity (MSSIM) has been proposed. Here, the input x and y are iteratively downsampled by a factor of 2 with a low-pass filter, (with scale 1 denoting the original scale). The contrast and structural components of SIM are calcuated at all scales (denoted by C p and S p for scale p). The luminance component is applied only at the highest scale(say P). The multi-scale structured similarity function can be written as In our case, all the weights in the exponents are kept the same. We compute the MSSIM using 4 different scales, and use window sizes of 4 × 4 to calculate the metrics across both images.
Our loss function can be written as follows: where ρ is between 0 and 1. Since both terms in the objective is differentiable, we can train the neural network using gradient descent, adopting standard back propagation methods.

Nearest neighbor enhancement
Our goal here is to enhance the output of CNN, by producing a high-quality image which retains the finer details of the original HR image. As mentioned earlier, it has been observed by [17] and others, that deep learning frameworks including CNNs suffer from the problem of having difficulty in interpreting the results, therefore is hard to know precisely how the synthesized output was generated. Furthermore CNNs in particular, produce outputs which look like a "smoothed" average of all the potential HR images. This issue manifests in the quality of images produced by any CNN method. One way to avoid this issue is to introduce a post processing framework that retrieves some of the image specific details, which may have been lost in the convolution process. To do this, we introduce a K-Nearest Neighbor based image enhancement framework similar to the concept proposed in [17]. We describe this idea next.
To do this, we adopt a dictionary based approach for KNN learning. We use a small dataset (about 50) of images for training. We assume that the corresponding high resolution image H i for each low resolution image L i is available in the training data. We scale each L i to the size of corresponding H i and extract patches from both types of images. For simplicity, we refer to l ji as the jth patch (represented as a vector) of L i (similar notation is used for h ji ). We extract first and second order moments for each low-resolution patch, which is appended to l ji . This creates a collection of corresponding patches for both high and low resolution images, which we call dictionaries D h and D l (whose columns are h ji and l ji respectively) for high and low resolution patches. We also maintain the mean and variance for each HR patch in h ji .
We use a simple bi-level K-NN approach which works by matching a query to the training set and returning corresponding outputs in two nested levels. Given a new test image l, let the corresponding output of the CNN be r. We scale up l to the size of r and extract patches from both (similar to the way dictionaries are created). For a patch l j ∈ l, we look up k 1 closest patches in the low-resolution dictionary D l . Once the closest matches in the low resolution dictionary has been determined, we create a sub-dictionary by choosing only those corresponding patches from D h , for which matches have been found in D l . Next we do another KNN search, this time using r j as query and search the sub dictionary for k 2 < k 1 nearest neighbors. The high resolution patches found in this search are averaged to create a template patch h t j . r j is then enhanced so that its mean and variance match the template patch (h t j ), to generate a new patchr j . All such patchesr j are aggregated to produce the reconstructed imager.
We use patch size of size 3 × 3 for the nearest neighbor search. The implementation is done in Matlab using CUDA's fast KNN libraries and parallel programming toolbox, which makes this bilevel KNN implementation efficient in practice. An example of the CNN+KNN output compared to the CNN output is shown in Figure 2. The images obtained from the CNN+KNN pipeline are sharper in quality and captures the minute details such as tissue characteristics better than simply the CNN output. We also find the resultant images have smaller reconstruction error, with the average PSNR increasing by 2 units.

Experiments
We performed experiments to evaluate our SR approach on two large tissue microarray (TMA) datasets, a Breast TMA dataset consisting of 202 images [49], and a Kidney TMA dataset with 129 images [50]. TMAs are a popular histopathology format as they can provide many separate patient tissue cores in array fashion in order to allow multiplex histological analysis. For each dataset, we train the dictionary on a subset of the HR images and then use it to reconstruct the learned HR image from the LR images. We compare our method with 6 other approaches: bicubic interpolation, which is a standard baseline, the patch based sparse coding approach (ScR) in [23,51,52], the deep learning approach (CSCN) [8,53], the convolutional neural network based framework (FSRCNN) [ coding based Dictionary learning method implemented using deep learning (SCDL) [54] and a GAN based implementation of SR (SRGAN) [55,56]. All methods were trained with the same training batch of images. We evaluate the following aspects in our experiments: 1) how well the obtained reconstruction matches the high resolution image 2) how the resultant segmentation quality is affected when using the reconstructed images 3) how the model parameters affect the quality of reconstruction and finally 4) analysis of the running time of our model. But before that, we briefly discuss the acquisition setup.

Materials and methods
Human tissue microarrays A human renal cell carcinoma tissue microarray (TMA) block was constructed by the Translational Research Initiatives in Pathology (TRIP) lab at the University of Wisconsin-Madison (UW-Madison). A section of 5um thickness was cut from the TMA block containing 600um diameter tissue TMA cores. The section was then placed on a glass slide, stained with standard hematoxylin and eosin (H&E), and mounted under a 1.5 glass coverslip. Different tissue cores were from different patients. This TMA slide was originally prepared for another study [50]. A tissue microarray, which contains tumor tissue cores from 207 breast cancer patients, was used for the analysis. Five samples were excluded because of staining issues or sample folding on the slide. So the total of 202 images were selected for this study. This TMA has been previously made in our lab and used by Conklin, and full detail can be found at [49]. Imaging systems High resolution images were acquired and digitalized at 20x using an Aperio CS2 Digital Pathology Scanner (Leica Biosystems) [57], with 4 pixels per micron, and low resolution images were acquired and digitized using PathScan Enabler IV [58] with .29 pixels per micron.

Reconstruction quality
Comparison with state of art We evaluate the reconstruction quality of the obtained images by our approach by evaluating it relative to HR ground truth image and calculating eight different metrics: 1) root mean square error (RMSE), 2) signal to noise ratio (SNR), 3) structured similarity (SSIM) and 4) Mutual Information (MI) 5) Multiscale Structured Similarity (MSSIM) 6) Information Fidelity Criteria(IFC) [59] 7) Noise quality measure(NQM) [60] and 8) Weighted peak signal-to-noise ratio (WSNR) [61]. RMSE should be as low as possible, whereas SNR, SSIM(1 being the maximum), MSSIM (1 being the maximum) and the remaining metrics, should be high for good reconstruction. We use the same metrics of evaluation for the other 6 methods. Also note that SNR and RMSE are correlated measures. For these experiments, we set the resolution difference to a factor of 2. The results are shown in Table 1 for breast images and Table  2 for kidney images. We can see that our methods outperforms the other algorithms on most of the metrics used. This is especially true for the SIM and MSSIM measures, which are known to have high correlation with human perceptual scores. Qualitative results of reconstruction of our method is shown in Figure 3 for breast cells and Figure 4 for kidney cells. Results of reconstructions by other methods is shown in Figure 5. These results show that comparable methods such as SCDL, ESCNN and SRGAN, where the reconstruction looks most similar (in Rows 2 and 4) to the high resolution image have a higher values for MSSIM, compared to SNR. Note that our MSSIM values are the best among all the all the other comparable method on both datasets. Reconstruction as a function of frequency The above metrics computes the quality of reconstruction as a single scalar value. However, it has been observed that substantial amount of information may be lost whenever the characteristics of a high-dimensional image are summarized by a single scalar value. In order to see the performances of the reconstruction algorithms wrt to spatial frequencies, we use the ESP algorithm proposed by [47] which outputs the Fourier radial Error Spectrum Plot, which provides us a glimpse of how the reconstruction error varies according to different spatial frequency components. The results are shown in Figure 6 for a randomly chosen Kidney image. We see that our algorithm reconstructs the image much better at the lower frequencies, whereas at higher frequencies all methods perform similarly. This makes intuitive sense, since all methods are able to capture the high-frequency components in the reconstruction such as edges, our method outperforms other methods in capturing the subtle variations in the images which often correspond to changes in tissue density of other biological

Quality of segmentation
Pathological diagnosis largely depends on nuclei localization and shape analysis. We used a simple color segmentation method to segment the nuclei using K-means clustering to segment the image into four different classes based on pixel values in Lab color space [62]. Following this, we use the Hadamard product of each class with the gray level image of the original bright-field image, computed average of pixel intensities in each class, and assigned the lowest value to the cell nuclei. To evaluate our results, we compare the segmentation of the reconstructed images with the results from HR images (ground truth) for 50 samples from the breast and kidney from each group by computing the misclassification error, which calculates the percentage of pixels misclassified. We compare our algorithm with the other methods used in the previous section.
Results show that number of pixels misclassified from images generated using our method, is in most cases better than the other methods compared. Qualitative results of the segmentation masks (using blue lines as boundaries) are shown in Figure 7.

Model parameters
Here, we study how model parameters affect the reconstruction output. First we study the variation of CNN model parameters and its effect on the reconstructed image, by systematically varying the filter sizes, number of filters and number of layers. These results are obtained before applying the nearest neighbor enhancement, to avoid confounds related to the mixed effects of the two stages of the reconstruction process (CNN and KNN). These results help justify the different choices made in the network design. Next, we also see the performance of our algorithm relative to resolution variation. We discuss these issues next.
Filter Size In order to study the network sensitivity to filter sizes, we conducted number of experiments with different size of filters. Note that as mentioned earlier, for our experiments, we use filter sizes of 2, 1, 3, 2 and 1 for each of first five consecutive layers, this is denoted as 2-1-3-2-1. In addition, we varied the filter sizes both at the input layers 5-2-3-2-1 and the output layers 2-1-3-5-2 and measured the psnr in each case. Finally, we also ran experiments in the setting where both the input and output layer filter sizes have been increased 10-5-6-10-5. The results are shown in Figure 8. As we can see, that the reconstruction error increases slightly when the filter sizes at the input side is changed and more so, when they are increased on the output side. We see a significant increase in the error, when the filter sizes are increased to 10-5-6-10-5. This shows that for this application, small filter sizes work best for a good reconstruction.
Number of Filters Here we study the performance of our algorithm wrt increasing the number of filters. Recall that in our network design, number of filters at each layer are assigned as follows: n 1 and n 2 are set to 64, and n 3 , n 4 and n 5 are set to 32, 16 and 8 respectively. Let us denote this  Figure 9. We see that in our case, keeping the number of filters fixed at all layers, or doubling the number of filter have a negligible effect on the quality of reconstruction. This was observed by Dong as well [15], where increasing the number of filter had a marginal effect on corresponding psnr values. However, larger filter numbers contribute to an increase in the computational time. Considering this, we see that filter numbers {64, 32, 16, 8} are a good choice for our model.

Number of Layers
It is generally believed that increasing the depth of CNN, by adding more layers leads to an improvement in performance of the learning framework. However for image super resolution problem in particular, Dong [15] observed that "..the effectiveness of simple deeper structures for a super-resolution is not apparent" for super-resolution applications as it is for image classification tasks. We evaluate the effectiveness of depth of the network wrt to reconstruction performance. For this, we apply additional convolutional layers after the super-pixel upscaling layer. Therefore, the output of our network is passed as an input to additional layers (identical to convolutional layers in our network), which are stacked at the      output. The results are shown in Figure 10. Similar to Dong, we do not observe significant changes in the quality of reconstruction, as a function of increase in the number of layers.
Reconstruction as a function of resolution: We performed experiments to study the performance of our algorithm relative to resolution variation. Using a randomly selected subset of 20 images from the kidney dataset, we resized the images such that the resolution difference is a factor of {2, 3, 4} respectively. The training was done separately for each setting. We compared the mean psnr values in each case, see Table 4. The psnr is best when the resolution difference is the lowest, and degrades as the resolution increases, which is expected. However we noticed that the change in reconstruction error is not significant, indicating our method still performs well at higher resolutions.

Running time
Here, we briefly discuss the computational issues related to our model. Since each prior method we discussed earlier, has been implemented under different platform and libraries, it is not Fig. 11. Runtime as a function of resolution.
possible to do a meaningful and fair assessment of the runtime comparison of these methods. Therefore, we report on the computational times of our method only. We implemented our model in TensorFlow using Python, which has inherent GPU utilization capability. We used a workstation with an AMD processor with a 2.4 GHz CPU, 16 Gb RAM and NVIDIA GPU Quadro K2200 graphics card. All our experiments have been performed using GPU, which shows significant performance gains compared to CPU runtime. The training time of our models depends on various factors such as dataset volume, network size, learning rate, batch size and number of training epochs. To report running times for training, we fix the network size to {64, 32, 16, 8}, learning rate to 10 −3 , dataset volume to 100 images, batchsize to 10 and number of training epochs to 10 5 . We vary the data set volume by training the network for different resolutions {2, 3, 4} and obtain the training time for each setting. The results are shown in Figure 11.
The results show running time of our method has almost a linear dependence of the resolution factor, since all images go through the same number of convolutions. The time to generate a new high resolution image once the network is trained takes 2 − 3minutes. The test-time speed of our model can be further accelerated by approximating or simplifying the trained networks with possible slight degradation in performance.

Future directions
This paper provides an interesting way to utilize low-resolution images, produced by slide scanners in end-user diagnostic applications. Until now, this was not feasible due to the lack of discriminatory features of tissue types, which are not obserable from low-resolution images. In addition, this work also leads to several interesting ideas which we will pursue as future work. We discuss these briefly next.
1. Besides the slide scanner images and the high-resolution images (20x) we have also acquired two intermediate resolutions as well. This gives us a sequence of resolutions for each image. This data is one of a kind and gives rise to a unique problem for super-resolution methodologies where the data has sequence structure. Note that the closest related work to this is Multi-Frame SR [63] applied to video reconstructions, which make use of motions of video frames. In our case, not only do we not have such motion information to help the reconstruction, the number of frames are far fewer than traditional video sequences.
2. One of our future goals is also generalize our technique so that it can learn the mapping between any two modalities. Particularly, we will adapt our technique to generate another modality such as phase contrast images, given the high-resolution images as input. This would provide a way to automatically generate specific modalities, without the need for actual acquisition procedure.
3. Our last goal is aimed at making our model scalable to large datasets and accelerating the CNN model to produce high-resolution images with a reduced computational and memory footprint. To do this, we will adopt recent developments in deep learning which show that one can substantially improve the running time of Deep CNNs by approximating by linear filters and other related ideas [64].

Conclusion
This paper provides an efficient way to utilize LR slide scanner images for more fine-grained pathological diagnosis by generating high quality reconstructed images which are similar in performance to images from expensive scanners. Experiments show promising results when compared against state-of-the-art methods on a number of test images. This approach is not only more cost effective over currently used approaches but may also open up new opportunities in histopathology research and clinical application, due to ease of use and quick scanning speed of low-resolution scanners over their high-resolution counterparts.