Virtual UV Fluorescence Microscopy from Hematoxylin and Eosin Staining of Liver Images Using Deep Learning Convolutional Neural Network

The use of UV (ultraviolet fluorescence) light in microscopy allows improving the quality of images and observation of structures that are not visible in visible spectrum. The disadvantage of this method is the degradation of microstructures in the slide due to exposure to UV light. The article examines the possibility of using a convolutional neural network to perform this type of conversion without damaging the slides. Using eosin hematoxylin stained slides, a database of image pairs was created for visible light (halogen lamp) and UV light. This database was used to train a multi–layer unidirectional convolutional neural network. The results of the study were subjectively and objectively assessed using the SSIM (Structural Similarity Index Measure) and SSIM (structure only) image quality measures. The results show that it is possible to perform this type of conversion (the studies used liver slides for 100× magnification), and in some cases there was an additional improvement in image quality.


Introduction
Classical optical microscopy is one of the foundations of medical and biological research [1,2]. The classic solution is based on the human eye and many methods have been used to improve the image quality in order to highlight the desired features. Various physical methods are used, such as microstructure illumination methods, work with the handling of filters, including the use of multispectral and hyperspectral methods, as well as biochemical methods, where image segmentation is performed using dedicated staining [3].
An extremely important element in the development of microscopy, although often overlooked, is the change in the method of image analysis. The introduction of computer image analysis methods combined

Contribution of the Paper and Related Works
The paper proposes a solution for digital conversion of the image of a slide stained with H&E (hematoxylin and eosin) [13] to a form resembling an image obtained with the use of fluorescence. The main contributions of the paper are: • Possibility to use a digital bright field microscope in place of a fluorescence microscope, • Limiting the influence of photobleaching and photo damage in the slide microstructure, • Analysis of the possibility of using deep learning convolutional neural networks to implement this type of conversion.
As the research material, slides with sections of human livers were used. The confirmation of the possibility of performing this type of image conversion gives us a chance to use a similar method to convert images of other biological structures.
The work is inspired by other research in the field of microscopic image conversion. One of the active areas of research is the digital staining of slide images [14][15][16][17][18]. As the staining process is a biochemical process, it requires proper preparation of a section, the use of chemical reagents (often expensive or hazardous to health), and time and human resources. The use of deep learning neural networks is an interesting alternative to classical biochemical staining.
Another area of inspiration for research is improving image quality. The microscopic images often have poor contrast and sharpness. This is due to many factors such as the quality of biological material, the quality of the staining process, and the quality of the image acquisition system (cameras and lenses).
A serious problem is the influence of thickness and thickness variation of biological material during the cutting. The use of computer methods allows one to improve the visual quality of the image. Additionally, it is possible to use super resolution methods to increase image resolution [19][20][21][22].

Content of the Paper
Section 2 describes the microscopic images of the liver sections. The same section introduces the virtual imaging solution and describes the individual blocks of the solution. Section 3 presents the results of the image processing, in particular the results for the applied image quality assessment metrics. The discussion takes place in Section 4. Final conclusions and future work are described in Section 5.

Materials and Methods
The proposed and tested solution uses a convolutional neural network [23][24][25] that transforms the recorded color images (RGB) into a virtual fluorescence UV image. In the learning and testing phases, the solution presented in Figure 1 is used. During normal operation, the solution presented in Figure 2 is used. Individual blocks are described in the following subsections.

Liver Microscopic Images
The material consisted of 52 slides showing cases of liver steatosis obtained from autopsy cases. There are various causes of fatty liver disease, including the use of ethyl alcohol, that can manifest as liver steatosis, steatohepatitis, and cirrhosis. It is well known that excessive ethanol consumption is a major public health problem and causes over 60% of chronic liver disease in some countries [26]. It should be noted that short-term consumption of 80 g of ethanol per day usually causes mild, reversible heaptic changes, including fatty liver. Chronic exposure to 40-80 g of alcohol per day can cause serious injury [26].
Steatosis is also a manifestation of non-alcoholic fatty liver disease (NAFLD), that can mimic the spectrum of changes seen in fatty liver in alcoholics. NAFLD is often associated with metabolic syndrome [26].
Hepatocellular steatosis begins with the accumulation of fat in the centrilobular hepatocytes (the epithelial cells of the liver). Usually, steatosis is observed as microvesicular: lipid droplets are small and macrovesicular: lipid droplets are large [26].
Both types of steatosis are very often observed in one slide. As a result, the hepatocyte takes the shape of a ring with the nucleus displaced to the peripheral part of the hepatocytes, that mimics the fatty tisue cells, adipocytes [27]. There are various causes of steasosis, not just alcohol consumption, including agents like amiodarone and methotrexate. Drug/toxin associated liver injury with steatosis manifests itself microscopically mailny as hepatocellular microvesicular steatosis of the liver, often associated with acute liver disfunction [26].
An exemplary pair of acquired images of steasosis is shown in Figure 3. Visible hepatocytes with lipid droplets that have been dissolved during the tissue process (dehydration in a series of alcohols) and hepatocytes with a signet ring with a large droplet of lipids in the cytoplasm are visible as white rounded spots in the color of H&E (black in the UV fluorescence) [27] with a compressed nucleus in the peripheral part of the hepatocytes.
There are also hepatocytes without steatosis, that are large polyhedral cells. Regular hepatocytes have at least six surfaces. The H&E staining cytoplasm is eosinophilic. The nuclei of hepatocytes are large, spherical, cells with two or more nuclei also observed, some of them polyploid. The tubular space between two hepatocytes is defined as bile canaliculus, and the hepatocytes are also in contact with the wall of sinusoid through perisunosidal space of Disse [28].

Data and Acquisition System
The study used its own database of liver samples, which were described by a patomorphologist. The database contains 52 slides stained with H&E. The microscope Imager D1 (Carl Zeiss) and the Axio-CamMRc5 camera with 2584 × 1936 resolution in the visible range were used for image acquisition with FS 43 filter set for fluorescence. The images were recorded in pairs, i.e. for a specific position of the slide in relation to the lens, an RGB image (work in bright mode) and a UV image (work in fluorescent mode) were recorded successively using 100× magnification (10× objective). Images were recorded with the AxioVision (Zeiss) software in the uncompressed ZVI format and saved in the 16-bit mode. Since the camera used is color, with a Bayer matrix filter, some mosaic artifacts related to the demosaicing algorithm are observed [29]. The acquisition was done for the same camera settings (in particular the exposure time). Images were converted to 16-bit TIFF per channel using the Bio-Formats library/tools ('bfconvert' command) [30].

UV (RGB) Image Redundancy
Since the camera is a color RGB camera, the UV image may contain redundant information. The UV image is the result of the interaction of monochromatic light with biological structures. It is usually close to monochromatic light, that allows the use of a more optimal color space than RGB. In order to verify whether the UV (RGB) image can be converted to a UV image in shades of gray, the PCA algorithm [31] (Principal Component Analysis) was used. The PCA algorithm enables the estimation of a new non-redundant color space for the input RGB image [32][33][34]. This is important for the training algorithm, because if the output image can be a grayscale image, the number of parameters (weights) to be trained will be smaller. RGB color image requires three times more weights in the last layer and reducing the number of parameters is important for the speed of convergence and the reduction of false color generation.
An example of removing redundancy from a UV RGB image is shown in Figure 4. In order to improve the image quality of the first three main components, a random sample for 1% of the pixels is shown.

Spatial Alignment of Pairs of Images
Care was taken not to shift the stage with the slide during switching the optical system between the halogen and UV lighting modes. However, even slight mechanical vibrations can do so, which is critical to the learning process. For this reason, the images have been automatically aligned. The maximum shift observed was 25 pixels, which is unacceptable for a legitimate conversion.
Image alignment was limited to X-Y translation (image rotation was not necessary), which allowed for a good convergence of the alignment process. Before image alignment, it became necessary to adjust the image characteristics. Due to the differences between the way the slide is illuminated (from the top for the UV lamp and from the bottom for the bright field for the halogen lamp), it was necessary to invert one of the images. Only one H&E RGB image channel-blue was used. Automatic image alignment was done using the 'imregcorr' function [35] from Matlab.

Contrast Normalization
Contrast normalization using z-score has been used as part of image preprocessing to normalize images [36]. Slides can vary in thickness, which affects the amount of light that is transmitted or reflected.
Normalization using the minimum and maximum values of the image is not a good choice, as there are often various artifacts in the slides that can strongly distort normalization. For this reason, fit-based normalization to the Gaussian distribution model was used for the histogram. The slide histograms are a Gaussian mixture [37] with two peaks plus asymmetry, but this type of normalization is less sensitive to outliers, that is, problem of min-max normalization. An example of normalization is shown in Figure 5. The following formulas are used for normalization for 'uint8' data type: where x and y are pixel coordinates, X is the input pixel value, and Y is the output pixel value.

Deep Learning Convolutional Neural Network (ConvNN)
A unidirectional multi-layer convolutional network was proposed for the image conversion task. It is a heteroassociative type network with four convolutional layers. The task of teaching this type of network is to obtain an approximator (using regression) that reconstructs the UV image in grayscale (as the main PCA component) from the RGB color image. The regression criterion is MSE (mean squared error). The input image and the output trainer are normalized. The network diagram is presented in Figure 6, and the exact configuration of the layers is given in Table 1.
The neural network was trained with the following parameters: training algorithm is SGDM (stochastic gradient descent with momentum) [38,39], initial learn rate is 0.05, momentum is 0.9, gradient threshold is 0.05, mini batch size is 50 and maximal number of epochs is 50. The mini batch size was optimized for maximum use of GPU memory.
Matlab R2020a was used for learning with some modifications. The input images have a resolution of 1000 × 800 pixels. Such a large size is problematic due to the amount of available RAM in the GPU and the need to store different images simultaneously. A typical solution is training with lower resolution patches which are selected from random positions from the original images. This network uses patches with a size of 256 × 256 pixels. One problem is that Matlab's 'randomPatchExtractionDatastore' function creates a pair of patches with the same resolution. This makes it necessary to generate a pixel output at the edges, which is associated with a number of artifacts. For this reason, a custom function was created that creates pairs of sizes 256 × 256 × 3 and 242 × 242 × 1 for the network input and output respectively.

Evaluation of Results
The image conversion method was tested with the use of a second database that uses the original database. The resolution of the original images is high and only a fragment of the image was used in the learning process. Other parts of the original images were used for testing. The learning criterion is MSE, but the use of image quality metrics is better to judge final image conversion. There are many image quality metrics, both objective and perceptual (subjective) [40][41][42].
One of the metrics is the SSIM index [43] (Structural Similarity Index Measure), which contains three internal indicators: the luminance term, the contrast term and the structural term. SSIM enables the weighing of individual components and the results show the overall SSIM (all weights equal to 1) and the SSIM only for the structure component (weights for luminance α and contrast β terms equal to 0, weight for the structure γ term equal to 1). The following formula is used to calculate the SSIM index: where, A and B are reference and tested images, l(., .) is luminance function, c(., .) is contrast function and s(., .) is structure function.

Results
The results of the work of the entire system, in particular the neural network, are presented in this section. Two SSIM metrics were used to evaluate the quality of the virtual slides.

Mechanical Shifts
The estimated shifts for pairs of images are shown in the histogram in Figure 7. Most position errors are a few pixels (Euclidean metric), but sometimes there are outliers. Shift estimation allows the compensation for these acquisition errors.

Exemplary Results
The testing database is derived from the original images database, and these are pairs of images that are not part of the training base and are independent of each other (no overlap).
There are 416 image pairs for network testing. Additionally, the results for the same number of pairs teaching the network are presented. Usually a neural network works better for training images than for testing images. The input H&E image has a resolution of 256 × 256 and the reference UV and output images have a resolution of 242 × 242.
Sample training and testing images are shown in Figures 8 and 9, where H&E is RGB input image, UV (main PC) is a preprocessed UV training image and virtual UV is the output image of the convolutional neural network.

SSIM and SSIM (Structure Only) Metrics
SSIM values range from 0 − 1 , with 0 being no similarity and 1 being the perfect match between a pair of pictures. As the database is relatively large, instead of providing values for several images or only the average value for the bases, it was decided to use the Monte Carlo methodology [44,45] to show the histograms of SSIM indexes.
The histograms for the full SSIM metric and the simplified (structure only) are shown in Figure 10.

Discussion of Results
The paper shows that the preparation of learning data requires a thorough analysis of the acquisition process. Mechanical problems were identified and an algorithmic solution was proposed for compensation.
The proposed solution confirms the possibility of converting an H&E color image to a UV virtual image. The random results presented in Figures 8 and 9 show a very high similarity between the real UV image (main PC) and that created by the neural network. It should be noted, however, that not all elements of the structure are mapped, for example in Figure 8c, the dark area in the center of the image.
The output images have not undergone additional contrast correction as it may cause changes in image perception (an image with higher contrast is usually considered better). The neural network has the last layer of the regression type, so the mapping comparison should be without the contrast correction of the output image.
The feature of UV images is higher optical resolution in relation to images in the visible light range. This means that the neural network can to some extent perform image enhancement operations in terms of improving resolution.
In the subjective perceptual assessment of images, it seems that the network reproduces smaller than larger structures relatively well. This is visible as the loss of the brightness value of the images for larger uniform structures, for example the light areas in Figure 9c, or the dark areas in Figure 8b-c (sinusoids). Since the lattice is of the convolutional type, individual convolution operations have their range of influence both within one layer and within within all. Increasing the size of convolution nuclei could improve the quality of work, but it would be necessary to increase their number and training time.
The loss of the brightness values for larger areas means that the neural network behaves like a high pass or band pass filter. This does not mean that the low-frequency spatial component is completely removed. Figure 9c shows the differences between the original UV image (main PC) and the reconstruction. The neural network fringes around the edges (lipid droplets regions that have been dissolved during the tissue process, especially). Interestingly, the introduction of this type of artifact improves the visual quality of the image. This is typical of unsharp mask filters [46]. The reconstructed image is visually sharper than the original. This type of operation, although the picture seems to be better for humans, results in deterioration of the SSIM values. For this reason, an attempt to unambiguously assess the quality with the use of SSIM is not fully adequate. The use of general blind measures is also not correct because histological images have their own specificity. Since the neural network can correct images, measure values should be treated with caution.
The study proposes the use of two quality assessment indicators: SSIM and SSIM (structure only). Both indicators were used to evaluate the results for both the training and testing images. The mean values of SSIM and SSIM (structure only) are close to 0.6 and 0.5, respectively. In the case of pairs of training images, they are obviously larger because the neural network is trained by them and there is some matching. As the neural network has problems with mapping the brightness of larger areas, it was decided to use the simplified SSIM only for the structure. The values for SSIM (structure only) are shifted on average by about +0.1. This means that the neural network reproduces the structure relatively well. Achieving high SSIM values around 0.9-1.0 is in practice unlikely because one of the components of the original H&E and UV (RGB) images is noise and tissue granulation. Noise is acquired and can be reduced by longer exposure. Granulation is a feature of biological material because it contains many small structures. In the process of image acquisition, depending on the lighting (visible/UV light), optical effects of invisible objects may be visible directly due to diffraction and interference. For these reasons, it is unlikely that a network with a very high SSIM will be obtained.

Discussion Related to Other Works
Image conversion task is an active research topic, in particular for coloring black and white images as well as adding styles to images (style transfer). These tasks are solved using various techniques, in particular, methods that use machine learning are effective. Among the methods of machine learning, the most spectacular results are obtained with the use of deep learning convolutional neural networks [47]. There are basically two goals for this type of transformation: to achieve consistency at the microstructure level and to have a good visual subjective transformation from the human point of view. Subjective acceptance for humans is important in, for example, artistic applications, where the mapping does not have to be repetitive, but is even allowed to introduce innovations in the generated image using a noise generator. These types of algorithms do not meet the requirements for medical or biological applications, where achieving microstructure compliance is a key property.
The main problem of image generating algorithms with the condition of preserving the microstructure is their complexity and sensitivity to new data that did not occur during the training. This means that it is necessary to obtain a good approximator with a good spatial location of the generated features (for example edges). With ConvNN, localization problems come from the use of pooling layers. Pooling layers are used in segmentation algorithms and are typical of most ConvNN architectures. For this reason, such layers were rejected in the proposed network, since a unidirectional multi-layer network can map the microstructures.
The problem of loss of spatial resolution and localization of microstructures is known and there are other approaches such as U-Net [48]. In the U-Net network, the encoder-decoder architecture is used, but the decoder uses the images in different scales from the encoder to synthesize it. This allows for image segmentation while preserving the microstructure. The GAN framework was proposed in [15], where the generator is the U-Net. This shows that different approaches are possible, but a comparative evaluation of different solutions is difficult. The comparison of different network structures requires a Monte Carlo study to determine unbiased estimators of image conversion quality. This requires many learning processes for each network (minimum 10, but hundreds and thousands are also possible). As networks can have different architectures: the number of layers, the number of neurons, the sizes of convolution masks, etc., the number of analyzed cases is huge. This process can be paralleled, but very high computing power is necessary. Unfortunately, the results depend on the learning base, which in the case of medical images means that the result of the search for optimal solutions is specific and depends on the type of images (tissues, enlargements, etc.). Inference about the best architecture choice from a case-by-case comparison usually leads to wrong conclusions.
Another option to improve the quality or optimize the conversion system is to use the radiomics method [49,50]. This method extracts a large number of features from radiographic medical images using data characterization algorithms. The obtained features may be inputs to ConvNN, so these or similar features required not to be learned by the network if they are well matched. The use of the radiomics method is quite difficult for ConvNN. In the case of an RGB image, there are only three image channels (R, G, B), and PCA uses one channel, while for radiomics, there may be hundreds of channels. This means that there are many more weights to learn. The amount of available GPU memory for calculations is also a serious problem. It is possible that the use of feature optimization may be an interesting solution, but it is an expensive process, because the network requires being repeatedly trained in order to determine the effectiveness of the feature set.

Conclusions and Further Work
The use of machine learning methods, shown in this work, indicates that there are many new research areas where it is possible to improve the quality of images or change the representation, for example, conversion to a different space. Images often contain a lot of information that is not fully exploited by humans due to the limitations of the brain. The use of computer methods allows for breaking these barriers.
Presented network is of the regression type and it can be a network which will then be used for preprocessing for further image analysis. The analysis of structures, in this case microscopic structures of the liver, is interesting for research reasons, which was one of the motivations for the work in this field.
The article shows the possibility of performing image conversion for microscopic liver slides. The ability to convert the H&E image to a virtual fluorescence UV image does not mean that this type of operation can always be performed for a slide of any tissue. Learning for a specific tissue type is essential each time that a system is designed.
It would be interesting to investigate the effect of network size on the results, but this requires quite a long study. The neural network presented in the article was selected after several attempts. In general, it is necessary to optimize the network structure (meta-optimization). This will be the subject of further research. Funding: This work is supported by the UE EFRR ZPORR project Z/2.32/I/1.3.1/267/05 "Szczecin University of Technology-Research and Education Center of Modern Multimedia Technologies" (Poland). We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan X GPU used also for this research.

Conflicts of Interest:
The authors declare no conflict of interest.