Deep OCT image compression with convolutional neural networks

: We report an end-to-end image compression framework for retina optical coherence tomography (OCT) images based on convolutional neural networks (CNNs), which achieved an image size compression ratio as high as 80. Our compression scheme consists of three parts: data preprocessing, compression CNNs, and reconstruction CNNs. The preprocessing module was designed to reduce OCT speckle noise and segment out the region of interest. Skip connections with quantization were developed and added between the compression CNNs and the reconstruction CNNs to reserve the ﬁne-structure information. Two networks were trained together by taking the semantic segmented images from the preprocessing module as input. To train the two networks sensitive to both low and high frequency information, we leveraged an objective function with two components: an adversarial discriminator to judge the high frequency information and a diﬀerentiable multi-scale structural similarity (MS-SSIM) penalty to evaluate the low frequency information. The proposed framework was trained and evaluated on ophthalmic OCT images with pathological information. The evaluation showed reconstructed images can still achieve above 99% similarity in terms of MS-SSIM when the compression ratio reached 40. Furthermore, the reconstructed images after 80-fold compression with the proposed framework even presented comparable quality with those of a compression ratio 20 from state-of-the-art methods. The test results showed that the proposed framework outperformed other methods in terms of both MS-SSIM and visualization, which was more obvious at higher compression ratios. Compression and reconstruction were fast and took only about 0.015 seconds per image. The results suggested a promising potential of deep neural networks on customized medical image compression, particularly valuable for eﬀective image storage and tele-transfer.


Introduction
There is an increasing demand for methods that can efficiently store and transfer imaging data when considering, in medicine alone, the mounting number of medical images collected at each hospital and the increasing image size (partially owing to improved resolution). Efficient imaging data compression, transfer and restoration are becoming particularly more critical for remote diagnosis and monitoring of diseases, such as age-related macular degeneration (AMD) detection with optical coherence tomography (OCT) [1,2]. For image compression, it is necessary to preserve the fine structural information (e.g., the detailed retinal layers information for ophthalmic OCT images) from the noisy background [3]. With recent advances in deep learning, convolutional neural networks (CNNs)-based framework offers a potential solution, capable of efficient compression and high-fidelity restoration.
While volumetric image compression is of great interest for both industrial standards (e.g., MPEG [4]) and clinical applications (e.g., 3-D OCT image compression [5]), 2-D image compression is also wildly used in daily images and clinical data. In this study, we mainly focus on 2-D compression of OCT images. The commonly used compression formats, such as BPG [6], WebP [7], JPEG [8], and JPEG2000 [9], become suboptimal when the compression ratio increases [3]. Customized compression methods, such as compressive sensing (CS) [10], provide an alternative; however, the CS-based methods, often involving convex optimization, are usually computationally expensive for achieving a high reconstruction accuracy [11][12][13]. Currently, CNNs provide an exciting new avenue for image compression [14][15][16][17]; yet most reported CNNs-based methods exhibit difficulty in preserving the fine structural information, especially at a high compression ratio.
In this paper, we report an efficient image compression and restoration framework based on CNNs to achieve a high compression ratio (up to 80:1) and restore the images with high fidelity (with a similarity to the original images better than 98%). The proposed scheme works in two stages: first, distinguishing diagnostic features of interest from noisy background by sematic segmentation [18][19][20]; second, compressing the segmented images by compression CNNs, interpreting the compressed files and restoring the images by reconstruction CNNs. This paper will present the detailed network structure and training detail of the CNNs for compression and reconstruction. The performance of the proposed framework will be demonstrated by quantitative and qualitative comparing with the commonly used compression methods including BPG, WebP, JPEG, JPEG2000, and a recurrent neural network (RNN)-based method [21]. This paper will end with a brief discussion on the novelty and prospect of the proposed framework.

Methods
Our compression framework consists of three modules: a data preprocessing module ( Fig. 1(a)), a CNNs-based compression module and a CNNs-based reconstruction module ( Fig. 1(b)). The data preprocessing module reduces OCT image noise and segments out the regions to be compressed, which helps efficiently train both the compression and reconstruction CNNs. The compression module generates the compressed file which consists of a bitstream that contains image information. The reconstruction module serves as a dictionary to interpret the bitstream to a high-resolution image. To reserve both high and low frequency information in the original images, we train the compression and reconstruction CNNs together with an adversarial objective function combined by a patch discriminator module and a differentiable multi-scale structural similarity (MS-SSIM) penalty module. The training, validation and testing data were obtained from an open source dataset provided by the Vision and Image Processing (VIP) Laboratory, Duke University [1].

Denoising and segmentation
OCT imaging is based on low-coherence interferometry and intrinsically suffers from speckle noise. Speckle noise not only degrades OCT image quality and makes it difficult to identify fine structural details, but also leads to fake guidance when we train the compression and reconstruction CNNs. To efficiently compress the OCT images and keep high-fidelity structural information, we chose to reduce the speckle noise in the OCT images as the first step (see Fig. 1(a)) by using the well-established denoising algorithm -Block-matching and 3D filtering (BM3D) which demonstrated superb performance in terms of peak signal-to-noise ratio and subjective visual quality [22][23][24]. To further increase the potential compression ratio, we segmented out the regions of interest to facilitate both of compression and reconstruction CNNs to effectively learn the feature representations from the ROIs and fully utilize the limited size of bitstreams. A semantic segmentation approach was used in preprocessing, which is based on the U-net architecture [19,25,26]. The U-Net segmentation architecture consisted of feature channels connecting the associated latent maps of down-sampling phase with the corresponding ones in the up-sampling phase, which enabled transmission of pixel-level localization information to up-sampling phase and vice versa during backpropagation [27][28][29]. The performance of the U-net architecture has demonstrated in various tasks in medical image analysis, especially, in biomedical image segmentation [30,31].
To run the image compression framework efficiently and avoid potential memory overflow, each original image with a size of 496 × 768 pixels was cropped out along the upper and lower boundaries [see the green and yellow lines in Fig. 1(a)] and was zero padded to the axial direction to make a final image of 256 × 512 pixels before compression [20]. Each rectangular image was then evenly divided into two square images with a size of 256 × 256 pixels each [see Fig. 1(a)]. All the images in the dataset were processed by the above procedures and divided into three independent groups: 740 images for training, 60 for validation, and 180 for testing. The images in each group were randomly selected without any overlap. The images in the training dataset were augmented by random cropping with scaling, random rotation and random horizontal flipping in each training epoch to prevent potential overfitting [19]. The compression CNNs and reconstruction CNNs were trained simultaneously. In the validation phase, hyper-parameters were fine-tuned, such as the initial learning rate, the number of skip connections, the quantization levels, and the number of channels of each convolution layer etc. We tested the as-developed compression and reconstruction CNNs with the independent 180 images and made the final quality assessment based on multi-scale structural similarity (MS-SSIM). 5-fold cross-validation was adapted to report the performance.

Compression CNNs and reconstruction CNNs
The schematic of proposed compression and reconstruction CNNs with a quantizer is shown in Fig. 2. The compression CNNs contracted the dimension of the input images layer by layer, and the reconstruction CNNs expanded the dimension in a symmetrical way. In detail, six layers in the compression CNNs learned the contextual features and hierarchically contracted the dimension of the input images. A quantizer residing between compression and reconstruction CNNs quantized the feature representations from three skip connections and the output of the last layer of compression CNNs to generate a bitstream (the compressed image). Three skip connections (the three red arrows from left to right in Fig. 2) added additional concatenations between two CNNs and fed the image information from different scales into the reconstruction CNNs via the quantizer. Skip connections preserved the fine structure information of the original images and transferred it to the corresponding reconstruction layers. The dimension of each input to the compression CNNs was 256 × 256 (H x W) pixels. The first layer applied convolution with 64 filters (each with a size of 7 × 7 pixels) and stride 1 to generate feature maps with the size of 256 × 256 × 64, followed by a normalization layer, and a Leaky Rectified Linear Unit (Leaky ReLu) activation to prepare the down-sampling phase. Five more layers with similar modules followed with a down sampling rate of 2 × 2. At the bottom of the compression CNNs, the most contracted feature maps with the dimension of ( H 32 × W 32 ) ×C 0 (where C 0 is the number of channels) were passed through the quantizer to the reconstruction CNNs. The reconstruction CNNs symmetrically reversed the process in the compression CNNs but we replaced the Leaky ReLu activation functions by the Rectified Linear Unit (ReLu) activation functions to achieve better reconstruction quality [32]. The output images from the reconstruction CNNs had exactly the same dimension as the input images. In order to enhance the resolution of the up-sampling process, three skip connections concatenated a part of the output feature maps of the third, fourth, and fifth layer of the compression CNNs and the reconstruction CNNs through the quantizer [33][34][35][36][37]. In order to control the size of the bitstream (the compressed image), we used C 0 , C 1 , , and C 3 to control the number of channels of the feature maps from the output of the last layer of the compression CNNs and each skip connection, respectively. A specific compression ratio could be achieved by multiple combinations of C 0 , C 1 , C 2 , and C 3 . Based on experimental results, we found the empirical combinations that generated the best reconstruction quality and used those in final version of the framework.

Multi-scale quantization and compressed output
Between the compression and reconstruction CNNs, a quantizer quantized the feature maps from different compression layers and generated a bitstream as output (red and black arrows in Fig. 2). The output contained two parts: (1) the bitstream from the last compression layer, which was the major part of the residual information, and (2) 3 skip connections from the "upper" layers, which facilitated enhancing the fine structures in the reconstructed images [38].
In the quantization process, a scalar variant method was used to quantize the feature maps in L quantization levels and then encoded quantized feature maps to a bitstream [16,39]. For a given quantization level L, the compression ratio is defined as the ratio of the spatial dimension of the Fig. 2. Schematic of proposed generator in a conditional GANs model, which contains compression CNNs (blue blocks on the left) and reconstruction CNNs (green blocks on the right). Leaky Rectified Linear Unit (Leaky ReLu) and Rectified Linear Unit (ReLu) induce nonlinearity for efficient training [7]. Convolution layers with stride 2 reduce feature maps by a factor of 2 along each dimension [13]. Arrows with different colors indicate different operations in the networks: the red arrows indicate quantized skip connections from the compression CNNs to the reconstruction CNNs via a quantizer; the black arrow represents the quantization operation on the most contracted feature map from the last layer of the compression CNNs; arrows with other colors represent the different combinations of convolution layers, normalization layers and activation functions. Please see the Section 2.2.1 for more details.
compressed outputs to the input dimension, i.e., Here H input , W input is the height and width of input images, respectively; S is the bit depth of the input image; n is the number of connections passing through the quantizer; (H i · W i · C i ) is the spatial dimension of the feature maps in each connection. In our work (Fig. 2), the dimensions of the input feature maps were (8 × 8 × C 0 ), (16 × 16 × C 1 ), (32 × 32 × C 2 ), and (64 × 64 × C 3 ), and the quantitation level L was 7. By changing the number of channels in the feature maps (C 0 , C 1 , C 2 , and C 3 ), we could achieve a compression ratio ranging from 10 to 80.

Objective functions and optimization
In the training phase, the compression CNNs and reconstruction CNNs were trained together as the generator of conditional GANs [32]. The generator took original images as input and produced reconstructed images as output and the loss was represented by the difference between the original images and the reconstructed images in term of the proposed objective function. The training of the proposed framework was thus self-supervised. Our objective function consisted of two parts: a patched decoder as the discriminator of the conditional GANs to identify the fine-structure difference between the reconstructed images and the original images, and an MS-SSIM loss function to evaluate the difference between the input and output on a larger scale, which forced the generator to model the low-frequency information more efficiently [32]. The whole objective function L G used in our framework is defined as below: where G consists of the compression CNNs and reconstruction CNNs, D indicates the PatchGAN discriminator minimizing this objective function in an adversarial way, and λ is the weight of the MS-SSIM loss. In our case, an empirical weight λ = 100 was selected. Specifically, the loss of the PatchGAN discriminator can be expressed as below [32]: Here x represents the original images; y represents the reconstructed images; z represents random noise, which is optional in our model; E represents mathematical expectation. The task for the PatchGAN discriminator was to identify whether each N × N patch of the input image was real or not. Here, N × N was the size of the receptive field for the PatchGAN discriminator and the patch size was set to be 70 × 70. The discriminator consisted of four convolution layers. In each layer, the input feature maps were convoluted by 4 × 4 filters with a stride step of 2 and then, sequentially went through an instance normalization layer and a Leaky ReLu activation function. After the last layer, the output feature maps were mapped to a one-channel output followed by a sigmoid function activation to generate a final evaluation score [28,32,40]. For the optimization procedure, we used the Adam optimizer to minimize the loss function with an exponential decay rates β = (0.9, 0.999) for the moment estimates [41]. Training hyperparameters were set as follows: constant learning rate of 2.0E−4 for the first 100 epochs then linearly decaying to 0; 200 maximum epochs; batch size of 1.

Results
In order to quantitatively assess the performance of our framework, we combined an MS-SSIM evaluator and direct human inspection that is often more sensitive to certain types of distortions than others [30,42]. The MS-SSIM evaluator produced a score between 0 and 1, with a higher value implying a closer match between the reconstructed and original images. The MS-SSIM evaluator is defined by below equation: where x represents the original image and y represents the reconstructed image. The algorithm iteratively applied a low-pass filter and down-sampled the filtered image by a factor of 2. We denote the original image as Scale 1, and the highest scale as Scale M. l i (x, y), c j (x, y), and s j (x, y) respectively represent the luminance comparison, contrast comparison, and structure comparison at the j-th scale [40,42].

Comparison with other methods
To fairly compare the compression performance of each method, all the test images used for the assessment metrics were preprocessed in the same fashion as mentioned in Section 2.1.1. We compared the performance of the proposed framework with other methods shown in Table 1 under a set of compression ratios (10, 20, 40, and 80). By controlling the quantization level L and the channel number of the latent feature maps from the compression CNNs, we trained a set of models at each aforementioned compression ratio. We performed a 5-fold cross-validation for each method and the quantitative comparison results were reported in Table 1 with the results presented by the mean and the corresponding 95% confidence interval. Although the RNN-based approach achieved slightly better performance than other non-learnable methods as shown in Table 1, its reconstruction was significantly time-consuming due to the patch-based iterative reconstruction [1,21]. In this paper, we mainly focused on the discussion and comparison with BPG which outperforms other non-learnable methods. Both the proposed approach [ Figs. 3

(a) and 3(b)] and BPG [Figs. 3(b) and 3(c)] achieved a high similarity index in terms of MS-SSIM
at low compression ratios (CR < 20), but the proposed method was able to maintain a high similarity index and far outperform BPG at high compression ratios. To further evaluate the performance of the proposed method, we applied the previously trained networks to a second set of ophthalmic OCT images from an independent source [2], and the performance is reported in Table 2. We notice that the performance of our proposed method degrades slightly at low compression ratios, which is mainly due to the additional noise associated with a different image acquisition system that our networks had not been trained for. We can observe similar degradation in all tested methods. However, at high compression ratios (CR of 40 or higher), our proposed method still outperforms others.
Comparing the two approaches, the proposed framework preserved fine structures, such as the Bruch's membrane (BM), with an MS-SSIM value of 0.985 even at a high compression ratio of 80 [see the corresponding zoomed-in images in Fig. 3(b)]. In contrast, the performance of BPG severely deteriorated (with an MS-SSIM value of 0.973) at a compression ratio of 80 with blurred fine structures. In comparison, the reconstructed images from the proposed framework exhibited highly consistent visual appearance over a large compression ratio range as shown in Fig. 3. To further demonstrate the practicability of the proposed method, we also illustrate the performance of reconstruction for images with diseases [i.e., age-related macular degeneration (AMD) and diabetic macular edema (DME)] in Fig. 4. The first row shows an early case of AMD, where the hyperreflective foci (HRF) and drusen appearing as retinal pigment epithelium (RPE) deformation were clearly preserved in the compressed image. The second and third rows show examples of DME images, where a giant outer nuclear layer (ONL) cyst in the inner nuclear layer and HRFs of different grades could be clearly identified in the compressed images. Fig. 4. Comparison between the images with age-related macular degeneration (AMD) and diabetic macular edema (DME) and the reconstructed ones by the proposed compression method at a compression ratio of 80. The first row shows the original and reconstructed images with AMD. The second and third rows show the original and reconstructed images with DME. The circle indicates drusen, the arrows point to hyperreflective foci, the "*" indicates a giant outer nuclear layer (ONL) cyst, and the bracket indicates inner layer cysts.

Discussion
In this paper, we developed an OCT image compression framework which could preserve fine structural features at a high compression ratio (as high as 80) with deep neural networks. It outperformed the general compression formats in both similarity index and visual examination. The proposed framework demonstrated superb performance mainly due to three factors: (1) CNNs efficiently compress and reconstruct the image with high fidelity for both low and high frequency information by introducing an adversarial PatchGAN discriminator and an MS-SSIM penalty in the proposed objective function [43,44]; (2) spackle noise reduction and selection of only regions of interest (ROI) make the model focused on specific structural information [45]; (3) customized skip connections enhance the reconstruction quality by preserving fine structural information at different scales (from feature maps with a size of 1/4, 1/8 and 1/16 of the original images) [38]. Figure 5 shows the effectiveness of data preprocessing (i.e., denoising and segmentation) and skip connections in the proposed compression framework. We trained the models with the same hyper-parameters and reconstructed the images at the same compression ratio. Figure 5(a) illustrates the visual comparison among the reconstructed images from the proposed framework and those reconstructed either without image preprocessing (denoising and ROI segmentation), or without skip connections in the CNNs, or without both. Without image preprocessing, the proposed framework was "fouled" by spackle noise and tried to reconstruct it. As a result, it degraded the compressed image. Without skip connections, it was more difficult to preserve the fine structure information and the compressed image was blurred. Without both preprocessing and skip connections, the compressed image was far more suboptimal. Figure 5(b) shows the quantitative comparison, where data preprocessing and skip connections clearly helped reduce image distortion in terms of the MS-SSIM index. For deploying the proposed method to other OCT images collected with different systems or for non-ophthalmic applications, it is suggested that the model be re-trained to achieve a desired performance. With a larger training dataset and upgraded network frameworks, it is feasible to push deep-learning based compression algorithms from cross-sectional images to volumetric ones, which will be our next direction for the future work. Due to the three factors mentioned above, our model results in excellent performance comparable to the latest published general image compression methods based on CNNs, in which the networks had more complex architectures and were trained with more abundant dataset [17,21,[46][47][48]. Furthermore, the computational efficiency of the proposed framework is also notable. It only took about 0.015 seconds to compress and reconstruct an image on a Ubuntu 18.04 computer with an NVIDIA 2080Ti GPU and PyTorch implementation.