Application of deep learning and compressed sensing for reconstruction of images

Compressed sensing (CS) is a technique in signal processing which reconstructs any given signal at a rate less than that of Nyquist’s’ rate given that the signal is sparse and incoherent in nature. The main focus of CS is to find a random matrix which reconstructs the original signal using as few samples as possible. In recent years there has been a lot of interest in using CS to reconstruct 2D images. Several machine learning and deep learning algorithms have been proposed for finding the random matrix. However, using deep learning techniques to generate the random matrix is still an emerging concept. There are several papers exploring the concepts of deep learning with compressed sensing for images and have produced promising results. The results obtained in these papers warrant a comparison to analyze their performances against more traditional methods to determine the best approach to reconstruct images. These results are also compared with the popular JPEG and JPEG2000 codecs. This paper focuses on surveying deep learning algorithms for reconstruction of images using CS. The comparison reveals that deep learning algorithms perform significantly better than traditional methods and perform well when compared to any image compression codec like JPEG. Possible methods to improve the existing algorithms have been suggested.


Introduction and Background
Image compression has been carried out using traditional lossy and lossless compression algorithms for many years. Popular lossy compression methods like the discrete cosine transform [1] have been used in JPEG to achieve a 10:1 compression ratio. However, with the emergence of compressed sensing [2][3], for applications satisfying the sparse sensing conditions, it has proven to produce results with a much better compression ratio while sampling at a rate lower than Nyquist's' rate. In order to use compressed sensing to reconstruct images, two conditions need to be met. The signal must be sparse in any domain and the signal must be incoherent. For a signal to be sparse, the majority of the samples must be zero, with only a few non-zero samples. For a given x, a real valued signal such that x ∈ܴ ே where N is the number of samples, the aim is to find a random sensing matrix φ ∈ܴ ெ✕ே where M ≪ N, then the sparse samples would be obtained using, Where, ‫ݕ‬ ∈ ܴ ெ is a vector with CS measurements.
The sensing matrix must be chosen carefully based on the application. Choosing a random matrix suffers unpredictable outcomes. It also has been proven to have high complexity along with huge memory requirements. Alternatively, matrices can be chosen specifically for an application so that the aforementioned limitations are reduced [4][5], however, generating the sensing matrix even with these approaches is difficult and somewhat tedious. Using learning algorithms to generate this matrix is advantageous as the matrix is generated specific to the application.
The next important step in CS is to successfully reconstruct the original image using the sensing matrix. There have been many compressed sensing algorithms based on reconstruction such as minimization of total variation [ [57].
From [14] using DWT algorithm, it can be observed that the best reconstruction algorithm in terms of execution time was around 10 seconds. Almost all of these require an optimization problem to be solved in order to perform the reconstruction. This in turn means that the optimization problem would require hundreds of iterations in order to arrive at a suitable solution [47]. Leading to higher computation cost and poor execution time when it comes to deploying it in an application. Discrete wavelet transform (DWT) [56] and total variation (TV) algorithms have lower execution times at the cost of a lower quality reconstruction.
These algorithms are not suitable for real-world applications if each of their execution times are in magnitudes of 10. It simply wouldn't be feasible. Further research in this field is being done to better the performance of the algorithm [53], but reducing the complexities of the algorithms [50] does not seem to be a priority. Deep learning has been very useful in extracting complicated information from images as previous researchers have suggested [15][16] [52] [54][60] [62]. Neural network-based algorithms directly learn the inverse mapping [55] to the original image domain from the CS measurement domain. Thus, the computation cost decreases and the performance of the algorithms are relatively better.
Hence, deep learning can be used to reconstruct 2D images while producing better execution times than the existing non-learning algorithms.
The paper focuses on how deep learning solves two major problems in CS; generating the sensing matrix and reconstructing the image, while maintaining a better performance than the non-learning solutions and producing results that can be used in real-life applications. The performance will also be measured against standard image codecs like JPEG and JPEG2000 to quantify where it stands when compared with popular image codecs.

Our focus
We are focusing on three things:(1) to survey the latest research in CS using deep learning; (2) to compare the latest research in CS using deep learning with traditional methods evaluated based on PSNR metric, SSIM metric and run-time cycle; (3) to compare the results of the latest algorithms in CS using deep learning with image codecs like JPEG and JPEG2000.

Analysis of algorithms
We analyze deep learning algorithms for CS and their performances with special reference to quality metrics like PSNR, SSIM and their run time.

Deep learning sparse ternary projections [17]
First This approach talks about hardware implementation of CS using DL algorithms [48] with the main focus being storage efficiency and fast computations. It simplifies the deep neural networks by using binary weights {-1, +1}. The algorithm uses mini-batch gradient descent method for training. The basic idea lies in obtaining the matrix with ternary values i.e., {-1, +1, 0} instead of real valued dense matrix.
The model learns the projection matrix and performs non-linear reconstruction. The algorithm imposes sparsity and binary constraint by taking sparse matrices composed of {0, -1, +1}. This makes it efficient since all the real valued entries will not be used. This is combined with binarization techniques [49] to produce highly sparse ternary projections. The network trains the system by using image patches and producing output images in a block-based manner.
The network consists of a sensing module and reconstruction module as shown in figure 1. The sensing module takes in vectorized patches of image of size N = ܵ ଶ , where S is the sparse signal, and projects the input x of dimension N into a domain of dimension M. This implies the number of units in this layer will be M = ܵ ଶ ܴ and R is the sensing rate with values between 0 to 1 and in order to keep the projection matrix simple, they do not apply bias or non-linearity activation filter. This layer uses {0, -1, +1} as weights. The reconstruction layer consists of scaling layer, L hidden and an output layer. The layers are followed by a batch normalization layer at the end which helps at reducing covariance shift.
The scaling layer scales the output of the sensing module by the learnt factors α. All layers except the scaling layer are fully connected. After the scaling layer, there are L hidden layers which use ReLU activation function [18]. Using ReLU activation function has its own obvious reasons such as ease of computation, fast convergence, does not have vanishing gradient problem and is a sparsely activated network which implies that this layer is activated only when needed. The output layer is linear and has dimensions equal to input dimensions. The outputs are normalized using batch normalization layer [19]. The training algorithm uses a non-conventional mini batch gradient descent method with sparsifying and binarization steps for training the sensing layer. The continuous weights in the sensing layer get converted to sparse weights of dimensions N*M, this procedure is referred to as top_k_select function in a column wise manner where top weights are selected from the layer. The continuous sparse weights are mapped to sparse binary weights containing values {-1, 0, +1} in a step called binarization. The loss from binarizing the weights is compensated using hidden units in scaling layer connected one to one to hidden units in sensing layer [20]. This results in a sparse matrix with K non-zero entries of {-1, +1} and existing training algorithms are used for further training.
Overall, this algorithm simplifies the neural networks by binarization of the weights giving a highly sparse ternary projection matrix. This simplification allows the reconstruction module to be non-linear which helps in improving the performance. The results obtained indicate that the algorithm outperforms algorithms like O-NL-SDA [21] and BP [22] while using only 5% of non-zero binary entries.

CS framework using Convolutional Neural Networks (CSNet) [14]
First This framework comprises a sampling network and a reconstruction network. Both these networks are jointly optimized to better their performance with respect to reconstruction quality and run time. CSNet uses CNN for block based compressed sensing. The proposed network as shown in figure 2, uses CNN for mainly three functions: Block based compressed sampling, initial reconstruction, and non-linear signal reconstruction.
In the sampling network, three sampling matrices are learned. They are floating-point matrix, binary {0, 1}, and bipolar {+1, -1}. Binary and bipolar matrices are learned to allow hardware implementation and easy storage. Conventionally, block based compressed sensing uses a pseudo inverse matrix [10] [23] to acquire an initial reconstructed matrix, this paper proposes a network which uses adaptive optimization in the network for reconstruction. The algorithm learns the sampling matrix from the training set to retain more structural information. Deterministic binarization is used instead of stochastic binarization to quantize the elements of the sampling matrix. Reconstruction network comprises of linear initial reconstruction network that generates an initial reconstructed image using a convolution layer and a combination layer [51]. A non-linear deep reconstruction network learns end-to-end mapping using a deep residual network to improve reconstruction quality of the images from CS measurements. Multiple convolutional filters are used to obtain initial reconstructed blocks, these blocks are reshaped and concatenated using a combination layer which contains functions to do the same. The initial reconstruction optimizes the entire image which enables utilization of both intra block and inter block information. The deep reconstruction network uses residual learning for non-linear signal reconstruction. Reconstruction is done using three operations: feature extraction, non-linear mapping and feature aggregation. Feature extraction is carried out using a convolution layer followed by an activation layer (ReLU), this produces high dimensional feature from the local receptive field. After the extraction of the feature, the deep reconstruction network cascades residual block, convolution layer and activation layer for non-linear mapping and increase receptive fields. Finally, the feature aggregation produces the reconstructed image using high dimensional features which were priorly extracted. For training the floating-point matrix, gradient of parameters is calculated and updated. For training the bipolar and binary matrices, deterministic binarization is used as suggested in [24][25] and elements of sampling matrix are quantized to binary and bipolar forms. During training, the sampling network and reconstruction network form an end-to-end network. During the application phase, the sampling network behaves as an encoder and the reconstruction network as a decoder.

Cascaded reconstruction networks (CSRNet and ASRNet) [26]
Compatibly sampling reconstruction network reconstructs high-quality images using random samples (CS measurements) from a random sampling matrix. The output from the reconstruction network of CSRNet is passed through CNN based residual network module to obtain higher quality images. CSRNet is not a complete CS image reconstruction network as it only reconstructs the image, it does not contain a sampling network.
CSRNet as shown in figure 3, contains three modules and they are Initial reconstruction module, deep reconstruction module and residual reconstruction module. Initial reconstruction module reconstructs initial image directly from CS measurements. Deep reconstruction module reconstructs images using three convolution layers, the first two convolution layers are followed by ReLU layer and produce an image of good quality which is used as an input for the third layer. The first layer uses kernel of size 11x11 and generates 64 feature maps. The second layer generates 32 features maps and uses kernel of size 1x1. Finally, the third layer uses kernel of size 7x7 and generates 1 feature map which is the output of the second module. Residual reconstruction module reconstructs a higher quality image by learning the residual information between input data and original image. Figure 3: Framework for CSRNet. [26] Adaptively sampling reconstruction network is a complete compressed sensing network containing a sampling module and a reconstruction module unlike CSRNet. ASRNet learns the sampling matrix using fully connected layers in the automatically sampling module and matches it with the residual reconstruction module.
ASRNet, as shown in figure 4, has three modules; Sampling module which uses a fully connected layer to achieve traditional compressive sampling. Initial reconstruction module uses a fully connected layer to generate an initial reconstructed image where the sensing matrix is automatically learned instead of being computed. The residual reconstruction module is same as that of CSRNet. ASRNet is also useful because the entire network is end to end and they can be trained together. Figure 4: Framework for ASRNet. [26] BM3D denoiser is used in the final stage to remove the artifacts from patch processing. Both ASRNet and CSRNet achieve better results than state of the art algorithms while the former achieves more than 1dB gain than CSRNet.

ReconNet [29]
Computer vision tasks like recognition where convolution neural networks have been applied, use images as input to the network. However, images need to be obtained as outputs from the CNN used for reconstruction purposes. So, a typical CNN architecture which maps images to features cannot be used. This paper uses [30] as inspiration. [30] uses bicubic interpolation to obtain initial estimates of highresolution images from the low-resolution images. It further uses a 3-layer CNN which uses these initial estimates as input and uses the ground truth of the required output as labels. However, in this paper, they propose ReconNet which does not try to obtain initial estimates, instead, it proposes a CNN architecture which can directly map CS measurements to image blocks. This method proposes a non-iterative algorithm using a CNN that takes CS entries as input and produces an intermediate reconstructed image. This intermediate image is fed to a denoiser to obtain the final reconstructed image. The algorithm is demonstrated to be very effective and perform better than its competitors even at low measurement rates of 0.1 or lower.
As shown in figure 8, The first layer of ReconNet is a fully connected layer that takes CS measurements as input and produces feature maps of size 33 x 33 as output. All subsequent layers are convolution layers that are followed by the ReLU activation function except the final layer. The final layer generates the intermediate reconstructed image. Finally, the noise in the intermediate image is removed using BM3D [31] denoiser.
Patches of size 33x33 are extracted from the input to obtain a set of 21760 patches where the stride is equal to 14. Stride is essentially the number of bytes that needs to be added to the first pixel of a row to reach the address of the first pixel in the next row. Only the luminance component of each image is extracted which form the labels of the training set. The CS measurements of these patches form the inputs of the training set. To obtain the CS measurements, a random Gaussian matrix is generated and a measurement matrix Φ is constructed. Then each of its rows are orthonormalized, followed by applying y = Φ x to obtain the measurements. Here, x is the luminance component of a patch after it is vectorized. Finally, the networks are trained using Caffe [32] [63].

Multi-channel deep networks for block-based CS (BCSNet) [33]
In order to efficiently deal with high dimensional natural images, block-based compressed sensing was proposed. Here, an image is partitioned into blocks and each block is sampled and reconstructed independently. The problem that block-based CS brings is that image quality is reduced because of the blocking artifacts which are discussed in [34] and [35]. A few methods were proposed in [34][35] and [36] which uses an iterative algorithm where at each iteration, an approximate of each block is obtained by using the projection operation and the denoising operation [59] is used on the image that is reassembled using all the approximate blocks. Results proved that although the results obtained were satisfactory, the reconstruction time was too long. Many deep neural networks-based algorithms were also proposed [27], [29], [37], [38] but it turned out that the blocking artifacts still existed especially when the sampling rates were low. Also, reconstruction accuracy was affected because the neural network-based methods ignore the structural insight of CS reconstruction algorithms and are trained as a black box. Image CS using deep learning has its performance reduced by using different sampling rates for each block in an image. This paper proposes multi-channel deep networks for block-based CS where sampling rates are assigned to each block and blocking artifacts are removed at the model-level. It is called BCSNet as it uses the blockbased CS algorithm in the learning network. Figure 5: Framework for BCSNet. [33] Each of the blocks with the multiple sampling rates are given to the deep reconstruction module as input. The blocking artifacts are removed by taking advantage of the inter-block correlation. The reconstruction network is divided into residual layers. Each of those layers correspond to an iteration of the BCS algorithm. This approach is very interesting because it makes use of both types of algorithms to achieve its goal of high-quality reconstruction. It makes use of the deep learning networks to learnt the structure of BCS to produce good quality outputs. 3.6. Deep Residual Reconstruction Network for Image CS [39] DR2Net comprises of two modules; the linear mapping network and the residual network. The linear mapping network is implemented using a fully connected layer and generates a preliminary image. The preliminary image needs to be enhanced to achieve better image quality. This is done in the residual network using residual learning blocks [28] [61]. The algorithm divides the initial image into 33 x 33 patches. CS entries are recorded for each of these patches and DR2Net uses these entries to reconstruct the image as shown in figure 6. Once the linear mapping network generates the preliminary image, the residual network works on finding the residual. It does so by using four residual learning blocks. Each of the four blocks contains three convolution layers and the first two convolution layers are followed by the ReLU activation function. The first convolution layer in each of the blocks generates 64 feature maps using kernel of size 11x11 The second layer generates 32 feature maps with kernel size1x1. Finally, the third layer uses kernel of size 7x7 to generate 1 feature map which is the output. The reconstruction of the image is performed by first extracting CS measurement from the 33x33 image patches that are obtained. The CS measurements are taken as input to DR2Net and it outputs the reconstructed patches. The reconstructed patches form an intermediate reconstructed image which is fed to BM3D denoiser to remove any artifacts.

Interpretable Optimization-Inspired deep network for image CS [38]
This algorithm is dubbed ISTA-Net, shown in figure 7. It uses Iterative shrinkage-thresholding algorithm (ISTA) and maps it onto a deep network. Each of the phases in this algorithm are like iterations in ISTA. ISTA-Net uses back-propagation to learn parameters like shrinkage thresholds, nonlinear transforms, etc., and they are learned end to end. Essentially, all the parameters involved in ISTA are learned automatically by the deep network instead of being fixed manually. Nonlinear sparsifying transform has a proximal mapping problem that is solved efficiently by ISTA-Net that allow 1 s for other optimization algorithms to be mapped to deep learning networks. Residuals of natural images and videos are more compressible and this fact is used to design a more enhanced ISTA-Net called ISTA-Net+. This enhanced algorithm strives to better CS performance and from the results obtained, it performs even better than ISTA-Net in terms of speed and quality.
In figure 9 Figure 11 illustrates the output of these networks. For Monarch, ISTA-Net outperforms the other algorithms. However, for a MR of 0.3 CSNet (floating point) performs better. Figure 12 illustrates the output of these networks.
For Fingerprint, BCSNet and CSNet (floating point) have similar performance with respect to PSNR and SSIM. It is observed that they perform much better than other networks and the traditional methods.
In figure 13, the average runtimes have been computed for all the networks along with the traditional algorithms using a standard 256*256 image with MR 0.1. It is observed that SDA has the fastest execution time, although the results produced aren't as good as the networks surveyed in previous sections. All the networks have executions times ranging from 0.01s to 1.37s, with ReconNet being the fastest algorithm. These execution times are quite fast with high reconstruction quality.
From comparisons with respect to PSNR and SSIM, it can be concluded that BCSNet and CSNet (floating point) outperform all the other networks. BCSNet does well because it uses a dynamic sampling rate depending on different blocks within an image. It also combines the traditional BCS with neural networks resulting in better PSNR and SSIM values. CSNet (floating point) does well because it uses neural networks to implement BCS [43] [44].
Figure14 and figure 15, compares JPEG and JPEG2000 with all the surveyed networks to compare these reconstruction algorithms with complete image codecs. Barbara (from the standard image data set) is used for this comparison with different quality factors. It is quite clear that JPEG and JPEG2000 perform significantly better than the networks with respect to PSNR and SSIM. Figure 16 and figure 17 illustrates the output of these networks. However, it is not always the case that these metrics are perfect models to represent human perception. From the outputs in [45], the images are subjectively of higher quality. There are several reasons why JPEG quantifiably performs better than these networks.

Conclusion
In this paper, we have surveyed all the algorithms for Deep Learning in CS for 2-D images. It is quite clear that Deep Learning coupled with CS is a potent combination achieving both fast execution times and high-quality reconstructions. Where traditional algorithms fail in terms of producing quality results within acceptable times, Deep Learning in CS succeeds by producing both acceptable run-times as well as subjectively good reconstructed images. On comparison, it was observed that BCSNet and CSNet (floating point) were the best performing networks. Additionally, all the surveyed networks outperformed traditional methods. It can be concluded that using Deep Learning in CS is the best way forward for 12 reconstructing 2-D images. These networks were outperformed by JPEG and JPEG2000 but it should be considered that these networks use only a fraction of the samples from the original image whereas JPEG uses a sample rate higher than Nyquist's rate.
As part of the future work in this field, incorporating these networks in these codecs for reconstruction of images would increase the performance of the codecs while exploiting the advantages of CS; it collects only a fraction of data instead of discarding information after collecting the entire data leading to efficient data storage.