Patch-primitive driven compressive ghost imaging

: Ghost imaging has rapidly developed for about two decades and attracted wide attention from different research ﬁelds. However, the practical applications of ghost imaging are still largely limited, by its low reconstruction quality and large required measurements. Inspired by the fact that the natural image patches usually exhibit simple structures, and these structures share common primitives, we propose a patch-primitive driven reconstruction approach to raise the quality of ghost imaging. Speciﬁcally, we resort to a statistical learning strategy by representing each image patch with sparse coefﬁcients upon an over-complete dictionary. The dictionary is composed of various primitives learned from a large number of image patches from a natural image database. By introducing a linear mapping between non-overlapping image patches and the whole image, we incorporate the above local prior into the convex optimization framework of compressive ghost imaging. Experiments demonstrate that our method could obtain better reconstruction from the same amount of measurements, and thus reduce the number of requisite measurements for achieving satisfying imaging quality.


Introduction
Ghost imaging (GI) is a novel imaging technique which non-locally records a scene, and has drawn wide attention over the last two decades. In the scheme of GI, two correlated beams travel along different light paths, one light beam illuminates the scene and is collected by a bucket detector, the other one is directly recorded by a spatially-resolved detector. By correlating the outputs of these two light paths, one can reconstruct the spatially-resolved image of the scene. The "ghost" emphasizes that only a non-spatially-resolved detector is needed to detect the light interacting with the target scene.
GI has gone through three main development stages in terms of the adopted light sources: quantum entangled photons [1], classical thermal light [2][3][4], and programmable illuminations [5,6]. From quantum to classical to computational, GI is becoming more and more flexible, and has already been put into various practical applications, such as 3D reconstruction [7], fluorescence imaging [8], optical encryption [9,10], remote sensing [11], looking through atmospheric turbulence [12,13], object tracking [14,15], etc. Among the three, computational ghost imaging successfully transfers the complexity of ghost imaging from the experimental apparatus to computation, and makes it possible to enhance and extend GI with the aid of computational resources.
The reconstruction algorithms of ghost imaging mainly fall into two types: second-order (or higher-order) correlation and compressive sensing. The former one obtains the target image through calculating the second-order (or higher-order) correlations of the bucket detector measurements and the illumination patterns. This method suffers from low reconstruction quality due to limited measurements, although some variants have been proposed to improve the performance [16][17][18][19]. Differently, the latter one, compressive ghost imaging (CGI), reconstructs the ghost image based on compressive sensing, which has also been successfully used in other applications, such as phase retrieval [20][21][22]. Through exploiting the redundancy in the structure of natural images [23]. CGI enables ghost imaging from sub-Nyquist measurements and largely reduces the acquisition time. Besides, adaptive methods have been proposed to further decrease the requisite measurements [24,25].
The higher reconstruction quality of CGI is attributed to the utilization of pixel-wise prior knowledge (e.g., minimizing the total variation to enforce the local smoothness) or global prior features (e.g., forcing sparsity of DCT coefficients to ensure the dominance of low frequencies) of natural images. Other than the above priors, there also exist strong priors in image patches. Statistically, the image patches are of low dimension and exhibit simple structures. These structures can be decomposed into several primitives, and different structures may share some primitives. So far, the patch prior has been extensively studied and successfully utilized to achieve state-of-the-art performances in various computer vision tasks [26][27][28].
In this paper, we propose to unify the patch prior together with the pixel or global prior into the CGI reconstruction framework. As far as we know, this is the first time to utilizing patch prior in natural images for ghost imaging and it is nontrivial, because the collected measurements from the bucket detector encode the information of the holistic scene. Our studies demonstrate that, incorporating the patch prior provides great help to improve the reconstruction quality of the non-spatially resolved imaging technique. The remainder of the paper is organized as follows: Section 2 introduces our modeling and derivation. Section 3 demonstrates the effectiveness of our method on both synthetic and real captured data. Finally, Section 4 makes further discussions and summarizes this work.

Method
In general, we introduce a linear indexing operator to map each patch to the whole image and vice versa (as shown in Fig. 1), then the constraint defined locally and globally can be integrated together for an intensive utilization of natural image redundancy. As for the pixel-wise or global prior, we can either minimize the total variation or enforce the sparsity of DCT coefficients. As for the patch prior, we represent the image patches by a composition of several primitives, which depict the elementary local structures of natural images, and enforce the sparsity of the representation coefficients. Since both the total variation minimization and coefficient sparse- ness (either of local patch or the holistic image) can be achieved by minimizing a l 1 norm, we can unify different constraints within a convex optimization framework to reconstruct the target scene.
The image patches represent local regions of a natural image. Usually, the patches are defined in terms of fixed-size blocks (e.g. k × k pixels), which is much smaller than the original image size, as illustrated in the upper part of Fig. 1. Statistical studies suggests that natural images contain characteristic structures that set them apart from random images (i.e. the pixel intensities are random) [29]. Therefore, characterizing the structures of natural images, and formulating the properties effectively may lend insights into the recovery of natural images [30]. Researches have shown that, one can apply the sparse coding algorithm to a large image patch set to learn a primitive set indicating the basic structures of natural image patches [30,31], and each patch can be represented by sparse linear combination of these primitives. The subfigure framed with dashed box in Fig. 1 gives an schematic illustration of the process of learning the patch primitive set.
Concretely, given a large number of natural image patches {p 1 , ..., p U } ∈ R k×k which are randomly cropped from natural images databases as input, the goal of sparse coding is to find the patch primitive set {d 1 , ..., d V } ∈ R k×k and the sparse vectors of weight {s 1 , ..., where s u v denotes the vth element of s u , i.e. the representation coefficient upon patch primitive d v of patch p u . The sparse coding problem is formulated as: arg min Here the first term measures how well the patch primitives represent the image patches, and the second term adopts the L1 norm to enforce the sparsity of the representation coefficients.
The parameter β is a positive constant balancing the importance of two terms. We optimize the objective function over a large database of natural image patches to obtain an over-complete patch primitive set, which is applicable to sparsely represent various natural image patches. Note that the patch primitive set is usually over-complete (i.e. V > k 2 ). This point has been emphasized in [32], which states that the local structures described by the primitives may occur at continuum positions and scales, and over-complete patch primitives could allow for smooth interpolation along this continuum. In other words, over-completeness results in representation accuracy and coefficients sparseness tolerating small translations and scaling of local structures, and thus ensures high flexibility to diverse target images. Overall, the over-completeness of the dictionary ensures the coefficient sparsity an effective prior for general image patches. We also visualize the learned patch primitive set in Fig. 1, with each entry representing a primitive. Further looking into the details of the primitive set, we could find that the set of primitives describe visually elementary features, such as oriented and translated edges, curves, corners, blobs etc. For each patch from the target scene, we can represent it with a small number out of the whole primitive set, as plotted in Fig. 1. This patch prior has already been successfully incorporated to improve the performance of the classical computer vision tasks, such as denoising [26], superresolution [27] and deblurring [28], etc. In this paper, we propose to incorporate this patch prior into the CGI to further enhance its performance.
To formulate the problem more concisely in matrix form, we let D ∈ R k 2 ×V denote the overcomplete primitive set (each column as a vectorized form of a patch primitive), and term it as a over-complete dictionary. Given such a dictionary D, each patch could be represented as in which p ∈ R k 2 ×1 denotes the vectorized image patch, and s is the representation coefficient of p. We introduce the over-complete dictionary and the reconstruction coefficients into the CGI reconstruction as in [26]: where x denotes the whole image, R i j (x) is the linear image-to-patch mapping, which extracts the patch with its top left pixel locating at (i, j) in image x. Correspondingly, we denote R T i j (p i j ) as the inverse mapping, i.e. the patch-to-image operator, and the image x can be obtained by tiling all the non-overlapping patches, i.e.
Based on above notations, we can introduce the patch prior into the reconstruction algorithm of conventional CGI (CCGI): arg min The first objective term is used to impose the image prior defined over the whole image (e.g, enforcing a small total variation or sparse DCT coefficients), with Ψ being the transform matrix applied on the whole image. The second term imposes the local prior that each image patch is of sparse representation coefficients upon the over-complete dictionary. We use a weighting factor λ to balance these two objective terms. The first inequality constraint describes the data fidelity. Here Φ is the measurement matrix, with each row denoting a vectorized illumination pattern, y is the measurements, and variable ε is introduced for robustness to the measurement noise. The second constraint builds a mapping between the image patches and the whole image, and facilitates formulating local and global priors in a unified framework. The third constraint comes from the representation of image patches defined in Eq. (2). Comparatively, CCGI [23] recovers ghost images through minimizing the first objective term under the first inequality constraint. Since utilizing only the whole image prior is prone to smoothing out the thin structures in the target image, we introduce patch prior (the second objective term and two additional equality constraints) to preserve the local details better. In this paper, we name the extended optimization in Eq. (5) We could rewrite the problem in Eq. (6) in a similar way to [33] as arg min where µ is the penalty parameter of the measurement noise.
To solve the problem, we denote m, n as the number of patches along two dimensions of the image and i, j as the corresponding indices, and then reformulate the problem as arg miñ are two block diagonal matrices, with blkdiag(·) denoting diagonal concatenation of the input entries;s = [s T 11 , ..., s T mn ] T is a vector composed by concatenating the reconstruction coefficients of all m × n image patches.
By simple variable substitution, Eq. (8) could be further rewritten into arg miñ Eq. (9) falls into a typical convex optimization, and we resort to the alternating direction method of multipliers algorithm [34] to solve the model. The final reconstructed image could be obtained from the optimums * according to Eq. (4).

Experiments on simulation data
To demonstrate the performance of our method, we conduct a series of numerical simulations upon natural images. In implementation, the image size in this experiment are all 64 × 64 pixels. The measurements are generated by calculating their inner products with different random binary patterns. We set the patch size to be 8 × 8 pixels empirically. The over-complete dictionary is trained from 300 natural images from the Berkeley Segmentation Data Set 300 (BSDS300) [35]. For each natural image, we choose 250 patches at randomly chosen positions, and in sum we use 750, 000 patches. Each patch is normalized to zero mean before provided to the efficient sparse coding algorithm [31]. As shown in Fig. 1, there are in all 256 primitives. As discussed before, the number of primitives should be bigger than the dimension of the image patch to ensure the over-completeness, i.e., the sparsity of the patches' reconstruction coefficients. Although theoretically a larger primitive set leads to better reconstruction, adopting too large a primitive set would pose heavy computational load on the reconstruction algorithm. To balance the over-completeness and computation load, we set the primitive set size as 256, the same as [26], which was demonstrated empirically to achieve good performance for general natural images. Accounting for the de-mean normalization of the training patches, we add a DC patch (all the entries of the 8 ×8 primitive is set to be 1) to the trained dictionary. By vectorizing each primitive as a column vector, we could get the over-complete dictionary D (64 × 257).
For a better evaluation of the advantages of introducing the patch prior, we compare the reconstruction results with and without it. Without loss of generalization, we use two widely adopted Ψs-gradient operator and 2D-DCT domain transform matrix, as in [11,23,24]. To express compactly, we denote two transform matrices as Ψ tv and Ψ dct , respectively. As for the two parameters in Eq. (9), we found that experimentally, the algorithm is insensitive to their values as long as they fall into an appropriate but reasonably wide range. Throughout the experiments, we use the same parameter setting: λ = 0.1, β = 0.4 and µ = 1000. Besides, for a comprehensive evaluation of our approach, other than CCGI, we also compare our method with the correlation-based computational GI methods, including traditional GI (TGI) [6], differential GI (DGI) [16], normalized GI (NGI) [17] and iterative GI (IGI) [18]. Considering that our experiments were conducted with the sub-Nyquist sampling ratio (SSR), i.e, the number of illumination patterns is smaller than the resolution of the target image, in which case NGI and IGI exhibit similar performance to DGI (slightly better than TGI), we choose to show only the results of TGI and DGI here. To test the performance of our approach at different SSRs, for each scene we reconstruct the images with SSR ranging from 0.1 to 1 with an interval of 0.1. Here we choose to display the reconstruction results at the SSR with the most prominent performance difference in Fig. 2. Comparatively, CS-based methods (i.e. PCGI-TV, PCGI-DCT, CCGI-TV, CCGI-DCT) perform much better than the correlation-based methods (i.e. TGI, DGI). The suffixes 'TV' and 'DCT' respectively denotes using Ψ tv and Ψ dct as the transform matrix Ψ in Eq. (5). The large difference in reconstruction performance comes from their different reconstruction mechanisms: The correlation-based methods rely only on the statistical properties of the cross-correlation matrix of the illumination patterns, thus requires large number of measurements, usually more than the size of the target image. On the contrary, CS-based methods incorporate priors additionally and thus are of good quality even at sub-Nyquist ratio. Comparing PCGI methods (i.e. PCGI-TV, PCGI-DCT) with CCGI methods (i.e. CCGI-TV, CCGI-DCT), we could find that the latter demonstrates better reconstruction, especially in the areas with edges (e.g. the boundary of the letters 'GI', the blade of 'Leaf', the hat brim of 'Lena', the petal of the 'Flower', the gaps among the fish sales), lines (e.g. the petiole of the 'Leaf', the lines around the 'Eighttrigrams', the thin grooves of the 'Brickwall', the wicker of the 'Wickerwork'), and blobs (e.g. the dots in the 'Eight-trigrams', the stigma of the 'Flower'). The performance gain of PCGI is mainly attributed to the patch primitive prior. Observing the learned patch primitive set in Fig. 1, we can find that edges, lines and small blobs are typical patch primitives, thus image patches exhibiting such structures are more likely to be composed by a sparse combination of the learned primitives. Therefore, introducing patch prior would improve the performance by enhancing the high frequencies that tend to be smoothed out by imposing prior defined on the whole image-minimizing total variation (CCGI-TV) or favouring the dominance of low frequencies (CCGI-DCT).
Besides, we can find experimentally that the performances at different settings are slightly different for the images in Figs. 2(a) and 2(b). For images without periodic textures in Fig. 2(a), the reconstruction using total variation prior exhibits higher quality than using DCT prior. The advantage of introducing patch prior is more distinct when using Ψ dct . On the contrary, in Fig. 2(b), the algorithm with DCT prior works better in the cases with rich periodic textures. The patch prior exhibits a more marked improvement for CGI with total variation prior. The varying applicability to different cases can help choosing different scene specific algorithms.
For quantitative evaluation, we adopt mean square error (MSE) as the evaluation metric and compare the reconstruction with respect to SSR among four different algorithms. In Figs. Figs. 2(a) and 2(b). As the number of measurements is far from being sufficient for decent correlation-based reconstruction, the performances of both TGI and DGI are apparently inferior to those of CCGI and PCGI. Further comparing CCGI and PCGI, one can clearly see that reconstruction of PCGI is better than CCGI, especially at low SSRs from 0.2 to 0.5. This implies that our method can obtain the same reconstruction quality with much less measurements. As in Fig. 3(a), to retrieve an image with the same quality (MSE= 0.05), our method could reduce the measurements by 5.7% and 22% for Ψ tv and Ψ dct transform, respectively. Similarly, in Fig. 3(b), for image 'Wickerwork', with abundant periodic texture, our method can reduce 5.3% and 12.5% measurements to achieve reconstruction with MSE= 0.1. In all, these results demonstrate the contribution of our approach in enhancing the quality, and thus reducing the requisite acquisition time of CGI.

(a) and 3(b), we respectively plot the MSE comparison of 'Lena' and 'Wickerwork', as two representative examples for the images in
Considering the inevitable sensor noise in an imaging system, we further test the reconstruction performance at varying noise levels. We simulate imaging noise by superimposing additive Gausssian white noise with the signal to noise ratio (SNR) from 10dB to 70dB. In this experiment, we set SSR to be 0.25 and provide the reconstruction results of six different methods with MSE as the evaluation metric. We choose to show the comparison result of 'Lena' and 'Wickerwork' in Fig. 2. As plotted in Fig. 4, the reconstruction quality of all methods increases monotonously as the noise level decreases and become stable when the signal to noise is higher than 40dB. As can be seen, our method performs much better than the other methods. The comparison between using different Ψs is consistent with the comparison discussed above.

Experiments on real captured data
Then, we experimentally demonstrate our method on the data captured by the prototype system. The scheme of our experimental setup for computational GI is exhibited in Fig. 5. The illumination module is built by hacking a commercial projector. Firstly, the light emitted from the halogen lamp is successively converged by a condenser lens, collimated by an optical integrator and then adjusted by a shaping lens. Then, the light is modulated by the digital micro-mirror device (DMD) to generate random patterned illuminations. After beam expansion with a projector lens, the patterned illumination interacts with the scene and the outgoing photons are collected by a bucket detector after going through a converging lens. The measurements are collected by sampling the outputs of the detector using a 14bit acquisition board ART PCI8514, which performs analog-to-digital (AD) signal conversion and then transmits the data to the computer for follow-up reconstruction. The illumination pattern on the DMD is controlled by the computer, and the target image is retrieved from the illumination patterns and the bucket measurements. In our CGI setup, there are several factors that might influence the reconstruction performance. We investigate the influences from instrument specifications and adopt corresponding strategies to avoid the performance degeneration.
At the illumination side, since we use binary (i.e., {0, 1}) patterns, there is no quantization error during the DMD modulation. The AD at the acquisition board would introduce quantization error. Here we adopt a 14-bit digitalization depth, and the fluctuation range of the single pixel detector is 0-5000mV, so the resolution of AD is 0.3053mV. In our experiments, the measure-ment range among the patterns are respectively 825±67.6425mV and 1103±108.2650mV with respect to the letters 'GHOST IMAGE' and 'Ghost'. So the quantization error occurs at the acquisition board can also be ignored.
Theoretically, one can get stair shaped measurement transition, but the elapse caused by pattern transition of the illumination and integration of the detector would smooth the transition edges. In our experiment, the illumination frequency is set to be 100Hz. We set the acquisition frequency at 100 times of the illumination, i.e., 10000Hz. So the sampling period 100µs and there are 100 measurements for each pattern. Because the integration time is 0.625ns and the DMD transition takes 12µs, we discarded 5 points in the beginning and ending of the 100 measurements to avoid the influence of the elapse caused by either illumination or detector. In implementation, we take average over the 90 left samples as the measurement for each pattern.
The sampling frequency can also influence the reconstruction performance, so we specially compare the performances at different sampling frequencies (ranging from 1000Hz to 100000Hz). As in Fig. 6, as the sampling frequency increases, the fidelity of reconstruction improves. This trend is reasonable, since denser sampling would suppress the sensor noise better with respect to each pattern. Similar to the trend on simulated data, display in Fig. 5, the reconstruction quality stop improving when the sampling frequency exceeds 10000Hz. This is mainly due to that our method is robust to small noise, thus the reconstruction result will not be distinctly improved by further raising the SNR of the measurements. Therefore, in our experiment, we set the sampling frequency to be 10000Hz. In our experiments, we reconstruct the image 'Ghost' and the uppercase alphabet 'GHOST IMAGE'. The spatial resolution is 64 × 64 pixels. We collect 614 measurements (SSR= 0.15) for each image. As can been seen in Fig. 7, the comparison between our approach and conventional CGI exhibit a similar trend with that on simulation. Although the reconstructions are a little bit corrupted due to the noise during sampling, by introducing the patch prior, our method could still show its superiority in reconstructing the local details of the image. For example, we can obtain sharper edges of the alphabet 'GHOST IMAGE' and cleaner outline of the 'Ghost'.
Here we make further effort upon quantitative performance comparison. Although we have the image of the target scene used in film printing, taking the film image as ground truth suffers from misalignment with the reconstruction. Hence, we choose to use the high quality reconstruction result of PCGI-TV with a large SSR (i.e. SSR= 0.5) as a pesudo-ground truth (PGT), as shown in the first column of Fig. 7. Here we still use MSE as the evaluation metric. Corresponding to the reconstructed images from 2nd to 7th column in Fig. 7, the MSEs of 'GHOST IMAGE' are 0.460, 0.447, 0.091, 0.068, 0.149, 0.086, and those of 'Ghost' are 0.464, 0.441, 0.078, 0.054, 0.177, 0.089. The performance ranking is consistent to the visual comparison, and the effectiveness of the proposed approach is further validated.

Conclusions and discussions
In summary, this paper proposes to introduce the patch prior of natural images into the computational ghost imaging framework. Experiments show that our approach largely raises the reconstruction quality of ghost imaging. The superiority to conventional CGI is mainly attributed to utilizing the patch prior, which enforces each patch to be a sparse linear combination of primitives learned from a natural image database. Mathematically, the dimension of the optimization variable increases after introducing the patch primitives, so the proposed approach is of higher computational complexity than conventional CGI. Fortunately, the related heavy calculations, such as the matrix inversion to get greedy solution at each iteration, is scene independent and can be calculated off-line before collecting measurements and conducting reconstruction. Therefore, introducing patch prior does not lead heavy computation and can serve as a feasible solution for improving quality of ghost imaging.
One promising extension of our method is to perform adaptive dictionary learning to take advantage of the high flexibility of sparse coding, i.e. if we know the type of the target scene, we could choose the same type of training image patches to learn a specific over-complete dictionary for better reconstruction [36]. Convolutional sparse model [37,38] can also be utilized to incorporate the patch prior into the framework of CGI and has the potential of higher performance. Besides, we also plan to introduce self similarity existing in the target image to reduce the required measurements further. Then, dynamic ghost imaging tends to be feasible and will broaden the practical applications of CGI.