LEARNING CIRCULANT SENSING KERNELS

. In signal acquisition, Toeplitz and circulant matrices are widely used as sensing operators. They correspond to discrete convolutions and are easily or even naturally realized in various applications. For compressive sensing, recent work has used random Toeplitz and circulant sensing matrices and proved their eﬃciency in theory, by computer simulations, as well as through physical optical experiments. Motivated by recent work [8], we propose mod- els to learn a circulant sensing matrix/operator for one and higher dimensional signals. Given the dictionary of the signal(s) to be sensed, the learned circulant sensing matrix/operator is more eﬀective than a randomly generated circulant sensing matrix/operator, and even slightly so than a (non-circulant) Gaussian random sensing matrix. In addition, by exploiting the circulant structure, we improve the learning from the patch scale in [8] to the much large image scale. Furthermore, we test learning the circulant sensing matrix/operator and the nonparametric dictionary altogether and obtain even better performance. We demonstrate these results using both synthetic sparse signals and real images.

1. Introduction. Compressive sensing (CS) ( [7,5]) acquires a compressible signal from a small number of linear projections. Letx denote an n-dimensional real or complex vector that is sparse under a certain basis Ψ, i.e., one can writex = Ψθ with a sparseθ. Let b := Φx represent a set of m linear projections ofx. The basis pursuit problem (1) BP: min as well as several other methods, has been known to return a sparse vector θ and thus recover x = Ψθ under certain conditions on the sensing matrix Φ and basis Ψ. Gaussian random matrix Φ is in the first studied CS matrices and has nice properties. Given enough underdetermined measurements, (1) with Gaussian random matrix Φ is guaranteed to exactly recover a sparsex with high probability. However, in many CS applications, the acquisition of the linear projections Φx requires a physical implementation. In most cases, the use of an i.i.d. Gaussian random matrix Φ is either impossible or overly expensive. This motivates the study of easily implementable CS matrices. Two types of such matrices are the Toeplitz and circulant matrices, which have been shown to be almost as effective as the Gaussian random matrix for CS encoding/decoding. A Toeplitz matrix T has the same element on each diagonal, and a circulant matrix C is a special Toeplitz matrix with each row obtained by shifting the preceeding row one element to the right, namely, When matrix T satisfies the additional property that t i = t n+i , ∀i, it becomes a circulant matrix C. For more about Toeplitz and circulant matrices, please refer to the review paper [11]. Since a (partial 1 ) Toeplitz matrix has very similar theoretical and computational properties to a (partial) circulant matrix of the same size, our discussions below are based exclusively on the circulant matrix. Using Toeplitz, rather than circulant, matrices will incur some insignificant computation overhead to the methods proposed in this paper.
1.1. Circulant Compressive Sensing. In various physical domains, it is easy to realize Cx since it is equivalent to the discrete convolution c * x for a certain vector c; if P is a row-selection (downsampling) operator, P Cx becomes circulant CS measurements ofx. Using either Toeplitz or circulant matrices, Tropp et al. [27] describes a random filter for acquiring a signalx; Haupt et al. [12] describes a channel estimation problem to identify a vectorx (called impulse responses) that characterizes a discrete linear time-invariant (LTI) system; Meng et al. [19,20] applies it to high-resolution OFDM channel estimation. Random convolutions can also be applied in some imaging systems in which convolutions either naturally arise or can be physically realized [3,13,17,16,24]. Furthermore, random convolutions can be realized by an optical correlator [24]. Since any circulant matrix C in (2) can be diagonalized by a Fourier transform, i.e., obeying where F is the discrete Fourier matrix of the same size as C and D is a diagonal matrix, implementing Cx is equivalent to implementing F DF * x , which can be realized through optical lens and a static spatial light modulator (SLM) [29]. Recently, the use of Toeplitz and circulant matrices has been proposed for compressive MR imaging by Liang et al. [14].
There are rich theoretical results on circulant matrices for CS. In [2], Toeplitz measurement matrices are constructed with i.i.d. random entries of ±1 or {−1, 0, 1}; their downsampling effectively selects the first m rows; and the number of measurements for stable 1 recovery is shown to be m ≥ O(k 3 · log n/k), where k is the signal sparsity. The work [12] selects the first m rows of a Toeplitz matrix with i.i.d. Bernoulli or Gaussian entries for sparse channel estimation. Their scheme requires m ≥ O(k 2 · log n) for stable 1 recovery. The work [20] establishes stable recovery under the condition m ≥ O(k 2 log(n/k)). In [24], random convolution with either random downsampling or random demodulation is proposed and studied. It is shown that the resulting measurement matrix is incoherent with any given sparse basis with high probability and 1 recovery is stable given m ≥ O(k · log n + log 3 n). Recent results in [23] show that several random circulant matrices satisfy the restricted isometry property (RIP) in expectation and with high probability given m ≥ O(max{k 3/2 log 3/2 n, k log 2 k log 2 n}) with arbitrary downsampling, and thus guarantee exact and stable CS recovery. We note that all these results are based on random circulant matrices, so they do not apply to optimized circulant matrices in this paper. We demonstrate that learned circulant matrices achieve even better performance.
The use of circulant sensing matrices also allows faster signal and image recovery. Practical algorithms (e.g., [28,30,31,34,32]) for CS are based on performing operations including multiplications involving with Φ and Φ * . For partial circulant matrix Φ = P C, Φx and Φ * y each can be quickly computed by two fast Fourier transforms (FFTs) and simple component-wise operations. This is much cheaper than multiplying a general matrix with a vector. For image recovery, a splitting algorithm taking advantages of the circulant structure has been proposed in [33] and shows both satisfactory recovery quality and speed. These fast algorithms apply to the learned circulant matrices in this paper.

1.2.
Learning dictionaries and sensing matrices. In CS, signal and image reconstructions are based on how they are sparsely represented. The sparse representation involves a choice of dictionary, a set of elementary signals (or atoms) used to sparsely decompose the underlying signal or image. There are analytic dictionaries and learned dictionaries. Examples of analytic dictionaries include the discrete cosine basis, various wavelets bases, as well as tight frames. Some of them are orthogonal while others are over-complete. Their analytic properties have been studied, and they feature fast implementations; hence, they have found wide applications. Properly learned (as opposed to analytic) bases can give rise to even sparser representations of signals and, in particular, images, so they can give better encoding and decoding performance than the analytic dictionaries; see [22,10,21] for examples and explanations. Although most theoretical results of CS recovery do not apply to learned dictionaries and optimized sensing matrices, one useful tool is the so-called mutual coherence between a dictionary Ψ and a sensing matrix Φ: with D := ΦΨ = [d 1 , . . . , d K ], it is defined as [15] (4) A smaller µ(D) tends to allow more Ψ-sparse signalsx to be successfully recovered from measurements Φx via various CS algorithms. Hence, Elad [9] seeks means to reduce µ(D). His work demonstrates improved recovery quality with learned (noncirculant) sensing matrices Φ, and it has motivated the subsequent work [8], which is not based on minimizing µ(D) but instead pursuing (ΦΨ) * (ΦΨ) ≈ I, where I denotes the identity matrix. Note that (ΦΨ) * (ΦΨ) ≈ I directly targets the RIP of ΦΨ, because in that case, any k × k leading minor of (ΦΨ) * (ΦΨ) also approximates the identity matrix. The results of [8] are even better than those in [9] since µ(D) is a worst-case characteristic whereas (ΦΨ) * (ΦΨ) ≈ I tends to reduce the average of off-diagonal elements of (ΦΨ) * (ΦΨ) and improve the recovery performance in average. Also, the latter is easier to compute.
1.3. Contributions. We are motivated by [8] and propose numerical methods to minimize (ΦΨ) * (ΦΨ) − I F , where Φ is either a full or partial circulant matrix.
Specifically, we determine the spectrum vector d = diag(D) of Φ = P C = P (F DF * ) by numerically optimizing the magnitudes of the entries in d and then assigning them uniformly random phases. Note that if a circulant matrix C is generated by either assigning a random vector as its first row or following Romberg's strategy [24], then the corresponding vector d has uniformly random phases. The only yet key difference among the different approaches lies on the magnitudes of d. Romberg [24] assigns a unit magnitude uniformly whereas we choose to optimize the magnitudes in order to adapt to the training data. Like [8], we also learn Φ and Ψ together, performing the so-called coupled learning of Φ and Ψ. While results of [8] are limited to one dimensional case 2 , we take advantages of the circulant structure and deal with more than one dimension, enabling the learned circulant sensing for signals such as images and videos. Furthermore, we address the patch-scale limitation of [8], namely, the sensing matrix size n must equal the dictionary atom size; namely, if the dictionary for a 512 × 512 image is formed by 8 × 8 patches, the sensing matrix generated in [8] has only 64 = 8 × 8 columns and applies to vectors of length 64, instead of 512×512. Hence, the learned sensing matrix cannot be applied to the entire image. We remove this limitation and perform image-scale learning by generating circulant sensing operators applicable to the entire signals or images.
Our approaches are tested on synthetic 1D signals, as well as the images in the Berkeley segmentation dataset [18]. The learned circulant sensing matrix gives better recoverability over both random circulant and Gaussian random matrices. For real image tests, the coupled learning approach achieves even better recovery performance.
1.4. Notation. We let (·) and (·) * denote transpose and conjugate transpose, respectively. conj(·) stands for the conjugate operator. Define A • B = i,j A ij B ij and A, B = i,j conj(A ij )B ij for any two matrices A, B of the same dimension. vec(A) is a vector formed by stacking all the columns of A, and diag(·) is defined in the same way as MATLAB function diag, which either extracts the diagonal entries of a given matrix to form a vector or, given a vector, forms a diagonal matrix with the vector's entries. In addition, and denote component-wise multiplication and division, respectively.
The rest of this paper is organized as follows. Section 2 overviews one and two dimensional circulant correlations. A two-step procedure to optimize a partial circulant sensing matrix/operator is described in section 3. Section 4 discusses how to form an image-scale dictionary from a patch-scale one, and section 5 describes how the involved optimization problems are solved. Numerical results are presented in section 6.

2.
Overview of circulant correlations. Given a kernel v ∈ C n , the circulant correlation x v for a vector x ∈ C n is defined as . Then x v is equivalent to the convolution x * c, which is defined as x i c ki , for k = 1, . . . , n, where k i = mod(n + k − i − 1, n) + 1. We next introduce three methods to compute x v. First, introducing the n × n cyclic permutation matrix we can write formula (5) as Multiplying P n from the left circularly down-shifts by a row and multiplying P n from the right circularly right-shifts a column. For example, given v = [1, 2, 3] and x = [−1, 0, 1] , y = x v equals Secondly, we can compute x v via the matrix-vector multiplication Cx with the circulant matrix Thirdly, x v can be quickly computed by two fast Fourier transforms and some component-wise multiplications as described in the following lemma, which is a restatement of the convolution theorem.
Lemma 2.1. Any circulant matrix C in (7) can be written as C = F DF * , where F is the n × n unitary discrete Fourier transform matrix, i.e., , for i = 1, . . . , n; j = 1, . . . , n, Remark 1. The matrix C is real-valued if d is conjugate symmetric, namely, d i = conj(d i ) for every i and i = mod(n − i + 1, n) + 1. Imposing conjugate symmetry leads to real-valued C and reduces the freedom of d to nearly half.
2D circulant correlation inherits the nice properties of 1D circulant correlation. Given a kernel M ∈ C n1×n2 , the 2D circulant correlation Y = X M for a given matrix X ∈ C n1×n2 is defined by where t i = mod(n 1 + i − t, n 1 ) + 1 and s j = mod(n 2 + j − s, n 2 ) + 1. Similar to 1D circulant, X M can be computed in three ways. First, with cyclic permutation matrices P n1 and P n2 , formula (8) can be written as Secondly, Y = X M can be computed by matrix-vector multiplications via vec(Y ) = C 2 vec(X), where with the notation M t,s := P t−1 n1 M (P n2 ) s−1 for 1 ≤ t ≤ n 1 and 1 ≤ s ≤ n 2 . It is not difficult to verify that C 2 is a block-circulant matrix. For the above example, we have Thirdly, we can quickly compute X M through two fast 2D Fourier transforms and some component-wise multiplications according to the following lemma, which is a restatement of the 2D convolution theorem.
where F 2 is the orthogonal 2D discrete Fourier transformation operator on C n1×n2 , F * 2 is the adjoint operator of F 2 as well as its inverse operator, and V is an operator on C n1×n2 defined by V(X) = V X for any X ∈ C n1×n2 with V = √ n 1 n 2 F 2 (M ).
Remark 2. The block-circulant matrix C 2 corresponding to C 2 is real valued if V is conjugate symmetric, namely, V ij = V i j for every pair of i, j and i = mod(n 1 − i + 1, n 1 ) + 1, j = mod(n 2 − j + 1, n 2 ) + 1.

3.
Learning circulant sensing matrix/operator. In this section, we illustrate how to learn the 1D kernel v in formula (5) and the 2D kernel M in formula (8), as well as their corresponding downsampling operators P (row selection) and P Ω (sample selection); hence, we form a partial circulant sensing matrix P C ∈ C m×n and a partial 2D circulant sensing operator P Ω C 2 , respectively. Here, P selects m out of the n rows from C, and P Ω collects the entries in Ω ⊂ {1, . . . , n 1 }×{1, . . . , n 2 } and forms a column vector. For example, Remark 3. In our experiments, we will normalize each column of the sensing matrix Φ, for otherwise the recovery can be numerically instable in particular when Φ is a Gaussian random matrix. Note that the normalization to P C (or P Ω C 2 ) is equivalent to multiplying a diagonal matrix (or having a componentwise operator) to the right, and thus it only slightly increases the computation.
3.1. Learning 1D circulant kernel and downsampler. Let Ψ ∈ C n×K be a given dictionary such that the underlying signalx = Ψθ, whereθ is a (nearly) sparse vector. Following [8], we would like to design a partial circulant sensing matrix Φ = P C such that Ψ * Φ * ΦΨ ≈ I, where C is an n × n circulant matrix and P is an m × n downsampling matrix. To construct a partial circulant matrix Φ, we shall determine P and C. We first learn the circulant matrix C and then the downsampler P . According to Lemma 2.1, C = F DF * where the diagonal matrix D = diag(d) with entries d = √ nF v and v is the kernel of C. Therefore, learning C is equivalent to choosing the best kernel v or vector d. Our approach is based on solving . Given Ψ, matrices B andB are constant matrices. Let which are unknowns to be determined. We have diag(d * )diag(d) = diag(z) and Hence, we reduce (10) to Problem (12) is a convex quadratic program and can be reformulated as a nonnegative least-squares problem. Given a solution z = z opt to (12), any d obeying (11) gives C = F diag(d)F * as a solution to (10). Since only |d i |, i = 1, . . . , n, are specified, the phases of d i are still subject to determine. Generally, phases encode the location of the information in a signal. Since such location is unknown at the time of sensing, we choose to generate the phase of every d i uniformly at random, which is suggested by [24]. We want to mention that if Ψ is an orthogonal basis or tight frame, i.e., ΨΨ * = I, the solution of (12) is simply a vector with all ones, and in this case our generated d is the same as that in [24]. However, as ΨΨ * = I, e.g., all the learned dictionaries in our experiments, the solution of (12) is generally not the all-one vector, and the optimized d is different from that in [24]. Models (10) and (12) lead to significant improvement in sensing efficiency as we shall demonstrate by numerical examples in section 6.
Remark 4. The above procedure generates a complex-valued C. To obtain a realvalued C, one shall add constraints z i = z n−i+2 for i = 2, . . . , n in the problem (12) and then generate a conjugate symmetric d from the solution z opt .
As Ψ is square, i.e., n = K, we also tested an alternative model as follows.
Although it does not work as well as (10), we believe it is worth conveying the information to the readers. The model is (13) min which is equivalent to where A is a diagonal matrix with diagonal entries given by vector diag(F * ΨΨ * F ).
Since (14) is component-wise separable in d i , it is easy to derive its closed-form solution Our numerical experience suggests that this solution (and thus model (13)) is not as effective as that of (10). On the other hand, Ψ is usually overcomplete, for which case (13) is inapplicable. After the circulant matrix C is determined, we now optimize P , and thus fully determine Φ. A simple yet effective solution is the random selection, that is, let P select m out of the n rows of C uniformly at random. This tends to work well on signals without dominating frequencies.
We can also choose to minimize Ψ * Φ * ΦΨ − I F so that Ψ * Φ * ΦΨ ≈ I. Let p i be the binary variable indicating the selection of element i, i.e., (16) p i = 1, matrix P selects row i, 0, otherwise.
Since Φ = P C, we shall solve (17) is equivalent to the binary quadratic program ] is a matrix of n 2 rows and n columns. By simple calculation, we can write (18) into the following equivalent problem (19) is NP-hard in general and can be solved to a moderate size by solvers such as Gurobi [4].

3.2.
Learning 2D circulant kernel and downsampler. A 2D circulant operator C 2 is often used in imaging [25,29]. It is not equivalent to a circulant matrix even if the image is vectorized but can be reduced to a block circulant matrix with circulant blocks [6]. Given a dictionary Ψ = [ψ 1 , ψ 2 , . . . , ψ K ] where each ψ i ∈ C n1×n2 , we define a linear operator Q on C K by Q(θ) = K i=1 θ i ψ i for θ ∈ C K . An array X ∈ C n1×n2 is sparse with respect to the dictionary Ψ if it can be represented as X = Q(θ) with a sparse θ ∈ C K . Let P Ω be the downsampling operator on C n1×n2 defined at the beginning of this section. Then the basis pursuit problem to recover X can be written as (20) min where b = P Ω C 2 (X) contains the measurements of X. To improve the recoverability of the 1 optimization problem (20), we shall try to make (C 2 Q) * C 2 Q close to the identity operator I on C K . The adjoint operator of Q is defined as Hence, for any θ ∈ C K , and making (C 2 Q) * C 2 Q ≈ I is equivalent to making C 2 (ψ j ), which can be simplified as follows. According to Lemma 2.2, we have Let U = conj(V ) V and define U := V * V, where V * is the adjoint operator of V. We have U(X) = U X for any X ∈ C n1×n2 . Let u = vec(U ) and Y = [y 1 , y 2 , . . . , y K ] with y s = vec(F * 2 (ψ s )) for s = 1, 2, . . . , K. Then G ts = V * VF * 2 (ψ t ), F * 2 (ψ s ) = (u y t ) * y s = u * (conj(y t ) y s ). Hence, . Therefore, problem (21) is equivalent to the convex quadratic program Given a solution u = u opt of (22), we can obtain matrix U via u = vec(U ). Any V satisfying U = conj(V ) V defines C 2 = F 2 VF * 2 as a solution to (21). Like done in Section 3.1 for 1D circulant, we choose to generate the phase of each entry V ts uniformly at random.
Given C 2 , we can choose Ω uniformly at random or optimize P Ω such that (P Ω C 2 Q) * (P Ω C 2 Q) ≈ I, which can be achieved by solving (23) min Let n = n 1 n 2 and D Ω ∈ C n×n be a diagonal matrix with diagonal entries defined by where k ij = i + (j − 1)n 1 . Then D Ω vec(X), D Ω vec(Y ) = P Ω X, P Ω Y for any X, Y ∈ C n1×n2 . Letting y t = vec(C 2 (ψ t )) for t = 1, . . . , K, we have G ts = P Ω C 2 (ψ t ), P Ω C 2 (ψ s ) = D Ω y t , D Ω y s = (y t ) * D Ω y s .
Let Y = s y s (y s ) * . Then where H = [|Y ij | 2 ] and p = diag(D Ω ). In addition, 4. Image-scale dictionary. For natural images, current dictionary learning methods, e.g., KSVD [1], train a dictionary Ψ 0 from a set of image patches. Hence, the dictionary atoms have the same size as patches. In practice, we often take measurements directly from an entire imageX instead of its small patches, so the learned dictionary Ψ 0 does not match the imageX. How can we use Ψ 0 to recoverX from the directly taken measurements? One approach is to make an image-scale dictionary under whichX is sparse. Given a patch-scale atom, we construct a list of image-scale atoms that have no common non-zero parts by placing the patch-scale atom at each possible place in the image. Special cares are given near the boundary where only part of the patch-scale atom is used. The precise treatment is specified below.
LetX ∈ C N1×N2 be an image, and assume that each of its small patches x ∈ C n1×n2 , where n 1 < N 1 and n 2 < N 2 , has a sparse representation under dictionary Ψ 0 = [ψ 1 , . . . , ψ K ]. We want to construct an image-scale dictionary Ψ from Ψ 0 such thatX is sparse under Ψ. Let n r = N1 n1 , n c = N2 n2 , s r = N 1 − n r n 1 , s c = N 2 − n c n 2 , and N = n r n c + sign(s r )n c + sign(s c )n r + sign(s r s c ). For each atom ψ k , we form N image-scale atoms Ψ k 1 , Ψ k 2 , . . . , Ψ k N ∈ C N1×N2 as follows. • For j = 1, . . . , n c and i = 1, . . . , n r , let t = (j − 1)n r + i and Ψ k t be the matrix with all elements being zero except having ψ k on the submatrix indexed by the (n r (i − 1) + 1)th through (n r i)th row and the (n c (j − 1) + 1)th through (n c j)th column.
• If s r = 0, let t = n r n c + j and Ψ k t be the matrix with all elements being zero except having the last s r rows of ψ k on the submatrix indexed by the (n r n 1 + 1)th through N 1 th row and the (n c (j − 1) + 1)th through (n c j)th column for j = 1, . . . , n c .
• If s c = 0, n r more atoms are formed in a similar way using the last s c columns of ψ k . • If s c s r = 0, we form the last atom Ψ k N with all elements being zero except having the right-bottom s r × s c submatrix of ψ k on its right-bottom s r × s c submatrix.  Note that any two atoms generated from ψ k have no common non-zero parts. The imageX has a sparse representation under Ψ since each patch ofX is sparse under Ψ 0 . Figure 1 illustrates how we form N image-scale atoms Ψ 1 , Ψ 2 , . . . , Ψ N from a small atom ψ. In section 6, we will show that this generated image-scale dictionary works well in recovering an entire image from its linear measurements.

Algorithm and implementation.
With a given dictionary Ψ, Algorithm 1 outlines our approach for optimizing a partial circulant matrix Φ.
For 2D, an optimized partial circulant operator P Ω C 2 can be obtained in a similar way. At Step 1 of Algorithm 1, we use the MATLAB function quadprog to solve (12) with its default settings. At Step 2, we use the commercial code Gurobi [4] with MATLAB interface [35] to solve the binary program (19). In our test, the maximum number of iterations was set to 2000. Usually, Gurobi terminated at the maximum number of iterations with the best solution obtained. Hence, (19) was only approximately solved.
We used both synthetic and real data for test. For synthetic data, Ψ was Gaussian randomly generated basis or discrete Fourier basis, and Φ was learned by Algorithm 1. For real data, Ψ was learned in either an uncoupled way or a coupled way. In the uncoupled test, we first learned Ψ from a set of training data X = [X 1 , . . . , X L ] ∈ C n×L by applying KSVD [1] to (26) min for a given sparsity level S, where Θ = [Θ 1 , . . . , Θ L ] and Θ i 0 denotes the number of non-zeros of Θ i . Then we learned Φ by Algorithm 1. In the coupled test, we simultaneously learned a pair of Ψ and Φ in a way similar to [8]. Specifically, during each loop of the coupled algorithm, we first compute Φ via Algorithm 1 with a fixed Ψ, then update Θ by applying OMP [26] to where α > 0 is a weight parameter, and finally update Ψ and Θ via solving (27) with respect to Ψ and the current nonzero entries of Θ jointly. Algorithm 2 summarizes our coupledly learning method, and in our experiments we terminated the algorithm if the relative changes of X − ΨΘ F is less than 10 −6 .

Algorithm 2
1: Initialization: randomly generate Ψ or set Ψ as some learned dictionary. 2: while Not converged do
The stopping tolerance for YALL1 was set to 10 −5 for noiseless case and 10 −3 for noise case, unless specified otherwise. The maximum number of iterations was set to 10 4 , which is sufficiently large to make YALL1 converge within the given tolerance.
6. Numerical simulations. We compared the performance of optimized circulant sensing matrices/operators to that of random ones on both synthetic and real-world data. In addition, we tested a pair of circulant sensing matrix Φ and dictionary Ψ learned together on real-world data. The tested sensing matrices and 2D operators are listed in Table 1. To be fair, we either take all the compared sensing matrices and 2D operators to be real-valued or take all of them to be complex-valued in the same test. In addition, except for the coupled learned circulant matrix coupledplus-rand-circ, all sensing matrices or 2D operators were generated and tested with the same set of synthetic or learned dictionaries. As in [8], throughout our tests, all columns of Ψ and Φ were normalized to have the unit 2-norm. The normalization of Ψ is only for convenience while that of Φ is critical, in particular when Φ is a Gaussian randomly matrix, for otherwise the recovery by YALL1 or other solvers may become instable. Similarly, we normalized all partial 2D circulant operators. We next introduce an efficient way to normalize 1D matrix P C and 2D operator P Ω C 2 . 1D rand-2D-circ circulant complex random random 2D opt-plus-rand-2D-circ circulant complex optimized+random random 2D * The kernel v for real random 1D circulant was generated by MATLAB command randn(n,1) and that for complex random 1D circulant by randn(n,1)+1i*randn(n,1); real Gaussian was generated by randn(m,n) and complex Gaussian by randn(m,n)+1i*randn(m,n); the kernel M for real random 2D circulant was generated by randn(n1,n2) and that for complex random 2D circulant by randn(n1,n2)+1i*randn(n1,n2); † 60% of the normalized optimized circulant plus 40% of a normalized real random circulant.
6.2. Synthetic data. We tested different optimized circulant sensing matrices Φ on synthetic data along with two kinds of bases Ψ: the Gaussian random basis and the discrete Fourier basis. The performance was compared to that of unoptimized random circulant, as well as random Gaussian, sensing matrices. The Gaussian random basis was generated by MATLAB command randn(n) and then turned to an orthogonal matrix by QR decomposition, and the Fourier basis was generated by MATLAB command dftmtx(n)/sqrt(n). Both bases were 512 × 512, and the sensing matrices had the size 64 × 512, i.e., an 8x downsample. In this test,θ was generated with k randomly located nonzero entries sampled from the standard Gaussian distribution and then normalized to have the unit 2-norm. The stopping tolerance for YALL1 was set to 5 × 10 −8 . Figure 2 depicts the comparison results of sensing matrices: rand-circ, Gaussian, opt-circ, opt-circ-and-P, as well as their real-valued counterparts. We call a recovery θ * successful if the relative error θ * − θ 2 / θ 2 was below 10 −4 . We calculated the success rate out of 50 independent trials at every sparsity level k.
All the four subfigures of Figure 2 reveal that optimized circulant sensing matrices led to equally good performance as random Gaussian matrices. For the random basis, random circulant matrices achieved similar recovery success rate, while they performed extremely bad for the discrete Fourier basis. The reason for the bad performace of random circulant matrices can be found in [33]. On the other hand, optimizing the selection matrix P hardly made further improvement. We believe that unless the underlying signal has dominant frequencies, optimizing the selection matrix P will not lead to consistent improvement. Therefore, in the subsequent tests, we let P be random selection matrices. In addition, the improvement by optimizing circulant sensing matrices over randomly generated ones is similar for real and complex-valued. For convenience, we use complex-valued sensing matrices/operators in the rest tests.
Although the tests above are based on synthetic signals that are exactly sparse, similar performance of different sensing matrices was observed on those that are nearly sparse. Since real images are nearly sparse (under certain dictionaries), we skip presenting the results of nearly sparse synthetic signals and proceed to the real image tests. 6.3. Image tests with circulant matrices. In this subsection, we present the performance of sensing matrices rand-circ, Gaussian, opt-plus-rand-circ and coupledplus-rand-circ on real images. 3 The dictionaries for all of them were learned from the same training set and had size 64 × 256, namely, we set K = 256. The first three sensing matrices shared the same set of dictionaries learned by KSVD [1] with sparsity level S = 6, 8 in (26), and the last one was learned simultaneously with its corresponding dictionary by the coupled method described in Section 5 with sparsity level S = 6, 8 and α = 1 32 in (27). This choice of α was recommended in [8]. All images used in these tests were scaled to have the unit maximum pixel value. The training data consists of 20,000 8 × 8 patches, that are 100 randomly extracted patches from each of the 200 images in the training set of the Berkeley segmentation dataset [18]. The 100 images in the testing set were used to measure the recovery performance.
In the first set of tests, we uniformly randomly extracted non-overlapping 600 patches from the 100 test images to recover using their measurements obtained with the four sensing matrices. Figure 3  All the four pictures in Figure 3 reveal that both uncoupled and coupled learning approaches achieved significantly better recovery over random circulant matrices, and they did even better than Gaussian random matrices when m = 16. The coupled learning approach made slight improvement over the uncoupled one.
In the second set of tests, we chose two out of the 100 test images to recover using their measurements obtained with the four different sensing matrices. The first image had the resolution of 481 × 321, and the second one had the resolution of 321 × 481. For convenience, we removed the last row and the last column of both the two images and partitioned each images into 1,200 8 × 8 patches. Then we used YALL1 to solve (30) for each patch with the dictionary Ψ learned by KSVD with the sparsity level S = 6. We used peak-signal-to-noise-ratio (PSNR) and mean squared error (MSE) to measure the performance of recovery. Table 2 lists the average results of 20 independent trials for measurement size m = 16, 24 and noise levelσ = 0, 0.01, 0.05, 0.10, 0.15. One set of the recovered images are shown in Figure 4. From the results, we can see that both uncoupled and coupled learning approaches made significant improvement over random circulant matrices, especially for m = 16. Whenσ ≥ 0.10 or m = 16, the coupled learning approach did even better than random Gaussian matrices. Coupled learned circulant matrices tend to do better than Gaussian random ones when m is smaller. Although the best PSNRs were achieved with Gaussian random matrices, the best visual results are  those with the coupled learned circulant matrices. In addition, the coupled approach did slightly better than the uncoupled one.
6.4. Image tests with 2D circulant. For the tests in Section 6.3, each selected patch was reshaped to a vector by stacking its columns. In this subsection, we compare rand-2D-circ and opt-plus-rand-2D-circ to illustrate how to directly apply 2D circulant operator on the squared patches. The dictionary learned by KSVD in Section 6.3 with the sparsity level S = 6 was used in this test. Note that each atom in the dictionary becomes a squared patch as opposed to a vector. We used the same 600 patches and the same two images as in Section 6.3 for this test. Instead of solving (30), now YALL1 was employed to solve (31) min where P Ω C 2 L denotes normalized partial 2D circulant operator, Q is defined in Section 3.2, b = P Ω C 2 L(X i + ση) with the same η and σ to those in Section 6.3, and X i is the ith tested patch corresponding to the reshaped vector x i in (30).      . One set of recovered images Plane (left two) and Gulf (right two) by YALL1 with two different 2D circulant operators: rand-2D-circ and opt-plus-rand-2D-circ for sample ratio SR = 0.30 and noise levelσ = 0.01 6.5. Image-scale recovery. In previous tests, we divided an image into a number of small patches and recovered each patch from its linear measurements, since the learned circulant sensing matrices/operators need to match those dictionary atoms in size. In practice, we often take measurements directly from the entire image instead of its small patches, so the previous learned patch-scale dictionaries do not match the image. In this section, we first use the method discussed in section 4 to generate an image-scale dictionary Ψ from the patch-scale one learned by KSVD in section 6.3 with sparsity level S = 6. Then we use Ψ to recover an imagê X ∈ C N1×N2 from its measurements b = G(X + ση) by solving (32) min where G = P Ω C 2 L is a normalized partial 2D circulant operator defined on C N1×N2 , η ∼ N (0, I) is Gaussian noise, σ =σ X ∞ / η ∞ , and Q is a 2D operator (corresponding to Ψ) defined in the same way as in section 3.2. At first glance, the recovery by solving (32) may take a lot of time since the dictionary Ψ has so many atoms. However, the sparsity structure of these atoms enables recoveries to be as fast as those using patch-size dictionaries. Note that fully random sensing operators are out of the game at the image scale since the number of free entries of each is at least ten thousand times of the image resolution, making such an operator impossible to fit in the memory of a typical workstation.
We compared rand-2D-circ and opt-plus-rand-2D-circ on two images. Both of them had the resolution of 241 × 129, and they were obtained by cropping 4 their larger originals Plane and Gulf in the testing set of Berkeley segmentation dataset. The noise levelσ was set to 0, 0.01, 0.05, 0.10, 0.15 and the sample ratio SR := |Ω| N1N2 to 0.20, 0.30. Table 3 lists the average results of 20 independent trials, and Figure 7 plots one set of recovered images Plane (left two) and Gulf (right two) forσ = 0.01 and SR = 0.30. From the table, we can see that the generated image-scale dictionary performed well in recovering the images with sufficiently many measurements, and that the optimized circulant operators did significantly better than the random ones on average. Table 3. Comparison results of recovered full-size images by YALL1 with two different 2D circulant operators: rand-2D-circ and opt-plus-rand-2D-circ; SR=sample ratio complex-valued random sensing matrices. They also would like to thank three anonymous referees and the associate editor for their very valuable comments. The authors' work has been supported in part by AFOSR, ARL and ARO, DARPA, NSF, and ONR.