Compressed sensing for photoacoustic computed tomography based on an untrained neural network with shape prior

: Photoacoustic (PA) computed tomography (PACT) shows great potential in various preclinical and clinical applications. A great number of measurements are the premise that obtains a high-quality image, which implies a low imaging rate or a high system cost. The artifacts or sidelobes could pollute the image if we decrease the number of measured channels or limit the detected view. In this paper, a novel compressed sensing method for PACT using an untrained neural network is proposed, which decreases a half number of the measured channels and recovers enough details. This method uses a neural network to reconstruct without the requirement for any additional learning based on the deep image prior. The model can reconstruct the image only using a few detections with gradient descent. As an unlearned strategy, our method can cooperate with other existing regularization, and further improve the quality. In addition, we introduce a shape prior to easily converge the model to the image. We verify the feasibility of untrained network-based compressed sensing in PA image reconstruction and compare this method with a conventional method using total variation minimization. The experimental results show that our proposed method outperforms 32.72% (SSIM) with the traditional compressed sensing method in the same regularization. It could dramatically reduce the requirement for the number of transducers, by sparsely sampling the raw PA data, and improve the quality of PA image significantly.


Introduction
Photoacoustic imaging (PAI) is a hybrid imaging modality, which originates from the principle of photoacoustic (PA) effect [1][2][3][4]. The PA signal is induced by a short-pulsed laser light, which propagates in medium and is detected by the ultrasonic transducers. In recent decades, PAI has enabled many interesting imaging applications including hemoglobin and oxygen saturation detection, small animal imaging, and pre-clinical cancer diagnosis [5][6][7][8][9][10][11]. One of the implementations of PAI is photoacoustic computed tomography (PACT), which uses unfocused light to illuminate the tissue, and detects the PA signals by a transducer array.
In PACT, the number of the detector should satisfy the Nyquist sampling theorem. However, increments of the detector will increase the cost of the system. Meanwhile, the transducer could not encircle the field of view (FOV) in some scenes, e.g., the imaging of human carotid. The under-determined setup of the PA image reconstruction is achieved in these cases.
Compressed sensing (CS) has been used to reconstruct the PA image in sparse or limited-view conditions, which could recover the signal/image under the Nyquist sampling rate [12,13]. CS leverages the sparsity of data to reconstruct the PA image based on different optimizations and 4. We demonstrate this method with conventional regularization (TV regularization in this paper). Simulated and experimental results show that our method outperforms conventional unlearned optimization method with the same regularization. Furthermore, our method embodies a robust and shows generalized performance on different data. Other CS methods are also suitable for integrating into our method, not just the TV regularization.

Photoacoustic imaging
In PAI,the initial pressure is excited by a single short laser pulse, which can be expressed as [2]: where Γ 0 is the Gruneisen coefficient, η th is the conversion efficiency from light to heat, µ a is the optical absorption coefficient, and F is the optical fluence. The pressure propagation in the medium can be described by below equation: where p(r, t) is the spatiotemporal pressure wave, v s is the speed of sound, H denotes the heating function, β denotes the thermal coefficient of volume expansion, and C P denotes the specific heat capacity at constant pressure. To compute PA pressure in any heterogeneous medium, we solve this equation with Green function [2], and derives the forward model: where p 0 (r ′ ) is the initial pressure at detection position r ′ .We use a linear operator A indicates the forward procedure from initial pressure distribution f to the PA signals b: where ϵ is noise. The light uniformly illuminates the whole target in PACT, which excites the PA signals simultaneously. The transducer array is used to receive the PA data at different positions. The inversion of Eq. (3) can be described from p(r, t) to p 0 (r) using universal back-projection (UPB) operation [31]: where θ 0 is the angle between the vector pointing to the reconstruction point r and transducer surface S 0 .

Compressed sensing
In CS, we use Ψ as a proper sparsity transform that results in an overdetermined representation, and the sparsity transform can be represented as: where f ∈ R n is original data, and g ∈ R N is coefficient on basis Ψ.
We can project data f into a series of sensing vector b with noise ϵ, and represents compressive measurements as: we assume it is deterministic noise ∥e∥ 2 ≤ ϵ. We can formulate the original data f obtained by solving the following basis pursuit denoising problem: Two conditions should be satisfied if we can successfully recover the ground-truth data f 0 : • The two basis sets Ψ and Φ should be incoherent.
In the CS theory, we should find a basis Ψ that sparsely represents f , and minimize the l 1 norm of Ψf promotes sparsity, and the constraint enforces data consistency. In CS-based PACT, a one-step scheme is described to solve the following minimization problem:

Untrained CNN for CS PACT
Given the measured PA signals b and the measurement matrix M (M = ΦA), we has b = Mf + ϵ.
We aim to recover f from b: in which ∥Mf − b∥ 2 2 is the data consistency term, and R(f ) is the regularization term. In CS, the optimal solution of Eq. (10) depends on R(f ). Namely, we should find a sparse basis as the embedded prior. Some sparse basis has been mentioned before (TV, Wavelets, Curvelets, Fourier), and many papers have studied the use of CNN as NETT regularization for CS-PAT [32,33]. However, a large number of the dataset are required to train the model.
Recently, DIP exposed that a generator model is sufficient to capture a great deal of natural images prior without any learning, which can also be used to recover the compressed signal. In our work, we aim to find a set of weights for the output of CNN to fit the reconstructed image, which is applied to the measurement matrix M by matching the given sparse measured data b.
To implement that, we should design an over-parametrized CNN decoder model D.
Therefore, we initialize the untrained model MD(Θ, z) with a fixed random matrix z, and solve the non-linear least squares solution: Generally, the over-parameterized deep decoder D can fit any image f * , including unstructured noise. Furthermore, an implicit prior can be expressed if we stop the procedure at the correct stage. Namely, further regularization is unnecessary if the procedure of optimization can be early stopped. Also, we can retain the sparse basis as the regularization term like the model-based optimization: In this work, a convolutional decoder, D, is used as the generator network, and the architecture will be described in the next section. These CNN models can provide a satisfied prior for natural images in problems such as inpainting and denoising due to their convolutional operations. Therefore, this approach also applies to another differentiable forward operator A, not only PA forward operator.
Note that our method is a learning-free method since it has not the training phase with much data. Meanwhile, this method leverages an untrained generative decoder to optimize the weights Θ of the model. By using different models, our results further support a hypothesis that network structure, not representation learning, is the key component in image reconstruction.

Network architecture
To demonstrate our method, we introduce a CNN that generates an image through convolutions and non-linear operations. Given a random fixed input z, we use a decoder D to generate the final PA image. For l layers' decoder(in our work, l=5), the model can be defined as: where Herein, ReLU (Rectified Linear Unit) is an activation function, BN means the batch normalization operation. κ is the convolutional kernel, and κ i is 3 × 3 size. Note that B i contains the coefficient of the convolutional layer.
As mentioned above, a five-layer decoder D is used to fit the initial pressure f * , and D is implemented here by Eq. (13) as shown in Fig. 1. This architecture is a U-Net [34] without the encoder and skipped connection. In this paper, the input data z is a Gaussian random matrix with 8 × 8 size, which should be fixed in the procedure of optimization. A decoder model generates an image with 128 × 128 size through five convolutional layers and de-convolution. For each layer, double combinations of convolution with 3 × 3 kernel size, BN, and ReLU are used in series and followed by a de-convolution to up-sample the feature map.
The output image, multiplied by the matrix M, should be restricted by measured raw PA data b. Namely, we can directly optimize this model by minimizing the data consistency (DC) loss function:

Shape prior
In CS-based PACT, different sparse basis is used. For instance, TV regularization can enforce smoothness as R(f ) in CS theory. Furthermore, the l 1 norm of TV regularization can suppress the small coefficients, whose solution can be sparse. The TV regularization can be described that: In our method, TV regularization also penalizes the output of the decoder. Therefore, an additional TV term can be contained to restrain the deep generative model, i.e., TV(D(Θ, z)). Furthermore, this scheme has the advantage that we do not consider the differentiability of the regularization term (l 1 norm) since we optimize the loss function by back-propagation and gradient descent (GD). Now, we rewrite the loss function based on Eq. (12) and Eq. (16): We can iterate this procedure and update the weight with GD. The solo data could cause the stochastic direction of the gradient. We further propose a shape prior to improve the performance and create a robust, efficient objective function. Considering that a direct texture of the target could provide a guided optimization at the beginning phase, we calculate the error between the rough image and output of D as shape prior (SP).
In our work, shape prior is proposed to penalize the output of the model, and the rough texture is created by sparse conventional reconstruction. Therefore, we estimate the prior with the decoder model by minimizing the least squares loss ∥D(Θ, z) − f d ∥ 2 2 , and f d comes from the conventional reconstruction. We combine this prior with the loss function in Eq. (17). Finally, we optimize the weights of the Decoder model by minimizing the final loss function as follow : λ 1 and λ 2 are hyperparameters, which are chosen for different data empirically. f d is a direct reconstruction from UBP [31] or time-reversal [35]. Note that, for PACT, this function cannot optimize the model directly since the gradient of the data consistency term cannot be tracked completely with the chain rule. We introduce a decomposed gradient descent to resolve this problem, which will be described in the next section.
Moreover, our main result shows that the estimate Θ, obtained by running gradient descent on the loss until convergence, yields an output D(Θ, z) which is close to f * , i.e., D(Θ, z) ≈ f * .

Implementation
Generally, the proximal gradient method is used to solve TV minimization in Eq. (10): which is a classical and widely-used solution for TV minimization [36,37]. The forward and adjoint operator A and A * are used to compute the gradient of the data consistency, which has been implemented in the MATLAB toolbox k-Wave [38]. For DL model, we do not consider the analytic gradient of loss function since GD and back-propagation are used to update the weight at each iteration. However, in our work, the data consistency contains A and D. Namely, A and A * cannot be back-propagated, and the gradient of D cannot be calculated directly.
On the other hand, the forward operator can be discretized and written as a matrix, which is limited by computing resources. The matrix-related gradient cannot be tracked since the size of the matrix is large. Therefore, no matter whether we use the function or matrix, we cannot directly update the weights by back-propagation. The key solution is to decompose the gradient calculation of forward operation and DL model.
To decouple the data consistency term, we compute the gradient of Eq. (15) for D: which can be calculated by k-Wave based on the output of D. We rewrite the derivative of Eq. (15) for Θ based on chain rule: For ∂D/∂Θ, the weights automatically optimize based on the chain of the gradient. Therefore, the gradient is decomposed into two terms, the first term can be computed by Eq. (20), and the second term can be updated by back-propagation. To transfer the gradient of the first term, we multiply these two terms and update the weight of DL model by back-propagation. Thus, the gradient of the data consistency term can be transferred to Θ. The procedure of this optimization can be described in Algorithm 1. We can calculate this loss and optimize the weights by back-propagation. We decompose the procedure as back-propagation and analytic gradient descent. For each iteration, the data consistency loss (lines 3 in Algorithm 1) is calculated based on the output of D since ∂D/∂Θ can be regarded as a constant for Θ. Next, this loss needs to be combined with other regularizations to form the final loss. Finally, the back-propagation is used to update the weights. In Fig. 2, we further illustrate the pipeline of this optimization, and CNN is our decoder model D in this paper.
In this paper, the optimizer of all experiments is RMSRrop, and the size of output image is 128 × 128 . We implement this algorithm including M, M * , and the framework D(Θ, z) by MATLAB. The initial step rate is 0.001. All methods are implemented on a 64-bit operating system with an Intel Core i7-6700 CPU and an NVIDIA GTX 1080 Ti GPU.

Experiments
To experimentally validate our approach, simulation data and in-vivo data are used. Furthermore, we compare our method with other methods. To validate CS-based PACT, the data is 128 channels, and a 50% random sub-sampling matrix is used to sub-sample the channel number.
Conventional method with TV norm is compared with our method, which leverages Eq. (19) to solve the objective function. Furthermore, we also compare our method with conventional Tikhonov regularization. Since our method is unlearned, we only compare it to other unlearned methods. Meanwhile, some comparison experiments are demonstrated, e.g., different regularization. Moreover, through experiments, we illustrate the effects of the priors and determine the appropriate number of iterations.

Synthetic setup
For the synthetic data, we use k-Wave to generate the data. 128 elements circular ultrasound (US) array receives the PA signals with 14.5mm radius. The pixel number of the initial pressure map is 380 × 380, and the total grid size is 30mm × 30mm. The sampling rate of PA signal is 40 MSa/s, and noise is added to signal with 40 dB SNR. The center frequency of the US transducer is set as 2.5 MHz with 80% bandwidth, and the speed of sound is 1500 m/s. The reconstructed region is 30 mm × 30 mm with 128 × 128 pixels.

In-vivo data
Moreover, we also compare our method with the conventional method on the in-vivo data, which contains the brain of mice and the cross-section of the human finger. All data is acquired from a self-built PACT system in Fig. 3. The transducer array is a customized 128-elements full-view circular with 30mm radius (2.5 MHz, Doppler Inc.), which is placed in a 3-D printed water tank. The laser source is a pulsed laser (720 nm wavelength, 10 Hz repetition rate), which illuminates the object by a fiber optic bundle as Fig. 3 shows. The data sampling rate of the data acquisition system is 40 MSa/s. The region of image reconstruction is 30 mm × 30 mm with 128 × 128 pixels.

Synthetic results
We firstly validate our method on the synthetic data, λ 1 and λ 2 are 0.006 and 0.05 respectively. In Fig. 4, we show the results of TV method and our method. Specially, our method is minimized by Eq. (18), which refers to this function by default in this paper. This holds by simply running TV method until convergence (300 iterations). All methods are implemented on a system with Intel i7-7600 processor with 3.40 GHz, 32 GB memory, and a GTX 1080Ti graphics processing unit. For each iteration, the conventional TV costs 0.23s, our proposed method costs 0.29s and the Tikhonov method costs 0.07s. Note that, for all experiments, the number of iterations is 700, which is terminated by pre-running different iterations. Moreover, the procedure of optimization can automatically stop when the value of metrics starts to decline if we use an additional metrics of quality. Due to the reduction in the number of channels, the background of the conventional result is polluted, which causes a poor contrast compared with Fig. 4. For our approach, most structures of the object are reconstructed well with few artifact. It shows that the decoder D fits the object at the initial phase, and the artifacts are reconstructed after continuously optimizing. For Tikhonov method, it is less sensitive to edges compared with TV, which causes blurry edges for artifacts. We can compute the Structural Similarity (SSIM) of these results to quantitatively compare the performance. The SSIM values of Fig. 4 are 0.6312, 0.8377 and 0.3618 respectively, which also indicates our method outperforms the conventional TV method over 32%.
Furthermore, we should validate the effects of different priors and the appropriate iteration times. A series of comparison experiments are set up.

Iteration times
We use an untrained model D to compare the performance of different numbers of iterations without any regularization. Different iteration times have been validated from 100 to 8000 as Fig. 5 shows. As the number of iterations increases, the main object is reconstructed firstly (from 1 to 500), and the best reconstruction is achieved between 500 and 1000. And then, the reconstruction result starts to blur (after 1000) since the artifacts near the object are appearing. It also indicates our model will first fit the major structure and then fit the noise and the artifacts. We should appropriately stop the iteration to obtain a better result [27,30]. We should further quantitatively evaluate these results. Three metrics are used to quantify the performance of different results, which are SSIM, Peak Signal to Noise Ratio (PSNR), and Signal to Noise Ratio (SNR). Table 1 demonstrates the quantitative results of these different iterations. The quantitative results show that the reconstruction quality first increases and then decreases with the number of iterations increasing. Namely, the model fits the correct object at the beginning, and the best quality is 500 iterations in Table 1. Therefore, for PACT, the best iterative times could be selected from 100 to 1000. After comparing different iterations, we determined to use 700 iterations for all experiments without unnecessary artifacts.

Comparison experiments
For DIP, the DL model can express the implicit prior generally, thus the additional prior term. In this section, we evaluate the effects of two priors (TV and SP). The two parameters are same with before. The synthetic vessel results have been shown in Fig. 6. The two results show similar reconstructions from Fig. 6. The background of D's result has some noises as the yellow arrows indicated in Fig. 6(a). By contrast, Fig. 4 has a purer background and higher contrast compared with Fig. 6. To further evaluate the contrast of these results, we can compute the contrast-to-noise rate (CNR) for these results. We list the quantitative results of Fig. 4 and Fig. 6 in Table 2. The first three columns of the table indicate that each of the two priors items has improved the reconstructed quality. Although the conventional sparse basis can be surpassed only using a deep model, different regularization can further boost the robustness and efficiency of this method. Compared with the decoder D without regularization terms, the decoder D with regularization performs better in terms of noise suppression, i.e., higher SNR (3.1046 and 1.6595). Similarly, the results of the quantitative comparisons also reflect the same performance from Table 2, and our method has higher contrast compared with others. These results further show that effective priors can improve the performance of untrained CNN.

In-vivo results
In addition, we demonstrate our method on in-vivo data, λ 1 and λ 2 are 0.005 and 0.1, respectively. Firstly, a real mice brain data is validated. We also compare TV method with our method. Figure 7(a)-(d) shows the brain imaging results, where the TV obtains the result with 300 iterations. To further evaluate the results, we also demonstrate the full-view results for all in-vivo results, which use TV method to reconstruct the image with 128 channels full-view PA data. Obviously, the conventional method cannot suppress the artifacts only using 64 channels data from Fig. 7(a). The untrained CNN model method shows a superior result, purer background creates a higher contrast in Fig. 7(b). The yellow arrows in Fig. 7(a), (b) show that the vessel in the sulcus has a clear shape compared with TV result, which contains a few artifacts in Fig. 7(a). For Fig. 7, the Tikhonov regularization is not sensitive for artifacts in this sparse condition compared with TV. Note that the required number of detectors N is related to the size of region of interest and the acoustic wavelength (Nλ/2 = πD) to satisfy Nyquist criteria [8]. For this setup (D=30mm, λ=400 µm), the number of detectors should be greater than 470. Although full-view result is formed from 128 channel data, the number of detectors still cannot satisfy the Nyquist criteria in this work. Therefore, the full-view result still has artifacts as Fig. 7(d) and 7(h) shown. For the background, our method even outperforms the full-view result (Fig. 7(d)) in the way of artifacts. Since the artifacts still exist in the full-view results, we do not further calculate the quantitative metrics for these full-view results. We further use a cross-section of the human finger to demonstrate different methods. We also compare these two different methods in Fig. 7(e)-(h). Similarly, some artifacts are retained in the result of conventional method as shown in Fig. 7(e),(g). The 50% sub-sampling rate, i.e., 64 channels, causes a blurry result showing that the object is disturbed by the artifacts. For our result, Fig. 7(f) eliminates most of the obvious artifacts compared with Fig. 7(e) and (g). Note that the artifacts near the objects are also beginning to be reconstructed from Fig. 7(f). However, for the SNR and contrast, our method still outperforms the conventional method. In addition, these results further verify the merits of this method, which will first fit the target and then fit the noise and the artifacts. It implies this method can fit any signal with appropriately stopping.

Conclusion
In this paper, we introduce an untrained CNN model to reconstruct a sparse PACT image, which outperforms unlearned methods without plenty of data. In addition, a direct reconstructed image is used to penalize the output of DL model. This prior improves the reconstruction error and efficiency. We further demonstrate how to implement this method on PACT, which further decomposes the analytic gradient and chained gradient in the data consistency term. The experimental results show our approach outperforms the conventional CS method with the same sparse basis. Note that DIP method can fit any signal given an over-parameterized model in empirical. This method provides insight for CS-based PACT, and explores more solid works combined with other conventional CS methods. Meanwhile, we will compare another sparse basis in future works. Disclosures. The authors declare no conflicts of interests.
Data availability. Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.