Enabling fast and high quality LED photoacoustic imaging: a recurrent neural networks based approach

: Photoacoustic (PA) techniques have shown promise in the imaging of tissue chromophores and exogenous contrast agents in various clinical applications. However, the key drawback of current PA technology is its dependence on a complex and hazardous laser system for the excitation of a tissue sample. Although light-emitting diodes (LED) have the potential to replace the laser, the image quality of an LED-based system is severely corrupted due to the low output power of LED elements. The current standard way to improve the quality is to increase the scanning time, which leads to a reduction in the imaging speed and makes the images prone to motion artifacts. To address the challenges of longer scanning time and poor image quality, in this work we present a deep neural networks based approach that exploits the temporal information in PA images using a recurrent neural network. We train our network using 32 phantom experiments; on the test set of 30 phantom experiments, we achieve a gain in the frame rate of 8 times with a mean peak-signal-to-noise-ratio of 35.4 dB compared to the standard technique.


Introduction
Photoacoustic (PA) is a combination of optical (photo) and ultrasound (acoustic) imaging techniques; it has abilities to exploit the spectral contrast of optical imaging as well as spatial resolution of ultrasound signal. The underlying principle of this imaging technology is based on the photoacoustic effect [1], that explains the phenomenon of generation of acoustic waves following light or near infrared energy absorption in a tissue sample. Usually endogenous tissue chromophores (e.g., hemoglobin in blood, melanin in skin) in human organs demonstrate such phenomenon due to their high optical absorption coefficients [2]. The mapping of optical absorption properties could, therefore, provide important anatomical and functional information that may be useful in many clinical applications, e.g., ischemic stroke, breast cancer, functional brain mapping or skin melanomas [3][4][5]. In addition to tissue chromophores, PA technology has been successfully used for imaging of exogenous contrast agents in a number of applications including molecular imaging and prostate cancer detection [6,7].
The PA imaging work-flow starts with excitation of a soft-tissue sample with high intense short light pulse, followed by local transient thermo-elastic expansion that eventually leads to generation of acoustic waves, usually known as PA signal. To excite a tissue sample, the current standard technology uses heavy and complex laser system, e.g., Nd:YAG or Ti:Sapphire laser [8].
The main advantage of such lasers is ability to generate high energy light pulse; the typical energy is in the range of mJ with a pulse repetition frequency, number of pulses in one second, of 10-30 Hz. However, due to the irradiation of high intense laser source, protective measures (e.g., wearing glasses and gloves) must be taken by the operator to prevent himself/herself from potential health hazards. In addition, such high intensity may decrease the optical contrast of the exogenous contrast agents due to photobleaching effect [9].
Light emitting diode (LED) is a potential alternative of laser as a light source; it has advantages of inexpensive, portable and safe light source. However, the key drawback of LED light source is its limited output power, even series of LEDs can generate energy in range of µJ. As a result, the received PA signal of an LED-based system significantly suffers from low signal-to-noise-ratio (SNR). To improve the SNR, the current technology is based on acquiring multiple frames of PA signals, and subsequently performs an averaging over them to minimize the noise. Though the pulse repetition frequency of LED-based system is much higher (in range of kHz), an averaging over a large number of frames, typically thousands, reduces the effective frame rate of PA images. In addition, a large number of averaging frames requires longer scanning time that can lead to potential motion artifacts in reconstructed images. A few researches have been carried out to improve the SNR as well as keep the averaging frame number low using standard signal processing approaches, e.g., using adaptive denoising [10], empirical mode decomposition [11], wavelet transform [12], Wiener deconvolution [13] or coded excitation [14][15][16].
In recent years, deep neural networks based approaches have outperformed the previous state-of-the-art signal and image processing techniques in various computer vision tasks including image classification [17,18], image segmentation [19], image denoising [20] and image superresolution [21][22][23][24]; where the latter two categories closely fit to our problem of image enhancement. The published image enhancement techniques are based on stacked denoising auto-encoder [20], densely connected convolutional net [24] or including perceptual loss to enhance the spatial structure of images [22]. In addition to computer vision, neural networks based techniques have been reported in various PA applications, e.g., image reconstruction in PA tomography [25][26][27] and elimination of reflection artifacts in PA images [28].
In this work, we present a deep neural networks based image enhancement approach to improve the quality as well as reduce the scanning time of LED-based PA images. Our proposed architecture consists of two key components; one is convolutional neural networks (CNN) to extract the spatial features, and the other one is recurrent neural networks (RNN) to leverage the temporal information in PA images. The CNN is built upon a state-of-the-art densenet-based architecture [24] that uses series of skip-connections to enhance the image content. For the RNN component, we use convolutional variant of short-long-term-memory [29,30] to exploit the temporal dependencies in a given PA image sequence. For an effective feature propagation and an elimination of vanishing gradient, we propose to incorporate skip-connections throughout the network including both CNN and RNN components.

Methods
We start this section with describing the CNN-based image enhancement technique in Section 2.1 that can be used to improve the quality of a single PA image. Next in Section 2.2, we present how to incorporate recurrent connection in the proposed network to exploit the temporal dependencies in a given PA image sequence for a further quality improvement. Figure 1(a) shows the densenet-based CNN architecture [24] to improve the quality of a single PA image. The network takes a low quality PA image as input and generates high quality PA image as output. The details of generating a PA image using a standard beamforming technique is mentioned in Section 3.1. For convenience, we mention the number of feature maps in each convolutional layer as 'xx' in 'Conv xx' in Fig. 1(a), in addition, we indicate the number of features maps in the illustration. The architecture consists of three dense blocks, where each dense block consists of two densely connected 3 × 3 convolutional layers followed by rectified linear units (ReLU). Unlike [24] we do not use any upsampling layers in the architecture as the sizes of the input and output images are same (168 × 168) in our case. The key advantage of dense convolutional layer is that it uses all of the generated features from previous layers as inputs through skip connections, therefore, features are propagated more effectively and it leads to the elimination of the vanishing gradient problem of a deep networks [31]. Finally, to estimate the output image, all of the features from the dense blocks are concatenated, and a convolution with one feature map is performed at the end.

Recurrent networks for temporal features
RNN [32,33] is established as a powerful tool to extract the temporal dependencies in a given sequence. The vanilla RNN is usually difficult to train for a long sequence due to the vanishing or exploding gradient. Several improved variants of RNN have been reported and long-short-term-memory (LSTM) [29] has shown the most promising performance in various applications.
ConvLSTM [30] is an extension of LSTM that leverages the convolution operation in extracting the temporal features from a series of 2D maps. As shown in Fig. 1(b), it takes current input X t , previous cell state C t−1 and previous hidden state H t−1 ; subsequently produces current cell state C t and current hidden state H t .
Here, * and • stand for convolution operation and element-wise matrix multiplication, respectively. i t , f t and o t represent input, forget and output gates, respectively.
the convolutional parameters to model the 2D time-series sequence. σ indicates the sigmoid operation, and we utilize 'tanh' as an activation function. Figure 1(c) shows our proposed architecture combining CNN and ConvLSTM together to improve the quality of PA image. The architecture takes a sequence of PA images as inputs, initially uses CNN to extract the spatial features and subsequently utilizes ConvLSTM to exploit the temporal dependencies. For the recurrent connection, we use two layers of ConvLSTM with including skip connections. At the end, we concatenate all of the generated features in the previous layers to predict the final output. The details of training the network with an input sequence and a high quality target image is described in Section 3.3.

Experiments and image reconstruction
We conducted an experimental study to train the proposed network as well as evaluate its performance. The excitation source in our experimental setup consisted of two sets of LED matrices (PreXionLED AcousticX, Tokyo, Japan) that were attached on both sides of the ultrasound transducer of our SonixDAQ system (BK Ultrasound, MA, USA). Each LED matrix consisted of 144 LEDs arranged in four rows. The imaging setup also included an LED driver to generate LED pulses with a pulse repetition frequency of 1 kHz, and a synchronizer was used to synchronize between LED excitation and PA signal reception. The ultrasound transducer consisting of 128 piezoelectric elements with a pitch of 0.3 mm was used to acquire the raw pre-beamformed PA signal. The experiment consisted of scanning of phantoms and in vivo human fingers. For the phantom experiments, we acquired PA signal for a duration of 11 sec that led to sets of 11,000 frames of pre-beamformed signals. It is important to note that we manged a steady condition during the whole scanning period of 11 sec in the phantom experiments. In contrast, for the in vivo, we could not be able to manage a steady condition, and acquired data for a short duration of 5 sec. After acquisition, we averaged the PA signals over a number of frames N, followed by beamforming using delay-and-sum technique, subsequently detecting the envelope to reconstruct the PA image PA N .
Two kinds of phantoms were used in our study; wire and magnetic nanoparticle phantoms. We chose wire and magnetic nanoparticles as optical contrast agents in our study due to their high optical absorption coefficients. For the wire phantom, we acquired a total of 62 sets of PA data corresponding to 62 different image planes, where each set provided a total of 11,000 frames. For the gold magnetic nanoparticle phantom, we provide a schematic in Fig. 2. This phantom was specially designed to investigate the sensitivity of our proposed approach with respect to possible variations of imaging depth, contrast of targeted objects and optical scattering of the imaging medium (details in Section 4.3). The phantom was built with fives cylindrical tubes (cross-section in Fig. 2) that were located at three different depths. Tubes 1-3 were located at successively decreasing depth, where the concentrations of the nanoparticles inside the tubes were same (10 nM). In addition to tube 3, two tubes (no 4 and 5) were placed at the lowest depth with concentrations of 5 nM and 2.5 nM, respectively; i.e., the concentration successively decreased from left (tube 3) to the right (tube 5). It is interesting to note that we considered the nanoparticles with the highest concentration at different depths for the sensitivity analysis since our existing LED-based PA system is not sufficient enough to image the nanoparticles with lower concentration (e.g. 2.5 nM) at higher depth (> 2 cm). In the first part of the experiment, we used water as a phantom medium. After acquisition of one set of PA signals of 11,000 frames, we mixed a very small amount of milk (1% of total volume of water) into the water in order to increase the optical scattering of the phantom medium, followed by acquiring another set of PA signals of 11,000 frames. For this phantom experiment, we acquired a total of 10 sets of PA data at 10 different image planes.

Materials
We divide all of our experimentally acquired data into three independent groups: training, validation and test sets. The training set is used for optimizing the network parameters; the validation set, in contrast, is used to fix the hyper-parameters of our architecture that mainly include number of dense blocks, number of convolutional layers in each dense block and number of ConvLSTM layers; and the test set is used to evaluate the proposed network. For the training and validation sets, we use a subset of PA images from the wire phantom. Out of 62 sets, we follow a 50% vs. 15% vs. 35% dividing rule to categorize the images from the wire phantom into training, validation and test sets, respectively. In other words, the training, validation and test sets consist of 32, 10 and 20 sets of PA data from the wire phantom experiment, respectively. Since a deep network requires lots of data for an effective training, we use most of our data in training and the rest of the data is divided between the validation and test sets.
We have chosen PA images from the wire phantom to train the neural networks, because those PA images with multiple point targets at different depths allow our network to learn how to improve the quality of point spread function in PA images. As a result, such a trained network with point spread function can be applied to any arbitrary function of optical target. To evaluate the performance of our technique using an independent test set, we form the test set from the remaining samples in our study, including 20 sets (different image planes) from the wire phantom and 10 sets from the gold nanoparticle experiment. Since in the in vivo experiment we could not be able to manage a steady condition during the scanning period, we did not use them in any quantitative analysis, instead we performed a qualitative analysis on them.

Generation of input and target sequences
We perform training of the network using different qualities of input PA images. To achieve a difference in the quality, we exploit the proportional effect of the averaging frame number (N) on the image quality. Note that such proportional effect is only true when there is a steady condition maintained during the scanning period. Since we maintain a steady condition during the scanning period in our phantom experiments, we can exploit such proportional effect to generate the PA images with different qualities. The values of N are, therefore, chosen from a range starting from a very low value to the highest possible value of 11,000; where an increased value of N indicates a higher image quality.
For a chosen value of N = N 0 , we generate one input sequence, consisting of PA images reconstructed with different values of the averaging frame numbers ≤ N 0 . Mathematically, the averaging frame numbers in the sequence can be represented as {N s , 2N s , 3N s , · · · , N 0 } corresponding to time index {t 1 , t 2 , t 3 , · · · , t N 0 }, i.e., the averaging frame number in the sequence starts with N s at t = t 1 , subsequently increasing with a step of N s frames to the highest value of N 0 at t = t N 0 . Figure 3 shows an example of generating one sequence of length of 4 for N 0 = 800 and N s = 200. The length of input sequence is, therefore, floor((N 0 )/N s ), where floor(x) indicates the nearest integer less than or equal to x. Once we fix the averaging frame numbers in a sequence, we can reconstruct a sequence of PA images using the averaging frame numbers in the sequence.
For a fixed N 0 , the length of the sequence is dependent on N s . A lower value of N s indicates a longer sequence that may cross the memory limit of our GPU during training. Based on our GPU memory size, we could be able to set N s = 200 and it gives a maximum length of the input sequence to 55 corresponding to N 0 = 11, 000.
Given our collection of 11,000 frames for one phantom sample, we can achieve the highest  possible quality PA image by reconstructing it from the averaged signal over all of the frames. Though an averaging over 11,000 frames significantly improves the SNR of the PA images, still there may be small amount of noise present in the averaged signal. To suppress this small noise, we apply an anisotropic diffusion filter on the reconstructed image PA 11000 without affecting the image structure. The anisotropic diffusion [34], also named as Perona-Malik diffusion, is a filtering technique aiming at suppressing the image noise while preserving the image content (e.g. edges). Such a filtering method has been shown to improve the image interpretation in a significant number of applications. In short, the filtering operation is not only space-variant but also data-driven that encourages intra-region smoothing rather than the inter-region smoothing. In our study, the filtered image on PA 11000 is considered as gold-standard target image . Note that for each experiment we have only one gold-standard target image corresponding to more than one input sequences.

Optimization
We use mean square losses between the predicted and gold-standard target PA images as a loss function. To minimze the loss function, TensorFlow library (Google, Mountain View, CA) with Adam [35] optimization technique is utilized. A summary of all of the steps of our training procedure is given below: 1. Start an iteration by randomly shuffle the collection of 11,000 frames.
3. For the chosen N 0 , generate a sequence of averaging frame numbers as {N s , 2N s , 3N s , · · · , N 0 } corresponding to time index {t 1 , t 2 , t 3 , · · · , t N 0 }. Note that the sequence length is dependent on N 0 and it may change after each iteration.
4. Generate the PA image sequence based on the averaging frame numbers in the sequence. In addition, the gold-standard target image is obtained by applying anisotropic diffusion filter on PA 11000 .

5.
Perform random cropping of input and target images, followed by resizing the cropped images to 168 × 168 pixels.
6. Given the input sequence and target image of the proposed network shown in Fig. 1(c), compute the loss function using forward-propagation, subsequently update the network parameters using back-propagation.
7. Go back to step (1) to start next iteration until convergence achieves.

Evaluation indices
For quantitative evaluation of the proposed approach using the test set, we use signal-to-noise-ratio (SNR), peak-signal-to-noise-ratio (PSNR) and structural similarity index (SSIM) as evaluation indices.

SNR
SNR is defined as the ratio of peak signal intensity and standard deviation of the background intensities in decibels. Unlike PSNR and SSIM, we do not need any reference signal/image to compute the SNR, it is purely based on signal strength and noise statistics. It can be mathematically defined as: where, µ I and σ b represent the peak signal amplitude at the target and the standard deviation of the background, respectively.

PSNR
The PSNR is a conventional measurement of the image quality in decibels (dB) based on the mean square differences between the estimated and reference images as: where,

MSE
Here, µ ref (σ ref ) and µ est (σ est ) are the mean (variance) of the reference and estimated images, respectively; σ cov represents the covariance between them. c 1 and c 2 are used as constants to stabilize the division with weak denominator and we use their same values as suggested in the original work [36].

Comparison
In addition to evaluation of the proposed CNN+RNN approach, we provide comparisons with simple averaging and densenet [24] based techniques. The simple averaging method is based on solely averaging operation, i.e., it does not perform any further processing on the PA image that is reconstructed from the already averaged PA signal. The densenet based technique, in contrast, uses a series of dense convolutional layers to improve the quality of a PA image; the architecture is shown in Fig. 1(a) that does not use any recurrent connection to extract the temporal features. We refer this technique as CNN-only method in rest of the literature. Note that the comparison is performed for different values of the averaging frame numbers ≤ 11, 000.

Sensitivity analysis
We also investigate the sensitivity of our proposed approach with respect to possible variations of imaging depth, contrast of targeted objects and optical scattering of the imaging medium. As mentioned earlier, we used the experimental data from the gold magnetic nanoparticle phantom for this purpose; we select five region of interests of area of 5 mm× 5 mm around the tubes in Fig. 2, followed by comparing the SNRs among those ROIs.

Computation time
We compute the run-time of our proposed method using a GPU (NVIDIA GeForce GTX 1080 Ti) and a CPU (Intel Core i7-7700K @4.20GHz with 32 GB RAM). rate (with respect to the averaging frame number) of our technique than that of the CNN-only method, in other words, the CNN-only method reaches to a saturation in image quality for higher averaging frame numbers.

Quantitative comparison
To quantify a statistical significance of the improvement, we perform student t-test between the evaluation indices of our and each of the comparing methods. Compared to the simple averaging technique, our p-values are obtained less than 0.05 at all of the averaging frame numbers for both PSNR and SSIM. In contrast, for CNN-only technique, we achieve p < 0.05 only at N > 4400.
We can also interpret the improvement in terms of gain in imaging speed. For this purpose, we choose the simple averaging technique as a reference method, i.e., we compute how much gain in frame rate we could achieve with respect to the averaging approach. For example, at the mean PSNR of 35.4 dB, the proposed, CNN-only and averaging techniques need 1,360, 3,680 and 11,000 averaging frames, respectively (dotted line in Fig. 4(a)). If we consider the frame numbers of 11,000 (corresponding to 35.4 dB) of the averaging approach as a reference, then the proposed and CNN-only techniques achieve gains in the frame rate of 11,000/1,360≈8.1 and 11,000/3,680≈3.0 times, respectively. By calculating the gains in the frame rate at other mean PSNRs of the averaging approach, we can plot gain in the frame rate vs mean PSNR as in Fig. 4(c). Such a plot demonstrates how much gain in the imaging speed we can achieve using our proposed RNN+CNN method, in addition, it shows significant improvement in imaging speed achieved compared to the CNN-only method.  blood veins, where we can notice improved detections of the blood vessels (shown by arrows) for our RNN+CNN approach. It is also evident that the the PA image averaged from all of the 5,000 frames (mentioned as high quality in the figure) contains some artifacts because of the movement during the scanning period.

Sensitivity analysis
It is mentioned earlier that the nanoparticle phantom in Figure 2 is used for the sensitivity analysis. A summary of the analysis using our approach is provided in the following.
Imaging depth For depth analysis, we choose the imaging plane consisting of tubes 1-3 of the nanoparticle phantom (please refer to Figure 2). Figure 7 shows a comparison among the simple averaging, CNN-only and our proposed techniques for three different values of the averaging frame numbers for that imaging plane. The PA images in Fig. 7 indicates that we need more number of averaging frame numbers to detect an object at higher depth.
To investigate the SNR across these tubes, we plot SNR vs. averaging frame numbers in Fig. 8(a) for three different ROIs corresponding to tube 1 (higher depth), 2 (medium depth) and 3 (lower depth) for the water phantom. Note that for computation of SNR, we use our presented technique to compute the peak intensity and standard deviation of the background of each ROI. A comparison among the SNRs at those ROIs indicates a successive decrease in SNR with an increase in the imaging depth. The key reason of such decreasing SNR can be attributed to lower PA peak intensity at higher depth.
Optical contrast The imaging plane for optical contrast analysis consists of tubes 3-5 of the nanoparticle phantom in Fig. 2, where the concentration successively decreases from tube 3 to 5 (left to right).  the optical absorber. In addition, we can observe better recovery of the target object for our RNN+CNN method (shown by arrows in Fig. 9). A comparison among the SNRs for tubes 3-5 (for the water phantom) is provided in Fig. 8(b) for different averaging frame numbers to demonstrate the effect of optical contrast on image quality. As expected, we can notice a successive reduction in SNR with a decrease in optical contrast.
Optical scattering In addition to comparing SNRs for the water phantom, we also include comparative results for tubes 3-5 with water+1% milk phantom in Fig. 8(b). We can observe a corresponding reduction in accuracy for all the tubes 3-5, when the phantom medium is changed from a lower (water only) to higher scattering (water+1% milk) liquid.

Computation time
The GPU computation times are obtained as 15 and 4 mili-seconds for our and CNN-only methods, respectively. The corresponding run-times in the CPU are 190 and 105 mili-seconds, respectively.  Fig. 7. An example effect of depth on the PA image quality. We choose tubes 1-3 of the nanoparticle phantom for this purpose. A comparison among the simple averaging, CNN-only and RNN+CNN techniques is shown for three different values of the averaging frame numbers. As shown by the arrows, a decrease in the image quality can be observed with an increase in the imaging depth. It is also interesting to notice the increased image quality with an increase in the averaging frame number.

Discussion and conclusion
We have presented a deep neural networks approach to improve the quality of PA images in real-time while simultaneously decreasing the scanning period. In addition to using CNN to extract the spatial features, we employ RNN in our architecture to exploit the temporal information in PA images. We have trained our network using sequence of PA images from 32 phantom experiments. On the test from 30 samples, we could achieve a gain in the frame rate of 8.1 times with a mean PSNR of 35.4 dB compared to the conventional averaging approach.
We have demonstrated that a temporal PA sequence allows the neural networks to learn the image and noise contents more effectively than a single image based CNN-only network does. In addition, for the CNN-only method we could observe saturation in both image quality indices for higher averaging frame numbers (Figs.4(a-b)) that indicate to a reduction in the improvement rate with an increase in the averaging frame number, in contrast to the higher improvement rate for our method.
Though we train our network using PA images with point targets, we could also notice its effectiveness in other target object (Fig. 5). In addition, we have observed improved performance of our method with respect to CNN-only and averaging techniques in detection of point targets at higher depth and with lower optical contrast (Figs. 7 and 9).
We have also shown the improved performance of the proposed approach for in vivo example (Fig. 6). In addition, we have observed the effect of artifacts (due to motion during scanning) in the reconstructed image with a higher averaging frame number. The key advantage of our technique is that it could enhance the image quality from a reconstructed image with low averaging frame number, subsequently eliminating the potential effect of the artifacts.
We have investigated the sensitivity of our approach with respect to imaging depth, contrast of the targeted object and optical scattering of the imaging medium. In summary, we could observe a reduction in image quality for a higher depth, for a lower contrast and for a higher scattering medium. For all of these effects, we can attribute the key reason to lower SNR of the PA signals. For example, either for a higher depth or for a higher optical scattering, the receiving light energy in the target decreases, followed by a reduced SNR. In contrary, a less optical contrast corresponds to less amount of converted acoustic energy that in turn provides lower SNR of the PA signal.
Future work includes an extended evaluation of the proposed approach using more in vivo experiments. In addition, we aim to integrate the proposed technique in our PA imaging work-flow. Fig. 8. Sensitivity analysis of our method. (a) Effect of depth on SNR vs. averaging frame numbers. We can notice a successively reduced accuracy from lower to higher depth. (b) Effect of optical contrast and scattering of phantom medium on image quality. The image quality is degraded with a decrease in the concentration of optical absorber. In addition, a higher optical scattering medium leads to a reduction in image quality. We can notice a successive decrease in signal quality with a decrease in concentration. In addition, the improved performance of our RNN+CNN method can be observed (marked by arrow) with respect to two other methods.
In conclusion, we have demonstrated the potential of our approach to exploit the temporal information in PA images for an improvement in image quality as well as a gain in the imaging frame rate.

Disclosures
The authors declare that there are no conflicts of interest related to this article.