Denoising of pre-beamformed photoacoustic data using generative adversarial networks

: We have trained generative adversarial networks (GANs) to mimic both the effect of temporal averaging and of singular value decomposition (SVD) denoising. This effectively removes noise and acquisition artifacts and improves signal-to-noise ratio (SNR) in both the radio-frequency (RF) data and in the corresponding photoacoustic reconstructions. The method allows a single frame acquisition instead of averaging multiple frames, reducing scan time and total laser dose significantly. We have tested this method on experimental data, and quantified the improvement over using either SVD denoising or frame averaging individually for both the RF data and the reconstructed images. We achieve a mean squared error (MSE) of 0.05%, structural similarity index measure (SSIM) of 0.78, and a feature similarity index measure (FSIM) of 0.85 compared to our ground-truth RF results. In the subsequent reconstructions using the denoised data we achieve a MSE of 0.05%, SSIM of 0.80, and a FSIM of 0.80 compared to our ground-truth reconstructions.


Introduction
Photoacoustic imaging (PAI) is a biomedical imaging technique that leverages the advantages of both ultrasound and optical imaging to non-invasively image vasculature and other optically absorbing targets [1,2]. Significant progress both in hardware and data reconstruction techniques have moved this modality from the lab to the clinic [3], where it has shown promise in the detection and staging of cancer [4,5], the detection of inflammatory conditions [6,7], and for intraoperative guidance [8,9].
Photoacoustic signals are inherently weak, with their intensity being proportional to the dose of laser light reaching the imaging target [2]. A fundamental tradeoff exists between maximizing this intensity while minimizing the laser dose to which the patient is exposed [10]. Although different reconstruction schemes including modified delay-and-sum techniques [11], filtered back-projection [12] and iterative approaches [13][14][15] have been developed to enhance the quality of the reconstructed images, it is still common to average the raw radio-frequency (RF) ultrasound data generated from multiple consecutive laser pulses, reducing stochastic noise and improving the signal-to-noise ratio (SNR) [16,17]. This approach not only increases the scan time proportionally to the number of frames acquired at each imaging position, lowering the frame rate of the system, but it also increases the total laser dose. More specific noise-reduction techniques such as bandpass filtering to the working range of the ultrasound transducer [12], or the removal of laser-induced noise using singular value decomposition (SVD) denoising [18] have also been employed to improve the quality of photoacoustic RF data. In this paper we propose a deep learning based method that is capable of reducing Gaussian background noise similar to averaging multiple acquisition frames, while simultaneously replicating more sophisticated denoising, in particular the SVD denoising method of Hill et al. [18] , which we have previously found to be particularly effective for our data [19].
Generative adversarial networks (GANs) [20] and conditional GANs (cGANs) [21] emerged in 2014, and were quickly adopted as the state-of-the-art deep learning models for a variety of tasks in different fields [22]. For instance, the introduction of cGANs capable of mapping different image domains, such as the Pix2Pix Model of Isola et al. [23], can be used for tasks such as segmentation, grayscale to RGB transformation, and super-resolution reconstruction and rendering. In recent years, there have been numerous studies investigating applications of deep learning in PAI [24][25][26][27]. More specifically, GANs have been applied to PAI both for artifact removal and as a substitution for iterative solutions [28,29] where the input and output of the GAN resemble the initial guess and final iterative solution provided by iterative reconstruction algorithms respectively. However, these studies have mainly focused on post-beamforming processing of photoacoustic data for enhancing reconstruction. Since we already have a gold standard (frame averaging and SVD denoising) for our data, a generative supervised learning approach such as a GAN is ideally suited.
In this study, we have developed a GAN-based method using the Pix2Pix model for denoising pre-beamformed RF data that can mimic both the behaviour of multi-frame temporal averaging noise reduction using only one frame of data while also being able to remove sensor-specific artifacts. Since the Pix2Pix model was designed for image-to-image translation tasks involving highly structured graphical outputs, we hypothesized that it might work as an alternative to our SVD denoising, which excels at removing structured noise.
We achieved comparable results both in terms of raw RF data and also in the resultant reconstructed images using the denoised data. To the best of our knowledge, this study is the first to use GANs as a pre-processing step in photoacoustic imaging to enhance the quality of the raw RF data and consequently the resultant reconstructed images.

Methods
In the following sections, scalars are represented by lower-case letters, data matrices by bold lower-case letters, operators by upper-case italic letters. We begin by describing the photoacoustic model and defining our reference dataset. We then propose our method of GAN denoising, as well as our reconstruction methods and metrics used for the quantitative assessment. Finally we describe our data acquisition hardware and experimental parameters.

Photoacoustic model
The photoacoustic effect describes the generation of propagating acoustic waves by a temporally short pulse of laser light. These waves can be detected by standard ultrasound transducers, and the received signals can be processed to reconstruct the initial pressure distribution p 0 which, under the assumption of an acoustically homogeneous sample and spatially uniform illumination, will be proportional to the optical absorption in the sample [2]. When the measured time-varying pressure p D (r D , t) is acquired at multiple positions r D in 3D space and p 0 (r) is reconstructed at sample points r, this is referred to as PAI reconstruction [12]. In imaging systems, the discretized versions of p 0 (r) and p D (r D , t) are denoted by p 0 and p D , respectively.

Reference RF dataset
At each imaging location across a volume, a frame of RF data is acquired using a sensor with the general shape of (m, n) where m is the number of detection elements of the sensor and n is the number of time samples taken, which correspond to how far away is the source from the detection element. This noisy frame of RF data is denoted by p D,noisy . We pre-process these frames by first performing temporal averaging at each location where n f frames (n f = 20 in our example) are averaged together denoted by p D,n f −avg . We then perform SVD denoising to remove cross-channel noise bands from the data [18,19]. These steps will result in a reference frame for the experiments performed in this paper, which we will denote by p D,ref . Additionally, SVD denoising applied to a single noisy frame of RF data (as opposed to a temporal average of frames) will be denoted by p D,SVD .
For example, a frame of our RF data contains n = 1792 time samples across m = 384 elements of the transducer, where for each 128-element, 1792-sample segment of a frame of RF data, we construct a 128 × 1792 matrix and compute its SVD (our transducer array of 384 elements is multiplexed into 128 channels; three acquisitions are performed at each transducer location). By isolating the first k singular value components and reverting to the original representation, we are left with only the noise, which dominates those k singular values. We subtract this noise from the original matrix, resulting in the denoised data. This process is illustrated in Fig. 1. An example frame pair of p D,noisy and p D,ref can also be seen in Fig. 2.
In SVD denoising, as the k singular values increase, lower-frequency structures are suppressed at the expense of an increased level of background noise, which can be observed in Fig. 3. To strike the best balance between removing the noise bands and preserving the photoacoustic signals, the data in the present study was denoised using k = 15, obtained empirically. Fig. 1. Preparation of a reference frame. After imaging one location for n f times and performing temporal averaging, the resultant frame is then denoised using SVD denoising. We also denoise each frame with the SVD approach.

GAN denoising
We use the Pix2Pix Model [23] which trains a cGAN to learn a mapping from an input image and random noise vector z, to an output image. Due to the size of p D,noisy and p D,ref in our example (384, 1792), we cannot use these frames at once as an input/output pair and therefore we need to divide them into smaller patches denoted by i p D,noisy and i p D,ref with i indicating the location of the patch within each frame (see Fig. 4 for an example). Letting G(., .) and D(., .) represent the outputs of the generator and discriminator respectively, the training objective of the model presented in [23] will be where L cGAN denotes the cGAN loss, L L1 the L1 distance norm, and E the expectation values [23]. Combining Eqs. (1) and (3) results in G * which is the cGAN loss where the generator tries to minimize the objective and the discriminator tries to maximize it, in addition to a sparsifying distance which encourages a more focused image controlled by the parameter λ 1 . In training the Pix2Pix model, the loss function does not provide a descriptive source for determining the quality of training [23], therefore in addition to having a training dataset and a testing dataset, we also utilized a validation dataset to tune the parameters of the model, after each training trial. While this requires performing an additional experiment, it ensures that the model training is completely blind to the validation data, unlike cross-validation methods such as k-fold or Monte Carlo. We found that we achieved optimal performance using our validation dataset with a patch size of (128, 128), without overlap, a batch size of n b = 5, a learning rate of α = 0.0002, and a regularizer λ 1 = 100. We used the Adam optimizer with the model weights being initialized randomly from a Gaussian distribution with mean 0 and standard deviation 0.02 as recommended in [23]. As seen from Eq. (4), training occurs with the generator learning to output images that are closer to the reference, while the the discriminator learns to distinguish between real images and fake images from the generator. This training process is outlined in Fig. 6. Our choice of hyper-parameters and input/output image sizes will result in the GAN architecture illustrated in Fig. 5 where the generator is a U-Net architecture [30] and the discriminator is a PatchGAN classifier [23]. Our model was trained using a Tesla V100 GPU on an NVIDIA DGX-1 system.   patches. If a different patch size is used, the size of the layers will be adjusted accordingly. Each output of the generator is subsequently used as an input to the discriminator which has to determine if it is a real patch or a fake patch produced by the generator.  6. The training process of the GAN involves A the generator learning to output more realistic data and B the discriminator learning to distinguish between fake and real images. The combined objective function L cGAN + λ 1 L L1 is back-propagated through both networks updating their respective weights.
Performing inference on our unseen testing data using the GAN is similar and requires the frames to be first divided into sub-patches, denoised, and subsequently put back together as outlined in Fig. 7. A frame of RF data denoised using the GAN will be denoted by p D,GAN . During inference, only the generator is used to output denoised patches using noisy input patches, while the discriminator is left unused.

Reconstructions
Improving the pre-processing and denoising of RF data ultimately serves to improve the quality of reconstructed photoacoustic images. We therefore need to assess the reconstructions that use our GAN-denoised RF dataset compared to reconstructions using temporal averaging and SVD denoising.

Filtered back-projection
In this study, we use the model of Xu et al. [12] for spherical scanning geometry, which relates p 0 (r) to p D (r D , t) in a backprojection form as where v s is the acoustic speed in the sample, Ω 0 is the solid angle of the surface containing the detection points r D , and dΩ 0 is the solid angle of the surface element at a location r D relative to a sample point r. While this form is exact only in the case in which this surface completely encloses the sample (Ω 0 = 4π), the dΩ 0 Ω 0 term serves as a weight to mitigate the effects of the well-known partial view problem [12]. We have previously described our reconstruction scheme Fig. 7. Once the noisy RF data p D,noisy is acquired using the imaging system, each frame is divided into i p D,noisy sub-patches (128 × 128 in our examples). These patches are then denoised using the trained generator model making i p D,GAN which are combined into a denoised p D,GAN frame subsequently used by the reconstruction algorithms to approximate the initial pressure distribution p 0 . in detail [19] but we note that in the present study we assume spatially uniform illumination. We refer to this single-shot reconstruction as filtered backprojection (FBP) in the later sections.

Fast iterative shrinkage thresholding algorithm
To implement the iterative reconstruction algorithm, we also need a forward model describing the generated p D as a function of p 0 . If we define a matrix A * which describes the action of Eq. (5) on a discrete sampling of p D in vector form p D (the RF data) as where p 0 is a vector of initial pressure values in the sample (the image), then we can also define a matched [13,31] forward (adjoint) operator A such that where A * = A T since A contains only real-valued elements [13]. We will refer to A and A T as the projection and backprojection operators, respectively. Iterative frameworks for solving the acoustic inverse problem in PAI have been well studied in the past [14]. In this project, we have chosen the Fast Iterative Shrinkage Thresholding Algorithm (FISTA) [32] with a Total Variational (TV) regularizer [33] which is used widely in the field of PAI [13,15]. As we are dealing with the limited-view problem [12], we can never fully compute p 0 , therefore formulating PAI as an inverse problem and following the steps outlined in [13,15,32,33] with our best possible estimation to p 0 being denoted byp 0 , the minimization objective will be min p 0 In Algorithm 1 we are using the monotone version of FISTA (MFISTA) outlined in [33] to prevent divergence of the cost function F due to the lack of an exact solution for the TV denoising problem [33]. We iteratively updatep 0,j , the j-th guess, from our initial guess ofp 0,0 being the FBP reconstruction to converge top 0 . TV 2λ 2 /L j is a solution to the TV denoising problem, implemented using the method of Chambolle [34], and L j is the Lipschitz constant, used as the denoising weight found using backtracking line search [33]. Note that p D in Algorithm 1 may be any of the RF datasets explained, i.e. Noisy, Reference, GAN Denoised, and SVD Denoised. In summary, as the algorithm progresses,p 0,j converges top 0 , a TV-regularized solution to Eq. (6).

Metrics
In this section we present the metrics we use to both assess the result of the GAN denoising directly using the RF data, as well as indirectly by inspecting the resultant reconstructions. In the formulas presented in this section, y refers to our ground-truth and expected output andŷ refers to our estimated output. Due to the dual purpose of our assessment, y andŷ can be substituted in by both the RF data and the reconstructed images in the formulas below. We begin by using the traditional mean squared error where n p is total number of pixels in an image and y i andŷ i the value of the pixel i in the expected and estimated images respectively. The second metric we use to quantify the performance is the structural similarity index measurement (SSIM) [35] SSIM(y,ŷ) = (2µ y µŷ + c 1 )(2σ yŷ + c 2 ) (µ 2 y + µ 2 y + c 1 )(σ 2 y + σ 2 y + c 2 ) where µ is the mean intensity and σ the standard deviation of the signal, and σ yŷ is the correlation coefficient of y andŷ The constant terms c 1 and c 2 are used to avoid ill-defined values in Eq. (10); we use the same values presented in [35].
The final metric used for assessing the quality of the denoised RF data directly and indirectly is the feature similarity index measurement (FSIM) [36,37] which assesses the similarity of the low-level features in the images similar to the features the human visual system uses. FSIM is defined as where the vector x represents the spatial domain of the the images y andŷ, and Ω is the entire spatial domain. PC m (x) is the maximum phase congruency [38]-a dimensionless quantity providing an absolute measure of significance of feature points, between the two images. S L (x) is a similarity measure between the two images calculated using where with G i (x) being the gradient magnitude and PC i (x) the phase congruency of image i in domain x. s 1 and s 2 are constants to avoid ill-defined values of S PC (x) and S G (x) which we chose the same values used in [36,37].

Data acquisition
All RF data were acquired using the SonixEmbrace Automated Breast Ultrasound Scanner (ABUS -Ultrasonix Medical Corporation, Richmond, BC, Canada). This scanner consists of a 384-element transducer with −12 cm radius of curvature, 10 MHz centre frequency, and 90 % bandwidth. The transducer is embedded in a spherical dome attached to a motor, which rotates through 360°to collect volumetric data. A SonixDAQ module (DAQ -BK Medical, Peabody, MA) was used to acquire pre-beamformed RF data at 40 MHz. Our illumination source consists of a Continuum Surelite II laser (Continuum, Santa Clara, CA) pumping an optical parametric oscillator (OPO) from the same manufacturer. The OPO output is homogenized [39] and coupled into a 1 mm silica-core optical fiber, which is coupled to a custom illuminator designed to deliver a fan-shaped beam of diffuse illumination to the sample surface through a window parallel to the ABUS transducer [19]. Synchronization of the RF data acquisition with the laser illumination was accomplished using a custom Arduino-based circuit controlled over USB. With the data download from the SonixDAQ to the PC being the rate-limiting step, we are able to acquire one frame per transducer position at 50 equally spaced positions in about 20 minutes. Our training data was acquired from a custom, modular wire phantom [19], consisting of black spray-painted monofilament fishing line suspended between attachment points on a 3D printed template as seen in Fig. 8(a). To validate our GAN, we imaged a commercial photoacoustic phantom containing optically absorbing spheres purchased from Computerized Imaging Reference Systems Incorporated (CIRS -Norfolk, VA). For testing and quantification, we used the same black fishing line described above, tied into a small figure-of-eight knot, as shown in Fig. 8(b), which provides several features at various angles within a small imaging volume.
To test how the model performs on out-of-distribution data, we also developed an anatomically realistic 3D printed vessel phantom using labelled magnetic resonance angiography patient data made publicly available by Lou et al. [40]. Beginning with dataset "Neg_35_Left", we extracted voxels labelled as blood vessels. We added a rectangular base for the vessels to attach to, and discarded any vessel segments which were not attached to either the base or another vessel, as these would be unsupported in the final print. Finally, we converted the voxel data to a surface mesh which was exported to STL format for 3D printing. The phantom was printed on a Form 2 3D printer (FormLabs, Somerville, MA) with a layer height of 0.1 mm, using "Tough 2000" resin from the same manufacturer. Finally, the print was cleaned up and spray painted black. While some of the smallest vessels were not faithfully reproduced, and others still were lost during the cleanup process, the phantom still provides realistic geometry, and vessels ranging in diameter from 0.2 mm (the voxel size in the original dataset) up to 5 mm. The numerical data, as well as a photo of the final phantom are shown in Fig. 8(c).
The training phantom was imaged at 100 imaging locations, the validation phantom at 10, and the two different testing phantoms each at 50 equally spaced imaging locations in the ABUS dome. Each location was imaged 20 times in all three cases. Since we train our GAN using (128, 128) patches, this corresponds to 4300, 430, and 2150 patch pairs for the training, validation and hyperparameter optimization, and testing datasets, respectively.

Results
In this section, we first provide an example of a training trial for the model. We then provide our results by first assessing the pre-beamformed RF data followed by the resulting reconstructions.

Model hyperparameter tuning
As mentioned previously, in order to train our GAN model we used a validation dataset to tune the different hyper-parameters using the CIRS phantom described in 2.6. Figure 9 is an example of this process where we chose the optimal number of training epochs. Ideally we would want the minimum MSE value and the maximum FSIM and SSIM to occur at the same epoch, however, Fig. 9 suggests that considering all three metrics, epoch 820 (dashed red line) provides a good balance of a low MSE value alongside high SSIM and FSIM values which are desirable for our task. This is due to the fact that while low MSE corresponds to the removal of the background noise, SSIM and FSIM correspond to the fidelity of the output signals' shape compared to the true signals. These 820 training epochs took an average of 28 seconds each, for a total training time of about 6.5 hours.

RF data
In applying our metrics, we compared our reference RF dataset to the other possible approaches of only frame-averaging, only SVD denoising, and finally GAN denoising. The results are the mean values from 50 imaging locations along the ABUS dome. Additionally, we have also added a 10-frame averaged case for comparison purposes to the 20 frames averaged case. These results are summarized in Table 1 for the knot phantom, and in Table 2 for the vessel phantom. Figure 10 and Fig. 11 illustrates sample denoised frames of the knot and vessel phantom, respectively, using the different approaches for one of the aforementioned 50 imaging locations. We note that increasing the number of averaged frames beyond 20, we found diminishing returns with respect to our quality metrics which did not justify the proportional increase in scan time. A plot of this relationship is included in Figure S1.    Table 3 summarizes the reconstruction results of the knot phantom using the different RF data available with reconstructions resultant from the reference RF dataset as ground-truth. The 2D reconstructions are in the axial/lateral plane of the ABUS transducer at a resolution of 0.1 mm. We achieved convergent FISTA reconstructions with the regularizer parameter λ 2 = 0.01 after 20 iterations of Algorithm 1. By examining the minimization objective as a function of iteration number, we can see that there is little improvement beyond 20 iterations. An example of this relationship is included in Figure S2. As before, these results are the mean of 50 imaging locations with Figs. 12 and 13 providing examples for FBP and iterative reconstructions at one of these imaging locations.   Fig. 12. 2D FBP reconstruction of the knot phantom data after the RF data has been processed using different denoising approaches. The line plot in each image is a profile plot along the dashed white line. Reference refers to reconstructions using the reference RF dataset. Fig. 13. 2D FISTA reconstruction of the knot phantom data after the RF data has been processed using different denoising approaches. The line plot in each image is a profile plot along the dashed white line. Reference refers to reconstructions using the reference RF dataset.

Discussion
For the knot phantom, as shown in Table 1, the GAN outperforms all of the other denoising cases except in its SSIM value which is slightly below that of 20 frames averaged. This is mainly due to the GAN being trained on the SVD and 20 frames averaged combination which is the theoretical upper limit of its performance. In Table 3, we can see that the reconstructions of the knot phantom data denoised using the GAN outperform all of the other cases, both in FBP and FISTA. This is significant since all of these performance benefits come at a fraction of a time of the frame averaged cases due to the GAN needing only a single frame of data per imaging location. These results are consistent with characteristics of the metrics used. While MSE mainly focuses on the absolute error between images, SSIM and FSIM focus on the structural differences and similarities [42]. Additionally, if we take a closer look at Fig. 10, we can see that the GAN and SVD are the only two cases that have managed to remove the straight line noise band visible at sample 550 (indicated in the A-lines in Fig. 10 with a red arrow). However, the GAN displays less background noise which is consistent with the results of Table 1, one such example being the signal at sample 400 (indicated in the A-lines in Fig. 10 with a green arrow) which is strongest in the GAN output. Once again this is due to the fact that the GAN learns the optimal behaviour from both frame averaging and SVD denoising since it has been trained on their combination. Similar behaviour is also seen in the reconstruction results in Fig. 12 and Fig. 13. The effects of the leftover noise bands can clearly be seen in the FBP(p D,10−avg ), FBP(p D,20−avg ), FISTA(p D,10−avg ), and FISTA(p D,10−avg ) cases as circular arcs, while the increased background noise can be seen in the FBP(p D,SVD ) and FISTA(p D,SVD ) cases.
Despite being significantly different from the training data, we see that the GAN performed similarly well on the vessel phantom data. Qualitatively, we see again that the GAN denoising removed the worst noise bands from the RF data, as is clear in the A-line plots in Fig. 11. Table 2 shows that the GAN performed best with respect to FSIM, and very near to the best result with respect to MSE and SSIM.
The timing benefits of using only a single frame of data for imaging will result in a better PAI frame rate as the GAN model can be loaded prior to the imaging session and denoising a frame takes on average 0.3 seconds on an Nvidia GeForce GTX 1060 GPU using the PyTorch library. This computation time will be reduced once inference-only libraries like TensorRT are used in the clinical deployment stage [43]. This is similar to our GPU implementation of SVD denoising, and since our 384-element transducer requires three 128-element acquisitions per frame (with one laser pulse each), we are still limited by the 10 Hz repetition rate of our laser. Additionally, averaging several frames per imaging location will most likely cause distortions in the data as clinical imaging subjects, i.e. human organs, will move throughout the data acquisition, therefore imaging only a single frame per location provides additional benefits.
Future work will benefit from a set of training data that is independent of both the SVD and frame averaging. One possible approach would be to use simulation data and add noise for training purposes; however, adding sensor specific noise to simulation data correctly will bring forth a new set of challenges. Ultimately, this will enable the GAN to overcome the shortcomings that might arise from using either the SVD or averaging. Further, while the effectiveness of this denoising method is fundamentally limited by the SNR of the training data, it would be useful to explore the absolute limits of this technique in terms of the minimum SNR which still allows recovery of photoacoustic signals. We are exploring the application of this technique to photoacoustic data acquired using a laser diode illuminator, where significant frame averaging is usually required to attain an SNR similar to pulsed laser systems.
While this technique is promising based on the metrics we have chosen, it will be important moving forward to further study the specific effects of this method on the RF data and corresponding reconstructions to ensure, for example, that the attainable resolution is not affected, or that certain reconstruction methods are not rendered less effective.
Finally, for this method to be useful, it must be tested on in-vivo data, which raises the question of how one would gather sufficient training data for such a study. While it would be prohibitively time consuming to acquire such a dataset from a single patient, we are hopeful that smaller scans of a modest cohort of patients or volunteers (10 to 15) would provide a sufficient dataset to train our model. Whether or not such a model would be robust enough for the high variability of in-vivo data will be the ultimate test of this technique. Our tests on the vessel phantom suggest that some out-of-distribution data can be effectively denoised, providing some promise in this regard.

Conclusion
In this paper we have shown that using a GAN to denoise pre-beamformed RF data is a viable option both for removing general background noise for which frame-averaging is standard [16]. The approach can be used to remove artifacts where algorithms such the SVD denoising are traditionally employed. Using SVD and frame-averaging as our training standard, we have improved upon the performance of single frame SVD output, producing results similar to SVD and frame averaging combined. Additionally we have shown that improving the quality of the RF data results in improvements in the corresponding reconstructions.