Deep learning approach to improve tangential resolution in photoacoustic tomography.

In circular scan photoacoustic tomography (PAT), the axial resolution is spatially invariant and is limited by the bandwidth of the detector. However, the tangential resolution is spatially variant and is dependent on the aperture size of the detector. In particular, the tangential resolution improves with the decreasing aperture size. However, using a detector with a smaller aperture reduces the sensitivity of the transducer. Thus, large aperture size detectors are widely preferred in circular scan PAT imaging systems. Although several techniques have been proposed to improve the tangential resolution, they have inherent limitations such as high cost and the need for customized detectors. Herein, we propose a novel deep learning architecture to counter the spatially variant tangential resolution in circular scanning PAT imaging systems. We used a fully dense U-Net based convolutional neural network architecture along with 9 residual blocks to improve the tangential resolution of the PAT images. The network was trained on the simulated datasets and its performance was verified by experimental in vivo imaging. Results show that the proposed deep learning network improves the tangential resolution by eight folds, without compromising the structural similarity and quality of image.


Introduction
Photoacoustic computed tomography/photoacoustic tomography (PACT/PAT) is a hybrid imaging modality that combines the merits of both optical and ultrasound imaging [1][2][3][4]. Over the past decade, PAT has shown many potential applications in various clinical and preclinical studies [5][6][7][8][9]. In PAT, nano second laser pulses are used to irradiate the target object (e.g., Biological tissues). The light energy incident on the target is absorbed by the chromophores leading to the thermoelastic expansion and generation of pressure waves [also known as photoacoustic (PA) waves]. These PA waves are then acquired around the target boundary by employing an ultrasound transducer (UST). In PAT several scanning geometries are used. However, the most widely used scanning geometry is the circular scanning geometry, in which the UST is rotated around the sample in a circular configuration for acquiring the signals at different locations. The acquired signals are then reconstructed into cross-sectional PAT image by using various reconstruction algorithms [10][11][12][13].
In circular scan PAT, the axial (radial, along the ultrasound axis or perpendicular to the scan axis) resolution is spatially invariant and is determined by the finite bandwidth of the detector. Whereas, the tangential (lateral, or parallel to the scan direction) resolution is spatially varying and is dependent on the aperture size (active area) of the detector [14,15]. For a detector with large aperture size the tangential resolution is poor near the detector surface in comparison with the tangential resolution near the scanning center. A pictorial representation of how the axial and tangential resolutions are defined in the circular scanning configuration is shown in Fig. 1. An easy way to overcome this problem is to increase the scanning radius by moving the UST far away from the imaging region of interest. However, increasing the scanning radius is not practically feasible due to degradation of sensitivity and constraints in space. Another way to improve the tangential resolution is to use a small aperture transducer (ideally a point detector). However, the sensitivity of such detectors will be low. Without increasing the radius or decreasing the aperture of the transducer, the tangential resolution can be improved experimentally by attaching a negative acoustic lens to the detector surface [16]. But the negative acoustic lens is homemade (/custom-made), and its attachment reduces the sensitivity of the detector, and it may be expensive to make custom-made transducer with negative lens built in. The tangential resolution can also be improved by using high numerical aperture (NA)-based virtual point detectors with a wide acceptance angle, high sensitivity, and negligible aperture effect instead of the conventional UST's [17,18]. However, the virtual point detectors are also homemade and is not commercially available. Alternatively, the tangential resolution can be improved by employing a modified delay-and-sum beamformer for image reconstruction [19]. In this approach there is no need for any external lenses or any custom-made detectors. In conventional delay-and-sum the large aperture detectors are considered as a point detector during the beamforming process. This approximation introduces the artefacts in the reconstructed image. Whereas in the modified delay-and-sum the entire surface of the detector was considered as many small segments. The modified delay-and-sum have also been experimentally validated showing threefold improvement in tangential resolution [20]. The impact of sensor apodization on the tangential resolution is also studied with modified delay-and-sum [21]. The modified delay-and-sum suffers from limitations such as the inability to preserve the shape of the target at the furthest point from the scanning center and the computational load is higher. Therefore, an alternative approach to alleviate the problem of tangential resolution needs to be explored. Deep learning is a powerful machine learning approach, which utilizes a range of neural networks to improve the quality of images [22][23][24]. Specifically, the convolutional neural networks (CNN) are the most widely employed neural networks, and it is shown to be powerful in solving complex imaging problems [25,26]. Cognizing the potential of CNN, there has been significant interest in photoacoustic community to apply the CNN to overcome the current limitations and challenges [27,28]. In general, the CNN based reconstruction techniques employed in PAT can be classified in to four different approaches, namely, post-processing, pre-processing, direct processing, and hybrid processing based on the type of input data to the CNN [29]. In post-processing approach, the images resulting from the conventional reconstruction methods are fed into the CNNs to improve their quality. In the approach of pre-processing, the quality of the raw PA signals (sinograms) are improved using CNN and then used with conventional reconstruction method to produce the final improved PA images. In direct processing, the images are obtained from the CNN by directly feeding the raw PA signals (sinograms), not needing use of any conventional reconstruction technique. Lastly, in the hybrid processing both the sinograms and conventionally reconstructed PA images are fed into the CNN to obtain the resultant final PA images. Several CNN based architectures have been successfully demonstrated for post processing applications such as artefact removal and contrast improvement in PAT images [30][31][32]. The potential of the CNN for direct processing and hybrid processing have also been successfully demonstrated for PA imaging [33][34][35]. However, the application of CNN to directly process the images from the raw data is limited by the quality of initial reconstruction and certain features could not be recovered by this approach. Furthermore, the direct processing suffers computational complexity and requires tedious parameter adjustments [27]. Thus, it is optimal to combine the conventional reconstruction technique to process the images from the raw data and then employ the CNN to improve their quality (post-processing approach).
In this work, we propose a postprocessing based deep learning approach to improve the tangential resolution in PAT using a novel CNN based architecture termed as TARES network. This is the first work where a CNN based architecture has been applied for improving the tangential resolution of PAT images. The proposed network was trained on the simulated PA data and its performance was verified by experimental phantom data and in vivo PA imaging. All simulated PA data was generated using k-Wave toolbox in MATLAB [36] The trained model performed well on the test dataset as well as the experimental phantom and in vivo animal imaging. In comparison with the conventional delay-and-sum, the deep learning-based method improved the tangential resolution by eight-fold. Furthermore, the shapes of the target objects were also preserved with the deep learning approach.

Proposed TARES network architecture
Since its inception, U-Net has been widely used for various imaging related tasks. It incorporates down sampling and up sampling layers along with the skip connections resembling a symmetrical U shape [37,38]. However, the U-Net requires extensional techniques for improved accuracy [39]. An improvised variant of U-Net, termed as fully dense (FD) U-Net, was first proposed for artefact removal in PAT [40] and was adopted for other PA applications due to its better performance [41][42][43]. The superior performance of FD-U-Net in comparison with the traditional U-Net have already been demonstrated [40]. The FD U-Net comprises of dense blocks in the contracting and expanding paths. These dense blocks connect the earlier convolutional layers to all subsequent layers via concatenation, thus enabling each layer to learn additional feature maps from the knowledge gained by previous layers. Furthermore, the dense layers increase the depth of the networks and its dense connectivity mitigates the problem of vanishing gradient. For improving the tangential resolution in PAT, we propose a network architecture termed as TARES (tangential resolution) network. The proposed network makes use of the basic structure of FD U-Net with 9 residual blocks in the bottom layer. Figure 2 shows the detailed architecture of the TARES network. The proposed network expects an input image size 128×128 pixels and returns a resultant image of size 128×128 pixels. The addition of residual blocks to the network is inspired by the de-raining model [44]. This inclusion of the residual blocks improves the denoising performance of the FD U-Net and its shorter paths from earlier to latter layers alleviates the problem of vanishing gradient. In the contracting path of our network the down sampling operation is performed by a 1×1 convolution block followed by a 3×3 convolution block of stride 2 and the up-sampling operation in expanding path is carried out by a 3×3 transposed convolution block of stride 2. Each convolution block incorporated in the network consists of batch normalization preceded by convolution and nonlinear activation (SELU). Using SELU as the activation function improves the learning speed and robustness of the network [45]. All the dense blocks employed in our model takes f feature maps (channels) as input and returns 2 × f feature maps (channels) as output, with a growth rate of f /4 and its structure is analogous to the dense blocks used in FD U-Net [40]. To prevent the spatial information loss, the skip connections were used for each layer and the resultant image is obtained by adding the input image to the residual image.

Network training and Implementation
The proposed TARES network was implemented in Python 3.7 with the TensorFlow v2.3 deep learning library [46]. The network training was performed on a Nvidia Tesla V100-32GB GPU using the nodes of Gekko cluster, High Performance Computing Centre, Nanyang Technological University, Singapore. The loss function used for training the network is a combination of the mean absolute error (MAE) and Fourier loss (FMAE) and it is calculated as Total loss = λ 1 × MAE + λ 2 × FMAE [42,43]. The weights assigned for λ 1 and λ 2 are 1 and 0.01, respectively. For optimizing the network, Adam optimization algorithm was used in conjunction with a learning rate scheduler intended to reduce the learning rate by a factor of 0.5 when the monitored metric was not improving. The initial learning rate used was 0.001. The saving metric used is the combination of structural similarity index (SSIM) and peak signal to noise ratio (PSNR), it is analogous to the saving metric used in [42]. The proposed network was trained for 100 epochs with a batch size of two and the training process took approximately 4 hours to complete. We observed the best SSIM of 0.96147 and PSNR of 31.909 at 62 nd epoch.

Simulated photoacoustic data sets for training
Deep learning is a data driven approach and its performance depends on the quality of the data used for training [47]. In general, the training data used in the deep learning model comprises of the input data and the expected output data (ground truth). Although, in this case, it is possible to obtain large amount of input data experimentally using the PAT system, but the expected ground truths are unavailable or difficult to obtain. Thus, we used the MATLAB toolbox k-Wave for generating the photoacoustic datasets for training [36]. Three numerical phantoms namely point targets (four-point sources, in which the size [0.5-0.7 mm], position [distance from the scanning center], orientation and source strength of the point sources were varied randomly), Triangular shape (size, orientation and position [distance from the scanning center] of the triangular shape were varied randomly), and vessel shapes mimicking the venous sinuses of rodent brain (orientation, magnitude and position [distance from the scanning center] of the vessel shapes were changed randomly) were used for the simulation [Figs. 3(a)-3(c)]. The simulation geometry used is shown in Fig. 3(d). A computational grid of 410×410 (0.1 mm/pixel) with a perfectly matched bounding layer was used for generating the forward PA data. The imaging region was restricted to 40 mm and the PA data was acquired at 800 sensor locations. A step size of 40 ns with 1500-time steps were used and 1% noise were added, thus maintaining the signal to noise ratio (SNR) at 40 dB. The speed of sound chosen was 1500 m/s and the medium used was acoustically homogeneous. For the input PA data, two large aperture unfocused USTs (13 mm active area) of central frequency 2.25 MHz and 5 MHz with 70% nominal bandwidth were considered. Whereas, for the ground truth PA data were generated with ideal point detector of central frequency 2.25 MHz and 5 MHz with 70% nominal bandwidth. The simulated forward PA data was then reconstructed into cross-sectional PAT images of size 128 ×128 pixels using the traditional delay-and-sum beamformer. The generated dataset comprises of 3000 PAT images simulated using both the USTs of central frequency 2.25 MHz and 5 MHz. The dataset was randomly divided into training, testing, and validation datasets in the ratio 90:5:5. Optimization of the model was done using the training dataset, the hyperparameters were tuned using the validation dataset, and performance of the model was assessed using the testing dataset.

Experimental phantom photoacoustic imaging
For validating the usefulness of the proposed TARES network in real experimental PA imaging, phantom imaging was done first. Two types of phantoms namely four-point targets (pencil leads located at 0, 8.5, 17, and 25.5 mm from the scanning center), and triangular horsehair phantom were used for obtaining the PAT images [ Figs. 3(e), 3(f)]. A schematic representation of the experimental setup employed for imaging the phantoms is shown in Fig. 4(a). A ∼532 nm Q-switched Nd:YAG laser capable of delivering 10 pulses per second at 5 ns pulse width was used as the excitation source. Two unfocused UST of 2.25 MHz (Olympus NDT, V306-SU) and 5 MHz (Olympus-NDT, V309-SU) central frequency (70% nominal bandwidth) were employed to acquire the generated PA waves. The acquired PA signals were amplified by a low signal noise amplifier (Olympus-NDT, 5072PR) and stored inside a desktop computer (IntelXeon, 3.7 GHz 64-bit processor, 16 GB RAM) using a data acquisition (DAQ) card (GaGe, compuscope 4227). The collected PA signals are then reconstructed into PAT images by conventional delay-and-sum beamformer.

In vivo photoacoustic rat brain imaging
The feasibility of the proposed TARES network was also verified on the in vivo PA imaging. For the in vivo imaging, two Sprague Dawley female rats (weighing ∼90 ± 5 g) procured from the InVivos Pte. Ltd., Singapore were utilized. A combination of Ketamine (100 mg/mL) and Xylazine (20 mg/mL) were used to anesthetize the rats. Before imaging, the hair on the rat head was removed using depilatory cream and eye gel was applied on the eyes of the rat. A layer of ultrasound gel was applied on rat scalp and the animal was mounted on the system as shown in the Fig. 4(b). During the imaging a continuous supply of 0.75% isoflurane and 1.0 L/min oxygen were provided for maintaining the anesthesia. The experimental setup used for imaging the in vivo PAT images of the rat brain is shown in Fig. 4(b). For the in vivo experiments, Near-infrared (NIR) wavelength is usually used for achieving better penetration depth. Thus, a pulsed laser diode (PLD) (Quantel, France) of ∼816 nm wavelength (2 kHz pulse repetition rate, 107 ns pulse width) was used as the excitation source. The PLD was controlled by the laser driving unit (LDU), and the output laser beam was homogenized using an optical diffuser. The laser energy density was maintained at ∼0.17 mJ/cm 2 , which is in par with the ANSI safety limits [48]. Two unfocused UST of 2.25 MHz (Olympus-NDT, V309-SU) and 5 MHz (Olympus-NDT, V309-SU) central frequency (70% nominal bandwidth) were used to obtain the generated PA signals. The PA signals acquired were amplified with a 48 dB low signal noise amplifier (Mini-circuits, ZFL-500LNBNC). The amplified signals are then stored inside a desktop computer (IntelXeon, 3.7 GHz 64-bit processor, 16 GB RAM) using a DAQ card (Spectrum, M2i.4932-Exp), and then reconstructed into PAT images using the conventional delay-and-sum beamformer. After the experiments, the animals were euthanized by the intraperitoneal administration of Valabarb (sodium pentobarbitone 300 mg/mL).

Model comparison
Initially, we compared the performance of the proposed TARES network to the performance of the FD U-Net using a variety of loss metrics [SSIM, PSNR, MAE, MSE] as shown in Table 1. For the comparison, the FD U-Net architecture was also trained with the same training dataset for 100 epochs and the metrics were evaluated over the test data using the best model (63 rd epoch) chosen [40,42]. The TARES network exhibited best performance over all the metrics [SSIM, PSNR, MAE, MSE] in comparison with the FD U-Net.  Figure 5 shows the images of two numerical phantoms obtained using the conventional delayand-sum, the modified delay-and-sum, and images with TARES network. The two numerical phantoms shown in Fig. 5 were simulated with the UST frequency of 2.25 MHz. The reconstructed images of four-point targets using the conventional delay-and-sum is shown in Fig. 5(a), and the zoomed images of the individual point targets 1 to 4 are shown in Figs. 5(d)-5(g). The elongation of the point targets in the tangential direction can be clearly visualized from these images. Note that the elongation of the point is larger (poorer the tangential resolution) when the respective point is farther from the scanning radius (closer to the UST). Figure 5(b) shows the reconstructed PAT images of the four-point targets using the modified delay-and-sum beamformer. The zoomed in images of each individual point targets 1 to 4 are shown from Figs. 5(h)-5(k). From the Fig. 5(b), it is evident that the modified delay-and-sum improves the tangential resolution as expected. However, it can also be inferred that the ability of the modified delay-and-sum to improve the tangential resolution and to preserve the shape of the target object degrades with increasing distance from the scanning center. The tangential resolution improved images using the TARES network is shown in Fig. 5(c), and the respective zoomed in images of the reconstructed individual point targets 1 to 4 are depicted from Figs. 5(l)-5(o). A similar improvement was also noted in the point target images simulated with 5 MHz UST frequency; the images are not shown here for brevity. The tangential resolution was also quantified at each point targets by using the edge spread function (ESF). In ESF, the resolution is determined by calculating the distance required by the edge response (from the sharp discontinuity at the edge of point target) to rise from 10% to 90% of maximum intensity value. The quantified tangential resolution at each point target positions were plotted for both the USTs of central frequency 2.25 MHz and 5 MHz [Figs. 5(p),5(q)]. Note that the proposed TARES network outperforms the modified delay-and-sum in improving the tangential resolution over all the points located at different distances from the scanning center. Especially at the farthest point, a threefold improvement in the tangential resolution was observed with the TARES network in comparison with the modified delay-and-sum and a tenfold improvement in comparison with conventional delay-and-sum. The axial resolution over all the points remains almost the same in all three reconstructed images shown in Figs. 5(a)-5(c). Furthermore, the potentiality of the TARES network to improve the image quality (due to tangential resolution improvement) can also be visualized by comparing the conventionally reconstructed and TARES network reconstructed image of the numerical triangular phantom shown in Figs. 5(r), 5(s). The yellow arrows on the Fig. 5(r) indicates that at the corners of the triangle the reconstructed image quality suffers (due to its location far from the scanning center). In Fig. 5(s) the corresponding corners have been recovered very well with the TARES network. This improvement in image quality is further substantiated by the higher SSIM [SSIM = 0.96] value of the tangential resolution improved image of the numerical triangular phantom.

Experimental phantom photoacoustic imaging
Success of TARES network on simulated PA imaging data to improve the tangential resolution drastically, lead us to apply the network on real experimental PA images. As described earlier, two types of phantoms were used for the experiments. Two different UST of central frequency 2.25 MHz and 5 MHz were employed to acquire the PAT data. The acquired PAT data is then reconstructed using the conventional delay-and-sum, and TARES network. Figure 6(a) shows the PAT images of the four-point targets obtained using the 2.25 MHz UST using the conventional delay-and-sum. The zoomed images of each point targets are shown in the Figs. 6(b)-6(e). Elongation of point targets can also be visualized in the conventionally reconstructed experimental point target PA images. The tangential resolution improved image of four-point target using the TARES network is shown in Fig. 6(f), and the zoomed in images of the individual point targets are shown in Figs. 6(g)-6(j). Tangential resolution plots versus distance from the scanning center for both USTs (2.25 MHz and 5 MHz) are shown in Figs. 6(k), 6(l). The individual zoomed in PA images and the tangential resolution vs distance plots confirms the ability of TARES network to recover the points very well even on the farthest point in the real experimental PAT images. In particular, the TARES network improved the tangential resolution by eight folds on the farthest location. The ability of the TARES network to preserve the shape and to improve the image quality in the reconstructed PAT images can be visualized by comparing the images of the triangular horsehair phantoms PAT images shown in Figs

In vivo photoacoustic rat brain imaging
The TARES networks performance was also evaluated on the in vivo rat brain imaging. Two different UST of central frequency 2.25 MHz and 5 MHz were employed to acquire the PAT data using the PAT system described earlier [ Fig. 4(b)]. Figures 7(a) and 7(c) show the in vivo PAT images obtained using the 2.25 MHz and 5 MHz frequency UST's using conventional delay-and-sum. Note that, in these images the boundaries of the venous sinuses are not clearly visible (blurred, shown with dashed yellow arrows), especially the demarcation of the transverse sinus is wane, and, in some case, the transverse sinus is not visible at all. This degradation of the venous sinuses is a major hurdle in employing the single element UST based PAT system for in vivo animal studies such as the detection of intracranial hypotension [49] and the detection of cerebral hemorrhage. Furthermore, the white spots (artefacts, shown with dashed yellow boxes) appearing around the venous sinuses also degrades its visibility and hampers the analysis. The tangential resolution improved images of the in vivo rat brain using the TARES network are shown in Figs  One of the common problems encountered in circular scan PAT is the tangential resolution degradation with increasing aperture size of the detector. Although numerous techniques have been proposed to overcome this problem, they are hindered by the high cost and the need of customized detectors. To overcome this problem, we proposed a FD U-Net based CNN architecture along with 9 residual blocks termed as TARES network. The use of residual blocks in the network improves its denoising performance along with the quality of the resultant images. Furthermore, the dense connectivity in the network increases the depth of the network without increasing the number of contracting and expanding layers. Simulated numerical phantoms were used to train the network and the performance of the network was verified with both the simulated and experimental PAT imaging. The results of the TARES network on the simulated and experimental imaging data consistently proves that the network is highly efficient in improving the tangential resolution and is independent of the PAT imaging system used. However, the approach of employing deep learning to improve the tangential resolution also has certain limitations. First, the requirement of large training dataset to optimize the network for the application and it can be overcome by using the simulated phantoms akin to the intended application. Furthermore, developing the simulated dataset close to the intended application enhances the performance of the deep learning network. This fact is exemplified by the TARES networks performance on the experimental phantom and in vivo PAT images. In lieu of the simulated training dataset, it is possible to generate the training dataset only on experimental images by using the conventional reconstruction for the input PAT images and tangential resolution improved PAT images as the ground truth. But, in such instances the performance of the deep learning network is constrained by the performance of the algorithm used to improve the tangential resolution of PAT images. Another limitation is the time taken for training the network, it is a well-known phenomenon that increasing the training data size increases the time taken for training. A means to overcome this problem is to distribute the training load over multiple GPU's. The training time also depends on the employed GPU specifications such as CUDA cores, memory, clock, BUS and GPU memory bandwidth [50]. Also, the training time depends on the epochs and batch size used for training. Using deep learning several drawbacks of photoacoustic imaging can be addressed in a one go approach and the only prerequisite is the appropriate dataset. We are happy to share the source code of the TARES network and it can be found at the data availability section of this paper.

Conclusion
In this work, a CNN based architecture termed as TARES was implemented for improving the tangential resolution in PAT images. The network was trained on the simulated phantoms and its performance was verified on both the simulated and experimental dataset. Our results substantiate that the proposed network can improve the tangential resolution by eight folds than the conventional delay-and-sum. In addition, the proposed network improves the quality of PAT images. Therefore, the proposed TARES network can improve the tangential resolution of PAT images without the use of any external negative acoustic lens on the surface of transducer or any modified delay-and-sum beamformer.