Convolutional neural network for resolution enhancement and noise reduction in acoustic resolution photoacoustic microscopy

: In acoustic resolution photoacoustic microscopy (AR-PAM), a high numerical aperture focused ultrasound transducer (UST) is used for deep tissue high resolution photoacoustic imaging. There is a signiﬁcant degradation of lateral resolution in the out-of-focus region. Improvement in out-of-focus resolution without degrading the image quality remains a challenge. In this work, we propose a deep learning-based method to improve the resolution of AR-PAM images, especially at the out of focus plane. A modiﬁed fully dense U-Net based architecture was trained on simulated AR-PAM images. Applying the trained model on experimental images showed that the variation in resolution is ∼ 10% across the entire imaging depth ( ∼ 4 mm) in the deep learning-based method, compared to ∼ 180% variation in the original PAM images. Performance of the trained network on in vivo rat vasculature imaging further validated that noise-free, high resolution images can be obtained using this method.


Introduction
Photoacoustic imaging (PAI) is an emerging imaging modality which has grown exponentially in the last two decades due to its advantages like good resolution, high contrast, label-free imaging, deep imaging depth, and the ability to perform multiscale structural and functional imaging [1][2][3][4][5]. In PAI, a pulsed laser is used to irradiate the sample. Absorption of laser pulses by the sample leads to transient temperature rise, resulting in generation of acoustic waves [known as photoacoustic (PA) waves]. In photoacoustic microscopy (PAM) focused optical illumination and/or focused ultrasound detection is used to obtain high resolution images at depth beyond optical mean free path. PAM has been successfully used for various clinical and pre-clinical applications [6][7][8].
Depending on the type of foci (optical or acoustic) used for imaging, PAM can be divided in two categories -optical [9] and acoustic-resolution [10] PAM. Optical resolution PAM (OR-PAM) utilizes optical focusing (diffraction limited resolution can be as high as ∼wavelength/2) but imaging depth is limited to ballistic regime (∼1-2 mm). Imaging deeper than the optical diffusion limit can be achieved by using acoustic resolution PAM (AR-PAM), which employs a weakly focused (or unfocused) light illuminating the sample, and a tightly focused ultrasound transducer (UST) acquiring the generated PA signals [11]. The lateral resolution of AR-PAM system is determined by the properties of UST and is given by LR AR−PAM = 0.71λ a NA a = 0.71v NA a ×f c , where, λ a is the center wavelength of the UST, NA a is the acoustic numerical aperture, ν is the speed of sound, and f c is the central frequency of the UST [6]. Lateral resolution can be improved by increasing the central frequency and the NA of the UST. However, high frequency ultrasound waves attenuate more in biological tissue and therefore, decrease the imaging depth [12]. Increasing the NA of the UST results in improved lateral resolution, but it reduces the focal zone [13]. This leads to degradation of lateral resolution outside the focal plane.
Synthetic focusing aperture technique (SAFT) using the concept of virtual detector (VD) has been introduced and modified to improve the out-of-focus resolution in AR-PAM [14][15][16][17][18]. Although these algorithms improve the out-of-focus resolution, they generally suffer from limitations like lower signal to noise ratio (SNR), high post-processing time, or dependence of improvement on orientation of the sample. Here we propose a deep learning (DL) based approach to improve primarily the lateral resolution of out-of-focus AR-PAM images, with a corresponding increase in PA signal strength and decrease in background noise. Advancement in computing techniques has led to rapid growth of artificial intelligence (AI). Among different AI techniques, DL has particularly become popular due to its ability to get trained and learn from pre-defined dataset [19,20]. DL uses multilayered artificial neural networks to enable supervised learning. Computer vision, the branch of DL associated with image analysis and classification, commonly employs convolutional neural network (CNN) for this purpose. CNNs use a convolution layer with a nonlinear operator to extract essential features of an image by learning weights and filters [21,22]. Last few years saw a significant growth in application of DL for image classification and processing. Deep learning methods have also been used for reconstruction and enhancement in photoacoustic imaging [23][24][25][26][27][28].
A simple deep neural network was trained to increase the bandwidth of detected PA signal and improve the quantitative accuracy of acquired PA images [29]. Following this, CNNs were used on PA images for reconstructing from pre-beamformed data [30], and from undersampled data [26,31,32], correcting motion artefacts [33], removing reflection artefacts [34], and deblurring [35]. Among different architectures, there has been an increased popularity of U-Net based-architectures due to their advantages of being fast and providing promising results on segmentation and reconstruction of medical images even when trained with limited dataset [36]. Reconstruction of PA images from sparse data has been demonstrated by following filtered backpropagation algorithm with a trained U-Net architecture [37]. This network was further expanded by including a dynamic aperture length correction algorithm to decrease under-sampling artifacts and blurring of PA images [38]. Following this, fully dense U-Net (FD-U-Net) was developed and demonstrated to be superior in performance for removing artifacts and denoising PA images [39]. Modified FD-U-Net has also been shown to be computationally efficient for reconstructing and removing artifacts of PA images from sparse and undersampled data in various types of PA systems [26,31,40]. Recently, application of U-Net combined with a fully connected network for vascular segmentation of PA images was also shown [41].
In this work, we propose a U-Net based CNN architecture to improve out-of-focus lateral resolution of AR-PAM images. So far in the literature no one has applied a U-Net based architecture for improving resolution of ARPAM images. A fully dense U-Net (FD-U-Net) [26,31] was trained with simulated (using k-wave toolbox of MATLAB [42]) AR-PAM images. The performance of the network was tested on phantom as well as in vivo rat vasculature imaging. The CNN-based images showed negligible change in resolution across the entire imaging depth. Moreover, compared to original PA images, up to 250% increase in signal strength and 100% decrease in background noise in out-of-focus regions was observed in CNN-based images.

Network architecture
U-Nets generally consist of a contracting-, a bridge-, and an expanding-path and produce an image the same size as the input [36]. For each layer in the contracting path, the image size is decreased by a factor of 2, while the number of channels is doubled. Reverse is done in the expanding path to symmetrically modify the spatial dimensions of the output. Loss of important features during contracting is prevented by adding skip connections between contracting and expanding paths. The network therefore learns relevant features at various spatial scales, and simultaneously reduces the artefacts. Although a basic U-Net model can effectively perform medical image segmentation, a denser network is preferred for higher level image processing. Addition of a fully dense (FD) block in each layer makes the network dense thus allowing it to learn additional feature maps. By increasing the number of trainable parameters from ∼8 to ∼18 million, this FD-U-Net permits the network to learn more information. Dense connectivity also alleviates learning of redundant features and improves the flow of information. Therefore, a fully dense U-Net (FD-U-Net) similar to the one proposed in [31,39] was trained for this work. Figure 1 shows the network architecture used in our model. Briefly, the proposed CNN architecture consisted of 5 layers each in contracting and expanding paths and 1 bridge block. 1×1 convolution was present in each layer to learn the features. Downsampling and upsampling were attained via 3×3 convolution and deconvolution, respectively, with a stride of 2. Each layer had an FD block designed to double the number of channels by first learning f /k feature maps (for f initial channels) via convolution, and then concatenating them with the input. Convolution of the concatenated output again produced an output with f /k channels. Repeating the process k times yielded 2f channels. For this network, k was equal to 4. Subset (yellow dashed box) in Fig. 1 shows different layers of FD block. Convolutions and deconvolutions were followed by batch normalization [43] and rectified linear unit (ReLU) activation [44]. 70% dropout was added in the FD layer for regularization of the network [45]. Contracting and expanding layers were joined using skip connections which also enhanced network training by decreasing the probability of vanishing gradient. Following the expanding path, a 1×1 convolution yielded an output of the same size as the initial image. Adding this output to the initial input image and performing ReLU activation resulted in the final image. This was done to facilitate residual learning i.e., the network learned R where ReLU(R + X) = Y. It was easier for the network to learn R rather than Y, hence enabling faster training of the network [39]. Compared to the previously developed FD-U-Net, 3 modifications were made in this network. (1) Addition of layers at the beginning and end of the network allowed the number of pixels to be 256 × 256, which thereby helped in maintaining the pixel size to be smaller than system resolution; (2) presence of dropout layer aided in regularization of network, thus preventing overfitting; (3) ReLU activation in the end enabled faster training, as the ground truth images (Y) contained non-negative values only.
During training, 10% of the images were randomly selected as test dataset. The remaining images were split in a ratio of 8:2 for training and validation, respectively. The model was trained keeping a batch size of 8, Adam optimizer with its default values (learning rate = 0.001, beta_1 = 0.9, beta_2 = 0.999, epsilon = 10 −07 ) as optimizer, Glorat normal (Xavier normal) weight initializer, and mean squared loss as the loss function. Over-fitting of data was prevented by applying early stopping callback. Loss on validation set was monitored and the training was stopped once the loss stopped improving. The training was implemented on Python (version 3.7.6) using Tensorflow (2.3.0) framework [46] and Keras library running on GPU (Nvidia V100-PCIE-16GB) using the nodes of Gekko cluster, high performance computing center (HPC), Nanyang Technological University, Singapore.

Image acquisition in AR-PAM
In a typical AR-PAM system, a tightly focused UST detects the PA signal generated due to irradiation of sample by a loosely focused optical beam (dark field illumination). Irradiation of the sample by a single pulse results in generation of an A-line (one-dimensional PA signal along the depth direction, say along z-axis). Multiple A-lines were acquired by sweeping the confocal alignment of the optical beam and UST along one axis parallel to the object (say y-axis). 2-dimensional absorption map (z-by-y axis), known as a B-scan, was formed by placing the A-lines next to each other. Multiple B-scans were collected by raster scanning method, (i.e., moving the UST in steps along the x-axis) to form a 3-dimensional PA image. After acquisition of A-lines, 3D volume rendering or 2D projection images [i.e., maximum amplitude projection (MAP)] were used to visualize the PA images.

Curation of synthetic dataset
For supervised learning, a large dataset (acquired images and their ground truth) is required for training. Here, synthetic dataset was generated by simulating B-scans using k-wave toolbox in MATLAB [42]. Figure 2 shows the geometry used for simulations. A 16 × 16 mm grid with a pixel size of 6 µm was used. An arc-shaped 6 mm sensor having a radius of curvature of 4.6 mm and a focal length of 10 mm, moved parallel to the y-axis to collect A-lines. The sensor had a 50 MHz center frequency and 70% bandwidth and was designed to mimic the UST of the AR-PAM system used for experimental validation [47]. The UST traversed 10 mm with a step size of 36 µm collecting one A-line (time step = 8 ns) at each position. B-scan was formed by stacking these A-lines next to each other and taking the absolute values of their Hilbert transformation. Sources were formed by randomly placing multiple objects at a distance of 5 mm above and below the focal plane (shaded region in Fig. 2) of the UST. Four different geometries (points, circles, discs, and lines) of sources of sizes between 6 and 180 µm were used to mimic cross-section of different biological samples. Initial pressure rise of the sources was varied, and noise was added in some images such that the SNR was 30, 20, or 10 dB. For all the simulations, the medium was assumed to be homogeneous with 1500 m/s sound speed. All simulations were performed on GPU (Nvidia V100-PCIE-16GB) using nodes of Gekko cluster, HPC, Nanyang Technological University, Singapore.
The ground truth images of source position and their corresponding B-scans were both downsampled such that the final pixel size in both directions was 36 µm. Following this, the source image was normalized, and the generated B-scans were scaled accordingly. Some source images along with their generated B-scans were further decreased in intensity by randomly multiplying with a factor ξ (0.4<ξ<1). Range of ξ was chosen to include weaker PA vasculature signal while removing background noise. This was done to bridge the gap in PA signal amplitude between simulated and experimental PA images. Using different combinations of type, number, position, initial pressure rises of sources, and different noise levels, 2095 B-scans were generated for training (1508), validation (377), and testing (210). These processed B-scans were considered as the input images (X) while the source images were considered the desired output or the ground truth (Y) during training of the neural network. Figure 3 shows examples of input B-scans and their corresponding ground truth images. Sources of different shapes [e.g., discs ( Fig. 3(b)), points ( Fig. 3(d)), circles ( Fig. 3(f)), and lines ( Fig. 3(h))], size, and initial pressure rise were randomly placed to generate B-scans [Figs. 3(a,c,e,g), respectively]. Background noise was also added in some images as shown in Fig. 3(c), where SNR was 10 dB. The simulated B-scans were all used as test image, while the images of the source were their respective ground truth images.

Evaluation parameters
During model training, its performance was monitored on simulated data before testing it on experimental PA images. Mean absolute error (MAE) and mean squared error (MSE) for both the training and the validation dataset were the regression metrics calculated during training. MAE computes the average of the absolute values of difference while MSE computes the average of the squared difference between the ground truth and the CNN-based image i.e., where, n is the total number of images, Y true is the ground truth image, and Y out is the model output (CNN-based) image. A decreasing MAE ensures that the difference between the CNN-based and the ground truth is decreasing. MSE gives more weightage to larger errors, thereby decreasing the outliers in the images. By giving higher error for larger mistakes, MSE suppresses large variations between output and ground truth, and hence it was the loss function during training.
Post training, MAE, MSE, structural similarity index (SSIM), and peak signal-to-noise ratio (PSNR) between the network output and ground truth images of test dataset were calculated. SSIM and PSNR were calculated using default TensorFlow equations. SSIM computes structural similarity between two images based on their luminance, contrast, and structure [48]. PSNR computes the peak signal-to-noise ratio between two images such that PSNR = 10log 10 where, L is the number of maximum possible intensity levels in an image and MSE is the mean squared error. Higher values of SSIM and PSNR denote higher resemblance between the output and ground truth images. Multiple models containing different layers and hyperparameters were evaluated. The model which delivered best performance on these parameters on test dataset was chosen as the appropriate model and was tested on the experimental PA images.

Experimental AR-PAM imaging system
Performance of the trained model was analyzed on experimental AR-PAM images. The imaging system is a dark field AR-PAM system [49]. It uses a diode-pumped solid-state Nd-YAG laser (INNOSLAB, Edgewave, Wurselen, Germany) combined with a dye laser (Credo-DYE-N, Sirah dye laser, Spectra Physics, Santa Clara, CA, USA) to generate 570 nm laser pulse at a pulse repetition rate of 5 kHz. After passing the laser beam through a variable neutral density filter (NDC-50C-4M, Thorlabs), it was focused on a multimode fiber (M29L01, Thorlabs) using an objective (M-10X, Newport, Irvine, CA, USA). The output beam from the multimode fiber was collimated (LA1951, Thorlabs), and then converted to a ring-shaped beam by passing through a conical lens (1-APX-2-B254, Altechna, Vilnius, Lithuania). A confocally aligned UST (V214-BB-RM, Olympus-NDT, Waltham, MA, USA), having a central frequency of 50 MHz and a 70% bandwidth, housed at the center of the optical condenser acquired the PA signal. An acoustic lens (LC4573, Thorlabs) with radius of curvature of 4.6 mm, 6 mm diameter, and 10 mm focal length was attached to the UST to provide an acoustic focal spot of ∼46 µm. This scanning set-up was mounted on a 3-axis motorized stage by a 3-axis controller. A water tank placed right below the scanning head housed the phantoms used in the experiments. A 10 cm imaging window in the tank sealed with polyethylene membrane provided passage for transmission of ultrasound waves from under the tank to the UST, during in vivo imaging. PA signals acquired by the UST was amplified by two mini-circuit amplifiers (ZFL-500LN, Mini Circuits) having a gain of 24 dB each, before being stored by a data acquisition (DAQ) card (M4i.4420, Spectrum, Germany) installed in a desktop. Data was acquired at a sampling rate of 250 Megasamples/second by performing two-dimensional raster scanning of imaging head using LabView as an interface. During raster scanning, step size between two A-lines was 3 µm. Following data acquisition, absolute value of Hilbert transformation of stacked A-lines was taken to generate B-scans. These were resized, cropped, and zero padded to make their dimensions same as that of training dataset. The trained CNN model was applied on the resultant B-scans to generate CNN-based PA images.
Model performance was analyzed by acquiring MAP of the original and CNN-based B-scans and smoothening them by applying maximum and Gaussian filters using SciPy (Python).

Phantom preparation
To study the effects of training model on samples present at various distances from the focal plane, phantom experiments were conducted. Seven horse hairs (diameter ∼80 µm) were placed horizontally in the water tank at different depths for imaging. Another phantom was prepared by embedding six such horse hairs in agar phantom. The phantoms were kept in the water tank directly below the scanning head. B-scan image containing all the hairs was obtained by moving the translation stage horizontally, perpendicular to the direction of the hair samples. Multiple B-scans were collected by raster scanning of the stage to obtain 3-dimensional and MAP PA images.

In vivo rat vasculature PA imaging
All the animal experiments were conducted in accordance to the guidelines and regulations approved by the Institutional Animal Care and Use Committee (IACUC) of Nanyang Technological University, Singapore (Animal Protocol Number ARF-SBS/NIEA0331). A 75 g adult female Sprague Dawley rat was anaesthetized by intraperitoneal injection of 0.15 mL cocktail containing ketamine and xylazine of dosages 85 and 15 mg/kg, respectively and maintained in anaesthesia by providing vaporized isolfluorane (1.2 L/min oxygen and 0.75% isoflurane, Medical Plus Pte Ltd, Singapore) via a custom-made nose cone. A pulse oximeter clipped to its forearm measured its vitals. Vasculature of the region surrounding the sentinel lymph node was imaged. Prior to imaging, a commercially available hair removal cream was applied to deplete the hair from the imaging region. The rat was placed on its side directly below the polyethylene membrane of the water tank. Transparent ultrasound gel between the imaging window and the rat prevented any air gaps. To evaluate network performance for both, two sets of PA images of the same area were collected. Initially, the rat was placed such that most blood vessels were in the focal plane. Following image acquisition, distance between the UST and the rat was increased, thereby moving the blood vessels outside the focal plane of the UST. An overdose of pentobarbital was injected in the rat to euthanize it after image acquisition.

Performance of the network on simulated AR-PAM images
The performance of the trained FD-U-Net model was evaluated on test data set to confirm that the model had been trained. Figure 4 shows (a) a simulated B-scan, (b) the ground truth, and (c) the CNN-based image when (a) was fed into the model. This simulated B-scan was similar to the B-scans used for our phantom experiments and was not a part of training dataset. The similarity between the ground truth and the CNN-based images validated that the model had been trained. CNN-based and ground truth images of the simulated data were also quantitatively analyzed to evaluate model performance. Figures 4(d) and (e) show the variation in MAE and MSE, respectively, for different epochs during model training. As shown, errors for both the training and the validation dataset decreased and converged with an increase in number of epochs. Training of the model was stopped after saturation of MSE loss of the validation set, which happened after 147 epochs and took ∼75 minutes for the given model. Table 1 shows the evaluation parameters obtained when original and CNN-based images of the test dataset were compared with ground truth images. On the test dataset, MAE and MSE were 0.0171 and 0.7009, respectively. Such low values of MAE and MSE ascertained that model had been trained. Further assessment was done by computing SSIM and PSNR. Average SSIM between ground truth and CNN-based image of the test data was 99.88% compared to just 38.44% between the simulated B-scans and the ground truth images. A corresponding improvement in PSNR from 29.19 between the simulated and the ground truth image to 52.8 between the CNN-based and the ground truth images was also observed. Prior to testing on FD-U-Net, the images were also trained on a simple U-Net based architecture. Although this network was able to produce a binary mask determining whether signal is present or not, it did not perform well in inferring the amplitude of PA signal. In U-Net, the validation loss saturated when MSE was greater than 300. Therefore, the model was modified to include fully dense layers to enable better training. To further evaluate effects of different layers, max pooling was used for downsampling in the contracting layer, instead of convolution with stride 2. Although the resultant model exhibited training in the similar manner as was depicted by our FD-U-Net model, the final MSE, SSIM, and PSNR values on testing set were slightly worse (0.91, 99.8%, and 50.62, respectively), therefore, convolution layer with stride 2 was preferred for downsampling. Hence, multiple layers and architectures were optimized, trained, and tested on simulated data, before being examined on experimental B-scans. Figure 5 shows the AR-PAM images of different horse-hair phantoms. Initially the hairs were placed directly in water such that the leftmost hair was deepest, and the rightmost hair was shallowest. The middle (4 th ) hair was in the focal plane. In the original MAP image [ Fig. 5(a)], both the resolution and the SNR degraded on moving away from the focal plane i.e., the fourth hair was captured very sharp while the others were comparatively blurred, and had a weaker PA signal, even though they all had similar geometries and absorbance originally. MAP image obtained from CNN-based B-scans [ Fig. 5(b)] validate the improvement in resolution. B-scans along the dotted line in Figs. 5(a,b) are shown in Figs. 5(c,d), respectively. Figures 5(e(1-3)), (f (1-3)) show the zoomed in images of the dashed boxes in Figs. 5(c,d), respectively. The images showed an improvement in resolution even when the hair is at the focal plane. This may be attributed to the perfect ground truth images used during the model training. The B-scans further substantiate the degradation of resolution on moving outside the focal plane of original images, and the improvement in resolution after FD-U-Net network was used on the images. Figures 5(g,h) show the original and CNN-based image of phantom prepared by embedding horsehairs in agar phantom. During this experiment, the focal plane was between the second and the third hair from the left, and the depth of hair increased and decreased on moving left and right, respectively. Similar trend as above was observed in original and CNN-based PAM images.  Figures 5(i,j) show the resolution and PA signal and noise at different depths, respectively, for both original and CNN-based images. Resolution degraded by ∼180% on moving ∼2.2 mm (between first and fourth point) out of focus in original PA image [ Fig. 5(c)], while a variation of only ∼10% in resolution was present in different hairs in CNN-based image [ Fig. 5(d)]. It should be noted that even though in the B-scans it appeared that the resolution decreased in the focal plane, the line spread function does not show that. This may be due to weak PA signal surrounding the target, which although appears large in the given colormap, does not impact the resolution. An improvement in axial resolution was also observed in B-scan images. However, since expected axial resolution of the system (∼33 µm) was less than the pixel size, axial resolution was not characterized here. Figure 5(j) shows decrease in signal strength in both the original and the CNN-based images (Figs. 5(g,h)) on moving out of focus, however, the decrease was just ∼19% in CNN-based images compared to ∼75% in original images. Up to 250% increase in signal strength (in right-most hair) was observed in CNN-based images. Additionally, a 100% decrease in background noise increased the SNR further. It is to be noted that background noise was absent (i.e., = 0) in the CNN-based images, hence the SNR could not be computed.

Performance of the network on experimental AR-PAM images
Comparison of original and CNN-based images showed that such model can improve outof-focus resolution of PA images while simultaneously enhancing the signal strength and decreasing the background noise. This is better than existing VD-based algorithms where usually improvement in resolution is accompanied by signal degradation or noise addition.  Figs. 6(a,c)) did not change prominently between the original and the CNN-based images. However, resolution of these vessels deteriorated on moving out-of-focus (blue arrows in Fig. 6(b)). Improvement in resolution was observed on applying the DL model on these out-of-focus images (blue arrows in Fig. 6(d)). Moreover, the signal strength of some vessels of out-of-focus region also improved in the CNN-based images, as shown by green arrows in Figs. 6(b,d). Hence, applying the DL model led to a significant improvement in resolution and signal strength along with a substantial decrease in the background signal.

Performance of the network on in vivo vasculature images
B-scans along the dotted lines in Figs. 6(a-d) are shown in Figs. 6(e-h), respectively. B-scans confirmed the decrease in background signal in CNN-based images. After applying DL model, most of the blood vessel signal (as shown by yellow arrows in the B-scans) was resolved as required. Further, some of the skin signal (shown by green dashed boxes in Figs. 6(e,f)) present in B-scan images also got eliminated. However, due to elimination of background signal, some of the weaker blood vessel signal also decreased, as pointed out by pink arrows in Figs. 6(a,c). Including the light illumination pattern during B-scan simulation may help in preventing this, as discussed later. Comparison between the original and CNN-based images validated that deep learning can enhance in vivo AR-PAM images by improving the resolution outside the focal plane and decreasing the background noise. Furthermore, weak skin signal can be added as an image source during B-scan generation to accurately train the model to differentiate between unwanted skin signal and important vascular signal. Including more shapes and types of objects in the training data may further make the model more versatile and improve its performance.
Testing the model on experimental images proved the network's ability to efficiently enhance the resolution, improve the signal strength, and decrease the background noise of out-of-focus AR-PAM images. These results demonstrate that DL is a promising tool for post-processing AR-PAM images. However, one major limitation of employing DL is in obtaining dataset for training. By simulating AR-PAM images, we waived the need of ground truth of experimental data (which are difficult to obtain, especially for in vivo experiments). However, simulating the AR-PAM images was time consuming and required computational resources. This limits the way step size or pixel size were chosen. Generation of one simulated B-scan in k-wave took ∼2 hours using GPU of the HPC. Decreasing step size would have resulted in a corresponding increase in simulation time. Here, same pixel size was maintained via downsampling which led to degradation of lateral resolution from ∼50 µm to ∼80 µm in focal plane. Decrease in pixel size may help in precise characterization of both axial and lateral resolution. This can be attained by decreasing the step size of UST during simulations and increasing the input and output image size during training (i.e., making input/output image size 512 × 512 instead of 256 × 256). Moreover, increasing the number of training dataset may have improved the performance further, however, that would have led to further increase in simulation time. Additionally, all the simulations were done in 2D in this work. Doing a 3D simulation for the training may improve the efficacy of these types of algorithm even further, but with a heavy computational overhead. In the CNN-based MAP images, some places had a discontinuity in vasculature structure which was not present in the original MAP images. Such artifacts may get resolved by training the network on volumetric images containing both sources and UST in 3D. It may also be noted that although simulation and training was done only on B-scans, an improvement in resolution was observed even in perpendicular direction. This may have been due to removal of weak PA signal which improves the image resolution. Nevertheless, further optimization, simulation, and training, specially using 3D models needs to be done to homogeneously improve resolution in all directions.
Another limitation of this work was the absence of light illumination pattern during simulations. Although application of DL improved the signal strength in out-of-focus regions, there is still room for further enhancement. In AR-PAM, decrease in PA signal in moving away from focus is due to the combined effect of decrease in ultrasound detection sensitivity and optical fluence decrease. In this deep learning model, only the ultrasound signal part was accounted for, hence a decrease in signal strength on moving outside the focal plane was expected due to the light fluence variation at different depth in the medium. To compensate for the fluence variation, it is necessary to incorporate light illumination patterns during the simulation. Furthermore, a decrease in signal strength was also observed for some weak but focused vasculature signal. Further optimization of range of values of initial pressure rise during simulation needs to be done so that the network can be precisely trained to differentiate between weak vasculature signal and noise. Performing simulations including light pattern (dark field illumination) and optimizing the range of initial pressure rises will be a part of our future work. Hence, obtaining simulated B-scans containing illumination pattern, sources of various shapes and sizes, and larger number of A-lines can help in further image refinement. We believe that by modifying the simulated images and algorithms, DL has the potential to simultaneously enhance the images further by increasing the contrast to noise ratio, decreasing reflection artefacts, and removing skin signal. We are happy to share the source codes and our training dataset. Details can be found at the Data availability section of this paper.

Conclusion
In this work, a fully dense U-Net model was trained on simulated AR-PAM images to improve the image resolution outside the focal plane. The trained model was tested on experimental phantom images as well as in vivo rat vasculature PA images. The original phantom images showed ∼180% degradation in resolution on moving outside the focal plane. Application of DL nullified the degradation and provided almost constant resolution across the entire depth of imaging (∼4 mm). The in vivo experiments further validated the improvement confirming that the model can perform well on vasculature imaging. Moreover, 100% decrease in background noise and up to 250% increase in PA signal were observed in experimental images after application of DL model.