PA-NeRF, a neural radiance field model for 3D photoacoustic tomography reconstruction from limited Bscan data

We introduce a novel deep-learning-based photoacoustic tomography method called Photoacoustic Tomography Neural Radiance Field (PA-NeRF) for reconstructing 3D volumetric PAT images from limited 2D Bscan data. In conventional 3D volumetric imaging, a 3D reconstruction requires transducer element data obtained from all directions. Our model employs a NeRF-based PAT 3D reconstruction method, which learns the relationship between transducer element positions and the corresponding 3D imaging. Compared with convolution-based deep-learning models, such as Unet and TransUnet, PA-NeRF does not learn the interpolation process but rather gains insight from 3D photoacoustic imaging principles. Additionally, we introduce a forward loss that improves the reconstruction quality. Both simulation and phantom studies validate the performance of PA-NeRF. Further, we apply the PA-NeRF model to clinical examples to demonstrate its feasibility. To the best of our knowledge, PA-NeRF is the first method in photoacoustic tomography to successfully reconstruct a 3D volume from sparse Bscan data.


Introduction
Photoacoustic tomography (PAT) is a valuable functional imaging technique that offers optical contrast and ultrasound (US) spatial resolution [1].PAT has many advantages over traditional imaging modalities as X-ray, US, CT, and MRI because of its ability to image functional parameters, such as blood oxygenation saturation and perfusion [2].It is particularly useful for imaging soft tissues, such as the breast, ovary, thyroid, and skin, in cancer detection and diagnosis [3].
Our group previously utilized a real-time co-registered PAT-US system with a curved transvaginal transducer to measure PA signals and diagnose ovarian cancer [4][5][6].Researches have shown that ovarian tissue morphology and function serve as an important biomarker in classifying malignancy [7][8][9].However, 2D US and PAT images provide only a cross-sectional view of the lesion.Additionally, because they are limited by the imaging window of the transvaginal probe, US and PAT images can reveal only a portion of large ovarian lesions.
3D images of ovarian lesions can improve diagnosis by providing physicians with more complete morphology and function information [10].Several methods have been explored to reconstruct 3D volumes in PAT.Typically, either a 2D transducer is needed to collect 3D data [11][12][13][14][15], or a 1D transducer is scanned to perform 3D reconstruction [16][17][18].However, these methods all require expensive US arrays or time-consuming data acquisition, and their limited imaging windows lose target information needed for clinical applications.
Deep learning has made significant advances in computer vision and in biomedical imaging.Many deep learning methods have been explored for image reconstruction, image post-processing, image segmentation, and lesion classification [19][20][21][22].As for deep learning in PAT, researchers are exploring various deep learning architectures to improve image resolution, reduce noise and image artifacts, achieve better reconstruction imaging quality, reconstruct images from limited view data, and recover images from sparsely sampled data [23][24][25][26][27].
A neural radiance field, commonly abbreviated as NeRF, is a groundbreaking technique in computer graphics and computer vision.Originally introduced in 2021 [28], this method has redefined the process of generating three-dimensional reconstructions from two-dimensional image data.
At its core, a NeRF synthesizes photorealistic 3D visualizations by forecasting both the structural and aesthetic elements of a scene through a continuous 2D function called a radiance field.Instead of relying on traditional 3D models or point clouds, NeRF leverages deep learning to predict how light interacts with surfaces at every point in space.This nuanced technique can capture intricate details, light reflections, and shadows, generating renderings that are almost indistinguishable from real-life scenes.
Beyond graphics and visual content generation, the transformative power of NeRF is making notable strides.In biomedical imaging, recent studies [29,30] have showcased the potential of NeRF in applications such as 3D CT imaging.The ability of NeRF to filter out noise, rectify artifacts, and amplify resolution has made CT scans more discernible.The increased clarity and contrast pave the way for precise diagnostic evaluations and comprehensive treatment blueprints.These benefits are particularly pronounced in 3D synthesis tasks that require meticulous attention to detail, such as tumor identification and organ boundary demarcation [29][30][31][32].
Moreover, the versatility of NeRF extends to ultrasonography as well.In traditional US imaging, capturing high-resolution 3D visuals of biological entities remains a challenge.However, with NeRF, the generation of detailed 3D reconstructions has become more feasible.A notable contribution was made by Wysocki et al., who integrated physics-inspired rendering to make US image synthesis more realistic, emphasizing direction-dependent changes in the scene [32].Another pivotal study, by Li et al., demonstrated the utility of NeRF in 3D US imaging of the spine.Their technique yielded superior quality images that accentuated bone structures, minimized noise, and ultimately led to a more accurate assessment of spinal curves.Impressively, their approach outperformed traditional methods, resulting in reduced error margins in spine curve estimations [30].Inspired by NeRF in 3D rendering, we developed a deep learning model for PAT volumetric reconstruction from sparse Bscan data, called PA-NeRF.PA-NeRF is different from a typical NeRF model in that it reconstructs a 3D volume of optical absorption profiles based on both PAT images and US images.Additionally, a forward loss to improve the PA-NeRF model's performance is implemented.To demonstrate the superiority of the PA-NeRF model over other machine learning models, we conducted phantom experiments and applied the PA-NeRF model to patient data.

Methodology
We employed a curved array transducer to capture cross-sectional Bscan data of the target and applied the delay-and-sum technique to reconstruct 2D Bscan images.Since our PAT system was designed for in-vivo imaging of ovarian lesions, the most feasible way to scan the entire lesion area was to rotate the probe to acquire the necessary volumetric data, as depicted in Fig. 1.
In this study, we implemented five different 3D reconstruction methods for comparison.The first method used direct interpolation (DI) from Bscans collected at different angles inside the volume of interest.DI was computationally the most efficient method, and it produced reasonable reconstructions from dense angular sampling.The second method was Bscan-wise universal back projection (BBP), adapted from universal back projection (UBP) to better suit the imaging conditions of this study.Because BBP was an analytical method for 3D reconstruction, it produced more accurate results with sparser angular sampling than DI.The third method was learning-based: a UNet was trained to predict volumetric PAT directly from a series of Bscans from a small number of angles.The UNet proved particularly powerful at generating In this study, we implemented five different 3D reconstruction methods for comparison.The first method used direct interpolation (DI) from Bscans collected at different angles inside the volume of interest.DI was computationally the most efficient method, and it produced reasonable reconstructions from dense angular sampling.The second method was Bscan-wise universal back projection (BBP), adapted from universal back projection (UBP) to better suit the imaging conditions of this study.Because BBP was an analytical method for 3D reconstruction, it produced more accurate results with sparser angular sampling than DI.The third method was learning-based: a UNet was trained to predict volumetric PAT directly from a series of Bscans from a small number of angles.The UNet proved particularly powerful at generating images under sparse sensing conditions.The fourth method, called TransUnet, implements the transformer block into the Unet model.It combines the local features from Unet and the global features from Transformers.Lastly, the best method was a NeRF-based machine learning model, which used a series of Bscans, specifically Bscan beams, collected at various angles to predict new Bscan beams at unscanned angles through a learned coordinate transform.NeRF could synthesize new views of the same target with a much simpler model than UNet and was more robust against overfitting or biased training data [28][29][30][31][32][33].

PA-NeRF
In this comprehensive study, we have leveraged the capabilities of the NeRF technique to enhance the reconstruction process of PAT.This enhanced model, termed PA-NeRF, offers a refined approach to PAT reconstruction by harnessing the power of deep learning.Traditionally, reconstructing a detailed PAT image from scattered scan lines has posed significant challenges, particularly when these lines originated from varied positions and orientations.However, by integrating the NeRF model, PA-NeRF is trained to comprehend these irregularities.By utilizing beamformed data from scan lines whose positions and orientations are already known, PA-NeRF learns the necessary coordinate transformations.This acquired knowledge empowers it to predict how beamformed data might appear if it were derived from scan lines in previously unscanned positions and orientations.
The PA-NeRF architecture is constructed based on a supervised, fully connected neural network framework.Its input contains positional information of each scan line, which includes the spatial coordinates (x, y, z) on the detection surface and the scan angle (ξ), which indicates the transducer rotation in Bscans, as depicted in Figure 1.Additionally, an object identifier (α) is incorporated as a marker to denote the specific object under reconstruction.This identifier is essential for the model, as it is used to calculate the appropriate coordinate transformation for each object.
The network comprises three hidden layers, each equipped with 256 neurons.For optimization, PA-NeRF employs the Adam optimizer, which enhances its efficiency in reaching an optimal learning rate.
In terms of output, PA-NeRF generates two distinct categories.The first category is a 128dimensional representation of the absorption coefficient beamline, while the second is a similar- images under sparse sensing conditions.The fourth method, called TransUnet, implements the transformer block into the Unet model.It combines the local features from Unet and the global features from Transformers.Lastly, the best method was a NeRF-based machine learning model, which used a series of Bscans, specifically Bscan beams, collected at various angles to predict new Bscan beams at unscanned angles through a learned coordinate transform.NeRF could synthesize new views of the same target with a much simpler model than UNet and was more robust against overfitting or biased training data [28][29][30].

PA-NeRF
In this comprehensive study, we have leveraged the capabilities of the NeRF technique to enhance the reconstruction process of PAT.This enhanced model, termed PA-NeRF, offers a refined approach to PAT reconstruction by harnessing the power of deep learning.Traditionally, reconstructing a detailed PAT image from scattered scan lines has posed significant challenges, particularly when these lines originated from varied positions and orientations.However, by integrating the NeRF model, PA-NeRF is trained to comprehend these irregularities.By utilizing beamformed data from scan lines whose positions and orientations are already known, PA-NeRF learns the necessary coordinate transformations.This acquired knowledge empowers it to predict how beamformed data might appear if it were derived from scan lines in previously unscanned positions and orientations.
The PA-NeRF architecture is constructed based on a supervised, fully connected neural network framework.Its input contains positional information of each scan line, which includes the spatial coordinates (x, y, z) on the detection surface and the scan angle (ξ), which indicates the transducer rotation in Bscans, as depicted in Fig. 1.Additionally, an object identifier (α) is incorporated as a marker to denote the specific object under reconstruction.This identifier is essential for the model, as it is used to calculate the appropriate coordinate transformation for each object.
The network comprises three hidden layers, each equipped with 256 neurons.For optimization, PA-NeRF employs the Adam optimizer, which enhances its efficiency in reaching an optimal learning rate.
In terms of output, PA-NeRF generates two distinct categories.The first category is a 128-dimensional representation of the absorption coefficient beamline, while the second is a similar-dimensional representation of the initial pressure beamline.The model's primary function is to predict beamformed data corresponding to the input scan line.In addition, it also learns the absorption coefficient profile of the scan line, which is pivotal in accurately reconstructing PAT images by capturing subtle variations in the acoustic properties of the scanned tissue.Finally, the PA-NeRF model will output a Bscan PAT image and a US image from different rotation angles.
To generate a holistic Bscan, the model takes in the positional data of every individual scan line that constitutes the Bscan.The outputs for each of these lines are then carefully concatenated, ensuring alignment with their respective scan line positions.This procedure ensures the generation of a seamless and comprehensive Bscan.As an advanced capability, the PA-NeRF model can also produce volumetric reconstructions by making predictions for Bscans around a complete 360°rotation and subsequently amalgamating these predictions.Bscan images are not overlapped, they are cross-sections around the center of a volume.Figure 2 depicts this intricate process and its outstanding results.
To generate a holistic Bscan, the model takes in the positional data of every individual scan line that constitutes the Bscan.The outputs for each of these lines are then carefully concatenated, ensuring alignment with their respective scan line positions.This procedure ensures the generation of a seamless and comprehensive Bscan.As an advanced capability, the PA-NeRF model can also produce volumetric reconstructions by making predictions for Bscans around a complete 360° rotation and subsequently amalgamating these predictions.Bscan images are not overlapped, they are cross-sections around the center of a volume.Figure 2 depicts this intricate process and its outstanding results.In addition to the mean square error (MSE) loss for absorption coefficient images and Bscan data, a forward loss is implemented to improve the model's performance: Here   and   are the absorption coefficients of the model output and ground truth,  is a constant coefficient of 0.1, and  represents the fluence map.In our PAT-US system, four optical fibers deliver light to the tissue.Similarly, four-point sources generate a homogeneous field in simulations. 0 and  0 are the beamformed initial pressure scanlines of the model output and ground truth, respectively.From the principle of PAT, we know that this loss should be 0 if we reconstruct perfect images.
Compared to other supervised models, PA-NeRF exhibits greater generalizability.This is attributed to its fine-tuning process.When aiming to reconstruct a new object from a set of Bscans, we initially fine-tune the model using the available Bscans.Subsequently, by inputting all transducer positions, we can acquire Bscans from every orientation.This comprehensive collection of Bscans then facilitates the reconstruction of the 3D volume.Fine-tuning the PA-NeRF model involves several steps.Initially, the model is provided with a new object identifier (α) along with the known transducer positions and angles.The ground truth in this context comprises the available ultrasound (US) images and photoacoustic tomography (PAT) images.The goal is to reconstruct the 3D volume.To achieve this, the model generates PAT images from a variety of rotation angles and transducer positions.In the final step, these individual slices are synthesized to form a comprehensive 3D volume.

Direct interpolation
In addition to the mean square error (MSE) loss for absorption coefficient images and Bscan data, a forward loss is implemented to improve the model's performance: Here µ a andµ a are the absorption coefficients of the model output and ground truth, γ is a constant coefficient of 0.1, andϕ represents the fluence map.In our PAT-US system, four optical fibers deliver light to the tissue.Similarly, four-point sources generate a homogeneous field in simulations.p 0 andp 0 are the beamformed initial pressure scanlines of the model output and ground truth, respectively.From the principle of PAT, we know that this loss should be 0 if we reconstruct perfect images.
Compared to other supervised models, PA-NeRF exhibits greater generalizability.This is attributed to its fine-tuning process.When aiming to reconstruct a new object from a set of Bscans, we initially fine-tune the model using the available Bscans.Subsequently, by inputting all transducer positions, we can acquire Bscans from every orientation.This comprehensive collection of Bscans then facilitates the reconstruction of the 3D volume.

Direct interpolation
Direct Interpolation uses 3D interpolation to build the entire 3D volume by combining several cross-section Bscan images.In photoacoustic imaging, the absorption of each short laser pulse produces an initial pressure rise, p 0 , within the tissue.p 0 is proportional to the Grüneisen parameter (Γ), the target absorption coefficient (µ a ), and the light fluence (ϕ(r)) [1,[33][34][35]: To reconstruct the target's optical properties from photoacoustic signals, delay-and-sum beamforming is typically used, which takes into account the time-of-flight of the ultrasound signal and the location of the point of reflection to create an image with good spatial resolution.Once several Bscan images have been reconstructed from multiple cross-sections, they are positioned in a 3D volume based on their coordinates, and missing voxels between each successive Bscan image are interpolated to generate the full 3D volume.

Bscan-based back projection (BBP)
BBP is modified from the universal back projection algorithm (UBP), which is an analytical reconstruction method for obtaining volumetric PAT results.UBP can be applied to different scanning geometries and has been used in various clinical applications [36][37][38][39].Compared to other reconstruction methods, such as filtered back projection and iterative model-based reconstruction, UBP strikes a good balance between reconstruction quality, computation cost, and robustness to noise and artifacts.The algorithm projects the measurement from each detector element back to the entire imaged volume in a spherically propagating wave, following the equation below [40]: where r 0 and r are the detector and voxel positions, respectively; θ 0 is the angle between ⃗ r and − → r 0 ; and the integral integrates over the entire detection surface.
To prepare for reconstruction, we discretized the volume into voxels with dimensions of 0.4 mm (x) x 0.4 mm (y) x 0.3 mm (z), where the axial dimension (z) was chosen to be on the same scale as and slightly greater than the ultrasonic pulse length.In a clinical setting, slight probe motion other than pure rotation between scans cannot be avoided, so the raw data from different Bscans cannot be added coherently.For this reason, unlike in UBP, in BBP we back projected the Bscan data separately to the imaged volume and summed the resultant envelope data of Bscans from all angles to obtain the final reconstruction, as shown in Fig. 3.We refer to this modified procedure as Bscan-wise universal back projection (BBP).In simulations with no motion artifacts between Bscans, we compared the BBP reconstruction with the Hilbert transform of conventional UBP reconstruction and found comparable results.high computational cost associated with 3D convolution layers, we implemented three downsampling and upsampling layers in the network structure.

TransUnet
In our implementation, we adapted the TransUnet model, originally proposed by JN.Chen [43], integrating the comprehensive feature extraction capabilities of both the transformer and U-Net architectures.This model distinctively tokenizes image patches derived from CNN feature maps, which the transformer then processes to incorporate global contexts.These globally informative features are subsequently upsampled and integrated with high-resolution CNN feature maps, ensuring enhanced precision in localization.Our modification particularly adjusted the input and output channels to align with those used in the 3D U-Net model, as shown in Figure 4.This adaptation was aimed at refining the model's application to a broader range of reconstruction tasks, where TransUnet has already demonstrated its superiority over existing methods.

Convolutional neural network
For comparison, we have implemented a 3D Unet model.It includes two parts, encoder network, and decoder network.The encoder network employs convolutional layers to extract relevant features from the input image while down sampling the image.The decoder network then uses up sampling layers to reconstruct the image, while preserving the high-resolution features using skip connections between the encoder and decoder networks.
To reconstruct the 3D PAT volume from Bscan images, the 3D CNN Unet model input was a sparse 3D matrix representing the volume of interest.The input Bscan images were inserted into this volume at their respective scan angles in the form of 2D slices, while the space between Bscan images was otherwise not filled.This input design allowed for a flexible number of input Bscan images while keeping the model input size constant.In addition, it was compatible with the UNet architecture and convenient for the subsequent 3D rendering because the volume of interest to be reconstructed was directly used as the model input.To mitigate the high computational cost associated with 3D convolution layers, we implemented three downsampling and upsampling layers in the network structure.

TransUnet
In our implementation, we adapted the TransUnet model, originally proposed by JN.Chen [41], integrating the comprehensive feature extraction capabilities of both the transformer and U-Net architectures.This model distinctively tokenizes image patches derived from CNN feature maps, which the transformer then processes to incorporate global contexts.These globally informative features are subsequently upsampled and integrated with high-resolution CNN feature maps, ensuring enhanced precision in localization.Our modification particularly adjusted the input and output channels to align with those used in the 3D U-Net model, as shown in Fig. 4.This adaptation was aimed at refining the model's application to a broader range of reconstruction tasks, where TransUnet has already demonstrated its superiority over existing methods.
high computational cost associated with 3D convolution layers, we implemented three downsampling and upsampling layers in the network structure.

TransUnet
In our implementation, we adapted the TransUnet model, originally proposed by JN.Chen [43], integrating the comprehensive feature extraction capabilities of both the transformer and U-Net architectures.This model distinctively tokenizes image patches derived from CNN feature maps, which the transformer then processes to incorporate global contexts.These globally informative features are subsequently upsampled and integrated with high-resolution CNN feature maps, ensuring enhanced precision in localization.Our modification particularly adjusted the input and output channels to align with those used in the 3D U-Net model, as shown in Figure 4.This adaptation was aimed at refining the model's application to a broader range of reconstruction tasks, where TransUnet has already demonstrated its superiority over existing methods.To generate simulation data from digital phantoms, we used a three-step strategy.First, a volume with the shape of the target letter was defined in a homogeneous background.The optical fluence inside each digital phantom was calculated using Monte Carlo simulation [42].Second, we computed the initial pressure value after laser pulse excitation at each pixel of the phantom by multiplying the Grüneisen parameter at that pixel by its absorption coefficient and the optical fluence.Third, from this initial pressure map, the raw photoacoustic signal at each transducer element was generated using the k-wave toolbox [43].The transducer configuration matched that of our co-registered US and PAT system.We simulated 5 letters -W, A, S, H, and U -and for each letter, we generated Bscans at 18 different angles by rotating the US probe.A total of 1350 (3 × 5 × 5 × 18) Bscan data were simulated, as shown in Table 1.Two 3D letter phantoms were made from 4% agar.To achieve a ∼4 cm −1 reduced scattering coefficient, we added 0.4% intralipid, and to achieve a ∼ 1 cm −1 absorption coefficient, we added 0.04% v/v of ink.After being refrigerated for 24 hours, the phantoms were placed in an intralipid solution with calibrated absorption and reduced scattering coefficients of 0.02 and 6 cm −1 , respectively, and measured from their PAT/US signals.

Real-time US-PAT system
All phantom and clinical data were acquired with an in-house, real-time co-registered PAT and US system [5].The co-registered PAT system has three parts, 1) a tunable Ti:sapphire laser (700 nm to 900 nm) pumped by a Q-switched Nd: YAG laser (Symphotics TII, LS-2134), which generates 10 ns laser pulses at a 15 Hz repetition rate; 2) an optical fiber-based light delivery system [44]; and 3) a commercial US system (EC-12R, Alpinion Medical Systems, Republic of Korea) [45].
A phantom experiment validated the four different methods for 3D PAT reconstruction (DI, BBP, 3D CNN Unet, and PA-NeRF).A transvaginal probe was placed on top of a water tank, and a rotation stage held the probe.The agar-based phantom was placed on the bottom of the tank filled with 1% intralipid solution, as shown in Fig. 5.
To generate simulation data from digital phantoms, we used a three-step strategy.First, a volume with the shape of the target letter was defined in a homogeneous background.The optical fluence inside each digital phantom was calculated using Monte Carlo simulation [44].Second, we computed the initial pressure value after laser pulse excitation at each pixel of the phantom by multiplying the Grüneisen parameter at that pixel by its absorption coefficient and the optical fluence.Third, from this initial pressure map, the raw photoacoustic signal at each transducer element was generated using the k-wave toolbox [45].The transducer configuration matched that of our co-registered US and PAT system.We simulated 5 letters -W, A, S, H, and U -and for each letter, we generated Bscans at 18 different angles by rotating the US probe.A total of 1350 (3 ×5 ×5×18) Bscan data were simulated, as shown in Table 1.

3D Phantom data
Two 3D letter phantoms were made from 4% agar.To achieve a ~4 cm -1 reduced scattering coefficient, we added 0.4% intralipid, and to achieve a ~ 1 cm -1 absorption coefficient, we added 0.04% v/v of ink.After being refrigerated for 24 hours, the phantoms were placed in an intralipid solution with calibrated absorption and reduced scattering coefficients of 0.02 and 6 cm -1 , respectively, and measured from their PAT/US signals.

Real-time US-PAT system
All phantom and clinical data were acquired with an in-house, real-time co-registered PAT and US system [46].The co-registered PAT system has three parts, 1) a tunable Ti:sapphire laser (700 nm to 900 nm ) pumped by a Q-switched Nd: YAG laser (Symphotics TII, LS-2134), which generates 10 ns laser pulses at a 15 Hz repetition rate; 2) an optical fiber-based light delivery system [47]; and 3) a commercial US system (EC-12R, Alpinion Medical Systems, Republic of Korea) [48].

SSIM
To quantify the 3D reconstructions from different methods, we used the structural similarity index measure (SSIM) to compare the reconstructions with the ground truth.The SSIM is calculated by Eq. ( 4): Here, x and y are respectively the reconstructed and the ground truth PA Bscans, l is their luminance, c is their contrast, and s represents the structure.α, β, andγ are coefficients which have a value less than or equal to 1.

PSNR
We also used the peak signal to noise ratio (PSNR) to quantify the quality of the reconstructed images and the ground truth.The calculation of the PSNR is shown in Eq. ( 5): Here, xandy represent the reconstructed and ground truth images, andMAX is the maximum possible pixel value of the image (for gray scale images, MAX is 255).MSE(x, y) is the mean square error of two images.

RMSE
In addition to the SSIM metric, we employed the root mean squared error (RMSE) to gauge the accuracy of the 3D reconstructions.The RMSE is formulated as

Simulation results
Figure 6 shows the simulation of the letter 'U' and the corresponding 3D reconstruction results from each of the four methods.The visualization of the 3D volume is presented using four images: (1) a top-down projection image, which shows the maximum projection of the reconstructed volume from the top-down view; and (2-4) three different reconstructed cross-sectional images that show the details of each reconstruction.In this particular case, nine Bscans were used for rendering the 3D volume, with the transducer rotation angles ranging from 0 to 180 degrees with a step of 20 degrees.All the reconstruction images are normalized to maximum amplitude.With nine Bscans, the direct interpolation method shows significant artifacts and distortions in the reconstructed volume, which are more obvious in the top-down projection image.The BBP method also shows some artifacts, but they are less prominent than with the DI method.The Unet and TransUnet generates similar output.The TransUnet model reconstructs smoother volume compared with Unet.The PA-NeRF method produce smooth and accurate reconstruction results too, although there are some slight differences between their cross-sections and the ground truth.The top-down projection image of the PA-NeRF method is also twisted due to the shift in each cross section.The use of nine Bscans covers all directional cross-sections, and the Unet model could learn the interpolation process easily from sparse data.As for PA-NeRF model, it learns the function of transducer elements and corresponding beamline data.Therefore, the simulation results from nine Bscans looks a little better for Unet than PA-NeRF.With nine Bscans, the direct interpolation method shows significant artifacts and distortions in the reconstructed volume, which are more obvious in the top-down projection image.The BBP method also shows some artifacts, but they are less prominent than with the DI method.The Unet and TransUnet generates similar output.The TransUnet model reconstructs smoother volume compared with Unet.The PA-NeRF method produce smooth and accurate reconstruction results too, although there are some slight differences between their cross-sections and the ground truth.The top-down projection image of the PA-NeRF method is also twisted due to the shift in each cross section.The use of nine Bscans covers all directional cross-sections, and the Unet model could learn the interpolation process easily from sparse data.As for PA-NeRF model, it learns the function of transducer elements and corresponding beamline data.Therefore, the simulation results from nine Bscans looks a little better for Unet than PA-NeRF.
From the SSIM and PSNR metrics in Fig 6b , 6c, and from the RMSE values in Table 2, the Unet method and TransUnet model still outperform the other methods, followed by PA-NeRF.Overall, these results demonstrate the superiority of deep learning-based methods in PAT image reconstruction, especially in scenarios with limited data.
Next, we tried to use fewer Bscans, which provide less data for 3D reconstruction.As shown in Figure 7, we used five Bscans, with rotation angles of 0, 40, 80, 120, and 160 degrees.From the SSIM and PSNR metrics in Fig. 6(b), 6c, and from the RMSE values in Table 2, the Unet method and TransUnet model still outperform the other methods, followed by PA-NeRF.Overall, these results demonstrate the superiority of deep learning-based methods in PAT image reconstruction, especially in scenarios with limited data.Next, we tried to use fewer Bscans, which provide less data for 3D reconstruction.As shown in Fig. 7, we used five Bscans, with rotation angles of 0, 40, 80, 120, and 160 degrees.
It is interesting to see that even with fewer Bscans, the Unet method, TrasUnet model, and PA-NeRF method can still achieve good reconstruction quality, while the direct interpolation method and BBP method fail to reconstruct the correct shape.This demonstrates the effectiveness of deep learning methods in handling the data insufficiency problem in PAT.Additionally, the results suggest that the Unet method, TransUnet model, and PA-NeRF method are more robust to the data variability caused by the rotation angles of the transducer array.The higher PSNR values of the PA-NeRF method also indicate that the model can better preserve the details of the reconstruction.Overall, the results demonstrate the potential of using deep learning methods for 3D PAT reconstruction with reduced data acquisition.
Because the ultimate goal of this research is to recover volume from fewer Bscans, we tested the reconstruction results with only two Bscans, using 0 and 90 degree Bscan data.
Based on the results shown in Fig. 8, the direct interpolation method only extends the two available Bscans along their positions, resulting in a poor reconstruction.The BBP method can It is interesting to see that even with fewer Bscans, the Unet method, TrasUnet model, and PA-NeRF method can still achieve good reconstruction quality, while the direct interpolation method and BBP method fail to reconstruct the correct shape.This demonstrates the effectiveness of deep learning methods in handling the data insufficiency problem in PAT.Additionally, the results suggest that the Unet method, TransUnet model, and PA-NeRF method are more robust to the data variability caused by the rotation angles of the transducer array.The higher PSNR values of the PA-NeRF method also indicate that the model can better preserve the details of the reconstruction.Overall, the results demonstrate the potential of using deep learning methods for 3D PAT reconstruction with reduced data acquisition.
Because the ultimate goal of this research is to recover volume from fewer Bscans, we tested the reconstruction results with only two Bscans, using 0 and 90 degree Bscan data.reconstruct some parts of the letter "U", but not the entire shape.With only two Bscans, the Unet model, and TransUnet model do not perform as well as with five or nine Bscans, because there is limited data for interpolation.On the other hand, the PA-NeRF model is still able to reconstruct the letter "U" shape, demonstrating its ability to learn the relationship between transducer element positions and the corresponding cross sections.However, the quality of the reconstruction is not as good as with five or nine Bscans, as indicated by the SSIM and PSNR values.Judging from the RMSE values in Table 2., the PA-NeRF model is much better than Unet and TransUnet.We can see that when there are enough Bscans to reconstruct the 3D volume, Unet, TransUnet and PA-NeRF have similar performance, but when we have very limited Bscans, PA-NeRF can still maintain a reasonable error where Unet falters.The PA-NeRF model does not interpolate between Bscans, but predicts the Bscans from transducer positions.It is interesting to see that even with fewer Bscans, the Unet method, TrasUnet model, and PA-NeRF method can still achieve good reconstruction quality, while the direct interpolation method and BBP method fail to reconstruct the correct shape.This demonstrates the effectiveness of deep learning methods in handling the data insufficiency problem in PAT.Additionally, the results suggest that the Unet method, TransUnet model, and PA-NeRF method are more robust to the data variability caused by the rotation angles of the transducer array.The higher PSNR values of the PA-NeRF method also indicate that the model can better preserve the details of the reconstruction.Overall, the results demonstrate the potential of using deep learning methods for 3D PAT reconstruction with reduced data acquisition.
Because the ultimate goal of this research is to recover volume from fewer Bscans, we tested the reconstruction results with only two Bscans, using 0 and 90 degree Bscan data.

Phantom results
To demonstrate the performance of the PA-NeRF model with real data, we conducted phantom studies using the letters U and S. Based on the simulation results, the Unet model and PA-NeRF model were selected to reconstruct the 3D volumes.
As shown in Fig. 9, the top-down projections of U and S have similar reconstructed shapes.However, from the volume profile and top-down projection images, the Unet model and TransUnet model can learn an interpolation to reconstruct a volume, therefore, there are blank regions between two Bscans.PA-NeRF also achieves a better volume reconstruction, as demonstrated by the SSIM values.Here, based on 0 and 90 degree Bscans, the Unet model barely reconstructs the shapes of the U and S.Where the Unet model apparently interpolates between given Bscans to form a 3D volume.Similar reconstruction from TransUnet, it can't reconstruct with only 2 Bscans.The PA-NeRF model tries to learn the relationship between the transducer element's positions and its corresponding Bscan beams.Hence, the PA-NeRF model is more generalizable and can better reconstruct the letter shape.
As can be seen in Table 3, unlike the the RMSE of simulation results, PA-NeRF performed better than Unet and TransUnet model, even with five Bscans possibly because the Unet model  Here, based on 0 and 90 degree Bscans, the Unet model barely reconstructs the shapes of the U and S.Where the Unet model apparently interpolates between given Bscans to form a 3D volume.Similar reconstruction from TransUnet, it can't reconstruct with only 2 Bscans.The PA-NeRF model tries to learn the relationship between the transducer element's positions and its corresponding Bscan beams.Hence, the PA-NeRF model is more generalizable and can better reconstruct the letter shape.
As can be seen in Table 3., unlike the the RMSE of simulation results, PA-NeRF performed better than Unet and TransUnet model, even with five Bscans possibly because the Unet model and TransUnet model are limited to the training data.However, the PA-NeRF model is finetuned with the phantom data and the learning relationship between the transducer elements and corresponding beamline is more generalizable than the interpolation.Here, based on 0 and 90 degree Bscans, the Unet model barely reconstructs the shapes of the U and S.Where the Unet model apparently interpolates between given Bscans to form a 3D volume.Similar reconstruction from TransUnet, it can't reconstruct with only 2 Bscans.The PA-NeRF model tries to learn the relationship between the transducer element's positions and its corresponding Bscan beams.Hence, the PA-NeRF model is more generalizable and can better reconstruct the letter shape.

Clinical examples
As can be seen in Table 3, unlike the the RMSE of simulation results, PA-NeRF performed better than Unet and TransUnet model, even with five Bscans possibly because the Unet model and TransUnet model are limited to the training data.However, the PA-NeRF model is  reconstruct a 3D volume from the data of just two Bscans.For homogeneous cystic lesions, we assume that either one of the two Bscan images can represent both the longitudinal and transverse orientations.Under this assumption, although we captured PAT data from only one cross-section, this data can be representative of the lesion in two directions.Below, we describe the reconstruction of two patients' ovarian lesions.One lesion is reconstructed from longitudinal and transverse Bscan data, but to reconstruct the other, a homogeneous lesion, we used the data from only one Bscan.
The second patient has a relatively homogeneous lesion, so the longitudinal and transverse scans are quite similar and we can use just one longitudinal image to represent both orientations.The PA-NeRF model constructs a good volume profile, as shown in the co-registered US-PAT images, the central area is the cyst, with dense PA signals.
The longitudinal view in Fig. 11 reveals two distinct cysts, and the transverse view (rotating 90 degrees) shows the two cysts merged into one big lesion.These details are successfully reconstructed in the PA-NeRF output.Even though, the PA signals in the input images are sparse and appear in only one cyst area, the co-registered US and PAT images show dense PA signals in the shape of the lesion.Figure 12 presents a prediction that accurately captures the spherical shape of the target, consistent with US images.It depicts both longitudinal and transverse views, revealing the inner signal of the dark area, which is likely indicative of the lesion area.The PA-NeRF model effectively unveils the 3D volume shape and reconstructs the PA signal from a homogenous B-scan, offering valuable insights into the lesion's structural and functional characteristics.
lesions, we assume that either one of the two Bscan images can represent both the longitudinal and transverse orientations.Under this assumption, although we captured PAT data from only one cross-section, this data can be representative of the lesion in two directions.
Below, we describe the reconstruction of two patients' ovarian lesions.One lesion is reconstructed from longitudinal and transverse Bscan data, but to reconstruct the other, a homogeneous lesion, we used the data from only one Bscan.

Ablation Study
In this section, we describe comprehensive ablation studies to identify the essential features contributing to the effectiveness of our PA-NeRF model.Our primary configuration, formulated in Equation 1, utilizes Mean Squared Error (MSE) loss combined with forward loss as the primary loss function.Additionally, we experimented with substituting forward loss with Total Variation (TV) loss, as shown in Table 4.However, our findings indicate that TV loss did not enhance the model's performance, suggesting that noise and artifacts are not significant hindrances in reconstructing the 3D volume.
Moreover, we explored variations in the coefficient applied to the forward loss, testing values of 0 (effectively omitting forward loss), 0.01, 0.1, and 1.Our results, as presented in

Ablation study
In this section, we describe comprehensive ablation studies to identify the essential features contributing to the effectiveness of our PA-NeRF model.Our primary configuration, formulated in Eq. ( 1), utilizes Mean Squared Error (MSE) loss combined with forward loss as the primary loss function.Additionally, we experimented with substituting forward loss with Total Variation (TV) loss, as shown in Table 4.However, our findings indicate that TV loss did not enhance the model's performance, suggesting that noise and artifacts are not significant hindrances in reconstructing the 3D volume.Moreover, we explored variations in the coefficient applied to the forward loss, testing values of 0 (effectively omitting forward loss), 0.01, 0.1, and 1.Our results, as presented in Table 5., demonstrate that a coefficient of 0.1 offers an optimal balance.This balance appears to be crucial for the model's success, highlighting the importance of fine-tuning loss function parameters in achieving desired outcomes.

Discussion
In this study, we implemented five different methods for 3D reconstruction from sparse Bscans, namely DI, BBP, Unet, TransUnet, and PA-NeRF, with the last being the first attempt to use the NeRF model in PAT reconstruction.The DI method is easy to implement and efficient in terms of time, but its reconstruction volume is highly limited by the quality of the Bscan images, and the reconstruction quality degrades with fewer Bscans.The BBP method produces smoother reconstruction, but with insufficient data, the back projection may result in a non-realistic output.With a sufficient number of Bscans, the Unet model usefully reconstructs a high-quality volume.However, because it learns only by interpolation, with only a few Bscans, the reconstruction may be poor.Furthermore, the Unet model is limited by the dataset, and if trained with simulation data, its output is limited to the simulation data.The TransUnet model is similar to the Unet model, which is limited to the dataset and can only interpolate the Bscans.Moreover, the transformers in the TransUnet model need a huge dataset for training, and our study does not have enough data to train the model to its best performance.
In contrast, the PA-NeRF model learns the relationship between a transducer element's position and its corresponding beamformed data, yielding a more realistic reconstruction and enhanced generality, especially with a small number of Bscans.Moreover, we introduced a forward loss to improve the reconstruction quality.Both qualitative and quantitative results from simulation and phantom data show that the PA-NeRF method can reconstruct a high-quality volume.

Figure 1 .
Figure 1.Scanning schematic diagram.The transducer is fixed on top of the target and rotates around the center.ξ represents the transducer rotation angle.

Fig. 1 .
Fig. 1.Scanning schematic diagram.The transducer is fixed on top of the target and rotates around the center.ξ represents the transducer rotation angle.

Figure 2 .
Figure 2. PA-NeRF model structure.F is the operator to map (x, y, z, ξ , α) to initial pressure and absorption coefficient and θ represent weights of PA-NeRF model.

Fig. 2 .
Fig. 2. PA-NeRF model structure.F is the operator to map (x, y, z, ξ, α) to initial pressure and absorption coefficient and θ represent weights of PA-NeRF model.

Figure 3
Figure 3 Reconstruction diagrams for UBP and BBP.In UBP, the algorithm projects each Bscan data set to the entire imaged volume and add them coherently.In BBP, the algorithm projects Bscan envelop data to the volume and add them to form 3-D images.

Fig. 3 .
Fig. 3. Reconstruction diagrams for UBP and BBP.In UBP, the algorithm projects each Bscan data set to the entire imaged volume and add them coherently.In BBP, the algorithm projects Bscan envelop data to the volume and add them to form 3-D images.

Figure 3
Figure 3 Reconstruction diagrams for UBP and BBP.In UBP, the algorithm projects each Bscan data set to the entire imaged volume and add them coherently.In BBP, the algorithm projects Bscan envelop data to the volume and add them to form 3-D images.

Figure 4 .Fig. 4 .
Figure 4. Overview of the framework.(a) schematic of the Transformer layer; (b) architecture of modified TransUNet.[43]2.6 Generating Simulation and Experimental Data for Model Training 2.6.1 Simulation data

Figure 6a .
Figure 6a.Comparative 3D reconstructions of a simulated "U" shape using nine Bscans.Displayed from left to right are the ground truth and the results from direct interpolation, back projection via Bscan, the CNN-Unet model, the TransUnet model, and the PA-NeRF model.The five rows illustrate, respectively, the volume profile, top-down maximum projection, and cross-sectional reconstructions at 30, 90, and 150 degrees.Figure6b.SSIM values for each reconstruction method benchmarked against the ground truth.Figure6c.PSNR values for each reconstruction approach in comparison with the ground truth.

Fig. 6 .
Fig. 6. a. Comparative 3D reconstructions of a simulated "U" shape using nine Bscans.Displayed from left to right are the ground truth and the results from direct interpolation, back projection via Bscan, the CNN-Unet model, the TransUnet model, and the PA-NeRF model.The five rows illustrate, respectively, the volume profile, top-down maximum projection, and cross-sectional reconstructions at 30, 90, and 150 degrees.Figure 6(b).SSIM values for each reconstruction method benchmarked against the ground truth.Figure 6(c).PSNR values for each reconstruction approach in comparison with the ground truth.

Figure 6 (
Fig. 6. a. Comparative 3D reconstructions of a simulated "U" shape using nine Bscans.Displayed from left to right are the ground truth and the results from direct interpolation, back projection via Bscan, the CNN-Unet model, the TransUnet model, and the PA-NeRF model.The five rows illustrate, respectively, the volume profile, top-down maximum projection, and cross-sectional reconstructions at 30, 90, and 150 degrees.Figure 6(b).SSIM values for each reconstruction method benchmarked against the ground truth.Figure 6(c).PSNR values for each reconstruction approach in comparison with the ground truth.

Figure 6 (
Fig. 6. a. Comparative 3D reconstructions of a simulated "U" shape using nine Bscans.Displayed from left to right are the ground truth and the results from direct interpolation, back projection via Bscan, the CNN-Unet model, the TransUnet model, and the PA-NeRF model.The five rows illustrate, respectively, the volume profile, top-down maximum projection, and cross-sectional reconstructions at 30, 90, and 150 degrees.Figure 6(b).SSIM values for each reconstruction method benchmarked against the ground truth.Figure 6(c).PSNR values for each reconstruction approach in comparison with the ground truth.

Figure 7a .
Figure 7a.3D reconstruction results for a simulated letter 'U', from five Bscans.Columns and rows show the same information as in Figure 6, as do plots b and c.

Figure 8a .
Figure 8a.3D reconstruction results for a simulated letter 'U', from two Bscans.Columns and rows show the same information as in Figure 6, as do plots b and c.

Fig. 7 .
Fig. 7. a. 3D reconstruction results for a simulated letter 'U', from five Bscans.Columns and rows show the same information as in Fig. 6, as do plots b and c.

Figure 7a .
Figure 7a.3D reconstruction results for a simulated letter 'U', from five Bscans.Columns and rows show the same information as in Figure 6, as do plots b and c.

Figure 8a .
Figure 8a.3D reconstruction results for a simulated letter 'U', from two Bscans.Columns and rows show the same information as in Figure 6, as do plots b and c.

Fig. 8 .
Fig. 8. a. 3D reconstruction results for a simulated letter 'U', from two Bscans.Columns and rows show the same information as in Fig. 6, as do plots b and c.

Figure 9 :
Figure 9: Reconstruction comparisons using five Bscans for phantom letters "U"' (a, b, bordered in red) and "S" (c, d, bordered in blue).In the absence of a ground truth, US images were utilized as the benchmark in these phantom studies.For each letter, from left to right, the columns sequentially present the US image-based ground truth, and the reconstruction results from the Unet model, the TransUnet model, and the PA-NeRF method.From top to bottom, the rows for each letter display the volume profile, top-down projection, and cross-sectional reconstructions at angles of 30, 90, and 150 degrees.Figure 9 b, d.Corresponding SSIM values, calculated by comparing Bscan and US image reconstructions to their respective ground truths.

Figure 10 a
Figure 10 a, c Reconstructed images of phantom letters U and S, from only two Bscans.Columns and rows show the same information as in Figure9, as do plots b and c.

Fig. 9 .
Fig. 9. Reconstruction comparisons using five Bscans for phantom letters "U"' (a, b, bordered in red) and "S" (c, d, bordered in blue).In the absence of a ground truth, US images were utilized as the benchmark in these phantom studies.For each letter, from left to right, the columns sequentially present the US image-based ground truth, and the reconstruction results from the Unet model, the TransUnet model, and the PA-NeRF method.From top to bottom, the rows for each letter display the volume profile, top-down projection, and cross-sectional reconstructions at angles of 30, 90, and 150 degrees.Figure 9 b, d.Corresponding SSIM values, calculated by comparing Bscan and US image reconstructions to their respective ground truths.

Figure 9 b
Fig. 9. Reconstruction comparisons using five Bscans for phantom letters "U"' (a, b, bordered in red) and "S" (c, d, bordered in blue).In the absence of a ground truth, US images were utilized as the benchmark in these phantom studies.For each letter, from left to right, the columns sequentially present the US image-based ground truth, and the reconstruction results from the Unet model, the TransUnet model, and the PA-NeRF method.From top to bottom, the rows for each letter display the volume profile, top-down projection, and cross-sectional reconstructions at angles of 30, 90, and 150 degrees.Figure 9 b, d.Corresponding SSIM values, calculated by comparing Bscan and US image reconstructions to their respective ground truths.

For
ovarian cancer diagnosis, imaging multiple views of lesions is impractically time-consuming.Physicians often take only longitudinal and transverse views, and our PA-NeRF model can were utilized as the benchmark in these phantom studies.For each letter, from left to right, the columns sequentially present the US image-based ground truth, and the reconstruction results from the Unet model, the TransUnet model, and the PA-NeRF method.From top to bottom, the rows for each letter display the volume profile, top-down projection, and cross-sectional reconstructions at angles of 30, 90, and 150 degrees.Figure 9 b, d.Corresponding SSIM values, calculated by comparing Bscan and US image reconstructions to their respective ground truths.

Figure 10 a
Figure 10 a, c Reconstructed images of phantom letters U and S, from only two Bscans.Columns and rows show the same information as in Figure9, as do plots b and c.

Fig. 10 .
Fig. 10.a, c Reconstructed images of phantom letters U and S, from only two Bscans.Columns and rows show the same information as in Fig. 9, as do plots b and c.

Figure 11 .
Figure 11.Volume reconstruction from a 76-year-old woman with simple cysts (the lesion is 3.1 cm in diameter).Left shows input images obtained from Bscan data (blue boxes mark the region of interest), with the corresponding PA-NeRF output volume in the middle.The volume shows in three different slices: front, top, and profile respectively.Right presents two views of the volume, and their US co-registered PAT images.The longitudinal view in Fig. 11 reveals two distinct cysts, and the transverse view (rotating 90 degrees) shows the two cysts merged into one big lesion.These details are successfully reconstructed in the PA-NeRF output.Even though, the PA signals in the input images are sparse and appear in only one cyst area, the co-registered US and PAT images show dense PA signals in the shape of the lesion.

Fig. 11 .
Fig. 11.Volume reconstruction from a 76-year-old woman with simple cysts (the lesion is 3.1 cm in diameter).Left shows input images obtained from Bscan data (blue boxes mark the region of interest), with the corresponding PA-NeRF output volume in the middle.The volume shows in three different slices: front, top, and profile respectively.Right presents two views of the volume, and their US co-registered PAT images.

Figure 12 .
Figure 12. Volume reconstruction from a 57-year-old woman with a benign mucinous cystadenoma, 3 cm in diameter.Left shows Bscan images obtained from input Bscan data (blue boxes mark the region of interest), with the corresponding PA-NeRF output volume in the middle.The volume shows in three different slices: front, top, and profile respectively.Right presents two different views of the volume, with their US co-registered PAT images.

Fig. 12 .
Fig. 12. Volume reconstruction from a 57-year-old woman with a benign mucinous cystadenoma, 3 cm in diameter.Left shows Bscan images obtained from input Bscan data (blue boxes mark the region of interest), with the corresponding PA-NeRF output volume in the middle.The volume shows in three different slices: front, top, and profile respectively.Right presents two different views of the volume, with their US co-registered PAT images.