Unsupervised Segmentation of 3D Microvascular Photoacoustic Images Using Deep Generative Learning

Abstract Mesoscopic photoacoustic imaging (PAI) enables label‐free visualization of vascular networks in tissues with high contrast and resolution. Segmenting these networks from 3D PAI data and interpreting their physiological and pathological significance is crucial yet challenging due to the time‐consuming and error‐prone nature of current methods. Deep learning offers a potential solution; however, supervised analysis frameworks typically require human‐annotated ground‐truth labels. To address this, an unsupervised image‐to‐image translation deep learning model is introduced, the Vessel Segmentation Generative Adversarial Network (VAN‐GAN). VAN‐GAN integrates synthetic blood vessel networks that closely resemble real‐life anatomy into its training process and learns to replicate the underlying physics of the PAI system in order to learn how to segment vasculature from 3D photoacoustic images. Applied to a diverse range of in silico, in vitro, and in vivo data, including patient‐derived breast cancer xenograft models and 3D clinical angiograms, VAN‐GAN demonstrates its capability to facilitate accurate and unbiased segmentation of 3D vascular networks. By leveraging synthetic data, VAN‐GAN reduces the reliance on manual labeling, thus lowering the barrier to entry for high‐quality blood vessel segmentation (F1 score: VAN‐GAN vs. U‐Net = 0.84 vs. 0.87) and enhancing preclinical and clinical research into vascular structure and function.

1 Supplementary Notes

Imaging Artefacts in Photoacoustic Mesoscopy
Photoacoustic mesoscopy is a non-invasive imaging technique that uses optical excitation and acoustic detection to produce high-resolution images of biological tissues.However, the technique is susceptible to imaging artefacts that can distort images and reduce image quality. [1]These artefacts arise from acoustic attenuation, acoustic reflection, optical scattering, acoustic diffraction, and thermal effects.Illumination artefacts, arise due to the limited top illumination of the field of view, leading to optical excitation and acoustic wave generation only in the upper part of absorbing structures that face the illumination and transducer array, resulting in inaccurate diameter estimations in the XZ plane compared to the XY plane. [2]Shadow artefacts, arise from obscuring objects, causing a signal loss in underlying objects due to strong optical or acoustic attenuation of the overlaying structure.Reflection artefacts, occur due to presence of acoustic reflectors/scatterers or strong acoustic reverberations of the absorbing object itself, [3][4][5] leading to a signal echo near the object of interest.They can occur in-plane or out-of-plane depending on the position of the absorber/reflector. [6]2 A Comparison of Segmentation Methods for Tumour Vasculature A comparison of performance between RF and VAN-GAN was performed for our PDX dataset for comparison to the results of [2] (Figure S7A-C).Results were consist with prior analyses where VAN-GAN prediction significantly larger vascular density (P < 0.0001, Figure S7D) and vessel diameters (P < 0.01, Figure S7E).Vasculature segmented by VAN-GAN also exhibited greater connectivity compared to RF (P < 0.001, Figure S7F) resulting in a significant difference in computed vessel looping structures (P < 0.001, Figure S7G).VAN-GAN allowed us to distinguish between structural differences between the ER+ and ER-subtypes, where ER+ tumours exhibited significantly larger mean blood vessel lengths (P < 0.01), which was not statistically significant using RF in [2] (Figure S8A).Similarly to [2] , VAN-GAN predicts a significant difference in mean vessel diameters between the subtypes (Figure S8B), however, the trend is reversed with ER+ tumours on average exhibiting significantly larger diameters compared to ER-(P < 0.01).This highlights how VAN-GAN can provide an unbiased insight into biological characteristics of solid tumours.
2 Supplementary Methods

V-System Stochastic Grammar
The aim of VAN-GAN is to learn the mappings between imaging and segmentation domains.Whilst datasets containing image volumes of real vascular networks are readily available, 3D blood vessel segmentations are not creating a need for realistic, 3D blood vessel segmentation labels for VAN-GAN to learn from.Mathematical angiogenesis models exist which incorporate biological mechanisms to simulate realistic, interconnected blood vessel networks, [7] however, to generate a large library of vascular networks of a size applicable to deep learning can be computationally expensive.
Our synthetic vascular networks are generated using L-System stochastic definitions (termed here, V-System) that define a rule as name → successor, where symbols are replaced according to the respective rule. [8,9]Following [9] , rules are composed by symbols that have the following actions: • f (l, d): forward l units in the direction vector and record vessel diameter, d; • +(θ) and −(θ): rotate direction vector clockwise or anti-clockwise by θ • on the plane defined by the current direction vector; • /(ϕ): rotate the current plane by ϕ • about the direction vector; • [ : store the current state formed by the diameter; • ] : restores the last saved state and allows the generation of a new branch; • { and }: define the limits of a vessel segment.
For increasing number of recursive iterations, n, the string of instructions progressively becomes more complex.For example, in the case of modelling the growth of algae, [8] using the grammatical rules A→ AB and A→ B, the axiom A becomes: The following string grammar was used to generate our synthetic vascular trees: [11] The vessel tilt angles, ϕ 1 and ϕ 2 , are sampled from a uniform distribution over the interval [22.5 • , 27.5 • ] multiplied by a random ± sign.Here, the string grammar S is defined by: where r is a random number generated over the interval [0, 1).The function S i for i = 0, 1 is defined by where λ j for j = 1, 2, 3 and δ k for k = 1, 2 are growth and decay rates, respectively, sampled from the interval [1, 1.5].We note that λ and δ are vectors sorted in ascending and descending orders of magnitude, respectively.The string S i defines a branching vessel, which can randomly contain an aneurysm of random size (probability of 1 in 10).
Using our grammatical rule, as an example, the axiom F becomes: The resulting synthetic vascular network quantified for a set of statistical and topological vascular descriptors [2] .Exemplar vascular skeletons of increasing complexity are illustrated in Figure S1 with distributions for our metrics given in Figure S2.

Supervised Random Forest (RF) Pixel Classifier
Using the open source ilastik tool, users can pre-define a set of possible image features and manually annotate a subset of voxels in training images (either 2D or 3D images) for a random forest model to learn to recognise complex image features.The trained model can then be applied to the remaining unlabelled data and its accuracy improved through retrospective interactive training.Photoacoustic images were preprocessed for application to ilastik's random forest toolbox by applying a slice-wise median filter with a kernel size of 3 × 3 to remove noise and improve contrast.For each dataset, a sample of images were manually annotated with two label categories: blood vessel and tissue background.Here, approximately 10 XY images were labelled from each training image volume.All 3D colour / intensity, edge and texture features were selected for all Gaussian mean variational scales (0.3 to 10.0) to extract fine details at a small scale as well features from larger voxel neighbourhoods.RF models were trained by authors PWS, LH, TLL and ELB.For synthetic data, all vessel labels were provided for training along with a proportion of surrounding background labels. [2]Trained models were applied to unlabelled images with predictions thresholded using default settings to create binary segmentation masks.

Model Training
VAN-GAN models were trained pragmatically as and when new datasets and computational hardware became available.Three VAN-GAN models were trained in total (see Table S7).In each case, VAN-GAN predictions were thresholded at an intensity value of 0.5 (images scaled between 0 and 1.0) to create a segmentation mask.Increasing the batch size led to a noticeable improvement in VAN-GAN certainty between the vessel and tissue background.We observed that workstation (2), which employed a larger batch size of 4 (global batch size of 16), achieved better results compared to workstation (1) that used a batch size of 1 (global batch size of 2).

Comparing Vessel Diameters between Axial Planes
To compare the diameters of vessels in the axial XY -and XZ -planes, analysis was performed using MAT-LAB v2022b, MathWorks, Natick, MA, USA.Maximal intensity projections (MIPs) were created in each axial plane for each image.The diameters were analysed by creating a distance transform (using in-built function bwdist) to provide vessel diameters and vascular skeletons (using in-built function bwmorph).
The resulting data provided a means to calculate mean vessel diameter for each plane.We note that MIPs were used as skeletonisation assumes radial symmetry, however, accuracy of this method declines for increasing vascular density of an image due to overlapping vessels when generating MIPs.

FFigure S6 :
Figure S6: VAN-GAN training and validation losses when applied to in vivo photoacoustic images of mouse skin, ear, breast cancer patient-derived xenograft tumours, and MCF7 and MDA-MB-231 tumours.Training and validation losses are given per epoch for: discriminators (A) D x and (D) D y ; generators (B) G and (E) F ; and total losses (C) x → y and (F) y → x.Dotted lines indicate epoch (140) used to segmented the ear dataset.Dashed lines indicate the optimal epoch (epoch 196).

Figure S7 :
Figure S7: VAN-GAN improves quantification of pathological vascular architecture.Maximal intensity projections of: (A) in vivo signal intensity, (B) segmentation masks for (top left) VAN-GAN and (bottom left) the random forest pixel classifier (RF), and (right) a 2D overlay; and (C) vascular skeletons computed from (left) RF and (right) VAN-GAN segmentation predictions.We compute our vascular descriptors to compare results between the RF and VAN-GAN models.(D) Vascular density (the percentage of vascular volume with respect to tumour volume, %).(E) Mean vessel diameters (µm).(F) Connectivity (the volume percentage of the largest vascular subnetwork with respect to the sum of all subnetwork volumes, %).(G) Loop density (the number of network loops normalised against tumour volume, mm -3 ).Mean and standard deviation of data are shown.Statistical significance indicated by * (P < 0.05), ** (P < 0.01), *** (P < 0.001) and **** (P < 0.0001).

Figure S8 :
Figure S8: Segmentation method can influence biological conclusions.Computed mean vessel (A) lengths and (B) diameters for random forest classifier (RF, top) and VAN-GAN (bottom).Mean and standard deviation of data are shown.Statistical significance indicated by * (P < 0.05) and ** (P < 0.01) (ns = not significant).

Table S1 :
Network architecture for 3D deep residual U-Net.Image input and output sizes are 128 × 128 × 128 × 1 (depth × width × height × channel).Conv i represents the i th 3D convolutional layer.Note, for each encoding and decoding layer an identity mapping also exists in each residual block.Here, a 3D convolution of size 1 × 1 × 1 is applied to the layer input and added to the layer output.The number of filters are equal to the other filters defined in each layer, and stride lengths are equal to 2 and 1 for encoding and decoding paths, respectively.

Table S2 :
Network architecture for 3D convolutional discriminator.Conv i represents the i th 3D convolutional layer.Note, Gaussian noise was added prior to each convolution and spatial dropout subsequently applied postconvolution.

Table S4 :
CycleGAN generator architecture.Each residual block (RB) is of the same type as used in VAN-GAN, consisting of of two 3D convolutional layers with a kernel size of 3 × 3 × 3 and stride of 1 × 1 × 1.

Table S5 :
Evaluating segmentation performance against 2D maximal intensity projections.Segmentation masks are calculated for a 3D image using two random forest pixel classifiers (RF) trained separately by two experts (indicated by subscripts 1 and 2 ) and our VAN-GAN method.Maximal intensity projections were calculate for each segmentation prediction for a direct comparison against the 2D ground truth masks using F1 Score, intersection over union (IoU), sensitivity and specificity performance metrics.Note, two scores per metric indicates that the metric was computed against the the ground truth generated by user (left) one or (right) two.The highest metric score per ground truth is shown in bold.

Table S6 :
Mesoscopic photoacoustic image dataset details.Note, voxel size and image dimensions are defined by X × Y × Z where the Z-axis is perpendicular to the skin surface.Dataset size indicates the number of image volumes.PA = photoacoustic and PDX = breast cancer patient-derived xenograft.

Table S7 :
Trained VAN-GAN models with corresponding datasets and workstation specifications used.Workstations are labelled by: 1) a Dell Precision 7920T with a Dual Intel Xeon Gold 5120 CPU with 128GB RAM and two NVIDIA Quadro GV100 32GB GPUs with NVLink; 2) a custom built workstation with a Intel Xeon Gold 5220 CPU with 128GB RAM and four NVIDIA RTX A6000 48GB GPUs with NVLinks.Note, the synthetic vascular dataset is used in all models.PA = photoacoustic and PDX = breast cancer patient-derived xenograft tumours.