Hybrid deep learning network for vascular segmentation in photoacoustic imaging

: Photoacoustic (PA) technology has been used extensively on vessel imaging due to its capability of identifying molecular speciﬁcities and achieving high optical-diﬀraction-limited lateral resolution down to the cellular level. Vessel images carry essential medical information that provides guidelines for a professional diagnosis. Modern image processing techniques provide a decent contribution to vessel segmentation. However, these methods suﬀer from under or over-segmentation. Thus, we demonstrate both the results of adopting a fully convolutional network and U-net, and propose a hybrid network consisting of both applied on PA vessel images. Comparison results indicate that the hybrid network can signiﬁcantly increase the segmentation accuracy and robustness.


Introduction
Photoacoustic (PA) imaging involves in applying a nanosecond pulsed laser to biological tissues and retrieving induced pressure waves. Acoustic waves are emitted by the process of laser absorption, abrupt rise in temperature and transient thermoelastic expansion. An ultrasonic detector is placed to collect data and reconstruct the target image. It has been shown that the unique optical absorption of substances provide quantification functionalities to endogenous nonfluorescent chromophores (hemoglobin, melanin, lipids and collagen etc.) and contribute to deepening the understanding of pathological knowledge of biological tissues [1,2]. Among various PA imaging categories, optical-resolution photoacoustic microscopy (OR-PAM) stands out due to the high optical-diffraction-limited lateral resolution at a cellular level of micrometers [3]. This non-invasive imaging technique performs significantly well with simultaneous high contrast and high spatial resolutions in vascular biology, ophthalmology, thyroid cancer and breast cancer detection [1,4,5].
Vessel segmentation is an essential task for biomedical imaging analyzing. Thus, it is challenging to provide reliable and accurate vessel segmentation on PA imaging. Boink et al. [6] demonstrated good segmentation accuracy on the blood vessels acquired from photoacoustic computed tomography (PACT). OR-PAM features high spatial resolution and simple imaging reconstruction [3], and thus has found broad applications in vascular imaging. To date, many efforts focus on the segmentation to the blood vessels captured by the OR-PAM system. Among various image processing tasks, image segmentation has been widely discussed in medical image processing fields (i.e. Skin surface detection [7], surgical vessel guidance [8], and vessel filtering [9] etc.). Further research and articles have proven that various image processing techniques are suitable for vascular segmentation and structural information extraction. Among these, adaptive thresholding based on pixel intensities is one of the simplest approaches, but the low morphological sensitivity to blood vessels often leads to over-segmentation [10]. Feature descriptors such as Hessian matrix have been extensively used in vascular segmentation [11][12][13] to extract image details and preserve morphological information. However, traditional hessian feature maps suffer from blurriness of small vessels when the Gaussian scale is not chosen with care [5]. Multiscale hessian filter proposed by Frangi et al. demonstrated the advantage of considering all possible aspect ratios [5]. In addition, iterative algorithms such as region growing (RG), and K-Mean clustering have been frequently used in a wide range of image processing fields [14][15][16]. Diyar et al. demonstrated breast cancer segmentation results based on region growing techniques [17] and Yang et al. presented the segmentation enhancement on OR-PAM for quantitative microvascular imaging [18]. However, it is generally acknowledged that PA images suffer from imaging distortions and noise, which significantly affect the segmenting quality.
Deep learning (DL) has been extensively used in recent years due to development in hardware (GPUs) as well as novel algorithms. Highly accurate segmentation approaches supported by deep learning methods can overcome the aforementioned difficulties. Such rapid development reinforced the growing field of various image processing tasks such as 3D tomography of red cells [19], image classification and segmentation [20]. These tasks demand the use of DL methods and require the capability of handling 2D images. Specifically, Convolutional neural network (CNN) inherits from DL neural networks and has shown its feasibility and superiority when tackling segmentation problems [21]. A recent survey conducted by Yang et al. [22] indicated that deep learning has been widely used in PA imaging fields such as PA image reconstruction [23], quantitative imaging [24], PA image detection [25] & classification [26], PA-assisted intervention [27] and PA segmentation [26]. Apart from the examples mentioned in this survey, other relevant work also provides evidence indicating that deep learning methods applied on medical imaging are feasible and contributing. Zhao et al. introduced an image deblurring deep learning network for optical microscopic systems [28] and motion correction for OR-PAM [29]. However, it was discovered that deep learning models were not widely applied to PA images for vessel segmentation. Thus, relevant research on medical image segmentation based on DL methods have been studied. For instance, Zhang et al. [30] proposed a CNN method for segmenting brain tissues on magnetic resonance imaging (MRI) and Li et al. [31] proposed a CNN model for liver tumor segmentation on computed tomography (CT) images. However, the fully connected layers which appear before the classification node has a lack of consistency with the spatial context [32]. Surprisingly, these difficulties can be solved by using a fully convolutional net (FCN) [33] or U-net [34]. Furthermore, Milletari et al. [35] demonstrated an FCN solution for 3D MRI image segmentation and Li et al. [36] demonstrated that U-net can perform well with liver tumor segmentation on CT volumes.
Hence, in this paper, FCN and U-net were separately applied in PA imaging for vascular segmentation, and a hybrid network consisting of both which are combined via a voting scheme on PA vascular images. The results are qualitatively compared and evaluated on Dice Coefficients, Intersection of Union (IoU), sensitivity and accuracy. The following sections will examine the proposed hybrid network, simulation results and conclusions. An opensource Python code for the simulation of our proposed system can be found in the following linkhttps://github.com/szgy66/Vessel-Segmentation-of-photoacoustic-imaging. Manual annotations of capillaries vary from individuals due to the complex structure. Therefore, the proposed models were not trained for the capillaries segmentation.

Data description
The in vivo vascular images were acquired from the ear of a Swiss Webster mouse using an OR-PAM system that incorporates a surface plasmon resonance sensor as the ultrasonic detector [37,38]. The maximum-amplitude-projection (MAP) image ( Fig. 4) was reconstructed by projecting the maximum amplitude of each PA A-line at the depth direction. The system's lateral resolution was estimated at ∼4.5µm, enabling the visualization of capillaries in addition to major blood vessels (Fig. 4). Our surface plasmon resonance sensor can respond ultrasounds with broad bandwidth, determining the depth resolution of the OR-PAM system at ∼7.6µm [38]. It took around 10 minutes for capturing a vascular image consisting of 512 × 512 pixels. All experimental animal procedures were performed in compliance with laboratory animal protocols approved by the Animal Studies Committee of the Shenzhen University. With the OR-PAM system, 38 images were obtained while 5 images were discarded due to the presence of either noise, breakpoints or discontinuous, which affects later tests. Due to the insufficient number of images presented by the PA system, data augmentation methods such as cropping, flipping and mapping were applied to the PA images to tackle model overfitting and low training accuracy. Furthermore, the dataset images were cropped into pixels of 256 × 256 to accelerate the training process. The final dataset consists of 177 images where 10 of them were randomly selected as testing set and the remaining images were randomly placed into either the training set (133 images) or the validation set (34 images) with ratios of 80% and 20% respectively. In addition, all dataset images were manually annotated by Labelmel, which is a graphical interface image annotation software developed by Massachusetts Institute of Technology that can be found in the following link:https://github.com/CSAILVision/LabelMeAnnotationToo.

Traditional methods
Most existing PA image processing technologies are based on traditional optimization algorithms. In this paper, we mainly discuss four methods (threshold segmentation, region growing, Maximum entropy and K-means clustering) as well as three deep learning approaches (FCN, U-net and Hy-Net).
The threshold segmentation method is to select an appropriate threshold pixel intensity as the dividing line. Thus, a clear classification between the foreground and background can be observed [39]. The two main drawbacks of using thresholding as the segmenting methods are the high sensitivity of threshold selection and lack of considering morphological information.
Region growing (RG) is the process of aggregating pixels or sub-regions into larger regions according to pre-defined criteria [40]. The basic idea is to start from a group of seed points which are manually chosen as the initial points. The seed points can be either a single pixel or a small area. The first step is to combine adjacent pixels or areas with similar properties together and form a new grown seed point. The following step is to repeat the above process until the region has converged (No additional seed points can be found). It can be clearly seen that the key issue with RG is the choice of initial growth point cannot be empirically determined.
Information entropy, as shown in Eq. (1), is used to describe the degree of information uncertainty. The essence of the principle of maximum entropy is that the probability of the occurrence of an event in the system satisfies all known constraints, without making assumptions about any unknown information, in other words treating the unknown as equal probabilities. In the maximum entropy image segmentation [41], the total entropy of the images under all the segmentation thresholds is calculated to find the maximum entropy, and the segmentation threshold corresponding to the maximum entropy is used as the final threshold. The pixels in the image whose gray level is greater than this threshold are classified as the foreground, otherwise classified as the background.
K-means clustering is an iterative algorithm, which can be mainly divided into the following 4 steps: a) a set of randomly selected initial centroids of K classes; b) label each sample according to the distance between them and each cluster center; c) calculate and update the new centroids of each class; d) Repeat steps b), and c) until centroids convergence occur.

Deep learning methods
Convolutional Neural Networks (CNN) are powerful visual models that produce a hierarchical structure of features. The application of CNN in semantic segmentation has exceeded the most advanced level. Although past models, such as GoogleNet [42], VGG [43] and AlexNet [44], have demonstrated superior performance, none of them achieved an end-to-end training due to the existence of fully connected layers before the network output and a consistent dimension of label size. Furthermore, the fully connected layers of the network expand the extracted features into a one-dimensional vector, thus discarding the spatial information of the feature map extracted from each map. On the contrast, by replacing the fully connected layers with convolutional layers, the spatial information can be preserved with the use of a Fully Convolutional Network (FCN) which avoids the pre-processing and post-processing of images.
Hence, an FCN model has been introduced in this paper, where the convolutional kernel size has been uniformly set to 3 × 3 with a step size of 2. The number of channels and the image size corresponding to each convolution process is written below with the corresponding blocks in Fig. 1. Two up-sampling operations with a step size of 2 and a single operation with a step size of 8 have been performed in the deconvolution process. Each convolution block except for the last layer of the network has been appended with a nonlinear correction unit ReLu. Jun et al. [45] demonstrated that no significant difference can be observed when selecting between up-sampling and ConvTranspose. Hence, up-sampling has been adopted in this network to reduce the number of training parameters. In the meantime, we use two convolution operations and a dropout block to prevent overfitting during the conversion process from convolution to deconvolution. This network implementation was based on an existing FCN method [33].
U-net is a model developed based on FCN which demonstrates strong robustness and a wide application field in both academia and industry. Although both networks consist of fully convolutional layers, a subtle difference can be found in the concatenation layer, where it combines the low-level features of the encoding part of the network with the high-level features of the decoding part. This effectively avoids feature-loss caused by the pooling layers in the network. In this paper, the U-net introduced consists of convolution kernels with size 3 × 3, four down-sampling and up-sampling steps with a step size of 2. The corresponding image size and number of channels are written above the blocks in Fig. 2. In this network, the additional layers were replaced by concatenation layers to fuse the low-level features with the high-level features, and the channel capacities were expanded instead of simply adding the corresponding pixels. Furthermore, a hybrid deep learning network Hy-Net as shown in Fig. 3 was proposed based on both FCN and U-net as previously mentioned. The Hy-Net combines the results from FCN [33] and U-net [34] with a concatenation block followed by a activation (sigmoid) block. The final probability map, in other words, the network output, is processed by the sigmoid function as expressed in Equation (2). A default threshold has been empirically set to 0.5 indicating that map entries greater than 0.5 are classified as foreground while the remaining entries are considered as the background. Although a default parameter has been used, other values were tested but limited promotion was observed. However, cases of over or under segmentation still exist despite the high segmentation accuracy achieved.

Evaluation methods
The following four metrics (Dice Coefficient(DC), Intersection over Union(IoU), Sensitivity(Sen) and Accuracy(Acc)) were applied to each test experiment to quantify the performance of our experiments on various segmentation methods. We abbreviate the variables with TP (True Positive), FP (False Positive), TN (True Negative), FN (False Negative) as shown in Table 1. In addition, GT (Ground Truth) and SR (Segmentation Result) were defined to represent our manual segmentation standard and the result of our network output, respectively. Related calculation formulas for the various metrics are shown in Table 2.

Implementation
The proposed method was implemented in Python 3 by utilizing the public Keras [46,47] front-end package with TensorFlow [48] package as the backend. The learning rate is set constantly to 0.0001 during the training process. We use Adam to optimize the cross-entropy loss function, and the minimum batch size is set to 2. The total number of training epochs is 50. The training time of FCN, U-net and Hy-Net elapsed about 3.052, 5.525 and 6.426 hours respectively on an Intel Xeon Platinum 8158 CPU @3.00GHz 2.99 GHz with 256GB RAM. In the test phase, the prediction of 10 images took about 14s.

Traditional methods
Four traditional non-deep learning methods are compared in terms of DC, IoU, Sen and Acc as shown in Table 3. A pixel intensity threshold value of 100 has been chosen for the thresholding method and this resulted in a segmentation accuracy of 97.40%. However, the remaining 3 metrics produced were worse and resulted in mean values of 70.98%, 56.09% and 61.64%, respectively. As mentioned previously, RG requires a selection of initial seed points. Hence, the image pixels were sorted in ascending order in terms of pixel curvature. Pixels with the smallest curvature were determined as the initial seed points. This selection method ensures that the algorithm starts from the smoothest area in the image, thus, reducing the number of divisions taken. The threshold value (the maximum density distance among the 8 pixels around the centroid) was set to 0.8. The RG based method resulted in an evaluation score of 64.30%, 49.70%, 51.96%, and 97.26% for DC, IoU, Sen and acc, respectively.
The K-mean clustering segmentation was implemented in MATLAB by utilizing the imsegkmeans function and resulted in 75.21%, 60.93%, 70.92% and 97.59% for DC, IoU, Sen and Acc respectively. In the meantime, visual comparison results are provided in Fig. 4 to better illustrate the major segmentation difference among various mentioned methods. The original test image, as shown in Fig. 4(a), was presented in RGB channels and represent the raw data captured by the PA imaging system. The ground truth were manually annotated as shown in Fig. 4(b). The segmentation results processed by traditional methods (Fig. 4(c)-(f)) were presented in light green and overlaid with the test image to conveniently observe the overlapping sections.
A few observations and conclusions can be made from Fig. 4. It is clearly seen from columns 3,4 and 5 that traditional methods perform well on bright images with clear contour boundaries. However, these methods perform badly on images with unclear boundaries (columns 1,2 and 6). Hence, it can be concluded that the four mentioned traditional segmentation approaches have a lack of robustness and insufficient generalization.

Proposed method
Significant improvements in the four evaluation indicators can be observed (Table 4) on the three deep learning methods. Among the deep learning methods, FCN has the worst performance, followed by U-net, whereas Hy-Net has the best. Specifically, U-net outperforms FCN by 13.71%, 17.97%, 13.37%, and 1.46%; Hy-Net model outperforms FCN by 15.34%, 20.05%, 18.62% and 1.55%; Hy-Net outperforms U-net by 1.63%, 2.08%, 5.25% and 0.09%. Meanwhile, the visualization results shown in Fig. 5 concludes that the Hy-Net is capable of gaining a high degree of overlapping with the label in both large and small vessel. The segmentation results of the network are represented in green and therefore overlaid on the original image for better comparison.
Traditional methods, on the one hand, as mentioned previously, either focus on global or local features which do not consider the information of the image space and leads to sub-optimal segmentation solutions. On the other hand, deep learning methods have demonstrated accurate segmentation results for PA blood vessel images and a significant improvement can be seen when compared with traditional methods. Such improvement can be attributed to the use of convolutional kernels (feature descriptors) and the parameter sharing feature of these kernels, which exploits the overall characteristics of the target image.
To further analyse the performance of various methods in a statistical manner, a box plot comparison is presented in Fig. 6 As shown by the quantification and visualization results (Table 4, Fig. 4, Fig. 5), it can be seen that Hy-Net outperforms FCN and U-net and demonstrates good stability and robustness. The  main manifestation is that FCN and U-net are both under-segmented. This phenomenon can be further explained by the following two aspects.
1. FCN and U-net both suffer from characteristic limitations of models regardless of different hyperparameter tuning, increasing iterations, or increasing training set size.
2. Hy-Net optimizes the results of the two models by combining the feature output of both FCN and U-net, which effectively avoids the uniqueness of the output from a single model.
Despite of gaining outstanding results from the Hy-Net model, manual annotations do not follow a uniform standard and fluctuates the segmentation accuracy. A visualization of the three worst test images along with the under-segmented (highlighted in red) and the over-segmented (highlighted in blue) results are depicted in Fig. 7 to illustrate the encountered issue. Two main reasons can be concluded from Fig. 7. 1) The over-segmented regions recognized by the model were actual blood vessels but mislabeled by human.
2) The high uncertainty when choosing the binarization thresholds could lead to under or over-segmentations. In this paper, a set of thresholds were tested, and the following configurations (FCN:80, U-net:100, Hy-Net 150) resulted in the best results. Considering the one-sidedness of a single probability map, the final thresholding value decision could be sensitive when converting from grayscale to binary, which possibly affects the overall segmentation accuracy severely. Therefore, the combined network output, which is decided by two probability maps, form a final complementarity mask. It can be seen from Table 5 that FCN is much smaller than U-Net in terms of parameters count and memory usage.

Conclusion and further works
In this paper, we proposed a deep learning network (Hy-Net) for blood vessel segmentation on PA images. From the results obtained from our experiments, it can be concluded that our method achieve higher accuracy and robustness when compared with traditional methods(thresholding, region growing, maximum entropy and K-Means cluster), and furthermore, the Hy-Net outperforms both FCN and U-net significantly supported by the four evaluation indicators. The promising results provide motivations for our future work and we are mainly committed to the following four aspects: 1) Annotate our images with the Double-blind method to avoid personal subjective opinions instead of single-person operation; 2) Focus on smaller tissue structures provided by photoacoustic images; 3) Investigate the practicability of applying deep learning models on other photoacoustic images and conduct photoacoustic image segmentations on mice or even human tissues; 4) Exploit and propose other deep learning networks that provide high segmentation accuracy; Propose a novel network for the capillaries segmentation.; 6) Focus on model optimization and parameter simplification to reduce the number of parameters and conduct studies on the microvessel segmentation via transfer learning.