Liver vessel segmentation based on inter-scale V-Net

: Segmentation and visualization of liver vessel is a key task in preoperative planning and computer-aided diagnosis of liver diseases. Due to the irregular structure of liver vessel, accurate liver vessel segmentation is difficult. This paper proposes a method of liver vessel segmentation based on an improved V-Net network. Firstly, a dilated convolution is introduced into the network to make the network can still enlarge the receptive field without reducing down-sampling and save detailed spatial information. Secondly, a 3D deep supervision mechanism is introduced into the network to speed up the convergence of the network and help the network learn semantic features better. Finally, inter-scale dense connections are designed in the decoder of the network to prevent the loss of high-level semantic information during the decoding process and effectively integrate multi-scale feature information. The public datasets 3Dircadb were used to perform liver vessel segmentation experiments. The average dice and sensitivity of the proposed method reached 71.6 and 75.4%, respectively, which are higher than those of the original network. The experimental results show that the improved V-Net network can automatically and accurately segment labeled or even other unlabeled liver vessels from the CT images.


Introduction
Liver cancer is the second most deadly cancer after lung cancer and one of the cancers with the fastest increasing morbidity and mortality in China [1]. At present, the computer-assisted treatment method for liver cancer is thermal ablation which is an effective treatment method to eliminate Due to the irregular shape of liver vessels, the surrounding tissue has low contrast, and 3D medical training samples with annotations are minimal. Besides, image segmentation is essentially a classification problem at the pixel level. Suppose the foreground and background categories are not balanced. In that case, it will easily cause training to fall into the optimal local value, and the small foreground area will be lost or without detected. Although the spatial information of the decoder in the V-Net network is relatively rough, it has powerful semantic features and accurate resolution. However, de-convolution and convolution often result in the loss of high-level semantic information so that the contextual information cannot be propagated to a higher resolution layer. The above problems make it difficult to segment liver vessels with conventional deep convolutional networks.
V-Net is chosen as the basic network structure of liver vessel segmentation and improved. The main contributions are as follows: 1) The original network structure is optimized and a 3D deep supervision mechanism [19] is introduced into the network, which helps the network learn semantic features better, accelerate the convergence speed and improve prediction accuracy. 2) Inter-scale dense connections are designed in the decoder, aiming to reduce the loss of high-level semantic information during the decoding process and effectively fuse multi-scale feature information. 3) A loss function composed of binary cross-entropy and dice coefficient is utilized to ensure that the network can still effectively train in the case of category imbalance.

Preprocessing
The preprocessing steps are as follows: 1) CT values of the CT image are limited to [−200,200] HU, which can filter out other organs in the image. 2) Due to the limitation of GPU memory, the original resolution is changed from 512 × 512 to 256 × 256 by down-sampling. 3) Because the thickness of most training dataset slices is 1.6 mm, the thickness of data slices less than 1.25 mm or greater than 2 mm is normalized to 1.6 mm by trilinear interpolation. A three-dimensional training data is taken multiple 48 slices continuously for training with a sliding step of 5. 4) Rotation and mirroring operations are used to augment the data.

Improvement of V-Net network framework
V-Net is a 5-layer symmetrical network architecture with an encoder that extracts spatial features from images, and a decoder that constructs segmentation graphs from encoded features, as well as the skip connection structure that combines the position information in the encoding path with the context information in the decoding path to make up for the missing edge features and spatial information during the decoding process. To mitigate the disappearance of the network gradient, the residual units are added to the network. The formula is as follows: where  represents the residual function, l x is the input feature, and i W is a group of weights related to the residual units. Any deeper feature L x (L > l ≥ 1) can be expressed as a shallow feature l x plus an accumulated residual function Because the original V-Net has many parameters, it is easy to cause the network to overfit. Therefore, 3 × 3 × 3 convolution kernels are applied to replace the original 5 × 5 × 5 convolution kernels in each layer of the network. A PReLu activation function adopted throughout the network. The down-sampling of the V-Net network adopts the convolution method. That is, the feature map is convolved using a 2 × 2 × 2 convolution kernel with stride 2 in the encoder to reduce the resolution rate of the feature map. At the same time, the number of feature channels in each layer is doubled to learn in-depth features more accurately and fully.
Although down-sampling can increase the receiving field, it also reduces the spatial resolution. Therefore, the last layer of the network is changed. Only the number of feature channels is increased without change the feature map size (see Figure 1). Three dilated convolutions are introduced in the third and fourth layers of the encoder to avoid losing the resolution and still increase the receptive field. The third layer dilation rate are 1, 2, and 4, and the corresponding receptive fields are 3, 7, and 15, respectively. The fourth layer dilation rate 3, 4, and 5, and the corresponding receptive fields are 11, 15, and 19, respectively. Adjusting the dilation rate of the dilated convolution can extract the context information about different scales of the feature map. The network can locate the target more accurately due to the improvement of the resolution. Each layer in the decoding path uses a 2 × 2 × 2 de-convolution with stride 2 for up-sampling. The number of feature channels is halved, followed by 3 padded convolutions (the last layer is 2 padded convolutions). Finally, in the output layer, a 1 × 1 × 1 convolution is performed to adjust the number of channels of the characteristic map. Because the image resolution is reduced in the preprocessing stage, trilinear interpolation is performed to restore the feature map to the original image size. A sigmoid function is applied to obtain the final probability map. A dropout layer is added at the end of the residual unit of each layer to prevent the network from overfitting.

3D deep supervision mechanism
A 3D deep supervision mechanism is introduced into the network to optimize the model, speed up the network learning speed and prevent information loss during the forward propagation. Because the parameters of each path are initialized randomly, this mechanism allows different paths to update the weights independently without interfering with each other so that the learning of the network will not stay in the same local minimum. Moreover, introducing a deep supervision mechanism allows the network to obtain more feedback information during the back propagation process than just using the last output layer for back propagation (see Figure 1). Three output layers are added to the decoder of the improved V-Net network. Each layer's characteristic map is up-sampling by trilinear interpolation, and then the loss value is calculated after a sigmoid function. The 3D deep supervision formula is as follows: d W represents the weight of the d-th layer in the main network and the third term is the weight attenuation regularizations and  is the hyper-parameter of weight.
The 3D deep supervision mechanism can promote the expression of high-level features by hidden layers, thereby promoting the discrimination capability of the model. As these different loss components propagate backward, the equivalent training data expands, thereby effectively preventing the network overfitting and further boosting its generalization capability.

Inter-scale dense connections
Inter-scale dense connections are introduced in the decoder to further reduce the information loss during the decoding process. The network constructs encoders and decoders for top-down and bottom-up methods. Although the spatial information is coarse in the decoder, it has powerful semantic features and precise resolution. Due to the large semantic gap between the layers, these specific inter-scale dense connections can directly propagate the feature information from one scale stage to another scale stage so that it can fuse feature information of different scales to prevent high-level semantic information loss.
The improved V-Net network is a four-layer network structure, and we use the feature activation of the residual block output of each stage from bottom to top in the decoder. We indicate that the output of residual block is {p1, p2, p3, p4}, and the up-convolution block is {u1, u2, u3} (see Figure 1). To achieve inter-scale dense connections at the decoder (see Figure 2), p in p4→u3, p3→u3, p4→u2 is passed through a connection block (The connection block includes using trilinear interpolation for up-sampling and using 1 × 1 × 1 convolution to reduce the number of channels.) is fused with the corresponding u. Then it is fused again with the feature maps propagated through the skip connection in the same layer to achieve multi-fusion. The inter-scale dense connections effectively avoid the loss of deep semantic information caused by operations such as up-sampling and multiple convolutions. The inter-scale dense connections formula is as follows: where L represents the number of network layers, Γ is the j-th layer de-convolution block in the encoder, j x is the input feature of the de-convolution block, and j w is a group of weights related to the de-convolution block. Θ is the connection block after the residual structure of j i  layer.
j i x  is the input feature of the connection block, and j i w  is a group of weights related to the connection block.

Loss function
A combined loss function, which is composed of binary cross-entropy loss function and dice loss function [20], enables the network to be effectively trained in imbalanced categories. The combined loss function formula is as follows, where  is a weighting factor.
The binary cross-entropy loss function is as follows: The dice loss function is as follows: where y  represents the prediction result of the network, y represents the true label of the corresponding voxel.

Post-processing
In post-processing, the volume of each connected region is calculated. To prevent the predicted disconnected liver vessels from being removed, we remove the small area noise (less than 450 mm 3 ) caused by classification errors through volume judgment, effectively reducing false positives in segmentation results (see Figure 3).

Experimental environment and experimental data
The hardware configuration required for the experiment is Intel (R) Xeon (R) Silver 4110 CPU @ 2.10GHz and an NVIDIA Tesla T4 GPU (16 GB memory) and the development tools are Python3.7 and PyTorch.
The experimental data were selected from the public CT image datasets 3Dircadb provided by the Research Institute against Digestive Cancer. The datasets contain 20 three-dimensional images of enhanced portal venous phase with pixel spacing ranging from 0.56 to 0.86 mm, slice thickness ranging from 1mm to 4 mm, number of slices ranging from 64 to 502, and single-layer resolution of 512 × 512, by manually selecting 12 cases for training and 8 cases for testing.

Parameter settings and training
The dropout parameter was set as 0.5 and the 3D deep supervision weight was initialized to 0.33, which decays as the training progress. A typical Adam optimizer was selected for network training, and the initial learning rate was 0.0001. Considering the computing resources, the batch size was set as 1, and the final number of the epoch was 35. Because the inter-scale dense connections were designed into the improved V-Net network, the probability map of the last output layer is used as the final segmentation result when making predictions. The training time of the model was about 10 h, the testing time of the 8 test datasets was 9.58-26.53 s, and the average testing time was 13.46 s.

Evaluation metrics
The following four evaluation metrics were selected, which include the dice coefficient (Dice), accuracy (Acc), sensitivity (Sen), and specificity (Spe). The formulas are as follows: where TP and TN are the numbers of voxels correctly divided into liver vessels and background, and FP and FN are the numbers of voxels incorrectly divided into liver vessels and background.

Selection of the weighting factor
The selection of weighting factor  in the combined loss function was analyzed, which was set to 0, 0.3, 0.5, 0.7, 0.9 and 1, respectively. Table 1 shows the effect of combined loss function on improved V-Net network performance under different weight factors. It can be seen from Table 1 that when  is 0.7, the performance of the network is good. In particular, when  is 0, the loss function is the binary cross-entropy loss function; when  is 1, the loss function is the dice loss function. Therefore, the weighting factor  in this experiment was set to 0.7. As shown in Figure 4, using the combined loss function in the network can segment even smaller liver vessels than using the dice loss function. However, there are still disparities compared with the annotated data. Therefore, in the following experiments, the combined loss function is used to train the network.

Evaluation and comparison
Each improved method was tested on 8 3Dircadb datasets, and the results were post-processing. As shown in Table 2, after introducing the 3D deep supervision mechanism into the improved V-Net network, the average dice, sensitivity, accuracy, and specificity are improved by 1.3, 0.7, 0.5, and 0.2%, respectively. The 3D deep supervision mechanism can alleviate the gradient disappearance or explosion of the network during the training process, make the network update parameters from different paths without interference, and help the network learn discrimination features better. When inter-scale dense connections were introduced into the improved V-Net network, the average dice was 71.2%, sensitivity was 74.8%, accuracy and specificity were 98.4 and 99.4%, respectively. Compared with the improved V-Net network evaluation, the average dice value improved by 2.5%, and sensitivity improved by 1.4%, which shows that this method can effectively compensate for the loss of high-level semantic information due to multiple up-sampling and convolution and achieve inter-scale feature fusion. As shown in Figure 5, this method can extract more thin-walled small liver vessels and further optimize the segmentation results.  Finally, the 3D deep supervision mechanism and the inter-scale dense connection were introduced to the network simultaneously. The final average dice, sensitivity, accuracy, and specificity of the testing data were 71.6, 75.4, 98.5, and 99.5%, respectively. The average dice, sensitivity, accuracy and specificity of liver vessels without post-processing were 71.5, 75.5, 98.4 and 99.5%, respectively. As shown in Table 3, the average sensitivity of our proposed method is slightly lower than the method [9], but it belongs to a semi-automatic segmentation method, and other metrics are significantly higher than the comparison methods, which indicates that our proposed method has better segmentation performance. As shown in Figure 6, the narrow liver vessels segmented by our method are closer to the real liver vessel contour and have high accuracy and robustness for images with high noise, low contrast and varied intensity distribution.  Kitrungrotsakul et al. [12] finally predicted the average dice value was 83%. Although the dice value is high, unlabeled liver vessels were not extracted in the results. Through experiments, we find that the proposed method can extract liver vessels unlabeled by experts, and these liver vessels have been recognized by experts, as shown in Figure 7. Therefore, our evaluation results are closer to the clinical results rather than the comparison results based on incomplete annotations. The proposed method is proved to effectively and accurately extract liver vessels, which is used to replace the interactive segmentation of liver vessels in clinical practice and assist surgical planning through three-dimensional visualization. In the future, we will also verify the proposed method on more vascular datasets, such as aortic vessels [21].

Conclusions
This paper proposes a method for automatically segmenting liver vessels from CT images based on an improved V-Net network. Rotation and mirroring operations are performed to augment the data. A combined loss function is utilized to improve the segmentation accuracy and sensitivity of liver vessels with unbalanced categories. The dilated convolution is introduced in the network encoder to increase the receptive field of the network in the case of reducing down-sampling. The 3D deep supervision mechanism is introduced into the network to speed up the network learning speed and improve the network's discrimination ability. Besides, inter-scale dense connections are designed into the network, effectively integrating multi-scale feature information. The final experimental results show that all metrics have been significantly improved and have been recognized by experts. The algorithm can automatically and accurately segment liver vessels with complex structures and low contrast with surrounding tissues from CT images.