Keywords

1 Introduction

In interventional endoscopy for pancreatic and biliary diseases, navigating the endoscope to specific gastrointestinal (GI) positions and orientations is critical for both diagnosis and treatment [4]. The endoscope’s small field of view and lack of visual orientation cues make this navigation task challenging, particularly for novice endoscopists [4]. Image-guidance can support intraprocedural navigation tasks and endoscopy training by revealing the wider anatomical context.

As the shapes and positions of abdominal organs can vary widely between patients, patient-specific anatomical models of the GI tract and surrounding organs should enable more accurate alignment with intra-procedural imaging and may improve performance. These models can be generated from segmented abdominal CT; however, integration into clinical workflows is only practical if the segmentation can be highly automated.

Multi-organ segmentation has been the subject of extensive study. The most common approaches, statistical shape models [2, 9] and multi-atlas label fusion [9, 12,13,14, 16], rely on registration to establish anatomical correspondence. However, interpatient image registration is less accurate for abdominal imaging than for other anatomical sites (e.g. brain), due to highly variable anatomy [14]. Thus, there is a need for automated multi-organ segmentation that does not rely on registration or shape-model fitting.

Deep-learning-based fully convolutional networks offer an approach to segment anatomy from voxel-based features directly. These networks have been successfully applied to segment individual organs from medical images, such as prostate [8] and pancreas [11]. They have also shown promise in abdominal CT for multi-organ segmentation [5] of the liver, spleen and kidney.

In this study, we present a fully convolutional network to segment the liver, pancreas, stomach and esophagus from abdominal CT. Such segmentations enable patient-specific 3D modelling of the GI tract and surrounding anatomy, providing a navigational reference for endoscopists. The network is trained and evaluated on 72 abdominal CT images from two centres, and directly compared to an existing approach based on multi-atlas image registration and label fusion.

2 Methods

Imaging. Abdominal CT images from two datasets were used in this study: 42 images from the Cancer Imaging Archive Pancreas-CT data set [3, 10, 11] and 30 images from the ‘Beyond the Cranial Vault’ segmentation challenge (doi:10.7303/syn3193805). For the latter, manual reference segmentations of liver, pancreas, stomach and esophagus were available. For the former, manual reference segmentations of the pancreas were available and the liver, stomach and esophagus were interactively segmented using Matlab 2015b and ITK-SNAP 3.2 (http://itksnap.com), under the supervision of a board-certified radiologist with 8 years of experience in gastrointestinal CT and MRI image interpretation. Images were cropped to the ribcage transversely and to the extent of the segmented organs in the inferior-superior direction.

Dense Dilated Convolutional Network Segmentation. The proposed segmentation used a fully-convolutional neural network with dilated convolution units with dense skip connections, described below and illustrated in Fig. 1a.

Dilated convolutions [17] use sparse convolution kernels allowing a large kernel spatial extent without increasing the number of learned parameters. Compared to using a cascade of downsampling layers [8], dilated convolutions maintain a high-resolution representation of non-local non-linear image features deeper in the network. This is particularly important in organ segmentation where thin structures (e.g. thin liver-adjacent stomach walls) must be inferred based on high-level information (e.g. adjacent stomach and liver tissue). Dilated convolutions can be implemented efficiently by reordering input data in memory (Fig. 1b) and convolving with a corresponding non-sparse convolution, leveraging efficient algorithms and hardware-support. Each convolutional layer is followed by a batch normalization and a rectifier linear unit defining a structure referred to hereafter as a convolutional unit.

The network uses an initial convolutional feature layer, 8 convolutional units, a segmentation layer and a spatial prior. The feature layer outputs 25 outputs with \(5^3\) convolutional kernels and stride 2, and a downsampled image. The convolutional units used \(3^3\) kernels with dilation scales of 1, 1, 2, 2, 4, 4, 2, and 1, each outputting 20 feature maps. The segmentation layer outputs one logit map for each label (liver, pancreas, stomach, esophagus, and other).

The convolutional units have dense skip connections [6]; i.e., the input to each unit is the concatenated output of all previous units. This enables efficient use of intermediate features as intermediate layers do not need to re-encode information from previous layers. Additionally, like shortcut layers in residual networks, the dense skip connections allow effective propagation of gradients through the network and combine multiple networks depths in the same network.

Finally, we introduce a new spatial prior map, added to the segmentation unit output. Spatial priors are more suited to medical images than natural images because medical images are commonly acquired in standard anatomically aligned views. The map comprised a \(12^3\) block of trainable parameters which was upsampled and added to the logit maps. This is analogous to the logit maps representing the log-probability of the input given the class label, the spatial prior map representing the log-probability of the class label at given image coordinates, and the resulting output representing the posterior probability of the class label; however, the spatial map parameters were learned (per fold) during trainging using gradient descent and may not represent true prior probabilities.

Training used the Adam optimiser minimizing a loss weighting L2 regularization and per-organ Dice scores. After training, the label with the maximum voxel-wise softmax probability was chosen. Segmentations were post-processed by eliminating, for each organ, each connected component comprising <10% of the total label volume, and upsampled to the original resolution.

Fig. 1.
figure 1

(a) Network architecture; (b) Dilated convolutions (top) generate the same output as more efficient regular convolutions on reordered input data (bottom).

Multi-atlas-based Segmentation (for Comparison). Our proposed algorithm is directly compared to an existing algorithm using multi-atlas registration and label fusion. First, multiple atlas images were registered to the input image using NiftyReg (http://niftk.org/niftyreg) to maximize normalized mutual information under affine then B-spline transformations. Then, transformed reference labels were combined using two fusion algorithms – majority voting, and joint label fusion [15] – yielding two sets of segmentations. Majority voting is a fast fusion algorithm where the segmentation labels are the voxel-wise modes of the transformed segmentation labels, and was implemented in Matlab. Joint label fusion is a statistical fusion algorithm where the segmentation label is the weighted average of the transformed labels, with weights computed based on the local image similarity between the transformed atlas and input images, while modelling correlations between atlas images. Joint label fusion, implemented in the publicly available PICSL Multi-Atlas Segmentation Tool (https://www.nitrc.org/projects/picsl_malf), and its variants have yielded the highest performance in MICCAI multi-atlas labeling grand challenges in 2012 [7] and 2015. Default parameters were used except for a B-spline control point spacing of 10 mm (instead of 5 voxels) to allow for anisotropic voxels. Segmentations were post-processed as in the previous section.

Evaluation. The segmentation algorithms were evaluated with an 8-fold cross-validation over all 72 subjects. In each fold, we compared the segmentation of each organ in each test image from each algorithm to the reference segmentation using 3 metrics: Dice coefficient – \(2|A\cap B|/(|A|+|B|)\); symmetric mean boundary distance – \((\overline{D(A,B)}+\overline{ D(B,A)})/2\); and symmetric 95% Hausdorff distance – \((percentile(D(A,B),95)+percentile(D(B,A),95))/2\); where A is the algorithm segmentation, B is the reference segmentation, \(\varOmega _A\) is the set of boundary pixels of A, and \(D(A,B)=\{\min \limits _{x\in \varOmega _B}\left| \left| x-y\right| \right| |y\in \varOmega _A\}\) is the set of boundary distances from \(\varOmega _A\) to \(\varOmega _B\). The Dice coefficient reflects the voxel-wise overlap. The mean boundary and 95% Hausdorff distances reflect the agreement between segmentation boundaries, with the latter being more sensitive to localized disagreements.

We compared the three algorithms for each organ and metric using Friedman tests (non-parametric repeated-measures ANOVA) with Benjamini–Hochberg false-discovery rate multiple comparison correction (\(\alpha =0.05\)) for pairwise tests.

3 Results

The median and quartiles of the segmentation evaluation metrics are reported in Table 1, with representative segmentations shown in Fig. 2.

Majority voting consistently underperformed the other algorithms, and failed to correctly identify any voxels as pancreas in the majority of subjects. The deep-learning-based algorithm yielded more accurate segmentations than the joint-label-fusion algorithm for the smaller organs – pancreas, stomach and esophagus – showing significantly higher Dice scores for the pancreas (median 66 vs 37), stomach (median 83 vs 72), and esophagus (median 73 vs 54) and significantly lower boundary distances for the pancreas and esophagus, as determined by the Friedman tests. Conversely, the joint-label-fusion algorithm yielded statistically significantly more accurate segmentations for the liver by all three measures, although the differences in median values were small. As seen in Fig. 2, both label fusion methods frequently under-segmented the pancreas, suggesting challenges in consistently registering this thin organ with variable abdominal position [14].

Fig. 2.
figure 2

Posterior view of four segmentations with Dice scores closest to the median (1 & 2), and to the 75th and 25th percentile (3 & 4). Liver (red), pancreas (green), stomach (yellow) and esophagus (cyan) segmentations were generated, from top to bottom, by manual segmentation, deep learning, joint label fusion and majority voting methods.

Table 1. Segmentation metrics from the cross-validation (Median [first, third quartile])

4 Discussion

This paper presents a deep-learning-based algorithm to segment liver, pancreas, stomach and esophagus on abdominal CT, while avoiding challenging inter-patient registration of abdominal organs.

Endoscope navigation through the gastrointestinal tract could benefit from segmentations of multiple gastrointestinal and surrounding organs. Many previous studies have proposed methods for multi-organ segmentation of abdominal CT, principally based on multi-atlas segmentation [9, 12,13,14, 16] or statistical shape models [2, 9]. Organs surrounding the GI tract, such as the liver and pancreas, are included in many of these studies, but esophagus and stomach segmentation has received little attention, likely due to the lack of available reference segmentations. Liver segmentation has consistently yielded higher Dice scores (82–95) than other anatomy (pancreas [45–74], stomach [10–87], esophagus [36–43]). Dice scores from some previous studies are given in Table 2.

Table 2. Dice scores for previous abdominal CT multi-organ segmentatation methods. Different data sets and segmentation of unlisted organs preclude direct comparisons.

As in previous work, observed Dice scores were substantially higher for the liver than for the other smaller organs. While noting that quantitative metrics for algorithms evaluated on different data sets are not directly comparable, the proposed segmentation yielded stomach and esophagus segmentations with Dice scores higher than previous studies, and liver and pancreas segmentations with Dice scores in the range observed in previous studies. Notably, compared to the two previous studies generating segmentations of all four organs segmented in this work, Dice scores for deep-learning-based esophagus, stomach and pancreas segmentations were higher, and for liver segmentations were within 2%.

The dilated convolutions with dense skip connections in our network address two key challenges in information and gradient propagation in deep convolutional networks for segmentation: (1) using both local and distant image information and (2) using both low and high-resolution image information. While large convolutional kernels enable the propagation of high-resolution local and distant image information, the larger kernels result in higher parameter counts (particularly in 3D convolution), challenging learning, and increasing the risk of over-fitting. A second approach has been to use spatial pooling or down-sampling layers such that small convolution kernels (with low parameter counts) have a large effective spatial extent followed by upsampling or transpose convolutions to regain the high resolution representation needed for segmentation. This approach limits the representation of high-resolution information through the network, motivating skip connections between early high-resolution representations and later upsampled representations. Our architecture avoids this by using high-resolution representations throughout the network and dilated convolutions to propagate high resolution information at large spatial scales.

Our conclusions should be qualified by some limitations. Algorithm parameters were not extensively optimized for this application; the reported performance of both proposed and comparison algorithms may underestimate their potential performance. The evaluation metrics measure segmentation fidelity with the manual reference, and not the clinical utility of the resulting segmentations for aiding endoscopic navigation. Future work will evaluate whether the proposed algorithm is already sufficiently accurate to provide a 3D patient-specific anatomical reference to aid endoscopic navigation. Finally, to support guidance for endoscopy in the gastrointestinal tract, segmentations of the duodenum, gallbladder, left kidney and vasculature, where sufficiently large reference segmentation data sets are not yet available, would be a valuable addition.

Dilated convolutional networks with dense skip connections can segment the liver, pancreas, stomach and esophagus in abdominal CT without image registration. Our proposed method achieved lower boundary distance errors (for pancreas and esophagus) and higher overlap (for pancreas, esophagus and stomach) with manual segmentations than a recent multi-atlas label fusion algorithm. Such automatically generated segmentations of abdominal anatomy have the potential to support image-guided navigation in pancreatobiliary endoscopy procedures.