An end-to-end approach to segmentation in medical images with CNN and posterior-CRF

Conditional Random Fields (CRFs) are often used to improve the output of an initial segmentation model, such as a convolutional neural network (CNN). Conventional CRF approaches in medical imaging use manually deﬁned features, such as intensity to improve appearance similarity or location to improve spatial coherence. These features work well for some tasks, but can fail for others. For example, in medical image segmentation applications where different anatomical structures can have similar intensity values, an intensity-based CRF may produce incorrect results. As an alternative, we propose Posterior-CRF , an end-to-end segmentation method that uses CNN-learned features in a CRF and optimizes the CRF and CNN parameters concurrently. We validate our method on three medical image segmentation tasks: aorta and pulmonary artery segmentation in non-contrast CT, white matter hyperintensities segmentation in multi-modal MRI, and ischemic stroke lesion segmentation in multi-modal MRI. We compare this with the state-of-the-art CNN-CRF methods. In all applications, our proposed method outperforms the existing methods in terms of Dice coeﬃcient, average volume difference, and lesion-wise F1 score. © 2021 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY license ( http://creativecommons.org/licenses/by/4.


Introduction
After the breakthrough of deep learning in computer vision ( Krizhevsky et al., 2012;He et al., 2016;Long et al., 2015 ), deep convolutional neural networks (CNNs) and their variants ( Ronneberger et al., 2015;Çiçek et al., 2016;Kamnitsas et al., 2017 ) quickly started to dominate medical image segmentation, outperforming traditional machine learning methods in many applications ( Yu et al., 2016;Bakas et al., 2018;Kuijf et al., 2019;Maier et al., 2015 ).To refine the prediction from the CNN, it is common to combine CNN with a conditional random field (CRF) ( Krähenbühl and Koltun, 2011 ).By modeling pairwise relationships and interactions between voxel-wise variables over the whole image, the CRF can improve the coherence of the segmentation.In previous work, CRFs based on predefined features such as intensity similarity and spatial coherence have been used as an efficient post-processing technique or trained in an end-to-end manner in a recurrent neural network to refine the CNN outputs ( Chen et al., 2017;Dou et al., 2017;Kamnitsas et al., 2017;Zheng et al., 2015 ).
Most often, a CRF uses a combination of voxel intensity and voxel location as pairwise potentials.Although this works well in several computer vision applications ( Zheng et al., 2015;Schwing and Urtasun, 2015 ), there can be challenges in other applications.The approach assumes that voxels that have similar intensity and are close to each other in the image are likely to belong to the same class.There are many applications among others in medical image analysis in which this assumption does not hold.For example, the intensity-based features of the CRF are not sufficient for problems where the intensity is not informative enough to identify object boundaries, such as the artery segmentation problem in Fig. 2 a.The spatial component of the CRF, on the other hand, requires extra careful tuning when the CRF is applied to data with isolated small objects, such as the white matter hyperintensities in Fig. 2b, which may be erroneously removed by excessive smoothing.In stroke lesion segmentation, a large appearance difference between lesion objects of the same class also goes against the CRF assumption that the same class objects should have similar intensity (see Fig. 2 c).
In this paper, we propose Posterior-CRF , a new learning-based CRF approach for image segmentation that allows the CRF to use features learned by a CNN, optimizing the CRF and CNN parameters concurrently.The learning-based CRF makes the CNN features update to work best with CRF in an end-to-end manner.During training, the CRF inference works in the CNN feature space, which is more likely to contain useful high-level features for segmentation compared to the original intensity values.
We demonstrate our method in three medical image analysis applications.Our first application is the segmentation of the aorta and pulmonary artery in non-contrast, non-ECG-gated chest CT scans.In these images, the aorta and the pulmonary artery share similar intensity values, which goes against the CRF assumption that similar classes should share similar intensity ( Sedghi Gamechi et al., 2018;Xie et al., 2014 ).The boundaries between the objects are not recognizable by intensity alone, making a standard CRF less effective ( Fig. 2 a).Our second application is the segmentation of white-matter hyperintensities in brain MRI.These small objects are sparsely distributed in the brain (see Fig. 2 b) and may be removed by the CRF, which optimizes for the spatial coherence of segmentation.Our third application is the segmentation of ischemic stroke lesions in brain MRI, which have very heterogeneous intensities and shapes within the same lesion class ( Fig. 2 c).

Contributions
1. We present a new end-to-end trainable algorithm for image segmentation called Posterior-CRF using learnable features in CRF pairwise potentials.We explore how the proposed method affects CNN learning during training.
2. We compare the performance of a fully-connected CRF in several settings: post-processing, end-to-end training with predefined features, and end-to-end training with learned features.Ablation experiments are conducted to investigate the influence of CRF parameters and which level of the CNN feature maps are more likely to benefit the CRF inference.We found that the features in the last CNN feature maps provide a more consistent improvement than features in early CNN layers and predefined intensity features.
3. We evaluate our methods in three applications: aorta and pulmonary artery segmentation in non-contrast CT, which can be used to compute important biomarkers such as the pulmonary artery to aorta diameter ratio ( Sedghi Gamechi et al., 2018 ); white matter hyperintensities segmentation in multi-sequence MRI, which is of key importance in many neurological research studies ( Kuijf et al., 2019 ); and ischemic stroke lesion segmentation in multi-sequence MRI, which can provide biomarkers for stroke diagnosis ( Maier et al., 2015 ).In the experiments, the proposed Posterior-CRF outperforms CNN without CRF, post-processing CRF, end-to-end intensity-based CRF, and end-to-end spatial-based CRF.
A preliminary version of this work, focused on a single application and with less validation, appeared as an extended abstract in ( Chen and de Bruijne, 2018 ).

End-to-end training of CRF and CNN
CRF is widely used as an efficient post-processing method to refine the output of CNN segmentation models (for example, Chen et al., 2017;Dou et al., 2017;Kamnitsas et al., 2017 ).However, applying a CRF as post-processing means that the CNN is not able to adapt its output to the CRF.Zheng et al. (2015) proposed to optimize CNN and CRF jointly by reformulating the CRF inference as a recurrent neural network (RNN) operation, such that the CRF weights can be learned together with the CNN.This approach makes the unary potentials and the kernel weights in pairwise potentials trainable, which saves the computational cost of grid search for other approaches to tune these weights, although the CRF still works in the predefined fixed feature space.In this paper, we focus on a new CRF approach where the CRF inference works in a learning-based CNN feature space.( Chen et al., 2017;Kamnitsas et al., 2017 ); (b) End-to-end training CRF with predefined features ( Zheng et al., 2015 ); (c) Proposed Posterior-CRF, which uses CNN feature maps as CRF reference maps.Best viewed in color with zoom.

Locally-connected CRFs with learned potentials
While conventional CRFs use predefined Gaussian edge potentials, the potentials can also be learned through a neural network.Vemulapalli et al. (2016) learn the pairwise potentials of a Gaussian CRF in a bipartite graph structure.This approach uses a simpler continuous CRF model which provides better convergence of mean-field inference than the conventional discrete CRF models.In this paper, we focus on the most widely used discrete CRF model which is a natural fit for the dense segmentation problem.Lin et al. (2016) ; Li and Ping (2018) and Wang et al. (2018a) learn pairwise CRF potentials to model patch-wise (or local) relationships using free form functions learned by neural network rather than a combination of predefined Gaussians to calculate the pairwise potentials.The patch-wise potentials provide a better ability to model the semantic compatibility between image regions and have different effects compared to our approach, where we do not consider patch-wise relationships.Our method uses traditional Gaussian edge potentials ( Krähenbühl and Koltun, 2011 ) similar to Zheng et al. (2015) which are easier to compute in a fullyconnected manner.Unlike Zheng et al., we derive the potentials from the feature space learned by a CNN.This allows us to model global interactions between voxel-wise variables using learningbased features.

Other methods related to CRF
Next to CRF, there are several other approaches that aim to model interactive relationships or add global information to neural networks.Graph neural networks (GNN) ( Scarselli et al., 2008;Selvan et al., 2018 ) model interactions between variables by applying graph convolution filters, which allow them to learn global relationships between voxels.We further address GNN in the Discussion.The recently proposed non-local CNN ( Wang et al., 2020 ) uses layer-wise self-attention ( Vaswani et al., 2017;Wang et al., 2018b;Yuan et al., 2019 ) to make each layer in the network focus on the areas that encoded the most non-local information in the preceding layer.While this allows non-local CNNs to model long-range dependencies, they are unable to model the interactions that can be learned by a CRF or GNN.In this paper, we focus on the fullyconnected CRF model which is an efficient approach of modeling both interactive relationships and global information.

Methodology
Our method consists of two parts that are optimized jointly: 3D CNN and 3D CRF.In Section 3.1 , we describe the CNN model, which provides unary potentials for the CRF inference as well as features for the pairwise potentials for the proposed Posterior-CRF.Then we introduce the CRF in Section 3.2 .We show two previously proposed ways to perform CRF inference using predefined features: post-processing ( Section 3.3.1 ) and end-to-end training with predefined features ( Section 3.3.2).Our proposed end-to-end training with learned features is presented in Section 3.4 , followed by Section 3.4.1 about the back-propagation of the proposed learningbased CRF.The mean-field inference algorithm used in the proposed method is explained in Appendix Section 8.

CNN Model
Our CNN model is based on UNet ( Ronneberger et al., 2015 ), the most widely used network architecture for medical image segmentation.It has a multi-scale design with skip-connections that connect the encoding and decoding parts of the network, which allow the decoding path to use the early, high resolution feature maps without losing information through pooling.We use 3D UNet as the basic CNN architecture to provide the unary potentials for CRF inference as well as features for the pairwise potentials for the proposed Posterior-CRF.Details of the network layout used in our experiments are given in Fig. 3 .

Conditional random fields
In this section, we describe the CRF as proposed in ( Krähenbühl and Koltun, 2011 ).In image segmentation, a CRF models voxel-wise variable x i taking values in { 1 , . . ., C} as a set of random variables X = { x 1 , . . ., x N } , where C is the number of classes and N is the number of voxels in the image.During training, x i is converted into a soft classification vector of length C, indicating for each class the probability that the i th voxel belongs to that class, with the L 1 norm | x | = 1 .x i obey a Markov property conditioned on a global observation, the image I consisting of variables I = { I 1 , . . ., I N } .In this paper, I is the observed 3D CT/MRI scans, with its length given by the number of imaging modality channels M times the number of voxels per channel N.
Consider a fully-connected pairwise CRF model ( X , I ) characterized by a prior Gibbs distribution: where i and j range from 1 to N. The first term ϕ u (x i ) in Eq. ( 2) is the unary potential, which in our case is the current C length vector of voxel i representing the class probabilities in the CNN posterior probability maps.The second term ϕ p (x i , x j ) is the pairwise potential: where μ(x i , x j ) is the label compatibility function that describes the interactive influences between different pairs of classes, ω m is the linear combination weight of different pre-defined kernels k m and K is the total number of kernels.Each k m is a modified Gaussian kernel with specific feature vector f : The feature vector f is defined from S arbitrary feature spaces.is a symmetric positive-definite precision matrix that defines the shape of each kernel.In semantic segmentation, typically a combination of intensity ( I) and position features ( p) has been used ( Krähenbühl and Koltun, 2011;Zheng et al., 2015;Kamnitsas et al., 2017 ): where the first kernel controlled by ω 1 is called appearance kernel and the second kernel controlled by ω 2 is called smoothness kernel .The parameters θ α , θ β and θ γ control the influence of the corresponding feature spaces.The appearance kernel is inspired by the observation that nearby voxels with similar intensity are likely to be in the same class, while voxels that are either further away or have larger intensity difference are less likely to be in the same class.The smoothness kernel can remove isolated regions and produce smooth segmentation results ( Krähenbühl and Koltun, 2011;Kamnitsas et al., 2017 ).Note that the position feature appears in both appearance kernel and smoothness kernel, where spatial information has different contributions to each of the two kernels, depending on the spatial standard deviations θ α and θ γ .

CRF with predefined features
Conventional CRFs use predefined features, such as the image intensity and spatial position shown in Eq. ( 6) .These features are commonly used in CRFs to encourage intensity and spatial coherence, based on the assumption that voxels that have a similar intensity or are close together are likely to belong to the same class.
We evaluate two state-of-the-art approaches to combine CRFs with predefined features with a CNN: 1. Apply the CRF as postprocessing to refine the CNN outputs ( Section 3.3.1 ); 2. Implement the CRF as a neural network layer that can be trained together with the CNN in an end-to-end manner ( Section 3.3.2).

CRF as post-processing
After we train a CNN model and get its predictions, we can apply CRF as a post-processing method to refine the results ( Chen et al., 2017 ).We refer to this method as Postproc-CRF ( Fig. 1 a).

End-to-end training CRF
The CNN and CRF can be combined more elegantly by optimizing them together in an end-to-end manner ( Zheng et al., 2015 ) ( Fig. 1 b), which allows the CRF to influence the CNN optimization.The end-to-end CRF uses the same pairwise potentials as that in the post-processing CRF ( Eq. ( 6) ).We refer to this variant as Intensity-CRF .
To investigate the spatial term in the end-to-end CRF, we can also use only the position features as the CRF feature space, which means that the CRF layer will only encourage nearby voxels to have the same class.We implement this CRF by setting the weight of the appearance kernel ω 1 to zero and make it not trainable.We refer to this method as Spatial-CRF .

Proposed CRF with learning-based features
Our proposed CRF uses a learning-based feature space.We replace the intensity feature vector I in the CRF kernel ( Eq. ( 6) ) with the new feature vector F (I ) from the CNN feature maps.The information in these CNN feature maps differs per level: in the first level of UNet the feature maps contain information close to the intensity, while in the last level of the UNet they contain more context for each voxel and potentially more class-discriminative information.
We refer to the CRF that uses features learned by CNN as feature-learning-based CRF (see Fig. 1 c) and refer to the specific form of CRF using the features in the last CNN softmax layer as Posterior-CRF (see Fig. 3 ).
Unlike the CRFs with predefined features, our CRF takes CNN feature maps as the reference maps and updates the random field X based on F (I ) instead of on I directly.Compared to the original CRF pairwise potential in Eq. ( 6) , the feature I is replaced with F (I ) and the new pairwise potential becomes:

Back-propagation of the learning-based CRF
The back-propagation of the proposed end-to-end featurelearning-based CRF is shown in Fig. 4 .There are five steps within one optimization iteration.Steps 1 ∼3 are the forward process that generates the output of the CNN.In the 4th step, CRF weights will adapt to the outputs calculated by the reference maps and unary maps, both given by CNN feature maps before back-propagation.In the 5th step, CNN weights are updated to provide new unary maps and reference maps for CRF for the next iteration.When the optimization converges, both CNN and CRF weights become stable close to their optimal values.Note that the mean-field inference in CRF happens in the forward process (after step 2 and before step 3) and thus contributes to the gradient updates of both CNN and CRF weights.The derivation of the mean-field inference gradient is omitted due to the length of the paper and can be found in Section 4.2 of the paper by Zheng et al. (2015) .

Experiments
In this section, we present experiments to evaluate the proposed method and compare it to the baseline methods: 3D UNet, Post-processing CRF, Intensity-CRF, and Spatial-CRF.Implementation details are discussed in Section 4.1 , followed by the experimental settings ( Section 4.2 ), the description of the datasets and pre-processing ( Section 4.3 ), data augmentation and training details ( Section 4.4 ) and evaluation metrics ( Section 4.5 ).

CNN Implementation
We implement all the algorithms in the TensorFlow framework.The detailed CNN architecture for the experiments is shown in Fig. 3 .All convolution layers use ReLU as the activation function except for the last output layer, which uses softmax to produce the final probability maps.For a fair comparison, the 3D UNet architecture that is tuned for the CNN baseline method is applied to all the CRF methods in Table 3 .The 5-layer depth of UNet (tuned from 3 to 6) and 32 base feature maps (tuned from 8 to 64) are tuned based on all three datasets.
All segmentation models are optimized by minimizing the Dice loss ( Isensee et al., 2020 ): where v c i is the predicted probability that voxel i belongs to the cth class.u j i is the true label.The loss is minimized using the Adam optimizer ( Kingma and Ba, 2014 ).

CRF Implementation
In CRF, mean-field approximation can be used to calculate the maximum a posteriori probability (MAP) of the inference.We use an efficient approximation algorithm for mean-field inference ( Krähenbühl and Koltun, 2011;Monteiro et al., 2018 ) built on a fast high-dimensional filtering using the permutohedral lattice ( Adams et al., 2010 ) that allows voxel-wise fully-connected CRF to be iteratively computed in linear time.For a fair comparison, all the CRF methods in this paper are implemented in 3D fully-connected manner.The codes are publicly available: https://github.com/ShuaiChenBIGR/Posterior-CRF.

Post-processing CRF
For Postproc-CRF , we fix the label compatibility μ in Eq. ( 6) to the identity matrix, which means that the CRF does not model label-specific interaction.In the case of multi-modal input, each imaging modality has a specific θ β to control the strength of the intensity term.

End-to-end CRF with predefined features
We consider two forms of end-to-end CRFs with predefined features: Intensity-CRF uses intensity of the input image I and position information as its feature space.Spatial-CRF uses only the position information (the smoothness term in Eq. ( 6) ).The label compatibility is a C × C parameter matrix which is optimized during training to allow the CRF to learn the label compatibility automatically.The weights ω 1 of the appearance kernel for Intensity-CRF and ω 2 of the spatial kernel for Spatial-CRF are C × C matrices, which we restrict to diagonal matrices because the relationship between classes is already covered by the label compatibility matrix.Inner product is calculated by multiplying the matrices.For simplicity, only one θ β is applied for all modalities.

End-to-end CRF with learned features
The proposed Posterior-CRF uses the last softmax layer of the CNN as its reference map.The hyperparameters are the same as end-to-end CRF with predefined features.Note that Posterior-CRF is a special case of the feature-learning-based CRF.We can also use early CNN feature maps as CRF reference maps.An ablation study investigating other CRF variants can be seen in Section 5.4 .

CRF Parameters
Parameters in the post-processing CRF for each dataset were obtained by grid search on the validation set and are shown in Table 1 .We computed results with 500 different configurations of Postproc-CRF on each dataset for grid-search.Parameters in the end-to-end CRFs ( Intensity-CRF, Spatial-CRF, Posterior-CRF ) are initialized with the same values as were used in post-processing CRF.Although the end-to-end CRF approaches have the ability to learn CRF weights automatically during training, we initialize all CRF approaches in the same way to facilitate visualization of the evolution of CRF parameters during training (see Fig. 5 ).We study the sensitivity to different CRF parameter initializations in Section 5.3 .
The initial label compatibility matrix is set to an identity matrix and can be optimized during training.In the multi-modality case, the initial value of θ β is averaged over all modalities.The initial values for each dataset are shown in Table 2 .

Computation costs of CRF
The training and testing time of the proposed CRF method is the same as Intensity-CRF but a bit slower than Spatial-CRF, since there is no bilateral term in Spatial-CRF.Although the proposed CRF uses CNNs features to compute the pairwise potential, the gradients only flow through the unary map path but not the reference map path which is the same as that in traditional Intensity-CRF.Therefore, there is no additional time and memory cost of the proposed method compared to traditional end-to-end CRF approaches with fixed feature space.Post-processing CRF is after the CNN training and takes more time for inference compared to the end-to-end CRFs, since the inference is done by CPU but not GPU.

Datasets and preprocessing
We evaluate the proposed method on three segmentation problems: CT arteries, MRI white matter hyperintensities, and MRI ischemic stroke lesions.We chose these problems to study the gen-eralizability of the method as these applications differ a lot in object shapes and appearances, imaging modalities, and suffer from different problems (see Fig. 2 ).

CT Arteries dataset
We use 25 non-contrast lung CT scans from 25 different subjects enrolled in the Danish Lung Cancer Screening Trial (DLCST) ( Pedersen et al., 2009 ).The selection of the 25 subjects was completely random and it was done before the development of this algorithm for an unrelated study.The aorta and pulmonary artery were manually segmented by a trained observer (ZS).

MRI White matter hyperintensities (WMH) dataset
The White Matter Hyperintensities (WMH) Segmentation Challenge ( Kuijf et al., 2019 ) provided images from 60 subjects (T1 and FLAIR) acquired from three hospitals and manually segmented for background and white matter hyperintensities.We randomly split these in 36 subjects for training, 12 for validation, and 12 for testing.For each subject, we cropped/padded MRI images into a constant size 200 × 200 × Z , where Z is the number of slices in the image.We use Gaussian normalization to normalize the intensities inside the brain mask in each image to zero mean and unit standard deviation.We extract training patches of size 200 × 200 × 16 with 80% overlap in z-axis between patches.In total, there are 528 3D patches for training.

MRI Ischemic stroke lesions (ISLES) dataset
The ISLES 2015 Challenge ( Maier et al., 2017 ) is a public dataset of diverse ischemic stroke cases.There are 4 MRI sequences available for each patient (T1, T2, FLAIR, and DWI).We use the subacute ischemic stroke lesion segmentation (SISS) dataset (28 subjects) with the lesion labels for experiments and randomly split them as 14 for training, 7 for validation and 7 for testing.The images are cropped/padded to the size 200 × 200 × Z. Gaussian normalization is applied for normalizing the intensities in each image.Training patches of the size 200 × 200 × 16 with 80% overlap in zaxis are extracted.In total, there are 560 3D patches for training.

Data augmentation and training details
The network is trained on all mini-batches (each minibatch contains one 3D patch).For each 3D patch in the current mini-batch we apply 3D random rotation sampled from ([ −5 , 5] , [ −5 , 5] , [ −10,10]) degrees, shifting ([ −24 , 24] , [ −24 , 24] , [ −7,7]) voxels, as well as random horizontal (left and right) flipping.We stopped training when the validation loss is not decreasing anymore and chose the model that achieved the best validation performance.The experiments are run on an Nvidia GeForce GTX1080 GPU.The average training time is 5 ∼10 h for one CNN baseline model and 1 ∼2 h more when the CRF layer is added.

Evaluation metrics
We use four voxel-wise metrics of segmentation quality: Dice similarity coefficient (DSC), indicating the relative overlap with the ground truth (larger is better); 95th percentile Hausdorff distance (H95), showing the extremes in contour distance from ground truth to the prediction (smaller is better); Average volume difference (AVD) as a percentage of the difference between ground truth volume and segmentation volume over ground truth volume (smaller is better), and Recall score (larger is better).For the lesion segmentations (WMH and ISLES), we additionally assess accuracy of lesion detection by computing the lesion-wise Recall and lesionwise F1 score (larger is better).The lesion-wise metrics use the 3D connected components, while the voxel-wise metrics do not use 3D connected components.The correct detection of a lesion is determined by the overlap (at least one voxel) of the 3D components.F1 score is equivalent to lesion-wise Dice score and is calculated by 2 * (precision * recall)/(precision+recall), where precision is calculated by true positives/(true positives+false positives).

Segmentation results
Table 3 shows the segmentation results for all three datasets.In most metrics, Posterior-CRF had the best performance in all datasets.For all datasets, CNN without CRF provides good baseline results, which indicates that 3D UNet is an efficient architecture to extract useful features for segmentation in these applications.Intensity-CRF performed worse on DSC than Posterior-CRF (statistically significant in aorta segmentation and WMH segmentation), which reveals the limitation of intensity features.Among all endto-end CRF methods, Spatial-CRF performs worst for all datasets except ISLES.From these results, we conclude that spatial coherence alone is not sufficient and often detrimental to segmentation accuracy, and that the CNN features in the last layer are more informative for CRF than the intensity features in the original images.
CRFs that depend strongly on intensity-based features have difficulties detecting objects that are similar in intensity.Examples of this problem can be observed in the segmentations for the CT arteries and ISLES datasets ( Fig. 6 ).In CT arteries segmentation, the aorta and pulmonary artery have very similar intensities, which causes most of the methods in our experiments to sometimes misclassify part of the aorta as pulmonary artery.This is especially true for Post-processing CRF but also for Intensity-CRF.
Posterior-CRF achieves a DSC segmentation overlap of 95.4% and an H95 lower than 2.87mm in aorta segmentation, which is significantly better than all other methods on this dataset.We argue that this is because the features from the last CNN feature maps are more informative than the intensity-based features, which allows the CRF inference to focus on refining the object boundary without expanding into neighboring class voxels with similar intensities.The Posterior-CRF also gives a performance improvement in the segmentation of the pulmonary artery, but this is not always statistically significant.One reason is that the blurred boundary between the aorta and pulmonary artery often results in the oversegmentation of pulmonary artery, the errors in pulmonary artery are emphasized because the overall pulmonary artery volume is lower.Another reason could be the curved shape of the pulmonary artery, which makes the results vary a lot between patients.
We see similar behavior on the ISLES dataset.The intensity boundaries of the large ischemic stroke lesions are ambiguous and their appearance varies a lot between lesions.Most of the methods fail to segment the boundaries accurately (see Fig. 6 ISLES).Postprocessing CRF hardly solves the problem and performs slightly worse than CNN.Posterior-CRF achieves better (while less significant due to the large prediction variance between samples) segmentation performance on DSC, AVD, lesion-wise F1.
A properly tuned spatial component of the post-processing CRF can benefits CT arteries and ischemic stroke lesion segmentation (Appendix Section 9, Fig. 2 (a) and (c)).However, it can cause problems to white matter hyperintensities no matter how we try to tune it (Appendix Section 9, Fig. 2 (b)), where we can see a positive ω 2 always leads to a decreased performance since the spatial smoothing contributes to remove both isolated true positives and false positives if they are small enough.The complete SHAP analysis will be discussed in Appendix Section 9.
The negative effect of the spatial smoothing results in the low average lesion-wise recall score in WMH segmentation for Postproc-CRF (34.8%) and can be observed in the WMH segmentation results (see Fig. 6 ).In this case, Postproc-CRF is always worse than vanilla CNN (within our grid-search range).This is because the scenario where post-processing CRF has no influence (with both ω 1 and ω 2 set to zero) was not included in the grid search range (0.1,10).Intensity-CRF has a higher lesion-wise average recall than CNN baseline (68% to 64.8%) but a lower (not significantly) voxel-wise recall (77.5% to 79.8%): although it detects more correct lesions than CNN due to the intensity features, its use of spatial features causes it to undersegment individual lesions (see Fig. 6 ).Spatial-CRF also suffers from this problem, with a high lesion-wise recall of 68.8% but low lesion-wise F1 of 65.7%.
For CT arteries, the proposed method performs better than the state-of-the-art ( Sedghi Gamechi et al., 2018 ) in aorta segmentation (0.95 vs. 0.94) and worse in pulmonary segmentation (0.89 vs. 0.92).Note that five-fold cross-validation is applied in ( Sedghi Gamechi et al., 2018 ) and in this paper we apply five random data splits, which may lead to different test data.Unlike in ( Sedghi Gamechi et al., 2018 ), we do not cut the pulmonary artery prediction from the bottom level.In some cases, our method produces segments that extend beyond the manual annotations, which leads to a lower Dice performance.For WMH, the proposed method performs slightly worse than the best performance in the leaderboard using 5 2D UNet ensembles (0.78 vs. 0.81) using the same test data.The top 3 methods in the leaderboard are all 2D UNet ensembles (0.81 vs. 0.80 vs. 0.80), which shows a well-tuned UNet can provide strong baseline performance for WMH segmentation.The best non-ensemble approach is brain atlas guided attention UNet which is more comparable to the proposed method (0.79 vs. 0.78).For ISLES, note that the test sets used in this paper are different from the ones that are used to calculate the leader-board performance.The performance of the proposed method using 14 training images is quite comparable to the best performance in the leaderboard (0.61 vs. 0.59), which is the only CNN-based method ( Kamnitsas et al., 2017 ) among the top-3 methods in Dice metrics (0.59 vs. 0.55 vs. 0.47).

Optimization of the end-to-end CRF
We show the evolution of the trainable CRF parameters in one data split of WMH dataset in Fig. 5 .For the four parameters in the 2 × 2 compatibility matrix μ and the two diagonal spatial kernel weights ω 2 , Spatial-CRF falls into different local optimal values compared to other CRF methods, probably because different parameter scaling due to the lack of the appearance kernel.In contrast, Intensity-CRF and Posterior-CRF converged to similar optimal values for μ and ω 2 .For the two diagonal bilateral kernel weights in ω 1 that control the appearance kernel, Intensity-CRF and Posterior-CRF converged to two different optimal values.This suggests that different CRF f eature spaces contribute mostly through the appearance kernel and less through the compatibility matrix or the spatial kernel.Interestingly, for the second diagonal bilateral weight ω (2) Intensity-CRF, but at the later stage it finds and learns another set of features that may help categorize the lesion class better, which are more reliable than the original intensity features.

Influence of CRF hyperparameters
We conduct experiments to investigate the influence of CRF hyperparameters on both end-to-end CRF with predefined features and the proposed CRF with learned features.
Trainable CRF parameters.The CRF weights μ, ω 1 , and ω 2 in the end-to-end CRF learning can be automatically updated together with CNN weights.We run Intensity-CRF and Posterior-CRF using WMH datasets with five different initializations of CRF weights randomly sampled from the search scale with all other parameters the same as in Table 2 .The CNN initializations are the same for all experiments.The results in Table 4 show that Intensity-CRF and Posterior-CRF converge to similar optimal points across different initializations.Spatial-CRF shows higher variances across experiments and is less stable to the change of initializations.Posterior-CRF is more robust to changes in initialization, achieving higher average performance and smaller standard deviations compared to Intensity-CRF and Spatial-CRF.
Empirically tuned parameters.The CRF standard deviation parameters θ α and θ γ , controlling the spatial terms, and θ β controlling the appearance term, were tuned empirically to give the best results for post-processing CRF.We here test, for WMH segmentation, five different values of θ α , θ β , and θ γ for Intensity-CRF and Posterior-CRF and five different values of θ γ for Spatial-CRF within the search scale.All other parameters are the same as in Table 2 .
The results are shown in Fig. 7 .We can see that Posterior-CRF is more robust to θ α and θ β and has consistently better performance than Intensity-CRF within the search scale, suggesting that Posterior-CRF parameters are more easy to tune.All CRF methods degenerate performance when θ γ becomes larger and show the best performance when using a similar value as that in the grid search for post-processing CRF.Spatial-CRF is more robust to θ γ compared to other CRF methods and has similar performance as CNN baseline with larger θ γ .This indicates that large θ γ reduces the CRF effect and the spatial term may introduce more incorrect segmentation when there is also an appearance term in the endto-end CRF like Intenity-CRF and Posterior-CRF.

Influence of hierarchical CNN features as CRF reference maps
We conduct experiments to investigate which level of features -earlier or deeper in the network -are more useful for the feature-learning-based CRF.We implement nine variants of feature-learning-based CRF with different levels of CNN feature maps as reference maps in the same 3D UNet architecture.For example, the method FL-CRF-e-1 indicates the feature-learning-based CRF using the level 1 feature maps in the UNet encoder path as CRF reference maps.The implementation detail of FL-CRF-e-1 is shown in Fig. 3 .To reduce the computational cost and keep the same layer capacity as Posterior-CRF, the 32-channel (or more in deeper layers) feature maps are encoded into C-channel feature maps and go through a softmax layer as the CRF reference maps.Since there is no gradient flowing back through the reference map path, we optimize the softmax layer with the segmentation loss directly in order to preserve as much semantic information as possible.Note that for CRF methods that use deeper CNN layers as reference maps, such as FL-CRF-e-2 to FL-CRF-d-2, we upsample the reference maps to the original image scale using nearest neighbor interpolation and optimize them with the segmentation loss, similar to FL-CRF-e-1.
The results are shown in Fig. 8 .Note that if we use the CNN input as CRF reference maps, it turns into Intensity-CRF; if we use the last CNN layer as CRF reference maps, it turns into Posterior-CRF.In the figure, we can see that all feature-learning-based CRF approaches (including Posterior-CRF) outperform Intensity-CRF and the overall Dice performance in the decoder path is better than that in the encoder path, indicating that CNN learned features are more useful to the CRF inference than intensity is and later CNN features are more useful than early features.The performance degenerates towards the middle part of the UNet (from FL-CRF-e-1 to FL-CRF-e-5 and FL-CRF-d-1 to FL-CRF-d-4) but fluctuates at the 2nd/3rd level.We argue that this may be due to the pooling effect which enables CNN to extract higher-level features but loses the spatial information at the same time.Posterior-CRF achieves the best performance among all variants and we argue that this is because the last CNN layer are more likely to contain more useful information for CRF inference and it still keeps the same spatial scale as the original image.

Evolution of CNN and CRF outputs
The concurrent optimization of CNN and CRF in our end-to-end models allows the CNN and CRF to interact during training.We observed that this has a strong effect on what the CNN learns in the early training epochs.Fig. 9 shows the evolution of CNN and CRF outputs for three typical examples.The baseline CNN without CRF converges quickly and focuses on the large lesions, already producing a fairly sparse output after the first epoch.The end-to-end models converge more slowly, and in this case the output of the CNN is influenced by the choice of CRF mostly in the early stage of training.For example, the CNN in the Intensity-CRF model initially tends to highlight voxels with similar intensity as the foreground (1 to 20 epoch), while the CNN in the Spatial-CRF model preserves the spatial coherence between voxels and outputs many

Discussion
In this paper, we explored efficient methods to combine the global inference capabilities of a CRF with the feature extraction from a CNN.Our end-to-end approach optimizes the CRF and CNN at the same time, and allows the two components of the approach to cooperate in learning effective feature representations.This gives our method an advantage over traditional CRFs that only use the original image intensities and position information.Intensity-based features can be suboptimal for problems where the intensity does not provide sufficient information to find the object boundaries, for example because the contrast between objects is too small.
Unlike other CRF methods, our Posterior-CRF uses adaptive learning-based features that are learned by the CNN and can combine spatial and appearance information in a way that suits the CRF.The results show our method can achieve stable, good performance across a range of segmentation applications and imaging modalities.FL-CRF variants that use early CNN features in Section 5.4 achieve in-between performance between Intensity-CRF and Posterior-CRF, using learning-based features that range from more similar to intensity to more similar to posterior probability maps.Finally, we found that integrating learned features into the CRF model reduces the need to fine-tune CRF parameters, making the method easier to apply than CRF methods with predefined features.

Interaction between CRF and CNN
Fig. 9 leads to the counter-intuitive observation that, at least initially, the CNNs in end-to-end models seem to imitate the CRF instead of complementing it.For example, the CNN output in Intensity-CRF highlights the ground truth, but also finds areas with similar intensities, producing something that looks very similar to the original image (20 epoch).The CNN output in Spatial-CRF selects the ground truth but also includes clusters of voxels in other areas (5 epoch).
This effect can be explained by the way the CNN and CRF interact during training.In Intensity-CRF and Spatial-CRF, the only interaction between CRF and CNN takes place through the unary map ( Fig. 4 , step 5, green arrow).For example, consider how this works in the Intensity-CRF.In WMH segmentation, the ground truth is usually high-intensity area.However, for the voxels with high intensities but not the target lesions, it is difficult to get both low pairwise CRF potentials and low segmentation loss, since labeling them as non-lesion goes against the CRF assumption that voxels with similar high-intensities are more likely to be the lesion class.For convenience, we call these voxels as hard voxels , indicating the voxels that do not fit the CRF assumption.In order to keep the correctly segmented lesions and reduce the CRF effect on the hard voxels at the same time, the CNN tends to provide unary maps that 1) highlight the ground truth area for lower segmentation loss, and 2) look similar to the CRF reference maps on the hard voxels for lower pairwise CRF potentials.In the later stage of training, CNN is encouraged to push the confidence of its outputs even further to minimize unary potentials and thus prevent CRF from undoing segmentation improvement on the hard voxels.From Fig. 9 , we can see that there are many hard voxels in Intensity-CRF (1 to 20 epoch, areas that look like the original image) and Spatial-CRF (5 epoch, clusters of voxels that do not belong to the ground truth) which may harm the segmentation.This indicates that the predefined features may not be the optimal feature space for the end-to-end CRF.
In the Posterior-CRF model, the CRF inference happens within the CNN feature space, which can improve the interaction between CNN and CRF.First, the features learned by CNN during training may contain information that is more useful for segmentation than that in the predefined features, which makes CRF benefit most from the CNN features.Second, using the learning-based features as CRF reference maps avoids the CRF assumption of the predefined features which may introduce many hard voxels, e.g., Intensity-CRF and Spatial-CRF, as discussed in the previous paragraph.With fewer hard voxels, the CNN in Posterior-CRF may provide better unary maps for the CRF inference.

Posterior-CRF vs. mean-field network
The mean-field approximation (MFA) in Posterior-CRF is somewhat similar to that in Mean-field networks (MFN) ( Li and Zemel, 2014 ), since both methods use it to get the posterior probabilities of the variables.Therefore, MFN could be a promising alternative to the MFA process in our method.MFN has the advantage that it utilizes each layer of the network as an iteration of MFA, which has the advantage of allowing more relaxation on parameters and provides some efficiency improvements.This makes the idea of formulating Posterior-CRF as a feed-forward network like MFN very attractive.There are, however, a few limitations that would need to be solved.
The first limitation is in training.MFN is designed to provide a faster and more flexible way to obtain the prediction of MFA, by fitting a powerful function that predicts the real MFA result.
To train an MFN, we first need to acquire the ground truth calculated by conventional mean-field iterations, which takes time during training but saves time during inference.On the other hand, Posterior-CRF provides a flexible and adaptive feature space for the conventional MFA, speeding up the procedure by applying Gaussian convolution in the message passing updates.As a result, the thing Posterior-CRF does is difficult to replicate with a MFN because the feature space of a Posterior-CRF changes during training, while MFN requires a predefined feature space to get the ground truth.
The second limitation is the tradeoff between dense inference and computation cost in the MFN.In its feed-forward network implementation, the computation cost increases exponentially when more neighbor nodes and number of layers are included, which limits its ability to model dense prediction problems such as segmentation tasks.

Posterior-CRF vs. graph neural networks
The proposed Posterior-CRF shares some similarities with graph neural networks (GNN) ( Scarselli et al., 2008;Selvan et al., 2018 ): both approaches aim to model interactions between variables within a graph model.The difference is that Posterior-CRF predefines the global relationship between variables through the mean-field assumptions and solves the maximum a posteriori problem, whereas GNN learns the global variable relationship by applying graph convolution filters and mapping the input graph to the output graph ( Selvan et al., 2018 ).
It could be interesting to combine the global view of the Posterior-CRF and the more local view of the GNN.The Posterior-CRF might benefit from using a GNN to replace its CNN component for feature extraction.The graph-based network may extract better features for Posterior-CRF than a CNN, which is not designed to extract unary and pairwise features for a graphical model.Similarly, the GNN may benefit from the efficient message passing of the Posterior-CRF, which would allow it to use the local graph-based features as CRF features for global interactive modeling in a computationally efficient way.

Limitations
In this paper, we show that the proposed Posterior-CRF method has benefits in the three medical imaging applications.Considering the medical imaging datasets are usually small largely because the manual annotations are very expensive to make, difference between Posterior-CRF and UNet may be smaller in larger training sets.But we know from literature that Intensity-CRF helps in some computer vision applications with large training sets (e.g., 10k 2D images or even more), it would be promising to test our method on these datasets.This is considered as our future work.
In Section 5.3 , we show that Posterior-CRF is robust to different CRF initializations and hyperparameters.However, the standard deviation parameters still require careful tuning, especially for θ γ in the spatial term.θ γ is sensitive to the image scale of different datasets and the size of the target object in different applications.Nevertheless, we recommend the researchers to use the default (or optimal if it is available) setting of post-processing CRF as a reference for tuning Posterior-CRF rather than random initialization.
Posterior-CRF is more robust to θ α and θ β compared to Intensity-CRF, which facilitates exhaustive tuning of these parameters.
The computational expense of the CRF also restricts the choice of applications.Compared to UNet ( ∼5 mins for 1 epoch in WMH experiment), there is around 20% training time increased on average when applying a CRF layer on top of the network ( ∼6 mins for 1 epoch).All end-to-end CRFs share similar computational costs.Given that Posterior-CRF uses posterior probability maps as its reference maps, it can become computationally expensive in multiclass segmentation problems.For a similar reason, Intensity-CRF and Postproc-CRF can become expensive when there are too many imaging modalities in the input channels M.
In the experiments, we use a plain 3D UNet as the backbone network for all methods.The training pipeline and hyperparameters are determined empirically and kept the same for all datasets, which could be suboptimal compared to elaborate automatic configuration strategies like nnU-Net ( Isensee et al., 2020 ).On the WMH dataset we therefore checked the performance of nnU-Net (3D version without ensembling).Average Dice score of nnU-net (0.77) was slightly higher than our CNN baseline (0.76, difference not statistically significant) but lower than the proposed posterior CRF using the CNN baseline as a backbone (0.79), which performed significantly better than the CNN baseline (see Table 3 ).Though our experiments have been limited to a standard 3D Unet architecture, We expect that posterior CRF can improve results of other segmentation architectures and other hyperparameter settings (such as nnU-net) as well.

Conclusions
In conclusion, we present a novel end-to-end segmentation method called Posterior-CRF that uses learning-based, classinformative CNN features for CRF inference.The proposed method is evaluated in three medical image segmentation tasks, including different MRI/CT imaging modalities and covering a range of object sizes, appearances and anatomical classes.In the quantitative evaluation, our method outperforms end-to-end CRF with early CNN features, end-to-end CRF approaches with predefined features, post-processing CRF, as well as a baseline CNN with similar architecture.In two of the three applications, our method significantly improves the segmentation performance.The qualitative comparison demonstrates that our method has good performance on segmenting blurred boundaries and very small objects.

Declaration of Competing Interest
Role of the funding source Chinese Scholarship Council, Iranian Ministry of Science, Research and Technology, and The Netherlands Organisation for Scientific Research (NWO) have no involvement in the study design, data collection, analysis and interpretation of data; in the writing of the report; and in the decision to submit the article for publication.

Fig. 2 .
Fig. 2. Difficult cases for conventional CRF inference in medical image segmentation.(a) Segmentation of arteries in CT: first row shows two axial slices of the CT scan with red arrows indicating indistinguishable boundaries; second row shows the corresponding ground truth of the aorta (yellow) and pulmonary artery (green); (b) White matter hyperintensities segmentation in MRI: four examples are shown with the ground truth of the lesions (green), red arrows indicate small isolated lesions that can be easily removed by CRF; (c) Ischemic stroke lesions segmentation in MRI: first row shows the ground truth of the lesions (green) where large appearance difference between lesions can be observed (red arrows); second row shows a close-up view of the lesions.Best viewed in color with zoom.(For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.) an undirected graph describing the random field X .Each clique c in a complete set of unary and pairwise cliques C ζ in ζ , and φ is the potential for each clique.We seek a maximum a posteriori probability (MAP) estimation x that minimizes the corresponding Gibbs energy E( X = x | I ) :

Fig. 4 .
Fig. 4. One end-to-end optimization iteration of the proposed CRF method.Best viewed in color with zoom.
Images have an anisotropic voxel resolution of 0 .78 mm × 0 .78 mm × 1 .00 mm and are of size 512x512 with on average 336 slices (range 271-394).The 25 scans are split into three parts of 10, 5, and 10 scans for training, validation, and testing respectively.Due to the limitation of GPU memory, we first crop the original CT images and only keep the axial central part of 256 × 256 voxels for all slices.Then, 3D patches of the size 256 × 256 × 16 are extracted from the cropped images.All training patches have 80% overlap in z-axis between neighboring patches to mitigate border effects.In total, there are 840 3D patches for training.We use the original CT intensities without normalization.

Fig. 6 .
Fig. 6.Example segmentation results.From left for each row : (1) Original image (2) Manual annotation (3) CNN baseline (4) Postproc-CRF (5) Intensity-CRF (6) Spatial-CRF (7) Posterior-CRF.Aorta is colored with yellow and the pulmonary artery is green, white matter hyperintensities and ischemic stroke lesions in yellow.Red/blue rectangles indicate areas with over/under segmented voxels and the orange rectangle indicates another branch of pulmonary artery whose annotation starts in the next few slices and merged with the main branch gradually.In the WMH example (second row), only detections that do not overlap with any ground truth voxel (false positive lesions) or ground truth lesions for which no voxel is detected (false negative lesions) are highlighted, and in the zoomed patches red and blue voxels indicate false positive and false negative lesions respectively.Better viewed in color with zoom.(For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Fig. 7 .
Fig. 7. Dice performance of varying θfor CRF methods on WMH dataset.CNN result is shown as the black dash line.Purple crosses indicate the values used in Table 4 .Best viewed in color with zoom.(For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)Table4 Performance (Dice score) across 5 different initializations of CRF weights on WMH dataset.Methods Intensity-CRF Spatial-CRF Posterior-CRF Mean (std) 0.7570 (0.008) 0.7507 (0.02) 0.7833 ( 0.003 )

Fig. 8 .
Fig. 8. Dice performance of end-to-end CRFs using different CNN f eature maps in an independent run on WMH dataset.Different blocks indicate different level of CNN feature maps used as CRF reference maps.Best viewed in color with zoom.

Fig. 9 .
Fig. 9. Evolution of CNN and CRF outputs during training.The CNN output maps and CRF results for WMH segmentation in 3 different MRI images (columns) are shown at, from top row to bottom row, epoch 1, 5, 20, and the best epoch.The best epoch is chosen when the model shows the best validation performance till the end of training (usually at 50 ∼80 epoch).FLAIR: the input FLAIR image of the current training sample.GT: ground truth.CNN baseline: the last layer (softmax output) of CNN.Intensity-CRF, Spatial-CRF, Posterior-CRF: the probability maps before/after the CRF layer at different epochs during training.Best viewed with zoom.

Fig. 3. Proposed feature-learning-based CRF using early/later CNN feature maps. The
backbone architecture is based on 3D UNet.The skip-connections concatenate the feature maps from the encoder path with the upsampled ones from the decoder path.The CRF module is placed on top of the CNN and infers the most likely posterior class probability conditioned on the CRF features.M is the number of input imaging modalities.C is the number of output classes.Two proposed CRF variants are shown in this figure: 1. Posterior-CRF (red rectangle and arrows), which uses the last CNN layer as CRF reference maps; 2. FL-CRF-e-1 (blue rectangles and arrows), which uses the first level CNN layer as CRF reference maps.Best viewed in color with zoom.(For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Table 1 Post-processing CRF parameters for each dataset.
Search range indicates the range of parameter values explored during grid search.

Table 2
Initial end-to-end CRF parameters for each dataset.

Table 3
Results.Mean (standard deviation).The best results are marked in bold.Each experiment is repeated 5 times with different random data split.The last two colomns are lesion-wise metrics.* : significantly better than CNN baseline ( p < 0 .05 ).: significantly worse than Posterior-CRF ( p < 0 .05 ).P-values are calculated by two-sided paired t -test.All CRF methods are implemented in 3D fully-connected manner and share the same CNN architecture and hyperparameters.

Fig. 5. CRF parameters during training in WMH dataset. The
initial values of the CRF parameters can be found in Table2.Best viewed in color with zoom.