Evaluation of state-of-the-art segmentation algorithms for left ventricle infarct from late Gadolinium enhancement MR images (cid:2)

Studies have demonstrated the feasibility of late Gadolinium enhancement (LGE) cardiovascular magnetic resonance (CMR) imaging for guiding the management of patients with sequelae to myocardial infarction, such as ventricular tachycardia and heart failure. Clinical implementation of these developments neces- sitates a reproducible and reliable segmentation of the infarcted regions. It is challenging to compare new algorithms for infarct segmentation in the left ventricle (LV) with existing algorithms. Benchmarking datasets with evaluation strategies are much needed to facilitate comparison. This manuscript presents a benchmarking evaluation framework for future algorithms that segment infarct from LGE CMR of the LV. The image database consists of 30 LGE CMR images of both humans and pigs that were acquired from two separate imaging centres. A consensus ground truth was obtained for all data using maximum likelihood estimation. Six widely-used ﬁxed-thresholding methods and ﬁve recently developed algorithms are tested on the benchmarking framework. Results demonstrate that the algorithms have better overlap with the consensus ground truth than most of the n -SD ﬁxed-thresholding methods, with the exception of the Full- Width-at-Half-Maximum (FWHM) ﬁxed-thresholding method. Some of the pitfalls of ﬁxed thresholding methods are demonstrated in this work. The benchmarking evaluation framework, which is a contribu- tion of this work, can be used to test and benchmark future algorithms that detect and quantify infarct in LGE CMR images of the LV. The datasets, ground truth and evaluation code have been made publicly available through the website: https://www.cardiacatlas.org/web/guest/challenges.


a b s t r a c t
Studies have demonstrated the feasibility of late Gadolinium enhancement (LGE) cardiovascular magnetic resonance (CMR) imaging for guiding the management of patients with sequelae to myocardial infarction, such as ventricular tachycardia and heart failure. Clinical implementation of these developments necessitates a reproducible and reliable segmentation of the infarcted regions. It is challenging to compare new algorithms for infarct segmentation in the left ventricle (LV) with existing algorithms. Benchmarking datasets with evaluation strategies are much needed to facilitate comparison. This manuscript presents a benchmarking evaluation framework for future algorithms that segment infarct from LGE CMR of the LV. The image database consists of 30 LGE CMR images of both humans and pigs that were acquired from two separate imaging centres. A consensus ground truth was obtained for all data using maximum likelihood estimation.
Six widely-used fixed-thresholding methods and five recently developed algorithms are tested on the benchmarking framework. Results demonstrate that the algorithms have better overlap with the consensus ground truth than most of the n-SD fixed-thresholding methods, with the exception of the Full-Width-at-Half-Maximum (FWHM) fixed-thresholding method. Some of the pitfalls of fixed thresholding methods are demonstrated in this work. The benchmarking evaluation framework, which is a contribution of this work, can be used to test and benchmark future algorithms that detect and quantify infarct in LGE CMR images of the LV. The datasets, ground truth and evaluation code have been made publicly available through the website: https://www.cardiacatlas.org/web/guest/challenges.

Introduction
In recent years, the translation of image analysis tools to the clinical environment has remained limited despite their rapid development the challenging task of cross comparing their algorithm's performance. The absence of a common pool of data along with evaluation strategies has limited algorithm translation into the clinical workflow Moreover, as larger cohort data sets become available, the need for reducing the manual labour involved in image analysis is becoming more important Benchmarking of algorithms on common datasets provides a fair test-bed for comparison. It is thus a very important activity as we move from bench to the bedside in the medical image processing community. In recent years, several conferences and meetings within the medical image processing community have provided a platform to benchmark algorithms from multiple research groups. These challenges invite participants to submit their algorithms and test them on common data. The results from the test are then evaluated and compared using common evaluation metrics. In the past, a few challenges have been organised, each with its own unique theme. There exists an index of past challenges within the medical image processing community and it can be found on the Cardiac Atlas project page in https://www.cardiacatlas.org/web/guest/ challenges. In the cardiovascular imaging domain, some recent challenges include left atrial fibrosis and scar segmentation (Karim et al., 2013), left ventricle segmentation (Suinesiaputra et al., 2014), right ventricle segmentation (Petitjean et al., 2015), cardiac motion tracking (Tobon-Gomez et al., 2013) and coronary artery stenosis detection (Kirisli et al., 2013).

Motivation for left ventricle infarct segmentation
Cardiovascular magnetic resonance (CMR) imaging can be used to comprehensively assess the viability of myocardium in patients with ischaemic heart disease. Myocardial infarction can be visualised and quantified using inversion recovery imaging 10-15 min after intravenous administration of Gadolinium contrast. This imaging technique is known as late Gadolinium enhancement (LGE) imaging. Experimental models have shown excellent agreement between size and shape in LGE CMR and areas of myocardial infarction by histopathology (Kim et al., 1999;Wagner et al., 2003). Infarct size from CMR is also a primary endpoint in many clinical trials (see Desch et al., 2011 for a complete list).
Recent studies have also demonstrated how infarct size, shape and location from pre-procedural LGE can be useful in guiding ventricular tachycardias (VT) ablation (Estner et al., 2011;Andreu et al., 2011). These procedures are often time-consuming due to the preceding electrophysiological mapping study required to identify slow conduction zone involved in re-entry circuits. Postprocessed LGE images provide scar maps, which can be integrated with electroanatomic mapping systems to facilitate these procedures (Andreu et al., 2011). Clinical implementation of these developments necessitates a reliable, fast, reproducible and accurate segmentation of the infarcted region. Moreover, as use of LGEbased infarct volume estimation becomes more clinically relevant, standardisation will facilitate more consistent interpretation.

State-of-the-art for cardiac infarct segmentation
A short overview of previously published infarct detection algorithms for the left ventricle (LV) is presented here. Table 1 lists the algorithms surveyed and highlights some of their important features. A common method for detecting infarct in the LV is the fixed-model approach, whereby intensities are thresholded to a fixed number of standard deviations (SD) from the mean intensity of nulled myocardium or blood pool (Flett et al., 2011). In the rest of the paper this will be known as the n-SD method, where n = 2, 3, 4, 5 or 6. A second common fixed-model approach is the full-width-at-half-maximum (FWHM) approach, where half of the maximum intensity within a user-selected hyper-enhanced region is selected as the fixed intensity threshold (Amado et al., 2004). Using this threshold, a region-growing process is employed from user-selected seeds. These seeds are selected to be within infarcted regions such that they can be segmented with region-growing.
As the aforementioned approaches require user input, making them prone to inter-and intra-observer variation, other approaches that are automatic have been developed. Hennemuth et al. (2008) modelled the intensities of homogeneous tissue in LGE CMR with a Rician distributions and an expectationmaximization (EM) algorithm was used for fitting the data. Pop et al. (2013) fitted Gaussian mixture models to myocardial tissue pixel intensities and correlated with histology. In Detsky et al. (2009), clustering in a feature space of steady-state and T * 1 intensity values provided the segmentation which was shown to provide good correlation with FWHM. Tao et al. (2010) employed automatic thresholding using the Otsu method on bi-modal intensity histograms of myocardium and blood pool. More recently, the use of the graph-cut technique in image processing has been applied to segment infarct in several methods (Lu et al., 2012;Karim et al., 2014;Karimaghaloo et al., 2012). An advantage of this technique is that constraints can be placed on the resulting segmentation, allowing segmentation boundary regularization with region-based properties. It also predicts which pixels are statistically most likely to be infarct based on prior probability distribution models.

Proposed evaluation framework
In this paper we propose an evaluation framework for future algorithms that segment and quantify infarct from LGE CMR images of the LV. To demonstrate the framework, five algorithms were evaluated by comparing against a consensus segmentation of experienced observers. The algorithm and observers were both provided the myocardium segmentation. The algorithms were also provided with training data sets. Algorithms evaluated in this work were submitted as a response to the open challenge, put forth to the medical imaging community at the Medical Image Computing and Computer Assisted Intervention (MICCAI) annual meeting's workshop entitled as Delayed Enhancement MRI segmentation challenge. There were thirty LGE CMR data of the LV from both human and porcine cohorts used for the challenge. The data were divided into test (n = 20) and training (n = 10) sets. Each participant designed and implemented an algorithm which segmented the infarct in each dataset. The datasets are publicly available via the Cardiac Atlas project challenge website https: //www.cardiacatlas.org/web/guest/ventricular-infarction-challenge.

Data acquisition database
LGE images were collected from two imaging centres: Imaging Sciences at King's College London (KCL-IM) and Universiteit Leuven (UL). A total of fifteen human and fifteen porcine datasets were collected, of which five in each cohort were used as a training set for the algorithms. For all datasets, a short-axis stack of DE-MRI images covering the LV were provided. The myocardial mask in each image was made available. This was delineated carefully by an expert observer using short-axis slices. A first step was to determine the basal, mid and apical slices based on the standard American Heart Association (AHA) guidelines (Cerqueira et al., 2002). The contours for epicardial and endocardial borders, excluding the papillary muscles, were carefully drawn on each slice before the enclosed region in-between them was filled to produce the mask. The images in the database were limited to the above two different types but varied in their quality. Refer to Table 2 for a summary of the two different types of data that were included in this study.
The human data (n = 15) were from randomly selected patients who had a known history of ischaemic cardiomyopathy and were under assessment for an implanatable cardioverter defibrillator (ICD) device for primary or secondary prevention after infarction. In addition to this, the patients chosen had a history of myocardial infarction at least three months prior to their MRI scan. There was also evidence of significant coronary artery disease on angiography and evidence of left ventricular impaired systolic function on echocardiography. The images were acquired on a clinical 1.5T MRI unit (Philips Achieva, The Netherlands). All patients gave written informed consent.
The porcine data (n = 15) were randomly selected from an experimental database of a pre-clinical model of chronic myocardial ischemia (Wu et al., 2011), with induced lesions obtained by occluding either the left-anterior descending or left-circumflex artery. The data were acquired six weeks after the induction of the coronary lesion on a clinical 3T MRI unit (Siemens Healthcare, Germany). Representative images are shown in Fig. 1.
Five research groups segmented the above datasets, leaving ten images aside, which were utilised for training. A brief summary of their algorithms is given in Table 3. They are described in greater detail in the sections below with a brief background on each technique implemented and details of their implementation.

Background:
Support vector machines (SVM) and level set methods were used to segment scar in this method. SVM is a machine learning technique which first computes the optimal hyperplane on a set of training data mapped to some feature space (Hearst et al., 1998). The hyperplane is a decision boundary which maximally separates the pre-labelled data. Once the hyperplane is obtained, the unseen data is mapped to the same feature space to see which side of the hyperplane it lies in. This labels and thus classifies the unseen data. Level-sets (Sethian, 1999) were also used in this method. In this technique a region evolves from an initial position within the region to be segmented. Level-sets have the added advantage of imposing shape constraints on the evolving region.

Implementation:
A number of image processing techniques were employed. In the first stage, an Otsu-based thresholding was used. Here the threshold between healthy and scar tissue was computed by maximising the intensity variance between the two labels in the  (Otsu, 1975). However, as this method was subject to limitations, especially in instances where healthy and scar tissues had overlapping intensities, further steps were necessary. An ensuing connected-component analysis found groups of connected pixels. On these pixel groups, several features relevant to scar were extracted: area, bounding box, major and minor axes, eccentricity, convex-hull area and Euler number (Teague, 1980). This allowed pixel groups to be mapped to a feature space. Several classifiers were tested on the training data provided. These were namely SVM, K-nearest neighbours, linear Bayesian discriminant, and linear perceptron classifiers. SVM was chosen based on the best trade-off between error and sensitivity on the training data (Hearst et al., 1998). Following classification using SVM, a further level-set-method step refined the segmentations obtained (Sethian, 1999). The contours obtained from the SVM classification step were used to initialise a level-set. The level-set was constrained by the search area obtained in the initial step of the algorithm. It evolved in a speed image P(x) derived from the SVM classified pixels: The values U and L were obtained from grey-level intensity I(x) statistics of the SVM output, i.e. U = μ + 5σ and L = μ − 5σ . These parameters are in-line with the standard deviation approach for classifying scar (Karim et al., 2013).

Background:
Region-growing is a well-known image processing technique which finds a group of connected pixels with intensity homogeneity. It is an iterative process which starts from a seed point, and the region increases in size by including neighbouring pixels that fit a certain pre-defined criteria. Region-growing can subsequently leak into neighbouring areas, which is an important limitation of the technique.

Implementation:
Seed selection for region-growing was automatic and repeated for each slice, making it essentially a 2D technique. A minimum of two seeds were selected for each tissue class: scar and healthy. The criteria for selecting seeds for the scar tissue class was the following: where a pixel in the kth slice has intensity I and is subjected to the above test based on mean (μ) and variance (σ 2 ) of myocardium intensity. Individual regions satisfying the above criteria were analysed for their shape and size. Elongated and thin regions near the epicardium were deleted in an automated manner by computing the eccentricity and width (proportion to myocardial mask) of the region in question, on which a thresholding was performed based on empirical values obtained from the training set. The size of negligible regions were defined in proportion to the pixel size and size of the myocardial mask. The two largest and brightest regions were selected as the seeds. This selected seeds for the scar tissue class. For the healthy tissue class, a similar standard deviation approach was utilised (i.e. I < μ k + 2σ k ) and the two largest and darkest regions were selected as seeds. Region-growing was initiated from each seed region and these generated segmented regions for healthy or scar tissue classes. The choice of two seeds, per slice, for each tissue class is important as it generates two separate disconnected regions. However, this places a limit on the maximum number of scar or healthy regions possible (i.e. two) in each slice.
The region-growing process was followed by a region-labelling step in which pixels that were not labelled as scar or healthy tissue were analysed; if they contained any adjacent neighbour belonging to either scar and healthy classes, they were labelled as such. This was followed by a post-processing step to fill holes or small gaps in the segmentations. Also, regions that were small islands containing a negligible number of pixels were removed from the segmentation. Finally, dark regions that lacked contrast, but were surrounded by scar pixels were re-labelled as scar. This is characteristic of a microvascular obstruction.

Background:
The previous methods described are geometrical in their nature; a region's intensity and its geometrical shape are used to determine its classification. The method described in this section is different from the above approaches in that a probabilistic classifier model was used. Based on the training dataset, the classifier can infer the posterior distribution of a pixel's label to be healthy or scar given the observation. There are two sets of observations made: (1) the pixel's intensity, and (2) the pixel's neighbourhood. Since labels of neighbouring pixels are typically correlated, neighbourhood information is incorporated by building a graphical model G(V, E), where voxels are represented by a set of nodes (V) and the relationships among them are represented by edges (E). In the generative Markov random field (MRF) (see Boykov et al., 2001), the Bayes' relationship is used to determine the posterior distribution: where X is the unseen image to be segmented and Y is the labelling into healthy and scar. The likelihood p(X|Y) of the unseen image is estimated by assuming that the voxel intensities in X are independent given the labels. Also, a uni-modal Gaussian is often used. However, in the context of medical image segmentation, regions are not random collections of independent pixels. Instead, structures usually form coherent and continuous shapes. In this work, a conditional Markov random field (CRF) (Lafferty et al., 2001) is used which is a discriminative framework and the posterior p(Y|X) is estimated by learning a direct map from observations to the class labels (i.e. in training images). This is how it differs from other MRF approaches used in binary classification, where the posterior is estimated using Gaussian distributions.

Implementation:
The CRF implemented in this work used a hierarchical approach and is described in Karimaghaloo et al. (2012). There are two levels of CRF: in the first level image intensity information was used, and in the second level, a so-called spin image feature vector derived from intensity information was used. In the first level CRF, the posterior distribution p(Y|X) was estimated as in a conventional CRF (Lafferty et al., 2001): where Z is a normalization term and φ, ϕ and ψ are unary, pairwise and triplet potentials respectively. Pairwise and triplet potentials measure the interaction between pixels that are immediate neighbours (pairwise) and neighbour's neighbours (triplet). As regions in MRI images are not random collections of independent pixels but part of coherent and continuous shapes, the pairwise and triplet potentials reinforce this notion. The unary potentials p(y i |x i ) computed the inference on the healthy or scar labels (y i ) from the MRI intensity observed at pixel i. This potential was modelled from labelled training data provided within the challenge using: where y i is the label and x i is the observed intensity at voxel i. A binary classifier was employed for the purpose of distinguishing between healthy and scar. The decision boundary was learned from training data using a variant of support vector machines (SVM) known as relevance vector machines (RVM) (Tipping, 2001). The final classification of the first-level CRF was performed using a graph-cut optimization framework (Boykov et al., 2001).
In the second-level CRF, using infarction candidates from the first level, a two dimensional histogram encoding the distribution of image brightness values in the neighbourhood of a particular reference point was constructed. This is the spin image which encoded local information around infarct candidates. Besides voxel intensity, these spin image features were also used for CRF. Similar to the first-level CRF, the final inference was performed using a graph-cut optimisation framework.

Background:
The method presented in this work assumes that the voxel intensity distribution in MR images can be modelled using statistical distribution models. Depending on acquisition parameters and the reconstruction algorithm, it can either be modelled using a Gaussian, Rayleigh or non-central χ-distribution (Dietrich et al., 2007).
These distributions are also closely related to the Rician distribution, making it suitable for modelling healthy myocardium intensities. For diseased myocardium the Rician-Gaussian mixture was found to be appropriate, and for necrotic tissues, the non-central χ-distribution was shown to be suitable (Hennemuth et al., 2008).
The watershed segmentation approach was used in this method (Hennemuth et al., 2008). Watershed is a classical image segmentation technique where the gradient image is considered as a topographic surface. Structures such as scar can be assumed to have high intensity gradients at edges and low gradients in the interior. This high-low-high intensity gradient profile creates basins in the image. Once points are located inside each basin they can be segmented by following paths of decreasing altitudes on the topography of the gradient image.

Implementation:
In this work, three separate models were considered: Rician, Rician-Gaussian and Gaussian models. Each model was fitted to the myocardium intensity distribution in the unseen image. The model with the least mean fitting error was chosen. To achieve an optimal fit, the Expectation-Maximization (EM) algorithm was used. Two classes corresponding to healthy and scar were chosen to initialise the EM fit.
A threshold was then derived from the mixture distribution obtained from the EM-fitting process. This is the higher of the two means in the two-class mixture model. Using Euclidean distance in 3D and endocardial voxels computed from the myocardium segmenatation, voxels with intensity higher than the threshold and closer to the endocardium were chosen as seeds for the watershed process. These seeds were used to define the basins and the watershed transformation determined the extent of each basin. The basins determined each location to be labelled as scar. An ensuing connected-components analysis step removed small noisy structures.

Algorithm 5: KCL -Graph-cuts with EM-algorithm (KCL)
2.6.1. Background: The background of the method used in this work is in some ways similar to the method proposed by MCG in Section 2.4 except that it employs a non-conditional MRF solved using graphcuts. The image to be segmented is modelled as a graph with paths or links between neighbouring pixels. For each pixel there is also a link to two special nodes also known as source and sink nodes that correspond to scar and healthy myocardium. Each link is assigned a weight based on its intensity. The graph-cuts approach computes a partitioning to divide the graph into two sub-graphs, one containing the source node and the other the sink node. This partitioning assigns a label (source or sink) to each pixel solving the segmentation as an optimisation problem. It searches for a globally-optimal solution.

Implementation:
In the graph-cuts approach implemented in this work, each pixel in the myocardium was modelled as a node in the graph with links to source and sink nodes. These links were assigned weights representing the affinity to healthy (i.e. source) and scar (i.e. sink) nodes. The weights were derived from statistical distribution models developed from training images. There were separate intensity distribution models for healthy and scar tissue, both of which were derived from the training images. For scar, the ratio of delayed enhancement intensity to mean blood pool was modelled using a Gaussian distribution. For healthy tissue, a Gaussian mixture was used. The number of mixtures in the model was fixed at three. The standard EM-algorithm computed mean and variance for each mixture from the training images. In the graph-cuts framework there are also links between adjacent pixels and these were derived from a measure of intensity similarity of two pixels. Adjacent pixels with similar intensities attained a high weight. This enforced coherence in the segmentation output. The final segmentation was obtained using global optimisation over the entire image. This allowed for disjointed infarct regions to be identified in the image.

Reference standard: consensus ground truth
A reference standard for scar in each case was obtained by combining volumetric segmentations from three separate observers. All observers were cardiologists with several years' experience in CMR assessment of LV function and tissue viability. They also had several years' experience working with patients suffering from ischaemic heart diseases. For both datasets, they were blinded to the underlying clinical situation of patients and pigs. For pigs, lesions were obtained by occluding either the left-anterior descending or left-circumflex artery, and the observers were blinded to this fact. The observers were not instructed to look for areas of grey zones.
For regions affected by microvascular obstructions, they were instructed to avoid these by looking for regions of significant hypoenhancement surrounded by enhanced regions.
Scars in the images were segmented as follows: (1) Each slice in the LGE CMR was analysed separately in the short-axis view. The segmentation of the myocardium was loaded as an overlay. (2) The basal, mid and apical slices were identified along with the LV orientation, i.e. the posterior and anterior ends. (3) The short-axis slices were then analysed one at a time sequentially from basal to apical or apical to basal. (4) The basal slices were then examined for non-scar related enhancements (see Turkbey et al., 2012) such as the right ventricle (RV) insertion point, and partial voluming in the basal slices due to the outflow tract and appendage. The mid and apical slices were also examined for coronary arteries carrying blood that could be enhanced, and microvascular obstructions. (5) Pixels enhanced within myocardium were labelled as scar and generally noisy pixels or regions were avoided. Noise observed in the lungs was used as a reference.
Each observer was provided with the same set of guidelines as above. However, their segmentations differed in some instances. This was generally due to differences in their opinion and experience. Such inter-observer variability is now widely accepted. It was thus important to merge the segmentations and obtain a consensus ground truth. A maximum likelihood estimation of ground truth was obtained using a published algorithm known as the STA-PLE (Warfield et al., 2004). For every voxel, a probabilistic estimate of the true segmentation was computed using an optimal combination of the observers' segmentations. The final consensus segmentation was then obtained by thresholding this probability above 0.7 or 70%. This is referred to in the rest of the text as the consensus ground truth.

Common algorithms: n-SD and FWHM
Quantification of scar in LGE CMR images using a fixed model is often desirable and commonly used as it includes fewer image processing steps, with some studies advocating its reproducibility (Flett et al., 2011;Amado et al., 2004). In fixed models, scar is quantified by thresholding intensities at a fixed distance from a reference intensity value. Two types of fixed models were used, namely FWHM and the n-SD method. FWHM is a technique where half of the maximum intensity within a user-selected hyper-enhanced region is selected as the fixed intensity threshold for an ensuing region-growing step (Amado et al., 2004). In the region-growing step, infarcted regions are segmented based on user-selected seed points. These are used to initialise the regiongrowing step. The n-SD method (where n = 2, 3, 4, 5, 6), uses a fixed number of standard deviations from mean signal within healthy myocardium. A manual region-of-interest (ROI) selection was required in both techniques. In FWHM, a ROI was delineated in hyper-intense myocardium. In n-SD, a ROI was delineated in remote myocardium. Remote myocardium was defined as a region with no enhancement and normal wall motion. Endocardial and epicardial surfaces were avoided in the delineation.

Evaluation metrics
Segmentations from each algorithm were compared against the reference standard for scar. As no single metric is advocated as the best metric, two different types of metric were chosen for evaluating the segmentations. These were overlap and volumetric measures, and they are briefly described below: 1. Overlap metric: The Dice similarity is a metric for segmentation overlap measuring the proportion of true positives in the segmentation: where X is the segmented region in the ground-truth and Y is the region in the challenger's algorithm. 2. Volumetric-based metric: The total volume error between the algorithm's output and reference standard was found: where V T is the volume of scar in the algorithm segmentation and V G is the volume of scar in the consensus segmentation.

Objective evaluation
In LGE CMR of the LV, hyper-enhanced areas not relating to scar are not uncommon (Turkbey et al., 2012). Unless the characteristic and geometry of these pseudo infarcts are explicitly modelled into the technique, it is challenging for an algorithm to distinguish them. Some common sources of pseudo infarcts seen in LGE CMR of the LV are: (1) the location of the RV insertion point, (2) partial voluming in basal slices due to the outflow tract and the appendage, and (3) hyper-enhanced areas due to epiand pericardial fat. An experienced observer selected regions containing the aforementioned enhancements. These were identified using simple techniques such as checking for continuity of scar or artefact in the adjacent slices, i.e. if it continues then it is likely to be scar. Some instances of pseudo infarcts occurring in the patient dataset are shown in Fig. 2. To evaluate how the algorithms handled pseudo infarcts, each algorithm's output was evaluated separately on these regions. The percentage of voxels detected by each method in these spurious regions was determined.
A good contrast between normal myocardium, blood pool and infarct is challenging and greatly depends on achieving the optimal inversion time. Each scan in the image database was scored by five raters experienced in LGE CMR images. The rating with maximum votes determined the scan's rating. Scans in the image database were ranked into three categories: good, average and poor. The Dice metric was computed separately in each category. This indicated how robust the algorithms were against contrast enhancement quality.

Segmentation accuracy against consensus ground truth
On the patient and porcine LGE CMR scans, segmentations from the algorithms were compared to the consensus ground truth. A consensus was available by combining segmentations from three separate observers as described in Section 2.7.1. Segmentation accuracies measured using the Dice metric are shown in Fig. 3 for the patient dataset. The Dice overlaps between algorithm and consensus were determined on an automatically-determined region-of-interest (ROI) enclosing each individual region of infarction labelled in the consensus. The medians of these individual Dice overlaps were as follows: AIT= 73, KCL= 74, MCG= 85, MV= 44, and UPF= 70. Fixed model approaches for segmenting  scar (i.e. n-SD and FWHM) were also compared with the consensus ground-truth. The median Dice overlaps were: 2-SD= 47, 3-SD= 54, 4-SD= 55, 5-SD= 62, 6-SD= 64, FWHM= 78. An example of a single slice from the patient dataset is shown in Fig. 5.
On the porcine LGE CMR scans segmentations from the algorithms and fixed-model approaches were compared in a similar way to the patient dataset. The Dice overlap metric is plotted in Fig. 4 for each submitted algorithm and fixed model. The Dice overlaps were determined, as above, on ROIs enclosing each region of infarction labelled in the consensus. The medians of these individual Dice overlaps were as follows: AIT= 86, KCL= 80, MCG= 73, MV= 33, and UPF= 73. Standard methods using fixed models were also compared with the consensus ground-truth and the median Dice overlaps were: 2-SD= 64, 3-SD= 65, 4-SD= 67, 5-SD= 74, 6-SD= 76, FWHM= 69. An example of a single slice from the porcine dataset is given in Fig. 6.
The Dice scores, reported above, were evaluated within ROIs enclosing scar in the consensus segmentation. These areas can often be large sections within the image, especially if the scar is continuous and extends to several slices. This provided for a more

Table 4
Segmentation accuracy with volume difference (δV) on patient and porcine data for submitted algorithms and fixed-models. The standard deviation of each metric is quoted in brackets.

Patient data
Porcine data |δV| (ml) |δV| ( objective evaluation. The algorithm's false positives outside the ROI is not accountable. To counteract this issue, segmentations were also compared by quantifying volume differences. This was determined by measuring the difference in total volume of scar between the consensus and algorithm segmentation. An algorithm could be deemed as accurate only when it yielded a good Dice together with a small volume difference. Table 4 lists the mean volume differences and variance (as millilitres) over the entire image database for patient and porcine datasets.
To further evaluate more objectively, the Dice overlap of the algorithms' segmentations were compared to the consensus based on the slice position (basal, mid and apical. Short-axis slices were subdivided according to the standard guidelines (Cerqueira et al., 2002). The results are plotted in Fig. 7. It is not clear what should be a good Dice overlap for datasets of this type. To address this issue, the degree of agreement between observers and the computed consensus was analysed and plotted in Fig. 8. It provided for an estimation of a reasonable target (i.e. good Dice score) for the evaluated algorithms.

Pseudo infarct regions
The algorithms were evaluated on hyper-enhanced regions which mimic scar. These pseudo infarct regions occur for several reasons mentioned in Section 2.7.4 and illustrated in Fig. 2. In each image, pseudo infarct was manually segmented by an experienced observer. These regions were either confirmed anatomically in the case of the outflow tract or by checking adjacent slices for scar continuity in the case of partial voluming. In each image, the total volume of pseudo infarct labelled by the observer was quantified. The total volume of these spurious infarct regions present in each algorithm and fixed model segmentation was also quantified. This was possible by comparing each segmentation to the manual labellings of pseudo infarcts. Results are represented in Fig. 10. KCL and MCG had a higher proportion of manually labelled pseudo infarct regions detected on average than other methods at 21 and 23%, respectively of pseudo infarct labelled by the observers. This is in comparison to MV, AIT and UPF with only 3, 9 and 3%, respectively. Fixed models 2,3,4,5,6-SD and FWHM contained 53, 44, 36, 30, 24 and 23% respectively of manually labelled pseudo infarct volume. Pseudo infarcts were most successfully avoided in the MV and UPF algorithms and least in the 2, 3, 4 and 5-SD methods.

Image quality on segmentation
The LGE CMR images in the database were acquired at different imaging centres with differing protocols and scanners (see Table 2). The quality of enhancement is known to vary and it depends on a number of factors including optimal inversion times, signal-to-noise and contrast-to-noise (CNR) ratios. The images in Table 5 Analysis of segmentation accuracy based on image quality (good, average and bad) on human and porcine datasets combined. The mean, standard deviation (SD) and median of the Dice for each challenger (AIT to UPF) and fixed-model method (2-SD to FWHM) is quoted. the database were qualitatively rated by five observers experienced in LGE. Images were rated as poor, average or good depending on the overall quality of the image. The Dice overlap was measured separately in each category and these are given in Table 5. In both the good and average categories, there were 40%, 60% from the patient and porcine datasets respectively; in the poor category, there were 75%, 25% from the patient and porcine datasets, respectively. A representative set of images for each quality is shown in Fig. 9.

Discussion
We have presented a framework which standardises evaluation of algorithms for segmenting scar in the LV. The framework was used to evaluate and compare five algorithms and six separate fixed model thresholding approaches (i.e. n−SD and FWHM). The algorithms were submitted as part of the STACOM challenge, a workshop organised at MICCAI in 2012. The data is publicly available via the website at: https://www.cardiacatlas.org/web/guest/ ventricular-infarction-challenge.

Evaluation framework
The presented evaluation framework comprises of both human and animal LV LGE CMR datasets and their respective myocardial segmentation masks. Human datasets were acquired from patients with a history of ischaemic cardiomyopathy. The animal datasets were acquired in a pig model of myocardial infarction induced by coronary stenosis. Datasets were also acquired using different Fig. 9. Images in the patient and porcine datasets that are representative of good, average and poor quality images. The arrow labels indicate sites of possible infarction as labelled by an observer. There are two images shown for every quality. scanner vendors and resolutions. The human datasets were acquired with a 1.5T Philips scanner and the animal datasets were acquired with a 3T Siemens scanner. There were both 2D and 3D (non-isotropic) acquisitions. This ensured that algorithms evaluated on the framework were not biased to a specific acquisition protocol, scanner vendor or resolution. The proposed framework provides data acquisitions that are both commonly-used and modern, making it suitable for testing and evaluating state-of-the-art algorithms.
It is often challenging to establish ground truth on infarcted regions in LGE CMR. This makes algorithm evaluation difficult. The framework addresses this issue by proposing a reference standard against which the algorithms can be reliably evaluated. To achieve a reference standard, the human and animal datasets were manually segmented by three experienced observers provided with epiand endo-cardial boundaries and a set of guidelines. Although, their delineations were consistent, some differences remained. The three expert delineations were combined to obtain a consensus segmentation of all three observers. The STAPLE algorithm (Warfield et al., 2004), which uses a probabilistic estimate of the true segmentation to derive the consensus, was used to obtain a consensus segmentation. The degree of agreement between observers and the computed consensus was analysed in Fig. 8 and this not only allows the assessment of agreement but also quantitatively provides for an estimation of a good Dice score in such datasets. In addition to the reference standard for scar, six commonly-used and established fixed thresholding models were used to see how they compare with the algorithms. These were namely the n-SD (where n = 2, 3, 4, 5, 6) and FWHM methods (Amado et al., 2004;Schmidt et al., 2007). The FWHM method is implemented as described in Amado et al. (2004), where the user clicked on hyper-enhanced regions within myocardium and an ensuing multi-pass region growing algorithm segmented infarct using the FWHM criterion.
Algorithms are often evaluated on various different metrics. This makes comparison of algorithms challenging. Most of the methods surveyed in Table 1 either use LGE volume or represent it as a percentage to evaluate detected enhancement (for example in Flett et al. (2011);Harrison et al. (2014)), or compare the amount of overlap with manual segmentation using the Dice metric (for example in Tao et al. (2010); Ravanelli et al. (2014)). The framework evaluated algorithms on both scales -volume and Dice metric. For the Dice metric, segmentations were evaluated on individual infarcted regions in the image. A Dice metric on the entire image has its pitfalls as it is difficult to ascertain within which local regions algorithms fail or succeed. This was addressed using a localised Dice evaluation strategy. Future algorithms tested on the framework will be subjected to the same metrics enabling algorithms and their segmentations to be compared in a reliable manner.
The presence of pseudo infarct, which mimics scar in LGE CMR images, poses various challenges for algorithms. Most earlier algorithms have not addressed or incorporated this into its segmentation models. The framework provided delineations of pseudo infarct regions from an experienced observer. Algorithms were assessed on the proportion of false positives due to pseudo infarct regions. This has allowed a more objective evaluation within this framework. The n-SD and FHWM fixed models segmented a large proportion of pseudo infarct labelled by the observer. The algorithms segmented significantly less pseudo infarcts than fixed models (paired t-test p < 0.05). Furthermore, images in the database were qualitatively rated for its quality by five different observers. Algorithms' segmentations were also evaluated separately based on the image's rating.
The proposed framework has several limitations. An important limitation is that the framework cannot be used to directly evaluate clinical utility or anatomic accuracy of the algorithms. This is since, the reference standard does not include any information about outcomes (for the patient data set) or histology (for the pig data set). Another limitation is the image database size which is 30 images, of which 20 that can be used for testing and 10 usable for training. However, within this small sample, it provides a range of datasets from different scanner vendors, scanner resolution and cohorts.
A second limitation is the dimensionality of the dataset. The human datasets are 2D acquisitions with 8 mm slice thickness. 2D images are commonly employed clinically for treatment stratification. For example based on the infarct volume and ejection fraction from 2D images, a patient could be subjected to certain therapeutic strategies, such as an implantable cardioverter defibrillator (ICD) implantation or ventricular ablation. 3D images provide more detailed quantification of infarct and only the porcine dataset within this framework are 3D non-istropic acquisitions. A third limitation is the manner in which the Dice metric is computed individually on each region of infarction labelled by the consensus. The Dice is computed only within ROIs enclosing each consensus-labelled infarct. Outside these regions, the Dice is not accountable. Thus, algorithms which over-segment can still exhibit a good Dice but poor volume error. The Dice need to be combined with the volume error to give a clearer understanding.
Intensity variation across the images due to coil shading may have an impact on segmentation, especially for methods which process absolute signal intensities. A coil sensitivity scan is a routine part of the acquisition protocol used to acquire the datasets of this study. However, no further coil sensitivity correction was carried out. This was in-line with the principal of this study to use only routine MRI scans.
A final limitation is that only one observer was employed to segment the myocardial masks. The observer was a cardiologist with several years of experience in CMR assessment of LV function assessment and ischaemic heart diseases. The issue of variability with different myocardial masks is counteracted by providing the human observers with these masks. The algorithms are also provided with the same masks. This ensures that infarct within the mask are labelled and computed. Thus, the evaluation is only carried out in the myocardial mask space.

Evaluated algorithms
Quantifying infarct in the LV can have important clinical implications. A 3D rendering of the LV with infarct areas can be Table 6 The mean infarct volume (in millilitres) and average number of regions (i.e. infarct) per slice in the consensus segmentation.

Patient data
Porcine data Mean infarct volume (ml) 5.38 (6.73) 13.81 (8.70) Average regions per slice 1.2 (0.5) 1.0 (0.1) integrated into electroanatomical systems for facilitating catheter ablation. As the resolution and SNR of LGE CMR continues to improve, detailed quantification of infarct is becoming possible. The pitfalls of fixed thresholding models advocated in past literature (Amado et al., 2004;Kim et al., 1999) have been highlighted in recent studies (Harrison et al., 2014). Fixed threshold model makes crude assumptions about the contrast levels between nulled blood pool and infarct, deeming a fixed cut-off threshold. However, as these contrast levels are directly dependent on the inversion time selected in LGE CMR, the preset threshold often requires user readjustments.
The algorithms were evaluated based on the slice position (basal, mid and apical) (see Fig. 7). In the analysis, there was no significant difference between the basal and mid slices. The apical slices showed better overlap for some algorithms. However, apical slices enclose a smaller myocardial area and thus the overlap assessments in these regions can be biased. However, it is important to note that the Dice overlap used here was slice-based and not region-based as the other results in this work. In general, with Dice scores, it is difficult to ascertain what is a good Dice for datasets of the nature included in this study. The analysis of agreement between the observers' segmentations (see Fig. 8) provide for a reasonable estimation and target for the algorithms.
The algortihms' comparison to common algorithms is important. The difference with FWHM remains small except for MCG, which was able to provide high accuracy in the patient dataset, and AIT providing the same in the porcine set. Both methods have considerable strengths, with the former using a state-of-the-art probabilistic technique for image segmentation, and the latter benefitting from post-processing steps which rectify the segmentation. The Dice results reflect the strengths of these methods. On the patient datasets, algorithms AIT, MCG and UPF performed similarly while KCL and MV also performed similarly but with a lower average Dice. This was due to greater variability in Dice for KCL and MV. However, AIT and UPF are both capable of rectifying errors in its segmentation with post-processing steps. AIT employs levelsets following SVM classification and UPF employs shape discriminants. Both KCL and MV rely heavily on its core segmentation process, with no post-processing. As a result, spurious regions are included. Models that are sub-optimal were able to benefit from post-processing.
The algorithms were also evaluated on the total infarct volume it segmented (see Table 4) and these volumes were compared to the consensus volumes. This is important as Dice computed in this work has the aforementioned limitations. Also when evaluating the myocardium, quantification of infarct volume is an important step. The average volume error in challenger's algorithms were 1.04 ml and 0.76 ml for patient and porcine datasets respectively (from Table 4). This was low compared to the overall average infarct volume in the datasets (see Table 6).
The algorithms evaluated on the framework have common traits -most employ region-based image processing techniques, for example level-set (AIT), region-growing (UPF and FWHM) and watershed (MV). This is justifiable as the algorithms are meant to segment infarct that are contiguous regions. However, key considerations such as the shape of candidate regions, are not always taken into account. UPF searches for regions that are elongated, as this is a strong characteristic of LV infarcts. A second important consideration is the seed selection step. If only a single seed is allowed per slice for capturing the infarct (for example UPF, see Table 3), other infarct areas on the same slice cannot be included. The average number of infarct regions per slice was computed for both patient and porcine datasets in Table 6. With the average number of regions found to be 1.2 in the patient dataset, more than a single seed may be necessary.
A second consideration is the spatial positioning of the scar candidate in relation to the image slices or 17-segment model of the AHA (Cerqueira et al., 2002). Enhancement in the basal slices due to the outflow tract or RV insertion point should be discriminated as a pseudo infarct. None of the algorithms or fixed models, have classified enhancement based on its location. Thus, pseudo infarcts have not been addressed in the evaluated methods.
A third consideration is the extent of scarring. Sub-classification of infarct as sub-endocardial, mid-wall and epicardial helps stratify treatment. But first and foremost, these formations are indicative of scar, one which the algorithms should be able to distinguish based on Euclidean distances measured on the myocardium segmentation. Equipped with this information, algorithms should be able to better distinguish scar, especially when enhancements arise due to partial voluming or a fat-related cause.
LGE CMR for the LV can be acquired either in 2D or 3D, with the former being more common as they can be obtained relatively quickly. However, 3D acquisitions are preferred over 2D when post-processing involves detailed quantification. As scanner engineering and technology continue to improve, 3D acquisitions will become more common. All algorithms, except UPF, evaluated within this framework and those surveyed in Table 1 uses 3D techniques that also work on 2D datasets. The UPF technique performs region-growing with seed selection on a slice-by-slice basis. For the porcine 3D datasets, it chooses a particular slice orientation (x, y or z) to work on; and an increasing load on the operator for seed-selection in each 3D slice. The framework supplies with both types of acquisitions to enable future algorithms to be evaluated separately.

Future algorithms
Infarct quantification in the LV is an important assessment criteria for many cardiac therapies. Furthermore, heterogeneity within infarct, especially in the peri-infarct regions, was shown to be a predictor of tachycardia and sudden cardiac death (Schmidt et al., 2007). This work proposes an evaluation framework for future algorithms which segment and quantify LV infarct. To demonstrate its usability, five different algorithms were evaluated on the framework. Three of which have been published (Hennemuth et al., 2008;Karimaghaloo et al., 2012;Karim et al., 2014). Six different fixed-model approaches were also evaluated. The framework provides thirty datasets, of which ten are for algorithm training and the rest for testing. Although they represent a specific pulse sequence, some algorithms evaluated here could be re-trained on new sequences. The consensus ground truths are derived from manual segmentations of three separate observers. Future algorithms can be evaluated both objectively with overlap metrics or less objectively and conventionally with pixel volumes. Most importantly, they can be compared and benchmarked against existing algorithms. To our knowledge, this is the first proposed framework for evaluating LV infarct segmentation and quantification algorithms from LGE CMR images. For the left atrium, a benchmarking evaluation framework already exists (Karim et al., 2013).

Conclusions
CMR continues to play an important role in imaging and quantifying infarct in the LV. Several algorithms have been proposed for its quantification but it is not clear how they compare or perform relative to one another. Furthermore, algorithms have only been tested on centre-and vendor-specific images. The translation of such algorithms into the clinical environment thus remains challenging. Benchmarking frameworks, providing a common dataset and evaluation strategies, is important for clinical translation of these algorithms. The proposed benchmarking framework provides thirty datasets, with fifteen datasets in each cohort: patient and porcine. Datasets in the two separate cohorts were acquired using different scanner vendors and field strength (1.5T and 3T), resolutions and acquisition protocols (2D and 3D). The ground truth is often absent in such datasets, and to this end, the framework provides with a powerful expert observers' consensus ground truth. The proposed framework remains publicly available for accessing the image database, uploading segmentations for evaluation and contributing manual segmentations for improving the consensus ground truth on the datasets.