Simultaneous left atrium anatomy and scar segmentations via deep learning in multiview information with attention

Three-dimensional late gadolinium enhanced (LGE) cardiac MR (CMR) of left atrial scar in patients with atrial fibrillation (AF) has recently emerged as a promising technique to stratify patients, to guide ablation therapy and to predict treatment success. This requires a segmentation of the high intensity scar tissue and also a segmentation of the left atrium (LA) anatomy, the latter usually being derived from a separate bright-blood acquisition. Performing both segmentations automatically from a single 3D LGE CMR acquisition would eliminate the need for an additional acquisition and avoid subsequent registration issues. In this paper, we propose a joint segmentation method based on multiview two-task (MVTT) recursive attention model working directly on 3D LGE CMR images to segment the LA (and proximal pulmonary veins) and to delineate the scar on the same dataset. Using our MVTT recursive attention model, both the LA anatomy and scar can be segmented accurately (mean Dice score of 93% for the LA anatomy and 87% for the scar segmentations) and efficiently (∼0.27 s to simultaneously segment the LA anatomy and scars directly from the 3D LGE CMR dataset with 60–68 2D slices). Compared to conventional unsupervised learning and other state-of-the-art deep learning based methods, the proposed MVTT model achieved excellent results, leading to an automatic generation of a patient-specific anatomical model combined with scar segmentation for patients in AF.


INTRODUCTION
Three-dimensional late gadolinium enhanced (LGE) cardiac MR (CMR) of left atrial (LA) scars in patients with atrial fibrillation (AF) has recently emerged as a promising technique to stratify patients, to guide ablation therapy and to predict treatment success [1][2] [3].Visualisation and quantification of LA scar tissue from LGE CMR require a segmentation of the LA anatomy (including proximal pulmonary veins (PV)) and a segmentation of the LA scars [4].In clinical practice, the LA anatomy and LA scars are generally segmented by radiologists with manual operations, which are time-consuming, subjective and lack reproducibility [5].Therefore, automatic LA anatomy and LA scar segmentation methods are highly in demand for improving the clinical workflow.
Automatic segmentations of LA anatomy and LA scar from LGE CMR are very challenging tasks due to the low visibility of the LA boundaries and the small discrete regions of the LA scars.
The LGE CMR technology is widely used to visualise scar tissues by enhancing their signal intensities, while the nulling of signals from healthy tissue reduces the visibility of the LA boundaries [6].Moreover, LA scars occupy only a small portion of the LA wall, and they distribute discretely; therefore, detection of LA scars is highly susceptible to noise interferences.In the AF patient population, prolonged scanning time, irregular breathing pattern and heart rate variability during the scan can result in poor image quality that can also further complicate both segmentation tasks.Because of these issues, previous studies have segmented the LA anatomy from an additional bright-blood data acquisition, and have then registered the segmented LA anatomy to the LGE CMR data for visualisation and delineation of the LA scars [7] [8][9].This approach is complicated by motion (bulk, respiratory or cardiac) between the two acquisitions and subsequent registration errors.Furthermore, it is based on a two-phase framework.It is inadequate to achieve accurate and efficient estimation for the LA scars because the LA anatomy and LA scars segmentations are separately handled, and no feedback connection exists between them during the algorithm training.
To address the above problems, we propose a fully automated multiview two-task (MVTT) recursive attention model to segment LA anatomy and LA scars from LGE CMR simultaneously.
In the same way that reporting clinicians typically step through 2D axial slices to find correlated information while also using complementary information from orthogonal views, we parse 3D LGE CMR images into continuous 2D slices and apply 2D convolutions instead of a 3D convolution.Our proposed MVTT method mainly consists of a multiview learning network and a dilated attention network.The multiview learning network learns the correlation between 2D axial slices by a sequential learning subnetwork.At the same time, two dilated residual subnetworks learn the complementary information from the sagittal and coronal views.Then, we integrate the two kinds of complementary information into the axial slice features to obtain fused multiview features to achieve the segmentation of the LA anatomy.Since LA scars are very small, the dilated attention network learns an attention map from the image to force our network to focus on these small regions and to reduce the influence of background noise.In our proposed MVTT, the LA anatomy and LA scars share the multiview features to handle the two segmentation tasks, thus can mitigate the error accumulation problem.

Major contributions of this article are as follows:
 An MVTT framework is proposed to provide clinicians with the segmented LA anatomy and LA scars directly and simultaneously from the LGE CMR images, avoiding the need for an additional data acquisition for anatomical segmentation and subsequent registration errors.
 A multiview learning network is presented to fuse multiview features.It mainly correlates the 2D axial slices while integrating complementary information from orthogonal views to relieve the loss of 3D spatial information.
 A dilated attention model is presented to force our network to focus on the small targets of LA scars.It mainly learns an attention map for the localisation and representation of the LA scars but neglects high intensity signals from noise.

Segmentation of the LA Anatomy
The LA anatomy would ideally be segmented from the cardiac and respiratory-gated LGE CMR dataset that is used to segment the scar tissue.However, this is difficult as the nulling of signal from healthy tissue reduces the visibility of the LA wall boundaries.Other options are to segment the anatomy from a separately acquired breath-hold magnetic resonance angiogram (MRA) study [8][10] or from a respiratory and cardiac gated 3D balanced steady state free precession (b-SSFP) acquisition [4][7] [9].While MRA shows the LA and PV with high contrast, these acquisitions are generally un-gated and usually acquired in an inspiratory breath-hold.The anatomy extracted from MRA can therefore be highly deformed compared to that in the LGE CMR study.Although a 3D b-SSFP acquisition takes longer to acquire, it is in the same respiratory phase as the LGE CMR and the extracted anatomy can be better matched.
Ravanelli et al. [10] manually segmented the LA wall and PV from MRA images, for which both efficiency and accuracy have been achieved.The segmented LA and PV were then mapped to the 3D LGE CMR dataset and this was followed by a thresholding based segmentation of the LA scars.Recently, Tao et al. [8] combined atlas based segmentation of LGE CMR and MRA to define the cardiac anatomy.After image fusion of the LGE CMR and MRA, accurate LA chamber and PV segmentation was achieved by a level set based local refinement, based on which an objective LA scars assessment is envisaged in future development.
Instead of using MRA, Karim et al. [7] used a respiratory and cardiac gated 3D b-SSFP acquisition to define the cardiac anatomy.This was resolved using a statistical shape model, and the LA scars were then segmented using a graph-cut model assuming that the LA wall is ~3mm from the endocardial border obtained from the LA geometry extraction.In Yang et al. [4][9] LA anatomy was derived by a whole heart segmentation (WHS) method [11] applied to the 3D b-SSFP data, and was then propagated to the corresponding LGE CMR images.All of these methods, which rely on a second bright-blood dataset (either MRA or 3D b-SSFP), are complicated by motion (bulk, respiratory or cardiac) between the two acquisitions and suffer from subsequent registration errors.
More recently, convolutional neural networks (CNN) based approaches have been proposed to segment the LA and PV [12] [13][14] [15][16] and a grand challenge has been held for LA anatomy segmentation [17].These research studies on LA anatomy segmentation can potentially be useful for LA scars segmentation although to the best of our knowledge, this has not been done to date.

Segmentation of the LA Scar
For segmentation of scar tissue within the LA, Oakes et al. [18] analysed the intensity histogram within the manually segmented LA wall to determine a thresh-hold above mean blood pool intensity for each slice within the 3D LGE volume.In an alternative approach, Perry et al. [19] applied k-means clustering to quantitatively assess normal and scarred tissue from manual LA wall segmentation.A grand challenge was carried out for the evaluation and benchmarking of various LA scars segmentation methods, including histogram analysis, simple and advanced thresholding, k-means clustering, and graph-cuts [1].Although these pioneering studies have shown promising results on the segmentation and quantification of LA scars using LGE CMR images, most have relied on manual segmentation of the LA wall and PV from a second dataset (MRA or 3D b-SSFP).
This has several drawbacks: (1) it is a time-consuming task; (2) there are intra-and inter-observer variations; (3) it is less reproducible for a multi-centre and multi-scanner study; and (4) there are registration errors between the LA and PV segmentation from a second dataset and the LGE CMR acquisition.Inaccurate segmentation of the LA wall and PV can further complicate the delineation of the LA scars and its quantification can be error-prone.This is potentially one of the reasons that there are currently on-going concerns regarding the correlation between LA scars identified by LGE CMR (enhanced regions) and electro-anatomical mapping systems (low voltage regions) [20][21] used during an electrophysiology procedure.Yang et al. [4] [22] proposed a supervised learning based method (using Support Vector Machine or Autoencoder) to delineate LGE regions that were initially over-segmented into super-pixel patches.Although this method achieved high accuracy in LA scars segmentation fully automatically, the scar boundaries and continuity of the LA scars in 3D could be affected due to this 2D slice by slice processing.As for the LA anatomy delineation, deep learning based architectures, e.g., U-Net [23] and V-Net [24], have been proposed to solve semantic segmentation for many computer vision and medical image analysis problems including segmentation of the LA anatomy; however, to the best of our knowledge, they have not yet been developed and validated for LA scars segmentation.

PROPOSED METHODS
Our work mimics the inspection process of radiologists who step through 2D axial slices to find correlated information while also using complementary information from orthogonal views.
Hence, we slice the 3D LGE CMR volume into contiguous 2D slices and perform 2D slice segmentation.This has two major advantages: 1) it increases training data samples and 2) 2D convolution has better memory efficiency.The workflow of our MVTT is summarised as shown in Figure 1.It consists of three major subnetworks-a multiview learning network, a dilated residual network and a dilated attention network-that perform the segmentations of the LA and proximal PV and LA scars automatically and simultaneously.

Multiview Learning Network for Feature Fusion
We slice the 3D LGE CMR volume into many 2D slices, thus losing the spatial correlation between these 2D slices.In order to learn the complementary information from the sagittal and coronal views, we propose to use a full CNN with shortcut connections that is similar to the residual network [25].To reduce the information loss, we introduce the dilated convolution [26], which can increase the receptive field while keeping the size of the feature map unchanged efficiently.In addition, it can aggregate multiscale contextual information with the same number of parameters.However, standard dilated convolution can cause a gridding problem.We alleviate the gridding problem by introducing a hybrid dilated convolution (HDC) into our network [27].Thus, the complementary information To compensate for the loss of spatial information in the axial view slice, we incorporate the complementary information into the correlated sequence features. ( where the denotes the transposition operation that transposes 2D slice features of the sagittal ( • ) and coronal views to the slice features of axial view.
represents the fused features.

Dilated Attention Network for LA Scars Representation Enhancement
Regions of LA scar are relatively small and discrete; therefore, in this study we tackle the delineation of LA scars using the attention mechanism to force the model to focus on the locations of the LA scar, and to enhance the representations of the LA scars at those locations.Conventional pooling operations can easily lose the information of these small LA scar regions.Therefore, a novel dilated attention network is designed to integrate a feed-forward attention structure with the dilated convolution to preserve the fine information of the LA scars [28].The dilated attention network mainly learns an attention mask , where the denotes the parameters of (  ):I→AM   S, is the 2D axial slice, and is the attention mask.In our proposed dilated attention network, I AM the attention is provided by a mask branch, which is changing adaptively according to the learned trunk branch.We utilise a sigmoid layer, which connects to a convolutional layer to 1 × 1 normalise the feature maps from mask branch into a range of [0,1] for each channel (c) and spatial position (i) of the feature vector to get the AM across all the channels [28].This sigmoid layer x i,c can be defined as following: The attention mask obtained from the mask branch is directly applied to the maps derived from the trunk branch in order to get the attention feature maps via a product operation.Because the attention mask can potentially affect the performance of the trunk branch, a skip connection with sum operation is also applied to mitigate such influence.The output of the attention model can O be denoted as in which ranges over all spatial positions, ranges over all the channels, is the attention i c AM(x i,c ) mask, which ranges from [0,1], represents the fused multiview features, and denotes the F(x i,c ) • dot product.there is a significant class imbalance between LA scars and background voxels.This can cause the network to pay more attention to the majority of background voxels, but neglect LA scars during training, which can lead to sub-optimal performance.In order to mitigate the class-imbalance problem, we adopt a Dice loss function to make the network biased towards the LA scars as well as the LA anatomy [29].Hence, we use a hybrid loss：

Hybrid Loss for Two Segmentation Task Learning
where the and represent the ground truth of LA anatomy and LA scars respectively, and     δ denotes the Dice loss function.

Network Configuration
Our proposed MVTT mainly consists of a multiview learning network and a dilated attention The first two layers contain 16 kernels with the size of 3×3 and each is followed by a BN layer and a ReLU layer.The output maps of the two layers are concatenated to connect with the last layer, which is a 1×1 convolution with one kernel and is followed by a sigmoid activation function.

Evaluation Metrics
The evaluation has been done quantitatively using multiple metrics, e.g., the Dice score and also the segmentation accuracy, sensitivity and specificity considering that the semantic segmentation is essentially solving a classification problem [4][33].In addition, for the LA scars segmentation, we also calculate the correlation between the LA scars extent [4] derived from the segmentation algorithms and the ground truth by assuming the LA wall thickness is fixed at 2.25mm [34].

Implementation Details
We used the Adam method to perform the optimisation with a decayed learning rate (the initial learning rate was 0.001 and the decay rate was 0.98).Our deep learning model was implemented using Tensorflow 1.2.1 on an Ubuntu 16.04 machine, and was trained and tested on an NVidia Tesla P100 GPU (3584 cores and 16GB GPU memory).
Training multiple subnetworks with limited data may pose a risk of over-fitting.In this study, we applied two strategies to mitigate the issue.First, we applied the early stopping strategy, which can be considered as an additional and efficient regularisation technique to avoid over-fitting.
Second, we used networks with a moderate number of parameters for each subnetwork in our framework to find a balance between a sufficient complexity to perform an accurate segmentation and a relatively low likelihood of over-fitting.
In order to test the efficacy of our proposed MVTT recursive attention model, we retrospectively studied 190 3D LGE CMR scans, and divided these data into a training/ten-fold cross-validation dataset (170 3D scans) and an independent testing dataset (20 3D scans with randomly selected 10 pre-ablation and 10 post-ablation cases).For the ten-fold cross-validation, we divided the 170 scans into 10 folds randomly.Each fold contains 17 scans.When training the model, 153 scans were used as training data and the remaining 17 scans were used for testing.We performed the cross-validation loop ten times to test the stability of our proposed methods.
We pre-processed the data with the mean normalisation: where represents the voxel intensities of the image.It is worth noting that we performed the mean  normalisation on each slice of the 3D image instead of using the entire 3D image.

Performance of the LA Anatomy Segmentation
The experimental results show that our MVTT framework can accurately segment the LA and PV (Table 1 and Table 2).The accuracy, sensitivity, specificity and Dice scores are 98.59%, 91.96%, 99.36% and 93.11% via independent testing (Table 2).The additive value of including the multiview learning and CLSTM is apparent from higher Dice scores.Figure 5 shows example segmentation results of the LA anatomy for example pre-and post-ablation cases from the independent testing dataset.

Performance of the LA Scars Segmentation
Our MVTT framework has also performed well for segmenting the LA scars (Table 3 and Table   4).We achieve an overall scar segmentation accuracy of 99.95%, with a sensitivity of 86.77%, a specificity of 99.98% and a Dice score of 86.59% on the independent testing dataset (Table 4).
The additive value of multiview learning, CLSTM and attention mechanism is seen through higher dice scores.Figure 6 shows LA scars segmentations from all methods in an example pre-and postablation patient.Visualization of the atrial scar segmentation using our MVTT (Figure 6 (b) and (k)) shows excellent agreement with the ground truth (Figure 6 (a) and (j)).In addition, Figure 7 shows the 3D segmentation results of the LA anatomy overlaid with LA scars showing high consistency compared to the manual delineated ground truth.

Model Variation Studies
To demonstrate the additive value of the multiview learning, convLSTM and attention mechanism, we performed several model variation studies: (1) For the LA and PV segmentation, we compared our MVTT model with the single axial view learning with ConvLSTM (SV+CLSTM) and multiview learning without using ConvLSTM (MV); (2) For the LA scars segmentation, we tested the multiview learning with attention network but without ConvLSTM (MV+AT), multiview learning with ConvLSTM but without attention network (MV+CLSTM), the single axial view learning with attention network and ConvLSTM (SV+CLSTM+AT).In order to prove that our MVTT was effective for delineating both LA and PV and LA scars simultaneously, we also implemented two single segmentations of LA and PV and LA scars (S-LA/PV and S-Scar).
Results on both cross-validation and independent testing showed that our MVTT model yielded superior results (see Table 1, Table 2, Table 3 and Table 4).In particular, for the LA scars delineation, our MVTT improved the Dice scores from 77%-82% to ~86% (Table 4).We also showed that our MVTT model could accurately segment LA and PV with LA scar simultaneously instead of performing these two tasks sequentially (rows S-LA/PV and S-Scar in Table 1, Table 2, Table 3 and Table 4).These superior results obtained by our proposed MVTT indicate that the effectiveness of the ConvLSTM for the sequence learning, the multiview learning for the information complement and the attention mechanism for the small target learning.In addition, it also demonstrates the effective integration of multiview learning, convLSTM and attention mechanism for the simultaneous segmentation of LA and LA scars.

Model Parameter Validation
To demonstrate the parameter effectiveness in our proposed MVTT.We carried out three extra experiments: (1) We replaced the kernel size of with in our proposed MVTT (K5) for 3 × 3 5 × 5 kernel validation; (2) The activation function of was applied to convolutional LSTM to ReLU validate the LSTM performance (AFT); (3) For learning of dilated convolution, we replaced the dilated convolution with general convolution (NDC).The experiment results on both crossvalidation and independent testing are shown in the Table 6 and Table 7.As shown in the two tables, for the validations of the kernel size, the activation function of convolutional LSTM and the dilated convolution, our proposed MVTT with the kernel size, the activation function of 3 × 3 ReLU for convolutional LSTM and the HDC can obtain the superior results.These superior results can be explained by that (1) kernel size can reduce the parameters of MVTT to decrease the 3 × 3
LA and PV Segmentation: Compared to WHS, our MVTT framework obtained much higher sensitivity (91.96% vs. 80.31%) and similar specificity and therefore a higher Dice score (93.11% vs. 82.94%) in the independent testing dataset.Our MVTT model also showed better quantitative results compared to other deep learning based models (Table 1 and Table 2).
LA Scars Segmentation: Dice scores for pre-and post-ablation studies in the training/crossvalidation and independent datasets are shown in Figure 10 and in Table 3 and Table 4.All the unsupervised learning methods, e.g., SD based thresholding and clustering, obtained high specificities, but very low sensitivities and poor Dice scores.Qualitative visualization in Figure 6 shows that the 2-SD, k-means and Fuzzy c-means (FCM) methods clearly over-estimated the enhanced scar regions, especially for the pre-ablation cases.The U-Net and V-Net based methods improved the delineation, but were still struggling to segment the LA scars accurately.Using the independent testing dataset, our MVTT model achieved a Dice score of 87% for the LA scars segmentation (83% 6% for the pre-ablation cases and 91% 3% for the post-ablation cases).± ± The superior results achieved by our proposed MVTT are mainly derived from the following aspects: (1) We fully consider the limited data, thus slicing the 3D LGE CMR volume into contiguous 2D slices to augment data.At the same time, we integrate the multiview features to improve the feature effectiveness for segmentation target learning; (2) We fully consider the small target learning for LA scars that a dilated attention mechanism is proposed to focus on small LA scars for its accurate learning.(3) We fully consider the multi-task learning that leverages the shared features to improve the segmentation performance.

Analysis of Potential Practical Application
We proposed an automated method to segment the LA with proximal PV and LA scars aiming to use such information to stratify AF patients, guide ablation therapy and predict treatment success.
Patient stratification is based on scar burden defined as the LA scar tissue as a percentage of the LA volume.Hence, we further analyse the calculated scar percentage between our MVTT and the ground truth.Figure 8 shows the linear regression analysis of the calculated scar percentage between our MVTT and the ground truth for both training and independent testing datasets.The Pearson correlation coefficients for the independent testing data show excellent agreement between the two (r = 0.983, 95% CI 0.966 to 0.996 [pre-ablation] and r = 0.990, 95% CI 0.950 to 0.998 [post-ablation] [0.8,1.0]).Bland-Altman plots showing the difference in scar percentage ∈ (between our MVTT and manual segmentations) against the manual segmentation scar percentage (as gold standard) are presented in Figure 9. From these figures, we find that our calculated scar percentage has high consistency with manual delineation by our physicist.For the independent testing, it took 5.34 seconds to segment 20 cases (to derive both the LA anatomy with proximal PV and LA scars simultaneously), and therefore ~0.27 seconds per case, which has similar performance compared to the 2D U-Net and 2D V-Net models (~0.2 seconds per case) and faster compared to the 3D U-Net and 3D V-Net models (~1.12 seconds and ~0.46 per case).These results have indicated the potential of our proposed MVTT in real clinical applications.

DISCUSSION
In this study, we have developed a fully automatic MVTT deep learning framework for segmenting both LA and atrial scar simultaneously.Our MVTT framework combines a sequential learning network that imitates 3D data scrutinisation routinely performed by the reporting clinicians and a dilated residual learning network and an attention model to delineate the LA scars more accurately.
Our proposed framework only requires a 3D LGE CMR dataset as the input and avoids acquiring/using additional scans for the delineation of the cardiac anatomy.In addition to reducing scanning time, this also eliminates the inevitable errors which occur when multiple datasets are registered.This has been achieved mainly because (1) our 3D LGE CMR studies are reliable so that most scans (~93.27% of all pre-ablation cases and ~94.90% of all post-ablation cases) can be used for training, validation and testing and (2) our developed MVTT framework is robust to detect and segment not only the LA anatomy but also the LA scars, which are relatively small.Our segmentation results have been validated against manual ground truth delineation carried out by experienced physicists and radiologists and have demonstrated promising potential for a direct application in clinical environment.
The performance of our proposed MVTT model did not rely on a comprehensive tuning of network parameters.In our preliminary study [37], we found that our initial MVTT model suffered from over-fitting by visualising the loss functions of training/cross-validation. We subsequently incorporated an early stopping strategy that has effectively reduced this and resulted in excellent performance in the independent testing dataset.Compared with our preliminary study, a new dilated attention network and a new dilated residual network, which integrated the hybrid dilated convolution, were proposed for a more efficient feature extraction and a more efficient generation of the attention map.In addition, we replaced the mean squared loss in our preliminary study with the Dice loss to further focus on the problem of small target segmentation.Furthermore, experiments in our current work were extended to a larger database with 190 cases, and more experiment validations and detailed discussions were added for the current study.
A limitation of this work is that the 'ground truth' segmentations that our MVTT framework was developed from and validated against were derived manually.While this is not ideal due to intra and inter operator variability, it is the most commonly used method for establishing the ground truth for such tasks and there is no real alternative available.Our ground truth was determined by a single expert due to limited resources and we are unable to provide an assessment of inter-rater agreement.While this is not ideal, the single-expert delineations were checked by a second expert who made changes (by consensus) if necessary.
In addition, many studies have demonstrated that multiscale network is an efficient architecture to acquire different receptive fields and capture information at different scales to improve the performance of a trained deep learning model [38] [39].However, integrating multiscale network into our MVTT will further increase the network complexity.It requires further investigations on how to make the combination of MVTT and multiscale network work efficiently.
A key challenge of imaging LA scars using LGE CMR remains the limited spatial resolution [40].Autopsy studies showed that the mean LA transmural thickness is 2.2-2.5mm(endocardiumepicardium) [34] but this may be reduced for persistent AF patients [41].Most current 3D LGE CMR sequences have a spatial resolution about 1-2mm [18][30] [42] [43], which is usually reconstructed/interpolated to a higher value [43]; however, current LGE CMR sequences still suffer from partial volume effects [40] and this may affect the delineation and quantification of the LA scars.Furthermore, the quantification of LA scars is based on the segmentation of LA (and proximal PV) and LA scars.It is important for us to perform the segmentation and quantification of LA scars simultaneously to further improve the efficiency [44].
We have performed comprehensive comparison studies in the current work-comparing our results with conventional unsupervised learning based methods and supervised deep learning models.It is of note that the WHS method derived the LA anatomy from additionally acquired bright-blood image that was then registered to the LGE-MRI for the further scar segmentation.
Our MVTT method derived both LA anatomy and scar segmentations from a single 3D LGE CMR dataset.This is a challenging task which eliminates the need for an additional acquisition to define atrial anatomy and subsequent registration errors.Interestingly, by comparing Table 1 and Table 2 with Table 3 and Table 4, we found that the U-Net and V-Net based methods achieved a Dice score over 90% for the LA and PV segmentation, but the performance of these methods was much worse for the LA scars delineation (<81% Dice score).This may be due to the fact that these U-Net and V-Net based architectures are more suitable for segmenting relatively larger areas but are not so effective on small LA scars regions.

CONCLUSIONS
In this study, we propose a fully automatic MVTT recursive attention model, which consists of three major subnetworks that incorporate multiview learning, convLSTM and attention mechanism.The proposed MVTT model can resolve the connections in-between the axial image slices and preserve the overall information from the other two views.This intuitively mimics the way reporting clinicians scrutinise the 3D data.For the abnormal and small LA scars regions, our developed attention network also imitates the human attention mechanism that can efficiently exclude interferences and lets the network focus on the abnormalities it tries to segment.Validation has been performed against manually defined ground truth, and both model variation studies and comparison studies demonstrate the efficacy of our MVTT model in pre-and post-ablation studies.
In conclusion, the proposed MVTT framework outperformed other state-of-the-art methods and it can be integrated into the clinical routine for a fast, reproducible and reliable LA scars assessment for individual AF patients.
The current study is based on a single centre data.Multi-centre and multi-scanner studies are essential to validate the robustness and the generalisation of the proposed method.However, the possible domain shift among multi-centre and multi-scanner data will pose potential challenges for accurate segmentation.Therefore, in the future work, we will investigate feasible solutions to cope with the domain shift problems and tackle the multi-centre and multi-scanner data.

Figure 1 :
Figure 1: Overall workflow of our proposed MVTT recursive attention model that consists of three major subnetworks.

Figure 2 :
Figure 2: Architecture of the proposed sequential learning network with corresponding kernel size (k), number of feature maps (n) and stride (s) indicated for each convolutional layer.

Figure 3 :
Figure 3: Architecture of the proposed dilated residual network with corresponding kernel size (k), number of feature maps (n), stride (s) and dilation rate (d) indicated for each convolutional layer.

Figure 4 :
Figure 4: Architecture of the proposed dilated attention network with corresponding kernel size (k), number of feature maps (n), stride (s) and dilation rate (d) indicated for each convolutional layer.
network.The multiview learning network contains three subnetworks: a sequence learning network and two dilated residual networks.The detailed configurations of the sequence learning network and the dilated residual networks are shown in Figure2and Figure3.The multiview learning network mainly learns the multiview features.Based on the learned multiview features, three convolutional layers are connected to perform the segmentation of LA anatomy.First two layers contain 16 kernels with the size of 3×3 and each is followed by a BN layer and a ReLU layer.The output maps of the two layers are concatenated to connect with the last layer, which is a 3×3 convolution with one kernel and is followed by a sigmoid activation function.The detailed configuration of dilated attention network is shown in Figure4.The dilated attention network mainly learns an (or the) enhanced feature map for LA scars.Based on the learned enhanced feature map, three convolutional layers are connected to perform the segmentation of LA scars.

(
TI) was set to null the signal from normal myocardium and varied on a beat-by-beat basis, dependent on the cardiac cycle length[6].Detailed scanning parameters are: 30-34 slices at (1.4-1.5)×(1.4-1.5)×4mm3, reconstructed to 60-68 slices at (0.7-0.75)×(0.7-0.75)×2mm 3 , field-ofview 380×380mm 2 .For each patient, prior to contrast agent administration, coronal navigatorgated 3D b-SSFP (TE/TR 1ms/2.3ms)data were acquired with the following parameters: 72-80 slices at (1.6-1.8)×(1.6-1.8)×3.2mm 3 , reconstructed to 144-160 slices at (0.8-0.9)×(0.8-0.9)×1.6mm 3 , field-of-view 380×380 mm 2 .Both LGE CMR and b-SSFP data were acquired during free-breathing using a prospective crossed-pairs navigator positioned over the dome of the right hemi-diaphragm with navigator acceptance window size of 5mm and CLAWS respiratory motion control[31][32].Navigator artefact resulting from the use of a navigator restore pulse in the LGE acquisition was reduced by introducing a navigator-restore delay of 100 ms[32].In agreement with the local regional ethics committee, CMR data were collected from 2011-2018 for persistent AF patients.The image quality of each 3D LGE dataset was scored by a senior cardiac MRI physicist on a Likert-type scale-0 (non-diagnostic), 1 (poor), 2 (fair), 3 (good) and 4 (very good)-depending on the level of SNR, appropriate TI, and interference from navigator and/or other artefact.In total, 190 cases (out of a total of 202) with image quality greater or equal to 2 were retrospectively entered into this study.This included 97 pre-ablation cases (~93% of all) and 93 post-ablation scans cases (~95% of all).Manual segmentations of the LA anatomy and LA scars were performed by a cardiac MRI physicist with >3 years of experience and specialised in LGE CMR with consensus from a second senior radiologist (>25 years of experience and specialised in cardiac MRI), which were then used as the ground truth for training and evaluation of our MVTT recursive attention model.

Figure 5 :
Figure 5: Qualitative visualisation of the LA anatomy segmentations (via independent testing) in multiple slices from an example pre-ablation (a-g) and an example post-ablation (h-n) study.Red contour: manual delineated ground truth.Green contour: segmentation using MVTT.

Figure 6 :
Figure 6: Qualitative visualisation of LA scars delineation (independent testing results) in an example preablation (a-i) and post-ablation (j-r) study using different methods.Red = manually segmentation (ground truth), green = algorithm segmentation.

Figure 7 :
Figure 7: 3D visualization for LA anatomy and LA scars of the independent testing results (DI_L represents the DI value for predicted LA anatomy.DI_S represents the DI value for predicted LA scars).(a-c) Ground truth and (d-f) Segmentation results of using our MVTT method.

Figure 8 :
Figure 8: Correlation between the estimated LA scars percentage (ESP) of our MVTT method and the LA scars percentage from the manual delineation (MSP) (diagonal lines represent lines of identity).(a) and (b) show the correlations for pre and post ablation studies in the training/cross-validation datasets, and (c) and (d) show the correlations for pre and post ablation studies in the independent testing datasets.

Figure 9 :
Figure 9: Bland-Altman plots for the calculated LA scars percentage of our MVTT method and the LA scars percentage of the manual delineation.(a) and (b) were calculated on the 170 LGE CMR images using training/cross-validation results.(c) and (d) were calculated on the 20 LGE CMR images using independent testing results.Horizontal lines show the mean difference and the 95% CI of limits of agreement (confidence limits of the bias), which are defined as the mean difference plus/minus 1.96 times the standard deviation of the differences.The mean differences are near the 0-line (bias=−1% [95% CI −6% to 4%] and bias=−1% [95% CI −8% to 5%] for the pre-ablation and post-ablation cases respectively via training/cross-validation and bias=−0.2%[95% CI −2% to 1.7%] and bias=−0.1% [95% CI −2.3% to 2.6%] for the pre-ablation and postablation cases respectively via independent testing.In summary, no significant systematic differences between the two methods can be discerned.MSP: Manual Segmented Atrial Scar Percentage; ESP: Estimated Atrial Scar Percentage. overfitting of model compared to the kernel size; (2) ReLU can reduce the vanishing gradient 5 × 5 problem; (3) The HDC can alleviate the gridding problem to help extract more robust features.

Figure 10 :
Figure 10: Boxplot of the Dice scores for comparison studies on LA scars segmentation.Training/crossvalidation on the pre-ablation (a) and post-ablation (b) cases.Independent testing on the pre-ablation (c) and post-ablation (d) cases.
(  *   +  ℎ * ℎ  -1 +   ∘   -1 +   )   =   ∘   -1 +   ∘ ReLU(  *   +  ℎ * ℎ  -1 +   )   = (  *   +  ℎ * ℎ  -1 +   ∘   +   ) from the sagittal and coronal views is learned by two dilated residual subnetworks: (  ):  → and, where the denotes the parameters of S, and are the high-resolution (  ):  → To achieve the two segmentation tasks of delineating LA anatomy and LA scars simultaneously, our proposed MVTT shares the fused feature .For the segmentation of LA anatomy, two   convolutional layers with parameters of are used to further learn the final segmentation map of   the LA anatomy.Therefore, through integrating the multiview learning network, the segmentation of LA anatomy can be achieved by the maximum likelihood estimation based on the conditional   ,  ,  ,  ,   map of LA scars.It is of note that  2 =   ,  ,  ,  ,  ,

Table 1 :
Quantitative results (mean±standard deviation) of the cross-validated LA and PV segmentation, compared to the performance using the WHS, 2D U-Net, 3D U-Net, 2D V-Net and 3D V-Net.AC: Accuracy, SE: Sensitivity, SP: Specificity and DI: Dice score.

Table 2 :
As Table1, but using the independent testing dataset.

Table 3 :
Quantitative results (mean±standard deviation) of the cross-validated LA scars delineation.For the LA

Table 4 :
As Table3, but using the independent testing dataset.

Table 6 :
Comparison of different parameter settings for the LA segmentation.

Table 7 :
Comparison of different parameter settings for the scar segmentation.