Elsevier

Medical Image Analysis

Volume 47, July 2018, Pages 31-44
Medical Image Analysis

Deep embedding convolutional neural network for synthesizing CT image from T1-Weighted MR image

https://doi.org/10.1016/j.media.2018.03.011Get rights and content

Highlights

  • We propose a very deep network architecture for estimating CT images from MR images directly. It learns an end-to-end mapping between different imaging modalities, without any patch-level pre- or post-processing.

  • We present a novel embedding strategy, to embed the tentatively synthesized CT image into the feature maps and further transform these features maps forward for better estimating the final CT image.

  • The experimental results show that our method can be flexibly adapted to different applications. Moreover, our method outperforms the state-of-the-art methods, regarding both the accuracy of estimated CT images and the speed of synthesis process.

Abstract

Recently, more and more attention is drawn to the field of medical image synthesis across modalities. Among them, the synthesis of computed tomography (CT) image from T1-weighted magnetic resonance (MR) image is of great importance, although the mapping between them is highly complex due to large gaps of appearances of the two modalities. In this work, we aim to tackle this MR-to-CT synthesis task by a novel deep embedding convolutional neural network (DECNN). Specifically, we generate the feature maps from MR images, and then transform these feature maps forward through convolutional layers in the network. We can further compute a tentative CT synthesis from the midway of the flow of feature maps, and then embed this tentative CT synthesis result back to the feature maps. This embedding operation results in better feature maps, which are further transformed forward in DECNN. After repeating this embedding procedure for several times in the network, we can eventually synthesize a final CT image in the end of the DECNN. We have validated our proposed method on both brain and prostate imaging datasets, by also comparing with the state-of-the-art methods. Experimental results suggest that our DECNN (with repeated embedding operations) demonstrates its superior performances, in terms of both the perceptive quality of the synthesized CT image and the run-time cost for synthesizing a CT image.

Introduction

Computed tomography (CT) and structural magnetic resonance (MR) images are both important and widely applied in the treatment planning of radiotherapy (Balter et al., 1998, Chen et al., 2004, Khoo et al., 1997, Schad et al., 1987). Recently, it has become desirable to synthesize CT image from the corresponding MR scan. For example, quantitative positron emission tomography (PET) requires CT image for attenuation correction (Carney et al., 2006, Kinahan et al., 1998, Pan et al., 2005). The approach for CT-based attenuation correction is to transform the CT image, which is expressed in Hounsfield units, into an estimate of the linear attenuation map. This map is then projected to obtain the attenuation correction factors for PET (Carney et al., 2006). However, unlike the traditional PET/CT scanner, the MR signal in the cutting-edge PET/MR scanner is not directly correlated with the tissue density and thus cannot be applied to attenuation correction after simple intensity transform (Wagenknecht et al., 2013). As a possible solution, one may segment MR images to identify different tissues first. This is often challenging because some parts in the MR images, e.g., bone and air-filled cavities, are presented in similar intensities while they have different attenuation properties. With the CT image synthesized from the T1-weighted MR image, however, the challenge can be greatly alleviated.

In this work, we aim to address the problem of synthesizing CT from T1 MR. But these two modalities differ largely regarding their image appearances, which makes this synthesis problem challenging to solve. Examples of the T1-weighted MR images and their corresponding CT images are shown in Fig. 1. These images are acquired from the same patients, i.e., (a) for the brain, and (b) for the prostate, respectively. In the MR images, the intensity values of ‘air’ and ‘bone’, as pointed by the blue arrows and orange arrows, are both low. However, in the CT images, the ‘air’ appears dark, while the ‘bone’ turns to be bright. In general, the intensity mapping between these two modalities of MR and CT is highly complex, encoding both spatial and context information in non-linear mapping.

There are several reports in the literature focusing on inter-modality medical image synthesis, i.e., from MR to CT. These methods can be mainly categorized into the following three classes.

  • (a)

    Atlas-based methods. In the atlas-based methods (Arabi et al., 2016, Hofmann et al., 2008, Kops and Herzog, 2007), a set of atlases are prepared in advance, each of which consists of both MR and CT acquisitions. Given a new subject with the MR image only, all atlases are first registered with the new subject by referring to their respective MR images. Then, the resulted deformation fields are applied to warp the respective CT images of the atlases to the new subject space, from which the subject CT image can be synthesized through the fusion of aligned atlas CT images (Burgos et al., 2014). Clearly, the performances of the above methods are highly related with the registration accuracy, and the quality of the synthesized CT image also relies on the sophisticated strategies to fuse the warped CT images. Note that the atlas-based methods may also cost high computational time for registering all images.

  • (b)

    Sparse-coding-based methods. These methods (Yang et al., 2012, Yang et al., 2008) usually involve several steps in the respective pipelines. First, the overlapping patches are extracted from the new subject MR image. These subject MR patches are then encoded by a MR patch dictionary that is built from the linearly aligned MR atlases. The obtained sparse representation coefficients are transferred to the coupled CT patch dictionary (also built from the linearly aligned CT atlases), to fuse the respective CT atlas patches for finally synthesizing the subject CT image. Roy et al. (2010) applied this framework for predicting FLAIR image from T1- and T2-weighted MR images. Similarly, (Ye et al., 2013) estimates T2- and diffusion-weighted MR images from T1-weighted MR. But one main drawback of these methods is that the estimation is computationally expensive (Dong et al., 2016a) due to the need of sparse coding optimization upon all image locations. Because each location needs extract patch and go through all the operations to get its corresponding predicted patch. And constructing a global dictionary means the dictionary needs to be of big size for ensuring the final prediction performance, which obviously adds the cost time for solving the sparse representation coefficients (Dong et al., 2016a, Yang et al., 2008).

  • (c)

    Learning-based methods. These methods learn the complex mapping from the local detailed appearances of MR images to those of CT images in the same subjects (Huynh et al., 2016, Johansson et al., 2011, Roy et al., 2014). In order to address the issue of expensive computation in sparsity learning based methods, Huynh et al. (2016) presented an approach to estimate CT image from MR using the structured random forest and the auto-context model. Vemulapalli et al. (2015) proposed an unsupervised approach to maximize both global mutual information and local spatial consistency for inter-modality image synthesis. But, such methods often have to first decompose the whole input MR image into the overlapping patches, and then map each MR patch to the corresponding CT patch. Also, the additional computation cost can be high in order to assemble the overlapping CT patches into a single output image.

Recently, the convolutional neural network (CNN) has shown its tremendous popularity and good performance in the computer vision and medical image computing fields (Liao et al., 2013, Ren et al., 2018, Xiang et al., 2017, Xu et al., 2016). CNN is capable of modeling non-linear mapping between different image spaces, without defining hand-crafted features. Also, CNN-based method can overcome time-consuming problem of the patch-based method by taking whole image as input and outputting its whole image prediction during testing stage. Successful applications can be found by reconstructing high-resolution images from low-resolution images (Dong et al., 2016a), and by enhancing PET signals from the simultaneously acquired structural MR (Li et al., 2014). (Han, 2017) also proposed a deep convolutional neural network method for CT synthesis from MR image, which achieved reasonable performance compared to the atlas-based methods. However, this method can only process a single slice through each forward mapping. To handle the problem of 3D MR-to-CT synthesis, this method had to process multiple slices independently, which could often cause discontinuity and artifacts in the synthesized CT images. Besides CNN-based network, Van Nguyen et al. (2015) proposed the location-sensitive deep network (LSDN) for synthesizing images across domains by integrating intensity features from image voxels and their spatial information.

In this paper, we propose a deep embedding convolutional neural network (DECNN) to synthesize CT images from T1-weighted MR images. Concerning the examples in Fig. 1, the mapping from MR to CT can be highly complex, as the appearances of these two modalities vary significantly across spatial locations (Wagenknecht et al., 2013). This large inter-modality appearance gap challenges the accurate learning of CNN. To this end, we decompose the CNN model into two stages: 1) the transform stage and 2) the reconstruction stage. The transform stage is a collection of the convolutional layers that are responsible for forwarding the feature maps, while the reconstruction stage aims to synthesize the CT image from the transformed feature maps. Besides, we also propose a novel embedding block. Note that the embedding block is able to synthesize the CT image from the tentative feature maps in the CNN. Next, the tentative CT synthesis is embedded with the feature maps, thus the newly embedded feature maps become more related to the CT images and can be further refined by the subsequent layers in the CNN. More importantly, we insert multiple embedding blocks into the transform stage to derive our DECNN accordingly. Note that the embedding block is similar to deep supervision (Lee et al., 2015) which has been adopted by many computer vision tasks (Chen et al., 2016, Xie and Tu, 2015). Holistically-nested edge detection (HED) method (Xie and Tu, 2015), for example, leverages multi-scale and multi-level feature learning to perform image-to-image edge detection. This method results in multi-outputs and then fuses them in the end of the network. DCAN (Chen et al., 2016) takes advantages from the auxiliary supervision by introducing multi-task regularization during training. Our embedding block goes beyond deep supervision since the midway feature maps are further embedded into the subsequent layers of the network. The embedding block thus provides consistent supervision to facilitate the modality synthesis and improve the quality of the final results. This embedding strategy also resembles the auto-context model (Tu and Bai, 2010). We note that auto-context model generally requires independent learning for each stage, while our DECNN can integrate all embedding blocks into a unified network for end-to-end training.

The advantage of the embedding block is that it has greatly strengthened the inter-modality mapping capability of DECNN regarding MR and CT images. In particular, the tentatively synthesized CT images are embedded to generate better feature maps, which will be transformed forward in the purpose of refining the synthesis of the CT image. Also, through the experiments we find that the embedding block contributes to faster convergence when training the deep network in back-propagation. Moreover, DECNN allows us to process all test subjects in a very efficient end-to-end way.

Our main contributions can be summarized as follows.

  • We propose a very deep network architecture for estimating CT images from MR images directly. The network consists of convolutional and concatenation operations only. It can thus learn an end-to-end mapping between different imaging modalities, without any patch-level pre- or post-processing.

  • To better train the deep network and refine the CT synthesis, we propose a novel embedding strategy, to embed the tentatively synthesized CT image into the feature maps and further transform these features maps forward for better estimation of the final CT image. This embedding strategy helps back-propagate the gradients in the network, and also make the training of the end-to-end mapping from MR to CT much easier and more effective.

  • We carry out experiments on two real datasets, i.e., human brain and prostate datasets. The experimental results show that our method can be flexibly adapted to different applications. Moreover, our method outperforms the state-of-the-art methods, in terms of both the accuracy of estimated CT images and the speed of synthesis process.

The rest of this paper is organized as follows. In Section 2, we present the details of our proposed DECNN for estimating CT image from MR image. Then, in Section 3, we conduct extensive experiments, evaluated with multiple metrics, on both real brain and prostate datasets. Finally, we conclude this paper in Section 4.

Section snippets

Method

CNN is capable of learning the mapping between different image spaces. We adopt the CNN model similar to Dong et al. (2016a) for the task of MR-to-CT image synthesis, and then develop our DECNN accordingly. As mentioned above, we decompose the CNN model into two stages, i.e., (1) the transform stage and (2) the reconstruction stage, as also illustrated in Fig. 2(a). The transform stage is used to forward the feature maps (i.e., derived from MR images), such that the CT image can be synthesized

Experimental result

In this section, we evaluate the performance of our method on two real CT datasets, i.e., (1) brain dataset and (2) prostate dataset, which are the same datasets used in (Huynh et al., 2016, Nie et al., 2016). We first describe the datasets used for training and testing our method. Next, more detailed training setup is given. Subsequently, we analyze the effect of the embedding blocks in our architecture. We also present both qualitative and quantitative comparisons between our DECNN model and

Discussion

We have presented a novel MR-to-CT mapping method for different modality transform on both brain and prostate datasets. Compared with the traditional learning based method, our DECNN model not only achieves the best synthesis result, but also performs several times or even orders of magnitude faster in the testing stage. There are also some limitations for our method. First, our method costs lots of time for training, which generally takes 2–3 days to get a model, while traditional methods

Conclusion

In this paper, we propose a novel DECNN model to synthesize the CT image from the T1-weighted MR image. Deep learning is well known for its capability in encoding the highly complex mapping between two different image spaces. The embedding block, which embeds the tentative CT estimation into the flow of the feature maps in deep learning, is integrated with CNN in our work. Thus, our derived DECNN can transform the embedded feature maps forward and reconstruct better CT synthesis results in the

Acknowledgements

This work was supported by National Key Research and Development Program of China (2017YFC0107600), National Natural Science Foundation of China (61473190, 81471733, 61401271), Science and Technology Commission of Shanghai Municipality (16511101100, 16410722400). This work was also supproted in part by NIH grants (EB006733, CA206100, AG053867).

References (51)

  • DongC. et al.

    Accelerating the super-resolution convolutional neural network

  • HanX.

    MR‐based synthetic CT generation using a deep convolutional neural network method

    Med. Phys.

    (2017)
  • He, K., Zhang, X., Ren, S., Sun, J., 2015a. Deep residual learning for image recognition. arXiv preprint...
  • HeK. et al.

    Delving deep into rectifiers: surpassing human-level performance on imagenet classification

  • M. Hofmann et al.

    MRI-based attenuation correction for PET/MRI: a novel approach combining pattern recognition and atlas registration

    J. Nuclear Med.

    (2008)
  • HuiT.-W. et al.

    Depth Map super-resolution by deep multi-scale guidance

  • HuynhT. et al.

    Estimating ct image from mri data using structured random forest and auto-context model

    IEEE Trans. Med. Imaging

    (2016)
  • J.E. Iglesias et al.

    Is synthesizing MRI contrast useful for inter-modality analysis?

  • V. Jain et al.

    Natural image denoising with convolutional networks

  • JiaY. et al.

    Caffe: convolutional architecture for fast feature embedding

  • A. Johansson et al.

    CT substitute derived from MRI sequences with ultrashort echo time

    Med. Phys.

    (2011)
  • P. Kinahan et al.

    Attenuation correction for a combined 3D PET/CT scanner

    Med. Phys.

    (1998)
  • S. Klein et al.

    Elastix: a toolbox for intensity-based medical image registration

    IEEE Trans. Med. Imaging

    (2010)
  • E.R. Kops et al.

    Alternative methods for attenuation correction for PET images in MR-PET scanners

  • LayN. et al.

    Rapid multi-organ segmentation using context integration and discriminative models

  • Cited by (153)

    View all citing articles on Scopus
    View full text