Non-local DWI Image Super-resolution with Joint Information Based on GPU Implementation

Since the spatial resolution of diffusion weighted magnetic resonance imaging (DWI) is subject to scanning time and other constraints, its spatial resolution is relatively limited. In view of this, a new non-local DWI image super-resolution with joint information method was proposed to improve the spatial resolution. Based on the non-local strategy, we use the joint information of adjacent scan directions to implement a new weighting scheme. The quantitative and qualitative comparison of the datasets of synthesized DWI and real DWI show that this method can significantly improve the resolution of DWI. However, the algorithm ran slowly because of the joint information. In order to apply the algorithm to the actual scene, we compare the proposed algorithm on CPU and GPU respectively. It is found that the processing time on GPU is much less than on CPU, and that the highest speedup ratio to the traditional algorithm is more than 26 times. It raises the possibility of applying reconstruction algorithms in actual workplaces.

In spite of all DWI advantages, the use of this technique is still very limited because an important characteristic of DWI is the low spatial resolution and the low signal-to-noise ratio. On the other hand, the acquisition of a set of high-resolution images is constrained by scanning costs, scanning time, scanner availability and patient comfort. In the clinical practice, DWI is usually acquired at a low-resolution. It can provide a better sensitivity for the analysis of brain structure and clinical disease by the improvement of DWI spatial resolution and high SNR [Mori and van Zijl (2002) ;Zeineh, Holdsworth, Skare et al. (2012); Coupé, Manjón, Chamberland et al. (2013)]. Among the post-processing processes that improve the resolution of images, the Super-Resolution (SR) appears as an efficient tool to enhance the information resolution [Vemulapalli, Nguyen and Zhou et al. (2015)]. With the development of SR, its methods can be explored from two stages: constrained reconstruction method and learning based method [Yue, Shen, Li et al. (2016)]. The former is based on sampling theory for image registered multimodal pairs, and the latter focuses on leaning the end-to-end mapping function of LR/HR image pairs. In the pioneer work, example-based SR methods was widely used. These methods focus on learning a compact dictionary or manifold space to relate LR/HR image pairs and presume the high-frequency (HF) details of LR images [Wang, Qiao, Li et al. (2014)]. Trinh et al. [Trinh, Luong and Dibos et al. (2014)] utilized a non-negative sparse linear representation of the input patch over the LR patches to estimate an HR image. Roy et al. [Roy, Carass, Prince (2013)] proposed an image restoration technique called MR image example-based contrast synthesis (MIMECS), which relied on sparse reconstruction from image patches. This type of non-local approach has high complexity and long computation time due to the patch-based image traversal, and it cannot be used in actual scenes. In recent years, the functions of GPU have been gradually extended from specialized graphics processing core to high-performance computing in computer systems. In the operation of SR algorithm, there is no interdependent relationship between data computation and, therefore, it has good parallelism and is suitable for processing on GPU [Eklund, Dufort, Forsberg et al. (2013)]. Medical image processing technology on GPU makes it possible to apply more advanced algorithms and perform computationally intensive tasks quickly in a clinical environment. Hence, medical image processing on GPU has become a very popular technology [Pratx and Xing (2011)]. In this paper, we propose a super-resolution method based on Non-Local Mean (NLM) using collaborative joint information to solve the SR problem in DWI dataset. Simultaneously, in order to improve the computational efficiency, the algorithm proposed in this paper is transplanted from CPU to GPU, which is implementation through CUDA language.

Related work 2.1 Super reconstruction with non-local mean
Image SR can be regarded as an ill-posed inverse problem, which requires to build an appropriate model for the low-resolution (LR) image Y and the high-resolution (HR) image X. The general model can be expressed as follows: (1) where D is the decimation operator, H is the degradation operator, and η is acquisition noise. The SR reconstruction problem is to estimate the underlying HR version X from Y as follows: Based on this model, the SR image can be estimated by minimizing a least square cost function as follows: where X � is a likelihood estimate value of HR image Y, ‖Y − DHX‖ is a fidelity term, R(x) is the regularization term, and is a parameter that balances the regularization term and the fidelity term. The non-local mean method can be used into an SR reconstruction, the NLM expression was defined as where X and X are the voxels at location and q, w is a searching window, w weighs the similarity between patch S(X ) and S(X ) below: In which h is the decay parameter, is a normalization factor and is defined as the sum of all the weights below: As shown in Manjón et al. [Manjón, Coupé, Buades et al. (2010); Zhou, Qiu, Li et al. (2018)], after computing this regularization term, the fidelity term is then applied for subsampling consistency [Banerjee and Jawahar (2008)]: Finally, this non-local interpolation framework is implemented iteratively until convergence, and the two-step iteration can be defined as: where NN is the nearest neighbor interpolation, and t is the iteration number. Eqs. (7) and (8) correspond to the non-local reconstruction and mean correction, respectively. This non-local SR method was proposed primarily for 3D MRI, and was then adapted for DWI application, which uses b0 information as the HR reference to guide the reconstruction [Coupé, Manjón, Chamberland et al. (2013)]. On the other hand, joint information, as indicated before, gathers the information from all correlated gradient images, providing extra redundancy, which is beneficial for SR reconstruction. Let X and X denote the DWI patches; and N corresponds to the Nth gradient direction. As shown in Manjón et al. [Manjón, Coupé, Buades et al. (2012); Chang, El-Araby, Dang et al. (2014)], a more efficient non-local estimation can be acquired through a more accurate weighting scheme. Intuitively, it is possible for joint information from correlation gradient directions to improve weighting accuracy, which leads to better SR reconstruction: ) , n is a constant that means the number of gradient directions.
It should be noted that, the similarity measure in Eq. (10) is not rotationally invariant. As pointed out in Manjón et al. [Manjón, Coupé, Buades et al. (2012)], the rotationally invariant measure can be applied to the proposed SR method for further improvement of the high-resolution image reconstruction. Manjon et al. [Manjón, Coupé, Buades et al. (2012)] proposed a simple but effective rotationally invariant measure, which is based on voxel intensity and the corresponding patch means. In this work, we adapted this rotationally invariant measure into our non-local SR method with joint information of DWI and, therefore, the similarity measure in Eq. (10) can be rewritten as: where μ is the mean of the patches around voxel X (or X ) and its corresponding n nearby gradient directions. As shown in (8), the mean correction step in order to ensure that the reconstructed high-resolution image will be consistent with the original low-resolution image can be represented as: where NN is the nearest neighbor interpolation, H is the degradation operator, D is the decimation operator and X � is the reconstructed high-resolution image.

Implementation based on GPU acceleration
With the rapid development of Graphics Processing Unit (GPU), the computational performance of the algorithm has been greatly improved. GPU is very suitable for big data and parallel computing tasks. Compute Unified Device Architecture (CUDA) is a parallel programming development model running on GPU, which is an extension of the C language. In the CUDA paradigm, a parallel task is executed by launching a multi-threaded program called a kernel. The kernel in the host is executed on the hardware according to the thread grid, which contains multiple thread blocks, which in turn contain multiple threads. When the kernel function starts on the host, its execution migrates to the device. The computation of a kernel is distributed to many threads, which are grouped into a grid of blocks (Fig. 1). GPU acceleration of medical image reconstruction algorithms has become much more common during recent years. Many research teams are interested in using GPU to visualize medical datasets in real-time, taking advantage of GPU's inherent graphics capabilities. GPU computing is also used in medical image reconstruction. Optical imaging and microscopy have also started to take advantage of GPUs, mainly for acceleration of demanding reconstruction algorithms [Chang, El-Araby, Dang et al. (2014)]. The SR algorithm has a large amount of computing tasks and requires repeated calculation of a large amount of data. In addition, the various data computation tasks necessary to be performed during the execution of the algorithm do not depend on each other, which makes the algorithm suitable for transplanting to GPU platforms.

Figure 1:
It is a two-layer thread hierarchy consisting of thread blocks and thread block grids. All threads in the same grid share the same global memory space. Thread collaboration within the same thread block can be achieved through synchronization and shared memory It can be seen from the above formula that the SRNLM algorithm has a huge amount of calculation and a high time complexity. Considering the ability of multithreading parallel computing from GPU, the repetitive computing needs to be mapped to the GPU platform for parallel processing. The flow (Fig. 2) of parallelization using CUDA for SRNLM in this article is as follows. On CPU, it needs to initialize and load low-resolution image. Afterwards, when GPU allocates memory space, it copies the data into its display memory space. On the device, it firstly gets texture memory and allocates array memory in CUDA. It defines the kernel function (SRNLM Kernel) on the device and calls it to map the image to each thread in pixels. According to the ability of graphics card computing, blocks and grids are set up so that GPU can carry out the SRNLM algorithm in parallel on several pixel points simultaneously according to the thread block. It needs to loop through the kernel function until all image reconstruction is done.
At last, it returns the reconstruction result from display memory to the CPU. The result is integrated and normalized on the CPU, and then the final reconstruction image is obtained.

Figure 2:
The Serial code is executed on CPU of the host, and the parallel code is executed on GPU of the device

Experiments based on SRNLM
A high field in vivo DWI dataset is selected as the experimental data. In addition, B-spline interpolation, which was introduced for DWI resolution enhancement in the literature [Raffelt, Tournier, Rose et al. (2012)] is used. The in vivo DWI dataset was acquired by a 7T Philips Achieva whole body scanner (Philips Healthcare, Cleveland, OH) with a volume head coil for transmission and 32-channels. A DW dual spin-echo, SENSE accelerated msh-EPI was used to acquire the DWI data (b-value: 700 s/mm 2 ; 15 diffusion directions); FOV=210×30×21 mm 3 ; matrix size=300×300 with 15 slices and a spatial resolution of 0.7×0.7×2 mm 3 . A gold standard image was constructed based on this in vivo HR DWI dataset to quantitatively and qualitatively validate the proposed approach. To do that, we averaged 10 acquisitions of high-resolution DW images in the image space (0.7×0.7×2 mm 3 ). Then the LR images used for the experiment were simulated by down-sampling our gold standard by a factor of 2 using the nearest-neighbor interpolation along each axis, which resulted in simulated LR images of 1.4×1.4×4 mm 3 . Both HR and LR data were filtered using the UNLM3D filter to remove random noise before high-resolution reconstruction. Two objective measure matrixes, namely the Peak Signal to Noise Ratio (PSNR) and Structural Similarity (SSIM) were used to quantitatively evaluate the super-resolved DWI dataset. The PSNR measures the differences between each images and image quality and SSIM measures the structure and perceptual similarities between the original and reconstructed images.

SSIM( , ) =
(2 )(2 + 2 ) where and are the mean value of images x and y, σ and σ are standard deviation of x and y, respectively, is the covariance between them, and constants 1 and 2 are chosen as suggested in Wang et al. [Wang, Bovik, Sheikh et al. (2004)]. For the in vivo dataset, the DWI reconstruction results were compared quantitatively and qualitatively in Fig. 3 and Fig. 4. For our proposed method, the super-resolution using joint information enhanced the reconstructed image quantitatively for every DWI image with different directions. In addition, both PSNR and SSIM improved with the increasing involvement of nearby DWI images. Fig. 4 shows the visual comparison of the reconstructed DWI images. Reconstructed image (Fig. 4(b)) has the blurriest result of all; the proposed method using seven nearby gradient images (Fig. 4(e)) achieved the image most similar to the golden standard ( Fig. 4(a)). The crack in the enlarged region is still clear in the image using the proposed method (Fig. 4(e)) compared with other method without any involvement of joint information.  Therefore, the impact of misalignments on the quality of results using the proposed method was studied. First, the displacements between b0 and DW images were obtained with an FSL eddy current correction [Smith, Jenkinson and Woolrich (2004)] and the mean displacements estimated from the reconstruction results are proposed in Fig. 5(a). Next, Fig. 5(b) demonstrates the correlation between image quality in terms of PSNR and the estimated mean displacements. It can be seen that there is no significant linear relationship between the results from the proposed method and the estimated mean displacements, which demonstrates the robustness of the proposed method to the limited misalignment.

Experiments based on GPU
In this section, the device parameters of CPU and GPU are respectively as follows. The CPU is Intel core i7-4600 U with 8 computing cores, and the main frequency is 2.9 GHz. The GPU is NVIDIA Tesla M40 based on Maxwell architecture, which has 3,072 computing cores and 12 GB storage capacity. The SRNLM algorithm was respectively run on CPU and GPU, which should reconstruct five times of super resolution. The experimental results show that the parallel SRNLM algorithm can get a speed ratio of 26 times as fast as the serial algorithm on the basis of maintaining the algorithm performance. The results of different image processing for a matrix size of 128×128, 60 slices and a single direction are shown in Tab. 1. where the speedup ratio is defined as In which is the calculating time from CPU, and is the calculating time from GPU. And then, we did experiments for the same matrix with 32 directions. The results are presented in Tab. 2. It can be seen from Tab. 1 and Tab. 2 that the processing time on GPU is significantly lower than on CPU, and the highest speedup ratio to the traditional algorithm is more than 26 times. It raises the possibility of applying reconstruction algorithms in actual workplaces.

Discussion and conclusion
Firstly, compared with the traditional interpolation algorithm, the proposed framework introduced joint information to improve the weighting scheme yet with a better image reconstruction. Meanwhile, the reconstruction of high-resolution image was further improved by introducing rotationally invariant similarity measure, which ensures a more accurate regularization procedure exists in SR while effectively reducing the computational burden. Secondly, GPU has the advantage of fast computing speed. Considering the high computational complexity of SRNLM reconstruction algorithm, it is transplanted from CPU to GPU. The experimental results show that the parallel SRNLM algorithm can achieve a speed of 19 times faster than the serial algorithm. Finally, our next goal is to conduct research on multi-GPU parallel computing to improve the processing efficiency of massive data with the popularization of general GPU and the appearance of massive data.