An improved parallel fuzzy connected image segmentation method based on CUDA

Purpose Fuzzy connectedness method (FC) is an effective method for extracting fuzzy objects from medical images. However, when FC is applied to large medical image datasets, its running time will be greatly expensive. Therefore, a parallel CUDA version of FC (CUDA-kFOE) was proposed by Ying et al. to accelerate the original FC. Unfortunately, CUDA-kFOE does not consider the edges between GPU blocks, which causes miscalculation of edge points. In this paper, an improved algorithm is proposed by adding a correction step on the edge points. The improved algorithm can greatly enhance the calculation accuracy. Methods In the improved method, an iterative manner is applied. In the first iteration, the affinity computation strategy is changed and a look up table is employed for memory reduction. In the second iteration, the error voxels because of asynchronism are updated again. Results Three different CT sequences of hepatic vascular with different sizes were used in the experiments with three different seeds. NVIDIA Tesla C2075 is used to evaluate our improved method over these three data sets. Experimental results show that the improved algorithm can achieve a faster segmentation compared to the CPU version and higher accuracy than CUDA-kFOE. Conclusions The calculation results were consistent with the CPU version, which demonstrates that it corrects the edge point calculation error of the original CUDA-kFOE. The proposed method has a comparable time cost and has less errors compared to the original CUDA-kFOE as demonstrated in the experimental results. In the future, we will focus on automatic acquisition method and automatic processing.

to segment vessel. Jiang et al. [3] employed a region growing method based on spectrum information to perform vessel segmentation.
In 1996, Udupa et al. [4] addressed a theory of fuzzy objects for n-dimensional digital spaces based on a notion of fuzzy connectedness of image elements and presented algorithms for extracting a specified fuzzy object and identifying all fuzzy objects present in the image data. Lots of medical applications of the fuzzy connectedness are proposed, including multiple abdominal organ segmentation [5], tumor segmentation [6], vascular segmentation in liver, and so on. Based on fuzzy connectedness algorithm, Harati et al. [6] developed a fully automatic and accurate method for tumor region detection and segmentation in brain MR images. Liu et al. [7] presented a method for brain tumor volume estimation via MR imaging and fuzzy connectedness.
However, with the size of medical data increasing, the sequential FC algorithm, which depends on the sequential performance of CPU, is greatly time-consuming. On the other hand, parallel technology developments in many domains, such as high-through DNA sequence alignment using GPUs [8], accelerating advanced MRI reconstructions on GPUs [9]. Therefore, some researchers proposed parallel implementations of FC. An OpenMP-based FC was proposed in 2008, the authors adapted a sequential fuzzy segmentation algorithm to multiprocessor machines [10]. Thereafter, Zhuge et al. [11] addressed a CUDA-kFOE algorithm which is based on NVIDIA's compute unified device architecture (CUDA) platform. CUDA-kFOE computes the fuzzy affinity relations and the fuzzy connectedness relations as CUDA kernels and executes them on GPU. The authors improved their method in 2011 [12] and 2013 [13]. However, their methods has expensive computational cost because their method is in an iterative manner and lacks of interblock communication on the GPU [13].
In this paper, we proposed a novel solution to the limited communication capability between threads of different blocks. The purpose of our study is to improve the implementation of CUDA-kFOE and enhance the calculation accuracy on GPU by CUDA. The main contributions of the proposed method are in two folds. Firstly, the improved method doesn't need large memory for large data set since we use a look up table. Secondly, the error voxels because of asynchronism are updated again and corrected in the last iteration of the proposed method.
The paper is organized as follows. In "Background" section, we first summarize the literature of fuzzy connectedness and the CPU-based FC algorithms. Then a brief description of fuzzy connectedness and the original CUDA-kFOE is presented in the "Fuzzy connectedness and CUDA executing model" and "Previous work" sections respectively. The proposed improved CUDA-kFOE is explained in the "Methods" section. The experiments and conclusion are given in the "Results and discussion" and "Conclusion" sections respectively.

Fuzzy connectedness
Fuzzy connectedness segmentation method [14] was first proposed by Udupa et al. in 1996. The idea of the algorithm is by comparing connectivity of seed points between target area and background area to separate the target and background. Let's define X be any reference set. Fuzzy subset A of X is a set of ordered pairs, where µ A : X → [0, 1] is the member function of A in X. A fuzzy relation ρ in X is a fuzzy subset of X × X, ρ = x, y , µ ρ x, y |x, y ∈ X , where µ ρ : X × X → [0, 1].
Let C = (C, f ) be a scene of (Z n , a), and if any fuzzy relation k in C is reflexive and symmetric, we said k to be a fuzzy spel affinity in C. We define µ k as respectively. The mean and variance of g 1 are computed by the intensity of objects surrounded in fuzzy scene, g 2 is a zero-mean Gaussian.

CUDA executing model
The basic strategy of CUDA is for all computing threads to run concurrently in logic. Actually, tasks will divide thread blocks according to the equipments of different CUDA devices, and GPU will automatically distribute task blocks to each stream multiprocessor (SM). Figure 1 shows a procedure of blocks divided from software level to hardware level. In this procedure, all SMs will run in parallel independently. This means any task blocks in different SMs won't execute synchronization instructions [15].

Previous work
In this section, a brief introduction of the CUDA-kFOE Algorithm proposed by Ying Zhuge et al. is presented, in which the kFOE is well parallelized. The CUDA-kFOE algorithm consists of two parts.
1. Affinity computation. We can use Eq. (2) to compute the affinity of voxel (c, d), and the result of affinity µ k (c, d) is stored in the special GPU device memory. 2. Updating fuzzy connectivity. The nature of computation for the fuzzy connectivity is a single-source-shortest-path (SSSP) problem. How to parallelize the SSSP is a challenge problem. Fortunately, CUDA-based SSSP algorithm proposed by Harish and Narayanan solves the problem [16]. With the computing capability of Eq. (2), the atomic operations are employed to solve multiple threads by accessing the same address conflict which basically achieve SSSP parallelization, and the algorithm is presented in [11].

Performance analysis and improvement
In the first step of CUDA-kFOE algorithm, we need release enormous memory space to store the six-adjacent affinity when computing large CT series data. In addition, CUDA-kFOE will suffer from errors in some voxels in the scenario of different blocks hard to execute synchronously. In order to overcome these drawbacks of the CUDA-kFOE algorithm, in this section, we propose an improved double iterative method which can be implemented easily and has more accurate performance. The main advantages of the improved method are as follows.
1. The proposed algorithm needs less memory compared to CUDA-kFOE when processing large data sets. (We change the affinity computation strategy by using look up table for memory reduction). 2. The proposed algorithm doesn't need CPU involved to handle extra computing and therefore achieve more accurate results. (The main idea is to process twice the error voxels because of asynchronism. Therefore those error voxels will be processed again in the last iteration).
Let's analyze the performance of CUDA-kFOE. Considering a single seed to start the CUDA-kFOE algorithm, and using breadth-first for computing fuzzy scenes. Figure 2 illustrates the processing of edge points, where red points represent its neighbors required to be updated and blue points represent being updated points. If the red points denote fuzzy affinity for propagation outside, the competition problem will be triggered when red points reach the blocks' edge. The reason is that the fuzzy affinity must be propagated between different blocks. Since the procedure of outward propagation of seed point looks like a tree shape and therefore the path will not appear in a circle. Thus the calculation procedure can be seen as the generation of tree structure which is built on seed points as the tree root.
In Fig. 2, pixel 1, (2, 4), 3 and 5 locate at different thread blocks. Pixel 1, 2 and 3 are in C 1 (c) array and pixel 4 and 5 are updated points which are the neighbors of pixel 2. Considering the worst situation: because the runnings of thread blocks are disorder, when judging f min > f (e), pixel 5 will be influenced by pixel 2 and 3 together. The running orders have six situations: Because updating the pixel 5 only need selecting the max values of fuzzy affinity between pixel 1 and 2, the orders of situation (a) and (b) won't influence the propagating result of fuzzy affinity. Therefore, situation (a) and (b) won't generate errors because of thread block asynchrony. In the situation (c) and (d), if the pixel 1 doesn't influence the values of pixel 2 and 3, the results are the same as the situation (a) and (b). However, If pixel 1 influences the pixel 2 or 3, the pixel 5 will be influenced by updating the pixel 2 and 3. At this condition, if run 2 → 5, 3 → 5 or 3 → 5, 2 → 5 first, new value of pixel won't reach pixel 5, thus pixel 5 can't compute the correct value. Therefore, we can run a correction iterator to propagate the correct value of pixel 1. Double iterations can solve the problem of situation (c) and (d). In the situation (e) and (f ), pixels will cross 3 thread blocks. It's the same situation as (c) and (d), thus we can run triple iterations to solve the asynchronous problem.

Improved algorithm and implementation
The flow chart of improved GPU implementation is illustrated in Fig. 3, which is modified from Ref. [13]. The pseudo code of the proposed method is given in the following algorithm.
As shown in the procedure of the algorithm, improved CUDA-FOE is an iteration algorithm. In the first iteration, only one voxel will participate in computing affinity and updating the six-adjacent connectivity. While the number of iteration increase, more and more voxels will be computed in parallel until there is no any update operation from all threads, which means every voxel value in C 1 is all false. In the step 6 of algorithm improved CUDA-kFOE, we use atomic operation for consistency [16] since more than one thread in update operation may access the same address simultaneously. In addition, the edges of different blocks can not be easily controlled which may cause error values for the voxels at the edge of blocks. Therefore we use two iterations to solve the problem.

Results and discussion
In the experiments, the accuracy of the proposed method is evaluated by compared to original CUDA-kFOE and the CPU version of FC at the same condition. The CPU version source code of fuzzy connectedness is from Insight Segmentation and Registration Toolkit (ITK).
The experiments use a computer of DELL Precision WorkStation T7500 Tower which is equipped with two quad-cores 2.93 GHz Intel Xeon X5674 CPU. It runs Windows 7 (64 bit) with 48 GB device memory. We use NVIDIA Quadro 2000 for display and NVIDIA Tesla C2075 for computing. The NVIDIA Tesla C2075 is equipped with 6 GB memory and 14 multiprocessors, in which each multiprocessor consists of 32 CUDA cores. Table 1 shows the data set used in the experiments and the results of CPU version, original GPU version and improved GPU version in running time and accuracy. Error pointers is defined as the difference between CPU version and GPU version and its result is displayed in a new image.  In addition, the improved method is further evaluated in different iteration directions as shown in Table 2. The results are also visualized in the Fig. 6. It is illustrated that the results have higher accuracy and less number of error points when choosing more adjacent edges during iterations. The time cost of each iteration direction is shown in the Fig. 7. For each data set, time cost slightly change while increase the iteration directions, because in the proposed twice-iteration method, most pointers reach their right values and only a few threads will participate in re-computing step.

Conclusions
In this study, we proposed an improved CUDA-kFOE to overcome the drawbacks of the original one. The improved CUDA-kFOE is in an two iterations manner. Two advantages are in the improved CUDA-kFOE. Firstly, the improved method doesn't need large memory for large data set since we use a look up table. Secondly, the error voxels because of asynchronism are updated again in the last iteration of the improved CUDA-kFOE. To evaluate the proposed method, three data sets of different size are used. The improved CUDA-kFOE has a comparable time cost and has less errors compared with the original one as demonstrated in the experiments. In the future, we will study automatic acquisition method and complete automatic processing.