Three-dimensional GPU-accelerated active contours for automated localization of cells in large images

Cell segmentation in microscopy is a challenging problem, since cells are often asymmetric and densely packed. Successful cell segmentation algorithms rely identifying seed points, and are highly sensitive to variablility in cell size. In this paper, we present an efficient and highly parallel formulation for symmetric three-dimensional contour evolution that extends previous work on fast two-dimensional snakes. We provide a formulation for optimization on 3D images, as well as a strategy for accelerating computation on consumer graphics hardware. The proposed software takes advantage of Monte-Carlo sampling schemes in order to speed up convergence and reduce thread divergence. Experimental results show that this method provides superior performance for large 2D and 3D cell localization tasks when compared to existing methods on large 3D brain images.


Introduction
Quantifying the size and distribution of cell nuclei in optical images is critical to understanding the underlying tissue structure [1] and organization [2,3]. Segmentation is crucial to this analysis, by providing quantitative data that pathologists can use to characterize diseases and evaluate their progression [4]. Since manual analysis of microscopy images is time consuming and labor intensive, automated cell localization is essential for detecting and segmenting cells in massive images. Microscopy images exhibit a large degree of variability and complexity, due to large numbers of overlapping cells and variations in cell types and stages of cell division, imaging systems, and staining protocols. In order to deal with this complexity, a large number of algorithms have been proposed [5,6]. Most current algorithms use basic techniques combined with complicated pipelines to overcome those challenges. These methods include thresholding [7][8][9], feature extraction [10,11], classification [12], c-means [8] and k-means [13] clustering, region growing [14][15][16], and deformable models [17][18][19].
Recently, learning based approaches using artificial neural networks (ANN) and convolutional neural networks (CNN) have gained increased attention. These methods rely on example data to train a machine learning algorithm to identify boundary pixels [20]  perform binary segmentation [21]. In general, most pipelines include prepossessing, finding cell bounding boxes, extracting either spatial or frequency-based features [2,[22][23][24] or using several convolution layers followed by max-pooling [25,26], and finally classifying the image. In these approaches, the training phase is time consuming and requires massive amounts of labeled data [27]. Most current algorithms focus on two-dimensional data, such as histology slides, and utilize a variety of techniques to deal with specific tissue types, stains, and labels. For example, deep CNNs have been used for overlapping clumps in Pap smear images [28], and support vector machines (SVMs) have been employed to segment epithelial cells [29] and skeletal muscle [30]. Finally, active contours have been shown to be effective for cell nuclei [31].
The major limitation of histology slices is that they are limited to 2D sampling. Although histological assesments convey some of the structure and morphology of the tissue, they do not provide proper insights into the 3D layout of cells. In addition, 3D images provide much better separability when cells are overlapping or hidden in the corresponding 2D images.
To date, several software solutions are available for specialized cell segmentation on 3D images. FARSIGHT [32] uses graph cuts and multi-scale Laplacian of Gaussian filters to detect cell seed points. Region growing is then used based on local-maximum clustering. MINS [33] performs blob detection by smoothing the image with Gaussian kernels at different scales and computing eigenvalues of the Hessian matrix at each pixel from these smoothed images. It then thresholds the respective eigenvalues to obtain a mask of nuclei and a connected component analysis assigns a unique ID to each nucleus. The 3D object counter plugin for ImageJ [34] is a simple 3D cell counter which uses a user-specified intensity threshold to separate foreground and background, resulting in an over-segmented image. Since fundamental thresholding is not robust, adaptive and iterative thresholding on smoothed 3D images can also be used [35,36]. In almost all cases, cell localization is a necessary initial step.
One method for addressing large-scale localization relies on simple active contours, such as snakuscules [37,38], which are fast to evaluate and rely on very few user-specified input parameters. However, a three-dimensional application of this algorithm has not been derived. In addition, the sampling required to evolve a primitive active contour is computationally intractable for images containing thousands of cells.

Approach
Most deformable models transform an image segmentation task to an optimization problem. An energy function is defined based on the image content and desired behavior of a curve. Snakuscules, introduced in [37], are region-based snakes optimized for identifying approximately circular features. In this section, we will describe the previously published snakuscule algorithm as well as generalize the mathematics to three-dimensional images.

Snakuscules
Snakuscules are active contours optimized for fast convergence around circular image features. Their fast evaluation time allows the initialization of many contours that cover an entire image, allowing detection of blob-like features without manual initialization.
A Snakuscule is defined by a pair of concentric disks parameterized by two points p and q (Fig 1a). The optimization attempts to minimize an energy function measuring the contrast between the inner disk and annulus in order to completely surround a bright round object with a circular curve. The energy function is defined to balance the weighted inner area against the weighted outer area of the curve: where I(x) is a two-dimensional image, x = [x 1 , x 2 ] T is an image coordinate, R is the radius of the snake, and c is its center. The value r ¼ 1 ffi ffi 2 p , derived previously for the two-dimensional case [37], enforces the equal area for both inner disk and outer annulus.
One snakuscule can find and segment one light blob in the image. To catch all interesting features, many initial contours are specified to cover the image (Fig 1b). The contours are then evolved, and trivial contours that do not converge to image features are eliminated (Fig 2).

3D snakuscules
We first propose a mathematical framework for evolving snakuscules in 3D by moving and expanding/contracting an initial 3D contour to fit cell nuclei. This generalization makes it viable to extend similar contours to higher-dimensional or hyperspectral data (ex. hypersnakuscules). The 3D snakuscule is based on a pair of concentric spheres that are parameterized by two points p = [p x , p y , p z ] T and q = [q x , q y , q z ] T (Fig 3a). The optimization process minimizes a local energy function, which favors high contrast between weighted inner and outer volumes. Contours move and evolve within the spatial domain of an image to minimize the contrast energy function (Eq 2).

Eðp; qÞ
This optimization leads the contour toward a bright spherical object on a dark background. To ensure the snake does not move in uniform regions with constant intensity where 8x 2 R 3 : I(x) = I 0 , the energy is defined using two sub-terms that cancel each other out; therefore, p . This prevents the contour from sliding when the surrounding gradient is zero [37]. We illustrate energy minimization for a generic model of a light blob I(r, θ) = 1 + sgn(r 0 − r),  where sgn function is defined as: It creates a sphere of radius r 0 in a black background. When the contour is concentric within the blob, the resulting energy is given by: The energy achieves an optimal value for any contour equal or larger than the blob, so there is no unique optimal contour. To ensure that the volume occupied by the contour is also minimized, a normalization term α is used to reformulate the energy function: where α > 0 to apply a penalty when the contour becomes larger than the blob. To balance the expansion and contraction speed when the contour is approaching the blob size, we force the energy gradient to be symmetric as the optimal value is approached:  the contour radius with and without normalization. Note that this normalization term can be optimized as desired for objects that are not binary indicator functions (ex. Gaussian kernels). A discrete formulation of the energy function is generated by substituting summation for integration in the pixel domain. The final discrete energy function is given by: where K is the set of all pixels within R þ 1 2 DR of the 3D snake center, r = |k − c|, and S(r) is a differentiable weight function (Fig 3d), so that R 1 0 SðrÞr 2 dr ¼ 0. The 3D snake is composed of four different regions (Fig 3c); two dynamic and two fixed regions. During evolution, the entire footprint becomes smaller or larger while ΔR and Δr remain unchanged: To simplify calculations, the two identifier points p and q are considered to be in the same line along both the y and z directions (p y = q y and p z = q z ). The energy function can be rewritten as: SðrÞIðkÞ ð7Þ

3D contour evolution
The 3D snakuscule evolves by movements of p and q in the opposite direction of rE to minimize the energy function using gradient descent. Therefore, partial derivatives of the energy function E with respect to the identifier points p and q are required: where We minimize the energy (Eq 7) using gradient descent to update the position of the identifier points. Each point k 2 K applies a force to p and q that dictate its motion over time: where � ¼ � 0 ffi ffi n p is learning rate, � 0 is constant and n is the iteration number.

Parallelizing the process
Regarding cell localization and counting, 3D snakuscules can be initially placed on the 3D image in a lattice (Fig 3b) similar to the 2D case (Fig 1b). They evolve independently to segment a nearby spherical structure (blob). However, the higher dimensional integration results in excessive computing time, making a serial implementation impractical for large high resolution images.
Since the evolution of each contour is completely independent from the others, this process is highly data-parallel and an ideal application for graphic processing units (GPUs). GPUs consist of a large number of parallel processors that can be used for general purpose parallel computing to improve the performance of algorithms that are highly data parallel and can be split into a large number of independent threads. A GPU has a local single-instruction on multiple data (SIMD) architecture, making execution of the same program on multiple values extremely efficient. The set of instructions applied on each element is called a kernel [39]. We define our evolutionary instructions as a GPU kernel that can be executed for thousands of snakes in parallel.
For instance, snakuscules are run on various pieces of a 2D DAPI stained rat brain tissue image. The image is a whole rat brain slice with resolution of 350 nm/pixel. The initial and final snakuscules configurations for a 1000 × 1000 image are shown (Fig 2). Fig 5 illustrates performance of GPU implementation in comparison to an optimized CPU version. By increasing number of contours, image size, CPU execution time increases significantly, quickly becoming impractical.
The 3D snakuscule is computationally more expensive because of integration over a 3D space using a uniform grid. Therefore, parallel computing using a GPU is employed to assign one contour evolution to one GPU thread.

Monte-Carlo integration
In order to further accelerate contour evolution, Monte-Carlo (MC) integration is used. It estimates the integral values using a uniform distribution of randomized samples. In the 2D case, samples are chosen from a uniform distribution inside of a circle with radius (R + ΔR/2). If r and θ are random numbers in [0, 1] and [0, 2π) respectively; a uniform set of points within the circle with radius r are computed: MC integration is selected because it provides two advantages over uniform sampling: • Convergence is significantly faster for higher-dimensional data sets, providing an error of 1 N , regardless of the number of dimensions.
• The use of MC sampling allows us to specify a constant number of samples per snake, minimizing branch divergence in the GPU-based SIMD algorithm.
One constraint of MC integration is that we are relying on an underlying assumption that the integral is well-behaved (smooth). Given that we expect cell nuclei to be relatively consistent in size, this assumption is well founded. However, it can be mathematically enforced using a low-pass filter that forces the image to be smooth.
For 3D images, uniform sampling is done within a sphere with radius (R + ΔR/2). Execution time using Monte-Carlo sampling in comparison with the original integration for different number of snakes on the 2D (Fig 6) and the 3D (Fig 7) images shows significant improvement. As expected, a significantly greater acceleration can be seen in the 3D algorithm, with an � 4X gain in performance on average.

Parallel contour evaluation
In order to improve the GPU efficiency by utilizing more GPU resources, we further parallelize each 3D contour. We instead assign each block to one contour so that threads in that block are responsible for smaller parts of Eqs 13 and 14. For each snake, if MC integration selects N random samples and the CUDA kernel is launched with T threads (the maximum number of threads per block), each thread calculates a portion of the energy (Eqs 8-11) corresponding to N/T spatial locations within the contour. The results are stored in shared memory and combined (Eqs 13 and 14) to calculate the final contour at each iteration. This allows employing more GPU threads to cooperatively walk through a snake evolution process (Fig 8).

Results and discussion
In order to find all sphere-like objects in an 3D-image without user interaction, the image is covered by close initial 3D contours. The 3D contours update their current configurations by individually optimizing their energies. Contour evolution is stopped when either (a) they meet maximum number of iterations or (b) they converge where contours reach the minima and the moving step is much less than one pixel such as 0.001.Overlapping snakes, as defined by kc 0 À c @ k < max ðR 0 ; R @ Þ= ffi ffi ffi 2 3 p , undergo a competition with the lower energy snake surviving. 3D snakuscules with energy greater than a threshold (E 0 ) are also removed.  Images of the hilus region of the dentate gyrus in the mouse hippocampus were collected with a 40X oil objective on a Leica TCS SP8 confocal microscope (1024 × 1024 pixels; 387.5 × 387.5μm). A 405 nm laser excited the DAPI signal that was detected between 415 to 500nm). A 1μm step size was set for z stack collection of the entire tissue thickness. Acquisition speed was set to 600 HZ, with a 0.75 zoom factor. Raw images for all data analysis were exported as TIFFs. Transgenic mice that model Dravet syndrome with spontaneous seizure onset at postnatal day 15 were housed in a 12 hour light/dark cycle. These mice have a knockin mutant Scn1A gene containing a nonsense substitution (CgG to TgA) in exon 21 [40]. All animal experiments were approved by the Institutional Animal Care and Use Committee of the University of Houston.
We applied our method to the image of size 256 × 256 × 40 (Fig 9a). In order to deal with pixel anisotropy, where z-axis resolution is commonly worse than lateral (x,y) resolution, the pixel size is specified using lateral pixels and the axial direction is linearly interpolated using GPU hardware. the images were re-sampled to obtain a uniform pixel size. The contours are initialized as a lattice of 3D snakuscules. The 3D snakuscules are evolved and culled using the proposed methods. Fig 9b depicts the final configuration of them on a 2D slice.
To quantitatively evaluate the performance of our method, 3D snakuscules are considered as either a cell (foreground) or non-cell (background) using a K-nearest neighborhood (KNN) search. The four evaluation parameters, precision (P r ), recall (R e ), F-measure (F) and Jaccard (J), are calculated as follows: Where the true positive (TP) value is number of accurately detected cells, the false positive (FP) value is the number of falsely detected cells, and the false negative (FN) value is number of undetected cells. The ground truth is the manual detection of cell centers. The annotations are done using Gimp for 2D and an in-house tool for 3D images under an expert supervision. Fig 10 illustrates the precision-recall curve for MINIS, FARSIGHT, 3D object counter, Cell-Segm and the proposed method on a DAPI-labeled image with 53 annotated cells. The performance for each algorithm is shown in Table 1. We initialized the 3D snake parameters with an initial radius of 15 pixels (� 6 μm), the energy threshold E 0 = −3(E 0 � 0), and the maximum number of iterations to 400. Also, We adjusted the hyperparameters of other methods to optimize their performance on our dataset. The results clearly demonstrate that 3D snakuscules are suitably capable of capturing round cell nuclei, and provide considerable performance advantages over other conventional methods with F-measure of 90% in comparison with that of 82%, 74%, 62% and 82% for MINIS, FARSIGHT, 3D object counter and CellSegm respectively. Additional advantages include the minimal number of parameters required for initialization.
Segmentation results for the proposed method are shown in Table 2.
The sensitivity of our algorithm to noise was tested by generating a phantom based on manually segmented three-dimensional images of cells acquired using KESM [43]. Incremental reduction in signal-to-noise ratio (SNR) demonstrates robust localization with an F-measure ranging from 0.96 (original image) to 0.75 (SNR = 0.2dB) (Fig 12).

GPU occupancy
Occupancy is a measure of how many warps the kernel has active on the GPU, relative to the maximum number of warps supported by the GPU. The graphic processor used in our experiments is GetForce GTX-1070 with 1920 CUDA cores, 8GB of global memory, 2MB of L2 cache size, and 48kB of on-chip shared memory. The compute capability is 6.1, the global memory bandwidth is 256.256GB/s, and the single precision FLOP/s is 6.852TeraFLOP/s. Theoretical occupancy provides an upper bound while achieved occupancy indicates the kernel's actual performance. When the GPU does not have enough work, resources are wasted.
The theoretical occupancy for our algorithm is 50%, limited by the number of registers required for contour evolution. This is a relatively standard theoretical occupancy for complex  Three-dimensional snakuscules calculations, however a more rigorous optimization may yield better results in the future. Since any 3D contour is assigned to one GPU thread, the number of utilized threads is equal to the number of initial 3D snakes. Therefore, a small image with a small number of cells occupies fewer resources, resulting in low compute performance that is unable to hide operation and memory latency (Fig 13a). Comparing achieved occupancy with and without Monte-Carlo sampling shows that MC integration improves GPU performance and occupancy by reducing  Three-dimensional snakuscules thread stalls. Since there are cells with various sizes in the image, uniform integration takes longer for contours corresponding to bigger cells, resulting in additional stalls. These are mitigated using MC integration. Since the reduction in efficiency is due primarily to low occupancy, further parallelizing each 3D contour allows for the generation of more threads. This provides an achieved occupancy much closer to the theoretical limit (Fig 13a), further reducing processing time (Fig  13b).

Conclusion
We developed a 3D blob-detector, based on the miniscule snakes (snakuscule) algorithm, that provides a method for 3D nuclei detection with minimal user interaction. This paper describes a unified formulation of snakuscules in three dimensional space, so that the new cost function is minimized with respect to two points which define the contour. Although the method is initially computationally expensive, it is extremely data parallel and can be efficiently implemented using GPU hardware. A GPU implementation, combined with Monte-Carlo sampling, results in a simple and fast blob detector for large images with numerous cells. Our method requires the specification of a minimum contour size, which is usually readily available for microscopy images. We have illustrated that the GPU implementation and Monte Carlo sampling significantly increase performance, making 3D snakuscules viable for cell localization. The experimental results demonstrate that the proposed method outperforms state of art methods in overall accuracy.
One major limitation of this algorithm is that a significant portion of the evaluation is devoted to the evolution of snakes that will ultimately be culled. This suggests that any method that reliably places initial contours could significantly increase snakuscule performance. A more optimal placement of initial contours could significantly improve performance beyond what we were able to achieve with a lattice. However, methods such as iterative voting and Laplacian of Gaussian blob detection resulted in reduced accuracy when tested. Therefore more advanced algorithms, such as dynamic culling or insertion of contours during evolution, may be a better approach.
In addition, further optimization of the evolution kernel to increase theoretical occupancy limited by register usage could double performance.