Parallel N-Dimensional Exact Signed Euclidean Distance Transform

The computation speed for distance transforms becomes important in a wide variety of image processing applications. Current ITK library ﬁlters do not see any beneﬁt from a multithreading environment. We introduce an N-dimensional signed parallel implementation of the exact Euclidean distance transform algorithm developed by Maurer et al. [1] with a theoretical complexity of O ( n / p ) for n voxels and p threads. Through this parallelization and efﬁcient use of data structures we obtain approximately 3 times mean speedup on standard tests on a 4-processor machine compared with the current ITK exact Euclidean distance transform ﬁlter[4].


Introduction
The Euclidean distance transform (EDT) of an N-dimensional binary image I is an N-dimensional image I EDT such that for all indices i, the element e EDT,i ∈ I EDT is the Euclidean distance from the element e I,i to the nearest foreground element in I.If we assume the typical convention of negative distances for the interior of a feature, we can further define the signed EDT (SEDT) as I SEDT such that ∀i, e SEDT,i ∈ I SEDT the|e SEDT,i | is the Euclidean distance from e I,i to the nearest surface foreground element and e SEDT,i is negative if e I,i is internal to a feature, positive if external, and zero if on an edge.

itk::SignedMaurerDistanceMapImageFilter
The first of these computes an approximation of the SEDT with less than 1 pixel error in two dimensions [2].The second computes an approximation using the Chamfer distance [3].The third computes the exact SEDT and is an implementation of Maurer, et al. [1][4] All three can compute the SEDT to arbitrary dimensions and can handle anisotropic image dimensions.
In this paper we present a new ITK filter, SignedMaurerParallelDistanceMapImageFilter, for computing the signed exact Euclidean distance transform for N-dimensional images in parallel.It is approximately 3 times faster than itk::SignedMaurerDistanceMapImageFilter on 4 simultaneous multithreading (SMT) processors with 2 threads each on 3D images.We compare output and speeds of our parallel implementation based on Maurer's algorithm with the Maurer and Danielsson filters and give implementation details on our filter.

Algorithm
In Maurer [1], the authors present a fast (linear time) algorithm for computing the exact distance transforms using the L p distance metrics and, more generally, the weighted L p distance metrics the difference between weighted (anisotropic) and non-weighted (isotropic) calculations being that between 1 and = 1 values of the w i term.A 3D Euclidean distance is thus , w i > 0. ( The base algorithm for SignedMaurerParallelDistanceMapImageFilter relies on dimensional reduction.First, the x dimension is considered and the closest feature voxel (CFV) is determined for every element in the image.Using the CFV calculations for x, the same can be done for y, and then z in turn-or for as many or as few dimensions are necessary.Our parallelization distributes contiguous groups of planes divided along the largest-dimensional axis to the threads (e.g.z planes for 3D images).After all threads have finished their lower-dimensional CFV calculation, groups of contiguous x planes are assigned to the threads for the final step.This a priory method maximizes cache hits and has a low run queue creation/access cost.Each thread calculates its processing region independent of others and this region is protected from access by other threads.Thus the region calculation cost is minimized and there are no synchronization costs directly associated with region access management.
A combination of this method and a run queue would also maximize cache hits but would suffer from queueassociated costs and, because of the highly symmetric nature of the image processing, would not generally provide significant benefits in load-balancing.The batch regions of slices would need to be determined statically and access to the shared queue would need to be controlled (i.e., by a mutex), both resulting in performance losses.The number of operations is largely the same across an image, varying somewhat by the number of features in a scan line (and thus, region).High load imbalances are therefore not especially likely and thus the cost due to dynamic load-balancing can be eliminated.Experimental comparisons with a loadbalancing version of our filter also fail to show consistent benefits for any particular degree of load-balancing tested.

Filter Use and Implementation
This filter was designed to adhere to the expectations of users of the Danielsson and itk::SignedMaurerDistanceMapImageFilter filters and as such shares many of the same options.It takes as input an image of any type and can output either a squared or an actual distance transform.
The parameters that differ between this filter and Danielsson or Maurer are as follows: • SetForegroundValue allows the specification of the foreground element value (typically 1, which is the default).All other values are assumed to be background values.
• SetBinaryConvert specifies whether the input image is already a binary image of values 1 and 0. This situation corresponds to a value of false.The default value is true.This option is useful for skipping a processing step if more information about the input is known.
• SetNumberOfThreads sets the requested number of threads for the filter.The actual number of threads used may differ.It sets the requested number of threads for the filters used internally as well as to the filter's own processes.The default behavior is to request the maximum available threads on the system.
Our filter uses the following filters internally on at least one option: • itk::AddImageFilter (for signed transform only) • itk::BinaryErodeImageFilter (for signed transform only) • itk::BinaryThresholdImageFilter (for non-preprocessed binary images) All three of these filters are of time O(n/p) when used in parallel in the current release of ITK.The distance transform itself is O(n/p) as stated by Maurer [1].Therefore, these pre-transform operations result only in a higher constant factor in the run time and not an increase in order.
Our filter does not use the standard ITK threading provision of including a ThreadedGenerateData function.
We do this because this filter cannot be parallelized purely through per-pixel parallelization.It is essential that individual scan lines be processed by the same thread.Further, the lower dimensions must be completed before the highest is calculated.This functionality cannot be achieved with the provided functionality, so we provide slightly altered versions of the relevant functions which guarantee the desired behavior.

Comparative Results
Our filter was compared with the itk::SignedMaurerDistanceMapImageFilter and Danielsson filters for accuracy and speed.The arithmetic difference between our filter and Maurer was found to be equal to zero in all cases using analogous settings, as expected from their use of the same base algorithm.The absolute difference between our filter and Danielsson was found to be less than 1, again as expected, and equalled the difference between the nonparallel Maurer filter and Danielsson.The speedup relative to Danielsson is in line with that demonstrated in Tustison, et al. [4].A comparison with itk::ApproximateSignedDistanceMapImageFilter was also performed and our tests show a marked advantage in speed here as well.
Figures 1 and 2 display the various comparisons of runtimes for the three filters on a large number of image sizes for a standard 3D test image (a cube of volume 1 8 the total of the image).They clearly show the benefits of our filter in a multiple-thread environment.The advantage over the current ITK single-threaded implementation when running in a single-threaded environment can also be seen.The lower coefficient (as they are both still O(n)) is primarily due to the use of efficient memory-access and iterator use.Table 1 shows the mean speedups for the filter.First is the unscaled speedup ratio of the processing time using 1 thread to the time for p processors, T 1 /T p , for the 3D image cube test data.Second is the mean projected scaled speedup using the processing time for image of size n with 1 thread and the time for image of size np with p threads, p × T 1 (n)/T p (np).The scaled speedup thus represents a parallelization with constant region size per thread for each n.The values used for the scaled speedup are for the projected times for the sample 3D image cube data based on a linear regression of the data.
All tests were performed on a machine with 4 Intel Xeon 1.5 GHz processors with (2 SMT threads per processor) and 2 GB of physical memory running SuSE Linux 9.2 and using gcc 3.3.4and ITK 2.8.1 (using revision 1.3 of itk::SignedMaurerDistanceMapImageFilter.
The shown test results are for the full number of multithreads on the tested system-8.However, a decrease in benefit of increased threads is clearly visible after 4 threads.When the filter actively uses both SMT threads on a processor these processors generate resource access costs that offset some of the parallelization benefits.

Number of
For total size of image.

Figure 1 :
Figure 1: Comparison of run times for 3D image of our parallel Maurer implementation running with a single thread and the existing ITK Maurer implementation.

Figure 2 :
Figure 2: Comparison of run times for 3D image of our parallel Maurer distance map implementation for a varying number of requested threads.

Table 1 :
Mean Speedup 1.819 2.461 2.972 2.767 2.920 2.955 3.060 Mean Speedup 1.740 2.445 3.054 2.880 3.085 3.172 3.325 Mean speedup T 1 /T p for cube data and mean projected scaled speedup p × T 1 (n)/T p (np) for 3D image cube data.5 Conclusions Our introduction of SignedMaurerParallelDistanceMapImageFilter, an ITK filter which calculates the signed Euclidean distance in parallel, will be of assistance in various parallel image processing applications.Its theoretical complexity of O(n/p) is a substantial improvement over filters which calculate the estimated EDT or which calculate the exact EDT sequentially.The filter behaves similarly to the established filters and should thus be easily adapted to existing applications.