PARALLEL IMPLEMENTATION OF A MULTI-VIEW IMAGE SEGMENTATION ALGORITHM USING THE HOUGH TRANSFORM

We report on the parallel implementation of a multi -view image segmentation algorithm via segmenting the corresponding three-dimensional scen e. The algorithm includes the reconstruction of a three-dimensional scene model in the form of a point cloud, and the segmentation of the resulting point cloud in three-dimensional space usin g the Hough space. The developed parallel algorithm was implemented on graphics processing unit s using CUDA technology. Experiments were performed to evaluate the speedup and efficien cy of the proposed algorithm. The developed parallel program was tested on modelled scenes.


Fig. 1. The scheme of the technology
According to the scheme, first a three-dimensional scene model is constructed from two images. In this paper, we use the algorithm for camera parameters determination described in [10]. Using the Lucas-Kanade method [11], we form an optical flow which matches points between the first and second images. The point cloud based on the matches obtained is formed by triangulation [12]. Then, the Hough transform is applied to all points of the resulting three-dimensional scene. Among all the planes, the maximum in the accumulator space is detected using the transformation, by means of which the background plane is selected. Further, by calculating the distance from the points to the detected plane, we divide one model into two: one model consists of the background points, and the other contains the points of the objects. After these steps, it is possible to segment the initial images using the obtained segmented scene. The key stages of the technology will be considered hereafter.
The goal of the 3D scene model segmentation is to separate the objects from the background of the scene. To detect the planes (background and objects in the scene), the threedimensional Hough transform is performed. The Hough transform is a way of parametric objects detection, which is commonly used to detect lines and circles, and other shapes in the image. For example, in paper [13] the generalized Hough transform is used for detection of a variety of twodimensional objects with the reference contour.
When performing the Hough transform, for all given points in the initial space the assumption is made whether they belong to the desired object or not. Thus, for this purpose the equation for each point of the scene is solved to determine certain parameters that represent the Hough space. At the final step the maximum values are determined in the Hough space. Thus, we obtain the parameters for the equation of the desired object, whether it is a line, a circle, or some other figure.
There are also several modifications of the Hough transform: probabilistic, random, hierarchical, phase space blur, the use of the image gradient, and others.
As the input values we use a set of points from threedimensional real space. The plane can be represented using the normal vector n to this plane and the distance ρ from the origin to the plane. Then, for each point p on the plane the following equation is satisfied: x x y y z z p n p n p n ρ = ⋅ = + + p n .
After substituting expressions for the angles between the normal vector and the selected coordinate system, the plane equation can be written as follows: cos sin sin sin cos where θ and ϕ are the angles defining the normal vector. The coordinates ϕ, θ and ρ form such three-dimensional Hough space, that for each point in this space there is a corresponding plane in real three-dimensional space. In turn, for each point (x0, y0, z0) of a real three-dimensional space there is a corresponding surface in the Hough space, so that each point of this surface (ϕ, θ, ρ) characterizes a certain plane passing through the required point (x0, y0, z0).
In this paper, we solve the problem of determining the background plane containing the greatest number of points from the formed point cloud. For all the points from the initial cloud, after determining the parameters ˆ( , , ) ϕ θ ρ of the background plane, it is determined whether this point belongs to the plane or not. To find this out, the coordinates of the point are substituted into the plane equation. Next, we obtain some value that we compare with a certain threshold: ˆĉ os sin sin sin cos .
x y All the points satisfying this inequality belong to the plane, the others are considered objects of the scene.
The results of the model segmentation can be used for the initial image segmentation, since there is a one-to-one correspondence between the pixels of the images and the reconstructed points of the three-dimensional model.

Sequential implementation of the three-dimensional
Hough transform algorithm Consider the algorithm that is used for the threedimensional Hough transform realization in this paper. A three-dimensional array of integer values is used as an accumulator array. For each element in this space there is a corresponding plane with the parameters that are specified by using the coordinates of this element.
Since the exact mapping is impossible due to the discreteness of the array elements, then for each point from the point cloud the algorithm increments the value of those elements of the accumulator array that correspond to the planes passing through the given point or in its neighbourhood.
Using the pseudocode, the above algorithm can be written as follows: Sequential implementation Input data: Point cloud Output data: Accumulator array For each point (x0, y0, z0) in point cloud For each angle θ from 0 to π with step π/180 For each angle ϕ from 0 to π with step π/360 Calculate ρ according to (1) Cast to integer type ρ As a result of this algorithm implementation, each element of the resulting array is assigned a number defined as the number of points from the initial point cloud, where the points are located in the neighbourhood of the plane specified by this element. The element of the array with the maximum value is the required point specifying the background plane.

Parallel implementation of the proposed algorithm
The Hough transform is computationally complex due to the irregular access to the memory during the increment operation of the accumulator array. The use of CUDA (Compute Unified Device Architecture) technology enables us to decompose this operation. However, due to the aforementioned irregular and unpredictable memory access, the effective implementation of the Hough transform algorithm on a graphics processing unit is nontrivial [14].
The architecture of NVIDIA GPU (Graphical Processing Unit) is based on streaming multiprocessors (SMs), scalable by the number of threads. Each GPU multiprocessor executes a thousand threads at a time. When the CUDA program on the host CPU calls the GPU kernel grid, the thread blocks that form the grid are distributed among the streaming multiprocessors (SMs). The GPU kernel grid is the part of the CUDA program code running on the GPU. The threads do not necessarily execute the same program (the GPU kernel) simultaneously. At the same time, threads combined in one block of threads are executed. The threads inside the block of threads are located in warps, and each warp contains 32 threads. Each thread in a warp performs the same instruction per one clock period [15].
The proposed three-dimensional Hough transform algorithm is implemented as a CUDA program. In CUDA program, a part of the code is executed either on the CPU (host) or on the GPU (device). The algorithm of the implemented program consists of five successive steps which are given below. The device performing the procedures at this step is indicated parentheses (host or device).
The main steps of the CUDA program: 1. allocation of memory for input and output data in the global memory of GPU (host); 2. copying the input data from RAM into the global memory of the GPU (host); 3. performing GPU kernel grid and saving the calculated values of the accumulator array in the global memory of the GPU (device); 4. copying the results from the GPU global memory to the RAM (host); 5. release the global memory (host). After the accumulator array formation, the task of determining the parameters of the required plane becomes trivial.
For the above-mentioned scheme of the CUDA program, two implementations differing in the third step were considered. These implementations differ in the number of parallel processes (threads) and the computational complexity of each of these processes.
In the case of the first parallel implementation, each thread calculates values ρ for all angles θ, ϕ for a certain point in the three-dimensional space. In the case of the second implementation each thread calculates values ρ for all points for a certain pair of angles θ, ϕ. The drawback of the second implementation consists in multiple calls to the global memory of the GPU to read the coordinates of the three-dimensional point. However, for both implementations it is difficult to estimate the collisions that arise when the content of the same memory cell needs to be changed for the execution of a transaction of different threads.
As it can be seen from the pseudocodes of the parallel implementations, each thread executes loops with different parameters and different number of operations. The size of the grid also varies.
The speedup of parallel implementations in comparison with the sequential one was calculated by the following formula: where tCPU -execution time of the sequential algorithm; tHtoD -transfer time of the input data from RAM of CPU to global memory of GPU (host-to-device); tkernel -time of CUDA kernel execution; tDtoH -transfer time of the resulting data from global memory of GPU to CPU RAM (device-to-host).

Experimental results
To test the efficiency of CUDA implementations of the parallel algorithm, the following experiments were carried out. A point cloud of 158877 points was used as input data. The experiments were carried out using the following equipment: CPU: Intel Core i7-6700K, 4 GHz, GPU: GeForce GTX 750 Ti. The results of comparative studies of the execution time of the algorithm are shown in Fig. 2, Fig. 3 and Table 1.   . 2 illustrates the dependency of the implementation time of sequential and parallel implementations on the number of points. Fig. 3 shows the dependence of the speed-up of the parallel implementations on the number of points. Both parallel implementations demonstrate the same time of 2500 points. However, when the number of points is greater, the first implementation is executed 1.5 times faster than the second one.
The sequential implementation of the Hough algorithm was carried out in 13 seconds. For parallel implementations 1 and 2, the execution time was 1353 and 2139 milliseconds, respectively. A feature of the parallel implementations is the occurrence of situations when different threads simultaneously perform an increment operation on the same variable. For atomic access of each thread to a specific area of memory, a special operation atomicAdd() was used to ensure atomicity. The smallest execution time was registered for parallel implementation 1, for which parallelism was implemented at the level of decomposition by cloud point data.

Conclusion
The proposed algorithm was implemented as a C++ program using CUDA technology. Experimental studies of achievable values of accuracy and reliability were carried out. During the experimental studies of the technology, its operability was demonstrated and a comparative study of the efficiency of various parallel program implementations of the proposed algorithm was carried out. The greatest speedup (by a factor of 9.7) was obtained for the parallel realization 1.