Simultaneous Pose and Correspondence Estimation Based on Genetic Algorithm

Although several algorithms have been presented to solve the simultaneous pose and correspondence estimation problem, the correct solution may not be reached to with the traditional random-start initialization method. In this paper, we derive a novel method which estimates the initial value based on genetic algorithm, considering the influences of different initial guesses comprehensively. First, a set of random initial guesses is generated as candidate solutions. Second, the assignment matrix and the perspective projection error are computed for each candidate solution. And then each individual is modified (selection, crossover, and mutation) in current iterative process. Finally, the fittest individual is stochastically selected from the final population. With the presented initialization method, the proper initial guess could be first calculated and then the simultaneous pose and correspondence estimation problem could be solved easily. Simulation results with synthetic data and experiments on real images prove the effectiveness and robustness of our proposed method.


Introduction
Estimating the position and orientation using 2D images and known 3D targets, which is usually called pose estimation, is a basic problem in computer vision, including hand-eye coordination system, object tracking, augmented reality, and autonomous navigation [1][2][3][4][5][6][7][8].However, when the correspondences between object features and image features are unknown, for example, because the scene contains repetitive patterns or because the 3D points are simply salient features on a geometric model without associated texture or because occlusion and clutter should be taken into account, it becomes difficult because no additional information is available.
In this case, pose and correspondence should be determined simultaneously, which is simultaneous pose and correspondence estimation problem.So far, designing a proper method that can be generally applied to estimate position and orientation when the correspondences are unknown is still a challenging problem.Particularly, this problem also takes occlusion, clutter, image noise, and model deformation into account.The problem can be iteratively solved by optimizing a global cost function.One limitation of such algorithm is that the global minimum cannot be guaranteed.This is alleviated by randomly initializing it at different initial guesses.Meanwhile, the algorithm is quite slow because hundreds of different initial poses are needed to try randomly and it may succeed only when the initial pose is close to the real pose.Obviously, the consideration of occlusion and clutter makes this problem complicated.Yet, certain configurations of the data or situations with large amounts of occlusion and clutter still cause the algorithm to fail.
In this paper, we formalize the simultaneous pose and correspondence problem comprehensively and present a novel initialization method based on genetic algorithm.We note that the relationships between each randomly initial guess are neglected by the traditional random-start initialization method.In this case, an initial guess may still be used to minimize the global objective function, even when a very similar guess has also been attempt.What is worse, an appropriate initial guess may not be selected in a long time if there are many local optima.Actually, each initial guess has different influence to the global optimum and a valid method should consider the different results of previous attempts during 2 International Journal of Distributed Sensor Networks searching for the global optimum.Our initialization method is derived based on the analysis.Simulation results and experiments on real images suggest that our method has faster convergence speed and higher convergence rate than the traditional random-start method, which can be used to solve the simultaneous pose and correspondence problem effectively and robustly.
The rest of this paper is organized as follows: Section 2 carries out a study of simultaneous pose and correspondence method and some relevant proposals using the genetic algorithm.Section 3 provides a thorough analysis of the simultaneous pose and correspondence estimation problem and genetic algorithm.We describe our initialization method based on genetic algorithm in Section 4. Section 5 presents experimental results using both synthetic data and real images and compares our algorithm with the state-of-theart pose estimation algorithms.Conclusions are drawn in Section 6.

Related Works
In literature, many algorithms have been developed to solve the simultaneous pose and correspondence problem.Most of them determine correspondences between 3D targets and 2D image features explicitly or implicitly before computing pose [9][10][11].The hypothesize-and-test algorithm, for example, RANSAC, is frequently used.In this approach, a small set of 2D image features and 3D object features are selected first and the correspondences between them are then hypothesized.Based on the hypothesized correspondences, the pose can be easily computed.All the 3D object features are reprojected into the image plane with the estimated pose.If the original and reprojected images features are sufficiently similar, the pose is accepted; otherwise, a new hypothesis is formed and the process is repeated.
In the above algorithms, the correspondence problem and the pose estimation problem are solved separately, but the relationships between pose and correspondence are neglected.Time complexity may become great, especially when the number of 3D object points or 2D image points is high.
Skrypnyk and Lowe [12] presented another similar algorithm which uses view-variant 2D image features to index 3D models.In his approach, a process named off-line training is executed to learn 2D feature groupings associated with large number of different views of the 3D models.Then, the online recognition stage is performed to index 3D object models in a database of learned object-to-image correspondence hypotheses.Correspondences could be determined based on the object recognition results, which are used for pose estimation and final verification.
In [13], Wunsch and Hirzinger formalized the problem as the optimization of an objective function combining all the correspondence and pose constraints.Combined with a hybrid pose estimation algorithm, a random-start local search method is performed in the object-to-image space.However, in their algorithm, the correspondence constrains are not represented analytically.Instead, each object feature is explicitly matched to the closest sight line of the image features.
David et al. derived a very similar algorithm named SoftPOSIT in [14] by defining a global objective function, which combined an iterative correspondence assignment technique called Softassign [11] and an iterative pose estimation technique called POSIT [15] into a single iteration loop.SoftPOSIT stands out in the simultaneously estimating pose and correspondence approaches because of its accuracy and speed.The drawback is that it tries different initial poses and succeeds when the initial pose is close to the real pose.
SoftSI algorithm [16] proposed by Zhou et al. is also based on the global objective function, which can simultaneously obtain pose and correspondences.The SoftSI algorithm combines the SI algorithm and two singular value decomposition-(SVD-) based shape description theorems.By analyzing the calculation process of SI algorithm, the method can avoid pose ambiguity and quickly eliminate bad initial values.
This objective function can be also solved by different optimization algorithms [17][18][19][20][21] and can contains line features [22], cornerless images [23], and nonrigid shape [24].Unfortunately, there is no guarantee of finding the global optimum given a single random-start initial guess in the algorithm, especially when the number of points is not equal because of occlusion, clutter, and image noise.
Genetic algorithm (GA), which simulated the natural evolution process, is a useful tool for many optimization problem.Recently, GA and its variations have been used in many fields, such as computer vision, artificial intelligence, pattern recognition, telecommunication, smart sensing, and mobile computing.In telecommunications, the quality of service (QoS) parameters may conflict with each other, so this problem is actually a multiobjective optimization [25].A genetic algorithm can be used to solve complicated global optimization problems.In quantum computing, genetic algorithm can be used in stimulated annealing for combinatorial optimization so as to avoid premature convergence [26].

Problem Formulation
In this section, we first give a description of camera model and the formulation of pose estimation algorithm with known correspondences, using the closed-form global optimal function.Then the function is modified to characterize the global pose-correspondence problem without known correspondences.The genetic algorithm is introduced at last.

Camera Model.
Generally, a camera can be seen as a pinhole model, which describes the mathematical relationship between the coordinates of a 3D point and its projection onto the image plane, where the camera aperture is described as a point.Some of the effects that the pinhole camera model does not take into account can be compensated, for example, by applying suitable coordinate transformations on the image coordinates.This means that the pinhole camera model often can be used as a reasonable description of how a camera depicts a 3D scene, for example, in computer vision and computer graphics.The geometry related to the mapping of a pinhole camera is illustrated in Figure 1.
In the model, a 3D orthogonal coordinate system is abstracted with its origin at   , which is also where the camera aperture is located.The three axes of the coordinate system are referred to as   ,   , and   .Axis   is pointing in the viewing direction of the camera and is referred to as the optical axis, principal axis, or principal ray.The image plane is parallel to axes   and   and is located at a distance  from the origin   in the negative direction of the   axis.A practical implementation of a pinhole camera implies that the image plane is located such that it intersects   axis at coordinate − where  > 0.  is also referred to as the focal length of the pinhole camera.
A point   at the intersection of the optical axis and the image plane is referred as the principal point or image center.
A point  somewhere in the world has the coordinates (, , ) relative to the axes   ,   , and   .The projection line of point  into the camera (the green line in Figure 1) passes through point  and the point   .The projection of point  onto the image plane is denoted as .This point is given by the intersection of the projection line (green line) and the image plane.In any practical situation we can assume that  > 0, which means that the intersection point is well defined.
There is also a 2D coordinate system in the image plane, with origin at   and with axes  and  which are parallel to   and   , respectively.The coordinates of point  relative to this coordinate system is (, ).
As depicted in Figure 1, the projection function can be formulated as which can be rewritten in homogenous coordinate system: The perspective projection of a pinhole model.
In image coordinate system, there is an equation as follows: where  0 and V 0 are the image center in image coordinate system and   and   are scale factors.
According to ( 2) and ( 3), the pinhole model of a camera can be described as where   =   and   =   are defined as equivalent focal length of axes   and   , respectively.  ,   ,  0 , and V 0 are called the camera intrinsic parameters.

Pose Estimation with Known Correspondences.
The mapping from 3D to 2D coordinates described by a pinhole camera is a perspective projection followed by a 180 ∘ rotation in the image plane, which corresponds to how a real pinhole camera operates.The resulting image is rotated 180 ∘ and the relative size of projected objects depends on their distance to the focal point and the overall size of the image depends on the distance  between the image plane and the focal point.
In order to produce an unrotated image, which is what we expect from a camera, we can place the image plane so that it intersects the   axis at  instead of at − and rework the previous calculations.This would generate a virtual (or front) image plane which cannot be implemented in practice but provides a theoretical camera which may be simpler to analyze than the real one.This relationship is illuminated in Figure 2.
Assuming that the coordinates of corresponding 3D object points and normalized 2D image points are P  = [  ,   ,   ]  and k  = [  ,   ]  , the perspective projection camera model can be described as where   is a scale factor.R and T are the rotation matrix and translation vector, respectively, which describe the relationship between the camera and the 3D target.Pose estimation with known correspondences is then to seek the optimal rotation matrix R and translation vector T that best satisfy the equations in (5).According to the above equation, R is a 3 × 3 matrix and T is a 3 × 1 vector, which can described as where R  1 , R  2 , and R  3 indicate the row vectors of rotation matrix R and each of them is a unit vector.  ,   , and   represent the coordinates of the original point of object coordinate system onto the three axes of the camera.
The perspective projection camera model can then be rewritten as If we multiply the same factor  = 1/  on all the elements of the right-side perspective projection matrix and the leftside homogeneous image point coordinates, the equality is not affected.Introducing a scaling factor   =   /  , we obtain with In photogrammetry, the pose estimation problem is usually formulated as the problem of optimizing the following objective function: where  is the number of 2D image points.To simplify the subsequent notation, we introduce three new vectors Q  , Q  , and P with four homogeneous coordinates: We call Q  and Q  the pose vectors.
In the objective function , the pose vectors are those for which all the partial derivatives of the objective function with respect to the coordinates of these vectors are zero.Q  and Q  can then be solved using the conditions: P  ) . (12)

Pose Estimation with Unknown
Correspondences.According to the above analysis with known correspondences, pose vectors can be computed in each iterative loop by minimizing the object function .When correspondences are unknown, each image point k  can potentially match any of the object points P  , and therefore the scaling factor  should be corrected specific to the coordinates of P  : where  is the number of 3D object points.Therefore, the simultaneous pose and correspondence problem can then be formulated as follows: where   (equal to 0 or 1) are correspondence weights.
In next stage, we should find a zero-one assignment matrix m = {  }, which specifies the matching between a set of  image points and a set of  object points explicitly.The assignment matrix m, which has one row for each of image point k  and one column for each of object point, must satisfy the constraint that each 2D image point matches at most one 3D object point and vice versa.Given that pose vectors have been estimated, the assignment matrix can be calculated by the iterative Softassign technique [11].
Assuming the correspondence variables   are known and fixed, the pose vectors Q  and Q  can then be determined through minimizing the objective function in an iteration step.The solutions are with    = ∑  =1   .The objective function is minimized iteratively, with the following three operations at each iteration step: (1) Given the initial guess of pose vectors Q  and Q  , compute the assignment matrix m. (2) Assuming the scaling factor   is known and fixed, calculate the pose vectors using the assignment matrix m. (3) Using the estimated pose vectors Q  and Q  , update the scaling factor   .
The above iterative approach can be summarized as follows.First, given an appropriate initial guess for the pose vectors according to the set of 2D image points and 3D object points, then the assignment matrix which represents correspondence between the image points and object points is estimated.Finally, the pose vectors are updated, using the results of the correspondences between 2D image points and 3D object points.This process is repeated until these estimations converge.In this case, through the final pose vectors pose variables can be calculated as follows: Then the final rotation matrix R and translation vector T are  3.4.Genetic Algorithm.Genetic algorithm (GA) was an optimal model which simulated the natural evolution process [25][26][27].The algorithm was first presented by John Holland in 1975.
In a genetic algorithm, a population of candidate solutions, called individuals, to an optimization problem is evolved toward better solutions.Each candidate solution has a set of properties, its chromosomes, which can be mutated and altered.Traditionally, candidate solutions are represented in binary as strings of 0 and 1, but other encodings are also possible.
The evolution usually starts from a population of randomly generated individuals and is an iterative process, and the population in each iteration is called a generation.In each generation, the fitness of every individual in the population is evaluated, which is usually the value of the objective function in the optimization problem being solved.The more fit individuals are stochastically selected from the current population, and each individual's genome is modified to form a new generation.The new generation of candidate solutions is then used in the next iteration of the genetic algorithm.Commonly, the genetic algorithm terminates when either a maximum number of generations has been produced or a satisfactory fitness level has been reached for the population.
A typical genetic algorithm requires two aspects.The first one is a genetic representation of the solution domain, and the second one is a fitness function to evaluate the solution domain [25][26][27].
The flow chart of genetic algorithm is depicted in Figure 3, where   is the probability of crossover and the   is the probability of mutation.
Once the genetic representation and the fitness function are defined, a GA proceeds to initialize a population of solutions and then to improve it through repetitive application of the mutation, crossover, inversion, and selection operators.

Our Initialization Method
To solve the simultaneous pose and correspondence problem, an iterative process starting from an initial guess is necessary [14].Usually, the minimized objective function has many local optimal because of occlusion, clutter, and image noise.If the initial guess is not appropriate, the correct global optimum will not be reached to or the objective function will converge to its other local optimal.Therefore we present an effective strategy based on genetic algorithm to solve the simultaneous pose and correspondence problem.

Initialization.
The random-start method is a common initialization method, which starts from a random initial guess.There are 12 different variables, including a 3 × 3 rotation matrix R and a 3×1 translation vector T. All the variables can be generated through six degrees of freedom and include three Euler angles and the translation parameters.Therefore the traditional random-start initialization method generates a random 6D vectors  0 = (, , , , , ) in a 6D hypercube.Then the rotation matrix R and translation vector T are defined as

Algorithm Framework.
In the traditional random-start initialization method, beginning from a random initial guess, pose and correspondence are iteratively estimated until meeting a specified termination criterion.If the calculated pose and correspondence are not correct, all the steps are processed again with another new random initial guess.However, the random-start method does not consider the different influence of each random initial guess because they are independent with each other in this method.Even worse, the initial guess  0 may repeat in some range, and the count of initial guess will increase rapidly and the convergence rate will decrease sharply.
To get an appropriate initial guess, we present a new initialization method based on genetic algorithm [21].First, instead of generating an initial guess one by one, a set of  random initial guesses { ()  0 },  = 1, 2, . . ., , is generated as candidate solutions called a population.Second, the assignment matrix m is computed for each candidate solution in the population and then the perspective projection error  is evaluated as the fitness of every individual in the population.Finally, the more fit individuals are stochastically selected from the current population, and each individual is modified (selection, crossover, and mutation) to form a new population.The new population of candidate solutions is then used in the next iteration of the algorithm.
In practice, the search for better solution should be terminated when the current solution is such that the number of matching points is smaller than the threshold   , which is defined as where  is the total number of the image points and  is a scale factor of matching rate, which determines what percent of the detected object points must be matched.In this case, we can deduce that the current initial guess is not suitable for the simultaneous pose and correspondence problem and a new initial guess should be given.Due to image noise, occlusion, and measurement error,  will not reach to 1 even when a good initial guess is found.Therefore, in the experiments discussed below, we take  ≡ 0.75.This test is not perfect, as it is possible for an initial guess to be very accurate even when the number of matched points is less than this threshold, which occurs mainly in cases of high noise.Conversely, a wrong initial guess may be accepted when the ratio of clutter features to detected object points is high.However, these situations are relatively uncommon both in simulation case and in practice.

Experiments
We first show the simulation results to confirm the effectiveness of our initialization method, compared with traditional random-start initialization method.Then experiments with real images intuitively present the computed pose and correspondences.

Synthetic Experiments.
We generate synthetic data as follows: 3D object has eight points, distributing in the vertexes of a cube; there are seven 2D image points with unknown correspondence to 3D object points, where one vertex of the cube is occluded, as shown in Figure 4.All the coordinates of image points have been normalized for the sake of convenience.Connecting edges are used for understanding, instead of computing.
In synthetic experiments, the convergence rate and convergence speed of global object function are evaluated, respectively, initialized with the traditional random-start method and our novel initial method.
An initial guess is convergent only when the global objective function  in ( 14) reaches to a stable minimum with finite iterations.It means that, with the initial guess, pose and correspondence can be calculated simultaneously in a few of iterations.
We first compute the convergence rate (CR) at different noise levels, which is defined as CR = number of convergent initial guesses total number of initial guesses , where total number of initial guesses is set to 1000 in each different noise level.
Figure 5 shows the convergence rates of our initial method and the traditional random start method as a function of noise level.In this figure, each point depicts results averaged over 1000 random initial guesses.It can be seen that our initial method has a higher convergence rate and decreases more slowly than the traditional random-start method.This is due to the fact that our initial method using genetic algorithm considers different influences between each initial guess and has a higher probability of approaching the minimum of the global objective function than the traditional random start method.
In the synthetic experiment, a set of random initial guesses is first generated as candidate solutions in our method based on genetic algorithm.Then certain processes are done with all the initial guesses until that the proper pose is determined.Similarly, the traditional random-start method also generates a list of globally aware guesses and tests against these random guesses as a whole.At this time, if the list of initial guesses contains the proper pose, the selection process terminates.However, the proper pose may be not in the list, and in this case, another list of initial guesses is needed to randomly generate.The whole procedure does not consider International Journal of Distributed Sensor Networks the former initial guess that has already been computed.On the opposite, our method based on genetic algorithm takes all the initial guesses as a whole population and new initial guess is generated according the results of former population.
We then compare the convergence speed when an initial guess computed by our method or random start method is convergent.The results are depicted in Figures 6 and 7.
From Figures 6 and 7, we can see that the value of objective function  decreases quickly with our initial guess and the number of iterations is always low.However, initialized with random start, the value of object function  decreases quickly only at the first several iterations and becomes very slowly at the subsequent iterations, which leads to a very large number of iterations.Meanwhile, the random start method does not consider the different influence of each random initial guess because every initial guess is independent with each other.Even worse, the initial guess may repeat in some range, and the count of iterations will increase rapidly and the convergence speed will decrease greatly.When in presence of large amounts of clutter, occlusions, or image noise, the random-start method searches for the proper initial pose in the pose space by using a RANSAC-style approach.It should consider all possible correspondences between 2D image points and 3D object points.What is worse, the initial guesses are represented in a 6D pose space ( 0 = (, , , , , )).As a consequence, it would lead to computational explosion and quickly become computationally intractable with realistic numbers of features.Figure 8 depicts the average time consumption until the initial guess is convergence.
When the noise level is low, the initial guess in the 6D pose space has high probability to converge.Only a few of times is needed to determine the proper guess with the randomly selected initial pose.However, when the noise level becomes high, the time consumption increases obviously because the convergence rate fast decreases and more times are needed to try to find a proper initial value.On the opposite, the average time consumption increases slowly with the different noise level.Figure 9 shows an example computation of simultaneous pose and correspondence algorithm initialized with our method, using the cube model depicted in Figure 4.
In Figure 9, we can see that the perspective projection of 3D object points overlaps with original 2D image points gradually after few iterations, giving the appreciate initial guess calculated by our method, which means that pose between the coordinate system of 3D object and the coordinate system of camera can be estimated with unknown correspondences and occlusion.

Real
Images.We test our initialization method for simultaneous pose and correspondence problem on real images.
In the test, there are 11 or 10 image points in 2D image and 28 object points in the 3D model.An initial pose is first given by random start method (the results are showed in Figure 10) and then by our initialization method (the results are showed in Figure 11).Figure 12 is another experiment results initialized using our method.
3D object model with 28 points is known and 2D image points are recognized by our implementation of Harris algorithm [12].
As depicted in Figure 10, with the unsuitable initial pose given by random start method, the final pose is not correct and the simultaneous pose and correspondence problem cannot be solved very well.
As shown in Figures 11 and 12, the number of image points is less than 3D object points; what is more, the correspondences between 2D image points and 3D object points are unknown due to occlusion, clutter, and image noise.Using our initialization method based on genetic algorithm, an appropriate initial guess is first calculated and then pose and correspondences are estimated with the initial value.Finally, 3D object model is accurately reprojected onto image plane using the rotation matrix and translation vector contained in pose.The effectiveness of the initial guess can be verified by observing the overlapping degree between the perspective projection and the original image.

Conclusions and Future Work
In this paper, we present an initialization method for determining the pose of 3D objects from images to solve the simultaneous pose and correspondence problem.Pose and correspondences can be computed simultaneously by minimizing International Journal of Distributed Sensor Networks   Compared with the traditional random-start initialization method, our proposed method has higher convergence rate and lower number of iterations, which has been verified through experiments.Future work will involve initializing the simultaneous pose and correspondence algorithm automatically using special features extracted in 3D object and 2D image plane.The effectiveness of other optimization algorithms would be analyzed.We are also interested in implementing a more thorough formalism to include initialization, pose estimation, and correspondence determination.

Figure 1 :
Figure 1: The pinhole model of a camera.

Figure 3 :
Figure 3: The flow chart of genetic algorithm.

Figure 4 :
Figure 4: 3D object points and 2D image points with one occluded point.

Figure 5 :Figure 6 :
Figure5: Convergence rate at different noise level.The results initialized with random start method are plotted as circles (I) and the results using our initialized method are plotted as squares (◻).

Figure 7 :Figure 8 :
Figure 7: Number of iterations at different noise level.

Figure 9 :
Figure 9: The iteration step of perspective projection with an initial value calculated by our method.

Figure 10 :
Figure 10: The results of real image initialized by random start method (11 image points): (a) 2D image points  (point number 9); (b) the projection of 3D model with the calculated initial pose; (c) the projection results at each iteration; (d) the final result of simultaneous pose and correspondence algorithm.

Figure 11 :
Figure 11: The results of real image initialized by our method (11 image points): (a) 2D image points; (b) the projection of 3D model with the calculated initial pose; (c) the projection results at each iteration; (d) the final result of simultaneous pose and correspondence algorithm.

Figure 12 :
Figure 12: The results of real image initialized by our method (10 image points): (a) 2D image points; (b) the projection of 3D model with the calculated initial pose; (c) the projection results at each iteration; (d) the final result of simultaneous pose and correspondence algorithm.
cos  − sin  cos  + cos  sin  sin  sin  sin  + cos  sin  cos  sin  cos  cos  cos  + sin  sin  sin  −cos  sin  + sin  sin  cos