A family of globally optimal branch-and-bound algorithms for 2D–3D correspondence-free registration

We present a family of methods for 2D–3D registration spanning both deterministic and non-deterministic branch-and-bound approaches. Critically, the methods exhibit invariance to the underlying scene primitives, enabling e.g. points and lines to be treated on an equivalent basis, potentially enabling a broader range of problems to be tackled while maximising available scene information, all scene primitives being simultaneously considered. Being a branch-and-bound based approach, the method furthermore enjoys intrinsic guarantees of global optimality; while branch-and-bound approaches have been employed in a number of computer vision contexts, the proposed method represents the ﬁrst time that this strategy has been applied to the 2D–3D correspondence-free registration problem from points and lines. Within the proposed procedure, deterministic and probabilistic procedures serve to speed up the nested branch-and-bound search while maintaining optimality. Experimental evaluation with synthetic and real data indicates that the proposed approach signiﬁcantly increases both accuracy and robustness compared to the state of the art. © 2019 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY license. ( http://creativecommons.org/licenses/by/4.0/ )


Introduction
This paper deals with the general problem of 2D-3D registration where given an image taken by a calibrated camera and a 3D model, the objective is to determine the pose of the camera with respect to the model.While there exist established solutions to this problem in the case where correspondences are known, there are many situations where it is not possible to reliably extract such correspondences across modalities, thus requiring the use of a correspondence-free registration algorithm.Existing correspondence-free methods rely on local search strategies and consequently have no optimality guarantee.In this paper we present a family of globally optimal solutions to the 2D-3D registration problem from points and lines without correspondences and in the presence of outliers.Fig. 1 illustrates how these solutions can be used within a 2D-3D registration pipeline.2D-3D registration finds use in a range of tasks such as motion segmentation [1] , object localisation and recognition [2] , with practical applications in many areas including vehicle navigation [3] , media visualisation [4] , medicine [5,6] and forensics [7] .
Despite considerable progress in feature extraction and singlemodality registration (e.g.2D-2D or 3D-3D), the general 2D-3D registration problem remains challenging.While there exist techniques to extract features in the 2D and 3D domains (e.g.corners [8] , salient features [9] or lines [10,11] ), it is an open problem to automatically establish correspondences between them.This may be explained by a variety of reasons.First, feature appearance can vary dramatically between 3D and its 2D projection due to the non-linear nature of the transformation; a 3D feature may be projected from a large range of viewpoints and perspective distortion may occur as well as view-dependent appearance variations if the material is non-Lambertian.Second, in the specific case of lines, there are many scenes where it is difficult to establish correspondences based on appearance, for example in highly repetitive manmade scenes or where low-width structures are present [12] .Finally, and more generally, correspondences of any feature type are particularly difficult to hypothesise when the 3D model is untextured, as is often the case if it is obtained by a laser range scanner.
The lack of feature correspondences renders traditional hypothesise-and-test approaches (e.g.RANSAC [13] ) practically obsolete due to the very high computational complexity of the problem.State-of-the-art approaches e.g.[14,15] search over the transformation space and scale cubically with the number of features, but are not robust to the high rates of outliers required for the problem at hand.However, existing approaches only search for local maxima and hence i) require a good initialisation and ii) are sub-optimal, particularly for higher rates of outliers.
In this paper we propose a globally optimal solution to this problem, achieved via a Branch-and-Bound (BnB) strategy.It recursively searches the transformation space, bounding the objective function at each stage and discarding parts of the transformation space for which it is impossible for the solution to lie in.Eventually, the remaining transformation space is tightly bounded and it may be concluded that transformations in the remaining space must be within of the globally optimal solution.Furthermore, the approach is not restricted to one feature type, but instead can be applied to the case where points, lines, or a mixture of each are present.
Within the proposed BnB algorithm a nested BnB structure is used (similarly to Yang et al. [16] ), whereby an outer BnB searches over the rotation component, with an inner BnB searching for the camera centre at each stage.It is in general faster than searching the full 6D parameter space directly since large parts of the rotation space may be unconditionally discarded, and since evaluating each bound is faster as features are only rotated once for the outer BnB.We extend upon this idea by proposing two extensions to the nested BnB structure in order to speed up the convergence without compromising on the accuracy of the solution.
In the first instance, a deterministic annealing procedure is implemented that gradually increases the accuracy of the search as the algorithm progresses.As such, early regions of rotation space may be more quickly evaluated, and the algorithm can focus its search at the later stages where it is nearing convergence.Secondly, we propose a probabilistic variant, whereby the inner BnB of less promising areas of rotation space is evaluated to a lower accuracy compared to more promising areas of rotation space.Both approaches result in a significant speed-up to the algorithm as demonstrated across a range of experiments on synthetic and real data.
The paper makes the following contributions.Firstly, we propose a globally optimal solution to this problem, achieved via a Branch-and-Bound strategy.Its formulation readily allows for both point and line features to be used, allowing it to be applicable to a broader range of scenes and also exploiting the complementarity of these different types of features to improve registration accuracy and robustness.Secondly, we propose novel formulations that allow for the speed-up of nested BnB algorithms while preserving the optimality properties of the solution.The approach is evaluated against the state of the art where significant improvements are demonstrated: our approach is more accurate and sig-nificantly more robust to high rates of outliers compared to existing approaches.
The paper is based on our previous work [17] which it extends in several ways.Firstly, the formulation is generalised to simultaneously allow use of both point and line features for globally optimal registration.This broadens the applicability of the method over our previous approach and improves its performance.Secondly, the methodology is further developed to include deterministic and probabilistic nested BnB formulations, resulting in a significant performance speed up while preserving optimality.Finally, the experimental evaluation is considerably improved through consideration of a broader range of synthetic and real datasets, comparison against an additional RANSAC approach, and use of a more realistic evaluation protocol based on features obtained from recently proposed 2D and 3D salient feature detectors (as opposed to 2D features backprojected to the 3D domain which were used in our previous work).
The structure of this paper is as follows.Related work is discussed in Section 2 .Section 3 formally defines the scope of the problem before the proposed Branch-and-Bound approach is detailed in Section 4 .Section 5 describes the deterministic and probabilistic nested BnB formulations.The different approaches are then evaluated against the state of the art on synthetic and real dataset in Section 6 .Finally, conclusions and avenues for future work are discussed in Section 7 .

Related work
A traditional approach to the feature registration problem is the hypothesise-and-test RANSAC algorithm [13] .RANSAC relies upon hypothesising small sets of 2D-3D correspondences (of size 3 for the 6 parameter 2D-3D registration problem), determining the transformation parameters from the small set of correspondences, and verifying the transformation against the rest of the features.Assuming there are N 2D features and M 3D features, there are a total of NM 3 hypothetical sets of size 3 correspondences to choose from.Assuming there are only kN inlying feature correspondences (where k is the inlier ratio, k < 1), there are a total of kN 3 sets of size 3 of inlying correspondences.As a result, the expected number of correspondences that must be hypothesised before finding an inlier set is O(( M 3 ) 3 ) .However, the hypothesis verification stage requires projecting the 3D features onto an image plane and determining their nearest neighbours from the 2D features.Hence, for 2D-3D feature registration where correspondences are unknown,

RANSAC has complexity O(
The above analysis is too simple-in reality, a set of 3 inlying correspondences may not lead to the optimal transformation due to noise.This was observed by Chum et al. [18] who propose an outer and an inner RANSAC loop, whereby whenever a new best solution is found the inner RANSAC locally searches from the smaller, inlying set of correspondences.It was more formally addressed by Imre and Hilton [19] who minimise the total number of iterations within each stage of such a two-stage RANSAC approach.Alternative extensions have been proposed to improve the speed of RANSAC e.g.WALDSAC [20] that evaluates the potential correspondences of a transformation in an optimal order.However, no RANSAC variant is able to reduce the very high complexity for this particular problem.The high complexity of RANSAC for this problem has led to more recent approaches e.g.[14,15] that search over the transformation space rather than potential correspondences leading to lower complexity of O(N 3 ) .Machine learning approaches have recently been applied to the 2D-3D registration problem.PoseNet [21] by Kendall et al. uses a CNN for 2D-3D registration of an outdoor scene, where the scene is obtained by Structure-from-Motion (SfM).Its accuracy is, however, somewhat limited-an issue later addressed by Kendall and Cipolla [22] , where the authors specifically focus on applying a geometric loss function to the network, thereby improving the accuracy over their previous work.Conversely, a Random Forest approach has been proposed by Shotton et al. [23] , however, this is for the slightly easier task of registering a 3D scene to an RGB-D image.ML approaches may also be applied to specific subcomponents of the 2D-3D registration problem, for example DSAC [24] who replace the deterministic RANSAC hypothesis with a smooth, differentiable objective function.However, RANSAC-based approaches fundamentally scale poorly where correspondences are unknown.
In the next two subsections, we review specific approaches that have been proposed in the cases of point features and line features respectively.The authors are not aware of any approach that explicitly uses points and lines within the same framework, therefore these two types of approaches are discussed separately.The section is concluded with a survey of branch-and-bound approaches that have been proposed to solve related geometry estimation problems.

Point-based methods
One of the best, early approaches to 2D-3D registration using points where correspondences are unknown is the SoftPosit algorithm [14] .It locally searches the transformation space while simultaneously determining the correspondences between 2D and 3D points.At each iteration, multiple, weighted correspondences are hypothesised based on the pose and points' nearest neighbours under the pose; and subsequently the pose is determined from the multiple, weighted correspondences.An annealing parameter is used within the weighting that ensures the algorithm converges towards hypothesising one-to-one correspondences as it progresses.
Moreno-Noguer et al. [15] have proposed a solution known as BlindPnP , by modelling an initial set of poses by a Gaussian Mixture Model and using each component to initialise a Kalman filter.Potential 2D and 3D points are considered in turn by the model to update the mean and covariance; eventually the algorithm determines a solution with high confidence.It performs comparably to SoftPosit in a similar amount of time except in large amounts of clutter, where SoftPosit is outperformed by BlindPnP .
An interesting solution has been proposed by Enqvist et al. [25] who compute pairwise constraints between pairs of potential correspondences.By creating a graph of all possible pairs of correspondences, the optimal solution is found by determining the largest set of pairwise consistent correspondences, formulated as a vertex cover problem.However, results were only given when cor-respondences were hypothesised and the problem was inlier set maximisation; it is unclear how it would perform if no correspondences could be known between the 2D and 3D points.
Other proposed solutions are a lot more restrictive, e.g. both [26,27] solve the problem where no outliers are present.Zhou and Zhang [26] use this to obtain global information e.g. that the mean of the 3D points should project onto the mean of the 2D points, and Marques et al. [27] view the problem as a correspondence permutation problem, which they solve by a convex relaxation procedure.The assumption however is unreasonable in many scenarios, where an algorithm that is robust to high outlier rates is required.

Line-based methods
An early solution to 2D-3D registration from correspondencefree lines is proposed by Beveridge and Riseman [28] who use a local search procedure to iteratively arrive at local optima.They investigate how easy the problem is; evaluating expected run-time as a function of the number of lines and amount of clutter.Bhat and Heikkilä [29] systematically sample and rank the space of potential poses however it is computationally inefficient for large numbers of lines.Alternatively, the SoftPosit algorithm has been extended to use lines [30] .At each iteration, the algorithm finds the nearest point of each 2D line for the endpoint of each 3D line.This point assignment enables it to adapt to the original SoftPosit algorithm for points.Some approaches to registration using lines can make restrictive assumptions.It is not uncommon to assume a Manhattan World where all lines are orthogonal, which may be used to speed up the algorithm e.g. by restricting the search space [31] .Alternatively, detected lines may be viewed as edges on a graph, leading to a graph matching approach [32] .However the graph structure is typically not preserved under a projective transformation, and the approach is more suited to other tasks e.g.aerial image registration.
All existing approaches to 2D-3D correspondence-free registration are heuristic, with no guarantee of optimality.In contrast, here we present a globally optimal solution to the problem, achieved via a branch-and-bound approach.By solving the problem in a globally optimal manner, our approach is demonstrably more robust to high rates of outliers compared to the state of the art.Furthermore, the approach naturally allows for both points and lines to be used within the same framework, in contrast to the approaches reviewed above.

Branch-and-bound methods
Branch-and-bound solutions to geometry estimation in computer vision have been proposed for a number of different problems, typically requiring novel derivations of bounds in each case.
Many BnB approaches in registration rely on linear programming (LP) techniques to compute bounds, e.g.[33] , whereby bounds may be computed as solutions to a LP.In a naive form they may only be applied to linear transformations, so to be more widely applicable nonlinear constraints are relaxed into linear convex and concave envelopes to compute upper and lower bounds respectively (e.g.[33,34] ).The optima of each envelope are determined as the bounds for the region of space: as the size of the region decreases the difference between the optima decreases and so the algorithm converges.LP relaxation techniques have been developed for complex and highly non-linear problems e.g.[35] , where it is used for inlier set maximisation where correspondences are unknown.With respect to the 2D-3D registration problem, Jurie [36] approximates perspective pose by orthographic pose (a linear transformation) to create a problem that may be solved by similar techniques without the need for convex or concave envelopes.However, its use of the Gaussian error model results in an approach that is not robust to outliers.
Alternative BnB approaches compute bounds that are geometrically meaningful .The earliest approaches are due to Breuel [37] who focuses mainly on 2D-2D registration problems with up to 4 degrees of freedom.He derives geometrically meaningful bounds that describe the maximum distance a feature can move by under a bounded set of transformations.He also proposes the use of matchlists: potential correspondences are kept when searching new parts of the transformation space so as to speed up nearest neighbour searches.The geometrically meaningful approach to computing bounds has been used for more complex problems, e.g.two-view translation estimation [38] and relative orientation estimation [39] .Geometric bounds have been non-trivially derived for the group of 3D rotations by Hartley and Kahl [40] by considering rotations in their minimal axis-angle representation.This has allowed for globally optimal relative pose estimation [40] , and 3D-3D registration [16] .In the latter case an outer BnB algorithm searches over the rotation space while an inner BnB searches for the translation.
Recent BnB approaches have focused on creating efficient search mechanisms.For example, Parra Bustos et al. [41] propose an efficient bounding mechanism for 3D rotations, based on the insight that a rotation leaves the magnitude of a point unchanged.Alternatively, a novel, efficient approach was proposed by Chin et al. [42] .Unlike the majority of other approaches that search over the transformation space, this explicitly searches over potential correspondences.Initially it hypothesises all correspondences, then runs a tree search to determine which correspondences are invalid.An A * algorithm is used to significantly speed up the search.While very good run-times are reported, it has not been tested for large numbers of outliers-this may be significantly more challenging, since the search tree becomes exponentially larger with the number of outliers.In [43] , Paudel et al. use a sum-of-squares optimisation framework to determine whether a point is an inlier for pointto-plane registration and show how plane visibility conditions can be used to boost registration.
Very recently, in [44] Campbell et al. introduced an approach for optimal 2D-3D alignment from point features.Unlike our approach which minimises a continuous objective function measuring the misalignment between 2D and 3D features, Campbell et al. propose an inlier maximisation framework which solves for the camera pose maximising the cardinality of the set of 2D features that are within a set inlier threshold from a projected 3D feature.Their approach also follows a Branch-and-Bound formulation, introducing new bounds which are proved to be tighter than those used in our formulation.Similarly to our approach, theirs guarantees global optimality, albeit for a different metric to that considered in this paper.[44] presents the advantage of not requiring an estimate of the proportion of inliers as it does not require trimming.However, it relies upon a user-defined threshold, which controls whether or not a match is classified as an inlier.
Our approach, originally introduced in [17] and extended here, is the first globally optimal approach to 2D-3D registration using points and lines without correspondences.We use a similar search mechanism to the globally optimal 3D-3D registration algorithm Go-ICP [16] , whereby an outer BnB searches over the rotation space, and an inner BnB searches over the camera centre.In contrast, our problem firstly requires the derivation of new bounds for the 2D-3D problem.Unlike our original formulation which considered either points or lines, the formulation is extended to simultaneously consider both types of features in the same optimisation framework.Secondly, we propose novel deterministic and probabilistic implementations that allow for the speed-up of nested branch-and-bound algorithms.Thirdly we propose a more general solution, extending the framework to use points and lines, allowing for broader scene applicability.

Problem formulation
Initially we give the problem definition for 2D and 3D features in general, before moving onto the specifics for points and/or lines.
Let there be N 2D features { i } N i =1 and M 3D features { j } M j=1 , and denote the distance between a 3D and 2D feature as d ( j , i ).The objective is to determine the pose of the camera that optimally aligns the sets of features.The pose is an element of 3D motion space SE (3) = SO (3) × R 3 , composed of a 3D rotation and 3D translation.Hence, where no outliers are present, the objective is to find the rotation R ∈ SO (3) and camera centre C ∈ R 3 that minimise: To make (1) robust to outliers, we use trimming : instead of minimising the sum over all 2D features it is minimised over the smallest k values, where k represents the expected number of inliers.Without loss of generality, assume the terms of the sum in (1) have been re-ordered in ascending order, yielding the trimmed objective : finding R ∈ SO (3) and C ∈ R 3 that minimise: where * denotes the sum rearranged in ascending order (note this depends upon R and C ).To apply (2) for points (denoted (P)   i and (P) j ) or lines (denoted (L )   i and (L ) j ) simply requires the distance measure to be defined.
In the case of points, denote each 2D point by X i and each 3D point by Y j .It is initially tempting to use the Euclidean reprojection error as the most principled distance measure.However, such a distance measure may still not be perfect where there are potential errors in the location of both the 2D and 3D features, and it makes bound computation difficult (and hence more time consuming) due to how it changes non-linearly with respect to the pose of the camera.Instead, we use a more geometrically meaningful distance measure.For convenience, assume the 2D point has been reprojected onto the unit sphere i.e.X i ∈ R 3 , X i = 1 where .denotes the 2 norm.Then we define the distance between a 2D point and 3D point as: In the case of lines, a suitable distance measure is less obvious.Approaches to pose estimation from line correspondences (e.g.[45] ) often decouple the problem into the determination of the rotation by using the direction of the 3D line, then determine the camera centre by using an arbitrary point on a line.Inspired by this approach, our line distance measure is a weighted sum of two terms, where the first term is dependent solely on the rotation of the 3D line, and the second term is the distance of a point on the 2D line to the 3D line.With such a construction, the distance will be quite large when the rotation is incorrect regardless of the camera centre-this is of use within the subsequent nested BnB approach, where it can potentially allow for unpromising areas of rotation space to be discarded more quickly.
Our line distance measure is as follows: for each 3D line, denote its normalised direction vector as d j .For each 2D line, denote its midpoint as P i , and backproject the line, denoting the normal to this plane as n i (see Fig. 2 for an illustration of these terms).In the ideal, noiseless case, d j will lie on the backprojected plane and Fig. 2.An illustration of the terminology used in defining a distance measure for lines. (L )   i denotes a 2D line, P i its midpoint and n i the normal to its backprojected plane. (L )   j denotes a 3D line and d j its normalised direction vector.P i will lie on the projection of line (L )  j .Hence, a suitable distance between the lines is defined as: where λ defines the relative weighting between the two terms.∠ ( (L ) j , P i ) denotes the angle between P i and the nearest point of the projected (finite) line segment either between the endpoints of (L ) j or is one of its endpoints, whichever is closest.This is low for lines that overlap slightly with endpoints that are not well aligned (to account for occlusion), but is higher when the lines are significantly further away.By using this we are implicitly considering 2D lines as infinitely long but 3D lines as finitely long.This assumption has been made elsewhere e.g.[46] due to the poor reliability of determining the endpoints of a 2D line.
In the case where both points and lines are present, we compute a weighted sum of the two objective functions.Assuming there are M 1 3D points and M 2 3D lines, the objective function becomes: where k 1 and k 2 represent the expected numbers of inlying points and lines respectively.For the relative weighting term we take μ = 2 .This is on the principle that the line distance ( 4) is composed of two equally weighted terms (after setting λ correctly).
The second of these is an angular distance which is comparable to the point distance (3) ; hence, the line distance should be approximately twice that of the point distance.

Branch-and-bound
Branch-and-Bound (BnB) is a very general framework for global optimisation.Assume the objective is to minimise some function f over an N -dimensional bounded space ⊂ R N .Assume further that for any subset ω⊆ (hereafter, known as a branch ) a lower bound and an upper bound may be determined for the minimal value of f in this branch, and that these bounds converge as the size of the branch tends to zero.For example, the upper bound could simply be the value of the function at the midpoint of the branch, and the lower bound could be the upper bound minus some expression for how much the function can deviate in an interval of that size.These assumptions allow for the determination of a solution to f whose value is within of the globally optimal solution, for any user-specified > 0. It relies upon recursively subdividing the space, calculating upper and lower bounds for each branch.Initially the input to the algorithm is simply the branch , and, at any stage in the algorithm, there is a set of branches that are subsets of , each with a lower and upper bound to the minimum value f can take in that branch.At each stage of the algorithm the following two steps are performed: 1. Determine the distance between the lowest lower bound and lowest upper bound of the bounds in the set of branches.If this distance is less than the algorithm terminates, outputting the lowest upper bound and its branch.2. Otherwise, consider the branch that has the lowest lower bound and subdivide it further, computing upper and lower bounds for each sub-branch.
The algorithm will converge because, eventually, the size of the branches considered will be sufficiently small that the distance between the upper bound and lower bound of a newly divided branch will be less than .When this occurs, the outputted value is within of the globally optimal solution because the entirety of has been (recursively) searched and so it is known that any better solution is no more than better than the one returned.For the 2D-3D registration problem, optimisation takes place over the space SE (3).This space is unbounded, so it is assumed the camera centre is known to lie within a bounded set C , which is typically a reasonable assumption when C encapsulates a suitably large space.
This section is structured as follows: in Section 4.1 , we give geometrically meaningful bounds that describe how much the features can be transformed by within a given neighbourhood and in Section 4.2 how these are used to bound the objective function.Then we describe the nested BnB structure in Section 4.3 .Finally, local refinement techniques are detailed in Section 4.4 .

Geometric bounds
Bounds are considered separately for the rotation component and camera centre component.Firstly, the rotation bound is computed.Rotations are considered in the axis-angle representation : a rotation is represented by a vector r ∈ R 3 whose direction specifies the axis of rotation and whose magnitude specifies the angle (hence, r ≤ π ).The rotation matrix that r represents may be computed via Rodrigues' rotation formula: where ˆ r = r / r .The notation [ v ] × for vector v ∈ R 3 denotes the skew-symmetric matrix representation of v , defined as: Note that [ v ] × x = v × x for any vector x ∈ R 3 .Lemma 1 : Let R 0 , R be rotation matrices and r 0 , r their corresponding axis-angle representations.Then, for any point X ∈ R 3 : Proof.[40] has already established that it follows that In the context of BnB, if one considers a branch as a cube of rotations r in their axis-angle representation where the centre of the branch is r 0 and the cube has half side-length δ R , then we have r 0 − r ∞ ≤ δ R .By the above result, it follows that for any rotation ( R ) within the cube and for any point Next, bounds on the camera centre are derived. where Proof.Lemma 2 can be intuitively understood by referring to Fig. 3 .The bound is trivially satisfied in the case where √ 3 δ C ≥ X − C 0 since π is the largest possible value allowed under our axis-angle representation parametrisation.The rest of the proof therefore assumes that √ 3 δ C < X − C 0 .The proof is conducted by searching for the camera centre C that maximises the angle θ and verifying that the corresponding angle is no greater than the bound defined in (11) .
Consider the triangle with sides of length X − C 0 , X − C , and C − C 0 (e.g. the triangle in the right diagram in Figure 3 ).By the cosine rule one obtains . Since sin θ is a strictly increasing function in this interval, obtaining an upper bound on sin θ will yield an upper bound on θ .By the sine rule: Without loss of generality X and C 0 may be assumed to be constant (since we are searching for C maximising the angle), hence the expression is maximised when and the result follows.

A uniformly continuous bound
The function governing the bounds on the camera centre ( 11) is not uniformly continuous: the relationship between and δ C is dependent on X .This causes real difficulties for the algorithm: if precision C is desired and a point X is arbitrarily close to C 0 , an arbitrarily small branch ( δ C ) is required.Hence the algorithm will not converge in finite time.
To alleviate this we modify the objective function slightly so as to be uniformly continuous: when computing (2) we only take into account 3D features whose distance from the camera centre is larger than a specified threshold ( γ ).For a suitably small threshold this is sensible in practice: in general very few features will be located immediately in front of the camera.
Note that an alternative way of addressing this issue is to restrict the search space to prevent camera centres from being located within a very small distance from an existing 3D point as proposed by Campbell et al. in [44] .
This now defines a uniformly continuous function since the re- to guarantee a minimum branch size, hence guaranteeing the convergence of the algorithm.By combining the above lemmas, the following result is obtained: Theorem 1.Let R 0 , R be rotation matrices and r 0 , r their corresponding axis-angle representations.Further, let C 0 , C ∈ R 3 .Then, for any point X ∈ R 3 : where R = √ 3 δ R and C = arcsin The proof follows by combining Lemmas 1 and 2 with the triangle inequality:

Function bounds
In this subsection, the bounds defined in Section 4.1 are related to the objective functions described in Section 3 .Assume we are minimising the trimmed objective (2) with the angular distance measure for point features (3) .It is required to determine upper and lower bounds for (2) when the pose space SE(3) is bounded.At each stage in the BnB algorithm, the pose space will be divided up into cubes, where we consider jointly a rotation cube centred at r 0 of half side-length δ R and a camera centre cube centred at C 0 of half side-length δ C .
To compute the upper bound for (2) using points (3) the objective function is simply evaluated at (R 0 , C 0 ) .To compute the lower bound the expression is derived by evaluating the function at (R 0 , C 0 ) and subtracting the maximum amount by which the function may deviate within that branch.Denote z( ) = R + C and hence, the lower bound is obtained as: The lower bound for lines (4) is derived in a similar way; the angles for each of the two terms in (4) are bounded in the same manner (by R + C ).Hence, the lower bound for (2) using lines (4) is obtained as:

Nested branch-and-bound
In a similar manner to [16] , we use a nested BnB structure for efficiency: an outer BnB searches over the rotation space SO (3) and, for each rotation branch, the upper and lower bounds are solved by an inner BnB algorithm for the camera centre.In doing so, all features may be rotated at the beginning of an inner BnB, leaving only their translation component ( −R C ) to be added at each stage; this is more efficient than directly implementing a full 6D search.We shall now describe the computation of bounds in the inner BnB algorithm.
Firstly, the case for determining the upper bound of a rotation cube is considered.To do so, the rotation is considered at the centre of the cube ( r 0 ) and the aim is to determine the minimum value of (2) where r is fixed to r 0 and C is allowed to vary.The upper bound used in the inner algorithm is simply the value of the function at that point, i.e. computed using (18) with z( ) = 0 , with the lower bound computed using z( ) = C .There is an early bailout condition: if the inner lower bound is greater than the outer upper bound then the inner BnB may terminate.This allows for speed-up of the algorithm if the outer upper bound is small (i.e. the algorithm is faster the closer it is to the optimal solution).
Secondly the lower bound of a rotation cube is considered.The same computation is performed as for the upper bound, but takes into account the maximum amount the objective function can deviate within the rotation branch.Hence, the upper bound used in the inner algorithm in this case is computed using (18) with z( ) = R ; the lower bound with z( At this point we should point out some minor differences between our nested BnB implementation and that of Yang et al. [16] .In [16] the authors compute the inner BnB to the same accuracy as the outer BnB and return the (inner) upper bound as the bound for that rotation branch.However, if the lower bound of a rotation branch is being considered, clearly the inner lower bound will be desired rather than the inner upper bound.Subsequently, for the outer BnB to be calculated to an accuracy of the inner BnBs will need to be computed to accuracy / τ , where τ > 2; this will ensure the difference between the outer upper and lower bounds is less than , hence guaranteeing the convergence of the algorithm.Detailed descriptions of the algorithms are provided in Algorithms 1 and 2 and a proof of the convergence of the algorithm is provided in a supplementary report [47] .

Local refinement
Similarly to other BnB approaches (e.g.[16] ) we locally optimise the solution whenever a promising part of the search space is found.If the output of the local optimisation results in a new best solution (according to (2) ), the upper bound is updated with the new solution.In our case, we use two refinement algorithms: one with a large basin of convergence that does not assume correspondences between features are known, and a more precise refinement requiring known correspondences.The first refinement is called whenever a solution is within 50% of the current best solution and a local refinement has not been called in a neighbourhood of this point.The second refinement is called whenever a new best solution is found (similarly to [16] ) and uses the correspondences given by the trimmed nearest neighbours.
Algorithm 1: Nested BnB algorithm to compute optimal rotation and camera centre.
Input : 2D and 3D feature sets, initial rotationand camera centre cubes R and C , desired accuracy .Output : Optimal rotation r res and camera centre C res .
Remove rotation cube with lowest lower bound from Q R and sub-divide into 8 sub-cubes.Compute lower bound U I using (18) with For the first local refinement algorithm with a large basin of convergence we use SoftPosit in the case of either points, lines, or both [14,30] .In the case where both points and lines are used, we modify the existing SoftPosit algorithm; specifically, the assignment matrix is adjusted to account for both points and lines such that it is impossible to assign any weighting to a point-line correspondence.For the second algorithm we use EPnP [48] for points and the approach by Kumar and Hanson [49] for lines.For both points and lines we use the approach by Dornaika and Garcia [50] that is based on the Posit algorithm [51] .
It should be noted that none of these algorithms directly minimise the objective function defined in (2) and if local refinement does not result in a better function value the algorithm will not update its best solution.Furthermore, it is not necessary to perform local refinement since the approach will still eventually find the optimal solution without it.Despite this, these refinement techniques allow the BnB algorithm to more efficiently find and discard local optima and concentrate on finding the global optimum.

Deterministic and probabilistic nested branch-and-bound methods
In Section 4.3 a nested BnB was proposed, where the outer BnB is computed to an accuracy of by computing the inner BnBs to an accuracy of I = /τ with τ > 2. However, it is not necessary to always compute the accuracy of an inner BnB to / τ and there is a trade-off here: calculating the inner BnBs to a high degree of accuracy (low I ) will result in tighter upper and lower bounds meaning the outer BnB will converge in fewer iterations, however each inner BnB will take more iterations.
In this section, we shall present two variants of the algorithm that take advantage of the above insight by computing the inner BnBs to different degrees of accuracy ( I ).Under an appropriate choice of I , both variants retain the global optimality of the approach by ensuring the outer BnB converges to within .A detailed analysis proving the convergence of both variants is provided in our supplementary material [47] .
For proposed variants of the algorithm, the accuracy of the inner BnBs ( I ) is a function of , the current outer upper and lower bounds ( U O and L O ), and the current inner upper and lower bounds ( U I and L I ) at that stage of the algorithm.For the first proposed variant the accuracy is computed in a deterministic way; I is large at the beginning of the algorithm and gradually decreases as it progresses.For the second variant, I is computed probabilistically whereby branches that look promising are evaluated to a higher degree of accuracy (lower I ) than those that do not.

Deterministic BnB
The deterministic BnB that we propose initially computes inner BnBs to a large I , and gradually decreases it to / τ with τ > 2 as the algorithm progresses.Hence, it terminates to the same accuracy as the original algorithm, despite computing many previous branches to a worse accuracy.
At any stage in the algorithm the outer upper bound ( U O ) and outer lower bound ( L O ) are known.Then we deterministically take as the accuracy to use for the inner BnB.This is for two reasons: firstly, it guarantees the difference between U O and L O to decrease as better parts of the search space are explored, i.e. the algorithm will continue to converge.Secondly, it naturally leads to a final accuracy of / τ , guaranteeing the same accuracy as the original algorithm.To begin with, I is set to an arbitrarily large number, hence U O is quickly set to a reasonable value after the first inner BnB.

Probabilistic BnB
We furthermore propose a probabilistic BnB formulation that, informally, calculates an inner BnB to a high degree of accuracy if it looks promising (e.g. it will lead to a new best solution) and a low degree of accuracy otherwise.More formally, we shall determine the trade-off between the amount of time taken evaluating an inner BnB and the expected benefit of taking that amount of time.
We shall assume for simplicity that there are two outcomes of interest for evaluating an inner BnB.When using the inner BnB to determine an upper bound the outcome of interest is whether or not it leads to a new global upper bound-if it does, this will speed up the algorithm (since there is an early bail-out condition, see Section 4.3 ) or the algorithm may potentially converge (i.e.terminate).When using the inner BnB to determine a lower bound the outcome of interest is whether the lower bound is high enough such that the branch may be discarded as this will further narrow the search space.
The probabilities for the outcomes of interest vary depending on the accuracy I that is desired.Denote the probability that the outcome of interest occurs as p ( I ) and the time taken to evaluate the inner BnB to an accuracy of I as t ( I ).If the outcome of interest occurs, assume the rest of the BnB algorithm takes time t 1 and time t 2 otherwise (hence t 1 < t 2 is assumed).Hence, the expected amount of time taken is: To determine I that minimises the expected amount of time taken we set the derivative of ( 20) to zero to give: We shall assume that t ( I ) ∝ 1/ I because all bounds derived are first order bounds that scale linearly with respect to the branch size.Hence, ( 21) may be re-written to give 1 for constants a and b .To guarantee the convergence of the algorithm, we constrain the maximum value of I to be Furthermore, we set a minimum value of I as / τ for all inner BnBs so that the algorithm does not spend too much time in an inner BnB.
These conditions may be substituted into (22) such that, when p( I ) = 1 , I takes its minimum value of / τ ; and similarly when p( I ) = 0 , I takes its maximum value of U O −L O τ .These allow for the constants a and b to be determined, yielding the relationship: Computation of p ( I ): If the inner BnB is being used to determine the outer upper bound (UB), we are interested in the probability that the inner BnB will find a new outer UB, i.e. p ( I ) is the probability that the inner UB is found to be less than U O when evaluated to accuracy I .Conversely, if the inner BnB is being used to determine the outer lower bound (LB), we are interested in the probability that the inner BnB will lead to this branch being discarded, i.e. p ( I ) is the probability that the inner LB is found to be greater than U O − when evaluated to accuracy I .
In either case, the estimate is determined by firstly considering the optimal value of the objective function in the inner BnB, denoted g .At this stage, all that can be said is that g lies between L I and U I .However, we have observed it to have a tendency to lie significantly closer to U I than L I , i.e.L I is a very pessimistic lower bound.
The reason for this is that the lower bound computation in (18) or (19) are computed as the sum of minima, but it is very unlikely that the summands simultaneously obtain their minimum at the same point in space.It is however true that any one of the summands obtains its minimum value within the branch, reducing the inner UB by an approximate value r where r = U I −L I k , since the difference between the inner UB and inner LB is split between the k summands.The other k − 1 summands are very unlikely to also obtain their minimum value at this point in the branch, and we assume each summand to reduce the inner UB by a uniformly distributed amount from the interval [ −r, r] at this point in the branch.
As a result, we assume the expected value of g to be U I − r, and its variance is that of sum of k − 1 uniformly distributed variables from the interval [ −r, r] .Using the central limit theorem, we approximate the distribution of g as: To use the distribution of g to estimate p ( I ) where the inner BnB is being used to determine the outer UB, we use the approximation: (25) may be computed using the error function.To determine I requires solving ( 23) and ( 25) simultaneously-we use an iterative approach to this with initial condition p( I ) = 0 .5 .Where the inner BnB is being used to determine the outer LB, we take p( I ) = P (g > ) and proceed in a similar manner.

Experimental evaluation
We compare between the three proposed approaches: BnB, BnB-D for the deterministic BnB, and BnB-P for the probabilistic BnB.They are furthermore compared to existing methods for 2D-3D feature matching without correspondences.Specifically, we compare against the traditional RANSAC [13] algorithm, SoftPosit [14] , and the state-of-the-art BlindPnP [15] approaches.
The structure of this section is as follows.In Section 6.1 we give implementation details for all approaches evaluated in this section, and in Section 6.2 the evaluation measures (accuracy and speed) are described.Subsequently results are presented, in Section 6.3 for synthetic data and in Section 6.4 for real data.

Implementation details
BnB/BnB-D/BnB-P: Few parameters need to be set for our globally optimal approaches, and we use the same parameters for all experiments with the exception of k (the expected number of inliers).For the synthetic data, k is set to the exact number of inliers (unless otherwise stated); for real data it is fixed to 25% of the total number of 2D features.In (4) we use λ = 0 .3 , and, for the uniformly continuous bound, we take γ = 0 . 1 .We set = 0 .0025 k for where only point features are used, and = 0 .006 k for when line features, or both point and line features, are used (with the exception of in Fig. 4 , where is a free parameter).For all approaches, the parameter τ setting the inner bound accuracy I was set to the limit case τ = 2 .Experiments confirmed the converge of all the BnB variants under this setting.

RANSAC:
The RANSAC algorithm [13] relies upon hypothesising transformations from minimal subsets and determining how many inliers there are with respect to the hypothesised transformation.In our case there is no inlier threshold as trimming is used, therefore the transformation that minimises (2) is taken.Since minimal samples of inlying features typically do not produce optimal transformations in the presence of noise, we use the LO-RANSAC algorithm [18] .Alternative, more efficient variants of RANSAC are inapplicable in our case.For example, PROSAC [52] relies upon the similarity of feature descriptors to obtain a better evaluation order, however we assume no feature descriptors are used.Alternatively, WALDSAC [20] evaluates the potential correspondences of a transformation in an optimal order-this is difficult to apply in our case where a trimmed objective function is used.To determine the transformation from minimal samples we use the approach by Kneip et al. [53] in the case of points, and the approach by Dhome et al. [54] in the case of lines.We do not test against RANSAC in the case where both points and lines are present.
SoftPosit: The SoftPosit algorithm has been implemented for points [14] and lines [30] .We extend it to the case where both points and lines are present by adjusting the assignment matrix used, such that it is impossible to assign any weighting to a pointline correspondence.It is run from a number of random starting points in SE(3) covering the same space the proposed BnB algorithms search from.
BlindPnP: The BlindPnP algorithm [15] has only been proposed for points.Furthermore, it is observed that BlindPnP relies upon the ability to use pose priors on where the possible camera pose may be -represented by a Gaussian Mixture Model (GMM) of typically 20 components.In their experiments the pose is constrained such that the camera lies on a torus around the 3D scene.However, it is often unrealistic to assume such prior knowledge can be obtained, and it is difficult to alter their approach to work with a significantly larger number of priors over a greater space of SE(3).Therefore, for a fair comparison, our approach was altered to use these pose priors for some of the synthetic experiments.

Evaluation measures
Throughout the experiments we aim to measure the accuracy of the available algorithms, and the speed of the approaches, where possible.
Accuracy: The accuracy is defined as the proportion of experiments from which an inlying solution is produced by an algorithm.A solution is deemed an inlier if the distance between the ground truth and estimated rotation, and ground truth and estimated camera centre, are both less than a given threshold.
For the rotation, the angle between the ground truth and estimated rotations is required to be less than 0.1 radians to be deemed an inlier.The angle between two rotations R a and R b is computed by constructing R c = R T a R b , and computing the angle of rotation of R c in its axis-angle form [40] .
For the camera centre, the relative error between the two camera centres (expressed as C true − C / C ) is required to be less than a threshold of 0.1 to be deemed an inlier, the same as in [15] .However, we note that the relative error between camera centres is coordinate system dependent, therefore we also use the absolute error between the camera centres ( C true − C ).It will be made clear which error on the camera centres is used in each case.
Speed: Timings are obtained by running the algorithms on servers with 2 × 10 core CPUs running at 2.6 GHz.Note that timings should be interpreted with care as they can only provide a coarse estimate of algorithm performance being influenced by server load at the time of the experiments.To complement this, we also include information on the number of iterations as this provides a more meaningful basis for comparing the different BnB algorithms proposed.Both run-times and numbers of iterations have high variance, hence we report the three quartiles, and give the proportion of experiments that converged within an iteration limit (denoted T I ).

Synthetic data
In this subsection, we compare against existing approaches for synthetically generated data.However, to fairly compare against Fig. 4. Median number of iterations (first column) and run-time (second column) with bar showing first and third quartiles, proportion of trials that converged within the iteration limit (third column) and proportion of inlying solutions (fourth column) for BnB, BnB-D , and BnB-P across different levels of desired accuracy, for a feature size of 50.No local refinement is used here.Each experiment was terminated after T I = 7 .5 × 10 6 inner BnB iterations if it had not already converged by then.different approaches requires certain assumptions be placed on the data for each approach.For example, BlindPnP places a prior on the camera pose; assuming it to lie on a torus around and facing the 3D scene, represented by a GMM of 20 components.However, RANSAC searches all potential correspondences regardless of pose priors, hence, to give a fair comparison against RANSAC there should be very little prior placed on the camera pose.Therefore, this section is split into two subsections; the first where pose priors are used, and the second where significantly fewer assumptions are placed on the camera pose.

Pose priors
Throughout this subsection the accuracy and speed of the approaches are tested where pose priors are assumed.For a fair comparison with [15] , the pose priors are generated in the same way as in [15] ; and the relative error between camera centres will be used in this section to determine whether a solution is an inlier.Our al-gorithms are modified to use pose priors in the following way: the input to our algorithm is a set of branches corresponding to each pose prior.Hence, each pose prior is defined by an initial rotation branch (centred at the prior) with each branch initiating its own camera centre branch (centred at the prior).Due to the potentially large running times, the proposed approaches are terminated after T I = 7 .5 × 10 6 inner BnB iterations if they had not already converged by then.For a fair comparison, RANSAC is also terminated after T I iterations.SoftPosit and BlindPnP are run 20 times; from the centre of each of the GMM components.
We generate the 2D and 3D features in a similar manner to [15] : firstly, we randomly generate a set of 3D features (points or lines) and randomly choose a camera position in SE(3) from the torus.A proportion of these 3D features are deemed inliers and are projected onto the image.Noise is added to their position (the endpoints in the case of lines) of variance 2 pixels.A number of outlying 2D features are then randomly generated on the image Fig. 5. Proportion of inlying solutions for all algorithms tested.From top to bottom: using points, using lines, using both.From left to right: 60% inliers, 40% inliers, 20% inliers.
such that the number of 2D and 3D features is equal (none of the algorithms require the two feature sets to be of equal size-we simply test in this way for simplicity).
In this subsection, three sets of experiments are performed.The first is without local refinement ( Section 4.4 ), and is to test the proposed deterministic and probabilistic BnB algorithms in isolation without being affected by the other aspects of the algorithm.Secondly, experiments are performed with local refinement, across a range of feature quantities and proportion of inliers.Finally, we present results of varying the expected number of inlier features ( k ) from their ground truth, since this cannot be assumed to be known in practice.
Without Local Refinement: Initially we test the three proposed approaches ( BnB, BnB-D , and BnB-P ) without local refinement.In this case, we test for a feature size of 50 (either points, lines, or 25 of each) for 40% inliers, for varying levels of accuracy ( ). 30 trials were performed in each case.The results are shown in Fig. 4 .Due to the high variance of timings obtained, the quartiles of the number of iterations and run-time are shown, along with the proportion of trials to converge within the iteration limit.Based on Fig. 6.Median number of iterations (first column) and run-time (second column) with bar showing first and third quartiles, proportion of trials that converged within the iteration limit (third column) and proportion of inlying solutions (fourth column) for BnB, BnB-D , and BnB-P across different assumed inlier ratios, for a feature size of 90.Each experiment was terminated after T I = 7 .5 × 10 6 inner BnB iterations if it had not already converged by then.
the median number of iterations, BnB-P performs the fastest, however its iteration count has higher variance than BnB-D .Both proposed approaches ( BnB-D and BnB-P ) use fewer iterations than the original BnB .All methods perform similarly well in terms of the quantity of inlying solutions obtained.As could be expected; all algorithms converge faster where a lower accuracy (higher ) is desired, often to the detriment of the quality of the solution.
With Local Refinement: Next we test with local refinement ( Section 4.4 ), against all other algorithms.The feature sizes range from 40 to 90, with inlier rates at 60%, 40%, and 20%.30 trials were performed in each case.Results are shown in Fig. 5 .From these graphs it is seen that our approaches are consistently more accurate than the state of the art.Interestingly, our approach sometimes does not get the right solution with 20% inliers, despite being globally optimal.It is in fact observed that, in some cases, it obtains a solution whose function value (by (2) ) is lower than the function value of the ground truth solution, despite being an out-lying solution!This is indicative of the intrinsic difficulty of the problem, and the capacity of noise to redefine the global minimum.Also of note is the observation that RANSAC performs better than the state-of-the-art approaches SoftPosit and BlindPnP .This is largely due to the fact that it is run for a very large number of iterations-the same number that the BnB approaches are run forin order to compare it against BnB.However, in doing so it performs a much larger search than SoftPosit and BlindPnP , both of which search locally from one of the 20 pose priors and are orders of magnitude faster than RANSAC.Furthermore, RANSAC has a much higher complexity ( O ( N 4 log ( N )) than SoftPosit and BlindPnP ( O ( N 3 )).
Varying Expected Inlier Ratio: A potential point of concern is that the expected number of inliers, k , cannot be known beforehand.We therefore run experiments to test how the proposed algorithms cope when k is varied away from the true inlier ratio.In Fig. 6 we show results for a feature size of 90, for 40% inliers, for Fig. 7. Median run-time with bar showing first and third quartiles (left column), proportion of trials that converged within the iteration limit (middle column) and proportion of inlying solutions (right column) for RANSAC, BnB, BnB-D , and BnB-P across different feature sizes for inlier ratios of 40% (rows 1, 3, and 5) and 60% (rows 2, 4, and 6), where no pose priors are used.Results are given for points (top two rows), lines (middle two rows), and both points and lines (bottom two rows).Each trial was terminated after 10 0 0 s if it had not already converged by then.All experiments were performed on the same machine.
varying expected inlier ratio (15%-50%).20 trials were performed in each case.From these graphs it appears that varying the expected inlier ratio has little effect on the accuracy of the results, with the vast majority of trials converging on the correct answer for the expected inlier ratio anywhere between 25% and 50% (compared to the true ratio of 40%).It is only when the expected inlier ratio is very small (15% -20%) that the algorithms fail, and this is mostly in the case of point features.However, the number of iterations the approaches take can vary drastically with respect to the expected inlier ratio.In particular, when the expected inlier ratio is higher than the true ratio, the number of iterations significantly increases.This could be due to the fact that, when the expected inlier ratio is at or less than the true ratio, the ground truth pose estimate is sufficiently small to warrant the algorithm to terminate (i.e. the ground truth pose estimate has objective function value in (2) less than ).Thus, the algorithm may only have to find a solution with function value less than , without explicitly verifying it.This is not always the case, particularly where only line features   are used and the number of iterations increases more gradually with the expected inlier ratio, however it is a contributing factor.

No pose priors
For a fair comparison against RANSAC that searches over the correspondences (and therefore searches over a large volume of SE( 3)), we test our approaches over a much larger prior volume.In this case, we do not test against BlindPnP since we are unable to adjust their approach to operate over a significantly larger prior search space.We also do not test against SoftPosit : it is observed that SoftPosit performs very poorly when the 3D features are so close to the camera centre since it relies upon approximating perspective projection by an orthographic projection.For this reason, SoftPosit is also not used as a subroutine for the BnB approaches in this section.
Experiments are performed for larger numbers of features ( 150 − 350 ) for 40% and 60% inliers.Each trial is terminated after 10 0 0 seconds if it has not already converged by then.Therefore, all experiments were performed on the same machine, and as such, only 10 trials were recorded in each case.Due to the high variance of timings, the median, and lower and upper quartiles of the time taken are recorded, along with the proportion of trials to converge within the time limit, as shown in Fig. 7 .The proportion of inlying solutions is also shown on the right of Fig. 7 , where an inlying solution is defined such that the angle between the hypothesised rotation and ground truth rotation is less than 0.1 and the absolute error between the two camera centres is less than 0.05.The absolute error is used here due to the camera centres lying in a neighbourhood of the origin (where the relative error is not meaningful).
Based on these results it can be seen that our approaches perform favourably to RANSAC, particularly in the case of lines.RANSAC performs better using points than for lines; this may be due to the different minimal solvers used in each case.The proposed approaches BnB-D and BnB-P result in a significant speed-up over the original BnB , particularly when both points and lines are used.

Real data
In this section, we evaluate the performance of the registration algorithms on real data.Specifically, we are interested in using the real dataset (as shown in Fig. 8 ) that comprises five models with between 7 and 11 images with known projection matrices per model.The features used here are salient points ( Sal-P ) by [55] and salient lines ( Sal-L ) by [56] ; denoted Sal-PL where a mixture of the two are present.State-of-the-art feature detectors are also used here as GFT and Harris for 2D and 3D points and LSD for lines; referred to as GL -P, GL -L , and GL -PL .The top-120 features are used in 2D and the top-240 used from 3D; except where both points and lines are used where we take the top-80 points and top-80 lines in 2D, alongside the top-160 points and top-160 lines in 3D.
The BnB methods proposed here require priors to be placed on the camera pose.This should not be seen as a significant barrier to the method; indeed, Moreno-Noguer et al. [15] assume the camera to lie on a torus around the object, and Svarm et al. [57] assume the 3D ground plane is known, and the orientation of the image with respect to the ground plane.In our case, we assume the camera centre to lie in a cube of diameter 1.5 m in the case of the indoor models ( Reception, Room , and Studio ), and a cube of diameter 5 m for the outdoor Cathedral and Courtyard models.We place no assumption on the rotation parameters.Each trial is run for a maximum of T I = 5 × 10 6 iterations for RANSAC, BnB, BnB-D , and BnB-P.SoftPosit is run for a maximum of 10 0 0 iterations from random starting locations in the prior so as to take a similar amount of time to the other tested methods.
Firstly qualitative results are presented.Figs. 9 and 10 show estimated poses obtained from all five approaches, using the six types of features.The globally optimal approaches all perform better than the sub-optimal RANSAC and SoftPosit , particularly in the case of lines.Furthermore, Sal-L is more robust in comparison to GL-L due to the tendency of GL-L to detect multiple lines in a similar location, whereas the Sal-L features are more representative of the scene.The problem is mitigated for GL-PL where the complementarity of the two feature types results in a more robust objective function.
Secondly, quantitative results are presented, where we firstly measure the proportion of inlying solutions returned.For this we use a threshold of 0.1 radians for the angle between the rotations, and for the camera centre threshold we use a function of the prior size of the camera centre, so as to obtain a fair evaluation between models.For the prior camera centre to have volume d , we take threshold t such that 4   3 πt 3 = 0 .025 d, i.e. there is only a 2.5% chance of obtaining the correct camera centre by chance.For the outdoor Cathedral and Courtyard with prior camera centre over a volume 125 m 3 the inlier threshold is about 0.91 m, whereas for the indoor Reception, Room , and Studio it is 0.27 m.Fig. 11 shows the proportion of inlying solutions returned for the sets of 2D-3D features for each model.Note that we compare against the different registration approaches outlined in this paper, and the different feature types.
It is observed that the proposed globally optimal approaches perform significantly better than SoftPosit and RANSAC .In particular, SoftPosit never gets the correct solution-this is in part due to the high rates of outliers, and partly due to SoftPosit performing poorly whenever 3D features are close to the camera.BnB-D and BnB-P generally perform better than BnB since they are able to search the transformation space more quickly and are more likely to find the correct solution within the iteration limit.Lines are significantly more robust than points, with some improvement when considering both types of features in the GL-PL case.The Room model sees the highest proportion of inliers, where all images were registered correctly using certain approaches.
In contrast to the previous quantitative results in this paper, we also measure the quantity of inlying solutions when varying the inlier threshold.In doing so, we jointly measure the accuracy of the proposed approaches and the accuracy of the detected features.The rotation and camera centre inlier thresholds are both varied, where the camera centre threshold is based on the ratio of the inlier volume to the total volume of the camera centre prior (where the ratio was 0.025 for results presented in Fig. 11 ).The results are given in Fig. 12 where results are given per feature type, and averaged across all datasets.
Similar conclusions may be made as from the tables in Fig. 11 : the proposed globally optimal approaches significantly outperform the sub-optimal RANSAC and SoftPosit ; and lines are much more robust than points.Sal-L features are registered more accurately than GL-L -this may be due to GL-L detecting repetitive lines in a similar location and causing ambiguity in determining a correspondence, while Sal-L detects a more representative set.On the whole however, GL-PL appears to perform the best, despite features that have lower 2D-3D repeatability.Due to the fact that GL-P outperforms Sal-P we are led to believe this may be due to the proposed salient points being less suited to registration than corners.

Conclusions and future work
This paper presented the first globally optimal framework for 2D-3D registration where feature correspondences are unknown.This framework introduced a family of methods, covering both deterministic and non-deterministic formulations, which are applicable to points, lines or a combination of the two, thereby maximising the use of available scene information and broadening the range of practical registration problems that can be tackled.Being based on BnB optimisation, the approaches have intrinsic guarantees on global optimality.Furthermore, the proposed deterministic annealing and probabilistic formulations of nested BnB algorithms have the advantage of allowing for greater efficiency without loss of optimality.This has resulted in algorithms that are significantly better than the state of the art, both in terms of accuracy and robustness to high outlier rates.These advances have been demonstrated and experimentally evaluated on a range of synthetic and challenging real datasets, where significant improvements can be observed.
An interesting avenue for future work would be to explore different ways to apply a BnB algorithm to the problem and their effects on performance.Bazin et al. [35] solve geometry estimation problems using BnB by relaxing non-linear constraints into linear convex and concave envelopes from which upper and lower bounds may be computed by linear programming techniques.Chin et al. [42] explicitly search over feature correspondences; initially hypothesising all correspondences and running a tree search to determine which are invalid.It is unclear at this stage which class of globally optimal method is preferable.However, we have presented the first globally optimal approach to the 2D-3D registration problem that is significantly better than the state of the art for the specific problem.

Research Data
The authors confirm that the indoor and outdoor 2D-3D datasets used for this research are freely available under the terms and conditions detailed in the license agreement enclosed in the data repositories.Details of the data and how to obtain access are available for the Room dataset at [58] ; and for the Cathedral, Courtyard, Reception , and Studio datasets at [59] .

Declaration of Interest
None.

Fig. 1 .
Fig. 1.Diagram illustrating the pipeline for correspondence-free 2D-3D registration.The proposed nested Branch-and-Bound algorithms are the central part of the pipeline, enabling global optimisation from point and line features extracted from 2D and 3D data.

Fig. 3 .
Fig. 3. Left: When √ 3 δ C ≥ X − C 0 , the maximum angle is π by placing X − C behind (or on) the origin.Right: Otherwise, the maximum angle obtained is when X − C is at a right angle to C − C 0 .

Algorithm 2 :
foreach sub-cube ω R do Compute upper bound U I and corresponding optimal C res I by calling Algorithm 2with r 0 at centre of ω R , rotation uncertainty R = 0 , current best error U O andinner bound accuracy I .Compute lower bound L I by calling Algorithm 2with r 0 at centre of ω R , rotation uncertainty R = √ 3 δ R , current best error U O andinner bound accuracy I .if U I < U O then Set U O = U I , r res = r 0 and C res = C res I .Run local refinement (see Section 4.4) and update U O , r res and C res if better solution found.end if L I ≤ U O then Insert ω R with priority L I into Q R .end end Set L O to lowest lower bound value in Q R .end BnB algorithm to compute optimal camera centre given rotation.Input : 2D and 3D feature sets, initial camera centre cube C , rotation r 0 , rotation uncertainty R , current best error U O , desired accuracy I .Output : Lower and upper bounds L I and U I on error and corresponding optimal camera centre C res I .Set U I = U O , L I = 0 .Insert C with priority L I into priority queue Q C .while ( U I − L I > I ) do Remove cube with lowest lower bound from Q C and sub-divide into 8 sub-cubes.foreach sub-cube ω C do Compute upper bound U I using (18) with z( ) = R .

Fig. 8 .
Fig. 8. Top : The 3D models used in the 2D-3D dataset.Bottom : An example image from each model used in the dataset.From left to right: Cathedral, Courtyard, Reception, Room, Studio .

Fig. 9 .
Fig. 9. Qualitative result for solutions returned using all methods on an image from Reception dataset.Blue features are 2D, green are 3D.(For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article).

Fig. 10 .
Fig. 10.Qualitative result for solutions returned using all methods on an image from Room dataset.Blue features are 2D, green are 3D.(For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article).

Fig. 11 .
Fig. 11.Proportion of inlying solutions (%) returned per 3D model and per feature type for each scene.

Fig. 12 .
Fig.12.Proportion of inlying solutions obtained when varying the inlier threshold.There are two graphs for each feature, for the rotation inlier threshold and camera centre inlier threshold.