Targetless Lidar-Camera Calibration via Cross-Modality Structure Consistency

Lidar and cameras serve as essential sensors for automated vehicles and intelligent robots, and they are frequently fused in complicated tasks. Precise extrinsic calibration is the prerequisite of Lidar-camera fusion. Hand-eye calibration is almost the most commonly used targetless calibration approach. This article presents a particular degeneration problem of hand-eye calibration when sensor motions lack rotation. This context is common for ground vehicles, especially those traveling on urban roads, leading to a significant deterioration in translational calibration performance. To address this problem, we propose a novel targetless Lidar-camera calibration method based on cross-modality structure consistency. Our proposed method utilizes cross-modality structure consistency and ensures global convergence within a large search range. Moreover, it achieves highly accurate translation calibration even in challenging scenarios. Through extensive experimentation, we demonstrate that our approach outperforms three other state-of-the-art targetless calibration methods across various metrics. Furthermore, we conduct an ablation study to validate the effectiveness of each module within our framework.

Targetless Lidar-Camera Calibration via Cross-Modality Structure Consistency Ni Ou , Hanyu Cai , and Junzheng Wang Abstract-Lidar and cameras serve as essential sensors for automated vehicles and intelligent robots, and they are frequently fused in complicated tasks.Precise extrinsic calibration is the prerequisite of Lidar-camera fusion.Hand-eye calibration is almost the most commonly used targetless calibration approach.This article presents a particular degeneration problem of hand-eye calibration when sensor motions lack rotation.This context is common for ground vehicles, especially those traveling on urban roads, leading to a significant deterioration in translational calibration performance.To address this problem, we propose a novel targetless Lidar-camera calibration method based on cross-modality structure consistency.Our proposed method utilizes cross-modality structure consistency and ensures global convergence within a large search range.Moreover, it achieves highly accurate translation calibration even in challenging scenarios.Through extensive experimentation, we demonstrate that our approach outperforms three other state-of-the-art targetless calibration methods across various metrics.Furthermore, we conduct an ablation study to validate the effectiveness of each module within our framework.Index Terms-Calibration, lidar, camera, automated vehicles.

I. INTRODUCTION
I N THE past decade, since Lidar and camera have comple- mentary characteristics, Lidar-camera fusion has sparked increasing interest in the field of autonomous driving.Lidars are able to directly measure the distances of sparse points in the surrounding environment, while cameras can capture dense pictures with rich texture information.By fusing data from both sensors, intelligent vehicles are capable of effectively handling perception [1], [2], [3] and navigation tasks [4], [5], [6] in sophisticated environments.
Lidar-camera extrinsic calibration provides the relative transformation between the two sensors, allowing for the processing of Lidar and camera data in a shared coordinate system.The establishment of cross-modality correspondences is a crucial step in Lidar-camera calibration.It normally involves designing loss functions based on these correspondences for calibration purposes [7], [8].Lidar-camera calibration algorithms can be broadly classified into two types based on whether a specific target is used for correspondence extraction: target-based and targetless.
Target-based calibration algorithms typically rely on manmade targets with known dimensions.The chessboard is the most common and widely studied target for Lidar-camera calibration [8], [9], [10].With known grid size, its position and orientation can be easily estimated by a monocular vision sensor through corner detection algorithms [11], [12].Meanwhile, when this target is scanned by an adequate number of Lidar beams, its position in the Lidar coordinate system can be precisely determined through the process of plane fitting [13].Aside from chessboards, planar objects with regular features like tags [14], holes [15], along with specifically shaped objects [16], are also feasible targets for calibration.Overall, target-based methods are useful when a calibration target with high machining accuracy is available, but they can be labor-intensive since the target needs to be displaced to different positions manually.Additionally, in certain cases, target-based methods may be less effective than targetless ones [17], [18].
On the opposite, targetless methods establish correspondences in natural scenes rather than relying on specific objects.According to [19], these methods can be broadly categorized into two branches: appearance-based and motion-based.The former utilizes cross-modality correspondence by either designing artificial rules or learning them through self-supervision.For example, Lidar-camera edge association is a widely-studied appearance-based correspondence [20], [21], which is established on geometric relations between Lidar depth-discontinuous edges and camera intensity-discontinuous edges.Recent research has focused on extracting Lidar depth-continuous edges [17], [22] through voxel-level plane fitting, as this kind of Lidar edges are immune to the foreground inflation and blending points problems.Other branches of appearance-based algorithms rely on statistical similarity metrics [23], [24], [25] or self-supervised training [26], [27], [28].In contrast, motion-based approaches rely on cross-frame geometric constraints for calibration optimization.Most of motion-based approaches [29], [30] are initialized through hand-eye calibration (HECalib), which is a well-studied technique based on pose-based constraints [31], [32].HECalib is a globally convergent algorithm that only necessities the relative poses (motions) between the Lidar and camera as inputs.Some recent studies have focused on improving its performance on the input side, including refining Lidar and visual odometry [18], [33] and optimizing timing offsets [34], [35].
In addition to appearance-based and motion-based calibration algorithms, a growing number of hybrid methods have been developed to combine their respective strengths.For instance, certain proposals employ HECalib for an initial calibration and then utilize specific appearance-based metrics for further improvement [35], [36], [37].However, the effectiveness of these methods tends to diminish when the HECalib solution becomes singular, especially in degenerate situations where rotational sensor motions are scarce [38].Unfortunately, few methods have thoroughly resolved this degeneration problem in calibration, which becomes particularly severe when the vehicle travels in straight lines.
This article presents a targetless method to effectively tackle the aforementioned problem.Our approach comprises a modified HECalib mechanism with regularization and a global optimization module.The former helps mitigate the solution failure of HECalib caused by degeneration, while the latter refines the initial calibration parameters over an extensive range of values.The global optimization is developed based on the principle of cross-modality structure consistency (CMSC), which will be defined in Section III-C.Unlike other hybrid methods [35], [36], [37], our approach does not include any predefined appearancebased metrics or densification procedures, allowing it to generalize effectively across a wide range of scenarios.We validate the effectiveness of our approach across various degenerate scenarios, including an extreme case where the vehicle undergoes only unidirectional translational motions (Sequence 04 in Table I).Our main contributions are summarized below.
r We analyze the degeneration problem that occurs in HECalib and propose a regularization approach to suppress the translational calibration error.
r We propose a novel targetless Lidar-camera calibration method based on CMSC.This method is globally convergent and immune to degeneration, and it can be applied to gray-scale cameras as well.
r We evaluate our method and compare it to three popular targetless methods [20], [34], [39] using six sequences of the KITTI Odometry [40] datasets (a total of 14,316 frames).We also conduct an ablation study to verify the effectiveness of each module we designed.Additionally, we make our codes openly available on GitHub to benefit the research community.

A. Sensor Motion Estimation
The input of HECalib are individual sensor motions of Lidar and camera, and we review relevant literature from two perspectives: Lidar odometry and visual odometry.Regarding the former, Iterative Closest Point (ICP) [41] has been widely applied to compute relative transformations between adjacent frames in many motion-based calibration methods [33], [35], [36].It is closed-formed and efficient when the initial alignment is good enough.In addition to ICP, our previous work [18] addresses the challenge of robustly estimating the pose of a low-resolution LiDAR by leveraging camera data for assistance.Moreover, Lidar SLAM (Simultaneous Localization and Mapping) has demonstrated its reliability in accurately estimating consecutive Lidar poses [42], [43], [44].By incorporating motion models and utilizing scan-map registration, Lidar SLAM can effectively prevented unlimited accumulation of localization error over time.In the case of large-scale scenes, back-end optimization techniques are commonly employed to further enhance the performance of Lidar SLAM [45].
In contrast, the estimation of monocular motions is more challenging due to the scale ambiguity problem.Although the scale factor can be solved by HECalib, maintaining a consistent scale throughout the entire trajectory remains difficult [46], [47].Visual SLAM techniques [48], [49], [50] are able to address this problem by employing global and local bundle adjustment (BA) [51], especially when integrated with the use of loop closure detection [52].In comparison to BA, Structure from Motion (SfM) excels in estimating camera poses and recovering 3D mappoints [53], but its computational demands increase significantly with the number of frames, making it less feasible for large-scale scenes.Aside from these systematic methods, an optical-based pipeline [33] is proposed to address the scale ambiguity, which jointly optimizes the extrinsic parameters and camera motions by tracking Lidar points.

B. Cross-Frame Geometric Constraint
The application of cross-frame geometric constraint in the filed of Lidar-camera calibration is to utilize relative sensor poses to constraint the optimization of calibration parameters, which has been increasingly popular in recent calibration studies.Some methods utilize the hand-eye calibration equations [38] to convert camera poses into functions of extrinsic parameters and Lidar poses, then treat Lidar poses as constant and indirectly optimize the extrinsic parameters by building multi-view constraints on camera poses.For instance, calibration parameters can be optimized in a BA problem within a Visual SLAM architecture [34].Additionally, CalibRCNN [54] incorporates synthetic view constraints and epipolar geometry constraints into this indirect calibration optimization.
Another category of methods considers Lidar-camera calibration as an equivalent task of joint 3D reconstruction.In this case, both camera and Lidar poses are fixed and known, and the calibration error is proportional to the 3D distance between visual and Lidar maps.This theory can be simply applied to the calibration of stereo (or RGB-D) cameras and Lidar [55], and it can also be extended to the calibration of monocular cameras and Lidar with the initial scale factor provided by HECalib [18], [39], [56].Techniques such as vision densification [57] and semantic information [58] can also be applied to enhance the calibration performance in these scenarios.Additionally, researchers have drawn inspiration from Lidar odometry techniques [42] and have explored the use of point-line and point-plane distances to enhance the alignment between visual SfM points and Lidar points for calibration purposes [56].However, for ground vehicles, implementing SfM [59], [60] in degenerate scenes is still challenging because this technique requires the camera to capture a fixed object from multiple angles.

A. Overview
The whole framework of our method is present in Fig. 1.In the first stage, Visual and Lidar SLAM predict camera and Lidar poses from collected data, respectively.Meanwhile, some intermediate data of the Visual SLAM are recorded, including the positions of triangulated keypoints and cross-frame keypointkeypoint correspondences.Then, HECalib estimates the initial extrinsic matrix T (0) CL and initial monocular scale s (0) using the predicted sensor poses.Finally, a global optimization process is performed to find the best extrinsic matrix T * CL and monocular scale s * .The two losses depicted in Fig 1, namely the Crossmodality BA (CBA) Loss and the Cross-modality Alignment (CA) Loss, are designed using the principle of CMSC.These losses will be introduced in detail in Sections III-C and III-D.
The remaining parts of Section III are structured as follows.Firstly, Section III-B reviews the principles of HECalib and investigates the occurrence of degeneration caused by the absence of rotational sensor motions.In the same section, we apply a regularization term to modify the ordinary HECalib to suppress negative impacts made by degeneration.Then, the

B. HECalib With Regularization
Let F i denote Frame i, and let T C ij , T L ij ∈ R 4x4 represent the relative poses between F i and F j for camera and Lidar, respectively.As described in [31], the constraint for HECalib is formulated in (1), which can be expanded into (2) and (3). where According to [38], R CL can be solved individually from (2) using singular value decomposition (SVD).After substituting the obtained value of R CL into (3), t CL and s can be solved using the linear least square method.
Unfortunately, HECalib is prone to performance deterioration when rotational movements are insufficient to activate the calibration.Mathematically, when there is no rotation in the sensor motions, i.e., 2) is no longer applicable.Moreover, the coefficient of t CL in Constraint (3) becomes zero (matrix), indicating that the value of t CL is no longer restricted by (19).We refer to this occurrence as degeneration.It is worth noting that degeneration always occurs in both sensors.Given that the Lidar and camera sensors are mounted on the same rigid body, if either R C ij or R L ij equals the identity, it implies that both of them must be identity.Although R C ij and R L ij can not be strictly equal to identity in practice, empirical observations indicate that a substantial number of sensor movements with minor rotation can lead to highly inaccurate solutions for t CL .
Based on the analysis above, the error in solving t CL can primarily be attributed to a lack of appropriate constraints.Therefore, we develop a regularization term to impose loose constraints on t CL while ensuring its adherence to the HECalib constraints.Extrinsic parameters are solved by the non-linear optimization on error items formulated in (4).The initial value of R CL & s are solved through ordinary HECalib while the initial value of t CL is replaced with a rough estimation t μ .This regularization method is practical in real-world applications because t μ does not require a high level of precision-even errors within a few tens of centimeters are acceptable.
(4) where w r is a constant regularization weight for each ξ ij .
This non-linear optimization problem can be easily implemented using specialized libraries such as g2o [61] or ceressolver [62].We set t μ = 0 for our experiments on the KITTI dataset.The effectiveness of translational regularization is demonstrated in the first two rows of each sequence group in Table I.With the proposed regularization, the calibration error in rotation remains almost unchanged, while the error in translation significantly decreases to a more reasonable magnitude.i : a Lidar point of F i that has been transformed into the camera coordinate system.l

C. Cross-Modality Structure Consistency
In this section, we provide a definition of CMSC and introduce the concept of Cross-modality Reprojection BA Error based on this theory, aiming to improve the accuracy of t CL .As illustrated in Fig. 2, CMSC indicates that the 3-D structure observed by the camera and Lidar should be consistent in their common field of view (FOV).Derived from this definition, when the extrinsic parameters, camera poses and monocular scale are all accurate, the scaled visual mappoints should align with certain Lidar points.Furthermore, in this case, due to the scale equivalence of the camera projection transformation, the Lidar points projected to the covisible frames should also coincide with certain image feature points.
Annotations in Fig. 2 also intuitively demonstrate the meaning of CA Loss and CBA Loss, which correspond to the above two hypotheses respectively.To enhance comprehension of this theory, we will provide a concise review of Visual SLAM first and then introduce the formulation of the proposed CBA error from the concept of geometric BA.
Taking Monocular SLAM as an example, the first step is to extract keypoints in each frame and employ keypoint matching between successive frames.Subsequently, two frames with sufficient parallax will be selected for initialization using matched keypoints.The initialization process includes the recovery of relative poses through fundamental matrix computation and the construction of an initial visual map through triangulation.Following initialization, the visual map is tracked in each subsequent frame.Meanwhile, BA [63] is implemented to jointly optimize camera poses and mappoint positions.If new triangulation operations are successful, the relevant mappoints will also be added.As a consequence, a mappoint can correspond to several keypoints in different frames (at most one keypoint in one frame), and keypoint-keypoint correspondences across frames are also established through mappoint-keypoint connections.The above steps can also be applied to stereo and RGB-D Visual SLAM.Detailed information on this subject can be found in [49], [50].
The above preliminaries well explain that the elements in Fig. 2 and they both correspond to the triangulated mappoint m (t) i .For the convenience of the following description, we present the following assumptions for reference.
1) The extrinsic matrix T CL is completely accurate; 2) The 3D point in real world corresponding to m i is also scanned by Lidar, which is denoted as q (t) i .In fact, p (t) i in Fig. 2 is transformed from q (t) i using (5).
First, we review the form of BA reprojection error [51].The reprojection errors of m where π(•) stands for the projection function of the camera.
In contrast to ordinary BA, the proposed CBA reprojection is predicated on Lidar points instead of triangulated visual mappoints, and errors are constructed between projected Lidar points and image feature points.Since the Lidar point cloud Q i corresponds to the image frame F i , the CBA error for F i can be directly written as (8).To extend this reprojection to a covisible frame of F i , namely F j , we introduce the concept of virtual frames.The only difference between Frame F j and Virtual Frame F j lies in their poses.As the scale of T C ij is determined by visual initialization and p (t) i is in real size, we modulate the relative translation between F i & F j to s • t C ij , thereby maintaining the reprojection similarity relationship depicted in Fig. 2. With the concept of virtual frames, our reprojection error in F j can be formulated as (9).
This form of reprojection resembles that of BA reprojection; hence, we term it as Cross-modality BA (CBA) reprojection.Intriguingly, CBA reprojection will devolve to BA reprojection if Assumption (1) and ( 2) are both sustained.This property can be simply validated by substituting the ideal condition p In fact, CBA errors (8) & ( 9) are both implicitly restricted by T CL and can be applied to extrinsic parameters optimization.By substituting (5) into ( 8) and ( 9), we can explicitly view T CL and s as independent variables and formulate CBA errors in (10) and (11).For convenience, the superscript notation CBA will be omitted in formulations henceforth.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Unfortunately, it is infeasible to directly utilize (10) and ( 11) for the optimization of T CL and s because the acquisition of q (t) i (or p (t) i ) relies on the satisfaction of Assumption (1) and (2).If Assumption (1) is not met, it will be impossible to select the correct p (t) i that corresponds to k (t) i since the entire Lidar point cloud is transformed using the erroneous T CL .Likewise, if Assumption (2) is not satisfied, it implies that a corresponding p (t) does not exist in the Lidar point cloud.In other words, the Lidar sensor fails to capture the pertinent 3D structural point in the real-world context.
Furthermore, it is apparent that Assumption (1) will never be satisfied because T CL & s (hereafter collectively denoted as x) are variables to be optimized.To address this problem, we alternate between identifying p (t) i and optimizing x, rather than performing these actions simultaneously.This alternating optimization starts with initial values T (0) CL and s (0) calculated by HECalib.Taking F i as an example, firstly, transform Lidar point cloud Q i into camera coordinate system using (5).Subsequently, identify p i .In the third step, the x is optimized by minimizing (11) with the selected p (t) i .Finally, these steps are carried out iteratively, commencing each cycle from the first step.To improve visibility, we only sampled 50 keypoints using Farthest Point Sampling [64] for drawing.The number of error items is denoted by "#error".The solid and hollow small circles respectively denote the reprojection of the matched Lidar points onto F j and their corresponding keypoints, which are the minuend and subtrahend on the right side of (9).A pair of solid and hollow circles of the same color indicates a match.
In contrast, Assumption (2) can hold for partial image keypoints that within the FOV of the Lidar.We propose an errorbased criterion to evaluate the satisfaction of Assumption (2), i.e., the existence of p (t) i that corresponds to keypoint k (t) i .Specifically, we employ KD-Tree to locate the projected Lidar point that is nearest to k (t) i .If the minimum distance exceeds a predefined threshold δ π , we infer that k (t) i does not correspond to any Lidar point.Finally, we have also incorporated an error threshold δ 1 to simply filter out outliers.Denote P i as the i th Lidar point cloud which is transformed to the camera coordinate and N f as the number of frames.The complete algorithm to compute CBA Loss for one iteration is summarized in Algorithm 1. Please note that the values of R CL , t CL , s are obtained from the current optimization iteration, and the same applies to Algorithm 2.
Finally, as depicted in Fig. 3, we showcase the visual disparity in CBA error (11) before and after optimization for one specific frame F j .In this figure, F j is one of the covisible frames of F i and matched Lidar points are selected by minimizing (8) in F i and then reprojected to F j using (9).Overall, these reprojected Lidar points (solid circles) are closer to the keypoints (hollow circles) in F j after optimization.This visual change is illustrated in Fig. 3(a) and (b), with noticeable distinctions marked by red squares.Before optimization, the two hollow circles were positioned on the foreground objects, whereas their corresponding solid circles were located in the background.After optimization, both circles are correctly projected onto the foreground objects.

D. Global Optimization
Following the introduction of CBA Loss, we describe the remaining components of the proposed global optimization in this section, including Cross-modality Alignment (CA) Loss and inequality constraints.We devise CA Loss as an auxiliary loss Algorithm 2: Cross-Modality Alignment Loss.function to complement CBA Loss and enhance the convergence performance of our optimization approach.The inspiration for the development of CA Loss comes from the Iterative Closest Point (ICP) algorithm [41] i can fit a plane, we replace point-point distance with point-plane distance as formulated in (13).Ultimately, similar to CBA, an error-based threshold δ 2 is also employed for outlier rejection.
where •, • denotes the inner product operation between two vectors and • 2 denotes the second norm of the vector.
Moreover, we devise two criteria ( 14) and ( 15) to assess the validity of the plane fitted by the Lidar points surrounding p (t) i .Regarding the meaning of these criteria, ( 14) constraints the regression error of the plane while (15) ensure the fitted plane has sufficient size.Both criteria must be satisfied for the fitted plane to be considered valid.All the above procedures for computing CA Loss are outlined in Algorithm 2. Furthermore, it is unnecessary to apply δ 2 to verify d pl because d pl is theoretically less than d pt . where i represents the number of p i denotes the normalized point normal of p (t) i ; p , r min are preset thresholds.
Similar to Fig. 3, the visual difference of CA errors before and after optimization for a specific example frame F 1 is also presented in Fig. 4. The visual mappoints observed by F 1 have been scaled by s, and the Lidar points of F 1 have been transformed into the camera coordinate system of F 1 using T CL .In comparison to Fig. 4(a), the transformed Lidar points (yellow) in Fig. 4(b) exhibit better alignment with the scaled visual mappoints (red).Three pairs of white circles highlight noticeable visual differences between the two figures.
Finally, as indicated in ( 16), we compute the aggregate loss by combining the L 1 and L 2 losses using their respective weights w 1 and w 2 .This aggregate loss is the actual objective function to be optimized.
Regarding the constraints of global optimization, we first utilize the Lie algebra formulation to remove the implicit constraints in T CL .Specifically, we formulate the 7-dimensional optimization variable x (used to denote T CL & s before) in (17), where ω, ρ ∈ R 3 are the rotation and translation vectors of Lie algebra while the scalar s denotes the scale factor.With this representation, the initial value of x can be derived from T (0) CL and s (0) .We define the bilateral bound constraint in (18) to determine the search range of x, where x 0 denotes the initial value of x and Δ x is a 7-dimensional vector that represents the search radius in each degree.

x = [ω, ρ, s]
(17) Furthermore, an inequality constraint ( 19) is imposed to ensure that the HECalib constraint ( 1) is satisfied within a specified error range.Despite the potential degeneration in hand-eye calibration, it can serve as a necessary but insufficient condition for verifying the correctness of the obtained solutions. i,j

Log(T CL T
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.where Log(•) transfers Lie groups to Lie algebras; N f represents the number of Constraint (19); δ h is a preset threshold.
The final constraint is imposed to limit the minimum inlier ratio of the CBA errors.When the Lidar and camera are accurately calibrated, the inlier ratio of the CBA errors should be maintained above a certain threshold since these errors are linked to correct correspondences.By applying Constraint (20), we can effectively exclude extreme cases where both the inlier ratio and the value of L 1 are low.
where n v 1 denotes the number of valid e (t) j that satisfy e (t) j ≤ δ 1 ; n 1 denotes the total number of e (t) j ; δ v is a preset threshold.Thus far, we have presented all the components of our proposed global optimization.Different from [22], [26], [28], [58], our method does not involve any appearance-based metrics or training steps in the calibration process.Instead, we utilize the two hypotheses derived from CMSC to design adaptable and unsupervised loss functions.This strategy enhances generalization ability and eliminates human-labeled data and training requirements.

E. Implementation Details
Due to the existence of KD-Trees in Algorithm 1 and 2, neither L 1 nor L 2 is differentiable.As a consequence, we opt to employ a derivative-free optimization algorithm called Mesh Adaptive Direct Search (MADS) [66], [67] for the optimization process.MADS effectively explores the variable space and accommodates both bound and nonlinear constraints.In terms of the primary parameter settings of the MADS algorithm, the minimum mesh size is set to 10 −6 for each degree, and the maximum number of black box evaluations (BBE) is set to 5000.Furthermore, the Variable Neighborhood Search (VNS) option, as proposed by reference [68], is enabled to suppress the occurrence of local optima.Ultimately, the weighted sum w 1 and w 2 given in ( 16) are both set to 1 in our experiments.
Concerning the selection of Visual SLAM and Lidar SLAM pipelines, we select ORB-SLAM [49] and F-LOAM [43], respectively.ORB-SLAM is a state-of-the-art monocular SLAM method that incorporates loop closure detection [52] and global BA.Considering that F-LOAM lacks similar loop closure capabilities, we integrate Scan Context [45] into F-LOAM to enable loop closure detection and back-end optimization.

A. Dataset Introduction
The KITTI dataset [40] is a widely used benchmark for evaluating computer vision and autonomous driving algorithms.It comprises 22 sequences of data captured by multiple sensors, including Lidar and camera.This dataset also includes the ground-truth Lidar-camera calibration extrinsic matrix.In our experiments, six KITTI sequences (00, 02, 03, 04, 05, 07) are selected for calibration evaluation.The majority of sensor motions in these sequences predominantly consist of straight-line movements, with only a small portion involving slight rotations due to curved road segments.Sequence 04 is the most extreme case, where the vehicle exclusively travels along a straight highway without any perceptible turns.
The sensors to be calibrated consist of a Velodyne HDL-64E Lidar and a gray-scale PointGray camera (camera-0).To provide clarity, we present a schematic diagram illustrating the relative transformation between the Lidar (O L ) and the camera (O C ) in The following content of this section is organized below.First, in Section IV-B, we evaluate the calibration performance of our method and three baseline methods [20], [34], [39] and conduct an ablation study on six KITTI sequences.Next, in Section IV-C, we present detailed information about the optimization process and analyze the properties of the proposed loss functions.Finally, we evaluate the degeneration-resistance performance of our method through an unidimensional analysis in Section IV-D.The experiments in Sections IV-C and IV-D focus on sequence 04, which is the most representative degenerate KITTI sequence.

B. Calibration Accuracy
First, we compare the calibration accuracy of our method to that of other baselines.Eight metrics are defined for evaluation based on the definition of T e formulated in (21).(21) where T gt CL is the ground-truth extrinsic matrix provided by KITTI dataset and T p CL is the predicted extrinsic matrix.The eight metrics includes three Euler angles (Δ roll , Δ pitch , Δ yaw ) and three translation components (Δ X , Δ Y , Δ Z ) of T e .The left two metrics are the root mean squared errors (RMSE) in rotation and translation, i.e., the root mean squared results of Euler angles and translation components, respectively.
As mentioned in Section I, we reproduce three targetless methods [20], [34], [39] as our baselines.For fairness, they are all initialized using our proposed hand-eye calibration introduced in Section III-B.Since [20] works for single-frame mode, we record its best single-frame result (minimum sum of rotational and translational RMSE) throughout each sequence as its final result in Table I.
Quantitative results are present in Table I.It is illustrated that our method (the last row) remarkably outperforms the baseline methods in terms of both rotational and translational RMSE metrics.In contrast, the appearance-based approaches (3 rd row) does not effectively reduce the calibration error from the initial estimation (2 nd row).These methods can only converge from a small bias toward the ground-truth point while the initial calibration obtained by HECalib is still far from the ground-truth values.The results of the other two baselines are not satisfactory as well, especially in terms of the translational calibration metrics.Although [34] (4 th row) is globally convergent, it directly substitutes the HECalib constraint (1) into (7) to construct its reprojection error, resulting in a similar degeneration issue as HECalib.Consequently, its rotational errors are low while its translational errors are high, aligning with our previous analysis that the degeneration solely causes significant errors in t CL .PBA [39] achieves good translational calibration performance on sequence 00, but performs not well on other sequences.It can roughly viewed as our CA Loss with joint optimization on camera poses, mappoints and intrinsic parameters.We attribute its failure in our experiments to the absence of CBA-like constraints and the use of a locally convergent algorithm [69] for optimization.
Subsequently, we also conduct ablation experiments by creating two variants of the proposed framework.The first variant eliminates the use of CA Loss, while the second retains CA Loss but exclusively applies point-point distance (12) for optimization.As illustrated in the final three rows of Table I, our method and both variants yield similar rotational errors.However, notable differences arise in translation.Comparison between the last and the third-to-last rows showcases the effect of CA Loss on reducing the translational errors.Meanwhile, compared with the second-to-last row, our proposed version (the last row) achieves lower translational RMSE on all the sequences except for sequence 00.
Nevertheless, the second variant of our method does not constantly perform better than the first in all cases.Its translational RMSE is not as good as the first variant in sequences 02 and 03, illustrating the importance of the proposed point-plane distance.Regarding computational efficiency, the first variant runs faster than the second due to the absence of CA Loss, and the second is more efficient than our complete version because Criterion (14) and (15) are no longer required to be validated.
Finally, we choose examples from four sequences (00, 03, 04, 07) to qualitatively demonstrate the calibration accuracy of our method.In Fig. 6, Lidar points are projected onto images using the predicted extrinsic matrix (left column) and the ground-truth (right column) one, respectively.Only slight visual differences can be recognized between two columns of corresponding images, which are marked with yellow circles.

C. Optimization Process
In this section, we analyze the trend of the final loss function L in the MADS algorithm.We plot the curve of L using the feasible points generated by MADS in Fig. 7. Feasible points refer to solutions that satisfying Constraint ( 19) & (20).Based on our observations, all the infeasible points depicted in Fig. 7 satisfy (20) but not (19).Therefore, all subsequent analyses concerning the infeasible points will focus on (19).
Initially, L decreases rapidly, but its rate of decrease gradually slows down as the optimization process continues.The variation of L follows a step-like pattern, with noticeable drops at indices 1022, 2624, and 3603.Even towards the end of the optimization, L still shows a decreasing trend and has not fully reached convergence (although the optimization terminates due to the Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Fig. 6.Left column: Lidar points are projected onto images using the predicted extrinsic matrix; Right column: Lidar points are projected onto images using the ground-truth extrinsic matrix.From top to bottom: samples from KITTI seq 00, 03, 04, 07.Some perceptible differences are denoted with paired yellow circles.minimum mesh condition being met).This reflects the difficulty of optimizing L, which is mainly attributed to its discontinuity.
As expressed in ( 16), the final loss function L is a weighted sum of L 1 and L 2 .Both L 1 and L 2 exhibit discontinuity due to variations in correspondences.The cause of the discontinuity in the CA Loss is similar to that in the ICP algorithm [41].CA correspondences can vanish or change while the relevant Euclidean distances exceed a predefined threshold δ 2 .Similar patterns are observed in CA correspondences, which is an explanation to why the number of correspondences changes after optimization.The variation in CA correspondences is visually depicted by the white circles in Fig. 4.
Comparatively, the discontinuity in CBA Loss primarily stems from the reprojection operation.By minimizing (8), the position of p (t) i can undergo a shift during selection, leading to a sudden value change in the reprojection error calculated in (9).Discontinuity also happens when some p (t) i selected by (8) can not be tracked in the next iteration.Moreover, similar to CA, a subset of CBA correspondences may change or disappear when filtered by the threshold δ 1 .To illustrate, we mark three changed CBA correspondences using a pair of orange circles in Fig. 3(a) and (b).

D. Unidimensional Analysis.
Taking the sequence 04 as an example, we show the degeneration that occurs in HECalib and the degeneration-resistance performance of our method in this section.The values of L and meeting statuses of (19) under unidimensional offsets of the optimization variable x are drawn in Fig. 8.As discussed in Section III-B, the solution of HECalib for t CL becomes degenerate in this case.As shown in Fig. 8(d), (e) and (f), no infeasible points appear as the translational offset increases.According to the theory of Lie algebra, the physical translation t equals the translation vector ρ when rotation degrades to zero.(19) under unidimensional offsets of the optimization variable x.The zero offset point is our optimization result x * .Blue curves (L): the curve of ( 16); Black triangles (IFeas): infeasible points that do not satisfy (19); Red inverted triangles (gt): the ground-truth point.
Therefore, this observation implies that the meeting status of (19) does not alter with the variation of t CL , which aligns with our theoretical derivations in Section III-B.
However, our designed loss function is sensitive to translation changes in the calibration parameters.On the theoretical side, it is observed in (11) that the coefficient of t CL does not degrade to zero when R C ij ≈ I, signifying that the solution of t CL is not degenerate in CBA Loss.Similar derivations can be applied to CA Loss if we substitute (5) into ( 12) & (13).On the experimental side, Fig. 8(d), (e), and (f) jointly demonstrate that the ground-truth solution (red inverted triangle) is close to the zero-offset point, indicating the accurate estimation of t CL through our optimization method.In addition, these three curves indicate that the optimization in translational dimensions of x is almost convex.Despite the appearance of local minima in Fig. 8(d), the shape of L still guarantees the minimum value is achieved near the ground-truth point.
On the opposite, it is demonstrated in Fig. 8(a), (b) and (c) that the optimization in rotational dimensions is much easier.The HECalib constraint (19) successfully narrowed down the search range to the vicinity of the ground-truth and the trends of the function on both side of the zero-offset are almost strictly monotonic.
Concerning the scale dimension, the ground-truth scale is unknown because we do not have the ground-truth visual 3D map, so we adopt the initial scale s (0) from HECalib as the zero-offset point and plot relevant curves in Fig. 8(g).The scale dimension is not directly associated with our calibration task, but its value is required by the optimization process.Fig. 8(g) demonstrates that the HECalib constraint (19) also works in the scale dimension.
In terms of the function of other constraints, Constraint ( 18) is predefined to determine the initial search range while Constraint ( 20) is applied to prevent runtime errors in extreme cases.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

V. CONCLUSION
In this article, we propose a novel targetless Lidar-camera calibration method based on cross-modality structure consistency.The performance of this method in degenerate scenes is showcased through extensive experiments, and its degenerationresistance property is theoretically derived.This property enables the application of this method in scenarios where the vehicle only moves forward without significant rotation, which is common in the field of autonomous driving.
Whereas, due to the discontinuity of the loss functions, it is tricky to use derivative-based algorithms in our framework.In our future research, we aim to devise differentiable and globally convergent loss functions for calibration.Derivativebased optimization methods are typically more computationally efficient than derivative-free ones.Additionally, by exploiting gradient information, joint optimization of the extrinsic matrix and camera poses is expected to produce more accurate extrinsic parameters.

Fig. 1 .
Fig. 1.Schematic diagram of our method.T CL denotes the extrinsic matrix from Lidar to camera and s denotes the scale factor used to transform the visual mappoints to the real size.

Fig. 2 .
Fig. 2. Illustration of cross-modality structure consistency.F i : Frame i, F j : Frame j, F j : Virtual Frame j, CA Loss: Cross-modality alignment loss, CBA Loss: Cross-modality bundle adjustment reprojection loss, p (t)

i
on F i & F j are formulated in (6) & (7), correspondingly.Note that BA e (t) i & BA e (t) j are vectors, and the same applies to the following error items.BA e (t) i , whose index in the Lidar point cloud is determined via a KD-Tree-based nearest neighbor search within the projected Lidar points around k(t)

Fig. 3 .
Fig. 3. Visual difference CBA Loss before (a) and after (b) global optimization.To improve visibility, we only sampled 50 keypoints using Farthest Point Sampling[64] for drawing.The number of error items is denoted by "#error".The solid and hollow small circles respectively denote the reprojection of the matched Lidar points onto F j and their corresponding keypoints, which are the minuend and subtrahend on the right side of (9).A pair of solid and hollow circles of the same color indicates a match.

Fig. 4 .
Fig. 4. Visual difference of CA Loss before and after global optimization (bird-eye view).Yellow points denote the Lidar points P i in Algorithm 2; red points denote the scaled visual mappoints s • m (t) i ; light blue points mark the nearest yellow point matched by the red point.

Fig. 5 .
In this diagram, Axis X L & Z C are parallel to the forward direction of the vehicle while Axis Z L & Y C are perpendicular to the ground.For sequence 04, the sensor motions pertain solely to translation along Axis X L (Z C ).For other sequences, considering the presence of vehicle turns, there are additional translational movements along Axis Y L (X C ) and rotational motions around Axis Z L (Y C ).

Fig. 5 .
Fig. 5. Coordinate system relationship between Lidar and camera in the KITTI dataset.O C and O L are the coordinate origins of the camera and LiDAR, respectively.Each axis of camera and Lidar coordinate systems is annotated in the figure.

Fig. 7 .
Fig. 7. Curve of the loss (16) during optimization.Part of the curve (index ≥ 1022) has been zoomed for better viewing.The label of the x-axis indicates the index of feasible points in the MADS algorithm.

Fig. 8 .
Fig.8.Values of L and meeting statuses of(19) under unidimensional offsets of the optimization variable x.The zero offset point is our optimization result x * .Blue curves (L): the curve of (16); Black triangles (IFeas): infeasible points that do not satisfy(19); Red inverted triangles (gt): the ground-truth point.
. Our optimization process involves optimizing either the point-point or point-plane distance when the scaled visual mappoints s • m