Optimal calibration of a prism-based videoendoscopic system for precise 3D measurements

Modern videoendoscopes are capable of performing precise three-dimensional (3D) measurements of hard-to-reach elements. An attachable prism-based stereo adapter allows one to register images from two different viewpoints using a single sensor and apply stereoscopic methods. The key condition for achieving high measurement accuracy is the optimal choice of a mathematical model for calibration and 3D reconstruction procedures. In this paper, the conventional pinhole camera models with polynomial distortion approximation were analyzed and compared to the ray tracing model based on the vector form of Snell’s law. We, first, conducted a series of experiments using an industrial videoendoscope and utilized the criteria based on the measurement error of a segment length to evaluate the mathematical models considered. The experimental results confirmed a theoretical conclusion that the ray tracing model outperforms the pinhole models in a wide range of working distances. The results may be useful for the development of new stereoscopic measurement tools and algorithms for remote visual inspection in industrial and medical applications.


Introduction
Measurement technologies are now transforming videoendoscopes from pure visual inspection tools to precise equipment for characterization of hard-to-reach elements. With the measurement technologies available today, remote visual inspection is complementing or even replacing other non-destructive testing methods in the rocket and aircraft engineering, oil and gas industry and other fields [1,2]. The mostly used approach to videoendoscope-based 3D measurements is the utilization of a miniature prism-based optics and matching two images of a surface from two different points [3]. This stereoscopic technique employs a calibration procedure and processing algorithms to compute a 3D coordinate for every matched pixel prior to the start of the measurement process resulting in a full 3D surface map of the inspected object [4 -6]. The data processing pipeline of a 3D measurement endoscopic system with a prism-based stereo adapter is shown in Fig. 1.
Many researchers have proposed their understandings and implementation methods for prism-based single-lens stereovision systems in the last two decades. Lee and Kweon [7] considered an arbitrary point in 3D space as two virtual points after the refraction of a prism and derived 3D reconstruction technique for this kind of systems. Lim and Xiao [8] described that one image acquired by the system could be divided into two halves considered as captured by two virtual cameras. This concept converted the prism-based stereovision system to the conventional stereoscopic system using two cameras, which made possible to use conventional camera calibration, image processing and 3D reconstruction methods. Furthermore, this concept was extended from twoocular to multiocular system [9]. This approach was also used to implement epipolar lines calculation method [10], to analyze the influence of prism's angle and position on a common field of view for the virtual cameras [11] and to estimate depth measurement error caused by errors of the prism and camera parameters estimation [12]. However, they made significant assumptions about the relative camera to prism position and analyzed refraction in 2D space only. Cui et al. [13] pointed out the limitations of the 2D approach and proposed an accurate geometrical optics model of the prism-based stereovision system using the 3D vector form of Snell's law. It made possible to derive a transformation matrix which could express the relationship between an object point and its image as 3D affine transformation. This matrix could be composed with other transformation matrices describing rotation, translation and projection in traditional projective geometry to build epipolar lines and 3D reconstruction. They noted that this matrix is different for each image point and the prism-based stereovision system can not be represented with pure projective camera models with high precision. Wu et al. [14] used similar approach for the digital image correlation system. They implemented a modified virtual point model based on backward ray tracing from the image plane to the object space considering the refraction on prism faces and adapted it for bilateral telecentric lens later [15].
In all reviewed papers it was supposed that the camera parameters could be either estimated through a standard calibration procedure [16] or known from a technical documentation and that the position and the parameters of the prism were known exactly. In contrast, Cui et al. [17] proposed a calibration technique to estimate multiocular prism position after a separate camera calibration. They used transformation matrices from [13] and made several assumptions about prism to reduce the number of parameters.
In this paper, we introduce a classification and analysis of mathematical models for a prism-based single-lens stereovision system based on a considered model of an optical distortion induced by a prism. The first group [8 -12] includes pure projective pinhole models ignoring prism distortion. Nevertheless, the concept of virtual camera allows to use any camera model with distortion to calibrate virtual cameras as independent ones. Following this way, Cui et al. [18] approximated scaling factor for every image point in [13] by radial distortion. Lim and Qian [19] made Taylor series analysis of the equations for ray refraction and offered the extended polynomial model to undistort acquired images. These models can be unified as the second group containing pinhole camera models with polynomial distortion. The third group [13 -15, 17] consists of the accurate geometrical optics models derived from the 3D vector form of Snell's law. Finally, we may refuse global distortion model and use model-free local distortion correction as shown in [20]. Furthermore, the same approach can be used in 3D object space to correct the result of 3D measurements directly or to undistort images. Genovese et al. [21] used an additional complicated equipment and the second camera without a prism to measure deviations of 3D coordinates for thousands of control points in the working volume to obtain correction tables.
However, an endoscopic system with a prism-based stereo adapter has some peculiarities. First of all, the main lens has a very short focal distance and a wider field of view, the prism angles are steeper and the prism is placed close to the lens. These factors lead to significantly bigger image distortions than shown in the most of reviewed papers. The lack of techniques allowing to assess 3D measurement error and the applicability of the proposed mathematical models for different camera and prism parameters makes it problematic to apply the results for large-scale prism-based systems to endoscopes. Moreover, the stereo adapter contains an additional lens. We consider the prism parameters and position to be unknown and to be estimated during a calibration procedure. Hence, we can not apply techniques based on known prism parameters as well as independent calibration of camera without a prism and estimation of prism parameters afterwards.
The patent [22] describing the stereo-measurement borescope included the description of calibration procedure using a single image of the calibration tool manufactured as a flat gridded base with a high step. It contains the distortion correction equations generated from the distorted positions of the grid points on the image, but didn't provide further details. Another patent [23] offered to apply polynomial distortion model to the main lens and to the prism with additional lens separately. Hence, we can refer it to the second group of our classification. Takata et al. [24] developed distortion correction method for images captured by endoscope with two wedge prisms using backward ray tracing similar to [14]. Unfortunately, none of these papers presented any results to estimate the uncertainty of 3D measurements which can be achieved for the endoscope with stereo adapter.
The results of computer simulation in our previous work [25] have shown that the pinhole camera model with polynomial distortion could not be equal to the ray tracing model based on the 3D vector form of Snell's law for the typical industrial endoscope with a prism-based stereoscopic adapter. The reason is that pinhole model doesn't consider a ray shift and non-homocentric beams after refraction. Additionally, the best criteria for industrial endoscope evaluation is based on the measurement uncertainties of length or area. These uncertainties vary significantly depending on position and direction inside the working space [25,26] which makes a problem to choose an optimal mathematical model for these systems challenging.
This paper presents theoretical analysis and experimental evaluation of the mathematical models for the industrial endoscopes with the prism-based adapter considering the peculiarities of their optical systems and the specificity of the application. We introduce the basic description of three models and the calibration procedure; present the experimental verification of each model effectiveness and the comparison of the models.

Mathematical models
The general geometric model of the image formation may be written in the form pi = Pi  Ei(lw), where lw is the vector of 3D line coordinates in the world coordinate system (WCS) and pi is the vector of 2D coordinates for the corresponding point on the image plane of the i-th camera, i = 1...N, N -number of cameras [4,5,26]. The vector of line coordinates lw includes 3D coordinates of the origin point cw and the direction vector vw: We use Ei to denote Euclidean mapping xi = Ei(xw) = Rixw+ti translating 3D point xw from the WCS to the coordinate system (CS) of i-th camera (Ri is the rotation matrix and ti is the translation vector). Hence, mapping Pi determines the unique correspondence pi = Pi (li) between the ray li in the i-th camera object space and the point pi in the image plane. We use «•» notation here to define a composition of transformations, i.e. PiEi(lw) ≡ Pi(Ei(lw)). The set of transformations Ei and Pi is parametrized by the vector k of intrinsic and extrinsic camera parameters, which components are to be determined during the system calibration. According to [8][9][10][11][12]19], the prism-based stereoscopic imager can be considered as two virtual cameras and the same notation can be used. Pinhole camera models. The pinhole camera model is widely used in computer vision and is well described [4,5,8,16]. It assumes that all rays Ii for the i-th camera Hence, it is possible to use one point xi for each ray and write xi instead of Ii in the equations for the pinhole models (see Fig. 2). Using the same approach as [5,27], we consider the mapping Pi as the combination of simple transformations where F is the central projection onto the unit plane x and Ai is the 2D affine transformation to the image coordinates in pixels x [4,5].

Fig. 2. A prism-based stereoscopic imager considered as two virtual pinhole cameras
One can find many possible approximations for the lens distortion, based on polynomial [16,19,23,27,28] or more complex [5] mathematical models. The widely used model (introduced by Brown [28]) represents distortion as the combination of radial (3rd, 5th and 7th orders) and tangential part to consider lens decentering as follows ′ ′ is the distortion center of the image, which is usually assumed to coincide with the center of projection; , , , , k k k ρ ρ are the distortion parameters. We further refer to this model as "model 1". This model requires 10 parameters to describe each camera (5 parameters for Ai and 5 parameters for Di) and 6 parameters for relative orientation of two cameras [4,16]. As a result, vector k includes 26 parameters. We can extend the polynomial distortion model using the generic equation where coefficients kxm,n and kym,n are the distortion parameters (2B 2 totally). We further refer to the extended polynomial model as "model 2". Generally, this model requires 2(5+2B 2 )+6 = 16+4B 2 parameters, but the number of the distortion coefficients can be reduced significantly based on theoretical analysis or computer simulation of the system typical parameters [19,25] to simplify optimization procedures on the calibration stage.
In order to allow reconstruction of a ray in 3D object space for each image point, camera model should be invertible and should provide the back-projection transformation ( ) . Hence, each of the transformations in Eq.
(1) should have a corresponding inverse one: F -1 , D -1 and A -1 . It is essential that polynomial models described by Eqs. (2) and (3) do not allow finding a closed-form solution for the inverse transformation D -1 , an iterative solver should be used in this case [5]. Thus, D -1 notation actually stands for this iterative procedure. Ray tracing models. The alternative model using a ray tracing from the image plane to the object space is formulated similar to [14,24]. We use Ik,i = Sk,i(Ik-1,i) notation to describe the refraction of each ray Ik,i on the k-th surface for the i-th image part. According to the vector form of Snell's law, we can derive transformation Sk,i as , , , where nk,i is the refractive index between the k-th and the (k + 1)-th surfaces, sk,i is the normal to the k-th surface and dk,i is the 3D coordinates of any point on the k-th surface (see Fig. 3). In order to find the ray coordinates I2,i in the object space for each image pixel pi, we should define ray coordinates I0,i. We consider the pinhole model with distortion for the main lens and use Eqs. (1) and (2) to find depends only on the parameters of the image sensor and the main lens, it is the same for both image parts. Similar to Eq. (1), the complex inverse transformation for the ray tracing model can be represented in the form where Euclidean mapping E is the same for both channels due to our choice of the same CS for the left and the right parts of the prism, vectors sk,i and dk,i are determined in this CS.

Fig. 3. Ray tracing through a prism
We assume that sk,i is the unit vector and the point dk,i is on the z-axis, so we can use 3 parameters for each prism surface. Accordingly, we have 9 parameters for 3 surfaces and one more parameter for the refractive index. Again, if we let E to be the unitary transformation, we need the vector k of 20 parameters to describe this model. In contrast to the pinhole model, the ray-tracing model cannot provide closed-form solution for the forward transformation pi = Pi  E(Iw) because it requires initially unknown direction vector of the line Iw from 3D point xw (shown in Fig. 3). This problem is usually solved by the iterative technique called ray aiming or by look-up-table interpolation. We can formally write the forward transformation as pi = Pi  E(xw) and extend it the same way as Eq. (5), if we consider that this notation stands for the iterative solver. We further refer to the ray tracing model as "model 3".
3D reconstruction problem. The reconstruction of 3D point coordinates xw from N projections pi may be considered as the ray intersection problem. Because of the uncertainties in coordinates pi, these rays are skew, and the triangulation algorithm T is used for the estimation of ŵ x by minimizing a functional C, i.e. If we assume that the deviation of the measured image coordinates pi from their true values i p follows the Gaussian distribution with zero mean and covariation matrix Σpi, then we can use the Maximum Likelihood Estimator, which minimizes the Mahalanobis distance in the image plane [4,29] ( ) ( ) where ( ) is the estimated position of the 3D point ŵ x projection on the image plane of the i-th camera, and 1 p i − Σ is the inverse (rank-constrained generalized inverse) covariation matrix of the coordinate measurement error for pi. We assume that the measurements errors are independent for each image. The cost function (7) and its derivatives are calculated at each iteration, so it is sufficient to have a closed-form solution for the forward transformation. Hence, we can use this cost function for models 1 and 2. Alternatively, we can formulate the cost function based on the sum of distances from the 3D point ŵ x to the rays l, for example, we can use the mid-point of the common perpendicular for two skew rays [4]. This alternative cost function does not allow to achieve the theoretical bound for minimal variance (the Cramer-Rao lower bound) [29], but it is more suitable for the models that can faster calculate the inverse transformation than the forward one, such as model 3, though we can also use it for models 1 and 2.
Calibration problem. The aim of the calibration procedure is to determine the value of parameter vector k using different types of calibration targets (such as boards and corners [16] or steps [22]). We consider that we register images of the calibration target with M points in R positions; 3D coordinates of the points xtj are known with a certain accuracy in the CS of the calibration target. Next, the image coordinates pi,j,k are calculated for each point j = 1...M and each position k = 1...R. The complete transformation for these points is ( ) It is not necessary for each point at each position to be projected for each camera, some points or positions may be skipped. Following the same reasons as for the triangulation algorithm above, the most appropriate choice of the cost function C is the Mahalanobis distance in the image plane. The equation for this cost function is similar to Eq. (7) Again, we can use this cost function for models 1 and 2 because they provide the closed-form solution for the forward transformation. For the model 3 we should derive the alternative cost function based on the sum of distances from 3D point xt to rays t l . If we use a flat calibration target, we can replace rays t , , i j k l by points t , , i j k x of ray intersections with XOY plane in the CS of the target. In that case, the cost function can be represented as follows where 1 x , , i j k − Σ is the inverse covariation matrix of calibration target coordinate measurement error for xt,i,j,k (see [29] for details about the estimation of this matrix). We can also perform calibration for models 1 and 2 with the cost function (10) which leads to slightly different results than calibration with (9).

Equipment.
In order to evaluate and compare the considered mathematical models, we have conducted a series of experiments using industrial videoendoscope Mentor Visual iQ Videoprobe (GE Inspection Technologies) with forwardview stereo adapter. The probe has a diameter 6.1 mm and a 1/6" CCD image sensor with 440000 pixels. A prism-based stereo adapter provides two registration channels with the fields of view 55°/55°.
To acquire images for calibration and tests, we utilized two types of plane calibration targets: (1) with black circular dots on a white background and (2) with black and white chessboard pattern. Due to sufficient range of distances and magnifications, three samples of each calibration target type with 0.5 mm (small-sized), 1 mm (middle-sized) and 2 mm (large-sized) distance between markers have been manufactured. This distance corresponded to the distance between the centers of the dots and to the chessboard square size for two types of target respectively. Each target had a grid of 25×25 markers.
The distal end of the endoscope was fixed on a mechanical stand that allows adjustable movement along z-axis and rotation around y and z-axis. The axis of the probe was approximately coincident with the z-axis of the stand. Another mechanical stand was placed in front of the endoscope to mount calibration targets. The plane of calibration target was set approximately perpendicular to the z-axis of the stand.
Image acquisition and processing. We have conducted a few experiments using different types of calibration targets described above. At first, dot and chessboard targets have been used for independent experiments. Next, we have carried out an experiment replacing calibration target in each position to obtain equal conditions for two types of targets. For each experiment we have captured two series of images: calibration sequence and test sequence.
The calibration sequence included 18 images captured at different positions of three calibration targets: smallsized, medium-sized and large-sized. Each of these targets has been placed approximately perpendicular to zaxis of the stand at the distance which allows to see at least eight markers in a horizontal row on both image halves. Then, the distal end was consequently rotated 30° around x and y axes to capture images at oblique angles. Finally, we have positioned the target perpendicular to zaxis at approximately 1.5× larger working distance than initial position. As a result, we have acquired 6 images for each size of the calibration target covering the total range of distances from 8 to 40 mm.
Test sequences were registered for medium-sized and large-sized calibration targets in the following way. Again, the calibration target has been placed approximately perpendicular to z-axis of the stand at reasonably close distance and oriented so that horizontal rows of markers were approximately horizontal on the image. We used translation stage to shift the distal end of the endoscope along zaxis and captured images with 1 mm step. Thus, we have obtained series of images for each calibration target. These series contained 16 images in the range from 12 to 27 mm for medium-size targets and 19 images in the range from 22 to 40 mm for large-sized ones. The range of distances was limited by the condition that the number of visible markers should not be too small and the condition of minimal marker scale suitable for robust image processing.
In order to calculate image coordinates pi,j,k for each marker, we have implemented in MATLAB the image processing algorithms for detection of dot and chessboard pattern from [30]. All automatically processed images of calibration and test sequences have been checked manually afterwards to delete points in low-illuminated areas, around glare, etc. Captured images of two calibration targets with estimated marker positions are shown in Fig. 4. These images correspond to the calibration series.
Calibration. We have implemented the calibration algorithms for each earlier considered mathematical model using non-linear iterative solver from MATLAB for the constrained minimization problem formulated in Eq. (8). The constraints for the values of calibration parameters k are determined by the physical limitations of prism parameters: size (less than 5 mm), refraction index (from 1.4 to 1.8), etc. We used the cost function C based on distance in the image plane for models 1 (usual pinhole model) and 2 (pinhole model with extended polynomial distortion) according to Eq. (9).
The covariation matrix Σp i,j,k was assumed to be the identity matrix due to the lack of a priory information about the error distribution of the image coordinates pi,j,k. As mentioned above, model 3 (ray tracing model) does not provide closed-form solution for the forward transformation which makes this cost function inappropriate. Hence, we used another cost function based on the distance in the XOY plane of the CS of the calibration target as shown in Eq. (10).

Fig. 4. Captured images with estimated marker positions for two types of calibration target
For simplicity, the covariation matrix Σx i,j,k was also considered as the identity matrix, but this is sufficiently less reasonable assumption than the same one for Σp i,j,k according to [4,29]. In order to assess the impact of the calibration algorithm on the measurement accuracy we have calibrated model 1 using the cost function (10). We further refer to this hybrid model as model 1' (usual pinhole model calibrated in 3D space).
We have processed four calibration sequences and have calculated four vectors of parameters for each of the considered mathematical models. Two of these sequences have been captured during the first experiment. The first sequence included images of the dot calibration target and the second one included images of the chessboard target. These sequences were acquired independently and the positions of calibration target were different. The next two sequences have been captured replacing the calibration target in each position during the second experiment. Thus, the planes of the targets were approximately the same for each position. We have compared the calibration results for each model calculated with different targets to assess the impact of this factor. Our analysis has shown that the difference of parameters calculated for two sequences of the second experiment is close to the difference of parameters calculated for two series of the same calibration target.
The calibration results for the chessboard target from the first experiment are listed in Table 1. The affine transformation Ai is described by the focal length fx, fy and the image center cx, cy [4,16], the skew parameter is considered to be zero. The parameters of distortion transformation Di for model 1, 1' and 3 correspond to Eq. (2). Model 2 uses 17 parameters for the distortion transformation: kx1,4, kx0,4, kx3,2, kx1,2, kx0,2, kx1,1, kx5,0, kx3,0, kx2,0,  ky0,5, ky2,3, ky0,3, ky0,2, ky4,1, ky2,1, ky1,1 and ky2,0 (7) based on the distance in the image plane for models 1 and 2 with the identity covariation matrix. The cost function based on the sum of the distances from the 3D point to the back-projected rays was used for models 1' and 3. We have considered all covariation matrices equal to the identity ones so that all coordinates, points and rays had equal weights in the cost functions.
In most applications of 3D measurement endoscopic systems, the calculated 3D point coordinates are used to measure the geometric parameters such as distances or areas. Hence, we should utilize the criteria based on the deviations of the length or the area to evaluate the mathematical models for these systems. We have processed 3D points calculated for each image of the test sequences to estimate the distance between neighbouring markers. Due to the chosen orientation of the calibration targets these segments are approximately parallel to x-and y-axis of the stand. The orientation of the CS is the same as shown in Fig. 3. The true value of the distance is assumed to be 1 mm for the medium-sized target and 2 mm for the large-sized one. Next, we have processed 3D points for pairs of consequently captured images to estimate the distance between points corresponding to the same marker. The obtained segment is approximately parallel to z-axis and its true length is assumed to be 1 mm (equal to the shift step for the distal end of the endoscope). Additionally, the position of the calibration target has been estimated for each image relying on the calculated 3D point coordinates.
We have estimated the distance measurement errors for all captured test sequences and all considered mathematical models using corresponding vectors of parameters from the calibration stage. Since the results for different types of calibration targets are almost coincident, we present the results for the chessboard target from the first experiment only. The absolute errors ∆r are shown in Fig. 5 for the medium-sized target. The error value is indicated by dots' color according to the grayscale bar in the right side of the figure. The dots indicating segments with error value larger than 0.2 mm are filled with the black color intended for 0.2 mm values. The coordinates of the dots correspond to the estimated positions of the calibration target for a visual clarity, the coordinates of the first point of segment are shown. Each row of the plots represents one of four considered mathematical models, each column corresponds to the segments parallel to x-, y-and z-axis respectively. The CSs for the plots are slightly different. The origin for model 3 is determined by the center of the internal pinhole camera as shown in Fig. 3. The middle point between two camera centers is chosen as the origin for models 1, 1' and 2 (see Fig. 2).
One can see that the error value varies significantly for different parts of the observed volume and the different orientation of the segment. Thus, the choice of the integral criterion for the quantitative evaluation representing the values distribution is not clear. According to common applications of 3D measurement endoscopic systems, we have divided the obtained data set into zones by z-coordinate and calculated mean and standard deviation of the segment length for every zone. The number of zones is equal to the number of images in the test sequence, hence, each captured image belongs to single zone. The results are shown in Fig. 6. The top row of the plots represents test sequence for large-sized calibration target, the true value of x-and y-segments is 2 mm and the true value of z-segments is 1 mm. The bottom row of plots represents test sequence for medium-sized target, the true value of all segments is 1 mm.
The results presented in Figs. 5 and 6 indicate that the error for x-segments grows significantly on the edges of the field of view for models 1, 1' and 2. In contrast, all models demonstrate relatively similar performance for ysegments. Furthermore, the error for x-and y-segments varies for each zone which leads to a large standard deviation due to a chosen evaluation method. In contrast, the standard deviation for z-segments is comparable for all models, but the usage of models 1, 1' and 2 leads to unacceptable bias which increases with distance. The correlation of spikes on the plots for the mean error of the length for z-segments reveals uncertainties in shifting of the distal end during the experiment. Model 1' demonstrates slightly better results than model 1 for x-segments, but the overall performance is relatively similar. Hence, the choice of the cost function has no great impact on the accepted evaluation of models. The main conclusion based on the experimental results is that model 3 has a superb performance in comparison to other models. Model 2 may be used for the working distances less than 20 mm but its usage is limited due to bias in measurements of length along z-axis at larger distances. Each column corresponds to segments parallel to x-, y-and z-axis respectively. The true value of x-and y-segments is 2 mm for the large-sized target and 1 mm for the medium-sized one. The true value of z-segments is 1 mm

Conclusion
In this paper, we first theoretically and experimentally analyzed and compared three different mathematical models of prism-based stereoscopic imagers. The experimental data confirmed the results previously obtained by computer simulation: any of considered pinhole models can not provide the required accuracy of the measurement over a wide range of working distances. The model based on ray tracing through the prism shows much better results. An increase in the number of degrees for distortion polynomial model does not help to improve the accuracy, apparently, because of the assumption that all rays pass through the single point. The error of the length segment measurement has a complex spatial distribution and depends strongly on the orientation of the segment. This fact should be taken into account for evaluation of the videoendoscopes and formulating the criteria for their measurement accuracy testing. The distribution can be used for the interactive user assistance by indicating the estimated measurement errors or giving recommendations on the location of the probe relative to the object. We have shown that the simple calibration method using a flat target allows to conduct the calibration of the endoscope with attached prism-based stereo adapter and reach the desired accuracy. The method does not require the results of other specific calibrations or the prism parameters. Type of the calibration target pattern selected for the calibration and test measurements has an insufficient effect on the result.
The obtained theoretical and experimental results may be useful for the improvement of the existing prism-based videoendoscopic imagers in terms of the measurement range and accuracy as well as for the development of the new small-size tools for remote visual inspection, especially for the important industrial tasks where high precision of defect characterization is crucial.